Aussie AI

What is Depth Pruningand?

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

What is Depth Pruning?

Depth pruning is removal of layers in Transformers to adjust the “depth” to which computation proceeds. Transformers have a stack of layers in their encoder and/or decoder, which can be “deep” with many layers, or “shallow” with only a few. Layers can be statically pruned from the model file, or skipped at runtime via early exiting.

The most common type of depth pruning is layer pruning, of which the dynamic form is called early exit inference. However, there are other types of depth pruning in non-Transformer architectures, such as cascades in DNNs/CNNs.

Like all types of pruning, depth pruning can be performed statically or dynamically. The main type of dynamic depth pruning is called “early exit” and is one type of dynamic layer pruning, along with layer skipping. Static depth pruning is a type of model compression, such as static layer pruning or layer fusion, where entire layers of weights are removed from the model file.

Types of Depth Pruning. Various subtypes of depth pruning include:

  • Static layer pruning
  • Early exit (dynamic layer pruning)
  • Layer skipping
  • Layer fusion
  • Layer reordering
  • Cascades (in DNNs/CNNs)
  • Shallow decoder Transformer architecture

There are multiple dimensions along which to prune a model. Depth pruning is orthogonal to pruning in other model dimensions: width pruning, length pruning. As such, depth pruning can be combined with other types of pruning, such as in dual pruning and triple pruning (generally called “multi-dimensional pruning”).

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++