Aussie AI

Layer Pruning

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Layer Pruning

Layer pruning is a type of structured pruning because it prunes entire layers. More precisely, it is a type of “depth pruning” because it reduces the depth of the stacks of encoders and/or decoders in the Transformer architecture. This technique can sometimes be called “layer compaction”.

Dynamic layer pruning is also called “early exiting” if all remaining layers are skipped, or “layer skipping” if only the current layer is skipped. Reducing layers in the decoder is called a “shallow decoder”, which has been found to be effective, because encoder layers are more important in a Transformer than decoder layers. Layer pruning is also related to “layer fusion” (usually via parameter sharing) and layer reordering.

Layer pruning refers to removing one or more entire layers from the model, which is a subtype of “depth pruning”. Most AI models have multiple hidden layers of nodes, and sometimes a layer can be removed without too great of a loss in model accuracy. The layer can be removed statically to create a new model file or dynamically via some adaptive criteria. Most of the literature focuses on dynamic layer pruning via early exit of the inference algorithm, when it detects a threshold of accuracy has been achieved.

Layer pruning can be combined with many other methods to create hybrid optimizations. For example, it is orthogonal to quantization, width pruning, (e.g. attention head pruning) and length pruning (e.g. token pruning, embeddings pruning).

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++