Aussie AI

Layer Pruning

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

Layer Pruning

Layer pruning is a type of structured pruning because it prunes entire layers. More precisely, it is a type of “depth pruning” because it reduces the depth of the stacks of encoders and/or decoders in the Transformer architecture. This technique can sometimes be called “layer compaction”.

Dynamic layer pruning is also called “early exiting” if all remaining layers are skipped, or “layer skipping” if only the current layer is skipped. Reducing layers in the decoder is called a “shallow decoder”, which has been found to be effective, because encoder layers are more important in a Transformer than decoder layers. Layer pruning is also related to “layer fusion” (usually via parameter sharing) and layer reordering.

Layer pruning refers to removing one or more entire layers from the model, which is a subtype of “depth pruning”. Most AI models have multiple hidden layers of nodes, and sometimes a layer can be removed without too great of a loss in model accuracy. The layer can be removed statically to create a new model file or dynamically via some adaptive criteria. Most of the literature focuses on dynamic layer pruning via early exit of the inference algorithm, when it detects a threshold of accuracy has been achieved.

Layer pruning can be combined with many other methods to create hybrid optimizations. For example, it is orthogonal to quantization, width pruning, (e.g. attention head pruning) and length pruning (e.g. token pruning, embeddings pruning).

• Next:

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs

The new AI programming book by Aussie AI co-founders:

AI coding in C++
Transformer engine speedups
LLM models
Phone and desktop AI
Code examples
Research citations

Get your copy from Amazon: Generative AI in C++

Aussie AI

Layer Pruning

Layer Pruning

Quick Links

Product

New to Writing?

Writing Styles