Aussie AI

Static vs Dynamic Pruning

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Static vs Dynamic Pruning

Another top-level classification involves whether pruning is done during training, after training, or during inference.

  • Static pruning. Changing the model during or after training, but not during inference.
  • Dynamic pruning. Inference decisions that effectively “skip” parts of the network for a given query.

If we wanted consistency between the pruning and quantization research literature, we really should call these Pruning-Aware Training (PAT) and Post-Training Pruning (PTP), but hey, who needs consistency? These terms are rarely used in pruning research papers, so I'm leading you astray.

Static pruning creates a fixed model, with lots of zeros, that doesn't change during inference. The model should be smaller in memory, and run faster.

Dynamic pruning doesn't really cut anything out of the model data structure, but “prunes” parts of the model at runtime by ignoring some of the weights. It effectively removes some weights by selectively skipping small weights or whole structures, depending on the user's input query. The weights are not permanently zeroed in dynamic pruning, and may be used again for the next user's query, which might skip different parts of the model. Hence, dynamic pruning helps runtime efficiency with fewer computations and fewer memory-to-cache data transfers, but does not reduce the size of a model in memory.

Note that static pruning can be unstructured or structured, and most techniques could be used. More specifically, unstructured pruning is usually done during training, but structured pruning could be done during training or after training, with or without additional fine-tuning. For example, static pruning could involve pre-inference removal of random weights (unstructured magnitude pruning) or of entire layers (structured layer pruning).

However, dynamic pruning doesn't make sense for unstructured pruning, because the model weights and other parameters are static during inference. I guess it's theoretically possible to do “dynamic unstructured pruning” by turning some of the weights on and off, such as by setting a magnitude threshold below which we don't use the weights in vector dot products, or using some bit vectors to mask different sets of weights, but I don't think I've seen a paper on it.

Dynamic pruning during inference always refers more to structured pruning, where we avoid processing an entire structure, such as a layer or an attention head. Dynamic layer pruning is where whole model layers are skipped dynamically, which is usually called “early exit” or “layer skipping” or “depth pruning” in the papers. Also possible is dynamic pruning along the width dimension, such as where whole attention heads are skipped, known as “attention head pruning” or “width pruning” in the literature.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++