Aussie AI

Unstructured and Structured Pruning

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Unstructured and Structured Pruning

The main general categories of pruning are according to the strategy of which weights to prune:

  • Unstructured pruning. Remove low weights regardless of where they are in the model.
  • Structured pruning. Everything else, which means removing weight groups in a specific part of the structure of the model (e.g. all weights in a layer, channel, filter, etc.)

You can do both. It's called “hybrid pruning” or “semi-structured pruning.” For example, a hybrid strategy would be to do unstructured magnitude pruning of all the weights in one structure, rather than removing an entire structural component (e.g. layer-specific magnitude pruning).

Weight pruning or “magnitude pruning” is unstructured pruning that removes very low weights, including small negative weights. Technically, weight pruning is the general class of pruning decisions made about a single weight at a time, whereas magnitude pruning is the specific case of pruning based on cutting small near-zero absolute-value weights (using some threshold). Magnitude pruning also obviously removes zero weights, or the nebulous oddity called “negative-zero” if that is found.

Unstructured pruning will remove small weights anywhere it finds matching candidates. We don't know in which structure of the model there will be pruning. It becomes somewhat random, and at runtime, we don't know in any vector which elements will be zero.

The underlying basis of the pruning idea is that weights with a low magnitude are not contributing much to the overall inference, and hence skipping these calculations should not greatly affect the overall accuracy of the model. Theory says that this works quite well, and magnitude pruning is the simplest and most longstanding type of pruning, with a large body of research.

An important practical aspect of unstructured pruning is that it has no latency benefit unless multiplication by zeros in vector dot products can be efficiently avoided. For example, any benefit from zeroing some weights in a vector that is all still sent to an accelerator (i.e. with some zeroed elements) depends on the characteristics of the hardware, or on the algorithms used by the deep learning compiler (or graph compiler), and the latency improvement may be limited.

Instead of hoping that training gives us lots of small weights to prune, there are ways to cause that intentionally. For example, quantization techniques can also increase the number of zero weights throughout the quantized model. Quantization may round low-magnitude weights down to zero, with a smaller number of discrete weights possible. However, it depends on the level of quantization, and the choices made about rounding versus truncation in the quantization method. Additional research has been done on various other techniques to increase the number of zero weights or low weights, thereby making the weight matrices sparser. Research on such issues is addressed under “sparsification” and “sparse matrix” theory.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++