Aussie AI

Model Compression

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Model Compression

Model compression is the general technique of optimizing AI inference by changing the model to be smaller, usually at the cost of some degree of accuracy. This can mean using fewer weights (e.g. pruning) or using weights that are smaller-size data types (e.g. quantization). The main established techniques with a longstanding body of research include:

  • Quantization of models. This is a well-known method whereby high-precision 32-bit floating-point multiplication of weights is replaced with lower-precision data, such as 16-bit floating-point numbers, or often integers to allow faster integer multiplication. Integer quantization has been researched for 8-bit integers all the way down to 4-bit, 3-bit, 2-bit and even 1-bit (binary) integer representations. Quantization of a pre-trained model for faster inference is called Post-Training Quantization (PTQ). It is also possible to use quantization during model training using Quantization-Aware Training (QAT) algorithms. Quantization not only improves speed, but also reduces the model size for storage or transmission (e.g. an 8-bit quantization of a 32-bit floating-point model reduces storage size by four).
  • Pruning (static and dynamic). This technique involves optimizing the weighted links in LLM models, such as “magnitude pruning” by cutting those with a very low value (indicating a feature has a low probability). If there are fewer model links, there are less multiplications required. See research on model pruning and Transformer-specific pruning ideas.
  • Layer pruning / layer compaction (“pancaking”). This is conceptually a generalization of pruning, whereby whole model layers are pruned. Models typically involve multiple hidden layers of weights and nodes, which perform the same underlying inference algorithms with different sets of weights. The layers can sometimes be removed without significant loss of model accuracy, resulting in proportional speedups.
  • Knowledge distillation. Training a smaller “student” model to be similar to a larger “teacher” model. This is a method where a large pre-trained model is used to train a smaller more-efficient model. The smaller model is then used for faster inference in the application.

There are several other less-used types of model compression, such as parameter sharing, layer fusion, and weight clustering.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++