Aussie AI

Model Compression

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Model Compression

Model compression is the general class of optimizations that “compress” a model down to a smaller size. The goal is usually both memory and speed optimization via a smaller model that requires fewer operations and/or lower-precision arithmetic. Some techniques work after training, and some model compression methods require a brief re-training or fine-tuning followup.

One key point to reducing a model size is that the number of weights is usually directly correlated to: (a) the number of arithmetic operations at runtime, and (b) the total bytes of memory-to-cache data transfers needed. Hence, shrinking a model can proportionally reduce time cost, which is not true for all space optimizations.

We've already examined a lot of the possible ways to make a smaller model in earlier chapters. The most popular types of model compression are:

  • Quantization
  • Pruning
  • Distillation

There are also a number of other less well-known types of model compression:

  • Weight sharing (parameter sharing)
  • Layer fusion
  • Weight clustering
  • Sparsity (not only via pruning)
  • Low-rank matrices

There are also several ensemble multi-model architectures that offer memory efficiency by having at least one small model in the mix:

  • Mixture-of-experts
  • Big-little models
  • Speculative decoding

All of these model compression techniques are discussed in separate chapters. Whenever fewer computations are required, there is also the additional benefit that fewer memory transfers will be required for the data at runtime.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++