Aussie AI

Model Compression

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

Model Compression

Model compression is the general class of optimizations that “compress” a model down to a smaller size. The goal is usually both memory and speed optimization via a smaller model that requires fewer operations and/or lower-precision arithmetic. Some techniques work after training, and some model compression methods require a brief re-training or fine-tuning followup.

One key point to reducing a model size is that the number of weights is usually directly correlated to: (a) the number of arithmetic operations at runtime, and (b) the total bytes of memory-to-cache data transfers needed. Hence, shrinking a model can proportionally reduce time cost, which is not true for all space optimizations.

We've already examined a lot of the possible ways to make a smaller model in earlier chapters. The most popular types of model compression are:

Quantization
Pruning
Distillation

There are also a number of other less well-known types of model compression:

Weight sharing (parameter sharing)
Layer fusion
Weight clustering
Sparsity (not only via pruning)
Low-rank matrices

There are also several ensemble multi-model architectures that offer memory efficiency by having at least one small model in the mix:

Mixture-of-experts
Big-little models
Speculative decoding

All of these model compression techniques are discussed in separate chapters. Whenever fewer computations are required, there is also the additional benefit that fewer memory transfers will be required for the data at runtime.

• Next:

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs

The new AI programming book by Aussie AI co-founders:

AI coding in C++
Transformer engine speedups
LLM models
Phone and desktop AI
Code examples
Research citations

Get your copy from Amazon: Generative AI in C++

Aussie AI

Model Compression

Model Compression

Quick Links

Product

New to Writing?

Writing Styles