Aussie AI

Model Compression

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

Model Compression

Model compression is the general class of AI optimizations that reduce the size of the model. These methods are generally considered to be “static” optimizations, because the model is reduced during or after training, and does not change at runtime during inference. The goal of model compression is two-fold:

(a) model size reduction, and

(b) latency optimization.

The model size reduces by either having fewer weights (e.g. pruning and sparsity) or by using smaller data types (e.g. quantization). The AI engine runs faster inference on the more compact model by (a) requiring fewer calculations overall, (b) using integer calculations (with some techniques), and (c) improving memory-bound inference because fewer memory-to-cache data transfers are needed.

Model compression techniques have been highly successful and are widely used, second only to hardware-acceleration in their impact on the AI industry. The three main model compression techniques with widespread industry usage are:

Quantization
Model pruning
Knowledge Distillation

There are various lesser-known types of model compression methods in the research papers:

Low-rank factorization of matrices (tensor decomposition)
Weight sharing
Layer fusion
Weight clustering

Some other research techniques with similar goals of a smaller, simpler model include:

Big-little architectures
Speculative decoding
Logarithmic models
Zero-multiplication models (e.g. adder networks)

Survey papers. Various research survey papers on model compression techniques include:

Xunyu Zhu, Jian Li, Yong Liu, Can Ma, Weiping Wang (2023), A Survey on Model Compression for Large Language Models, arXiv preprint arXiv:2308.07633, Aug 2023, https://arxiv.org/abs/2308.07633 (Recent 2023 survey paper on various model compression approaches.)
Canwen Xu, Julian McAuley (2022), A Survey on Model Compression and Acceleration for Pretrained Language Models, Nov 2022, https://arxiv.org/abs/2202.07105
T Choudhary, V Mishra, A Goswami (2020), A comprehensive survey on model compression and acceleration, Artificial Intelligence Review, 2020, https://doi.org/10.1007/s10462-020-09816-7, https://link.springer.com/article/10.1007/s10462-020-09816-7
Y Cheng, D Wang, P Zhou, T Zhang (2020), A survey of model compression and acceleration for deep neural networks, arXiv preprint arXiv:1710.09282, June 2020 (revised), https://arxiv.org/abs/1710.09282
K Nan, S Liu, J Du, H Liu (2019) Deep model compression for mobile platforms: A survey, Tsinghua Science and Technology (Volume 24, Issue 6, December 2019), https://ieeexplore.ieee.org/abstract/document/8727762, PDF: https://ieeexplore.ieee.org/iel7/5971803/8727756/08727762.pdf
Yu Cheng; Duo Wang; Pan Zhou; Tao Zhang (2018), Model Compression and Acceleration for Deep Neural Networks: The Principles, Progress, and Challenges, IEEE Signal Processing Magazine (Volume 35, Issue 1, January 2018), https://ieeexplore.ieee.org/document/8253600
G Menghani (2023), Efficient deep learning: A survey on making deep learning models smaller, faster, and better, ACM Computing Surveys, https://dl.acm.org/doi/abs/10.1145/3578938, https://arxiv.org/abs/2106.08962
L Deng, G Li, S Han, L Shi, Y Xie (2020), Model compression and hardware acceleration for neural networks: A comprehensive survey, Proceedings of the IEEE (Volume 108, Issue 4, April 2020), https://ieeexplore.ieee.org/abstract/document/9043731
K Ramesh, A Chavan, S Pandit (2023), A Comparative Study on the Impact of Model Compression Techniques on Fairness in Language Models, Microsoft Research, https://aclanthology.org/2023.acl-long.878.pdf, https://www.microsoft.com/en-us/research/uploads/prod/2023/07/3687_Paper.pdf
W Li, H Hacid, E Almazrouei, M Debbah (2023), A Comprehensive Review and a Taxonomy of Edge Machine Learning: Requirements, Paradigms, and Techniques, AI 2023, 4(3), 729-786, https://www.mdpi.com/2673-2688/4/3/39 (Extensive survey related to optimizing on edge devices, including model compression.)
A Jaiswal, Z Gan, X Du, B Zhang, Z Wang, Y Yang (2023), Compressing LLMs: The Truth is Rarely Pure and Never Simple, arXiv preprint arXiv:2310.01382, Oct 2023, https://browse.arxiv.org/pdf/2310.01382.pdf

For more general research on model compression, refer to https://www.aussieai.com/research/model-compression. All of the individual model compression strategies (e.g. quantization, pruning) are discussed in detail in the following chapters.

• Next:

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs

The new AI programming book by Aussie AI co-founders:

AI coding in C++
Transformer engine speedups
LLM models
Phone and desktop AI
Code examples
Research citations

Get your copy from Amazon: Generative AI in C++

Aussie AI

Model Compression

Model Compression

Quick Links

Product

New to Writing?

Writing Styles