Aussie AI

Model Compression

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Model Compression

Model compression is the general class of AI optimizations that reduce the size of the model. These methods are generally considered to be “static” optimizations, because the model is reduced during or after training, and does not change at runtime during inference. The goal of model compression is two-fold:

    (a) model size reduction, and

    (b) latency optimization.

The model size reduces by either having fewer weights (e.g. pruning and sparsity) or by using smaller data types (e.g. quantization). The AI engine runs faster inference on the more compact model by (a) requiring fewer calculations overall, (b) using integer calculations (with some techniques), and (c) improving memory-bound inference because fewer memory-to-cache data transfers are needed.

Model compression techniques have been highly successful and are widely used, second only to hardware-acceleration in their impact on the AI industry. The three main model compression techniques with widespread industry usage are:

  • Quantization
  • Model pruning
  • Knowledge Distillation

There are various lesser-known types of model compression methods in the research papers:

  • Low-rank factorization of matrices (tensor decomposition)
  • Weight sharing
  • Layer fusion
  • Weight clustering

Some other research techniques with similar goals of a smaller, simpler model include:

  • Big-little architectures
  • Speculative decoding
  • Logarithmic models
  • Zero-multiplication models (e.g. adder networks)

Survey papers. Various research survey papers on model compression techniques include:

  1. Xunyu Zhu, Jian Li, Yong Liu, Can Ma, Weiping Wang (2023), A Survey on Model Compression for Large Language Models, arXiv preprint arXiv:2308.07633, Aug 2023, https://arxiv.org/abs/2308.07633 (Recent 2023 survey paper on various model compression approaches.)
  2. Canwen Xu, Julian McAuley (2022), A Survey on Model Compression and Acceleration for Pretrained Language Models, Nov 2022, https://arxiv.org/abs/2202.07105
  3. T Choudhary, V Mishra, A Goswami (2020), A comprehensive survey on model compression and acceleration, Artificial Intelligence Review, 2020, https://doi.org/10.1007/s10462-020-09816-7, https://link.springer.com/article/10.1007/s10462-020-09816-7
  4. Y Cheng, D Wang, P Zhou, T Zhang (2020), A survey of model compression and acceleration for deep neural networks, arXiv preprint arXiv:1710.09282, June 2020 (revised), https://arxiv.org/abs/1710.09282
  5. K Nan, S Liu, J Du, H Liu (2019) Deep model compression for mobile platforms: A survey, Tsinghua Science and Technology (Volume 24, Issue 6, December 2019), https://ieeexplore.ieee.org/abstract/document/8727762, PDF: https://ieeexplore.ieee.org/iel7/5971803/8727756/08727762.pdf
  6. Yu Cheng; Duo Wang; Pan Zhou; Tao Zhang (2018), Model Compression and Acceleration for Deep Neural Networks: The Principles, Progress, and Challenges, IEEE Signal Processing Magazine (Volume 35, Issue 1, January 2018), https://ieeexplore.ieee.org/document/8253600
  7. G Menghani (2023), Efficient deep learning: A survey on making deep learning models smaller, faster, and better, ACM Computing Surveys, https://dl.acm.org/doi/abs/10.1145/3578938, https://arxiv.org/abs/2106.08962
  8. L Deng, G Li, S Han, L Shi, Y Xie (2020), Model compression and hardware acceleration for neural networks: A comprehensive survey, Proceedings of the IEEE (Volume 108, Issue 4, April 2020), https://ieeexplore.ieee.org/abstract/document/9043731
  9. K Ramesh, A Chavan, S Pandit (2023), A Comparative Study on the Impact of Model Compression Techniques on Fairness in Language Models, Microsoft Research, https://aclanthology.org/2023.acl-long.878.pdf, https://www.microsoft.com/en-us/research/uploads/prod/2023/07/3687_Paper.pdf
  10. W Li, H Hacid, E Almazrouei, M Debbah (2023), A Comprehensive Review and a Taxonomy of Edge Machine Learning: Requirements, Paradigms, and Techniques, AI 2023, 4(3), 729-786, https://www.mdpi.com/2673-2688/4/3/39 (Extensive survey related to optimizing on edge devices, including model compression.)
  11. A Jaiswal, Z Gan, X Du, B Zhang, Z Wang, Y Yang (2023), Compressing LLMs: The Truth is Rarely Pure and Never Simple, arXiv preprint arXiv:2310.01382, Oct 2023, https://browse.arxiv.org/pdf/2310.01382.pdf

For more general research on model compression, refer to https://www.aussieai.com/research/model-compression. All of the individual model compression strategies (e.g. quantization, pruning) are discussed in detail in the following chapters.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++