Aussie AI

4-Bit Quantization (INT4)

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

4-Bit Quantization (INT4)

Quantization with four bits, or INT4 quantization, uses 4 bits for weights, and thus allows 2^4=16 distinct weights. In terms of industry models, 4-bit quantization is one of the most popular quantization regimes in practical usage. It is far more common to see a 4-bit quantization of an open source model than binary, 2-bits, or 3-bits. INT4 allows eight-fold storage compression of 32-bits down to 4-bits, which reduces memory requirements, and can also speed up inference by reducing memory-cache transfer overheads in both CPU and GPU.

This level of quantization has a reputation for offering a good trade-off in terms of a mild accuracy decline versus a significant speedup. The model compression is about eight-fold, being from 32 bit floats down to 4 bit integers. The 16 distinct weights contain enough information for reasonable accuracy compared to the full-precision model. The 4-bit weights also fit cleanly into 8-bit bytes or 32-bit integers, making the unpacking code simple and efficient.

Research papers on 4-bit quantization:

Ron Banner, Yury Nahshan, Elad Hoffer, Daniel Soudry, May 2019, Post-training 4-bit quantization of convolution networks for rapid-deployment, NeurIPS 2019, https://arxiv.org/abs/1810.05723, https://proceedings.neurips.cc/paper_files/paper/2019/file/c0a62e133894cdce435bcb4a5df1db2d-Paper.pdf
Han-Byul Kim, Eunhyeok Park, and Sungjoo Yoo. 2022. BASQ: Branch-wise Activation-clipping Search Quantization for Sub-4-bit Neural Networks, In European Conference on Computer Vision, Cham: Springer Nature Switzerland, 17-33. https://link.springer.com/chapter/10.1007/978-3-031-19775-8_2 (Evaluates quantization precision from 2-bits to 4-bits.)
Zechun Liu, Kwang-Ting Cheng, Dong Huang, Eric Xing, Zhiqiang Shen. Apr 2022. Nonuniform-to-uniform quantization: Towards accurate quantization via generalized straight-through estimation, In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4942-4952, https://arxiv.org/abs/2111.14826, Code: https://github.com/liuzechun/Nonuniform-to-Uniform-Quantization (Contains an extensive review of models from 2-bits to 4-bits for both weights and activations.)
Anton Trusov, Elena Limonova, Dmitry Slugin, Dmitry Nikolaev, Vladimir V. Arlazarov, Oct 2020, Fast Implementation of 4-bit Convolutional Neural Networks for Mobile Devices, 2020 25th International Conference on Pattern Recognition (ICPR), https://arxiv.org/abs/2009.06488, https://ieeexplore.ieee.org/document/9412841
Xiao Sun, Naigang Wang, Chia-yu Chen, Jia-min Ni, Ankur Agrawal, Xiaodong Cui, Swagath Venkataramani Kaoutar El Maghraoui, Vijayalakshmi Srinivasan, Kailash Gopalakrishnan, 2020, Ultra-low precision 4-bit training of deep neural networks, 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada, https://papers.nips.cc/paper/2020/file/13b919438259814cd5be8cb45877d577-Paper.pdf
Shaojin Ding, Phoenix Meadowlark, Yanzhang He, Lukasz Lew, Shivani Agrawal, and Oleg Rybakov. 2022, 4-bit conformer with native quantization aware training for speech recognition, In Hanseok Ko and John H. L. Hansen, editors, Interspeech 2022, 23rd Annual Conference of the International Speech Communication Association, Incheon, Korea, 18-22 September 2022, pages 1711–1715. ISCA, 2022. https://arxiv.org/abs/2203.15952
Jeonghoon Kim, Jung Hyun Lee, Sungdong Kim, Joonsuk Park, Kang Min Yoo, Se Jung Kwon, and Dongsoo Lee. 2023, Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization, CoRR, abs/2305.14152, https://arxiv.org/abs/2305.14152 (Quantization to 3-bit and 4-bit levels.)
HuggingFace, May 2023, Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA, HuggingFace Blog, https://huggingface.co/blog/4bit-transformers-bitsandbytes
E Kloberdanz, W Le, Sep 2023, MixQuant: Mixed Precision Quantization with a Bit-width Optimization Search, arXiv preprint arXiv:2309.17341, https://arxiv.org/pdf/2309.17341.pdf (Various tests of quantization from 2-bits to 8-bits.)
NM Ho, DT Nguyen, JL Gustafson, WF Wong, 2023, Bedot: Bit Efficient Dot Product for Deep Generative Models, CoNGA 2023: Next Generation Arithmetic, pp. 19–37, https://link.springer.com/chapter/10.1007/978-3-031-32180-1_2, PDF: https://www.comp.nus.edu.sg/~wongwf/papers/CONGA23-Bedot.pdf (2–3 bits for weights and 2–5 bits for activation.)
Z. Yuan, C. Xue, Y. Chen, Q. Wu, and G. Sun, 2022, PTQ4ViT: Post-training quantization for vision transformers with twin uniform quantization, in European Conference on Computer Vision. Springer, 2022, pp. 191–207. https://link.springer.com/chapter/10.1007/978-3-031-19775-8_12, PDF: https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136720190.pdf (Has 4-bit, 6-bit and 8-bit quantization.)
A. H. Zadeh, I. Edo, O. M. Awad, and A. Moshovos, 2020, GOBO: Quantizing attention-based NLP models for low latency and energy efficient inference, in 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2020, pp. 811–824. https://arxiv.org/abs/2005.03842 (Compares to BERT at 3-bit and 4-bit quantization levels.)
Z. Liu, Y. Wang, K. Han, W. Zhang, S. Ma, and W. Gao, 2021, Post-training quantization for vision transformer, Advances in Neural Information Processing Systems, vol. 34, pp. 28 092–28 103, 2021. https://arxiv.org/abs/2106.14156 (Has evaluations of 4-bit, 6-bit, and 8-bit quantization; also mixed-precision.)
N. Frumkin, D. Gope, and D. Marculescu, 2022, CPT-V: A Contrastive Approach to Post-Training Quantization of Vision Transformers, arXiv preprint arXiv:2211.09643, 2022. https://arxiv.org/abs/2211.09643 (Examines 3-bit, 4-bit, and 8-bit.)
B Gouin-Ferland, R Coffee, AC Therrien, 2022, Data reduction through optimized scalar quantization for more compact neural networks, Frontiers in Physics, https://www.frontiersin.org/articles/10.3389/fphy.2022.957128/full (Examined 3 to 7 bit weights for quantization.)
Li, Y., Gong, R., Tan, X., Yang, Y., Hu, P., Zhang, Q., Yu, F., Wang, W., & Gu, S., 2021, BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction, ArXiv, abs/2102.05426. https://arxiv.org/abs/2102.05426 Code: https://github.com/yhhhli/BRECQ (Tests 2 to 4 bits for weights, and mixed-precision quantization.)
J Liu, R Gong, X Wei, Z Dong, J Cai, B Zhuang, Oct 2023, QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models, arXiv preprint arXiv:2310.08041, https://arxiv.org/pdf/2310.08041.pdf (PTQ with 4-bit quantization of Llama models.)