Aussie AI
3-Bit Quantization (INT3)
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
3-Bit Quantization (INT3)
3-bit quantization is uncommon and unpopular, and it's not entirely clear why. It has improved accuracy over 2-bits and saves 25% storage compared to its more popular 4-bit cousin, being only slightly less accurate, since it allows 2^3=8 distinct weights. Maybe it just seems too inelegant for programmers to code cramming 3-bit values into 8-bits or 32-bits for packing and unpacking? But, no, even 5-bit quantization gets recommended by AI experts on forums, whereas listening for supporters of the 3-bit versions, all you hear are crickets.
Even the research papers on 3-bit quantization don't like to admit to it, and you'll struggle to even find “3-bit quantization” in a paper title. Here are some papers on 3-bit quantization (as if you care):
Research papers on 3-bit quantization:
- Jeonghoon Kim, Jung Hyun Lee, Sungdong Kim, Joonsuk Park, Kang Min Yoo, Se Jung Kwon, 2023, and Dongsoo Lee. 2023, Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization. CoRR, abs/2305.14152, https://arxiv.org/abs/2305.14152 (Quantization to 3-bit and 4-bit levels.)
- Han-Byul Kim, Eunhyeok Park, and Sungjoo Yoo. 2022. BASQ: Branch-wise Activation-clipping Search Quantization for Sub-4-bit Neural Networks, In European Conference on Computer Vision, Cham: Springer Nature Switzerland, 17-33. https://link.springer.com/chapter/10.1007/978-3-031-19775-8_2 (Evaluates quantization precision from 2-bits to 4-bits.)
- Zechun Liu, Kwang-Ting Cheng, Dong Huang, Eric Xing, Zhiqiang Shen. Apr 2022. Nonuniform-to-uniform quantization: Towards accurate quantization via generalized straight-through estimation, In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4942-4952, https://arxiv.org/abs/2111.14826, Code: https://github.com/liuzechun/Nonuniform-to-Uniform-Quantization (Contains an extensive review of models from 2-bits to 4-bits for both weights and activations.)
- E Kloberdanz, W Le, Sep 2023, MixQuant: Mixed Precision Quantization with a Bit-width Optimization Search, arXiv preprint arXiv:2309.17341, https://arxiv.org/pdf/2309.17341.pdf (Various tests of quantization from 2-bits to 8-bits.)
- NM Ho, DT Nguyen, JL Gustafson, WF Wong, 2023, Bedot: Bit Efficient Dot Product for Deep Generative Models, CoNGA 2023: Next Generation Arithmetic, pp. 19–37, https://link.springer.com/chapter/10.1007/978-3-031-32180-1_2, PDF: https://www.comp.nus.edu.sg/~wongwf/papers/CONGA23-Bedot.pdf (2–3 bits for weights and 2–5 bits for activation.)
- A. H. Zadeh, I. Edo, O. M. Awad, and A. Moshovos, 2020, GOBO: Quantizing attention-based NLP models for low latency and energy efficient inference, in 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2020, pp. 811–824. https://arxiv.org/abs/2005.03842 (Compares to BERT at 3-bit and 4-bit quantization levels.)
- N. Frumkin, D. Gope, and D. Marculescu, 2022, CPT-V: A Contrastive Approach to Post-Training Quantization of Vision Transformers, arXiv preprint arXiv:2211.09643, 2022. https://arxiv.org/abs/2211.09643 (Examines 3-bit, 4-bit, and 8-bit.)
- B Gouin-Ferland, R Coffee, AC Therrien, 2022, Data reduction through optimized scalar quantization for more compact neural networks, Frontiers in Physics, https://www.frontiersin.org/articles/10.3389/fphy.2022.957128/full (Examined 3 to 7 bit weights for quantization.)
- Li, Y., Gong, R., Tan, X., Yang, Y., Hu, P., Zhang, Q., Yu, F., Wang, W., and Gu, S., 2021, BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction, ArXiv, abs/2102.05426. https://arxiv.org/abs/2102.05426 Code: https://github.com/yhhhli/BRECQ (Tests 2, 3 and 4 bits for weights, and mixed-precision quantization.)
See more papers on 3-bit quantization (INT3) at: https://www.aussieai.com/research/quantization#int3
• Next: • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |