Aussie AI

8-Bit Integer Quantization (INT8)

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

8-Bit Integer Quantization (INT8)

Research papers on 8-bit quantization:

Tim Dettmers, Mike Lewis, Younes Belkada, Luke Zettlemoyer, 2022, LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale, Nov 2022, https://arxiv.org/abs/2208.07339
A. Rock, A. Untether, O. Khalil, O. Shai, and P. Grouchy, 2022, INT8 Transformers for Inference Acceleration, 36th Conference on Neural Information Processing Systems (NeurIPS), PDF: https://neurips2022-enlsp.github.io/papers/paper_52.pdf
E Kloberdanz, W Le, Sep 2023, MixQuant: Mixed Precision Quantization with a Bit-width Optimization Search, arXiv preprint arXiv:2309.17341, https://arxiv.org/pdf/2309.17341.pdf (Various tests of quantization from 2-bits to 8-bits.)
O. Zafrir, G. Boudoukh, P. Izsak, and M. Wasserblat, 2019, Q8BERT: Quantized 8Bit BERT, in Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS), 2019, pp. 36–39. https://arxiv.org/abs/1910.06188
B. Li, S. Lu, K. Xie, and Z. Wang, 2022, Accelerating NLP Tasks on FPGA with Compressed BERT and a Hardware-Oriented Early Exit Method, in IEEE Computer Society Annual Symposium on VLSI (ISVLSI), 2022, pp. 410–413. https://ieeexplore.ieee.org/document/9912016 (8-bit and 16-bit integer quantization.)
Gunho Park, Baeseong Park, Sungjae Lee, Minsub Kim, Byeongwook Kim, Se Jung Kwon, Youngjoo Lee, Dongsoo Lee, 2022, nuQmm: Quantized MatMul for Efficient Inference of Large-Scale Generative Language Models, CoRR, abs/2206.09557, https://arxiv.org/abs/2206.09557v2
Y. Lin, Y. Li, T. Liu, T. Xiao, T. Liu, and J. Zhu, 2020, Towards fully 8-bit integer inference for the transformer model, the 29th International Joint Conference on Artificial Intelligence and the 17th Pacific Rim International Conference on Artificial Intelligence (IJCAI-PRICAI2020), 2020. https://arxiv.org/abs/2009.08034 (Quantizes not just weights, but also non-linear functions such as Softmax.)
Feng Zhu, Ruihao Gong, Fengwei Yu, Xianglong Liu, Yanfei Wang, Zhelong Li, Xiuqi Yang, Junjie Yan, 2020, Towards Unified INT8 Training for Convolutional Neural Network, https://arxiv.org/abs/1912.12607, https://openaccess.thecvf.com/content_CVPR_2020/papers/Zhu_Towards_Unified_INT8_Training_for_Convolutional_Neural_Network_CVPR_2020_paper.pdf
Manuele Rusci, Marco Fariselli, Martin Croome, Francesco Paci, Eric Flamand, 2022, Accelerating RNN-Based Speech Enhancement on a Multi-core MCU with Mixed FP16-INT8 Post-training Quantization, https://arxiv.org/abs/2210.07692 (Mixed precision FP16-INT8 quantization.)
Gunho Park, Baeseong Park, Minsub Kim, Sungjae Lee, Jeonghoon Kim, Beomseok Kwon, Se Jung Kwon, Byeongwook Kim, Youngjoo Lee, Dongsoo Lee, Apr 2023, LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models, https://arxiv.org/abs/2206.09557
V. Vanhoucke, A. Senior and M. Z. Mao, 2011, Improving the speed of neural networks on CPUs, Proc. Deep Learn. Unsupervised Feature Learn. Workshop, pp. 1-8, 2011. https://research.google/pubs/pub37631/, PDF: https://research.google/pubs/pub37631.pdf (INT8 quantization.)
M. A. Nasution, D. Chahyati and M. I. Fanany, 2017, Faster R-CNN with structured sparsity learning and Ristretto for mobile environment, Proc. Int. Conf. Adv. Comput. Sci. Inf. Syst. (ICACSIS), pp. 309-314, Oct. 2017. https://ieeexplore.ieee.org/document/8355051
Tim Dettmers, Mike Lewis, Younes Belkada, Luke Zettlemoyer, 2022, GPT3.int8(): 8-bit Matrix Multiplication for Transformers at Scale, Part of Advances in Neural Information Processing Systems 35 (NeurIPS 2022) https://papers.nips.cc/paper_files/paper/2022/hash/c3ba4962c05c49636d4c6206a97e9c8a-Abstract-Conference.html
A. Mishra, J. A. Latorre, J. Pool, D. Stosic, D. Stosic, G. Venkatesh, C. Yu, and P. Micikevicius, 2021, Accelerating sparse deep neural networks, arXiv preprint arXiv:2104.08378, 2021. https://arxiv.org/abs/2104.08378
Z. Yuan, C. Xue, Y. Chen, Q. Wu, and G. Sun, 2022, PTQ4ViT: Post-training quantization for vision transformers with twin uniform quantization, in European Conference on Computer Vision. Springer, 2022, pp. 191–207. https://link.springer.com/chapter/10.1007/978-3-031-19775-8_12, PDF: https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136720190.pdf (Has 4-bit, 6-bit and 8-bit quantization.)
Z. Liu, Y. Wang, K. Han, W. Zhang, S. Ma, and W. Gao, 2021, Post-training quantization for vision transformer, Advances in Neural Information Processing Systems, vol. 34, pp. 28 092–28 103, 2021. https://arxiv.org/abs/2106.14156 (Has evaluations of 4-bit, 6-bit, and 8-bit quantization; also mixed-precision.)
N. Frumkin, D. Gope, and D. Marculescu, 2022, CPT-V: A Contrastive Approach to Post-Training Quantization of Vision Transformers, arXiv preprint arXiv:2211.09643, 2022. https://arxiv.org/abs/2211.09643 (Examines 3-bit, 4-bit, and 8-bit.)
Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, Dmitry Kalenichenko, 2018, Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 2704-2713, https://openaccess.thecvf.com/content_cvpr_2018/papers/Jacob_Quantization_and_Training_CVPR_2018_paper.pdf
M Giacobbe, TA Henzinger, M Lechner, 2020, How many bits does it take to quantize your neural network?, TACAS 2020, https://link.springer.com/chapter/10.1007/978-3-030-45237-7_5, PDF: https://link.springer.com/content/pdf/10.1007/978-3-030-45237-7_5.pdf (Ran experiments from 6-bit to 10-bit quantization.)
Javier Fernandez-Marques, Paul N. Whatmough, Andrew Mundy, Matthew Mattina, 2020, Searching for winograd-aware quantized networks, Proceedings of the 3rd MLSys Conference, Austin, TX, USA, 2020. PDF: https://proceedings.mlsys.org/paper_files/paper/2020/file/678e209691cd37f145a5502695378bac-Paper.pdf (Evaluates INT8, INT10, and INT16 quantization.)
Markus Nagel, Mart van Baalen, Tijmen Blankevoort, Max Welling, 2019, Data-free quantization through weight equalization and bias correction, PDF: https://openaccess.thecvf.com/content_ICCV_2019/papers/Nagel_Data-Free_Quantization_Through_Weight_Equalization_and_Bias_Correction_ICCV_2019_paper.pdf (Evaluates INT5, INT6, INT8, INT10, INT12, and INT16.)
Pierre-Emmanuel Novac, March 2023, MicroAI: Embedded Artificial Intelligence for Human Activity Recognition on Smart Glasses, Ph.D. Thesis, Artificial Intelligence. Université Côte d’Azur, https://theses.hal.science/tel-04049008/document (Uses INT8 and INT16 quantization.)
Li, Y., Gong, R., Tan, X., Yang, Y., Hu, P., Zhang, Q., Yu, F., Wang, W., and Gu, S., 2021, BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction, ArXiv, abs/2102.05426. https://arxiv.org/abs/2102.05426 Code: https://github.com/yhhhli/BRECQ (Tests 8-bit weights, along with 2-4 bits.)