Aussie AI
8-Bit Integer Quantization (INT8)
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
8-Bit Integer Quantization (INT8)
Research papers on 8-bit quantization:
- Tim Dettmers, Mike Lewis, Younes Belkada, Luke Zettlemoyer, 2022, LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale, Nov 2022, https://arxiv.org/abs/2208.07339
- A. Rock, A. Untether, O. Khalil, O. Shai, and P. Grouchy, 2022, INT8 Transformers for Inference Acceleration, 36th Conference on Neural Information Processing Systems (NeurIPS), PDF: https://neurips2022-enlsp.github.io/papers/paper_52.pdf
- E Kloberdanz, W Le, Sep 2023, MixQuant: Mixed Precision Quantization with a Bit-width Optimization Search, arXiv preprint arXiv:2309.17341, https://arxiv.org/pdf/2309.17341.pdf (Various tests of quantization from 2-bits to 8-bits.)
- O. Zafrir, G. Boudoukh, P. Izsak, and M. Wasserblat, 2019, Q8BERT: Quantized 8Bit BERT, in Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS), 2019, pp. 36–39. https://arxiv.org/abs/1910.06188
- B. Li, S. Lu, K. Xie, and Z. Wang, 2022, Accelerating NLP Tasks on FPGA with Compressed BERT and a Hardware-Oriented Early Exit Method, in IEEE Computer Society Annual Symposium on VLSI (ISVLSI), 2022, pp. 410–413. https://ieeexplore.ieee.org/document/9912016 (8-bit and 16-bit integer quantization.)
- Gunho Park, Baeseong Park, Sungjae Lee, Minsub Kim, Byeongwook Kim, Se Jung Kwon, Youngjoo Lee, Dongsoo Lee, 2022, nuQmm: Quantized MatMul for Efficient Inference of Large-Scale Generative Language Models, CoRR, abs/2206.09557, https://arxiv.org/abs/2206.09557v2
- Y. Lin, Y. Li, T. Liu, T. Xiao, T. Liu, and J. Zhu, 2020, Towards fully 8-bit integer inference for the transformer model, the 29th International Joint Conference on Artificial Intelligence and the 17th Pacific Rim International Conference on Artificial Intelligence (IJCAI-PRICAI2020), 2020. https://arxiv.org/abs/2009.08034 (Quantizes not just weights, but also non-linear functions such as Softmax.)
- Feng Zhu, Ruihao Gong, Fengwei Yu, Xianglong Liu, Yanfei Wang, Zhelong Li, Xiuqi Yang, Junjie Yan, 2020, Towards Unified INT8 Training for Convolutional Neural Network, https://arxiv.org/abs/1912.12607, https://openaccess.thecvf.com/content_CVPR_2020/papers/Zhu_Towards_Unified_INT8_Training_for_Convolutional_Neural_Network_CVPR_2020_paper.pdf
- Manuele Rusci, Marco Fariselli, Martin Croome, Francesco Paci, Eric Flamand, 2022, Accelerating RNN-Based Speech Enhancement on a Multi-core MCU with Mixed FP16-INT8 Post-training Quantization, https://arxiv.org/abs/2210.07692 (Mixed precision FP16-INT8 quantization.)
- Gunho Park, Baeseong Park, Minsub Kim, Sungjae Lee, Jeonghoon Kim, Beomseok Kwon, Se Jung Kwon, Byeongwook Kim, Youngjoo Lee, Dongsoo Lee, Apr 2023, LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models, https://arxiv.org/abs/2206.09557
- V. Vanhoucke, A. Senior and M. Z. Mao, 2011, Improving the speed of neural networks on CPUs, Proc. Deep Learn. Unsupervised Feature Learn. Workshop, pp. 1-8, 2011. https://research.google/pubs/pub37631/, PDF: https://research.google/pubs/pub37631.pdf (INT8 quantization.)
- M. A. Nasution, D. Chahyati and M. I. Fanany, 2017, Faster R-CNN with structured sparsity learning and Ristretto for mobile environment, Proc. Int. Conf. Adv. Comput. Sci. Inf. Syst. (ICACSIS), pp. 309-314, Oct. 2017. https://ieeexplore.ieee.org/document/8355051
- Tim Dettmers, Mike Lewis, Younes Belkada, Luke Zettlemoyer, 2022, GPT3.int8(): 8-bit Matrix Multiplication for Transformers at Scale, Part of Advances in Neural Information Processing Systems 35 (NeurIPS 2022) https://papers.nips.cc/paper_files/paper/2022/hash/c3ba4962c05c49636d4c6206a97e9c8a-Abstract-Conference.html
- A. Mishra, J. A. Latorre, J. Pool, D. Stosic, D. Stosic, G. Venkatesh, C. Yu, and P. Micikevicius, 2021, Accelerating sparse deep neural networks, arXiv preprint arXiv:2104.08378, 2021. https://arxiv.org/abs/2104.08378
- Z. Yuan, C. Xue, Y. Chen, Q. Wu, and G. Sun, 2022, PTQ4ViT: Post-training quantization for vision transformers with twin uniform quantization, in European Conference on Computer Vision. Springer, 2022, pp. 191–207. https://link.springer.com/chapter/10.1007/978-3-031-19775-8_12, PDF: https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136720190.pdf (Has 4-bit, 6-bit and 8-bit quantization.)
- Z. Liu, Y. Wang, K. Han, W. Zhang, S. Ma, and W. Gao, 2021, Post-training quantization for vision transformer, Advances in Neural Information Processing Systems, vol. 34, pp. 28 092–28 103, 2021. https://arxiv.org/abs/2106.14156 (Has evaluations of 4-bit, 6-bit, and 8-bit quantization; also mixed-precision.)
- N. Frumkin, D. Gope, and D. Marculescu, 2022, CPT-V: A Contrastive Approach to Post-Training Quantization of Vision Transformers, arXiv preprint arXiv:2211.09643, 2022. https://arxiv.org/abs/2211.09643 (Examines 3-bit, 4-bit, and 8-bit.)
- Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, Dmitry Kalenichenko, 2018, Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 2704-2713, https://openaccess.thecvf.com/content_cvpr_2018/papers/Jacob_Quantization_and_Training_CVPR_2018_paper.pdf
- M Giacobbe, TA Henzinger, M Lechner, 2020, How many bits does it take to quantize your neural network?, TACAS 2020, https://link.springer.com/chapter/10.1007/978-3-030-45237-7_5, PDF: https://link.springer.com/content/pdf/10.1007/978-3-030-45237-7_5.pdf (Ran experiments from 6-bit to 10-bit quantization.)
- Javier Fernandez-Marques, Paul N. Whatmough, Andrew Mundy, Matthew Mattina, 2020, Searching for winograd-aware quantized networks, Proceedings of the 3rd MLSys Conference, Austin, TX, USA, 2020. PDF: https://proceedings.mlsys.org/paper_files/paper/2020/file/678e209691cd37f145a5502695378bac-Paper.pdf (Evaluates INT8, INT10, and INT16 quantization.)
- Markus Nagel, Mart van Baalen, Tijmen Blankevoort, Max Welling, 2019, Data-free quantization through weight equalization and bias correction, PDF: https://openaccess.thecvf.com/content_ICCV_2019/papers/Nagel_Data-Free_Quantization_Through_Weight_Equalization_and_Bias_Correction_ICCV_2019_paper.pdf (Evaluates INT5, INT6, INT8, INT10, INT12, and INT16.)
- Pierre-Emmanuel Novac, March 2023, MicroAI: Embedded Artificial Intelligence for Human Activity Recognition on Smart Glasses, Ph.D. Thesis, Artificial Intelligence. Université Côte d’Azur, https://theses.hal.science/tel-04049008/document (Uses INT8 and INT16 quantization.)
- Li, Y., Gong, R., Tan, X., Yang, Y., Hu, P., Zhang, Q., Yu, F., Wang, W., and Gu, S., 2021, BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction, ArXiv, abs/2102.05426. https://arxiv.org/abs/2102.05426 Code: https://github.com/yhhhli/BRECQ (Tests 8-bit weights, along with 2-4 bits.)
See more papers on 8-bit quantization (INT8) at: https://www.aussieai.com/research/quantization#int8
• Next: • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |