Aussie AI

Mixed-Precision Quantization

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Mixed-Precision Quantization

Mixed-precision quantization is where two different data types are used in different parts. For example, there might be FP16 and INT8 quantization used. Different precision might be used across different parts of the Transformer (e.g. attention heads versus FFNs), or alternatively, different sizes of weight quantization versus the activations. However, if the main computations are still done in FP32 (i.e. the normal single-precision size), whereas the weights are stored as a quantized smaller data type (e.g. FP16 or integers), this isn't usually considered to be mixed-precision quantization (it's sometimes called “fake quantization”).

Research papers on mixed-precision quantization:

  1. Manuele Rusci, Marco Fariselli, Martin Croome, Francesco Paci, Eric Flamand, 2022, Accelerating RNN-Based Speech Enhancement on a Multi-core MCU with Mixed FP16-INT8 Post-training Quantization, https://arxiv.org/abs/2210.07692 (Mixed precision FP16-INT8 quantization.)
  2. Dipankar Das, Naveen Mellempudi, Dheevatsa Mudigere, Dhiraj Kalamkar, Sasikanth Avancha, Kunal Banerjee, Srinivas Sridharan, Karthik Vaidyanathan, Bharat Kaul, Evangelos Georganas, Alexander Heinecke, Pradeep Dubey, Jesus Corbal, Nikita Shustrov, Roma Dubtsov, Evarist Fomenko, Vadim Pirogov, 2018, Mixed Precision Training of Convolutional Neural Networks using Integer Operations, https://arxiv.org/pdf/1802.00930
  3. M. A. Nasution, D. Chahyati and M. I. Fanany, 2017, Faster R-CNN with structured sparsity learning and Ristretto for mobile environment, Proc. Int. Conf. Adv. Comput. Sci. Inf. Syst. (ICACSIS), pp. 309-314, Oct. 2017. https://ieeexplore.ieee.org/document/8355051
  4. Zhen Dong, Zhewei Yao, Amir Gholami, Michael W Mahoney, and Kurt Keutzer, 2019, HAWQ: Hessian AWare Quantization of Neural Networks with Mixed-Precision, In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 293–302, https://arxiv.org/abs/1905.03696
  5. Yiren Zhou, Seyed-Mohsen Moosavi-Dezfooli, Ngai-Man Cheung, and Pascal Frossard. 2017, Adaptive quantization for deep neural network, arXiv preprint arXiv:1712.01048, 2017, https://arxiv.org/abs/1712.01048 (Layerwise different bitwidth quantization.)
  6. Sijie Zhao, Tao Yue, and Xuemei Hu. 2021, Distribution aware adaptive multi-bit quantization, In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9281–9290, 2021, https://ieeexplore.ieee.org/document/9577892, PDF: https://openaccess.thecvf.com/content/CVPR2021/papers/Zhao_Distribution-Aware_Adaptive_Multi-Bit_Quantization_CVPR_2021_paper.pdf
  7. Zhewei Yao, Zhen Dong, Zhangcheng Zheng, Amir Gholami, Jiali Yu, Eric Tan, Leyuan Wang, Qijing Huang, Yida Wang, Michael W. Mahoney, Kurt Keutzer, June 2021, HAWQV3: Dyadic Neural Network Quantization, arXiv preprint arXiv:2011.10680, https://arxiv.org/abs/2011.10680
  8. Huanrui Yang, Lin Duan, Yiran Chen, Hai Li, Feb 2021, BSQ: Exploring Bit-Level Sparsity for Mixed-Precision Neural Network Quantization, arXiv preprint arXiv:2102.10462, https://arxiv.org/abs/2102.10462
  9. Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han. 2019, HAQ: Hardware-aware automated quantization, In Proceedings of the IEEE conference on computer vision and pattern recognition, https://arxiv.org/abs/1811.08886
  10. Zhongnan Qu, Zimu Zhou, Yun Cheng, and Lothar Thiele, June 2020, Adaptive loss-aware quantization for multi-bit networks, In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), https://arxiv.org/abs/1912.08883
  11. Hai Victor Habi, Roy H. Jennings, Arnon Netzer, July 2020, HMQ: Hardware Friendly Mixed Precision Quantization Block for CNNs, arXiv preprint arXiv:2007.09952, https://arxiv.org/abs/2007.09952
  12. Z. Liu, Y. Wang, K. Han, W. Zhang, S. Ma, and W. Gao, 2021, Post-training quantization for vision transformer, Advances in Neural Information Processing Systems, vol. 34, pp. 28 092–28 103, 2021. https://arxiv.org/abs/2106.14156 (Has evaluations of 4-bit, 6-bit, and 8-bit quantization; also mixed-precision.)
  13. Li, Y., Gong, R., Tan, X., Yang, Y., Hu, P., Zhang, Q., Yu, F., Wang, W., and Gu, S., 2021, BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction, ArXiv, abs/2102.05426. https://arxiv.org/abs/2102.05426 Code: https://github.com/yhhhli/BRECQ (Tests 2 to 4 bits for weights, and mixed-precision quantization.)
  14. E Kloberdanz, W Le, Sep 2023, MixQuant: Mixed Precision Quantization with a Bit-width Optimization Search, arXiv preprint arXiv:2309.17341, https://arxiv.org/pdf/2309.17341.pdf (Various tests of 2-8 bits, and mixed precision quantization.)
  15. A Chauhan, U Tiwari, 2023, Post Training Mixed Precision Quantization of Neural Networks Using First-Order Information, Proceedings of the IEEE/CVF International Conference, https://openaccess.thecvf.com/content/ICCV2023W/RCV/papers/Chauhan_Post_Training_Mixed_Precision_Quantization_of_Neural_Networks_Using_First-Order_ICCVW_2023_paper.pdf
  16. Y Shang, Z Yuan, Q Wu, Z Dong, Sep 2023, PB-LLM: Partially Binarized Large Language Models, arXiv preprint arXiv:2310.00034, https://browse.arxiv.org/pdf/2310.00034.pdf, Code: https://github.com/hahnyuan/BinaryLLM (Hybrid partial binarization.)

More research papers on mixed-precision quantization: https://www.aussieai.com/research/quantization#mixed.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++