Aussie AI

44. Advanced Quantization

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

“When you change the way you look at things,
the things you look at change.”

— Dr. Wayne Dyer.

Binary Quantization

The extreme of quantization is to encode floating-point weights down to 1 bit. This is binary quantization (or “binarization”), where there are only 2 weights, and they are 0 and 1, or alternatively -1 and +1. This compresses the model by a factor of 32 in terms of space, and reduces the inference computations to simpler arithmetic.

The downside of binary quantization is the loss of accuracy. Hence, binary networks haven't really caught on in widespread industry usage. However, there is a continual stream of research papers attempting to improve them.

The attraction of binary quantization is that its runtime efficiency is hard to beat. Binary quantization's use of minimal weights changes multiplication by a floating-point weight to a simple addition (for 1) and a null test (for 0). Or for binary weights -1 and +1, the -1 is a subtraction and +1 an addition, which is usually further optimized to use a sign bit tweak.

Binary quantization is not the only way to use single bits for AI models. There are also other invocations of binary neural network architectures that use only bitwise operations, such as XNOR networks and Weightless Neural Networks (WNNs).

Research papers on binary quantization:

H. Yang, M. Fritzsche, C. Bartz, and C. Meinel (2017), Bmxnet: An open-source binary neural network implementation based on mxnet, CoRR, vol. abs/1705.09864, 2017, https://arxiv.org/abs/1705.09864, Code: https://github.com/hpi-xnor
Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi (2016), Xnor-net: Imagenet classification using binary convolutional neural networks. In European conference on computer vision, Springer, 525–542, https://arxiv.org/abs/1603.05279
B. McDanel, S. Teerapittayanon, and H. Kung (2017), Embedded binarized neural networks, arXiv preprint arXiv:1709.02260, 2017, https://arxiv.org/abs/1709.02260
Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, Yoshua Bengio (2016), Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1 Feb 2016, https://arxiv.org/abs/1602.02830
Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio (2016), Binarized neural networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems. 4114–4122, https://proceedings.neurips.cc/paper/2016/hash/d8330f857a17c53d217014ee776bfd50-Abstract.html
Zhouhan Lin, Matthieu Courbariaux, Roland Memisevic, Yoshua Bengio (2016), Neural Networks with Few Multiplications, Feb 2016, https://arxiv.org/abs/1510.03009v1
Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi (2016), Xnor-net: Imagenet classification using binary convolutional neural networks. In ECCV, pages 525–542. Springer, 2016, https://arxiv.org/abs/1603.05279
Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David (2015), Binaryconnect: Training deep neural networks with binary weights during propagations, In NeuriPS, pages 3123–3131, 2015, https://arxiv.org/abs/1511.00363
Zhaowei Cai, Xiaodong He, Jian Sun, and Nuno Vasconcelos (2017). Deep learning with low precision by half-wave gaussian quantization. In CVPR, pages 5918–5926, 2017, https://arxiv.org/abs/1702.00953
Yefei He, Zhenyu Lou, Luoming Zhang, Weijia Wu, Bohan Zhuang, and Hong Zhou (2022). Bivit: Extremely compressed binary vision transformer. arXiv preprint arXiv:2211.07091, 2022. https://arxiv.org/abs/2211.07091 (Softmax-aware binarization)
Zechun Liu, Barlas Oguz, Aasish Pappu, Lin Xiao, Scott Yih, Meng Li, Raghuraman Krishnamoorthi, and Yashar Mehdad (2022). Bit: Robustly binarized multi-distilled transformer. arXiv preprint arXiv:2205.13016, 2022. https://arxiv.org/abs/2205.13016, Code: https://github.com/facebookresearch/bit
Felix Juefei-Xu, Vishnu Naresh Boddeti, and Marios Savvides (2017). Local binary convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 19–28, 2017. https://arxiv.org/abs/1608.06049
Zechun Liu, Zhiqiang Shen, Marios Savvides, and KwangTing Cheng (2020). Reactnet: Towards precise binary neural network with generalized activation functions. In European Conference on Computer Vision, pages 143–159. Springer, 2020. https://arxiv.org/abs/2003.03488
Koen Helwegen, James Widdicombe, Lukas Geiger, Zechun Liu, Kwang-Ting Cheng, and Roeland Nusselder (2019). Latent weights do not exist: Rethinking binarized neural network optimization. In Advances in Neural Information Processing Systems 32, pages 7533–7544. 2019. https://arxiv.org/abs/1906.02107, Code: https://github.com/plumerai/rethinking-bnn-optimization
Lin, X.; Zhao, C.; and Pan, W. (2017). Towards Accurate Binary Convolutional Neural Network. Advances in Neural Information Processing Systems, 30, https://arxiv.org/abs/1711.11294
Yichi Zhang, Ankush Garg, Yuan Cao, Łukasz Lew, Behrooz Ghorbani, Zhiru Zhang, Orhan Firat (2023), Binarized Neural Machine Translation, Feb 2023, https://arxiv.org/abs/2302.04907
Kota Ando; Kodai Ueyoshi; Kentaro Orimo; Haruyoshi Yonekawa; Shimpei Sato; Hiroki Nakahara; Masayuki Ikebe (2017), BRein memory: A 13-layer 4.2 K neuron/0.8 M synapse binary/ternary reconfigurable in-memory deep neural network accelerator in 65 nm CMOS, Proc. Symp. VLSI Circuits, pp. C24-C25, Jun. 2017. https://ieeexplore.ieee.org/document/8008533
S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen and Y. Zou (2016), DoReFa-Net: Training low bitwidth convolutional neural networks with low bitwidth gradients, arXiv:1606.06160, 2016. https://arxiv.org/abs/1606.06160 (Has binary weights, 2-bit activations)
R. Andri, L. Cavigelli, D. Rossi and L. Benini (2016), YodaNN: An ultra-low power convolutional neural network accelerator based on binary weights, Proc. IEEE Comput. Soc. Annu. Symp. VLSI (ISVLSI), pp. 236-241, Jul. 2016. https://arxiv.org/abs/1606.05487v1
Z. Cai, X. He, J. Sun and N. Vasconcelos (2017), Deep learning with low precision by half-wave Gaussian quantization, Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 5918-5926, Jul. 2017. https://arxiv.org/abs/1702.00953 (Has binary weights, 2-bit activations)
R. Ding, Z. Liu, R. D. Blanton, and D. Marculescu (2018). Quantized deep neural networks for energy efficient hardware-based inference. In IEEE Asia and South Pacific Design Automation Conference, pages 1–8, 2018. https://ieeexplore.ieee.org/document/8297274 (Survey and evaluation of various quantized DNN models in 2018, including binarized and light models, on chosen datasets.)
Taylor Simons and Dah-Jye Lee (2019), A Review of Binarized Neural Networks, Electronics 2019, 8, 661; doi:10.3390/electronics8060661, MDPI, https://www.mdpi.com/2079-9292/8/6/661/review_report
Xiaofan Lin, Cong Zhao, and Wei Pan (2017). Towards accurate binary convolutional neural network. Advances in Neural Information Processing Systems, 30, 2017. https://arxiv.org/abs/1711.11294 (Uses multiple single-bit weights combined to create a multi-binary quantization method.)
Y Shang, Z Yuan, Q Wu, Z Dong (2023), PB-LLM: Partially Binarized Large Language Models, Sep 2023, arXiv preprint arXiv:2310.00034, https://browse.arxiv.org/pdf/2310.00034.pdf, Code: https://github.com/hahnyuan/BinaryLLM (Hybrid partial binarization.)

See more papers on binary quantization at: https://www.aussieai.com/research/quantization#binary

Ternary Quantization

Ternary quantization (or “ternarization”) is the use of 3 weights: -1, 0, and 1. This requires 2 bits for representation of the weights in the model, so why wouldn't you just use 4 weights? The answer is that ternary quantization can use zero-multiplication arithmetic in the inference algorithm, with an addition (for +1), a subtraction (for -1), and a null test (for 0).

However, like binary quantization, ternary quantization still suffers from accuracy degradation. It is highly efficient in terms of space and time, but the model loses some capabilities. Nevertheless, there are many research papers attempting to improve this.

Research papers on ternary quantization:

N. Mellempudi, A. Kundu, D. Mudigere, D. Das, B. Kaul, and P. Dubey, Ternary neural networks with fine-grained quantization , CoRR, vol. abs/1705.01462, 2017, https://arxiv.org/abs/1705.01462
Fengfu Li, Bo Zhang, and Bin Liu. 2016. Ternary weight networks . arXiv preprint arXiv:1605.04711 (2016), https://arxiv.org/abs/1605.04711
Zhu et al. 2016] Zhu, C.; Han, S.; Mao, H.; and Dally, W. J. 2016. Trained ternary quantization . arXiv preprint arXiv:1612.01064, https://arxiv.org/abs/1612.01064
D Liu, X Liu, 2023, Ternary Quantization: A Survey , arXiv preprint arXiv:2303.01505, 2023, https://arxiv.org/abs/2303.01505
E Yvinec, A Dapogny, K Bailly, 2023, Designing strong baselines for ternary neural network quantization through support and mass equalization , arXiv preprint arXiv:2306.17442, 2023, https://arxiv.org/abs/2306.17442
Fengfu Li, Bin Liu, Xiaoxing Wang, Bo Zhang, Junchi Yan, Nov 2022, Ternary Weight Networks , https://arxiv.org/abs/1605.04711, Code: https://github.com/Thinklab-SJTU/twns
M Kim, S Lee, J Lee, S Hong, DS Chang, 2023, Token-Scaled Logit Distillation for Ternary Weight Generative Language Models , 2023, https://arxiv.org/abs/2308.06744,
Dan Liu, Xi Chen, Chen Ma, Xue Liu, Dec 2022, Hyperspherical Quantization: Toward Smaller and More Accurate Models , https://arxiv.org/abs/2212.12653
Kota Ando; Kodai Ueyoshi; Kentaro Orimo; Haruyoshi Yonekawa; Shimpei Sato; Hiroki Nakahara; Masayuki Ikebe, 2017, BRein memory: A 13-layer 4.2 K neuron/0.8 M synapse binary/ternary reconfigurable in-memory deep neural network accelerator in 65 nm CMOS, Proc. Symp. VLSI Circuits, pp. C24-C25, Jun. 2017. https://ieeexplore.ieee.org/document/8008533
S. K. Esser et al., 2016, Convolutional networks for fast energy-efficient neuromorphic computing, Proc. Nat. Acad. Sci. USA, vol. 113, no. 41, pp. 11441-11446, 2016. https://arxiv.org/abs/1603.08270 (Ternary weights, binary activations.)

See more papers on ternary quantization at: https://www.aussieai.com/research/quantization#ternary

2-Bit Quantization (INT2)

This section refers to non-ternary 2-bit quantization, using 4 distinct weights. In practice, 2-bit quantization is regarded as still having some problems with model accuracy, whereas 4-bit integer quantization is considered a more reasonable tradeoff of speed-vs-accuracy. On the other hand, maybe this is unwarranted, since Liu et al (2022) tested lots of models with 2-bits, 3-bits, and 4-bits (see Table 1 in their paper), and the extra accuracy of 4-bits over 2-bits was usually only a couple of percentage points (for double the space).

Research papers on 2-bit quantization:

Jungwook Choi, Pierce I-Jen Chuang, Zhuo Wang, Swagath Venkataramani, Vijayalakshmi Srinivasan, Kailash Gopalakrishnan, July 2018, Bridging the Accuracy Gap for 2-bit Quantized Neural Networks (QNN), https://arxiv.org/abs/1807.06964
Jungwook Choi, Swagath Venkataramani, Vijayalakshmi (Viji) Srinivasan, Kailash Gopalakrishnan, Zhuo Wang, Pierce Chuang, 2019, Accurate and Efficient 2-bit Quantized Neural Networks, Proceedings of Machine Learning and Systems 1 (MLSys 2019), https://proceedings.mlsys.org/paper/2019/file/006f52e9102a8d3be2fe5614f42ba989-Paper.pdf
S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen and Y. Zou, 2016, DoReFa-Net: Training low bitwidth convolutional neural networks with low bitwidth gradients, arXiv:1606.06160, 2016. https://arxiv.org/abs/1606.06160 (Has binary weights, 2-bit activations)
Z. Cai, X. He, J. Sun and N. Vasconcelos, 2017, Deep learning with low precision by half-wave Gaussian quantization, Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 5918-5926, Jul. 2017. https://arxiv.org/abs/1702.00953 (Has binary weights, 2-bit activations)
Han-Byul Kim, Eunhyeok Park, and Sungjoo Yoo. 2022. BASQ: Branch-wise Activation-clipping Search Quantization for Sub-4-bit Neural Networks, In European Conference on Computer Vision, Cham: Springer Nature Switzerland, 17-33. https://link.springer.com/chapter/10.1007/978-3-031-19775-8_2 (Evaluates quantization precision from 2-bits to 4-bits.)
Zechun Liu, Kwang-Ting Cheng, Dong Huang, Eric Xing, Zhiqiang Shen. Apr 2022. Nonuniform-to-uniform quantization: Towards accurate quantization via generalized straight-through estimation, In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4942-4952, https://arxiv.org/abs/2111.14826, Code: https://github.com/liuzechun/Nonuniform-to-Uniform-Quantization (Contains an extensive review of models with 2-bit weights and 2-bit activations, and also 3-bits and 4-bits.)
E Kloberdanz, W Le, Sep 2023, MixQuant: Mixed Precision Quantization with a Bit-width Optimization Search, arXiv preprint arXiv:2309.17341, https://arxiv.org/pdf/2309.17341.pdf (Various tests of quantization from 2-bits to 8-bits.)
Xiaofan Lin, Cong Zhao, and Wei Pan. 2017, Towards accurate binary convolutional neural network, Advances in Neural Information Processing Systems, 30, 2017. https://arxiv.org/abs/1711.11294 (Unique 2-bit quantization approach is really a double-binarized quantization method.)
NM Ho, DT Nguyen, JL Gustafson, WF Wong, 2023, Bedot: Bit Efficient Dot Product for Deep Generative Models, CoNGA 2023: Next Generation Arithmetic, pp. 19–37, https://link.springer.com/chapter/10.1007/978-3-031-32180-1_2, PDF: https://www.comp.nus.edu.sg/~wongwf/papers/CONGA23-Bedot.pdf (2–3 bits for weights and 2–5 bits for activation.)
Li, Y., Gong, R., Tan, X., Yang, Y., Hu, P., Zhang, Q., Yu, F., Wang, W., and Gu, S., 2021, BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction, ArXiv, abs/2102.05426. https://arxiv.org/abs/2102.05426 Code: https://github.com/yhhhli/BRECQ (Tests 2 to 4 bits for weights, and mixed-precision quantization.)
Yuji Chai, John Gkountouras, Glenn G. Ko, David Brooks, Gu-Yeon Wei, June 2023, INT2.1: Towards Fine-Tunable Quantized Large Language Models with Error Correction through Low-Rank Adaptation, arXiv preprint arXiv:2306.08162, https://arxiv.org/abs/2306.08162

See more papers on 2-bit quantization (INT2) at: https://www.aussieai.com/research/quantization#int3

3-Bit Quantization (INT3)

3-bit quantization is uncommon and unpopular, and it's not entirely clear why. It has improved accuracy over 2-bits and saves 25% storage compared to its more popular 4-bit cousin, being only slightly less accurate, since it allows 2^3=8 distinct weights. Maybe it just seems too inelegant for programmers to code cramming 3-bit values into 8-bits or 32-bits for packing and unpacking? But, no, even 5-bit quantization gets recommended by AI experts on forums, whereas listening for supporters of the 3-bit versions, all you hear are crickets.

Even the research papers on 3-bit quantization don't like to admit to it, and you'll struggle to even find “3-bit quantization” in a paper title. Here are some papers on 3-bit quantization (as if you care):

Research papers on 3-bit quantization:

Jeonghoon Kim, Jung Hyun Lee, Sungdong Kim, Joonsuk Park, Kang Min Yoo, Se Jung Kwon, 2023, and Dongsoo Lee. 2023, Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization. CoRR, abs/2305.14152, https://arxiv.org/abs/2305.14152 (Quantization to 3-bit and 4-bit levels.)
Han-Byul Kim, Eunhyeok Park, and Sungjoo Yoo. 2022. BASQ: Branch-wise Activation-clipping Search Quantization for Sub-4-bit Neural Networks, In European Conference on Computer Vision, Cham: Springer Nature Switzerland, 17-33. https://link.springer.com/chapter/10.1007/978-3-031-19775-8_2 (Evaluates quantization precision from 2-bits to 4-bits.)
Zechun Liu, Kwang-Ting Cheng, Dong Huang, Eric Xing, Zhiqiang Shen. Apr 2022. Nonuniform-to-uniform quantization: Towards accurate quantization via generalized straight-through estimation, In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4942-4952, https://arxiv.org/abs/2111.14826, Code: https://github.com/liuzechun/Nonuniform-to-Uniform-Quantization (Contains an extensive review of models from 2-bits to 4-bits for both weights and activations.)
E Kloberdanz, W Le, Sep 2023, MixQuant: Mixed Precision Quantization with a Bit-width Optimization Search, arXiv preprint arXiv:2309.17341, https://arxiv.org/pdf/2309.17341.pdf (Various tests of quantization from 2-bits to 8-bits.)
NM Ho, DT Nguyen, JL Gustafson, WF Wong, 2023, Bedot: Bit Efficient Dot Product for Deep Generative Models, CoNGA 2023: Next Generation Arithmetic, pp. 19–37, https://link.springer.com/chapter/10.1007/978-3-031-32180-1_2, PDF: https://www.comp.nus.edu.sg/~wongwf/papers/CONGA23-Bedot.pdf (2–3 bits for weights and 2–5 bits for activation.)
A. H. Zadeh, I. Edo, O. M. Awad, and A. Moshovos, 2020, GOBO: Quantizing attention-based NLP models for low latency and energy efficient inference, in 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2020, pp. 811–824. https://arxiv.org/abs/2005.03842 (Compares to BERT at 3-bit and 4-bit quantization levels.)
N. Frumkin, D. Gope, and D. Marculescu, 2022, CPT-V: A Contrastive Approach to Post-Training Quantization of Vision Transformers, arXiv preprint arXiv:2211.09643, 2022. https://arxiv.org/abs/2211.09643 (Examines 3-bit, 4-bit, and 8-bit.)
B Gouin-Ferland, R Coffee, AC Therrien, 2022, Data reduction through optimized scalar quantization for more compact neural networks, Frontiers in Physics, https://www.frontiersin.org/articles/10.3389/fphy.2022.957128/full (Examined 3 to 7 bit weights for quantization.)
Li, Y., Gong, R., Tan, X., Yang, Y., Hu, P., Zhang, Q., Yu, F., Wang, W., and Gu, S., 2021, BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction, ArXiv, abs/2102.05426. https://arxiv.org/abs/2102.05426 Code: https://github.com/yhhhli/BRECQ (Tests 2, 3 and 4 bits for weights, and mixed-precision quantization.)

See more papers on 3-bit quantization (INT3) at: https://www.aussieai.com/research/quantization#int3

4-Bit Quantization (INT4)

Quantization with four bits, or INT4 quantization, uses 4 bits for weights, and thus allows 2^4=16 distinct weights. In terms of industry models, 4-bit quantization is one of the most popular quantization regimes in practical usage. It is far more common to see a 4-bit quantization of an open source model than binary, 2-bits, or 3-bits. INT4 allows eight-fold storage compression of 32-bits down to 4-bits, which reduces memory requirements, and can also speed up inference by reducing memory-cache transfer overheads in both CPU and GPU.

This level of quantization has a reputation for offering a good trade-off in terms of a mild accuracy decline versus a significant speedup. The model compression is about eight-fold, being from 32 bit floats down to 4 bit integers. The 16 distinct weights contain enough information for reasonable accuracy compared to the full-precision model. The 4-bit weights also fit cleanly into 8-bit bytes or 32-bit integers, making the unpacking code simple and efficient.

Research papers on 4-bit quantization:

Ron Banner, Yury Nahshan, Elad Hoffer, Daniel Soudry, May 2019, Post-training 4-bit quantization of convolution networks for rapid-deployment, NeurIPS 2019, https://arxiv.org/abs/1810.05723, https://proceedings.neurips.cc/paper_files/paper/2019/file/c0a62e133894cdce435bcb4a5df1db2d-Paper.pdf
Han-Byul Kim, Eunhyeok Park, and Sungjoo Yoo. 2022. BASQ: Branch-wise Activation-clipping Search Quantization for Sub-4-bit Neural Networks, In European Conference on Computer Vision, Cham: Springer Nature Switzerland, 17-33. https://link.springer.com/chapter/10.1007/978-3-031-19775-8_2 (Evaluates quantization precision from 2-bits to 4-bits.)
Zechun Liu, Kwang-Ting Cheng, Dong Huang, Eric Xing, Zhiqiang Shen. Apr 2022. Nonuniform-to-uniform quantization: Towards accurate quantization via generalized straight-through estimation, In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4942-4952, https://arxiv.org/abs/2111.14826, Code: https://github.com/liuzechun/Nonuniform-to-Uniform-Quantization (Contains an extensive review of models from 2-bits to 4-bits for both weights and activations.)
Anton Trusov, Elena Limonova, Dmitry Slugin, Dmitry Nikolaev, Vladimir V. Arlazarov, Oct 2020, Fast Implementation of 4-bit Convolutional Neural Networks for Mobile Devices, 2020 25th International Conference on Pattern Recognition (ICPR), https://arxiv.org/abs/2009.06488, https://ieeexplore.ieee.org/document/9412841
Xiao Sun, Naigang Wang, Chia-yu Chen, Jia-min Ni, Ankur Agrawal, Xiaodong Cui, Swagath Venkataramani Kaoutar El Maghraoui, Vijayalakshmi Srinivasan, Kailash Gopalakrishnan, 2020, Ultra-low precision 4-bit training of deep neural networks, 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada, https://papers.nips.cc/paper/2020/file/13b919438259814cd5be8cb45877d577-Paper.pdf
Shaojin Ding, Phoenix Meadowlark, Yanzhang He, Lukasz Lew, Shivani Agrawal, and Oleg Rybakov. 2022, 4-bit conformer with native quantization aware training for speech recognition, In Hanseok Ko and John H. L. Hansen, editors, Interspeech 2022, 23rd Annual Conference of the International Speech Communication Association, Incheon, Korea, 18-22 September 2022, pages 1711–1715. ISCA, 2022. https://arxiv.org/abs/2203.15952
Jeonghoon Kim, Jung Hyun Lee, Sungdong Kim, Joonsuk Park, Kang Min Yoo, Se Jung Kwon, and Dongsoo Lee. 2023, Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization, CoRR, abs/2305.14152, https://arxiv.org/abs/2305.14152 (Quantization to 3-bit and 4-bit levels.)
HuggingFace, May 2023, Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA, HuggingFace Blog, https://huggingface.co/blog/4bit-transformers-bitsandbytes
E Kloberdanz, W Le, Sep 2023, MixQuant: Mixed Precision Quantization with a Bit-width Optimization Search, arXiv preprint arXiv:2309.17341, https://arxiv.org/pdf/2309.17341.pdf (Various tests of quantization from 2-bits to 8-bits.)
NM Ho, DT Nguyen, JL Gustafson, WF Wong, 2023, Bedot: Bit Efficient Dot Product for Deep Generative Models, CoNGA 2023: Next Generation Arithmetic, pp. 19–37, https://link.springer.com/chapter/10.1007/978-3-031-32180-1_2, PDF: https://www.comp.nus.edu.sg/~wongwf/papers/CONGA23-Bedot.pdf (2–3 bits for weights and 2–5 bits for activation.)
Z. Yuan, C. Xue, Y. Chen, Q. Wu, and G. Sun, 2022, PTQ4ViT: Post-training quantization for vision transformers with twin uniform quantization, in European Conference on Computer Vision. Springer, 2022, pp. 191–207. https://link.springer.com/chapter/10.1007/978-3-031-19775-8_12, PDF: https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136720190.pdf (Has 4-bit, 6-bit and 8-bit quantization.)
A. H. Zadeh, I. Edo, O. M. Awad, and A. Moshovos, 2020, GOBO: Quantizing attention-based NLP models for low latency and energy efficient inference, in 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2020, pp. 811–824. https://arxiv.org/abs/2005.03842 (Compares to BERT at 3-bit and 4-bit quantization levels.)
Z. Liu, Y. Wang, K. Han, W. Zhang, S. Ma, and W. Gao, 2021, Post-training quantization for vision transformer, Advances in Neural Information Processing Systems, vol. 34, pp. 28 092–28 103, 2021. https://arxiv.org/abs/2106.14156 (Has evaluations of 4-bit, 6-bit, and 8-bit quantization; also mixed-precision.)
N. Frumkin, D. Gope, and D. Marculescu, 2022, CPT-V: A Contrastive Approach to Post-Training Quantization of Vision Transformers, arXiv preprint arXiv:2211.09643, 2022. https://arxiv.org/abs/2211.09643 (Examines 3-bit, 4-bit, and 8-bit.)
B Gouin-Ferland, R Coffee, AC Therrien, 2022, Data reduction through optimized scalar quantization for more compact neural networks, Frontiers in Physics, https://www.frontiersin.org/articles/10.3389/fphy.2022.957128/full (Examined 3 to 7 bit weights for quantization.)
Li, Y., Gong, R., Tan, X., Yang, Y., Hu, P., Zhang, Q., Yu, F., Wang, W., & Gu, S., 2021, BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction, ArXiv, abs/2102.05426. https://arxiv.org/abs/2102.05426 Code: https://github.com/yhhhli/BRECQ (Tests 2 to 4 bits for weights, and mixed-precision quantization.)
J Liu, R Gong, X Wei, Z Dong, J Cai, B Zhuang, Oct 2023, QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models, arXiv preprint arXiv:2310.08041, https://arxiv.org/pdf/2310.08041.pdf (PTQ with 4-bit quantization of Llama models.)

See more papers on 4-bit quantization (INT4) at: https://www.aussieai.com/research/quantization#int4

5-Bit Quantization (INT5)

Research papers on 5-bit quantization:

E Kloberdanz, W Le, Sep 2023, MixQuant: Mixed Precision Quantization with a Bit-width Optimization Search, arXiv preprint arXiv:2309.17341, https://arxiv.org/pdf/2309.17341.pdf (Various tests of quantization from 2-bits to 8-bits.)
NM Ho, DT Nguyen, JL Gustafson, WF Wong, 2023, Bedot: Bit Efficient Dot Product for Deep Generative Models, CoNGA 2023: Next Generation Arithmetic, pp. 19–37, https://link.springer.com/chapter/10.1007/978-3-031-32180-1_2, PDF: https://www.comp.nus.edu.sg/~wongwf/papers/CONGA23-Bedot.pdf (2–3 bits for weights and 2–5 bits for activation.)
B Gouin-Ferland, R Coffee, AC Therrien, 2022, Data reduction through optimized scalar quantization for more compact neural networks, Frontiers in Physics, https://www.frontiersin.org/articles/10.3389/fphy.2022.957128/full (Examined 3 to 7 bit weights for quantization.)
Markus Nagel, Mart van Baalen, Tijmen Blankevoort, Max Welling, 2019, Data-free quantization through weight equalization and bias correction, PDF: https://openaccess.thecvf.com/content_ICCV_2019/papers/Nagel_Data-Free_Quantization_Through_Weight_Equalization_and_Bias_Correction_ICCV_2019_paper.pdf (Evaluates INT5, INT6, INT8, INT10, INT12, and INT16.)

See more papers on 5-bit quantization (INT5) at: https://www.aussieai.com/research/quantization#int5

6-Bit Quantization (INT6)

Research papers on 6-bit quantization:

E Kloberdanz, W Le, Sep 2023, MixQuant: Mixed Precision Quantization with a Bit-width Optimization Search, arXiv preprint arXiv:2309.17341, https://arxiv.org/pdf/2309.17341.pdf (Various tests of quantization from 2-bits to 8-bits.)
Z. Yuan, C. Xue, Y. Chen, Q. Wu, and G. Sun, 2022, PTQ4ViT: Post-training quantization for vision transformers with twin uniform quantization, in European Conference on Computer Vision. Springer, 2022, pp. 191–207. https://link.springer.com/chapter/10.1007/978-3-031-19775-8_12, PDF: https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136720190.pdf (Has 4-bit, 6-bit and 8-bit quantization.)
Z. Liu, Y. Wang, K. Han, W. Zhang, S. Ma, and W. Gao, 2021, Post-training quantization for vision transformer, Advances in Neural Information Processing Systems, vol. 34, pp. 28 092–28 103, 2021. https://arxiv.org/abs/2106.14156 (Has evaluations of 4-bit, 6-bit, and 8-bit quantization; also mixed-precision.)
M Giacobbe, TA Henzinger, M Lechner, 2020, How many bits does it take to quantize your neural network?, TACAS 2020, https://link.springer.com/chapter/10.1007/978-3-030-45237-7_5, PDF: https://link.springer.com/content/pdf/10.1007/978-3-030-45237-7_5.pdf (Ran experiments from 6-bit to 10-bit quantization.)
B Gouin-Ferland, R Coffee, AC Therrien, 2022, Data reduction through optimized scalar quantization for more compact neural networks, Frontiers in Physics, https://www.frontiersin.org/articles/10.3389/fphy.2022.957128/full (Examined 3 to 7 bit weights for quantization.)
Markus Nagel, Mart van Baalen, Tijmen Blankevoort, Max Welling, 2019, Data-free quantization through weight equalization and bias correction, PDF: https://openaccess.thecvf.com/content_ICCV_2019/papers/Nagel_Data-Free_Quantization_Through_Weight_Equalization_and_Bias_Correction_ICCV_2019_paper.pdf (Evaluates INT5, INT6, INT8, INT10, INT12, and INT16.)

See more papers on 6-bit quantization (INT6) at: https://www.aussieai.com/research/quantization#int6

7-Bit Quantization (INT7)

Research papers on 7-bit quantization:

E Kloberdanz, W Le, Sep 2023, MixQuant: Mixed Precision Quantization with a Bit-width Optimization Search, arXiv preprint arXiv:2309.17341, https://arxiv.org/pdf/2309.17341.pdf (Various tests of quantization from 2-bits to 8-bits.)
Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, Dmitry Kalenichenko, 2018, Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 2704-2713, https://openaccess.thecvf.com/content_cvpr_2018/papers/Jacob_Quantization_and_Training_CVPR_2018_paper.pdf
M Giacobbe, TA Henzinger, M Lechner, 2020, How many bits does it take to quantize your neural network?, TACAS 2020, https://link.springer.com/chapter/10.1007/978-3-030-45237-7_5, PDF: https://link.springer.com/content/pdf/10.1007/978-3-030-45237-7_5.pdf (Ran experiments from 6-bit to 10-bit quantization.)
B Gouin-Ferland, R Coffee, AC Therrien, 2022, Data reduction through optimized scalar quantization for more compact neural networks, Frontiers in Physics, https://www.frontiersin.org/articles/10.3389/fphy.2022.957128/full (Examined 3 to 7 bit weights for quantization.)

See more papers on 7-bit quantization (INT7) at: https://www.aussieai.com/research/quantization#int7

8-Bit Integer Quantization (INT8)

Research papers on 8-bit quantization:

Tim Dettmers, Mike Lewis, Younes Belkada, Luke Zettlemoyer, 2022, LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale, Nov 2022, https://arxiv.org/abs/2208.07339
A. Rock, A. Untether, O. Khalil, O. Shai, and P. Grouchy, 2022, INT8 Transformers for Inference Acceleration, 36th Conference on Neural Information Processing Systems (NeurIPS), PDF: https://neurips2022-enlsp.github.io/papers/paper_52.pdf
E Kloberdanz, W Le, Sep 2023, MixQuant: Mixed Precision Quantization with a Bit-width Optimization Search, arXiv preprint arXiv:2309.17341, https://arxiv.org/pdf/2309.17341.pdf (Various tests of quantization from 2-bits to 8-bits.)
O. Zafrir, G. Boudoukh, P. Izsak, and M. Wasserblat, 2019, Q8BERT: Quantized 8Bit BERT, in Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS), 2019, pp. 36–39. https://arxiv.org/abs/1910.06188
B. Li, S. Lu, K. Xie, and Z. Wang, 2022, Accelerating NLP Tasks on FPGA with Compressed BERT and a Hardware-Oriented Early Exit Method, in IEEE Computer Society Annual Symposium on VLSI (ISVLSI), 2022, pp. 410–413. https://ieeexplore.ieee.org/document/9912016 (8-bit and 16-bit integer quantization.)
Gunho Park, Baeseong Park, Sungjae Lee, Minsub Kim, Byeongwook Kim, Se Jung Kwon, Youngjoo Lee, Dongsoo Lee, 2022, nuQmm: Quantized MatMul for Efficient Inference of Large-Scale Generative Language Models, CoRR, abs/2206.09557, https://arxiv.org/abs/2206.09557v2
Y. Lin, Y. Li, T. Liu, T. Xiao, T. Liu, and J. Zhu, 2020, Towards fully 8-bit integer inference for the transformer model, the 29th International Joint Conference on Artificial Intelligence and the 17th Pacific Rim International Conference on Artificial Intelligence (IJCAI-PRICAI2020), 2020. https://arxiv.org/abs/2009.08034 (Quantizes not just weights, but also non-linear functions such as Softmax.)
Feng Zhu, Ruihao Gong, Fengwei Yu, Xianglong Liu, Yanfei Wang, Zhelong Li, Xiuqi Yang, Junjie Yan, 2020, Towards Unified INT8 Training for Convolutional Neural Network, https://arxiv.org/abs/1912.12607, https://openaccess.thecvf.com/content_CVPR_2020/papers/Zhu_Towards_Unified_INT8_Training_for_Convolutional_Neural_Network_CVPR_2020_paper.pdf
Manuele Rusci, Marco Fariselli, Martin Croome, Francesco Paci, Eric Flamand, 2022, Accelerating RNN-Based Speech Enhancement on a Multi-core MCU with Mixed FP16-INT8 Post-training Quantization, https://arxiv.org/abs/2210.07692 (Mixed precision FP16-INT8 quantization.)
Gunho Park, Baeseong Park, Minsub Kim, Sungjae Lee, Jeonghoon Kim, Beomseok Kwon, Se Jung Kwon, Byeongwook Kim, Youngjoo Lee, Dongsoo Lee, Apr 2023, LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models, https://arxiv.org/abs/2206.09557
V. Vanhoucke, A. Senior and M. Z. Mao, 2011, Improving the speed of neural networks on CPUs, Proc. Deep Learn. Unsupervised Feature Learn. Workshop, pp. 1-8, 2011. https://research.google/pubs/pub37631/, PDF: https://research.google/pubs/pub37631.pdf (INT8 quantization.)
M. A. Nasution, D. Chahyati and M. I. Fanany, 2017, Faster R-CNN with structured sparsity learning and Ristretto for mobile environment, Proc. Int. Conf. Adv. Comput. Sci. Inf. Syst. (ICACSIS), pp. 309-314, Oct. 2017. https://ieeexplore.ieee.org/document/8355051
Tim Dettmers, Mike Lewis, Younes Belkada, Luke Zettlemoyer, 2022, GPT3.int8(): 8-bit Matrix Multiplication for Transformers at Scale, Part of Advances in Neural Information Processing Systems 35 (NeurIPS 2022) https://papers.nips.cc/paper_files/paper/2022/hash/c3ba4962c05c49636d4c6206a97e9c8a-Abstract-Conference.html
A. Mishra, J. A. Latorre, J. Pool, D. Stosic, D. Stosic, G. Venkatesh, C. Yu, and P. Micikevicius, 2021, Accelerating sparse deep neural networks, arXiv preprint arXiv:2104.08378, 2021. https://arxiv.org/abs/2104.08378
Z. Yuan, C. Xue, Y. Chen, Q. Wu, and G. Sun, 2022, PTQ4ViT: Post-training quantization for vision transformers with twin uniform quantization, in European Conference on Computer Vision. Springer, 2022, pp. 191–207. https://link.springer.com/chapter/10.1007/978-3-031-19775-8_12, PDF: https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136720190.pdf (Has 4-bit, 6-bit and 8-bit quantization.)
Z. Liu, Y. Wang, K. Han, W. Zhang, S. Ma, and W. Gao, 2021, Post-training quantization for vision transformer, Advances in Neural Information Processing Systems, vol. 34, pp. 28 092–28 103, 2021. https://arxiv.org/abs/2106.14156 (Has evaluations of 4-bit, 6-bit, and 8-bit quantization; also mixed-precision.)
N. Frumkin, D. Gope, and D. Marculescu, 2022, CPT-V: A Contrastive Approach to Post-Training Quantization of Vision Transformers, arXiv preprint arXiv:2211.09643, 2022. https://arxiv.org/abs/2211.09643 (Examines 3-bit, 4-bit, and 8-bit.)
Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, Dmitry Kalenichenko, 2018, Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 2704-2713, https://openaccess.thecvf.com/content_cvpr_2018/papers/Jacob_Quantization_and_Training_CVPR_2018_paper.pdf
M Giacobbe, TA Henzinger, M Lechner, 2020, How many bits does it take to quantize your neural network?, TACAS 2020, https://link.springer.com/chapter/10.1007/978-3-030-45237-7_5, PDF: https://link.springer.com/content/pdf/10.1007/978-3-030-45237-7_5.pdf (Ran experiments from 6-bit to 10-bit quantization.)
Javier Fernandez-Marques, Paul N. Whatmough, Andrew Mundy, Matthew Mattina, 2020, Searching for winograd-aware quantized networks, Proceedings of the 3rd MLSys Conference, Austin, TX, USA, 2020. PDF: https://proceedings.mlsys.org/paper_files/paper/2020/file/678e209691cd37f145a5502695378bac-Paper.pdf (Evaluates INT8, INT10, and INT16 quantization.)
Markus Nagel, Mart van Baalen, Tijmen Blankevoort, Max Welling, 2019, Data-free quantization through weight equalization and bias correction, PDF: https://openaccess.thecvf.com/content_ICCV_2019/papers/Nagel_Data-Free_Quantization_Through_Weight_Equalization_and_Bias_Correction_ICCV_2019_paper.pdf (Evaluates INT5, INT6, INT8, INT10, INT12, and INT16.)
Pierre-Emmanuel Novac, March 2023, MicroAI: Embedded Artificial Intelligence for Human Activity Recognition on Smart Glasses, Ph.D. Thesis, Artificial Intelligence. Université Côte d’Azur, https://theses.hal.science/tel-04049008/document (Uses INT8 and INT16 quantization.)
Li, Y., Gong, R., Tan, X., Yang, Y., Hu, P., Zhang, Q., Yu, F., Wang, W., and Gu, S., 2021, BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction, ArXiv, abs/2102.05426. https://arxiv.org/abs/2102.05426 Code: https://github.com/yhhhli/BRECQ (Tests 8-bit weights, along with 2-4 bits.)

See more papers on 8-bit quantization (INT8) at: https://www.aussieai.com/research/quantization#int8

9-Bit Quantization (INT9)

Research papers on 9-bit quantization:

M Giacobbe, TA Henzinger, M Lechner, 2020, How many bits does it take to quantize your neural network?, TACAS 2020, https://link.springer.com/chapter/10.1007/978-3-030-45237-7_5, PDF: https://link.springer.com/content/pdf/10.1007/978-3-030-45237-7_5.pdf (Ran experiments from 6-bit to 10-bit quantization.)
W Jiang, P Liu, F Wen, 2017, An improved vector quantization method using deep neural network, AEU - International Journal of Electronics and Communications, Volume 72, February 2017, Pages 178-183, https://www.sciencedirect.com/science/article/pii/S1434841116313954

See more papers on 9-bit quantization (INT9) at: https://www.aussieai.com/research/quantization#int9

10-Bit Quantization (INT10)

Research papers on 10-bit quantization:

M Giacobbe, TA Henzinger, M Lechner, 2020, How many bits does it take to quantize your neural network?, TACAS 2020, https://link.springer.com/chapter/10.1007/978-3-030-45237-7_5, PDF: https://link.springer.com/content/pdf/10.1007/978-3-030-45237-7_5.pdf (Ran experiments from 6-bit to 10-bit quantization.)
J Shi, M Lu, F Chen, S Pu, Z Ma, 2022, Rate-Distortion Optimized Post-Training Quantization for Learned Image Compression, arXiv preprint arXiv:2211.02854, https://arxiv.org/abs/2211.02854
Javier Fernandez-Marques, Paul N. Whatmough, Andrew Mundy, Matthew Mattina, 2020, Searching for winograd-aware quantized networks, Proceedings of the 3rd MLSys Conference, Austin, TX, USA, 2020. PDF: https://proceedings.mlsys.org/paper_files/paper/2020/file/678e209691cd37f145a5502695378bac-Paper.pdf (Evaluates INT8, INT10, and INT16 quantization.)
Markus Nagel, Mart van Baalen, Tijmen Blankevoort, Max Welling, 2019, Data-free quantization through weight equalization and bias correction, PDF: https://openaccess.thecvf.com/content_ICCV_2019/papers/Nagel_Data-Free_Quantization_Through_Weight_Equalization_and_Bias_Correction_ICCV_2019_paper.pdf (Evaluates INT5, INT6, INT8, INT10, INT12, and INT16.)

See more papers on 10-bit quantization (INT10) at: https://www.aussieai.com/research/quantization#int10

11-Bit Quantization (INT11)

Research papers on 11-bit quantization:

G Dundar, K Rose, 1995, The effects of quantization on multilayer neural networks, IEEE Transactions on Neural Networks, Volume 6, Issue 6, November 1995, https://ieeexplore.ieee.org/abstract/document/471364
Fang Tang, Denis Guangyin Chen, Bo Wang, Amine Bermak, 2013, Low-Power CMOS Image Sensor Based on Column-Parallel Single-Slope/SAR Quantization Scheme, IEEE Transactions on Electron Devices, Vol. 60, No. 8, August 2013, https://ieeexplore.ieee.org/document/6547236, PDF: https://ss-sensing.com/paper/Low-Power%20CMOS%20Image%20Sensor%20Based%20on%20Column-Parallel%20Single-Slope-SAR%20Quantization%20Scheme.pdf

See more papers on 11-bit quantization (INT11) at: https://www.aussieai.com/research/quantization#int11

12-Bit Quantization (INT12)

Research papers on 12-bit quantization:

Markus Nagel, Mart van Baalen, Tijmen Blankevoort, Max Welling, 2019, Data-free quantization through weight equalization and bias correction, PDF: https://openaccess.thecvf.com/content_ICCV_2019/papers/Nagel_Data-Free_Quantization_Through_Weight_Equalization_and_Bias_Correction_ICCV_2019_paper.pdf (Evaluates INT5, INT6, INT8, INT10, INT12, and INT16.)
Xishan Zhang1,2, Shaoli Liu, Rui Zhang, Chang Liu, Di Huang, Shiyi Zhou, Jiaming Guo, Qi Guo, Zidong Du, Tian Zhi, Yunji Chen, 2020, Fixed-Point Back-Propagation Training, CVPR 2020, PDF: https://openaccess.thecvf.com/content_CVPR_2020/papers/Zhang_Fixed-Point_Back-Propagation_Training_CVPR_2020_paper.pdf

See more papers on 12-bit quantization (INT12) at: https://www.aussieai.com/research/quantization#int12

Mixed-Precision Quantization

Mixed-precision quantization is where two different data types are used in different parts. For example, there might be FP16 and INT8 quantization used. Different precision might be used across different parts of the Transformer (e.g. attention heads versus FFNs), or alternatively, different sizes of weight quantization versus the activations. However, if the main computations are still done in FP32 (i.e. the normal single-precision size), whereas the weights are stored as a quantized smaller data type (e.g. FP16 or integers), this isn't usually considered to be mixed-precision quantization (it's sometimes called “fake quantization”).

Research papers on mixed-precision quantization:

Manuele Rusci, Marco Fariselli, Martin Croome, Francesco Paci, Eric Flamand, 2022, Accelerating RNN-Based Speech Enhancement on a Multi-core MCU with Mixed FP16-INT8 Post-training Quantization, https://arxiv.org/abs/2210.07692 (Mixed precision FP16-INT8 quantization.)
Dipankar Das, Naveen Mellempudi, Dheevatsa Mudigere, Dhiraj Kalamkar, Sasikanth Avancha, Kunal Banerjee, Srinivas Sridharan, Karthik Vaidyanathan, Bharat Kaul, Evangelos Georganas, Alexander Heinecke, Pradeep Dubey, Jesus Corbal, Nikita Shustrov, Roma Dubtsov, Evarist Fomenko, Vadim Pirogov, 2018, Mixed Precision Training of Convolutional Neural Networks using Integer Operations, https://arxiv.org/pdf/1802.00930
M. A. Nasution, D. Chahyati and M. I. Fanany, 2017, Faster R-CNN with structured sparsity learning and Ristretto for mobile environment, Proc. Int. Conf. Adv. Comput. Sci. Inf. Syst. (ICACSIS), pp. 309-314, Oct. 2017. https://ieeexplore.ieee.org/document/8355051
Zhen Dong, Zhewei Yao, Amir Gholami, Michael W Mahoney, and Kurt Keutzer, 2019, HAWQ: Hessian AWare Quantization of Neural Networks with Mixed-Precision, In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 293–302, https://arxiv.org/abs/1905.03696
Yiren Zhou, Seyed-Mohsen Moosavi-Dezfooli, Ngai-Man Cheung, and Pascal Frossard. 2017, Adaptive quantization for deep neural network, arXiv preprint arXiv:1712.01048, 2017, https://arxiv.org/abs/1712.01048 (Layerwise different bitwidth quantization.)
Sijie Zhao, Tao Yue, and Xuemei Hu. 2021, Distribution aware adaptive multi-bit quantization, In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9281–9290, 2021, https://ieeexplore.ieee.org/document/9577892, PDF: https://openaccess.thecvf.com/content/CVPR2021/papers/Zhao_Distribution-Aware_Adaptive_Multi-Bit_Quantization_CVPR_2021_paper.pdf
Zhewei Yao, Zhen Dong, Zhangcheng Zheng, Amir Gholami, Jiali Yu, Eric Tan, Leyuan Wang, Qijing Huang, Yida Wang, Michael W. Mahoney, Kurt Keutzer, June 2021, HAWQV3: Dyadic Neural Network Quantization, arXiv preprint arXiv:2011.10680, https://arxiv.org/abs/2011.10680
Huanrui Yang, Lin Duan, Yiran Chen, Hai Li, Feb 2021, BSQ: Exploring Bit-Level Sparsity for Mixed-Precision Neural Network Quantization, arXiv preprint arXiv:2102.10462, https://arxiv.org/abs/2102.10462
Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han. 2019, HAQ: Hardware-aware automated quantization, In Proceedings of the IEEE conference on computer vision and pattern recognition, https://arxiv.org/abs/1811.08886
Zhongnan Qu, Zimu Zhou, Yun Cheng, and Lothar Thiele, June 2020, Adaptive loss-aware quantization for multi-bit networks, In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), https://arxiv.org/abs/1912.08883
Hai Victor Habi, Roy H. Jennings, Arnon Netzer, July 2020, HMQ: Hardware Friendly Mixed Precision Quantization Block for CNNs, arXiv preprint arXiv:2007.09952, https://arxiv.org/abs/2007.09952
Z. Liu, Y. Wang, K. Han, W. Zhang, S. Ma, and W. Gao, 2021, Post-training quantization for vision transformer, Advances in Neural Information Processing Systems, vol. 34, pp. 28 092–28 103, 2021. https://arxiv.org/abs/2106.14156 (Has evaluations of 4-bit, 6-bit, and 8-bit quantization; also mixed-precision.)
Li, Y., Gong, R., Tan, X., Yang, Y., Hu, P., Zhang, Q., Yu, F., Wang, W., and Gu, S., 2021, BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction, ArXiv, abs/2102.05426. https://arxiv.org/abs/2102.05426 Code: https://github.com/yhhhli/BRECQ (Tests 2 to 4 bits for weights, and mixed-precision quantization.)
E Kloberdanz, W Le, Sep 2023, MixQuant: Mixed Precision Quantization with a Bit-width Optimization Search, arXiv preprint arXiv:2309.17341, https://arxiv.org/pdf/2309.17341.pdf (Various tests of 2-8 bits, and mixed precision quantization.)
A Chauhan, U Tiwari, 2023, Post Training Mixed Precision Quantization of Neural Networks Using First-Order Information, Proceedings of the IEEE/CVF International Conference, https://openaccess.thecvf.com/content/ICCV2023W/RCV/papers/Chauhan_Post_Training_Mixed_Precision_Quantization_of_Neural_Networks_Using_First-Order_ICCVW_2023_paper.pdf
Y Shang, Z Yuan, Q Wu, Z Dong, Sep 2023, PB-LLM: Partially Binarized Large Language Models, arXiv preprint arXiv:2310.00034, https://browse.arxiv.org/pdf/2310.00034.pdf, Code: https://github.com/hahnyuan/BinaryLLM (Hybrid partial binarization.)

More research papers on mixed-precision quantization: https://www.aussieai.com/research/quantization#mixed.

Bitshift Quantization (Power-of-Two)

The idea with bitshift quantization is to use power-of-2 integer weights and bitshift operations rather than integer multiplication. There is a significant trade-off in terms of accuracy of the model, since the number of distinct weights is greatly reduced. This is an active area of research that is well-known, with the earliest papers dating back to 1992 and 1993. However, software algorithms using bitshift seem unlikely to outperform hardware acceleration of integer multiplication, and hardware support is limited. Extending hardware accelerators to use bitshifting or the highest power-of-two approximate multiplication in hardware, presumably requiring fewer operations and less computing power (and reduced heat generation) seems an open area for further research. Note that the highest bit of an integer can be efficiently calculated using Brian Kernighan's algorithm (1988).

Research papers on bitshift power-of-two quantization:

Maarten Vandersteegen, Kristof Van Beeck and Toon Goedemé, 2023, Integer-Only CNNs with 4 Bit Weights and Bit-Shift Quantization Scales at Full-Precision Accuracy, Electronics, October 2021, 10(22), 2823, https://www.mdpi.com/2079-9292/10/22/2823
Yiren Zhao, Xitong Gao, Daniel Bates, Robert Mullins, Cheng-Zhong Xu, 2019, Focused Quantization for Sparse CNNs, Advances in Neural Information Processing Systems 32 (NeurIPS 2019), 2019, https://proceedings.neurips.cc/paper/2019/hash/58aaee7ae94b52697ad3b9275d46ec7f-Abstract.html
Dominika Przewlocka-Rus, Syed Shakib Sarwar, H. Ekin Sumbul, Yuecheng Li, Barbara De Salvo, 2022, Power-of-Two Quantization for Low Bitwidth and Hardware Compliant Neural Networks, Feb 2022, https://arxiv.org/abs/2203.05025
Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio, Quantized neural networks: Training neural networks with low precision weights and activations, The Journal of Machine Learning Research, 18(1):6869–6898, 2017, https://arxiv.org/abs/1609.07061.
T. Hokchhay, S. Hashemi, R. I. Bahar, and S. Reda, 2017, Hardware-software codesign of accurate, multiplier-free deep neural networks, in Proc. 54th Annu. Design Autom. Conf. (DAC), 2017, pp. 1–6., https://arxiv.org/abs/1705.04288
Yuhang Li, Xin Dong, and Wei Wang, 2020, Additive powers-of-two quantization: An efficient non-uniform discretization for neural networks, International Conference on Learning Representations, February 2020, https://arxiv.org/abs/1909.13144
Z. Lin, M. Courbariaux, R. Memisevic, and Y. Bengio. 2015, Neural networks with few multiplications, CoRR, abs/1510.03009, 2015. https://arxiv.org/abs/1510.03009 (Power-of-Two Quantization)
Soheil Hashemi; Nicholas Anthony; Hokchhay Tann; R. Iris Bahar; Sherief Reda, Understanding the impact of precision quantization on the accuracy and energy of neural networks, Design, Automation & Test in Europe Conference & Exhibition, March 2017, https://ieeexplore.ieee.org/abstract/document/7927224
Marchesi, Michele, Orlandi, Gianni, Piazza, Francesco, and Uncini, Aurelio, 1993, Fast neural networks without multipliers, IEEE Transactions on Neural Networks , 4(1):53–62, 1993, https://ieeexplore.ieee.org/document/182695
A. White and M. 1. Elmasry, 1992, The digi-neocognitron: a digital neocognitron neural network model for VLSI, IEEE Trans. Neural Networks, vol. 3. pp. 73-85, Jan. 1992, https://ieeexplore.ieee.org/document/105419
Kwan, Hon Keung and Tang, CZ, 1993, Multiplierless multilayer feedforward neural network design suitable for continuous input-output mapping, Electronics Letters, 29(14):1259–1260, 1993, https://digital-library.theiet.org/content/journals/10.1049/el_19930841
Sean Eron Anderson, 2023, Bit Twiddling Hacks (Kernighan Algorithm), https://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetKernighan
Peter Wegner, 1960, A technique for counting ones in a binary computer, Communications of the ACM, Volume 3, Issue 5, May 1960, https://doi.org/10.1145/367236.367286
Daisuke Miyashita, Edward H. Lee, and Boris Murmann, 2016, Convolutional Neural Networks using Logarithmic Data Representation, CoRR abs/1603.01025 (2016), https://arxiv.org/abs/1603.01025
Edward H. Lee, Daisuke Miyashita, Elaina Chai, Boris Murmann, and S. Simon Wong, 2017, LogNet: Energy-efficient neural networks using logarithmic computation, In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2017, New Orleans, LA, USA, March 5-9, 2017. 5900–5904. https://doi.org/10.1109/ICASSP.2017.7953288
Elhoushi, M.; Chen, Z.; Shafiq, F.; Tian, Y. H.; and Li, J. Y., 2019, Deepshift: Towards multiplication-less neural networks, arXiv preprint arXiv:1905.13298, https://arxiv.org/abs/1905.13298
Zhou, A.; Yao, A.; Guo, Y.; Xu, L.; and Chen, Y., 2017, Incremental network quantization: Towards lossless CNNs with low-precision weight, arXiv preprint arXiv:1702.03044, https://arxiv.org/abs/1702.03044
J Cai, M Takemoto, H Nakajo, 2018, A deep look into logarithmic quantization of model parameters in neural networks, https://dl.acm.org/doi/abs/10.1145/3291280.3291800
HyunJin Kim; Min Soo Kim; Alberto A. Del Barrio; Nader Bagherzadeh, 2019, A cost-efficient iterative truncated logarithmic multiplication for convolutional neural networks, IEEE 26th Symposium on Computer Arithmetic (ARITH), https://ieeexplore.ieee.org/abstract/document/8877474
X Li, B Liu, RH Yang, V Courville, C Xing, VP Nia, 2023, DenseShift: Towards Accurate and Efficient Low-Bit Power-of-Two Quantization, Proceedings of the IEEE/CVF, https://openaccess.thecvf.com/content/ICCV2023/papers/Li_DenseShift_Towards_Accurate_and_Efficient_Low-Bit_Power-of-Two_Quantization_ICCV_2023_paper.pdf (Extends log quantization to floating-point numbers efficiently by using a bitwise trick of integer addition on the sign and exponent bits of 32-bit IEEE 754 floats.)

See also more sum papers on power-of-two quantization at https://www.aussieai.com/research/quantization#logarithmic.

Sum of Two Bitshifts Quantization

The downside of logarithmic quantization is that there are relatively few unique weights, limiting precision, even if the number of bits used is maximized using a large scaling factor. An alternative implementation is to use two bitshift operations and an addition (or use of “shift-and-add” operations). In this way, the two highest bits of the quantized integer weight can be used, which improves model precision at the cost of more computation. This assumes that two integer shifts and an integer addition are less than the cost of a single integer multiplication. An early mention of this “sums of powers of two” method is in Marchesi et al. (1993).

Research papers on sum-of-two-bitshifts quantization:

Sung-En Chang, Yanyu Li, Mengshu Sun, Runbin Shi, Hayden K.-H. So, Xuehai Qian, Yanzhi Wang, and Xue Lin, 2021, Mix and Match: A Novel FPGA-Centric Deep Neural Network Quantization Framework, 2021, In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, Seoul, Korea (South), 208–220, https://doi.org/10.1109/HPCA51647.2021.00027
You, H.; Chen, X.; Zhang, Y.; Li, C.; Li, S.; Liu, Z.; Wang, Z.; and Lin, Y., 2020, ShiftAddNet: A Hardware-Inspired Deep Network, In NeurIPS, https://arxiv.org/abs/2010.12785
Marchesi, Michele, Orlandi, Gianni, Piazza, Francesco, and Uncini, Aurelio, 1993, Fast neural networks without multipliers, IEEE Transactions on Neural Networks , 4(1):53–62, 1993, https://ieeexplore.ieee.org/document/182695
Robert Eisele, 2010, Optimizing integer multiplication, April 29th, 2010, https://www.xarg.org/2010/04/optimizing-integer-multiplication/
Yuhang Li, Xin Dong, and Wei Wang, 2020, Additive powers-of-two quantization: An efficient non-uniform discretization for neural networks, International Conference on Learning Representations, February 2020, https://arxiv.org/abs/1909.13144

See also more sum of two bitshift quantization papers at https://www.aussieai.com/research/quantization#logarithmic.

Arbitrary Base Logarithmic Quantization

The main use of logarithms in quantization is power-of-two logarithmic quantization. This is efficient, allowing bitshifting, but its lack of accuracy is a known problem. There is also some research on bases other than two, or indeed arbitrary bases, to try to more accurately map weights to a logarithmic format:

Research papers on arbitrary base log quantization:

S. Vogel, M. Liang, A. Guntoro, W. Stechele, and G. Ascheid, 2018, Efficient hardware acceleration of CNNs using logarithmic data representation with arbitrary log-base, In Proceedings of the 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD’18). 1–8.

See also more logarithmic bitshift quantization papers at https://www.aussieai.com/research/quantization#logarithmic.

Integer Division for Quantization?

What about using integer division instead of multiplications in quantization? After all, multiplication by a small weight like 0.003 could instead be a division by 333. Is this an avenue for optimization? It seems unlikely, since division is usually much slower than multiplication, often by an order-of-magnitude.

Integer division can possibly be used efficiently using bitshift operations. Power-of-two division might be an opportunity for (right) bitshifts instead of division, which is effectively the same as the left bitshift quantization above. Dyadic numbers are an interesting idea and their implementation involves division by a power-of-two, usually performed via a right bitshift.

Note that division is often used in scaling operations, particularly in de-quantization. However, in such cases, it isn't the bottleneck operation, as scaling or de-quantization is performed an order-of-magnitude fewer times.

Research papers on division:

LibDivide, 2023, https://libdivide.com/ and https://github.com/ridiculousfish/libdivide
Ridiculous Fish, May 12th, 2021, Benchmarking division and libdivide on Apple M1 and Intel AVX512, https://ridiculousfish.com/blog/posts/benchmarking-libdivide-m1-avx512.html

For more division quantization research papers, see https://www.aussieai.com/research/quantization#division.

Dyadic Quantization

Dyadic numbers are a class of numbers represented as rational numbers, but operated on as paired numbers. The number is an integer, but the denominator is restricted to be a power-of-two integer. This allows dyadic numbers to support a wide range of weights, including quite high precision in fractional weights, but integer arithmetic can be used.

Research papers on dyadic quantization:

Zhewei Yao, Zhen Dong, Zhangcheng Zheng, Amir Gholami, Jiali Yu, Eric Tan, Leyuan Wang, Qijing Huang, Yida Wang, Michael Mahoney, Kurt Keutzer, 2021, HAWQ-V3: Dyadic Neural Network Quantization, Proceedings of the 38th International Conference on Machine Learning, PMLR 139:11875-11886, 2021, https://arxiv.org/abs/2011.10680
Renato J. Cintra; Stefan, Duffner; Christophe Garcia; André Leite, 2018, Low-Complexity Approximate Convolutional Neural Networks, IEEE Transactions on Neural Networks and Learning Systems, Volume 29, Issue 12, December 2018, pp.5981-5992, https://ieeexplore.ieee.org/abstract/document/8334697
Fernanda Botelho, Max Garzon, On Dynamical Properties of Neural Networks, 1991, Complex Systems 5 (1991), p.401-413, https://wpmedia.wolfram.com/uploads/sites/13/2018/02/05-4-4.pdf

More dyadic quantization research papers are available at https://www.aussieai.com/research/quantization#dyadic and also papers on dyadic numbers at https://www.aussieai.com/research/advanced-ai-mathematics#dyadic

Stochastic Quantization

Stochastic quantization is a research area that examines intentionally inserting some randomness or statistical variation into the quantization algorithms, which may result in higher accuracy. This idea can be used in conjunction with Post-Training Quantization (PTQ) or with Quantization-Aware Training (QAT).

Research papers on stochastic quantization:

Angela Fan, Pierre Stock, Benjamin Graham, Edouard Grave, Rémi Gribonval, Hervé Jégou, and Armand Joulin, 2020, Training with quantization noise for extreme model compression, arXiv e-prints, pages arXiv–2004, https://arxiv.org/abs/2004.07320
Jianfei Chen, Yu Gai, Zhewei Yao, Michael W Mahoney, and Joseph E Gonzalez. 2020, A statistical framework for low-bitwidth training of deep neural networks, arXiv preprint arXiv:2010.14298, 2020, https://arxiv.org/abs/2010.14298 J Zhang, 2023, Quantization for High-dimensional Data and Neural Networks: Theory and Algorithms, Ph.D. Thesis, University of California, San Diego, https://escholarship.org/content/qt9bd2k7gf/qt9bd2k7gf.pdf (See Chapter 5 in the thesis for stochastic quantization algorithms.)

See more updated research paper citations on stochastic quantization in the Aussie AI literature review at https://www.aussieai.com/research/quantization#stochastic.

Weight Clustering

Weight clustering is conceptually like pruning and quantization combined, and is sometimes called “cluster-based quantization”. A group of weights are merged with similar weights, to make all of the similar weights have exactly the same weight. Hashing has also been used to group weights for clustering.

Research papers on weight clustering:

Shaokai Ye, Tianyun Zhang, Kaiqi Zhang, Jiayu Li, Jiaming Xie, Yun Liang, Sijia Liu, Xue Lin, Yanzhi Wang, 2018, A Unified Framework of DNN Weight Pruning and Weight Clustering/Quantization Using ADMM, November 2018, https://arxiv.org/abs/1811.01907
Steven J. Nowlan; Geoffrey E. Hinton, 1992, Simplifying Neural Networks by Soft Weight-Sharing, Neural Computation, 4(4), July 1992, https://ieeexplore.ieee.org/abstract/document/6796174
Zhihang Yuan, Lin Niu, Jiawei Liu, Wenyu Liu, Xinggang Wang, Yuzhang Shang, Guangyu Sun, Qiang Wu, Jiaxiang Wu, Bingzhe Wu, RPTQ: Reorder-based Post-training Quantization for Large Language Models, May 2023, https://arxiv.org/abs/2304.01089
TensorFlow, 2023, Weight clustering, https://www.tensorflow.org/model_optimization/guide/clustering
A. Zhou, A. Yao, Y. Guo, L. Xu and Y. Chen, 2017, Incremental network quantization: Towards lossless CNNs with low-precision weights, arXiv:1702.03044, 2017. https://arxiv.org/abs/1702.03044 (Groups large and small weights)
W. Chen, J. T. Wilson, S. Tyree, K. Weinberger and Y. Chen, 2015, Compressing neural networks with the hashing trick, Proc. ICML, pp. 2285-2294, 2015. https://arxiv.org/abs/1504.04788 (Uses hashing to do weight clustering/grouping weights.)
Maedeh Hemmat, Joshua San Miguel, Azadeh Davoodi, 2021, AirNN: A Featherweight Framework for Dynamic Input-Dependent Approximation of CNNs, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol.40, no.10, pp.2090-2103, 2021. https://ieeexplore.ieee.org/document/9239327 (Input-dependent inference optimization via layer-wise weight clustering and early exit based on a termination condition.)
Maedeh Hemmat; Azadeh Davoodi, March 2019, Dynamic Reconfiguration of CNNs for Input-Dependent Approximation, 20th International Symposium on Quality Electronic Design (ISQED), https://ieeexplore.ieee.org/document/8697843 (Dynamically decides how many clusters of similar weights to use, depending on input.)
B Rokh, A Azarpeyvand, A Khanteymoori, 2023, A Comprehensive Survey on Model Quantization for Deep Neural Networks in Image Classification, ACM Transactions on Intelligent Systems, PDF: https://dl.acm.org/doi/pdf/10.1145/3623402 (Includes a survey of weight clustering.)
W Cheng, W Zhang, H Shen, Y Cai, X He, K Lv, 2023, Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs, arXiv preprint arXiv:2309.05516, PDF: https://arxiv.org/pdf/2309.05516.pdf (Examination of rounding schemes in PTQ and QAT for quantization and weight clustering.)

See more updated research paper citations in the Aussie AI literature review at https://www.aussieai.com/research/quantization#weight-clustering.

• Next: Chapter 45. Knowledge Distillation

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs

The new AI programming book by Aussie AI co-founders:

AI coding in C++
Transformer engine speedups
LLM models
Phone and desktop AI
Code examples
Research citations

Get your copy from Amazon: Generative AI in C++