Aussie AI

LLM Quantization Research

  • Last Updated 12 December, 2024
  • by David Spuler, Ph.D.

Quantization is an extremely popular method of model compression that has a huge number of research papers, and has been implemented in many modern inference engines. Generally, quantization has been very successful at both reducing inference compute times and storage space, without a huge hit to model accuracy, achieving near floating-point accuracy.

Types of Quantization

Quantization is usually separated into two main categories:

  • Post-Training Quantization (PTQ). This is where a pre-trained model is quantized for faster inference.
  • Quantization-Aware Training (QAT). This is the use of quantization during model training.

Quantization granularity specifies what numbers are going to be quantized, and which sets of numbers will have different quantization parameters. There are different model floating-point numbers that can be quantized:

  • Weights quantized (weight-only quantization)
  • Activations quantized (weight-and-activation quantization)
  • Per-layer quantization
  • Per-tensor quantization
  • Per-channel quantization

The scaling algorithm or "scale factor" by which floating-point numbers are scaled to a smaller range of numbers is also a distinction for a given quantization algorithm. Several types include:

  • Uniform scaling (uniform quantization)
  • Uniform affine quantization
  • Symmetric uniform quantization
  • Non-uniform quantization
  • Power-of-two quantization
  • Asymmetric quantization

There are several different technical types of quantization:

  • FP16 quantization: This uses 16-bit floating point numbers instead of 32-bit numbers. Commonly used.
  • FP8 quantization: 8-bit floating point.
  • FP4 quantization: 4-bit floating point. Occasionally used in research papers.
  • 8-bit integer quantization (INT8): This uses single 8-bit bytes for quantization. Commonly used. Weights are scaled to either -128 to 127 (signed), or 0 to 255 (unsigned bytes).
  • 4-bit quantization (INT4): Another popular size for quantization is a "nibble" (4 bits). There can be 2^4=16 weights. This is commonly used and quite effective given its low bitwidth.
  • 3-bit quantization (INT3). This uses 2^3=8 distinct weights.
  • 2-bit quantization (INT2): There are 4 distinct weights. Not commonly used.
  • Ternary quantization: This is quantization with 3 weights, usually -1, 0, and +1. Uses 2 bits, but not 4 weights. Suffers accuracy problems.
  • Binary quantization: This is 1-bit quantization with 2 possible weights (usually 0 and 1, or -1 and +1). Not highly accurate.
  • 0-bit quantization. Good luck with this algorithm.
  • Integer-only-arithmetic quantization. This refers to quantization where the actual arithmetic performed during inference is all integer multiplications. This is distinct from the rather unkindly named "fake quantization" which is quantization where the integers are "dequantized" back to floating-point before using floating point multiplication in inference calculations. Integer-only-arithmetic quantization aims to improve both speed and space, whereas non-integer-arithmetic-only integer quantization still reduces model size and storage space, but improves execution speed less fully (latency is still somewhat improved due to reduced memory-related activity).
  • Dyadic quantization: This is an uncommon quantization method using dyadic numbers, which are a mathematical representation of numbers as rational quotients where the numerator is an integer, but the denominator is always a power-of-two (allowing bitshifts).
  • Logarithmic Bitshift Quantization (Power-of-Two Quantization). This is where the weights are all powers of 2, so that faster bitshifts are used instead of integer multiplication. A generalization is "Sum of Two Bitshifts Quantization" which uses multiple bitshifts added together.

And some more quantization terminology:

  • Stochastic quantization. This is a method of intentionally introducing some non-determinancy and randomness into quantization algorithms with the goal of increased inference accuracy.
  • Extreme quantization: Usually refers to binary quantization.
  • Low-bit quantization: Usually means binary, ternary, or at most 4-bit quantization.
  • Fake quantization (or "simulated quantization"). Refers somewhat unkindly to integer quantization with the main goal of storage space reduction from a reduced model size, where the actual arithmetic is still performed as floating-point multiplications, rather than the "real quantization" of integer-only-arithmetic quantization.
  • Fixed point quantization. Using fixed-point arithmetic to change vector dot product from floating point multiplication/addition into integer multiplication and addition.
  • Mixed-precision quantization (or simply "mixed quantization"): Refers to more finely granular quantization where different parts of the model have different levels of quantization in terms of bits.

Quantization Theory

Research papers on the theoretical basis of quantization:

  • Zifei Xu, Sayeh Sharify, Wanzin Yazar, Tristan Webb, Xin Wang, 18 Oct 2024, Understanding the difficulty of low-precision post-training quantization of large language models, https://arxiv.org/abs/2410.14570
  • Jiayi Pan, Chengcan Wang, Kaifu Zheng, Yangguang Li, Zhenyu Wang, Bin Feng, 6 Dec 2023, SmoothQuant+: Accurate and Efficient 4-bit Post-Training WeightQuantization for LLM, https://arxiv.org/abs/2312.03788
  • Jiedong Lang, Zhehao Guo, Shuyu Huang, 30 Oct 2024, A Comprehensive Study on Quantization Techniques for Large Language Models, https://arxiv.org/abs/2411.02530
  • Tanishq Kumar, Zachary Ankner, Benjamin F. Spector, Blake Bordelon, Niklas Muennighoff, Mansheej Paul, Cengiz Pehlevan, Christopher Ré, Aditi Raghunathan, 7 Nov 2024, Scaling Laws for Precision, https://arxiv.org/abs/2411.04330
  • Xu Ouyang, Tao Ge, Thomas Hartvigsen, Zhisong Zhang, Haitao Mi, Dong Yu, 27 Nov 2024 (v2), Low-Bit Quantization Favors Undertrained LLMs: Scaling Laws for Quantized LLMs with 100T Training Tokens, https://arxiv.org/abs/2411.17691

Survey Papers on Quantization

  • Sangeetha Siddegowda, Marios Fournarakis, Markus Nagel, Tijmen Blankevoort, Chirag Patel, Abhijit Khobare, Neural Network Quantization with AI Model Efficiency Toolkit (AIMET), arXiv:2201.08442v1 [cs.LG], 20 Jan 2022, https://arxiv.org/pdf/2201.08442.pdf
  • Krishnamoorthi, R., Quantizing deep convolutional networks for efficient inference: A whitepaper. arXiv preprint arXiv:1806.08342, 2018, https://arxiv.org/abs/1806.08342
  • Nagel, M., Fournarakis, M., Amjad, R. A., Bondarenko, Y., van Baalen, M., and Blankevoort, T., "A white paper on neural network quantization", 2021, https://arxiv.org/abs/2106.08295
  • Jingjing Xu, Wangchunshu Zhou, Zhiyi Fu, Hao Zhou, Lei Li, A Survey on Green Deep Learning, Nov 2021, https://arxiv.org/abs/2111.05193 (Extensive 2021 survey paper including quantization.)
  • Xunyu Zhu, Jian Li, Yong Liu, Can Ma, Weiping Wang, A Survey on Model Compression for Large Language Models, arXiv preprint arXiv:2308.07633, Aug 2023 https://arxiv.org/abs/2308.07633 (Recent 2023 survey paper on various topics, including PTQ and QAT quantization.)
  • Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W. Mahoney, and Kurt Keutzer. 2021. A Survey of Quantization Methods for Efficient Neural Network Inference. arXiv:2103.13630 [cs], June 2021, http://arxiv.org/abs/2103. 13630 arXiv: 2103.13630, https://arxiv.org/abs/2103.13630
  • T Liang, J Glossner, L Wang, S Shi, X Zhang, 2021, Neurocomputing, Pruning and quantization for deep neural network acceleration: A survey, https://arxiv.org/abs/2101.09671
  • T Choudhary, V Mishra, A Goswami, 2020, A comprehensive survey on model compression and acceleration, Artifcial Intelligence Review, https://doi.org/10.1007/s10462-020-09816-7, https://link.springer.com/article/10.1007/s10462-020-09816-7
  • Y Cheng, D Wang, P Zhou, T Zhang, June 2020 (revised), A survey of model compression and acceleration for deep neural networks, arXiv preprint arXiv:1710.09282, https://arxiv.org/abs/1710.09282
  • R. Ding, Z. Liu, R. D. Blanton, and D. Marculescu. Quantized deep neural networks for energy efficient hardware-based inference. In IEEE Asia and South Pacific Design Automation Conference, pages 1–8, 2018. https://ieeexplore.ieee.org/document/8297274 (Survey and evaluation of various quantized DNN models in 2018, including binarized and light models, on chosen datasets.)
  • B Rokh, A Azarpeyvand, A Khanteymoori, ACM Transactions on Intelligent Systems, 2023, A Comprehensive Survey on Model Quantization for Deep Neural Networks in Image Classification, PDF: https://dl.acm.org/doi/pdf/10.1145/3623402
  • Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748
  • Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
  • Yanshu Wang, Tong Yang, Xiyan Liang, Guoan Wang, Hanning Lu, Xu Zhe, Yaoming Li, Li Weitao, 18 Sep 2024, Art and Science of Quantizing Large-Scale Models: A Comprehensive Overview, https://arxiv.org/abs/2409.11650 (Extensive survey of quantization from the basics to SOTA approaches, with also some coverage of knowledge distillation and KV cache compression.)
  • David Spuler, September 2nd, 2024, 500+ LLM Inference Optimization Techniques, Aussie AI Blog, https://www.aussieai.com/blog/llm-inference-optimization
  • Ummara Bibi, Mahrukh Mazharm Dilshad Sabir, Muhammad Fasih Uddin Butt, Ali Hassan, Mustansar Ali Ghazanfar, Arshad Ali Khan, Wadood Abdul, 2024, Advances in Pruning and Quantization for Natural Language Processing, IEEE Access, doi: 10.1109/ACCESS.2024.3465631. https://ieeexplore.ieee.org/document/10685352 PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10685352
  • Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu, 25 Sep 2024, A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms, https://arxiv.org/abs/2409.16694
  • Babak Rokh, Ali Azarpeyvand, Alireza Khanteymoori, 23 Oct 2023 (v5), A Comprehensive Survey on Model Quantization for Deep Neural Networks in Image Classification, https://arxiv.org/abs/2205.07877

General Research Papers on Quantization

Papers on the general theory of quantization, or specific works with relevance to the overall theoretical basis of using quantization for model compression:

Floating Point Quantization

The most straight-forward quantization is to reduce 32-bit floating point (4 bytes) to 16-bit floating point (2 bytes). This halves the memory storage requirements, and does not suffer much reduction in model accuracy. All operations in matmuls are still done in floating point arithmetic.

The classic form of floating point quantization is often abbreviated as FP16. There is also "bfloat16", which uses a different representation of numbers. An even more reduced size is FP8 quantization, which uses 8-bit floating point numbers.

Research papers on floating point quantization (there are many):

8-bit Floating-Point Quantization (FP8)

FP8 quantization hasn't caught on in the AI industry as much as FP16 or integer quantization methods, but there are plenty of papers. Resarch papers on FP8 quantization:

6-bit Floating-Point Quantization (FP6)

Research on FP6 quantization:

  • Haojun Xia, Zhen Zheng, Xiaoxia Wu, Shiyang Chen, Zhewei Yao, Stephen Youn, Arash Bakhtiari, Michael Wyatt, Donglin Zhuang, Zhongzhu Zhou, Olatunji Ruwase, Yuxiong He, Shuaiwen Leon Song, 25 Jan 2024, FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design, https://arxiv.org/abs/2401.14112
  • Haojun Xia, Zhen Zheng, Xiaoxia Wu, Shiyang Chen, Zhewei Yao, Stephen Youn, Arash Bakhtiari, Michael Wyatt, Donglin Zhuang, Zhongzhu Zhou, Olatunji Ruwase, Yuxiong He, Shuaiwen Leon Song, July 2024, Quant-LLM: Accelerating the Serving of Large Language Models via FP6-Centric Algorithm-System Co-Design on Modern GPUs, Proceedings of the 2024 USENIX Annual Technical Conference. July 10–12, 2024,Santa Clara, CA, USA, 978-1-939133-41-0, https://www.usenix.org/conference/atc24/presentation/xia https://www.usenix.org/system/files/atc24-xia.pdf

4-bit Floating-Point Quantization (FP4)

Research on FP4 quantization:

  • Youngdeok Hwang; Janghwan Lee; Jiwoong Park; Jieun Lim; Jungwook Choi, Jan 2024, Searching Optimal Floating-Point Format for Sub-8-Bit Large Language Model Inference, 2024 International Conference on Electronics, Information, and Communication (ICEIC), https://ieeexplore.ieee.org/abstract/document/10457111 (Examines floating-point representations below 8 bits, and also the importance of denormalized numbers.)
  • Luchang Li, Sheng Qian, Jie Lu, Lunxi Yuan, Rui Wang, Qin Xie, 29 Mar 2024, Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs, https://arxiv.org/abs/2403.20041 (On-device LLMs via four optimizations: dynamic-tensor-shape inference, FP4 quantization, operator optimizations, and KV cache improvements.)
  • Xiaoxia Wu, Zhewei Yao, Yuxiong He, 2021, A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats Microsoft, https://neurips2023-enlsp.github.io/papers/paper_92.pdf Code: https://github.com/microsoft/DeepSpeed (FP4 4-bit weight quantization with 8-bit FP8 activation quantization, and showed FP8 bettered INT8 quantization and FP4 beat INT4.)
  • S Liu, Z Liu, X Huang, P Dong, KT Cheng, 2023 LLM-FP4: 4-Bit Floating-Point Quantized Transformers, arXiv preprint arXiv:2310.16836, https://arxiv.org/pdf/2310.16836.pdf Code: https://github.com/nbasyl/LLM-FP4

16-bit Floating-Point Quantization (FP16)

Quantization from high-precision 32-bit floating point weights (usually abbreviated "FP32" or "float32") to lower-precision 16-bit floating point (usually abbreviated "FP16" or "float16") can yield significant benefits, often without a significant loss of accuracy. There is much research in this area.

32-bit Floating-Point Quantization (FP32)

Is there an FP32 quantization technique? No, not really! It's not quantized if it's the same format as the default.

Integer Quantization

Integer quantization of AI models is a long-standing area of research, with much literature. These are only some of the many papers:

Integer-Only-Arithmetic Quantization

Integer-only quantization is integer quantization where only integer multiplication is performed. The assumption that this is true for all integer quantization algorithms is false. Several types of integer quantization may store weights as quantized integers, but then de-quantize them back to floating point at various points (even for weight multiplication in some algorithms). Methods that strictly restrict arithmetic to avoid floating point operations are more precisely named "integer-only-arithmetic quantization algorithms". For methods that also quantize non-linear components to integers, such as Softmax and normalization components, see also end-to-end integer Transformers.

Low-Bit Quantization

Low-bit quantization generall refers to 4-bit quantization or less. See below for research on binary, ternary, 2-bit, 3-bit, and 4-bit quantization papers.

Papers on low-bit quantization in general:

  • Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. 2016, Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, https://arxiv.org/abs/1606.06160
  • Harshavardhan Adepu, Zhanpeng Zeng, Li Zhang, Vikas Singh, 10 Mar 2024, FrameQuant: Flexible Low-Bit Quantization for Transformers, https://arxiv.org/abs/2403.06082 (A method using 2-bit quantization.)
  • Cheng Zhang, Jianyi Cheng, Ilia Shumailov, George A. Constantinides, Yiren Zhao, Oct 2023, Revisiting Block-based Quantisation: What is Important for Sub-8-bit LLM Inference? https://arxiv.org/abs/2310.05079, https://arxiv.org/pdf/2310.05079.pdf
  • JH Heo, J Kim, B Kwon, B Kim, SJ Kwon, D Lee, Sep 2023, Rethinking Channel Dimensions to Isolate Outliers for Low-bit Weight Quantization of Large Language Models, arXiv preprint arXiv:2309.15531, https://arxiv.org/pdf/2309.15531.pdf
  • Shuchang Zhou, Yuxin Wu, Zekun Ni, et al. 2016. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv:1606.06160, http://arxiv.org/abs/1606.06160
  • Yuzhuang Xu, Xu Han, Zonghan Yang, Shuo Wang, Qingfu Zhu, Zhiyuan Liu, Weidong Liu, Wanxiang Che 22 May 2024 (v3), OneBit: Towards Extremely Low-bit Large Language Models, https://arxiv.org/abs/2402.11295
  • Zhiwen Mo, Lei Wang, Jianyu Wei, Zhichen Zeng, Shijie Cao, Lingxiao Ma, Naifeng Jing, Ting Cao, Jilong Xue, Fan Yang, Mao Yang, 12 Aug 2024, LUT Tensor Core: Lookup Table Enables Efficient Low-Bit LLM Inference Acceleration, https://arxiv.org/abs/2408.06003 (Lookup tables for mixed-precision MatMul/GEMM kernels using low-bit quantization mixed with full precision.)
  • Vladimir Malinovskii, Denis Mazur, Ivan Ilin, Denis Kuznedelev, Konstantin Burlachenko, Kai Yi, Dan Alistarh, Peter Richtarik, 30 May 2024 (v2), PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression, https://arxiv.org/abs/2405.14852 https://burlachenkok.github.io/PV-Tuning/
  • Szabolcs Cséfalvay, James Imber, 31 Jan 2023 (v2), Self-Compressing Neural Networks, https://arxiv.org/abs/2301.13142
  • Wang, Miao Wei, Shu Yang, Fangmin Chen, Xing Mei, 16 Aug 2024, ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models, https://arxiv.org/abs/2408.08554
  • Hossein Katebi, Navidreza Asadi, Maziar Goudarzi, 2024, FullPack: Full Vector Utilization for Sub-Byte Quantized Vector-Matrix Multiplication on General Purpose CPUs, IEEE Computer Architecture Letters, PrePrints pp. 1-4, DOI Bookmark: 10.1109/LCA.2024.3370402, https://www.computer.org/csdl/journal/ca/5555/01/10449368/1USuDIYNOQE
  • Mohamed Mekkouri, Marc Sun, Leandro von Werra, Pedro Cuenca, Omar Sanseviero, Thomas Wolf, September 18, 2024, Fine-tuning LLMs to 1.58bit: extreme quantization made easy, https://huggingface.co/blog/1_58_llm_extreme_quantization
  • Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
  • Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu, 25 Sep 2024, A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms, https://arxiv.org/abs/2409.16694
  • Yifei Liu, Jicheng Wen, Yang Wang, Shengyu Ye, Li Lyna Zhang, Ting Cao, Cheng Li, Mao Yang, 25 Sep 2024, VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models, https://arxiv.org/abs/2409.17066 https://arxiv.org/pdf/2409.17066
  • Shaobo Ma, Chao Fang, Haikuo Shao, Zhongfeng Wang, 26 Sep 2024, Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores, https://arxiv.org/abs/2409.17870
  • Kang Zhao, Tao Yuan, Han Bao, Zhenfeng Su, Chang Gao, Zhaofeng Sun, Zichen Liang, Liping Jing, Jianfei Chen, 21 Oct 2024, Beyond 2:4: exploring V:N:M sparsity for efficient transformer inference on GPUs, https://arxiv.org/abs/2410.16135
  • Zifei Xu, Sayeh Sharify, Wanzin Yazar, Tristan Webb, Xin Wang, 18 Oct 2024, Understanding the difficulty of low-precision post-training quantization of large language models, https://arxiv.org/abs/2410.14570
  • Yuhang Li, Priyadarshini Panda, 24 Oct 2024, TesseraQ: Ultra Low-Bit LLM Post-Training Quantization with Block Reconstruction, https://arxiv.org/abs/2410.19103
  • Xu Ouyang, Tao Ge, Thomas Hartvigsen, Zhisong Zhang, Haitao Mi, Dong Yu, 27 Nov 2024 (v2), Low-Bit Quantization Favors Undertrained LLMs: Scaling Laws for Quantized LLMs with 100T Training Tokens, https://arxiv.org/abs/2411.17691

Binary Quantization

The extreme of quantization is to encode floating-point weights down to 1 bit. This is binary quantization (or "binarization"), where there are only 2 weights, and they are 0 and 1, or alternatively -1 and +1. This compresses the model by a factor of 32 in terms of space, and reduces the inference computations. In fact, binary quantization changes multiplication by a floating point weight to a simple addition (for 1) and a null test (for 0). Or for binary weights -1 and +1, the -1 is a subtraction and +1 an addition, which is usually further optimized to use a sign bit tweak. There are also other invocations of binary neural network architectures that use only bitwise operations, such as XNOR networks and Weightless Neural Networks (WNNs).

  • H. Yang, M. Fritzsche, C. Bartz, and C. Meinel, Bmxnet: An open-source binary neural network implementation based on mxnet, CoRR, vol. abs/1705.09864, 2017, https://arxiv.org/abs/1705.09864, Code: https://github.com/hpi-xnor
  • Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. 2016. Xnor-net: Imagenet classification using binary convolutional neural networks. In European conference on computer vision, Springer, 525–542, https://arxiv.org/abs/1603.05279
  • B. McDanel, S. Teerapittayanon, and H. Kung, Embedded binarized neural networks, 2017, arXiv preprint arXiv:1709.02260, https://arxiv.org/abs/1709.02260
  • Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, Yoshua Bengio, Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1 Feb 2016, https://arxiv.org/abs/1602.02830
  • Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio, 2016, Binarized neural networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems. 4114–4122, https://proceedings.neurips.cc/paper/2016/hash/d8330f857a17c53d217014ee776bfd50-Abstract.html
  • Zhouhan Lin, Matthieu Courbariaux, Roland Memisevic, Yoshua Bengio, Neural Networks with Few Multiplications, Feb 2016, https://arxiv.org/abs/1510.03009v1
  • Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi, Xnor-net: Imagenet classification using binary convolutional neural networks. In ECCV, pages 525–542. Springer, 2016, https://arxiv.org/abs/1603.05279
  • Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David, Binaryconnect: Training deep neural networks with binary weights during propagations, In NeuriPS, pages 3123–3131, 2015, https://arxiv.org/abs/1511.00363
  • Zhaowei Cai, Xiaodong He, Jian Sun, and Nuno Vasconcelos. Deep learning with low precision by half-wave gaussian quantization. In CVPR, pages 5918–5926, 2017, https://arxiv.org/abs/1702.00953
  • Yefei He, Zhenyu Lou, Luoming Zhang, Weijia Wu, Bohan Zhuang, and Hong Zhou. Bivit: Extremely compressed binary vision transformer. arXiv preprint arXiv:2211.07091, 2022. https://arxiv.org/abs/2211.07091 (Softmax-aware binarization)
  • Zechun Liu, Barlas Oguz, Aasish Pappu, Lin Xiao, Scott Yih, Meng Li, Raghuraman Krishnamoorthi, and Yashar Mehdad. Bit: Robustly binarized multi-distilled transformer. arXiv preprint arXiv:2205.13016, 2022. https://arxiv.org/abs/2205.13016, Code: https://github.com/facebookresearch/bit
  • Felix Juefei-Xu, Vishnu Naresh Boddeti, and Marios Savvides. Local binary convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 19–28, 2017. https://arxiv.org/abs/1608.06049
  • Zechun Liu, Zhiqiang Shen, Marios Savvides, and KwangTing Cheng. Reactnet: Towards precise binary neural network with generalized activation functions. In European Conference on Computer Vision, pages 143–159. Springer, 2020. https://arxiv.org/abs/2003.03488
  • Koen Helwegen, James Widdicombe, Lukas Geiger, Zechun Liu, Kwang-Ting Cheng, and Roeland Nusselder. Latent weights do not exist: Rethinking binarized neural network optimization. In Advances in Neural Information Processing Systems 32, pages 7533–7544. 2019. https://arxiv.org/abs/1906.02107, Code: https://github.com/plumerai/rethinking-bnn-optimization
  • Lin, X.; Zhao, C.; and Pan, W. 2017. Towards Accurate Binary Convolutional Neural Network. Advances in Neural Information Processing Systems, 30, https://arxiv.org/abs/1711.11294
  • Yichi Zhang, Ankush Garg, Yuan Cao, Łukasz Lew, Behrooz Ghorbani, Zhiru Zhang, Orhan Firat, Feb 2023, Binarized Neural Machine Translation, https://arxiv.org/abs/2302.04907
  • Kota Ando; Kodai Ueyoshi; Kentaro Orimo; Haruyoshi Yonekawa; Shimpei Sato; Hiroki Nakahara; Masayuki Ikebe, "BRein memory: A 13-layer 4.2 K neuron/0.8 M synapse binary/ternary reconfigurable in-memory deep neural network accelerator in 65 nm CMOS", Proc. Symp. VLSI Circuits, pp. C24-C25, Jun. 2017. https://ieeexplore.ieee.org/document/8008533
  • S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen and Y. Zou, "DoReFa-Net: Training low bitwidth convolutional neural networks with low bitwidth gradients", arXiv:1606.06160, 2016. https://arxiv.org/abs/1606.06160 (Has binary weights, 2-bit activations)
  • R. Andri, L. Cavigelli, D. Rossi and L. Benini, "YodaNN: An ultra-low power convolutional neural network accelerator based on binary weights", Proc. IEEE Comput. Soc. Annu. Symp. VLSI (ISVLSI), pp. 236-241, Jul. 2016. https://arxiv.org/abs/1606.05487v1
  • Z. Cai, X. He, J. Sun and N. Vasconcelos, "Deep learning with low precision by half-wave Gaussian quantization", Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 5918-5926, Jul. 2017. https://arxiv.org/abs/1702.00953 (Has binary weights, 2-bit activations)
  • R. Ding, Z. Liu, R. D. Blanton, and D. Marculescu. Quantized deep neural networks for energy efficient hardware-based inference. In IEEE Asia and South Pacific Design Automation Conference, pages 1–8, 2018. https://ieeexplore.ieee.org/document/8297274 (Survey and evaluation of various quantized DNN models in 2018, including binarized and light models, on chosen datasets.)
  • Taylor Simons and Dah-Jye Lee, 2019, A Review of Binarized Neural Networks, Electronics 2019, 8, 661; doi:10.3390/electronics8060661, MDPI, https://www.mdpi.com/2079-9292/8/6/661/review_report
  • Xiaofan Lin, Cong Zhao, and Wei Pan. Towards accurate binary convolutional neural network. Advances in Neural Information Processing Systems, 30, 2017. https://arxiv.org/abs/1711.11294 (Uses multiple single-bit weights combined to create a multi-binary quantization method.)
  • Y Shang, Z Yuan, Q Wu, Z Dong, PB-LLM: Partially Binarized Large Language Models, Sep 2023, arXiv preprint arXiv:2310.00034, https://browse.arxiv.org/pdf/2310.00034.pdf Code: https://github.com/hahnyuan/BinaryLLM (Hybrid partial binarization.)

Ternary Quantization

Ternary quantization (or "ternarization") is the use of 3 weights: -1, 0, and 1. This requires 2 bits for representation of the weights in the model, so why wouldn't you just use 4 weights? The answer is that ternary quantization can use zero-multiplication arithmetic in the inference algorithm, with an addition (for +1), a subtraction (for -1), and a null test (for 0).

2-Bit Quantization (INT2)

This section refers to non-ternary 2-bit quantization, using 4 distinct weights. In practice, 2-bit quantization is regarded as still having some problems with model accuracy, whereas 4-bit integer quantization is considered a more reasonable tradeoff of speed-vs-accuracy. On the other hand, maybe this is unwarranted, since Liu et al (2022) tested lots of models with 2-bits, 3-bits, and 4-bits (see Table 1 in their paper), and the extra accuracy of 4-bits over 2-bits was usually only a couple of percentage points (for double the space).

Research papers on integer quantization using 2-bits include:

3-Bit Quantization (INT3)

3-bit quantization is uncommon and unpopular, and it's not entirely clear why. It has improved accuracy over 2-bits and saves 25% storage compared to its more popular 4-bit cousin, being only slightly less accurate, since it allows 2^3=8 distinct weights. Maybe it just seems too inelegant for programmers to code cramming 3-bit values into 8-bits or 32-bits for packing and unpacking? But, no, even 5-bit quantization gets recommended by AI experts on forums, whereas listening for supporters of the 3-bit versions, all you hear are crickets.

Even the research papers on 3-bit quantization don't like to admit to it, and you'll struggle to even find "3-bit quantization" in a paper title. Here are some papers on 3-bit quantization (as if you care):

  • Jeonghoon Kim, Jung Hyun Lee, Sungdong Kim, Joonsuk Park, Kang Min Yoo, Se Jung Kwon, and Dongsoo Lee. Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization. CoRR, abs/2305.14152, 202 https://arxiv.org/abs/2305.14152 (Quantization to 3-bit and 4-bit levels.)
  • Han-Byul Kim, Eunhyeok Park, and Sungjoo Yoo. 2022. BASQ: Branch-wise Activation-clipping Search Quantization for Sub-4-bit Neural Networks. In European Conference on Computer Vision, Cham: Springer Nature Switzerland, 17-33. https://link.springer.com/chapter/10.1007/978-3-031-19775-8_2 (Evaluates quantization precision from 2-bits to 4-bits.)
  • Zechun Liu, Kwang-Ting Cheng, Dong Huang, Eric Xing, Zhiqiang Shen. Apr 2022. Nonuniform-to-uniform quantization: Towards accurate quantization via generalized straight-through estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4942-4952, https://arxiv.org/abs/2111.14826, Code: https://github.com/liuzechun/Nonuniform-to-Uniform-Quantization (Contains an extensive review of models from 2-bits to 4-bits for both weights and activations.)
  • E Kloberdanz, W Le, Sep 2023, MixQuant: Mixed Precision Quantization with a Bit-width Optimization Search, arXiv preprint arXiv:2309.17341, https://arxiv.org/pdf/2309.17341.pdf (Various tests of quantization from 2-bits to 8-bits.)
  • NM Ho, DT Nguyen, JL Gustafson, WF Wong, 2023, Bedot: Bit Efficient Dot Product for Deep Generative Models, CoNGA 2023: Next Generation Arithmetic, pp. 19–37, https://link.springer.com/chapter/10.1007/978-3-031-32180-1_2, PDF: https://www.comp.nus.edu.sg/~wongwf/papers/CONGA23-Bedot.pdf (2–3 bits for weights and 2–5 bits for activation.)
  • A. H. Zadeh, I. Edo, O. M. Awad, and A. Moshovos, “GOBO: Quantizing attention-based NLP models for low latency and energy efficient inference,” in 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2020, pp. 811–824. https://arxiv.org/abs/2005.03842 (Compares to BERT at 3-bit and 4-bit quantization levels.)
  • N. Frumkin, D. Gope, and D. Marculescu, “CPT-V: A Contrastive Approach to Post-Training Quantization of Vision Transformers,” arXiv preprint arXiv:2211.09643, 2022. https://arxiv.org/abs/2211.09643 (Examines 3-bit, 4-bit, and 8-bit.)
  • B Gouin-Ferland, R Coffee, AC Therrien, 2022, Data reduction through optimized scalar quantization for more compact neural networks, Frontiers in Physics, https://www.frontiersin.org/articles/10.3389/fphy.2022.957128/full (Examined 3 to 7 bit weights for quantization.)
  • Li, Y., Gong, R., Tan, X., Yang, Y., Hu, P., Zhang, Q., Yu, F., Wang, W., & Gu, S., 2021, BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction. ArXiv, abs/2102.05426. https://arxiv.org/abs/2102.05426 Code: https://github.com/yhhhli/BRECQ (Tests 2, 3 and 4 bits for weights, and mixed-precision quantization.)
  • W Cheng, Y Cai, K Lv, H Shen, Oct 2023, TEQ: Trainable Equivalent Transformation for Quantization of LLMs, https://arxiv.org/pdf/2310.10944.pdf
  • Stefanos Laskaridis, Kleomenis Katevas, Lorenzo Minto, Hamed Haddadi, 20 Mar 2024 (v2), MELTing point: Mobile Evaluation of Language Transformers, https://arxiv.org/abs/2403.12844 (Survey and benchmarking of SOTA methods for running LLM inference natively on phones including iPhone and Android, with quantization levels, and with measurement of speed and battery depletion.)
  • Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, Dan Alistarh, Jan 2024, Extreme Compression of Large Language Models via Additive Quantization, https://arxiv.org/abs/2401.06118
  • Han Guo, William Brandon, Radostin Cholakov, Jonathan Ragan-Kelley, Eric P. Xing, Yoon Kim, 15 Jul 2024, Fast Matrix Multiplications for Lookup Table-Quantized LLMs, https://arxiv.org/abs/2407.10960

4-Bit Quantization (INT4)

4-bit quantization is one of the most popular quantization regimes in practical usage. It is far more common to see a 4-bit quantization of an open source model than binary, 2-bits, or 3-bits. INT4 allows eight-fold storage compression of 32-bits down to 4-bits, which reduces memory requirements, and can also speed up inference by reducing memory-cache transfer overheads in both CPU and GPU. The 4 bits allow 2^4=16 distinct weights, which is enough for reasonable accuracy compared to the full-precision model. The 4-bit weights also fit cleanly into 8-bit bytes or 32-bit integers, making the unpacking code simple and efficient.

Research papers on 4-bit quantization:

5-Bit Quantization (INT5)

Research papers on 5-bit quantization:

6-Bit Quantization (INT6)

Research papers on 6-bit quantization:

7-Bit Quantization (INT7)

Research papers on 7-bit quantization:

8-Bit Integer Quantization (INT8)

Research papers on 8-bit quantization:

9-Bit Quantization (INT9)

Research papers on 9-bit quantization:

10-Bit Quantization (INT10)

Research papers on 10-bit quantization:

11-Bit Quantization (INT11)

Research papers on 11-bit quantization:

12-Bit Quantization (INT12)

Research papers on 12-bit quantization:

16-Bit Integer Quantization (INT16)

INT16 is the use of 16-bit integers, so as to use half the space of FP32 weights/activations, and also using integer arithmetic kernels. Consideration of the pros and cons of integer versus floating-point computations is warranted, since FP16 quantization uses the same 16-bit memory size, but may be more accurate than quantized 16-bit integers.

Research papers on 16-bit integer quantization:

32-Bit Integer Quantization (INT32)

INT32 is not an effective form of "model compression", because it's not compressed at all! The data is no smaller than the FP32 raw weights, although it does allow integer arithmetic instead of floating-point computations. Also closely related is fixed-point quantization, using 32-bit integers and integer arithmetic.

Research papers on 32-bit integer quantization:

Mixed-Precision Quantization

Research papers on mixed-precision quantization:

Logarithmic Bitshift Quantization (Power-of-Two Quantization)

The idea with bitshift quantization is to use power-of-2 integer weights and bitshift operations rather than integer multiplication. There is a significant trade-off in terms of accuracy of the model, since the number of distinct weights is greatly reduced. This is an active area of research that is well-known, with the earliest papers dating back to 1992 and 1993. However, software algorithms using bitshift seem unlikely to outperform hardware acceleration of integer multiplication, and hardware support is limited. Extending hardware accelerators to use bitshifting or the highest power-of-two approximate multiplication in hardware, presumably requiring fewer operations and less computing power (and reduced heat generation) seems an open area for further research. Note that the highest bit of an integer can be efficiently calculated using Brian Kernighan's algorithm (1988).

Sum of Two Bitshifts Quantization

The downside of logarithmic quantization is that there are relatively few unique weights, limiting precision, even if the number of bits used is maximized using a large scaling factor. An alternative implementation is to use two bitshift operations and an addition (or use of "shift-and-add" operations). In this way, the two highest bits of the quantized integer weight can be used, which improves model precision at the cost of more computation. This assumes that two integer shifts and an integer addition are less than the cost of a single integer multiplication. An early mention of this "sums of powers of two" method is in Marchesi et al. (1993).

Arbitrary Base Logarithmic Quantization

The main use of logarithms in quantization is power-of-two logarithmic quantization. This is efficient, allowing bitshifting, but its lack of accuracy is a known problem. There is also some research on bases other than two, or indeed arbitrary bases, to try to more accurately map weights to a logarithmic format:

  • S. Vogel, M. Liang, A. Guntoro, W. Stechele, and G. Ascheid, 2018, Efficient hardware acceleration of CNNs using logarithmic data representation with arbitrary log-base, In Proceedings of the 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD’18). 1–8.

Integer Division for Quantization?

What about using integer division instead of multiplications in quantization? After all, multiplication by a small weight like 0.003 could instead be a division by 333. Is this an avenue for optimization? It seems unlikely, since division is usually much slower than multiplication, often by an order-of-magnitude.

Integer division can possibly be used efficiently using bitshift operations. Power-of-two division might be an opportunity for (right) bitshifts instead of division, which is effectively the same as the left bitshift quantization above. Dyadic numbers are an interesting idea and their implementation involves division by a power-of-two, usually performed via a right bitshift.

Note that division is often used in scaling operations, particularly in de-quantization. However, in such cases, it isn't the bottleneck operation, as scaling or de-quantization is performed an order-of-magnitude fewer times.

No papers were found on "division quantization". Some research on division arithmetic algorithms:

Dyadic Quantization

Dyadic numbers are a class of numbers represented as rational numbers, but operated on as paired numbers. The number is an integer, but the denominator is restricted to be a power-of-two integer. This allows dyadic numbers to support a wide range of weights, including quite high precision in fractional weights, but integer arithmetic can be used.

  • Zhewei Yao, Zhen Dong, Zhangcheng Zheng, Amir Gholami, Jiali Yu, Eric Tan, Leyuan Wang, Qijing Huang, Yida Wang, Michael Mahoney, Kurt Keutzer, HAWQ-V3: Dyadic Neural Network Quantization, Proceedings of the 38th International Conference on Machine Learning, PMLR 139:11875-11886, 2021, https://arxiv.org/abs/2011.10680
  • Renato J. Cintra; Stefan, Duffner; Christophe Garcia; André Leite, Low-Complexity Approximate Convolutional Neural Networks, IEEE Transactions on Neural Networks and Learning Systems, Volume 29, Issue 12, December 2018, pp.5981-5992, https://ieeexplore.ieee.org/abstract/document/8334697
  • Fernanda Botelho, Max Garzon, On Dynamical Properties of Neural Networks, Complex Systems 5 (1991), p.401-413, https://wpmedia.wolfram.com/uploads/sites/13/2018/02/05-4-4.pdf
  • David Spuler, March 2024, Chapter 44. Advanced Quantization, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9

Fixed-Point Quantization

Fixed-point numbers are a way of representing numbers, that differs from floating-point numbers. For example, we can represent a dollars and cents number $12.34 as the integer "1234". This is fixed-point numbers, scaled by 100.

In practice, we can convert any fractional number to an integer by multiplying by a scaling factor (and truncating any lower-level fractional digits). Doing this is "floating-point quantization".

The main advantage is integer arithmetic. Using fixed-point quantizations changes vector dot product to use integer multiplication and integer addition (with a bitshift). See fixed point number system.

Floating-point numbers have a per-number exponent. Fixed-point numbers are like having a single global exponent value for all numbers (not stored anyway). The intermediate method is "block floating-point" where blocks of numbers (i.e. vectors) have a per-block or per-vector exponent value. Integer-only arithmetic is possible with fixed-point and block-floating point quantization.

Stochastic Quantization

Stochastic quantization is a research area that examines intentionally inserting some randomness or statistical variation into the quantization algorithms, which may result in higher accuracy. This idea can be used in conjunction with Post-Training Quantization (PTQ) or with Quantization-Aware Training (QAT).

Weight Clustering

Weight clustering is conceptually like pruning and quantization combined, and is sometimes called "cluster-based quantization". A group of weights are merged with similar weights, to make all of the similar weights have exactly the same weight. Hashing has also been used to group weights for clustering.

Outliers

The issue of "outliers" refers to weights or activations that are largers (or smaller) than the vast majority of other values. There are various parts of a Transformer where the output results can depend inordinately on a small subset of values, notably in the attention computation (and hence also the KV cache).

Research papers that discuss the issue of outliers include:

Activation Quantization

The quantization of dynamic calculations of activations is well-known and almost always used now.

Research papers on activation quantization:

  • Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, Liang Zhao, 4 Jan 2024, Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models https://arxiv.org/abs/2401.00625 (A general survey paper with coverage of many techniques including this one.)
  • Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I-Jen Chuang, Vijayalakshmi Srinivasan, Kailash Gopalakrishnan, Jul 2018, PACT: Parameterized Clipping Activation for Quantized Neural Networks, https://arxiv.org/abs/1805.06085
  • Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, Song Han, 10 May 2024 (v2), QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving, https://arxiv.org/abs/2405.04532 Code: https://github.com/mit-han-lab/qserve
  • Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey
  • 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
  • Apple, June 2024, Introducing Apple’s On-Device and Server Foundation Models, https://machinelearning.apple.com/research/introducing-apple-foundation-models (Apple's on-device models feature optimizations including small models, grouped query attention, 2-bit/4-bit quantization including activation quantization, shared embedding/unembedding tensors, small-ish vocabulary size of 49k, an undisclosed efficient KV cache optimization for neural engines, and layer-specific 16-bit LoRA/QLoRA adapters of size "10s of megabytes" for fine-tuned specialized model versions, also sometimes in 2-bit/4-bit, claiming speed rates of 0.6ms/token in prefill, and 30 tokens per second in decoding.)
  • Jinguang Wang, Yuexi Yin, Haifeng Sun, Qi Qi, Jingyu Wang, Zirui Zhuang, Tingting Yang, Jianxin Liao, 27 Jun 2024, OutlierTune: Efficient Channel-Wise Quantization for Large Language Models, https://arxiv.org/abs/2406.18832
  • Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748
  • Leo Donisch, Sigurd Schacht, Carsten Lanquillon, 6 Aug 2024, Inference Optimizations for Large Language Models: Effects, Challenges, and Practical Considerations, https://arxiv.org/abs/2408.03130
  • Wang, Miao Wei, Shu Yang, Fangmin Chen, Xing Mei, 16 Aug 2024, ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models, https://arxiv.org/abs/2408.08554
  • Jahyun Koo, Dahoon Park, Sangwoo Jung, Jaeha Kung, 6 Sep 2024, OPAL: Outlier-Preserved Microscaling Quantization A ccelerator for Generative Large Language Models, https://arxiv.org/abs/2409.05902
  • Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu, 25 Sep 2024, A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms, https://arxiv.org/abs/2409.16694
  • Eldar Kurtic, Alexandre Marques, Shubhra Pandit, Mark Kurtz, Dan Alistarh, 4 Nov 2024, "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization, https://arxiv.org/abs/2411.02355
  • A. Jantsch et al., "Special Session: Estimation and Optimization of DNNs for Embedded Platforms," 2024 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), Raleigh, NC, USA, 2024, pp. 21-30, doi: 10.1109/CODES-ISSS60120.2024.00013. https://ieeexplore.ieee.org/abstract/document/10740783
  • Liu, J., Zhang, B., Cao, X. (2025). ROI-Aware Dynamic Network Quantization for Neural Video Compression. In: Antonacopoulos, A., Chaudhuri, S., Chellappa, R., Liu, CL., Bhattacharya, S., Pal, U. (eds) Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15305. Springer, Cham. https://doi.org/10.1007/978-3-031-78169-8_22 https://link.springer.com/chapter/10.1007/978-3-031-78169-8_22

Vector Quantization

Vector quantization is a longstanding ML technique that pre-dates all of the Transformer work. Hence, there are many early papers on ML topics. Vector quantization is related to other vector methods such as nearest-neighbor search, such as for the analysis of embedding vectors and semantic similarity, amongst many other applications.

Research papers on vector quantization:

Quantization Granularity

Quantization granularity refers to the parts or segments or structures of weights that are quantized. For example, granularity levels may be:

  • Layerwise quantization
  • Vector quantization
  • Block quantization

Research papers on granularity of quantization include:

  • Jan Klhufek, Miroslav Safar, Vojtech Mrazek, Zdenek Vasicek, Lukas Sekanina, 8 Apr 2024, Exploring Quantization and Mapping Synergy in Hardware-Aware Deep Neural Network Accelerators, https://arxiv.org/abs/2404.05368 (Quantization of weights and activations on a CNN with a method to identify the optimal per-layer bitwidth for quantization.)
  • Nikita Trukhanov, Ilya Soloveychik, 29 Mar 2024, Accurate Block Quantization in LLMs with Outliers, https://arxiv.org/abs/2403.20137 (Analyzes block floating point number formats in block quantization with a focus on the KV cache memory reduction, including the use of permutations to reorder tensor weight rows.)
  • Wang-Cheng Kang, Derek Zhiyuan Cheng, Ting Chen, Xinyang Yi, Dong Lin, Lichan Hong, Ed H. Chi, 25 Aug 2020 (v2), Learning Multi-granular Quantized Embeddings for Large-Vocab Categorical Features in Recommender Systems, https://arxiv.org/abs/2002.08530
  • Lianwei Yang, Zhikai Li, Junrui Xiao, Haisong Gong, Qingyi Gu, 13 Jun 2024, MGRQ: Post-Training Quantization For Vision Transformer With Mixed Granularity Reconstruction, https://arxiv.org/abs/2406.09229
  • Minghai Qin, 27 Aug 2024, The Uniqueness of LLaMA3-70B with Per-Channel Quantization: An Empirical Study, https://arxiv.org/abs/2408.15301
  • Yanshu Wang, Tong Yang, Xiyan Liang, Guoan Wang, Hanning Lu, Xu Zhe, Yaoming Li, Li Weitao, 18 Sep 2024, Art and Science of Quantizing Large-Scale Models: A Comprehensive Overview, https://arxiv.org/abs/2409.11650 (Extensive survey of quantization from the basics to SOTA approaches, with also some coverage of knowledge distillation and KV cache compression.)
  • David Spuler, 26th August, 2024, Inference Optimization Research Ideas, https://www.aussieai.com/blog/inference-optimization-ideas

Layerwise Quantization

Layerwise quantization is quantization done at the granularity level of layers. Each layer can have its own quantization parameters. This can be a special case of mixed-precision quantization (i.e. per-layer precision quantization), but it is also possible to use the same precision quantization at each level, but with different parameters.

Research papers on layerwise quantization:

Blockwise Quantization

Blockwise quantization is a very granular type of quantization where each "block" of data has its own quantization parameters.

  • Nikita Trukhanov, Ilya Soloveychik, 29 Mar 2024, Accurate Block Quantization in LLMs with Outliers, https://arxiv.org/abs/2403.20137 (Analyzes block floating point number formats in block quantization with a focus on the KV cache memory reduction, including the use of permutations to reorder tensor weight rows.)
  • Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao, July 11, 2024, FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision, https://arxiv.org/abs/2407.08608 https://tridao.me/blog/2024/flash3/
  • Ziteng Sun1 Uri Mendlovic, Yaniv Leviathan1 Asaf Aharoni, Ahmad Beirami , Jae HunRo, Ananda Theertha Suresh, https://openreview.net/pdf?id=OWwc8eOIm8
  • Yanshu Wang, Wenyang He, Tong Yang, 24 May 2024, Athena: Efficient Block-Wise Post-Training Quantization for Large Language Models Using Second-Order Matrix Derivative Information, https://arxiv.org/abs/2405.17470
  • Haocheng Xi, Yuxiang Chen, Kang Zhao, Kai Jun Teh, Jianfei Chen, Jun Zhu, 21 Jul 2024 (v2), Jetfire: Efficient and Accurate Transformer Pretraining with INT8 Data Flow and Per-Block Quantization, https://arxiv.org/abs/2403.12422 Code: https://github.com/thu-ml/Jetfire-INT8Training
  • Xin Ding, Xiaoyu Liu, Zhijun Tu, Yun Zhang, Wei Li, Jie Hu, Hanting Chen, Yehui Tang, Zhiwei Xiong, Baoqun Yin, Yunhe Wang, 15 Apr 2024 (v4), CBQ: Cross-Block Quantization for Large Language Models, https://arxiv.org/abs/2312.07950
  • Xueying Wu, Edward Hanson, Nansu Wang, Qilin Zheng, Xiaoxuan Yang, Huanrui Yang, Shiyu Li, Feng Cheng, Partha Pratim Pande, Janardhan Rao Doppa, Krishnendu Chakrabarty, Hai Li, 27 Oct 2023 (v3), Block-Wise Mixed-Precision Quantization: Enabling High Efficiency for Practical ReRAM-based DNN Accelerators, https://arxiv.org/abs/2310.12182
  • Sebastian Eliassen, Raghavendra Selvan, 16 Jan 2024