Aussie AI

LLM Quantization Research

  • Last Updated 30 August, 2025
  • by David Spuler, Ph.D.

Quantization is an extremely popular method of model compression that has a huge number of research papers, and has been implemented in many modern inference engines. Generally, quantization has been very successful at both reducing inference compute times and storage space, without a huge hit to model accuracy, achieving near floating-point accuracy.

Types of Quantization

Quantization is usually separated into two main categories:

  • Post-Training Quantization (PTQ). This is where a pre-trained model is quantized for faster inference.
  • Quantization-Aware Training (QAT). This is the use of quantization during model training.

Quantization granularity specifies what numbers are going to be quantized, and which sets of numbers will have different quantization parameters. There are different model floating-point numbers that can be quantized:

  • Weights quantized (weight-only quantization)
  • Activations quantized (weight-and-activation quantization)
  • Per-layer quantization
  • Per-tensor quantization
  • Per-channel quantization

The scaling algorithm or "scale factor" by which floating-point numbers are scaled to a smaller range of numbers is also a distinction for a given quantization algorithm. Several types include:

  • Uniform scaling (uniform quantization)
  • Uniform affine quantization
  • Symmetric uniform quantization
  • Non-uniform quantization
  • Power-of-two quantization
  • Asymmetric quantization

There are several different technical types of quantization:

  • FP16 quantization: This uses 16-bit floating point numbers instead of 32-bit numbers. Commonly used.
  • FP8 quantization: 8-bit floating point.
  • FP4 quantization: 4-bit floating point. Occasionally used in research papers.
  • 8-bit integer quantization (INT8): This uses single 8-bit bytes for quantization. Commonly used. Weights are scaled to either -128 to 127 (signed), or 0 to 255 (unsigned bytes).
  • 4-bit quantization (INT4): Another popular size for quantization is a "nibble" (4 bits). There can be 2^4=16 weights. This is commonly used and quite effective given its low bitwidth.
  • 3-bit quantization (INT3). This uses 2^3=8 distinct weights.
  • 2-bit quantization (INT2): There are 4 distinct weights. Not commonly used.
  • Ternary quantization: This is quantization with 3 weights, usually -1, 0, and +1. Uses 2 bits, but not 4 weights. Suffers accuracy problems.
  • Binary quantization: This is 1-bit quantization with 2 possible weights (usually 0 and 1, or -1 and +1). Not highly accurate.
  • 0-bit quantization. Good luck with this algorithm.
  • Integer-only-arithmetic quantization. This refers to quantization where the actual arithmetic performed during inference is all integer multiplications. This is distinct from the rather unkindly named "fake quantization" which is quantization where the integers are "dequantized" back to floating-point before using floating point multiplication in inference calculations. Integer-only-arithmetic quantization aims to improve both speed and space, whereas non-integer-arithmetic-only integer quantization still reduces model size and storage space, but improves execution speed less fully (latency is still somewhat improved due to reduced memory-related activity).
  • Dyadic quantization: This is an uncommon quantization method using dyadic numbers, which are a mathematical representation of numbers as rational quotients where the numerator is an integer, but the denominator is always a power-of-two (allowing bitshifts).
  • Logarithmic Bitshift Quantization (Power-of-Two Quantization). This is where the weights are all powers of 2, so that faster bitshifts are used instead of integer multiplication. A generalization is "Sum of Two Bitshifts Quantization" which uses multiple bitshifts added together.

And some more quantization terminology:

  • Stochastic quantization. This is a method of intentionally introducing some non-determinancy and randomness into quantization algorithms with the goal of increased inference accuracy.
  • Extreme quantization: Usually refers to binary quantization.
  • Low-bit quantization: Usually means binary, ternary, or at most 4-bit quantization.
  • Fake quantization (or "simulated quantization"). Refers somewhat unkindly to integer quantization with the main goal of storage space reduction from a reduced model size, where the actual arithmetic is still performed as floating-point multiplications, rather than the "real quantization" of integer-only-arithmetic quantization.
  • Fixed point quantization. Using fixed-point arithmetic to change vector dot product from floating point multiplication/addition into integer multiplication and addition.
  • Mixed-precision quantization (or simply "mixed quantization"): Refers to more finely granular quantization where different parts of the model have different levels of quantization in terms of bits.

Quantization Theory

Research papers on the theoretical basis of quantization:

  • Zifei Xu, Sayeh Sharify, Wanzin Yazar, Tristan Webb, Xin Wang, 18 Oct 2024, Understanding the difficulty of low-precision post-training quantization of large language models, https://arxiv.org/abs/2410.14570
  • Jiayi Pan, Chengcan Wang, Kaifu Zheng, Yangguang Li, Zhenyu Wang, Bin Feng, 6 Dec 2023, SmoothQuant+: Accurate and Efficient 4-bit Post-Training WeightQuantization for LLM, https://arxiv.org/abs/2312.03788
  • Jiedong Lang, Zhehao Guo, Shuyu Huang, 30 Oct 2024, A Comprehensive Study on Quantization Techniques for Large Language Models, https://arxiv.org/abs/2411.02530
  • Tanishq Kumar, Zachary Ankner, Benjamin F. Spector, Blake Bordelon, Niklas Muennighoff, Mansheej Paul, Cengiz Pehlevan, Christopher Ré, Aditi Raghunathan, 7 Nov 2024, Scaling Laws for Precision, https://arxiv.org/abs/2411.04330
  • Xu Ouyang, Tao Ge, Thomas Hartvigsen, Zhisong Zhang, Haitao Mi, Dong Yu, 27 Nov 2024 (v2), Low-Bit Quantization Favors Undertrained LLMs: Scaling Laws for Quantized LLMs with 100T Training Tokens, https://arxiv.org/abs/2411.17691
  • Noga Bar, Raja Giryes, 12 Jan 2025, ZOQO: Zero-Order Quantized Optimization, https://arxiv.org/abs/2501.06736
  • Mohammad Sadegh Akhondzadeh, Aleksandar Bojchevski, Evangelos Eleftheriou, Martino Dazzi, 3 Mar 2025, KurTail : Kurtosis-based LLM Quantization, https://arxiv.org/abs/2503.01483
  • Jiaqi Zhao, Miao Zhang, Weili Guan, Liqiang Nie, 21 May 2025, Boost Post-Training Quantization via Null Space Optimization for Large Language Models, https://arxiv.org/abs/2506.11044 https://github.com/zjq0455/q2n

Survey Papers on Quantization

  • Sangeetha Siddegowda, Marios Fournarakis, Markus Nagel, Tijmen Blankevoort, Chirag Patel, Abhijit Khobare, Neural Network Quantization with AI Model Efficiency Toolkit (AIMET), arXiv:2201.08442v1 [cs.LG], 20 Jan 2022, https://arxiv.org/pdf/2201.08442.pdf
  • Krishnamoorthi, R., Quantizing deep convolutional networks for efficient inference: A whitepaper. arXiv preprint arXiv:1806.08342, 2018, https://arxiv.org/abs/1806.08342
  • Nagel, M., Fournarakis, M., Amjad, R. A., Bondarenko, Y., van Baalen, M., and Blankevoort, T., "A white paper on neural network quantization", 2021, https://arxiv.org/abs/2106.08295
  • Jingjing Xu, Wangchunshu Zhou, Zhiyi Fu, Hao Zhou, Lei Li, A Survey on Green Deep Learning, Nov 2021, https://arxiv.org/abs/2111.05193 (Extensive 2021 survey paper including quantization.)
  • Xunyu Zhu, Jian Li, Yong Liu, Can Ma, Weiping Wang, A Survey on Model Compression for Large Language Models, arXiv preprint arXiv:2308.07633, Aug 2023 https://arxiv.org/abs/2308.07633 (Recent 2023 survey paper on various topics, including PTQ and QAT quantization.)
  • Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W. Mahoney, and Kurt Keutzer. 2021. A Survey of Quantization Methods for Efficient Neural Network Inference. arXiv:2103.13630 [cs], June 2021, http://arxiv.org/abs/2103. 13630 arXiv: 2103.13630, https://arxiv.org/abs/2103.13630
  • T Liang, J Glossner, L Wang, S Shi, X Zhang, 2021, Neurocomputing, Pruning and quantization for deep neural network acceleration: A survey, https://arxiv.org/abs/2101.09671
  • T Choudhary, V Mishra, A Goswami, 2020, A comprehensive survey on model compression and acceleration, Artifcial Intelligence Review, https://doi.org/10.1007/s10462-020-09816-7, https://link.springer.com/article/10.1007/s10462-020-09816-7
  • Y Cheng, D Wang, P Zhou, T Zhang, June 2020 (revised), A survey of model compression and acceleration for deep neural networks, arXiv preprint arXiv:1710.09282, https://arxiv.org/abs/1710.09282
  • R. Ding, Z. Liu, R. D. Blanton, and D. Marculescu. Quantized deep neural networks for energy efficient hardware-based inference. In IEEE Asia and South Pacific Design Automation Conference, pages 1–8, 2018. https://ieeexplore.ieee.org/document/8297274 (Survey and evaluation of various quantized DNN models in 2018, including binarized and light models, on chosen datasets.)
  • B Rokh, A Azarpeyvand, A Khanteymoori, ACM Transactions on Intelligent Systems, 2023, A Comprehensive Survey on Model Quantization for Deep Neural Networks in Image Classification, PDF: https://dl.acm.org/doi/pdf/10.1145/3623402
  • Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748
  • Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
  • Yanshu Wang, Tong Yang, Xiyan Liang, Guoan Wang, Hanning Lu, Xu Zhe, Yaoming Li, Li Weitao, 18 Sep 2024, Art and Science of Quantizing Large-Scale Models: A Comprehensive Overview, https://arxiv.org/abs/2409.11650 (Extensive survey of quantization from the basics to SOTA approaches, with also some coverage of knowledge distillation and KV cache compression.)
  • David Spuler, September 2nd, 2024, 500+ LLM Inference Optimization Techniques, Aussie AI Blog, https://www.aussieai.com/blog/llm-inference-optimization
  • Ummara Bibi, Mahrukh Mazharm Dilshad Sabir, Muhammad Fasih Uddin Butt, Ali Hassan, Mustansar Ali Ghazanfar, Arshad Ali Khan, Wadood Abdul, 2024, Advances in Pruning and Quantization for Natural Language Processing, IEEE Access, doi: 10.1109/ACCESS.2024.3465631. https://ieeexplore.ieee.org/document/10685352 PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10685352
  • Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu, 25 Sep 2024, A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms, https://arxiv.org/abs/2409.16694
  • Babak Rokh, Ali Azarpeyvand, Alireza Khanteymoori, 23 Oct 2023 (v5), A Comprehensive Survey on Model Quantization for Deep Neural Networks in Image Classification, https://arxiv.org/abs/2205.07877
  • Wenchao Xu, Jinyu Chen, Peirong Zheng, Xiaoquan Yi, Tianyi Tian, Wenhui Zhu, Quan Wan, Haozhao Wang, Yunfeng Fan, Qinliang Su, Xuemin Shen, https://arxiv.org/abs/2412.13437 18 Dec 2024, Deploying Foundation Model Powered Agent Services: A Survey, (A survey of not just deployment, but many inference optimization techniques.)
  • Tom Wallace, Naser Ezzati-Jivan, Beatrice Ombuki-Berman, 16 Jan 2025, Optimization Strategies for Enhancing Resource Efficiency in Transformers & Large Language Models, https://arxiv.org/abs/2502.00046
  • Mozhgan Navardi, Romina Aalishah, Yuzhe Fu, Yueqian Lin, Hai Li, Yiran Chen, Tinoosh Mohsenin, 19 Feb 2025, GenAI at the Edge: Comprehensive Survey on Empowering Edge Devices, https://arxiv.org/abs/2502.15816
  • Liangdong Liu, Zhitong Zheng, Cong Wang, Tianhuang Su, Zhenyu Yang, 26 Feb 2025, Binary Neural Networks for Large Language Model: A Survey, https://arxiv.org/abs/2502.19008
  • Guiyao Tie, Zeli Zhao, Dingjie Song, Fuyang Wei, Rong Zhou, Yurou Dai, Wen Yin, Zhejian Yang, Jiangyue Yan, Yao Su, Zhenhan Dai, Yifeng Xie, Yihan Cao, Lichao Sun, Pan Zhou, Lifang He, Hechang Chen, Yu Zhang, Qingsong Wen, Tianming Liu, Neil Zhenqiang Gong, Jiliang Tang, Caiming Xiong, Heng Ji, Philip S. Yu, Jianfeng Gao, 8 Mar 2025, A Survey on Post-training of Large Language Models, https://arxiv.org/abs/2503.06072
  • Minjun Kim, Jaehyeon Choi, Jongkeun Lee, Wonjin Cho, U Kang, 14 May 2025, Zero-shot Quantization: A Comprehensive Survey, https://arxiv.org/abs/2505.09188
  • Kai Liu, Qian Zheng, Kaiwen Tao, Zhiteng Li, Haotong Qin, Wenbo Li, Yong Guo, Xianglong Liu, Linghe Kong, Guihai Chen, Yulun Zhang, Xiaokang Yang, 8 May 2025, Low-bit Model Quantization for Deep Neural Networks: A Survey, https://arxiv.org/abs/2505.05530 https://github.com/Kai-Liu001/Awesome-Model-Quantization
  • Yutong Liu, Cairong Zhao, Guosheng Hu, 23 Jul 2025, A Comprehensive Evaluation on Quantization Techniques for Large Language Models, https://arxiv.org/pdf/2507.17417

Quantization Overflow Errors

The computation of quantized activations can cause overflow to occur, in both integer and floating-point representations. Various methods to prevent or detect arithmetic overflow in quantized computations have been researched:

Floating Point Quantization

The most straight-forward quantization is to reduce 32-bit floating point (4 bytes) to 16-bit floating point (2 bytes). This halves the memory storage requirements, and does not suffer much reduction in model accuracy. All operations in matmuls are still done in floating point arithmetic.

The classic form of floating point quantization is often abbreviated as FP16. There is also "bfloat16", which uses a different representation of numbers. An even more reduced size is FP8 quantization, which uses 8-bit floating point numbers.

Research papers on floating point quantization (there are many):

8-bit Floating-Point Quantization (FP8)

FP8 quantization hasn't caught on in the AI industry as much as FP16 or integer quantization methods, but there are plenty of papers. Resarch papers on FP8 quantization:

6-bit Floating-Point Quantization (FP6)

Research on FP6 quantization:

  • Haojun Xia, Zhen Zheng, Xiaoxia Wu, Shiyang Chen, Zhewei Yao, Stephen Youn, Arash Bakhtiari, Michael Wyatt, Donglin Zhuang, Zhongzhu Zhou, Olatunji Ruwase, Yuxiong He, Shuaiwen Leon Song, 25 Jan 2024, FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design, https://arxiv.org/abs/2401.14112
  • Haojun Xia, Zhen Zheng, Xiaoxia Wu, Shiyang Chen, Zhewei Yao, Stephen Youn, Arash Bakhtiari, Michael Wyatt, Donglin Zhuang, Zhongzhu Zhou, Olatunji Ruwase, Yuxiong He, Shuaiwen Leon Song, July 2024, Quant-LLM: Accelerating the Serving of Large Language Models via FP6-Centric Algorithm-System Co-Design on Modern GPUs, Proceedings of the 2024 USENIX Annual Technical Conference. July 10–12, 2024,Santa Clara, CA, USA, 978-1-939133-41-0, https://www.usenix.org/conference/atc24/presentation/xia https://www.usenix.org/system/files/atc24-xia.pdf

4-bit Floating-Point Quantization (FP4)

Research on FP4 quantization:

  • Youngdeok Hwang; Janghwan Lee; Jiwoong Park; Jieun Lim; Jungwook Choi, Jan 2024, Searching Optimal Floating-Point Format for Sub-8-Bit Large Language Model Inference, 2024 International Conference on Electronics, Information, and Communication (ICEIC), https://ieeexplore.ieee.org/abstract/document/10457111 (Examines floating-point representations below 8 bits, and also the importance of denormalized numbers.)
  • Luchang Li, Sheng Qian, Jie Lu, Lunxi Yuan, Rui Wang, Qin Xie, 29 Mar 2024, Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs, https://arxiv.org/abs/2403.20041 (On-device LLMs via four optimizations: dynamic-tensor-shape inference, FP4 quantization, operator optimizations, and KV cache improvements.)
  • Xiaoxia Wu, Zhewei Yao, Yuxiong He, 2021, A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats Microsoft, https://neurips2023-enlsp.github.io/papers/paper_92.pdf Code: https://github.com/microsoft/DeepSpeed (FP4 4-bit weight quantization with 8-bit FP8 activation quantization, and showed FP8 bettered INT8 quantization and FP4 beat INT4.)
  • S Liu, Z Liu, X Huang, P Dong, KT Cheng, 2023 LLM-FP4: 4-Bit Floating-Point Quantized Transformers, arXiv preprint arXiv:2310.16836, https://arxiv.org/pdf/2310.16836.pdf Code: https://github.com/nbasyl/LLM-FP4
  • Wonsuk Jang, Thierry Tambe, 2 Jan 2025, BlockDialect: Block-wise Fine-grained Mixed Format for Energy-Efficient LLM Inference, https://arxiv.org/abs/2501.01144 (Per-block granular mixed-precision quantization including FP4.)
  • Ruizhe Wang, Yeyun Gong, Xiao Liu, Guoshuai Zhao, Ziyue Yang, Baining Guo, Zhengjun Zha, Peng Cheng, 28 Jan 2025, Optimizing Large Language Model Training Using FP4 Quantization, https://arxiv.org/abs/2501.17116
  • Roberto L. Castro, Andrei Panferov, Soroush Tabesh, Oliver Sieberling, Jiale Chen, Mahdi Nikdan, Saleh Ashkboos, Dan Alistarh, 20 May 2025, Quartet: Native FP4 Training Can Be Optimal for Large Language Models, https://arxiv.org/abs/2505.14669
  • Yutong Liu, Cairong Zhao, Guosheng Hu, 23 Jul 2025, A Comprehensive Evaluation on Quantization Techniques for Large Language Models, https://arxiv.org/pdf/2507.17417
  • Brian Chmiel, Maxim Fishman, Ron Banner, Daniel Soudry, 10 Aug 2025, FP4 All the Way: Fully Quantized Training of LLMs, https://arxiv.org/abs/2505.19115

16-bit Floating-Point Quantization (FP16)

Quantization from high-precision 32-bit floating point weights (usually abbreviated "FP32" or "float32") to lower-precision 16-bit floating point (usually abbreviated "FP16" or "float16") can yield significant benefits, often without a significant loss of accuracy. There is much research in this area.

32-bit Floating-Point Quantization (FP32)

Is there an FP32 quantization technique? No, not really! It's not quantized if it's the same format as the default.

Integer Quantization

Integer quantization of AI models is a long-standing area of research, with much literature. These are only some of the many papers:

Integer-Only-Arithmetic Quantization

Integer-only quantization is integer quantization where only integer multiplication is performed. The assumption that this is true for all integer quantization algorithms is false. Several types of integer quantization may store weights as quantized integers, but then de-quantize them back to floating point at various points (even for weight multiplication in some algorithms). Methods that strictly restrict arithmetic to avoid floating point operations are more precisely named "integer-only-arithmetic quantization algorithms". For methods that also quantize non-linear components to integers, such as Softmax and normalization components, see also end-to-end integer Transformers.

Low-Bit Quantization

Low-bit quantization generall refers to 4-bit quantization or less. See below for research on binary, ternary, 2-bit, 3-bit, and 4-bit quantization papers.

Papers on low-bit quantization in general:

  • Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. 2016, Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, https://arxiv.org/abs/1606.06160
  • Harshavardhan Adepu, Zhanpeng Zeng, Li Zhang, Vikas Singh, 10 Mar 2024, FrameQuant: Flexible Low-Bit Quantization for Transformers, https://arxiv.org/abs/2403.06082 (A method using 2-bit quantization.)
  • Cheng Zhang, Jianyi Cheng, Ilia Shumailov, George A. Constantinides, Yiren Zhao, Oct 2023, Revisiting Block-based Quantisation: What is Important for Sub-8-bit LLM Inference? https://arxiv.org/abs/2310.05079, https://arxiv.org/pdf/2310.05079.pdf
  • JH Heo, J Kim, B Kwon, B Kim, SJ Kwon, D Lee, Sep 2023, Rethinking Channel Dimensions to Isolate Outliers for Low-bit Weight Quantization of Large Language Models, arXiv preprint arXiv:2309.15531, https://arxiv.org/pdf/2309.15531.pdf
  • Shuchang Zhou, Yuxin Wu, Zekun Ni, et al. 2016. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv:1606.06160, http://arxiv.org/abs/1606.06160
  • Yuzhuang Xu, Xu Han, Zonghan Yang, Shuo Wang, Qingfu Zhu, Zhiyuan Liu, Weidong Liu, Wanxiang Che 22 May 2024 (v3), OneBit: Towards Extremely Low-bit Large Language Models, https://arxiv.org/abs/2402.11295
  • Zhiwen Mo, Lei Wang, Jianyu Wei, Zhichen Zeng, Shijie Cao, Lingxiao Ma, Naifeng Jing, Ting Cao, Jilong Xue, Fan Yang, Mao Yang, 12 Aug 2024, LUT Tensor Core: Lookup Table Enables Efficient Low-Bit LLM Inference Acceleration, https://arxiv.org/abs/2408.06003 (Lookup tables for mixed-precision MatMul/GEMM kernels using low-bit quantization mixed with full precision.)
  • Vladimir Malinovskii, Denis Mazur, Ivan Ilin, Denis Kuznedelev, Konstantin Burlachenko, Kai Yi, Dan Alistarh, Peter Richtarik, 30 May 2024 (v2), PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression, https://arxiv.org/abs/2405.14852 https://burlachenkok.github.io/PV-Tuning/
  • Szabolcs Cséfalvay, James Imber, 31 Jan 2023 (v2), Self-Compressing Neural Networks, https://arxiv.org/abs/2301.13142
  • Wang, Miao Wei, Shu Yang, Fangmin Chen, Xing Mei, 16 Aug 2024, ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models, https://arxiv.org/abs/2408.08554
  • Hossein Katebi, Navidreza Asadi, Maziar Goudarzi, 2024, FullPack: Full Vector Utilization for Sub-Byte Quantized Vector-Matrix Multiplication on General Purpose CPUs, IEEE Computer Architecture Letters, PrePrints pp. 1-4, DOI Bookmark: 10.1109/LCA.2024.3370402, https://www.computer.org/csdl/journal/ca/5555/01/10449368/1USuDIYNOQE
  • Mohamed Mekkouri, Marc Sun, Leandro von Werra, Pedro Cuenca, Omar Sanseviero, Thomas Wolf, September 18, 2024, Fine-tuning LLMs to 1.58bit: extreme quantization made easy, https://huggingface.co/blog/1_58_llm_extreme_quantization
  • Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
  • Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu, 25 Sep 2024, A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms, https://arxiv.org/abs/2409.16694
  • Yifei Liu, Jicheng Wen, Yang Wang, Shengyu Ye, Li Lyna Zhang, Ting Cao, Cheng Li, Mao Yang, 25 Sep 2024, VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models, https://arxiv.org/abs/2409.17066 https://arxiv.org/pdf/2409.17066
  • Shaobo Ma, Chao Fang, Haikuo Shao, Zhongfeng Wang, 26 Sep 2024, Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores, https://arxiv.org/abs/2409.17870
  • Kang Zhao, Tao Yuan, Han Bao, Zhenfeng Su, Chang Gao, Zhaofeng Sun, Zichen Liang, Liping Jing, Jianfei Chen, 21 Oct 2024, Beyond 2:4: exploring V:N:M sparsity for efficient transformer inference on GPUs, https://arxiv.org/abs/2410.16135
  • Zifei Xu, Sayeh Sharify, Wanzin Yazar, Tristan Webb, Xin Wang, 18 Oct 2024, Understanding the difficulty of low-precision post-training quantization of large language models, https://arxiv.org/abs/2410.14570
  • Yuhang Li, Priyadarshini Panda, 24 Oct 2024, TesseraQ: Ultra Low-Bit LLM Post-Training Quantization with Block Reconstruction, https://arxiv.org/abs/2410.19103
  • Xu Ouyang, Tao Ge, Thomas Hartvigsen, Zhisong Zhang, Haitao Mi, Dong Yu, 27 Nov 2024 (v2), Low-Bit Quantization Favors Undertrained LLMs: Scaling Laws for Quantized LLMs with 100T Training Tokens, https://arxiv.org/abs/2411.17691
  • Yuzhuang Xu, Shiyu Ji, Qingfu Zhu, Wanxiang Che, 12 Dec 2024, CRVQ: Channel-relaxed Vector Quantization for Extreme Compression of LLMs, https://arxiv.org/abs/2412.09282 (Vector quantization of low-bit or 1-bit weight vectors, with additional bits for some channels, analogous to combining mixed-precision quantization and/or weight clustering.)
  • Weilun Feng, Haotong Qin, Chuanguang Yang, Zhulin An, Libo Huang, Boyu Diao, Fei Wang, Renshuai Tao, Yongjun Xu, Michele Magno, 16 Dec 2024, MPQ-DM: Mixed Precision Quantization for Extremely Low Bit Diffusion Models, https://arxiv.org/abs/2412.11549
  • Kyle Wiggers, December 23, 2024, A popular technique to make AI more efficient has drawbacks, https://techcrunch.com/2024/12/23/a-popular-technique-to-make-ai-more-efficient-has-drawbacks/
  • Dibakar Gope, David Mansell, Danny Loh, Ian Bratt, 23 Dec 2024, Highly Optimized Kernels and Fine-Grained Codebooks for LLM Inference on Arm CPUs, https://arxiv.org/abs/2501.00032 https://github.com/ggerganov/llama.cpp
  • Yuzong Chen, Xilai Dai, Chi-chih Chang, Yash Akhauri, Mohamed S. Abdelfattah, 6 Jan 2025, The Power of Negative Zero: Datatype Customization for Quantized Large Language Models, https://arxiv.org/abs/2501.04052 (Remapping negative zero to other values.)
  • Zhang, Z., Liu, S., Chen, R., Kailkhura, B., Chen, B., & Wang, Z. (2024). Q-Hitter: A Better Token Oracle for Efficient LLM Inference via Sparse-Quantized KV Cache. In P. Gibbons, G. Pekhimenko, & C. De Sa (Eds.), Proceedings of Machine Learning and Systems 6 (MLSys 2024) (pp. 381-394). MLSys. https://proceedings.mlsys.org/paper_files/paper/2024/hash/bbb7506579431a85861a05fff048d3e1-Abstract Conference.htm https://github.com/VITA-Group/Q-Hitter
  • 18 Jan 2025, LUT-DLA: Lookup Table as Efficient Extreme Low-Bit Deep Learning Accelerator, Guoyu Li, Shengyu Ye, Chunyun Chen, Yang Wang, Fan Yang, Ting Cao, Cheng Liu, Mohamed M. Sabry, Mao Yang, https://arxiv.org/abs/2501.10658 (Extremely low-bit quantization below 1-bit (!) with vector quantization to table lookup.)
  • Xuerui Qiu, Jieyuan Zhang, Wenjie Wei, Honglin Cao, Junsheng Guo, Rui-Jie Zhu, Yimeng Shan, Yang Yang, Malu Zhang, Haizhou Li, 23 Jan 2025, Quantized Spike-driven Transformer, https://arxiv.org/abs/2501.13492 https://github.com/bollossom/QSD-Transformer
  • Weiming Hu, Haoyan Zhang, Cong Guo, Yu Feng, Renyang Guan, Zhendong Hua, Zihan Liu, Yue Guan, Minyi Guo, Jingwen Leng, 26 Feb 2025, M-ANT: Efficient Low-bit Group Quantization for LLMs via Mathematically Adaptive Numerical Type, https://arxiv.org/abs/2502.18755
  • Jaewoo Song, Fangzhen Lin, 7 Mar 2025, SplitQuantV2: Enhancing Low-Bit Quantization of LLMs Without GPUs, https://arxiv.org/abs/2503.07657
  • Vikas Natesh, H.T. Kung, 12 Apr 2025, PQS (Prune, Quantize, and Sort): Low-Bitwidth Accumulation of Dot Products in Neural Network Computations, https://arxiv.org/abs/2504.09064
  • Kai Liu, Qian Zheng, Kaiwen Tao, Zhiteng Li, Haotong Qin, Wenbo Li, Yong Guo, Xianglong Liu, Linghe Kong, Guihai Chen, Yulun Zhang, Xiaokang Yang, 8 May 2025, Low-bit Model Quantization for Deep Neural Networks: A Survey, https://arxiv.org/abs/2505.05530 https://github.com/Kai-Liu001/Awesome-Model-Quantization
  • Ba-Hien Tran, Van Minh Nguyen, 28 May 2025. Highly Efficient and Effective LLMs with Multi-Boolean Architectures, https://arxiv.org/abs/2505.22811
  • Amir Ardakani, 2025, Towards multiplier-less implementation of neural networks, PhD Thesis, McGill University, https://escholarship.mcgill.ca/downloads/nz806584h
  • Ching-Yi Lin, Sahil Shah, 4 Aug 2025 (v2), Low-Bit Integerization of Vision Transformers using Operand Reordering for Efficient Hardware, https://arxiv.org/abs/2504.18547 (Delayed dequantization until after matrix multiplication.)
  • Weigao Sun, Jiaxi Hu, Yucheng Zhou, Jusen Du, Disen Lan, Kexin Wang, Tong Zhu, Xiaoye Qu, Yu Zhang, Xiaoyu Mo, Daizong Liu, Yuxuan Liang, Wenliang Chen, Guoqi Li, Yu Cheng, 13 Aug 2025, Speed Always Wins: A Survey on Efficient Architectures for Large Language Models, https://arxiv.org/abs/2508.09834
  • Dayou Du, Shijie Cao, Jianyi Cheng, Luo Mai, Ting Cao, Mao Yang, 14 Aug 2025, BitDecoding: Unlocking Tensor Cores for Long-Context LLMs with Low-Bit KV Cache, https://arxiv.org/abs/2503.18773
  • Deyu Cao and Samin Aref, 30 Jul 2025, Enhancing Ultra-Low-Bit Quantization of Large Language Models Through Saliency-Aware Partial Retraining, https://arxiv.org/abs/2504.13932
  • Chen Feng and Yicheng Lin and Shaojie Zhuo and Chenzheng Su and Ramchalam Kinattinkara Ramakrishnan and Zhaocong Yuan and Xiaopeng Zhang, 1 Aug 2025, Edge-ASR: Towards Low-Bit Quantization of Automatic Speech Recognition Models, https://arxiv.org/abs/2507.07877
  • Jiaqi Zhao, Miao Zhang, Ming Wang, Yuzhang Shang, Kaihao Zhang, Weili Guan, Yaowei Wang, Min Zhang, 6 Aug 2025, PTQ1.61: Push the Real Limit of Extremely Low-Bit Post-Training Quantization Methods for Large Language Models, https://arxiv.org/abs/2502.13179
  • Sanja Karilanova, Subhrakanti Dey, Ay\c{c}a \"Oz\c{c}elikkale, 8 Aug 2025, Low-Bit Data Processing Using Multiple-Output Spiking Neurons with Non-linear Reset Feedback, https://arxiv.org/abs/2508.06292
  • Sergey Salishev, Ian Akhremchik, 19 Aug 2025, GDNSQ: Gradual Differentiable Noise Scale Quantization for Low-bit Neural Networks, https://arxiv.org/abs/2508.14004
  • Xinlin Li, Osama Hanna, Christina Fragouli, Suhas Diggavi, 24 Aug 2025, ICQuant: Index Coding enables Low-bit LLM Quantization, https://arxiv.org/abs/2505.00850

Binary Quantization

The extreme of quantization is to encode floating-point weights down to 1 bit. This is binary quantization (or "binarization"), where there are only 2 weights, and they are 0 and 1, or alternatively -1 and +1. This compresses the model by a factor of 32 in terms of space, and reduces the inference computations. In fact, binary quantization changes multiplication by a floating point weight to a simple addition (for 1) and a null test (for 0). Or for binary weights -1 and +1, the -1 is a subtraction and +1 an addition, which is usually further optimized to use a sign bit tweak. There are also other invocations of binary neural network architectures that use only bitwise operations, such as XNOR networks and Weightless Neural Networks (WNNs).

  • H. Yang, M. Fritzsche, C. Bartz, and C. Meinel, Bmxnet: An open-source binary neural network implementation based on mxnet, CoRR, vol. abs/1705.09864, 2017, https://arxiv.org/abs/1705.09864, Code: https://github.com/hpi-xnor
  • Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. 2016. Xnor-net: Imagenet classification using binary convolutional neural networks. In European conference on computer vision, Springer, 525–542, https://arxiv.org/abs/1603.05279
  • B. McDanel, S. Teerapittayanon, and H. Kung, Embedded binarized neural networks, 2017, arXiv preprint arXiv:1709.02260, https://arxiv.org/abs/1709.02260
  • Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, Yoshua Bengio, Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1 Feb 2016, https://arxiv.org/abs/1602.02830
  • Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio, 2016, Binarized neural networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems. 4114–4122, https://proceedings.neurips.cc/paper/2016/hash/d8330f857a17c53d217014ee776bfd50-Abstract.html
  • Zhouhan Lin, Matthieu Courbariaux, Roland Memisevic, Yoshua Bengio, Neural Networks with Few Multiplications, Feb 2016, https://arxiv.org/abs/1510.03009v1
  • Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi, Xnor-net: Imagenet classification using binary convolutional neural networks. In ECCV, pages 525–542. Springer, 2016, https://arxiv.org/abs/1603.05279
  • Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David, Binaryconnect: Training deep neural networks with binary weights during propagations, In NeuriPS, pages 3123–3131, 2015, https://arxiv.org/abs/1511.00363
  • Zhaowei Cai, Xiaodong He, Jian Sun, and Nuno Vasconcelos. Deep learning with low precision by half-wave gaussian quantization. In CVPR, pages 5918–5926, 2017, https://arxiv.org/abs/1702.00953
  • Yefei He, Zhenyu Lou, Luoming Zhang, Weijia Wu, Bohan Zhuang, and Hong Zhou. Bivit: Extremely compressed binary vision transformer. arXiv preprint arXiv:2211.07091, 2022. https://arxiv.org/abs/2211.07091 (Softmax-aware binarization)
  • Zechun Liu, Barlas Oguz, Aasish Pappu, Lin Xiao, Scott Yih, Meng Li, Raghuraman Krishnamoorthi, and Yashar Mehdad. Bit: Robustly binarized multi-distilled transformer. arXiv preprint arXiv:2205.13016, 2022. https://arxiv.org/abs/2205.13016, Code: https://github.com/facebookresearch/bit
  • Felix Juefei-Xu, Vishnu Naresh Boddeti, and Marios Savvides. Local binary convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 19–28, 2017. https://arxiv.org/abs/1608.06049
  • Zechun Liu, Zhiqiang Shen, Marios Savvides, and KwangTing Cheng. Reactnet: Towards precise binary neural network with generalized activation functions. In European Conference on Computer Vision, pages 143–159. Springer, 2020. https://arxiv.org/abs/2003.03488
  • Koen Helwegen, James Widdicombe, Lukas Geiger, Zechun Liu, Kwang-Ting Cheng, and Roeland Nusselder. Latent weights do not exist: Rethinking binarized neural network optimization. In Advances in Neural Information Processing Systems 32, pages 7533–7544. 2019. https://arxiv.org/abs/1906.02107, Code: https://github.com/plumerai/rethinking-bnn-optimization
  • Lin, X.; Zhao, C.; and Pan, W. 2017. Towards Accurate Binary Convolutional Neural Network. Advances in Neural Information Processing Systems, 30, https://arxiv.org/abs/1711.11294
  • Yichi Zhang, Ankush Garg, Yuan Cao, Łukasz Lew, Behrooz Ghorbani, Zhiru Zhang, Orhan Firat, Feb 2023, Binarized Neural Machine Translation, https://arxiv.org/abs/2302.04907
  • Kota Ando; Kodai Ueyoshi; Kentaro Orimo; Haruyoshi Yonekawa; Shimpei Sato; Hiroki Nakahara; Masayuki Ikebe, "BRein memory: A 13-layer 4.2 K neuron/0.8 M synapse binary/ternary reconfigurable in-memory deep neural network accelerator in 65 nm CMOS", Proc. Symp. VLSI Circuits, pp. C24-C25, Jun. 2017. https://ieeexplore.ieee.org/document/8008533
  • S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen and Y. Zou, "DoReFa-Net: Training low bitwidth convolutional neural networks with low bitwidth gradients", arXiv:1606.06160, 2016. https://arxiv.org/abs/1606.06160 (Has binary weights, 2-bit activations)
  • R. Andri, L. Cavigelli, D. Rossi and L. Benini, "YodaNN: An ultra-low power convolutional neural network accelerator based on binary weights", Proc. IEEE Comput. Soc. Annu. Symp. VLSI (ISVLSI), pp. 236-241, Jul. 2016. https://arxiv.org/abs/1606.05487v1
  • Z. Cai, X. He, J. Sun and N. Vasconcelos, "Deep learning with low precision by half-wave Gaussian quantization", Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 5918-5926, Jul. 2017. https://arxiv.org/abs/1702.00953 (Has binary weights, 2-bit activations)
  • R. Ding, Z. Liu, R. D. Blanton, and D. Marculescu. Quantized deep neural networks for energy efficient hardware-based inference. In IEEE Asia and South Pacific Design Automation Conference, pages 1–8, 2018. https://ieeexplore.ieee.org/document/8297274 (Survey and evaluation of various quantized DNN models in 2018, including binarized and light models, on chosen datasets.)
  • Taylor Simons and Dah-Jye Lee, 2019, A Review of Binarized Neural Networks, Electronics 2019, 8, 661; doi:10.3390/electronics8060661, MDPI, https://www.mdpi.com/2079-9292/8/6/661/review_report
  • Xiaofan Lin, Cong Zhao, and Wei Pan. Towards accurate binary convolutional neural network. Advances in Neural Information Processing Systems, 30, 2017. https://arxiv.org/abs/1711.11294 (Uses multiple single-bit weights combined to create a multi-binary quantization method.)
  • Y Shang, Z Yuan, Q Wu, Z Dong, PB-LLM: Partially Binarized Large Language Models, Sep 2023, arXiv preprint arXiv:2310.00034, https://browse.arxiv.org/pdf/2310.00034.pdf Code: https://github.com/hahnyuan/BinaryLLM (Hybrid partial binarization.)

Ternary Quantization

Ternary quantization (or "ternarization") is the use of 3 weights: -1, 0, and 1. This requires 2 bits for representation of the weights in the model, so why wouldn't you just use 4 weights? The answer is that ternary quantization can use zero-multiplication arithmetic in the inference algorithm, with an addition (for +1), a subtraction (for -1), and a null test (for 0).

1.58 Bit Quantization

1.58 bit quantization refers to ternary quantization with the values -1, 0, and +1. The number arises as log2(3), which represents how many binary bits are needed to store three values. It is somewhat of a misnomer, because impelemtnations of these "1.58 bit" models are not actually using this few bits, but are usually implemented as ternary values in a more reasonable data representation, such as 8 bits (wastes space, but is efficient). Like ternary models, 1.58 bit models are efficient because they don't need multiplication, but suffer from accuracy problems, although the latest research seems to be correcting this limitation.

Research papers on 1.58 bit models:

2-Bit Quantization (INT2)

This section refers to non-ternary 2-bit quantization, using 4 distinct weights. In practice, 2-bit quantization is regarded as still having some problems with model accuracy, whereas 4-bit integer quantization is considered a more reasonable tradeoff of speed-vs-accuracy. On the other hand, maybe this is unwarranted, since Liu et al (2022) tested lots of models with 2-bits, 3-bits, and 4-bits (see Table 1 in their paper), and the extra accuracy of 4-bits over 2-bits was usually only a couple of percentage points (for double the space).

Research papers on integer quantization using 2-bits include:

3-Bit Quantization (INT3)

3-bit quantization is uncommon and unpopular, and it's not entirely clear why. It has improved accuracy over 2-bits and saves 25% storage compared to its more popular 4-bit cousin, being only slightly less accurate, since it allows 2^3=8 distinct weights. Maybe it just seems too inelegant for programmers to code cramming 3-bit values into 8-bits or 32-bits for packing and unpacking? But, no, even 5-bit quantization gets recommended by AI experts on forums, whereas listening for supporters of the 3-bit versions, all you hear are crickets.

Even the research papers on 3-bit quantization don't like to admit to it, and you'll struggle to even find "3-bit quantization" in a paper title. Here are some papers on 3-bit quantization (as if you care):

  • Jeonghoon Kim, Jung Hyun Lee, Sungdong Kim, Joonsuk Park, Kang Min Yoo, Se Jung Kwon, and Dongsoo Lee. Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization. CoRR, abs/2305.14152, 202 https://arxiv.org/abs/2305.14152 (Quantization to 3-bit and 4-bit levels.)
  • Han-Byul Kim, Eunhyeok Park, and Sungjoo Yoo. 2022. BASQ: Branch-wise Activation-clipping Search Quantization for Sub-4-bit Neural Networks. In European Conference on Computer Vision, Cham: Springer Nature Switzerland, 17-33. https://link.springer.com/chapter/10.1007/978-3-031-19775-8_2 (Evaluates quantization precision from 2-bits to 4-bits.)
  • Zechun Liu, Kwang-Ting Cheng, Dong Huang, Eric Xing, Zhiqiang Shen. Apr 2022. Nonuniform-to-uniform quantization: Towards accurate quantization via generalized straight-through estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4942-4952, https://arxiv.org/abs/2111.14826, Code: https://github.com/liuzechun/Nonuniform-to-Uniform-Quantization (Contains an extensive review of models from 2-bits to 4-bits for both weights and activations.)
  • E Kloberdanz, W Le, Sep 2023, MixQuant: Mixed Precision Quantization with a Bit-width Optimization Search, arXiv preprint arXiv:2309.17341, https://arxiv.org/pdf/2309.17341.pdf (Various tests of quantization from 2-bits to 8-bits.)
  • NM Ho, DT Nguyen, JL Gustafson, WF Wong, 2023, Bedot: Bit Efficient Dot Product for Deep Generative Models, CoNGA 2023: Next Generation Arithmetic, pp. 19–37, https://link.springer.com/chapter/10.1007/978-3-031-32180-1_2, PDF: https://www.comp.nus.edu.sg/~wongwf/papers/CONGA23-Bedot.pdf (2–3 bits for weights and 2–5 bits for activation.)
  • A. H. Zadeh, I. Edo, O. M. Awad, and A. Moshovos, “GOBO: Quantizing attention-based NLP models for low latency and energy efficient inference,” in 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2020, pp. 811–824. https://arxiv.org/abs/2005.03842 (Compares to BERT at 3-bit and 4-bit quantization levels.)
  • N. Frumkin, D. Gope, and D. Marculescu, “CPT-V: A Contrastive Approach to Post-Training Quantization of Vision Transformers,” arXiv preprint arXiv:2211.09643, 2022. https://arxiv.org/abs/2211.09643 (Examines 3-bit, 4-bit, and 8-bit.)
  • B Gouin-Ferland, R Coffee, AC Therrien, 2022, Data reduction through optimized scalar quantization for more compact neural networks, Frontiers in Physics, https://www.frontiersin.org/articles/10.3389/fphy.2022.957128/full (Examined 3 to 7 bit weights for quantization.)
  • Li, Y., Gong, R., Tan, X., Yang, Y., Hu, P., Zhang, Q., Yu, F., Wang, W., & Gu, S., 2021, BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction. ArXiv, abs/2102.05426. https://arxiv.org/abs/2102.05426 Code: https://github.com/yhhhli/BRECQ (Tests 2, 3 and 4 bits for weights, and mixed-precision quantization.)
  • W Cheng, Y Cai, K Lv, H Shen, Oct 2023, TEQ: Trainable Equivalent Transformation for Quantization of LLMs, https://arxiv.org/pdf/2310.10944.pdf
  • Stefanos Laskaridis, Kleomenis Katevas, Lorenzo Minto, Hamed Haddadi, 20 Mar 2024 (v2), MELTing point: Mobile Evaluation of Language Transformers, https://arxiv.org/abs/2403.12844 (Survey and benchmarking of SOTA methods for running LLM inference natively on phones including iPhone and Android, with quantization levels, and with measurement of speed and battery depletion.)
  • Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, Dan Alistarh, Jan 2024, Extreme Compression of Large Language Models via Additive Quantization, https://arxiv.org/abs/2401.06118
  • Han Guo, William Brandon, Radostin Cholakov, Jonathan Ragan-Kelley, Eric P. Xing, Yoon Kim, 15 Jul 2024, Fast Matrix Multiplications for Lookup Table-Quantized LLMs, https://arxiv.org/abs/2407.10960
  • Shuming Shi, Enbo Zhao, Deng Cai, Leyang Cui, Xinting Huang, Huayang Li, 16 Jan 2024, Inferflow: an Efficient and Highly Configurable Inference Engine for Large Language Models, https://arxiv.org/abs/2401.08294 Source: https://github.com/inferflow/inferflow
  • Haotong Qin, Xudong Ma, Xingyu Zheng, Xiaoyang Li, Yang Zhang, Shouda Liu, Jie Luo, Xianglong Liu, Michele Magno, 8 Feb 2024, Accurate LoRA-Finetuning Quantization of LLMs via Information Retention, https://arxiv.org/abs/2402.05445 Code: https://github.com/htqin/ir-qlora

4-Bit Quantization (INT4)

4-bit quantization is one of the most popular quantization regimes in practical usage. It is far more common to see a 4-bit quantization of an open source model than binary, 2-bits, or 3-bits. INT4 allows eight-fold storage compression of 32-bits down to 4-bits, which reduces memory requirements, and can also speed up inference by reducing memory-cache transfer overheads in both CPU and GPU. The 4 bits allow 2^4=16 distinct weights, which is enough for reasonable accuracy compared to the full-precision model. The 4-bit weights also fit cleanly into 8-bit bytes or 32-bit integers, making the unpacking code simple and efficient.

Research papers on 4-bit quantization:

5-Bit Quantization (INT5)

Research papers on 5-bit quantization:

6-Bit Quantization (INT6)

Research papers on 6-bit quantization:

7-Bit Quantization (INT7)

Research papers on 7-bit quantization:

8-Bit Integer Quantization (INT8)

Research papers on 8-bit quantization:

9-Bit Quantization (INT9)

Research papers on 9-bit quantization:

10-Bit Quantization (INT10)

Research papers on 10-bit quantization:

11-Bit Quantization (INT11)

Research papers on 11-bit quantization:

12-Bit Quantization (INT12)

Research papers on 12-bit quantization:

16-Bit Integer Quantization (INT16)

INT16 is the use of 16-bit integers, so as to use half the space of FP32 weights/activations, and also using integer arithmetic kernels. Consideration of the pros and cons of integer versus floating-point computations is warranted, since FP16 quantization uses the same 16-bit memory size, but may be more accurate than quantized 16-bit integers.

Research papers on 16-bit integer quantization:

32-Bit Integer Quantization (INT32)

INT32 is not an effective form of "model compression", because it's not compressed at all! The data is no smaller than the FP32 raw weights, although it does allow integer arithmetic instead of floating-point computations. Also closely related is fixed-point quantization, using 32-bit integers and integer arithmetic.

Research papers on 32-bit integer quantization:

Mixed-Precision Quantization

Research papers on mixed-precision quantization:

  • Manuele Rusci, Marco Fariselli, Martin Croome, Francesco Paci, Eric Flamand, 2022, Accelerating RNN-Based Speech Enhancement on a Multi-core MCU with Mixed FP16-INT8 Post-training Quantization https://arxiv.org/abs/2210.07692 (Mixed precision FP16-INT8 quantization.)
  • Dipankar Das, Naveen Mellempudi, Dheevatsa Mudigere, Dhiraj Kalamkar, Sasikanth Avancha, Kunal Banerjee, Srinivas Sridharan, Karthik Vaidyanathan, Bharat Kaul, Evangelos Georganas, Alexander Heinecke, Pradeep Dubey, Jesus Corbal, Nikita Shustrov, Roma Dubtsov, Evarist Fomenko, Vadim Pirogov, 2018, Mixed Precision Training of Convolutional Neural Networks using Integer Operations, https://arxiv.org/pdf/1802.00930
  • M. A. Nasution, D. Chahyati and M. I. Fanany, "Faster R-CNN with structured sparsity learning and Ristretto for mobile environment", Proc. Int. Conf. Adv. Comput. Sci. Inf. Syst. (ICACSIS), pp. 309-314, Oct. 2017. https://ieeexplore.ieee.org/document/8355051
  • Zhen Dong, Zhewei Yao, Amir Gholami, Michael W Mahoney, and Kurt Keutzer, 2019, HAWQ: Hessian AWare Quantization of Neural Networks with Mixed-Precision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 293–302, https://arxiv.org/abs/1905.03696
  • Yiren Zhou, Seyed-Mohsen Moosavi-Dezfooli, Ngai-Man Cheung, and Pascal Frossard. Adaptive quantization for deep neural network. arXiv preprint arXiv:1712.01048, 2017, https://arxiv.org/abs/1712.01048 (Layerwise different bitwidth quantization.)
  • Sijie Zhao, Tao Yue, and Xuemei Hu. Distributionaware adaptive multi-bit quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9281–9290, 2021, https://ieeexplore.ieee.org/document/9577892, PDF: https://openaccess.thecvf.com/content/CVPR2021/papers/Zhao_Distribution-Aware_Adaptive_Multi-Bit_Quantization_CVPR_2021_paper.pdf
  • Zhewei Yao, Zhen Dong, Zhangcheng Zheng, Amir Gholami, Jiali Yu, Eric Tan, Leyuan Wang, Qijing Huang, Yida Wang, Michael W. Mahoney, Kurt Keutzer, June 2021, HAWQV3: Dyadic Neural Network Quantization, arXiv preprint arXiv:2011.10680, https://arxiv.org/abs/2011.10680
  • Huanrui Yang, Lin Duan, Yiran Chen, Hai Li, Feb 2021, BSQ: Exploring Bit-Level Sparsity for Mixed-Precision Neural Network Quantization, arXiv preprint arXiv:2102.10462, https://arxiv.org/abs/2102.10462
  • Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han. 2019, HAQ: Hardware-aware automated quantization. In Proceedings of the IEEE conference on computer vision and pattern recognition, https://arxiv.org/abs/1811.08886
  • Zhongnan Qu, Zimu Zhou, Yun Cheng, and Lothar Thiele, June 2020, Adaptive loss-aware quantization for multi-bit networks, In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), https://arxiv.org/abs/1912.08883
  • Hai Victor Habi, Roy H. Jennings, Arnon Netzer, July 2020, HMQ: Hardware Friendly Mixed Precision Quantization Block for CNNs, arXiv preprint arXiv:2007.09952, https://arxiv.org/abs/2007.09952
  • Z. Liu, Y. Wang, K. Han, W. Zhang, S. Ma, and W. Gao, “Post-training quantization for vision transformer,” Advances in Neural Information Processing Systems, vol. 34, pp. 28 092–28 103, 2021. https://arxiv.org/abs/2106.14156 (Has evaluations of 4-bit, 6-bit, and 8-bit quantization; also mixed-precision.)
  • Li, Y., Gong, R., Tan, X., Yang, Y., Hu, P., Zhang, Q., Yu, F., Wang, W., & Gu, S., 2021, BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction. ArXiv, abs/2102.05426. https://arxiv.org/abs/2102.05426 Code: https://github.com/yhhhli/BRECQ (Tests 2 to 4 bits for weights, and mixed-precision quantization.)
  • E Kloberdanz, W Le, Sep 2023, MixQuant: Mixed Precision Quantization with a Bit-width Optimization Search, arXiv preprint arXiv:2309.17341, https://arxiv.org/pdf/2309.17341.pdf (Various tests of 2-8 bits, and mixed precision quantization.)
  • A Chauhan, U Tiwari, 2023, Post Training Mixed Precision Quantization of Neural Networks Using First-Order Information, Proceedings of the IEEE/CVF International Conference, https://openaccess.thecvf.com/content/ICCV2023W/RCV/papers/Chauhan_Post_Training_Mixed_Precision_Quantization_of_Neural_Networks_Using_First-Order_ICCVW_2023_paper.pdf
  • Y Shang, Z Yuan, Q Wu, Z Dong, PB-LLM: Partially Binarized Large Language Models, Sep 2023, arXiv preprint arXiv:2310.00034, https://browse.arxiv.org/pdf/2310.00034.pdf Code: https://github.com/hahnyuan/BinaryLLM (Hybrid partial binarization.)
  • Changhun Lee, Jungyu Jin, Taesu Kim, Hyungjun Kim, Eunhyeok Park, 24 Jan 2024, OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and Inference of Large Language Models, https://arxiv.org/abs/2306.02272 Code: https://github.com/xvyaward/owq (Stores some weights in different quantization levels based on their values.)
  • G Rutishauser, 2024, Agile and Efficient Inference of Quantized Neural Networks, Ph.D. Thesis, ETH Zurich, https://www.research-collection.ethz.ch/bitstream/handle/20.500.11850/675547/1/thesis.pdf
  • N Martynov, A Goncharov, G Kumichev, E Egorov, 2024, On the Way to Lossless Compression of Language Transformers: Exploring Cross-Domain Properties of Quantization, https://aclanthology.org/2024.lrec-main.1089.pdf (Quantize 95% of weights to INT8, leaving the remainder at FP32.)
  • M Rakka, ME Fouda, P Khargonekar, F Kurdahi, 29 April 2024, A Review of State-of-the-Art Mixed-Precision Neural Network Frameworks, IEEE Transactions on Pattern Analysis and Machine Intelligence (Early Access), Pages 1 - 20, DOI: 10.1109/TPAMI.2024.3394390, https://doi.org/10.1109/TPAMI.2024.3394390 https://ieeexplore.ieee.org/abstract/document/10509805/
  • Wei Huang, Haotong Qin, Yangdong Liu, Yawei Li, Xianglong Liu, Luca Benini, Michele Magno, Xiaojuan Qi, 23 May 2024, SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models, https://arxiv.org/abs/2405.14917 Code: https://github.com/Aaronhuang-778/SliM-LLM
  • Utkarsh Saxena, Kaushik Roy, McQueen: Mixed Precision Quantization of Early Exit Networks, https://papers.bmvc2023.org/0511.pdf (Combination of mixed-precision quantization, with precision specifiable staticly to a layerwise granularity, with early exit dynamic depth optimizations.)
  • JY Jeon, XT Nguyen, S Ryu, HJ Lee, 2024, USDN: A Unified Sample-wise Dynamic Network with Mixed-Precision and Early-Exit, https://openaccess.thecvf.com/content/WACV2024/papers/Jeon_USDN_A_Unified_Sample-Wise_Dynamic_Network_With_Mixed-Precision_and_Early-Exit_WACV_2024_paper.pdf
  • Rong Tian, Zijing Zhao, Weijie Liu, Haoyan Liu Weiquan Mao, Zhe Zhao, Kimmo Yan, Sep 2022, SAMP: A Toolkit for Model Inference with Self-Adaptive Mixed-Precision, https://arxiv.org/pdf/2209.09130.pdf (Mixed-precision quantization combined with kernel fusion, including QKV tensor operation fusion and AddBias-LayerNorm fusion.)
  • Mariam Rakka, Mohammed E. Fouda, Pramod Khargonekar, Fadi Kurdahi, 11 Aug 2022, Mixed-Precision Neural Networks: A Survey, https://arxiv.org/abs/2208.06064
  • Piotr Kluska, Adri´an Castello, Florian Scheidegger, A. Cristiano I. Malossi, 2024, QAttn: Efficient GPU Kernels for mixed-precision Vision Transformers https://openaccess.thecvf.com/content/CVPR2024W/eLVM/papers/Kluska_QAttn_Efficient_GPU_Kernels_for_Mixed-precision_Vision_Transformers_CVPRW_2024_paper.pdf
  • Behnam Ghavami, Amin Kamjoo, Lesley Shannon, Steve Wilton, 3 Apr 2024, DNN Memory Footprint Reduction via Post-Training Intra-Layer Multi-Precision Quantization, https://arxiv.org/abs/2404.02947
  • Dimitrios Danopoulos, Georgios Zervakis, Dimitrios Soudris, Jörg Henkel, 12 Feb 2024, TransAxx: Efficient Transformers with Approximate Computing, https://arxiv.org/abs/2402.07545 (Using approximations in Vision Transformer architectures.)
  • Rong Tian, Zijing Zhao, Weijie Liu, Haoyan Liu, Weiquan Mao, Zhe Zhao, and Kan Zhou, 2023, SAMP: A Model Inference Toolkit of Post-Training Quantization for Text Processing via Self-Adaptive Mixed-Precision https://aclanthology.org/2023.emnlp-industry.13.pdf
  • Tao Yang, Dongyue Li, Zhuoran Song, Yilong Zhao, Fangxin Liu, Zongwu Wang, Zhezhi He, and Li Jiang, 2022, DTQAtten: Leveraging Dynamic Token-based Quantization for Efficient Attention Architecture, https://mxhx7199.github.io/files/%5BDATE-22%5DDTQAtten_preprint.pdf
  • David Spuler, March 2024, Chapter 44. Advanced Quantization, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
  • Yaohui Cai, Zhewei Yao, Zhen Dong, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. 2020, Zeroq: A novel zero shot quantization framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13169–13178, 2020, https://arxiv.org/abs/2001.00281 Code: https://github.com/amirgholami/ZeroQ
  • JH Park, JS Choi, JH Ko, 2020, Dual-Precision Deep Neural Network, https://dl.acm.org/doi/abs/10.1145/3430199.3430228 https://arxiv.org/pdf/2009.02191
  • Jing Liu, Ruihao Gong, Mingyang Zhang, Yefei He, Jianfei Cai, Bohan Zhuang, 13 Jun 2024, ME-Switch: A Memory-Efficient Expert Switching Framework for Large Language Models, https://arxiv.org/abs/2406.09041 (How to load multiple experts for MoE in a memory-efficient way using mixed-precision quantization based on identifying the few salient channels that need higher precision, as an alternative to multi-LoRA.)
  • Jinguang Wang, Yuexi Yin, Haifeng Sun, Qi Qi, Jingyu Wang, Zirui Zhuang, Tingting Yang, Jianxin Liao, 27 Jun 2024, OutlierTune: Efficient Channel-Wise Quantization for Large Language Models, https://arxiv.org/abs/2406.18832
  • Jianyu Wei, Shijie Cao, Ting Cao, Lingxiao Ma, Lei Wang, Yanyong Zhang, Mao Yang, 25 Jun 2024, T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge, https://arxiv.org/abs/2407.00088 Code: https://github.com/microsoft/T-MAC (Table lookup for low-bit quantization on CPUs.)
  • Yipin Guo, Yilin Lang, Qinyuan Ren, 3 Jul 2024, GPTQT: Quantize Large Language Models Twice to Push the Efficiency, https://arxiv.org/abs/2407.02891 (Two-phase quantization, first to high-bit, then to binary quantization.)
  • Md Fahim Faysal Khan, May 2024, Constraint Driven Multimodal Edge Intelligence, Ph.D. Thesis, Electrical Engineering and Computer Science, Pennsylvania State University, https://etda.libraries.psu.edu/files/final_submissions/29680 (Layer-specific quantization levels for mixed-precision quantization.)
  • J Wu, M Song, J Zhao, HKH So, 2024, A Case for Low Bitwidth Floating Point Arithmetic on FPGA for Transformer Based DNN Inference, https://wujiajunic.cn/publication/ipdpsw2024/IPDPSW2024.pdf
  • Sergei Savvov, Jun 27, 2023, 7 Ways To Speed Up Inference of Your Hosted LLMs, https://betterprogramming.pub/speed-up-llm-inference-83653aa24c47
  • Zhiwen Mo, Lei Wang, Jianyu Wei, Zhichen Zeng, Shijie Cao, Lingxiao Ma, Naifeng Jing, Ting Cao, Jilong Xue, Fan Yang, Mao Yang, 12 Aug 2024, LUT Tensor Core: Lookup Table Enables Efficient Low-Bit LLM Inference Acceleration, https://arxiv.org/abs/2408.06003 (Lookup tables for mixed-precision MatMul/GEMM kernels using low-bit quantization mixed with full precision.)
  • Wang, Miao Wei, Shu Yang, Fangmin Chen, Xing Mei, 16 Aug 2024, ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models, https://arxiv.org/abs/2408.08554
  • P. -C. Chen, Y. -T. Liu, G. -Y. Zeng and T. -D. Chiueh, 2024, Design and Implementation of an Easy-to-Deploy Energy-Efficient Inference Acceleration System for Multi-Precision Neural Networks, 2024 IEEE 6th International Conference on AI Circuits and Systems (AICAS), Abu Dhabi, United Arab Emirates, 2024, pp. 587-591, doi: 10.1109/AICAS59952.2024.10595940, https://ieeexplore.ieee.org/document/10595940
  • Xueying Wu, Edward Hanson, Nansu Wang, Qilin Zheng, Xiaoxuan Yang, Huanrui Yang, Shiyu Li, Feng Cheng, Partha Pratim Pande, Janardhan Rao Doppa, Krishnendu Chakrabarty, Hai Li, 27 Oct 2023 (v3), Block-Wise Mixed-Precision Quantization: Enabling High Efficiency for Practical ReRAM-based DNN Accelerators, https://arxiv.org/abs/2310.12182
  • Penghao Xiao, Chunjie Zhang, Qian Guo, Xiayang Xiao, Haipeng Wang, 2024, Neural Networks Integer Computation: Quantizing Convolutional Neural Networks of Inference and Training for Object Detection in Embedded Systems, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, DOI 10.1109/JSTARS.2024.3452321, PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10660473
  • Ziheng Wang, Pedro Reviriego, Farzad Niknia, Javier Conde, Shanshan Liu, Fabrizio Lombardi, 26 Aug 2024, Adaptive Resolution Inference (ARI): Energy-Efficient Machine Learning for Internet of Things, https://arxiv.org/abs/2408.14528 (Running a small quantized model and then determining whether to run the full non-quantized model.)
  • W Byun, J Woo, S Mukhopadhyay, 2024, Hardware-friendly Hessian-driven Row-wise Quantization and FPGA Acceleration for Transformer-based Models, https://dl.acm.org/doi/pdf/10.1145/3665314.3670806
  • Junfeng Gong, Cheng Liu, Long Cheng, Huawei Li, Xiaowei Li, 17 Jul 2024, MCU-MixQ: A HW/SW Co-optimized Mixed-precision Neural Network Design Framework for MCUs, https://arxiv.org/abs/2407.18267
  • Bernard Ryhede Bengtsson, Joel Bengs, 2024, Accelerated Segmentation with Mixed-Precision Quantization of EfficientViT-SAM, MSc Thesis, Lund University, Sweden, https://lup.lub.lu.se/luur/download?func=downloadFile&recordOId=9174462&fileOId=9174463
  • Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu, 25 Sep 2024, A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms, https://arxiv.org/abs/2409.16694
  • Zeyu Cao, Cheng Zhang, Pedro Gimenes, Jianqiao Lu, Jianyi Cheng, Yiren Zhao, 9 Oct 2024, Scaling Laws for Mixed Quantization, https://arxiv.org/abs/2410.06722
  • Lian Liu, Haimeng Ren, Long Cheng, Zhaohui Xu, Yudong Pan, Mengdi Wang, Xiaowei Li, Yinhe Han, Ying Wang, 16 Oct 2024, COMET: Towards Partical W4A4KV4 LLMs Serving, https://arxiv.org/abs/2410.12168
  • Hao Mark Chen, Fuwen Tan, Alexandros Kouris, Royson Lee, Hongxiang Fan, Stylianos I. Venieris, 17 Oct 2024, Progressive Mixed-Precision Decoding for Efficient LLM Inference, https://arxiv.org/abs/2410.13461
  • Jie Peng, Zhang Cao, Huaizhi Qu, Zhengyu Zhang, Chang Guo, Yanyong Zhang, Zhichao Cao, Tianlong Chen, 23 Oct 2024 (v2), Harnessing Your DRAM and SSD for Sustainable and Accessible LLM Inference with Mixed-Precision and Multi-level Caching, https://arxiv.org/abs/2410.14740
  • Xinghao Wang, Pengyu Wang, Bo Wang, Dong Zhang, Yunhua Zhou, Xipeng Qiu, 31 Oct 2024, BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments, https://arxiv.org/abs/2410.23918 https://github.com/xinghaow99/BitStack
  • Eldar Kurtic, Alexandre Marques, Shubhra Pandit, Mark Kurtz, Dan Alistarh, 4 Nov 2024, "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization, https://arxiv.org/abs/2411.02355
  • Peng Tang, Jiacheng Liu, Xiaofeng Hou, Yifei Pu, Jing Wang, Pheng-Ann Heng, Chao Li, Minyi Guo, 6 Nov 2024 (v2), HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference, https://arxiv.org/abs/2411.01433
  • Jiajun Wu, Mo Song, Jingmin Zhao, Yizhao Gao, Jia Li, Hayden Kwok-Hay So, 6 Nov 2024, TATAA: Programmable Mixed-Precision Transformer Acceleration with a Transformable Arithmetic Architecture, https://arxiv.org/abs/2411.03697
  • Qinyang Bao, MPBRQ- A Framework for Mixed-Precision Quantization for Large Language Models, Masters in Applied Science, Graduate Department of Electrical and Computer Engineering, University of Toronto 2024, https://tspace.library.utoronto.ca/bitstream/1807/141039/1/Bao_Qinyang_202411_MAS_thesis.pdf
  • Jianhua Gao, Bingjie Liu, Weixing Ji, Hua Huang, 9 Apr 2024, A Systematic Literature Survey of Sparse Matrix-Vector Multiplication, https://arxiv.org/abs/2404.06047
  • Yuzong Chen, Ahmed F. AbouElhamayed, Xilai Dai, Yang Wang, Marta Andronic, George A. Constantinides, Mohamed S. Abdelfattah, 18 Nov 2024, BitMoD: Bit-serial Mixture-of-Datatype LLM Acceleration, https://arxiv.org/abs/2411.11745
  • Yi Ren, Ruge Xu, Xinfei Guo, Weikang Qian, 27 Nov 2024, FAMES: Fast Approximate Multiplier Substitution for Mixed-Precision Quantized DNNs--Down to 2 Bits! https://arxiv.org/abs/2411.18055
  • Amitash Nanda, Sree Bhargavi Balija, Debashis Sahoo, 3 Dec 2024, CPTQuant -- A Novel Mixed Precision Post-Training Quantization Techniques for Large Language Models, https://arxiv.org/abs/2412.03599
  • Yiming Fang, Li Chen, Yunfei Chen, Weidong Wang, Changsheng You, 5 Dec 2024 (v2), Mixed-Precision Quantization: Make the Best Use of Bits Where They Matter Most, https://arxiv.org/abs/2412.03101
  • Yuzhuang Xu, Shiyu Ji, Qingfu Zhu, Wanxiang Che, 12 Dec 2024, CRVQ: Channel-relaxed Vector Quantization for Extreme Compression of LLMs, https://arxiv.org/abs/2412.09282 (Vector quantization of low-bit or 1-bit weight vectors, with additional bits for some channels, analogous to combining mixed-precision quantization and/or weight clustering.)
  • Mukul Lokhande, Gopal Raut, Santosh Kumar Vishvakarma, 16 Dec 2024, Flex-PE: Flexible and SIMD Multi-Precision Processing Element for AI Workloads, https://arxiv.org/abs/2412.11702
  • Weilun Feng, Haotong Qin, Chuanguang Yang, Zhulin An, Libo Huang, Boyu Diao, Fei Wang, Renshuai Tao, Yongjun Xu, Michele Magno, 16 Dec 2024, MPQ-DM: Mixed Precision Quantization for Extremely Low Bit Diffusion Models, https://arxiv.org/abs/2412.11549
  • Zhen Zheng, Xiaonan Song, Chuanjie Liu, 19 Dec 2024, MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design, https://arxiv.org/abs/2412.14590
  • Utkarsh Saxena, Sayeh Sharify, Kaushik Roy, Xin Wang, 18 Dec 2024, ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals, https://arxiv.org/abs/2412.14363 https://github.com/utkarsh-dmx/project-resq
  • Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, Lei Chen, 27 Dec 2024, A Survey on Large Language Model Acceleration based on KV Cache Management, https://arxiv.org/abs/2412.19442 (Huge survey of all KV cache optimization methods.)
  • Wonsuk Jang, Thierry Tambe, 2 Jan 2025, BlockDialect: Block-wise Fine-grained Mixed Format for Energy-Efficient LLM Inference, https://arxiv.org/abs/2501.01144 (Per-block granular mixed-precision quantization including FP4.)
  • S. Kim, H. Lee, S. Kim, C. Kim and W. W. Ro, "AirGun: Adaptive Granularity Quantization for Accelerating Large Language Models," 2024 IEEE 42nd International Conference on Computer Design (ICCD), Milan, Italy, 2024, pp. 645-652, doi: 10.1109/ICCD63220.2024.00103. https://ieeexplore.ieee.org/abstract/document/10818069
  • Haoning Xu, Zhaoqing Li, Zengrui Jin, Huimeng Wang, Youjun Chen, Guinan Li, Mengzhe Geng, Shujie Hu, Jiajun Deng, Xunying Liu, 7 Jan 2025, Effective and Efficient Mixed Precision Quantization of Speech Foundation Models, https://arxiv.org/abs/2501.03643
  • A. Xu et al., "GausiQ: Generalized Automatic Hybrid-Precision Quantization for MIMO Detection," in IEEE Wireless Communications Letters, doi: 10.1109/LWC.2024.3509269. https://ieeexplore.ieee.org/abstract/document/10839390
  • Jeongseok Kim, Jemin Lee, Yongin Kwon, Daeyoung Kim, 13 Jan 2025, QuantuneV2: Compiler-Based Local Metric-Driven Mixed Precision Quantization for Practical Embedded AI Applications, https://arxiv.org/abs/2501.07161
  • Michael Wu, Arnab Raha, Deepak A. Mathaikutty, Martin Langhammer, Engin Tunali, 31 Jan 2025, StruM: Structured Mixed Precision for Efficient Deep Learning Hardware Codesign, https://arxiv.org/abs/2501.18953
  • Haojie Duanmu, Xiuhong Li, Zhihang Yuan, Size Zheng, Jiangfei Duan, Xingcheng Zhang, Dahua Lin, 9 May 2025, MxMoE: Mixed-precision Quantization for MoE with Accuracy and Performance Co-Design, https://arxiv.org/abs/2505.05799 https://github.com/cat538/MxMoE
  • C Zhang, X Zhu, L Chen, T Yang, E Pan, G Yu, Y Zhao, 2025, Enhancing LLM Inference Performance on ARMCPUsthrough Software and Hardware Co-optimization Strategies, DOI 10.23919/ICS.2025.3568404, https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10994252
  • Raunak Shah, Zhaoheng Li, Yongjoo Park, 7 May 2025, QStore: Quantization-Aware Compressed Model Storage, https://arxiv.org/abs/2505.04081
  • Yutong Liu, Cairong Zhao, Guosheng Hu, 23 Jul 2025, A Comprehensive Evaluation on Quantization Techniques for Large Language Models, https://arxiv.org/pdf/2507.17417
  • Wang, J., Liu, H., Li, R., Feng, D., Ding, J., Ding, B. (2025). ALMP: Automatic Layer-By-Layer Mixed-Precision Quantization for Large Language Models. In: Huang, DS., Li, B., Chen, H., Zhang, C. (eds) Advanced Intelligent Computing Technology and Applications. ICIC 2025. Lecture Notes in Computer Science(), vol 15864. Springer, Singapore. https://doi.org/10.1007/978-981-95-0014-7_13 https://link.springer.com/chapter/10.1007/978-981-95-0014-7_13
  • Héctor Martínez, Adrián Castelló, Francisco D. Igual, Enrique S. Quintana-Ortí, 13 Jun 2025, The Cambrian Explosion of Mixed-Precision Matrix Multiplication for Quantized Deep Learning Inference, https://arxiv.org/abs/2506.11728
  • Sangwoo Kwon, Seong Hoon Seo, Jae W. Lee, Yeonhong Park, 8 Aug 2025, DP-LLM: Runtime Model Adaptation with Dynamic Layer-wise Precision Assignment, https://arxiv.org/abs/2508.06041 (Dynamic layer-wise mixed-precision quantization.)
  • Qingcheng Zhu, Yangyang Ren, Linlin Yang, Mingbao Lin, Yanjing Li, Sheng Xu, Zichao Feng, Haodong Zhu, Yuguang Yang, Juan Zhang, Runqi Wang, Baochang Zhang, 24 Jul 2025, Squeeze10-LLM: Squeezing LLMs' Weights by 10 Times via a Staged Mixed-Precision Quantization Method, https://arxiv.org/abs/2507.18073
  • Binrui Shen, Yuan Liang, Shengxin Zhu, 26 Jul 2025, FRAM: Frobenius-Regularized Assignment Matching with Mixed-Precision Computing, https://arxiv.org/abs/2508.00887
  • Wenyuan Liu, Haoqian Meng, Yilun Luo, Peng Zhang, Xindian Ma, 4 Aug 2025, MicroMix: Efficient Mixed-Precision Quantization with Microscaling Formats for Large Language Models, https://arxiv.org/abs/2508.02343
  • Haidong Kang, Lianbo Ma, Guo Yu and Shangce Gao, 5 Aug 2025, Where and How to Enhance: Discovering Bit-Width Contribution for Mixed Precision Quantization, https://arxiv.org/abs/2508.03002
  • Mehmet Emre Akbulut, Hazem Hesham Yousef Shalby, Fabrizio Pittorino, Manuel Roveri, 6 Aug 2025, InfoQ: Mixed-Precision Quantization via Global Information Flow, https://arxiv.org/abs/2508.04753
  • Zijun Jiang and Yangdi Lyu, 13 Aug 2025, MiCo: End-to-End Mixed Precision Neural Network Co-Exploration Framework for Edge AI, https://arxiv.org/abs/2508.09500
  • Tejas Chaudhari, Akarsh J., Tanushree Dewangan, Mukul Lokhande, and Santosh Kumar Vishvakarma, 18 Aug 2025, XR-NPE: High-Throughput Mixed-precision SIMD Neural Processing Engine for Extended Reality Perception Workloads, https://arxiv.org/abs/2508.13049

Logarithmic Bitshift Quantization (Power-of-Two Quantization)

The idea with bitshift quantization is to use power-of-2 integer weights and bitshift operations rather than integer multiplication. There is a significant trade-off in terms of accuracy of the model, since the number of distinct weights is greatly reduced. This is an active area of research that is well-known, with the earliest papers dating back to 1992 and 1993. However, software algorithms using bitshift seem unlikely to outperform hardware acceleration of integer multiplication, and hardware support is limited. Extending hardware accelerators to use bitshifting or the highest power-of-two approximate multiplication in hardware, presumably requiring fewer operations and less computing power (and reduced heat generation) seems an open area for further research. Note that the highest bit of an integer can be efficiently calculated using Brian Kernighan's algorithm (1988).

Sum of Two Bitshifts Quantization

The downside of logarithmic quantization is that there are relatively few unique weights, limiting precision, even if the number of bits used is maximized using a large scaling factor. An alternative implementation is to use two bitshift operations and an addition (or use of "shift-and-add" operations). In this way, the two highest bits of the quantized integer weight can be used, which improves model precision at the cost of more computation. This assumes that two integer shifts and an integer addition are less than the cost of a single integer multiplication. An early mention of this "sums of powers of two" method is in Marchesi et al. (1993).

Arbitrary Base Logarithmic Quantization

The main use of logarithms in quantization is power-of-two logarithmic quantization. This is efficient, allowing bitshifting, but its lack of accuracy is a known problem. There is also some research on bases other than two, or indeed arbitrary bases, to try to more accurately map weights to a logarithmic format:

  • S. Vogel, M. Liang, A. Guntoro, W. Stechele, and G. Ascheid, 2018, Efficient hardware acceleration of CNNs using logarithmic data representation with arbitrary log-base, In Proceedings of the 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD’18). 1–8.

Integer Division for Quantization?

What about using integer division instead of multiplications in quantization? After all, multiplication by a small weight like 0.003 could instead be a division by 333. Is this an avenue for optimization? It seems unlikely, since division is usually much slower than multiplication, often by an order-of-magnitude.

Integer division can possibly be used efficiently using bitshift operations. Power-of-two division might be an opportunity for (right) bitshifts instead of division, which is effectively the same as the left bitshift quantization above. Dyadic numbers are an interesting idea and their implementation involves division by a power-of-two, usually performed via a right bitshift.

Note that division is often used in scaling operations, particularly in de-quantization. However, in such cases, it isn't the bottleneck operation, as scaling or de-quantization is performed an order-of-magnitude fewer times.

No papers were found on "division quantization". Some research on division arithmetic algorithms:

Dyadic Quantization

Dyadic numbers are a class of numbers represented as rational numbers, but operated on as paired numbers. The number is an integer, but the denominator is restricted to be a power-of-two integer. This allows dyadic numbers to support a wide range of weights, including quite high precision in fractional weights, but integer arithmetic can be used.

  • Zhewei Yao, Zhen Dong, Zhangcheng Zheng, Amir Gholami, Jiali Yu, Eric Tan, Leyuan Wang, Qijing Huang, Yida Wang, Michael Mahoney, Kurt Keutzer, HAWQ-V3: Dyadic Neural Network Quantization, Proceedings of the 38th International Conference on Machine Learning, PMLR 139:11875-11886, 2021, https://arxiv.org/abs/2011.10680
  • Renato J. Cintra; Stefan, Duffner; Christophe Garcia; André Leite, Low-Complexity Approximate Convolutional Neural Networks, IEEE Transactions on Neural Networks and Learning Systems, Volume 29, Issue 12, December 2018, pp.5981-5992, https://ieeexplore.ieee.org/abstract/document/8334697
  • Fernanda Botelho, Max Garzon, On Dynamical Properties of Neural Networks, Complex Systems 5 (1991), p.401-413, https://wpmedia.wolfram.com/uploads/sites/13/2018/02/05-4-4.pdf
  • David Spuler, March 2024, Chapter 44. Advanced Quantization, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9

Fixed-Point Quantization

Fixed-point numbers are a way of representing numbers, that differs from floating-point numbers. For example, we can represent a dollars and cents number $12.34 as the integer "1234". This is fixed-point numbers, scaled by 100.

In practice, we can convert any fractional number to an integer by multiplying by a scaling factor (and truncating any lower-level fractional digits). Doing this is "floating-point quantization".

The main advantage is integer arithmetic. Using fixed-point quantizations changes vector dot product to use integer multiplication and integer addition (with a bitshift). See fixed point number system.

Floating-point numbers have a per-number exponent. Fixed-point numbers are like having a single global exponent value for all numbers (not stored anyway). The intermediate method is "block floating-point" where blocks of numbers (i.e. vectors) have a per-block or per-vector exponent value. Integer-only arithmetic is possible with fixed-point and block-floating point quantization.

Quantized Model Slowdown

Quantized model is slow? Although a quantized model should run inference queries more efficiently, there are various reasons why a quantized version of a model may run slower than the non-quantized FP32 model. Some of the possibilities for a slow quantized model include:

  • Quantized kernel is not fully specialized for that bit size (and it's doing something slow like converting back to FP32 or using bit unpacking prior to integer arithmetic).
  • Kernel fusion of your quantized kernel is lacking for a followup component, such as activation functions or Softmax (whereas this was fused in the FP32 version).
  • Sometimes GPUs can be faster on different sizes (e.g., FP16 quantization may be faster than INT8 quantization, even though the former is twice the byte size).
  • Your quantized version is "fake quantization" and its doing too many conversions back-and-forth between the integer and FP32 data types.
  • Double quantization where you have quantized an already-quantized model version.
  • CUDA C++ optimizer is confused by something in the integer kernel code, and does less auto-optimization.

Generally, it means there's something wrong with the C++ kernel code for the quantized versions.

Stochastic Quantization

Stochastic quantization is a research area that examines intentionally inserting some randomness or statistical variation into the quantization algorithms, which may result in higher accuracy. This idea can be used in conjunction with Post-Training Quantization (PTQ) or with Quantization-Aware Training (QAT).

Weight Clustering

Weight clustering is conceptually like pruning and quantization combined, and is sometimes called "cluster-based quantization". A group of weights are merged with similar weights, to make all of the similar weights have exactly the same weight. Hashing has also been used to group weights for clustering.

Outliers

The issue of "outliers" refers to weights or activations that are largers (or smaller) than the vast majority of other values. There are various parts of a Transformer where the output results can depend inordinately on a small subset of values, notably in the attention computation (and hence also the KV cache).

Research papers that discuss the issue of outliers include:

  • Jerry Yao-Chieh Hu, Pei-Hsuan Chang, Robin Luo, Hong-Yu Chen, Weijian Li, Wei-Po Wang, Han Liu, 4 Apr 2024, Outlier-Efficient Hopfield Layers for Large Transformer-Based Models, https://arxiv.org/abs/2404.03828 Code: https://github.com/MAGICS-LAB/OutEffHop (Addresses outliers in quantization with a modified Softmax and an advanced Hopfield memory model.)
  • Xing Hu, Yuan Chen, Dawei Yang, Sifan Zhou, Zhihang Yuan, Jiangyong Yu, Chen Xu, 28 May 2024, I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models, https://arxiv.org/abs/2405.17849 Code: https://anonymous.4open.science/r/I-LLM-F242/
  • Wanyun Cui, Qianle Wang, 3 Apr 2024, Cherry on Top: Parameter Heterogeneity and Quantization in Large Language Models, https://arxiv.org/abs/2404.02837 (Examines which weights most affect inference, including outlier values.)
  • Nikita Trukhanov, Ilya Soloveychik, 29 Mar 2024, Accurate Block Quantization in LLMs with Outliers, https://arxiv.org/abs/2403.20137 (Analyzes block floating point number formats in block quantization with a focus on the KV cache memory reduction, including the use of permutations to reorder tensor weight rows.)
  • Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey
  • Daliang Xu, Hao Zhang, Liming Yang, Ruiqi Liu, Mengwei Xu, and Xuanzhe Liu, 11 June 2024, WiP: Efficient LLM Prefilling with Mobile NPU, EdgeFM '24: Proceedings of the Workshop on Edge and Mobile Foundation Models, June 2024, Pages 33 - 35, https://doi.org/10.1145/3662006.3662066 https://dl.acm.org/doi/abs/10.1145/3662006.3662066 PDF: https://dl.acm.org/doi/pdf/10.1145/3662006.3662066 (Faster NPU prefill via chunked prefilling using sequences of tokens, along with INT8 NPU quantization that is aware of outliers and offloads FP32 calculations from NPU back to CPU.)
  • Franklin Huang, May 17, 2024, Machine Learning Systems with Reduced Memory Requirements, Masters of Science, Electrical Engineering and Computer Sciences, University of California, Berkeley, Technical Report No. UCB/EECS-2024-120 http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.html https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.pdf Code: https://github.com/hongyihuang/spec-mcts/blob/main/triton (Broad paper that examines a lot of different optimizations that reduce memory costs, including quantization, kernel fusion, sparsity, MatMul optimizations, KV cache compression, and various other methods.)
  • Jinguang Wang, Yuexi Yin, Haifeng Sun, Qi Qi, Jingyu Wang, Zirui Zhuang, Tingting Yang, Jianxin Liao, 27 Jun 2024, OutlierTune: Efficient Channel-Wise Quantization for Large Language Models, https://arxiv.org/abs/2406.18832
  • Wonbeom Lee, Jungi Lee, Junghwan Seo, Jaewoong Sim, 28 Jun 2024, InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management, https://arxiv.org/abs/2406.19707 (KV caching optimization using salient token pruning for the attention layer.)
  • Lianwei Yang, Haisong Gong, 6 Aug 2024, DopQ-ViT: Towards Distribution-Friendly and Outlier-Aware Post-Training Quantization for Vision Transformers, https://arxiv.org/abs/2408.03291
  • Guangxuan Xiao, May 2024, Efficient Deployment Algorithms for Large Language Models, Masters Thesis, MIT, https://dspace.mit.edu/bitstream/handle/1721.1/156332/xiao-xgx-sm-eecs-2024-thesis.pdf
  • Xin Ding, Xiaoyu Liu, Zhijun Tu, Yun Zhang, Wei Li, Jie Hu, Hanting Chen, Yehui Tang, Zhiwei Xiong, Baoqun Yin, Yunhe Wang, 15 Apr 2024 (v4), CBQ: Cross-Block Quantization for Large Language Models, https://arxiv.org/abs/2312.07950
  • Bowen Zhao, Hannaneh Hajishirzi, Qingqing Cao, 4 Jun 2024 (v2), APT: Adaptive Pruning and Tuning Pretrained Language Models for Efficient Training and Inference, ICML 2024 Oral, https://arxiv.org/abs/2401.12200 https://github.com/ROIM1998/APT
  • Yupeng Su, Ziyi Guan, Xiaoqun Liu, Tianlai Jin, Dongkuan Wu, Graziano Chesi, Ngai Wong, Hao Yu, 20 Aug 2024, LLM-Barber: Block-Aware Rebuilder for Sparsity Mask in One-Shot for Large Language Models, https://arxiv.org/abs/2408.10631 https://github.com/YupengSu/LLM-Barber
  • Jahyun Koo, Dahoon Park, Sangwoo Jung, Jaeha Kung, 6 Sep 2024, OPAL: Outlier-Preserved Microscaling Quantization A ccelerator for Generative Large Language Models, https://arxiv.org/abs/2409.05902
  • Yanshu Wang, Tong Yang, Xiyan Liang, Guoan Wang, Hanning Lu, Xu Zhe, Yaoming Li, Li Weitao, 18 Sep 2024, Art and Science of Quantizing Large-Scale Models: A Comprehensive Overview, https://arxiv.org/abs/2409.11650 (Extensive survey of quantization from the basics to SOTA approaches, with also some coverage of knowledge distillation and KV cache compression.)
  • Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu, 25 Sep 2024, A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms, https://arxiv.org/abs/2409.16694
  • Yifei Liu, Jicheng Wen, Yang Wang, Shengyu Ye, Li Lyna Zhang, Ting Cao, Cheng Li, Mao Yang, 25 Sep 2024, VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models, https://arxiv.org/abs/2409.17066 https://arxiv.org/pdf/2409.17066
  • Tianzhu Ye, Li Dong, Yuqing Xia, Yutao Sun, Yi Zhu, Gao Huang, Furu Wei, 7 Oct 2024, Differential Transformer, https://arxiv.org/abs/2410.05258
  • Ke Yi, Zengke Liu, Jianwei Zhang, Chengyuan Li, Tong Zhang, Junyang Lin, Jingren Zhou, 30 Sep 2024, Rotated Runtime Smooth: Training-Free Activation Smoother for accurate INT4 inference, https://arxiv.org/abs/2409.20361 (Handling of outliers in INT4 quantization.)
  • Mengzhao Chen, Yi Liu, Jiahao Wang, Yi Bin, Wenqi Shao, Ping Luo, 7 Oct 2024, PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs, https://arxiv.org/abs/2410.05265 https://github.com/ChenMnZ/PrefixQuant (Puts outliers into the KV cache as a prefix.)
  • Akshat Ramachandran, Souvik Kundu, Tushar Krishna, 12 Nov 2024 (v2), MicroScopiQ: Accelerating Foundational Models through Outlier-Aware Microscaling Quantization, https://arxiv.org/abs/2411.05282
  • Dongwei Wang, Huanrui Yang, 8 Dec 2024, Taming Sensitive Weights : Noise Perturbation Fine-tuning for Robust LLM Quantization, https://arxiv.org/abs/2412.06858
  • H Kang, Q Zhang, S Kundu, G Jeong, Z Liu, T Krishna, Dec 2024, GEAR: An Efficient Error Reduction Framework for KV Cache Compression in LLM Inference, https://neurips2024-enlsp.github.io/papers/paper_3.pdf (Use extra information in low-rank and sparse matrices to efficiently alleviate lossy KV cache quantization issues such as outliers.)
  • Jiun-Man Chen, Yu-Hsuan Chao, Yu-Jie Wang, Ming-Der Shieh, Chih-Chung Hsu, Wei-Fen Lin, 11 Mar 2024, QuantTune: Optimizing Model Quantization with Adaptive Outlier-Driven Fine Tuning, https://arxiv.org/abs/2403.06497 (Outlier-correcting fine-tuning and quantization method.)
  • Kyle Wiggers, December 23, 2024, A popular technique to make AI more efficient has drawbacks, https://techcrunch.com/2024/12/23/a-popular-technique-to-make-ai-more-efficient-has-drawbacks/
  • Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, Lei Chen, 27 Dec 2024, A Survey on Large Language Model Acceleration based on KV Cache Management, https://arxiv.org/abs/2412.19442 (Huge survey of all KV cache optimization methods.)
  • S. Kim, H. Lee, S. Kim, C. Kim and W. W. Ro, "AirGun: Adaptive Granularity Quantization for Accelerating Large Language Models," 2024 IEEE 42nd International Conference on Computer Design (ICCD), Milan, Italy, 2024, pp. 645-652, doi: 10.1109/ICCD63220.2024.00103. https://ieeexplore.ieee.org/abstract/document/10818069
  • Xuerui Qiu, Jieyuan Zhang, Wenjie Wei, Honglin Cao, Junsheng Guo, Rui-Jie Zhu, Yimeng Shan, Yang Yang, Malu Zhang, Haizhou Li, 23 Jan 2025, Quantized Spike-driven Transformer, https://arxiv.org/abs/2501.13492 https://github.com/bollossom/QSD-Transformer
  • Minhajul Hoque, Jan 4, 2025, DeepSeek V3: How They Achieved Big Results with Small Compute, https://ai.plainenglish.io/deepseek-v3-how-they-achieved-big-results-with-small-compute-fb694606d59a (DeepSeek optimizations included FP8 quantization with outlier handling, attention and KV cache optimization via Multi-Head Latent Attention (MHLA), and multi-token decoding.)
  • Zunhai Su, Zhe Chen, Wang Shen, Hanyu Wei, Linge Li, Huangqi Yu, Kehong Yuan, 25 Jan 2025, RotateKV: Accurate and Robust 2-Bit KV Cache Quantization for LLMs via Outlier-Aware Adaptive Rotations, https://arxiv.org/abs/2501.16383 (INT2 KV caching with special handling of outliers, RoPE, and attention sinks, and the resulting architecture works in Chain-of-Thought.)
  • Mingyu Jin, Kai Mei, Wujiang Xu, Mingjie Sun, Ruixiang Tang, Mengnan Du, Zirui Liu, Yongfeng Zhang, 3 Feb 2025, Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding, https://arxiv.org/abs/2502.01563 https://github.com/MingyuJ666/Rope_with_LLM (Finds that outliers in attention are important, and arise by being generated by RoPE.)
  • Songhao Wu, Ang Lv, Xiao Feng, Yufei Zhang, Xun Zhang, Guojun Yin, Wei Lin, Rui Yan, 1 Feb 2025, PolarQuant: Leveraging Polar Transformation for Efficient Key Cache Quantization and Decoding Acceleration, https://arxiv.org/abs/2502.00527
  • G. Wang, S. Cai, W. Li, D. Lyu and G. He, "OFQ-LLM: Outlier-Flexing Quantization for Efficient Low-Bit Large Language Model Acceleration," in IEEE Transactions on Circuits and Systems I: Regular Papers, doi: 10.1109/TCSI.2025.3547732. https://ieeexplore.ieee.org/abstract/document/10924797
  • Yi Su, Yuechi Zhou, Quantong Qiu, Juntao Li, Qingrong Xia, Ping Li, Xinyu Duan, Zhefeng Wang, Min Zhang, 16 May 2025, Accurate KV Cache Quantization with Outlier Tokens Tracing, https://arxiv.org/abs/2505.10938
  • P Czakó, G Kertész, S Szénási, 2025, Addressing Activation Outliers in LLMs: A Systematic Review of Post-Training Quantization Techniques, IEEE Access, 2025, https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10994764
  • M. Seo, J. Hyun, S. Jeong, X. T. Nguyen, H. -J. Lee and H. Lee, "OASIS: Outlier-Aware KV Cache Clustering for Scaling LLM Inference in CXL Memory Systems," in IEEE Computer Architecture Letters, doi: 10.1109/LCA.2025.3567844, https://ieeexplore.ieee.org/abstract/document/10990150
  • Yutong Liu, Cairong Zhao, Guosheng Hu, 23 Jul 2025, A Comprehensive Evaluation on Quantization Techniques for Large Language Models, https://arxiv.org/pdf/2507.17417
  • Dongyeun Lee, Jiwan Hur, Hyounguk Shon, Jae Young Lee, Junmo Kim, 17 Jul 2025, DMQ: Dissecting Outliers of Diffusion Models for Post-Training Quantization, https://arxiv.org/abs/2507.12933
  • Joonsung Kang, 23 Jul 2025, Doubly robust outlier resistant inference on causal treatment effect, https://arxiv.org/abs/2507.17439
  • Ivan Letteri, 20 Jul 2025, A Comparative Analysis of Statistical and Machine Learning Models for Outlier Detection in Bitcoin Limit Order Books, https://arxiv.org/abs/2507.14960
  • Arend Hintze and Clifford Bohm, 11 Aug 2025, Rethinking Self-Replication: Detecting Distributed Selfhood in the Outlier Cellular Automaton, https://arxiv.org/abs/2508.08047
  • Tanvir Islam, 26 Jul 2025, Extended Histogram-based Outlier Score (EHBOS), https://arxiv.org/abs/2502.05719
  • Marcello D'Orazio, 28 Jul 2025, An empirical comparison of some outlier detection methods with longitudinal data, https://arxiv.org/abs/2507.21203
  • Katharine M. Clark and Paul D. McNicholas, 31 Jul 2025, funOCLUST: Clustering Functional Data with Outliers, https://arxiv.org/abs/2508.00110
  • Jiaxi Li, Lu Yin, Xilu Wang, 4 Aug 2025, OWLed: Outlier-weighed Layerwise Pruning for Efficient Autonomous Driving Framework, https://arxiv.org/abs/2411.07711
  • Muhammad Rajabinasab, Farhad Pakdaman, Moncef Gabbouj, Peter Schneider-Kamp, Arthur Zimek, 18 Aug 2025, Randomized PCA Forest for Outlier Detection, https://arxiv.org/abs/2508.12776
  • Mingyu Kim, Daniel Stilwell, Jorge Jimenez, 18 Aug 2025, Outlier Detection of Poisson-Distributed Targets Using a Seabed Sensor Network, https://arxiv.org/abs/2508.13099
  • K\'evin Ducharlet, Louise Trav\'e-Massuy\`es, Jean-Bernard Lasserre, Marie-V\'eronique Le Lann, Youssef Miloudi, 13 Aug 2025, Leveraging the Christoffel Function for Outlier Detection in Data Streams, https://arxiv.org/abs/2508.16617
  • Sunwoo Kim, 17 Aug 2025, Deep Learning and Matrix Completion-aided IoT Network Localization in the Outlier Scenarios, https://arxiv.org/abs/2508.18225
  • Ryan Faulkner, Ian Reid, Simon Ratcliffe, Tat-Jun Chin, 25 Aug 2025, Finding Outliers in a Haystack: Anomaly Detection for Large Pointcloud Scenes, https://arxiv.org/abs/2508.17634
  • Paul Fogel (1), Christophe Geissler (1), George Luta (2) ((1) Data Services, Forvis Mazars, Levallois, France, (2) Department of Biostatistics, Bioinformatics and Biomathematics, Georgetown University Medical Center, Washington, DC, USA), 22 Aug 2025, The Target Polish: A New Approach to Outlier-Resistant Non-Negative Matrix Factorization, https://arxiv.org/abs/2507.10484

Activation Quantization

The quantization of dynamic calculations of activations is well-known and almost always used now.

Research papers on activation quantization:

  • Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, Liang Zhao, 4 Jan 2024, Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models https://arxiv.org/abs/2401.00625 (A general survey paper with coverage of many techniques including this one.)
  • Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I-Jen Chuang, Vijayalakshmi Srinivasan, Kailash Gopalakrishnan, Jul 2018, PACT: Parameterized Clipping Activation for Quantized Neural Networks, https://arxiv.org/abs/1805.06085
  • Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, Song Han, 10 May 2024 (v2), QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving, https://arxiv.org/abs/2405.04532 Code: https://github.com/mit-han-lab/qserve
  • Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey
  • 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
  • Apple, June 2024, Introducing Apple’s On-Device and Server Foundation Models, https://machinelearning.apple.com/research/introducing-apple-foundation-models (Apple's on-device models feature optimizations including small models, grouped query attention, 2-bit/4-bit quantization including activation quantization, shared embedding/unembedding tensors, small-ish vocabulary size of 49k, an undisclosed efficient KV cache optimization for neural engines, and layer-specific 16-bit LoRA/QLoRA adapters of size "10s of megabytes" for fine-tuned specialized model versions, also sometimes in 2-bit/4-bit, claiming speed rates of 0.6ms/token in prefill, and 30 tokens per second in decoding.)
  • Jinguang Wang, Yuexi Yin, Haifeng Sun, Qi Qi, Jingyu Wang, Zirui Zhuang, Tingting Yang, Jianxin Liao, 27 Jun 2024, OutlierTune: Efficient Channel-Wise Quantization for Large Language Models, https://arxiv.org/abs/2406.18832
  • Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748
  • Leo Donisch, Sigurd Schacht, Carsten Lanquillon, 6 Aug 2024, Inference Optimizations for Large Language Models: Effects, Challenges, and Practical Considerations, https://arxiv.org/abs/2408.03130
  • Wang, Miao Wei, Shu Yang, Fangmin Chen, Xing Mei, 16 Aug 2024, ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models, https://arxiv.org/abs/2408.08554
  • Jahyun Koo, Dahoon Park, Sangwoo Jung, Jaeha Kung, 6 Sep 2024, OPAL: Outlier-Preserved Microscaling Quantization A ccelerator for Generative Large Language Models, https://arxiv.org/abs/2409.05902
  • Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu, 25 Sep 2024, A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms, https://arxiv.org/abs/2409.16694
  • Eldar Kurtic, Alexandre Marques, Shubhra Pandit, Mark Kurtz, Dan Alistarh, 4 Nov 2024, "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization, https://arxiv.org/abs/2411.02355
  • A. Jantsch et al., "Special Session: Estimation and Optimization of DNNs for Embedded Platforms," 2024 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), Raleigh, NC, USA, 2024, pp. 21-30, doi: 10.1109/CODES-ISSS60120.2024.00013. https://ieeexplore.ieee.org/abstract/document/10740783
  • Liu, J., Zhang, B., Cao, X. (2025). ROI-Aware Dynamic Network Quantization for Neural Video Compression. In: Antonacopoulos, A., Chaudhuri, S., Chellappa, R., Liu, CL., Bhattacharya, S., Pal, U. (eds) Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15305. Springer, Cham. https://doi.org/10.1007/978-3-031-78169-8_22 https://link.springer.com/chapter/10.1007/978-3-031-78169-8_22
  • Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, Song Han, 18 Jul 2024 (v5), AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration, https://arxiv.org/abs/2306.00978 https://github.com/mit-han-lab/llm-awq

Vector Quantization

Vector quantization is a longstanding ML technique that pre-dates all of the Transformer work. Hence, there are many early papers on ML topics. Vector quantization is related to other vector methods such as nearest-neighbor search, such as for the analysis of embedding vectors and semantic similarity, amongst many other applications.

Research papers on vector quantization:

Quantization Granularity

Quantization granularity refers to the parts or segments or structures of weights that are quantized. For example, granularity levels may be:

  • Layerwise quantization
  • Vector quantization
  • Block quantization

Research papers on granularity of quantization include:

  • Jan Klhufek, Miroslav Safar, Vojtech Mrazek, Zdenek Vasicek, Lukas Sekanina, 8 Apr 2024, Exploring Quantization and Mapping Synergy in Hardware-Aware Deep Neural Network Accelerators, https://arxiv.org/abs/2404.05368 (Quantization of weights and activations on a CNN with a method to identify the optimal per-layer bitwidth for quantization.)
  • Nikita Trukhanov, Ilya Soloveychik, 29 Mar 2024, Accurate Block Quantization in LLMs with Outliers, https://arxiv.org/abs/2403.20137 (Analyzes block floating point number formats in block quantization with a focus on the KV cache memory reduction, including the use of permutations to reorder tensor weight rows.)
  • Wang-Cheng Kang, Derek Zhiyuan Cheng, Ting Chen, Xinyang Yi, Dong Lin, Lichan Hong, Ed H. Chi, 25 Aug 2020 (v2), Learning Multi-granular Quantized Embeddings for Large-Vocab Categorical Features in Recommender Systems, https://arxiv.org/abs/2002.08530
  • Lianwei Yang, Zhikai Li, Junrui Xiao, Haisong Gong, Qingyi Gu, 13 Jun 2024, MGRQ: Post-Training Quantization For Vision Transformer With Mixed Granularity Reconstruction, https://arxiv.org/abs/2406.09229
  • Minghai Qin, 27 Aug 2024, The Uniqueness of LLaMA3-70B with Per-Channel Quantization: An Empirical Study, https://arxiv.org/abs/2408.15301
  • Yanshu Wang, Tong Yang, Xiyan Liang, Guoan Wang, Hanning Lu, Xu Zhe, Yaoming Li, Li Weitao, 18 Sep 2024, Art and Science of Quantizing Large-Scale Models: A Comprehensive Overview, https://arxiv.org/abs/2409.11650 (Extensive survey of quantization from the basics to SOTA approaches, with also some coverage of knowledge distillation and KV cache compression.)
  • David Spuler, 26th August, 2024, Inference Optimization Research Ideas, https://www.aussieai.com/blog/inference-optimization-ideas
  • S. Kim, H. Lee, S. Kim, C. Kim and W. W. Ro, "AirGun: Adaptive Granularity Quantization for Accelerating Large Language Models," 2024 IEEE 42nd International Conference on Computer Design (ICCD), Milan, Italy, 2024, pp. 645-652, doi: 10.1109/ICCD63220.2024.00103. https://ieeexplore.ieee.org/abstract/document/10818069
  • M Raji, AG Ahsaei, K Soroush, B Ghavami, Jan 2025, Progressive Bitwidth Assignment Approaches for Efficient Capsule Networks Quantization, IEEE Access, https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10854429
  • Shaibal Saha, Lanyu Xu, 26 Feb 2025, Vision Transformers on the Edge: A Comprehensive Survey of Model Compression and Acceleration Strategies, https://arxiv.org/abs/2503.02891

Layerwise Quantization

Layerwise quantization is quantization done at the granularity level of layers. Each layer can have its own quantization parameters. This can be a special case of mixed-precision quantization (i.e. per-layer precision quantization), but it is also possible to use the same precision quantization at each level, but with different parameters.

Research papers on layerwise quantization:

Blockwise Quantization

Blockwise quantization is a very granular type of quantization where each "block" of data has its own quantization parameters.

  • Nikita Trukhanov, Ilya Soloveychik, 29 Mar 2024, Accurate Block Quantization in LLMs with Outliers, https://arxiv.org/abs/2403.20137 (Analyzes block floating point number formats in block quantization with a focus on the KV cache memory reduction, including the use of permutations to reorder tensor weight rows.)
  • Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao, July 11, 2024, FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision, https://arxiv.org/abs/2407.08608 https://tridao.me/blog/2024/flash3/
  • Ziteng Sun1 Uri Mendlovic, Yaniv Leviathan1 Asaf Aharoni, Ahmad Beirami , Jae HunRo, Ananda Theertha Suresh, https://openreview.net/pdf?id=OWwc8eOIm8
  • Yanshu Wang, Wenyang He, Tong Yang, 24 May 2024, Athena: Efficient Block-Wise Post-Training Quantization for Large Language Models Using Second-Order Matrix Derivative Information, https://arxiv.org/abs/2405.17470
  • Haocheng Xi, Yuxiang Chen, Kang Zhao, Kai Jun Teh, Jianfei Chen, Jun Zhu, 21 Jul 2024 (v2), Jetfire: Efficient and Accurate Transformer Pretraining with INT8 Data Flow and Per-Block Quantization, https://arxiv.org/abs/2403.12422 Code: https://github.com/thu-ml/Jetfire-INT8Training
  • Xin Ding, Xiaoyu Liu, Zhijun Tu, Yun Zhang, Wei Li, Jie Hu, Hanting Chen, Yehui Tang, Zhiwei Xiong, Baoqun Yin, Yunhe Wang, 15 Apr 2024 (v4), CBQ: Cross-Block Quantization for Large Language Models, https://arxiv.org/abs/2312.07950
  • Xueying Wu, Edward Hanson, Nansu Wang, Qilin Zheng, Xiaoxuan Yang, Huanrui Yang, Shiyu Li, Feng Cheng, Partha Pratim Pande, Janardhan Rao Doppa, Krishnendu Chakrabarty, Hai Li, 27 Oct 2023 (v3), Block-Wise Mixed-Precision Quantization: Enabling High Efficiency for Practical ReRAM-based DNN Accelerators, https://arxiv.org/abs/2310.12182
  • Sebastian Eliassen, Raghavendra Selvan, 16 Jan 2024 (v2), Activation Compression of Graph Neural Networks using Block-wise Quantization with Improved Variance Minimization, https://arxiv.org/abs/2309.11856
  • Tim Dettmers, Mike Lewis, Sam Shleifer, Luke Zettlemoyer, 20 Jun 2022 (v2), 8-bit Optimizers via Block-wise Quantization, https://arxiv.org/abs/2110.02861
  • Jahyun Koo, Dahoon Park, Sangwoo Jung, Jaeha Kung, 6 Sep 2024, OPAL: Outlier-Preserved Microscaling Quantization A ccelerator for Generative Large Language Models, https://arxiv.org/abs/2409.05902
  • W Byun, J Woo, S Mukhopadhyay, 2024, Hardware-friendly Hessian-driven Row-wise Quantization and FPGA Acceleration for Transformer-based Models, https://dl.acm.org/doi/pdf/10.1145/3665314.3670806
  • G. Wang, H. Qin, S. A. Jacobs, C. Holmes, S. Rajbhandari, O. Ruwase, F. Yan, L. Yang, and Y. He, 2023, Zero++: Extremely efficient collective communication for giant model training, arXiv preprint arXiv:2306.10209, 2023. https://arxiv.org/abs/2306.10209
  • David Spuler, 26th August, 2024, Inference Optimization Research Ideas, https://www.aussieai.com/blog/inference-optimization-ideas
  • Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu, 25 Sep 2024, A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms, https://arxiv.org/abs/2409.16694
  • Alireza Khodamoradi, Kristof Denolf, Eric Dellinger, 15 Oct 2024, Error Diffusion: Post Training Quantization with Block-Scaled Number Formats for Neural Networks, https://arxiv.org/abs/2410.11203 https://github.com/ROCm/tensorcast
  • Yuzong Chen, Ahmed F. AbouElhamayed, Xilai Dai, Yang Wang, Marta Andronic, George A. Constantinides, Mohamed S. Abdelfattah, 18 Nov 2024, BitMoD: Bit-serial Mixture-of-Datatype LLM Acceleration, https://arxiv.org/abs/2411.11745
  • Wonsuk Jang, Thierry Tambe, 2 Jan 2025, BlockDialect: Block-wise Fine-grained Mixed Format for Energy-Efficient LLM Inference, https://arxiv.org/abs/2501.01144 (Per-block granular mixed-precision quantization including FP4.)
  • A. Xu et al., "GausiQ: Generalized Automatic Hybrid-Precision Quantization for MIMO Detection," in IEEE Wireless Communications Letters, doi: 10.1109/LWC.2024.3509269. https://ieeexplore.ieee.org/abstract/document/10839390
  • M Raji, AG Ahsaei, K Soroush, B Ghavami, Jan 2025, Progressive Bitwidth Assignment Approaches for Efficient Capsule Networks Quantization, IEEE Access, https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10854429
  • Michael Wu, Arnab Raha, Deepak A. Mathaikutty, Martin Langhammer, Engin Tunali, 31 Jan 2025, StruM: Structured Mixed Precision for Efficient Deep Learning Hardware Codesign, https://arxiv.org/abs/2501.18953
  • Weiming Hu, Haoyan Zhang, Cong Guo, Yu Feng, Renyang Guan, Zhendong Hua, Zihan Liu, Yue Guan, Minyi Guo, Jingwen Leng, 26 Feb 2025, M-ANT: Efficient Low-bit Group Quantization for LLMs via Mathematically Adaptive Numerical Type, https://arxiv.org/abs/2502.18755

Quantization-Aware Training (QAT)

A small selection of some papers on QAT:

  • Haocheng Xi, Yuxiang Chen, Kang Zhao, Kaijun Zheng, Jianfei Chen, Jun Zhu, 19 Mar 2024, Jetfire: Efficient and Accurate Transformer Pretraining with INT8 Data Flow and Per-Block Quantization, https://arxiv.org/abs/2403.12422 (Quantization during pre-training using INT8 quantization and low-granularity per-block quantization.)
  • Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
  • 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
  • Penghao Xiao, Chunjie Zhang, Qian Guo, Xiayang Xiao, Haipeng Wang, 2024, Neural Networks Integer Computation: Quantizing Convolutional Neural Networks of Inference and Training for Object Detection in Embedded Systems, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, DOI 10.1109/JSTARS.2024.3452321, PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10660473
  • Yanshu Wang, Tong Yang, Xiyan Liang, Guoan Wang, Hanning Lu, Xu Zhe, Yaoming Li, Li Weitao, 18 Sep 2024, Art and Science of Quantizing Large-Scale Models: A Comprehensive Overview, https://arxiv.org/abs/2409.11650 (Extensive survey of quantization from the basics to SOTA approaches, with also some coverage of knowledge distillation and KV cache compression.)
  • Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu, 25 Sep 2024, A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms, https://arxiv.org/abs/2409.16694
  • Babak Rokh, Ali Azarpeyvand, Alireza Khanteymoori, 23 Oct 2023 (v5), A Comprehensive Survey on Model Quantization for Deep Neural Networks in Image Classification, https://arxiv.org/abs/2205.07877
  • Fali Wang, Zhiwei Zhang, Xianren Zhang, Zongyu Wu, Tzuhao Mo, Qiuhao Lu, Wanjing Wang, Rui Li, Junjie Xu, Xianfeng Tang, Qi He, Yao Ma, Ming Huang, Suhang Wang, 4 Nov 2024, A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness, https://arxiv.org/abs/2411.03350
  • Saleh Ashkboos, Bram Verhoef, Torsten Hoefler, Evangelos Eleftheriou, Martino Dazzi, 17 Nov 2024, EfQAT: An Efficient Framework for Quantization-Aware Training, https://arxiv.org/abs/2411.11038
  • M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024, Resource-efficient Algorithms and Systems of Foundation Models: A Survey, https://dl.acm.org/doi/pdf/10.1145/3706418
  • Pengle Zhang, Jia Wei, Jintao Zhang, Jun Zhu, Jianfei Chen, 12 Mar 2025 (v2), Accurate INT8 Training Through Dynamic Block-Level Fallback, https://arxiv.org/abs/2503.08040

Post-Training Quantization (PTQ)

A brief selection of some papers on PTQ:

  • Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
  • 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
  • Yipin Guo, Yilin Lang, Qinyuan Ren, 3 Jul 2024, GPTQT: Quantize Large Language Models Twice to Push the Efficiency, https://arxiv.org/abs/2407.02891 (Two-phase quantization, first to high-bit, then to binary quantization.)
  • Lianwei Yang, Haisong Gong, 6 Aug 2024, DopQ-ViT: Towards Distribution-Friendly and Outlier-Aware Post-Training Quantization for Vision Transformers, https://arxiv.org/abs/2408.03291
  • Penghao Xiao, Chunjie Zhang, Qian Guo, Xiayang Xiao, Haipeng Wang, 2024, Neural Networks Integer Computation: Quantizing Convolutional Neural Networks of Inference and Training for Object Detection in Embedded Systems, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, DOI 10.1109/JSTARS.2024.3452321, PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10660473
  • Yanshu Wang, Tong Yang, Xiyan Liang, Guoan Wang, Hanning Lu, Xu Zhe, Yaoming Li, Li Weitao, 18 Sep 2024, Art and Science of Quantizing Large-Scale Models: A Comprehensive Overview, https://arxiv.org/abs/2409.11650 (Extensive survey of quantization from the basics to SOTA approaches, with also some coverage of knowledge distillation and KV cache compression.)
  • Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu, 25 Sep 2024, A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms, https://arxiv.org/abs/2409.16694
  • Ian Colbert, Fabian Grob, Giuseppe Franco, Jinjie Zhang, Rayan Saab, 25 Sep 2024, Accumulator-Aware Post-Training Quantization, https://arxiv.org/abs/2409.17092
  • Babak Rokh, Ali Azarpeyvand, Alireza Khanteymoori, 23 Oct 2023 (v5), A Comprehensive Survey on Model Quantization for Deep Neural Networks in Image Classification, https://arxiv.org/abs/2205.07877
  • Fali Wang, Zhiwei Zhang, Xianren Zhang, Zongyu Wu, Tzuhao Mo, Qiuhao Lu, Wanjing Wang, Rui Li, Junjie Xu, Xianfeng Tang, Qi He, Yao Ma, Ming Huang, Suhang Wang, 4 Nov 2024, A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness, https://arxiv.org/abs/2411.03350
  • H. M. Jeddi, M. Grailoo and J. Nunez-Yanez, "Leveraging Dynamic Range Analysis for Efficient Post-Training Quantization in Graph Convolutional Networks," 2024 IEEE Nordic Circuits and Systems Conference (NorCAS), Lund, Sweden, 2024, pp. 1-7, doi: 10.1109/NorCAS64408.2024.10752486. https://ieeexplore.ieee.org/abstract/document/10752486
  • Donghyeon Yi, Seoyoung Lee, Jongho Kim, Junyoung Kim, Sohmyung Ha, Ik Joon Chang, Minkyu Je, 22 Nov 2024, FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration, https://arxiv.org/abs/2411.14733
  • M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024, Resource-efficient Algorithms and Systems of Foundation Models: A Survey, https://dl.acm.org/doi/pdf/10.1145/3706418
  • Deus Ex Machina, Dec 2024, Overview of Post-training Quantization and Examples of Algorithms and Implementations, https://deus-ex-machina-ism.com/?p=62443
  • Tollef Emil Jørgensen, 13 May 2025, Resource-Efficient Language Models: Quantization for Fast and Accessible Inference, https://arxiv.org/abs/2505.08620

KV Caching and Quantization

There is a lot of analogous research on using quantization in the KV cache data. The aims are similar to weight and activation quantization: smaller data size leads to less memory usage costs and tighter arithmetic kernels. Read more about these KV cache research areas:

Dequantization

Dequantization is the reverse mapping of quantized integers to their original weights. In reality, the result of dequantization loses precision compared to the original values that were quantized. Dequantization is not often considered in detail by papers on quantization, but there is still some research specific to issues of dequantization:

  • Han Guo, William Brandon, Radostin Cholakov, Jonathan Ragan-Kelley, Eric P. Xing, Yoon Kim, 15 Jul 2024, Fast Matrix Multiplications for Lookup Table-Quantized LLMs, https://arxiv.org/abs/2407.10960
  • Haojun Xia, Zhen Zheng, Xiaoxia Wu, Shiyang Chen, Zhewei Yao, Stephen Youn, Arash Bakhtiari, Michael Wyatt, Donglin Zhuang, Zhongzhu Zhou, Olatunji Ruwase, Yuxiong He, Shuaiwen Leon Song, July 2024, Quant-LLM: Accelerating the Serving of Large Language Models via FP6-Centric Algorithm-System Co-Design on Modern GPUs, Proceedings of the 2024 USENIX Annual Technical Conference. July 10–12, 2024,Santa Clara, CA, USA, 978-1-939133-41-0, https://www.usenix.org/conference/atc24/presentation/xia https://www.usenix.org/system/files/atc24-xia.pdf
  • Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu, 25 Sep 2024, A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms, https://arxiv.org/abs/2409.16694
  • A. Jantsch et al., "Special Session: Estimation and Optimization of DNNs for Embedded Platforms," 2024 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), Raleigh, NC, USA, 2024, pp. 21-30, doi: 10.1109/CODES-ISSS60120.2024.00013. https://ieeexplore.ieee.org/abstract/document/10740783
  • Character.AI, Nov 21, 2024, Optimizing AI Inference at Character.AI (Part Deux), https://research.character.ai/optimizing-ai-inference-at-character-ai-part-deux/ (Optimization techniques discussed include INT8, Flash attention 3, kernel fusion of KV dequantization and attention, MQA parallelization, producer-consumer CUDA warp specialization, fused matrix transpose, and more.)
  • Zhang, Y., Lu, L., Zhao, R. et al. An efficient quantized GEMV implementation for large language models inference with matrix core. J Supercomput 81, 496 (2025). https://doi.org/10.1007/s11227-025-06993-6 https://link.springer.com/article/10.1007/s11227-025-06993-6
  • C Zhang, X Zhu, L Chen, T Yang, E Pan, G Yu, Y Zhao, 2025, Enhancing LLM Inference Performance on ARMCPUsthrough Software and Hardware Co-optimization Strategies, DOI 10.23919/ICS.2025.3568404, https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10994252
  • Ching-Yi Lin, Sahil Shah, 4 Aug 2025 (v2), Low-Bit Integerization of Vision Transformers using Operand Reordering for Efficient Hardware, https://arxiv.org/abs/2504.18547 (Delayed dequantization until after matrix multiplication.)
  • Hazem Hesham Yousef Shalby, Fabrizio Pittorino, Francesca Palermo, Diana Trojaniello and Manuel Roveri, 7 Aug 2025, DQT: Dynamic Quantization Training via Dequantization-Free Nested Integer Arithmetic, https://arxiv.org/abs/2508.09176

General Research Papers on Quantization

Papers on the general theory of quantization, or specific works with relevance to the overall theoretical basis of using quantization for model compression:

  • Frantar, E.; Ashkboos, S.; Hoefler, T.; and Alistarh, D. 2022. GPTQ: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323 https://arxiv.org/abs/2210.17323, Code: https://github.com/IST-DASLab/gptq
  • Norm Tweaking: High-performance Low-bit Quantization of Large Language Models L Li, Q Li, B Zhang, X Chu - arXiv preprint arXiv:2309.02784, 2023 https://arxiv.org/pdf/2309.02784.pdf (Novel quantization-related optimization strategy with quantization based on tweaking LayerNorm weights.)
  • EPTQ: Enhanced Post-Training Quantization via Label-Free Hessian O Gordon, HV Habi, A Netzer, arXiv preprint arXiv:2309.11531, 2023, https://arxiv.org/pdf/2309.11531.pdf, Code: https://github.com/sony/model_optimization
  • J Liu, R Gong, X Wei, Z Dong, J Cai, B Zhuang, Oct 2023, QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models, arXiv preprint arXiv:2310.08041, https://arxiv.org/pdf/2310.08041.pdf (PTQ with 4-bit quantization of Llama models.)
  • Z Li, X Liu, B Zhu, Z Dong, Q Gu, K Keutzer, Oct 2023, QFT: Quantized Full-parameter Tuning of LLMs with Affordable Resources, arXiv preprint arXiv:2310.07147, https://arxiv.org/pdf/2310.07147.pdf
  • G Rutishauser, 2024, Agile and Efficient Inference of Quantized Neural Networks, Ph.D. Thesis, ETH Zurich, https://www.research-collection.ethz.ch/bitstream/handle/20.500.11850/675547/1/thesis.pdf
  • Sayeh Sharify, Zifei Xu, Wanzin Yazar, Xin Wang, 12 May 2024, Combining multiple post-training techniques to achieve most efficient quantized LLMs, https://arxiv.org/abs/2405.07135
  • Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, Song Han, 7 May 2024, QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving, arXiv preprint arXiv:2405.04532, https://arxiv.org/abs/2405.04532 Project: https://hanlab.mit.edu/projects/qserve Code: https://github.com/mit-han-lab/qserve (Efficient quantized inference on GPUs using 4-bit weights, 8-bit activations, and 4-bit KV cache, mostly via a GEMM speedup.)
  • Irina Proskurina, Luc Brun, Guillaume Metzler, Julien Velcin, 1 May 2024, When Quantization Affects Confidence of Large Language Models? https://arxiv.org/abs/2405.00632
  • Dayou Du, Gu Gong, Xiaowen Chu, 1 May 2024, Model Quantization and Hardware Acceleration for Vision Transformers: A Comprehensive Survey, https://arxiv.org/abs/2405.00314
  • Yi Guo, Fanliu Kong, Xiaoyang Li, Hui Li, Wei Chen, Xiaogang Tian, Jinping Cai, Yang Zhang, Shouda Liu, 19 Apr 2024, decoupleQ: Towards 2-bit Post-Training Uniform Quantization via decoupling Parameters into Integer and Floating Points, https://arxiv.org/abs/2404.12759 Code: https://github.com/bytedance/decoupleQ (Decouple parameters into integer and floating-point parts for more accurate quantization at low bitwidths.)
  • J Wu, M Song, J Zhao, HKH So, 2024, A Case for Low Bitwidth Floating Point Arithmetic on FPGA for Transformer Based DNN Inference, https://wujiajunic.cn/publication/ipdpsw2024/IPDPSW2024.pdf
  • Wojciech Czaja, Sanghoon Na, 11 Apr 2024, Frame Quantization of Neural Networks, https://arxiv.org/abs/2404.08131 (Deep theory of quantization.)
  • Georgy Tyukin, 2 Apr 2024, Enhancing Inference Efficiency of Large Language Models: Investigating Optimization Strategies and Architectural Innovations, Masters Thesis, Data Science and Machine Learning, University College London., https://arxiv.org/abs/2404.05741 (Reviews various model compression and inference optimization techniques, and specifically analyzes layer skipping and sublayer skipping, such as attention head pruning and FFN/MLP pruning.)
  • Busayo Awobade, Mardiyyah Oduwole, Steven Kolawole, 6 Apr 2024, What Happens When Small Is Made Smaller? Exploring the Impact of Compression on Small Data Pretrained Language Models, https://arxiv.org/abs/2404.04759 (General article shows that the big three of model compression work not just on compression big LLMs, but also on making small models even smaller.)
  • Jan Klhufek, Miroslav Safar, Vojtech Mrazek, Zdenek Vasicek, Lukas Sekanina, 8 Apr 2024, Exploring Quantization and Mapping Synergy in Hardware-Aware Deep Neural Network Accelerators, https://arxiv.org/abs/2404.05368 (Quantization of weights and activations on a CNN with a method to identify the optimal per-layer bitwidth for quantization.)
  • Seungtae Hong, Gunju Park, Jeong-Si Kim, 9 June 2024, Automated deep-learning model optimization framework for microcontrollers, https://doi.org/10.4218/etrij.2023-0522 https://onlinelibrary.wiley.com/doi/full/10.4218/etrij.2023-0522 (Framework for using quantization and pruning on microcontroller devices.)
  • B Jiang, X Cheng, Y Li, J Zhang, S Fu, Q Yang, M Liu, 2023, Output-Directed Dynamic Quantization for DNN Acceleration, ICPP '23: Proceedings of the 52nd International Conference on Parallel Processing, August 2023, Pages 645–654, https://doi.org/10.1145/3605573.3605580, https://dl.acm.org/doi/abs/10.1145/3605573.3605580
  • Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
  • You Zhou, Xiujing Lin, Xiang Zhang, Maolin Wang, Gangwei Jiang, Huakang Lu, Yupeng Wu, Kai Zhang, Zhe Yang, Kehang Wang, Yongduo Sui, Fengwei Jia, Zuoli Tang, Yao Zhao, Hongxuan Zhang, Tiannuo Yang, Weibo Chen, Yunong Mao, Yi Li, De Bao, Yu Li, Hongrui Liao, Ting Liu, Jingwen Liu, Jinchi Guo, Xiangyu Zhao, Ying WEI, Hong Qian, Qi Liu, Xiang Wang, Wai Kin (Victor)Chan, Chenliang Li, Yusen Li, Shiyu Yang, Jining Yan, Chao Mou, Shuai Han, Wuxia Jin, Guannan Zhang, Xiaodong Zeng, Nov 2023, On the Opportunities of Green Computing: A Survey, https://arxiv.org/abs/2311.00447 (Extensive survey of environmental and green AI issues, along with a survey of various optimization methods to reduce AI resource requirements in training and inference.)
  • Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, Liang Zhao, 4 Jan 2024, Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models https://arxiv.org/abs/2401.00625 (A general survey paper with coverage of many techniques including this one.)
  • T Liang, J Glossner, L Wang, S Shi, X Zhang, 2021, Neurocomputing, Pruning and quantization for deep neural network acceleration: A survey, https://arxiv.org/abs/2101.09671
  • Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. A survey of model compression and acceleration for deep neural networks. CoRR, abs/1710.09282, 2017. https://arxiv.org/abs/1710.09282
  • Marina Neseem, Conor McCullough, Randy Hsin, Chas Leichner, Shan Li, In Suk Chong, Andrew G. Howard, Lukasz Lew, Sherief ,Reda, Ville-Mikko Rautio, Daniele Moro, 29 Mar 2024, PikeLPN: Mitigating Overlooked Inefficiencies of Low-Precision Neural Networks, https://arxiv.org/abs/2404.00103
  • Nikita Trukhanov, Ilya Soloveychik, 29 Mar 2024, Accurate Block Quantization in LLMs with Outliers, https://arxiv.org/abs/2403.20137 (Analyzes block floating point number formats in block quantization with a focus on the KV cache memory reduction, including the use of permutations to reorder tensor weight rows.)
  • Ian Colbert, Alessandro Pappalardo, Jakoba Petri-Koenig, Yaman Umuroglu, 19 Jan 2024, A2Q+: Improving Accumulator-Aware Weight Quantization, https://arxiv.org/abs/2401.10432
  • Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 15 Mar 2024 (v5), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer (A large survey of a variety of LLM optimizations.)
  • Zhuocheng Gong, Jiahao Liu, Jingang Wang, Xunliang Cai, Dongyan Zhao, Rui Yan, 11 Mar 2024, What Makes Quantization for Large Language Models Hard? An Empirical Study from the Lens of Perturbation, https://arxiv.org/abs/2403.06408 (Deep analysis of quantization and how it works or fails.)
  • Haotong Qin, Xudong Ma, Xingyu Zheng, Xiaoyang Li, Yang Zhang, Shouda Liu, Jie Luo, Xianglong Liu, Michele Magno, 8 Feb 2024, Accurate LoRA-Finetuning Quantization of LLMs via Information Retention, https://arxiv.org/abs/2402.05445 Code: https://github.com/htqin/ir-qlora
  • Alberto Marchisio, Davide Dura, Maurizio Capra, Maurizio Martina, Guido Masera, Muhammad Shafique, Apr 2023, SwiftTron: An Efficient Hardware Accelerator for Quantized Transformers, https://arxiv.org/abs/2304.03986 Code: https://github.com/albertomarchisio/SwiftTron
  • Shashank Verma and Neal Vaidya, Mastering LLM Techniques: Inference Optimization, Nov 17, 2023, NVIDIA Technical Blog, https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/
  • S Peng, F Yang, N Sun, S Chen, Y Jiang, A Pan, Oct 2023, Exploring Post-Training Quantization of Protein Language Models, arXiv preprint arXiv:2310.19624, https://arxiv.org/abs/2310.19624
  • MA Shafique, A Munir, J Kong, Oct 2023, Deep Learning Performance Characterization on GPUs for Various Quantization Frameworks, https://www.mdpi.com/2673-2688/4/4/47
  • MWU Rahman, MM Abrar, HG Copening, S Hariri, Oct 2023, Quantized Transformer Language Model Implementations on Edge Devices, https://arxiv.org/pdf/2310.03971.pdf (Uses a "FlatBuffer" format on TensorFlow-Lite.)
  • PE Novac, G Boukli Hacene, A Pegatoquet, 2021, Quantization and deployment of deep neural networks on microcontrollers, Sensors, 2021, https://www.mdpi.com/1424-8220/21/9/2984
  • Shuchang Zhou, Yuxin Wu, Zekun Ni, et al. 2016. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv:1606.06160 http://arxiv.org/abs/1606.06160
  • L Chen, P Lou, 2022, Clipping-Based Post Training 8-Bit Quantization of Convolution Neural Networks for Object Detection, https://www.mdpi.com/2076-3417/12/23/12405/pdf
  • B Moons, K Goetschalckx, 2017, Minimum energy quantized neural networks, https://arxiv.org/pdf/1711.00215
  • A Kuzmin, M Van Baalen, Y Ren, 2022, https://proceedings.neurips.cc/paper_files/paper/2022/file/5e07476b6bd2497e1fbd11b8f0b2de3c-Paper-Conference.pdf
  • B Jacob, S Kligys, B Chen, M Zhu, 2018, Quantization and training of neural networks for efficient integer-arithmetic-only inference, http://openaccess.thecvf.com/content_cvpr_2018/papers/Jacob_Quantization_and_Training_CVPR_2018_paper.pdf
  • H Lin, J Lou, L Xiong, Integer-arithmetic-only certified robustness for quantized neural networks, 2021, https://openaccess.thecvf.com/content/ICCV2021/papers/Lin_Integer-Arithmetic-Only_Certified_Robustness_for_Quantized_Neural_Networks_ICCV_2021_paper.pdf
  • GW Jeon, SE Yu, JS Lee, 2023, Integer Quantized Learned Image Compression, 2023 IEEE International Conference on Image Processing (ICIP), https://ieeexplore.ieee.org/abstract/document/10222336 (Quantizes both weights and activations using integer quantization.)
  • K Nan, S Liu, J Du, H Liu, 2019, Deep model compression for mobile platforms: A survey, Tsinghua Science and Technology (Volume 24, Issue 6, December 2019), https://ieeexplore.ieee.org/abstract/document/8727762 PDF: https://ieeexplore.ieee.org/iel7/5971803/8727756/08727762.pdf
  • David Spuler, March 2024, Chapter 32. Quantization, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
  • Xiaotian Zhao; Ruge Xu; Xinfei Guo, Post-training Quantization or Quantization-aware Training? That is the Question, https://ieeexplore.ieee.org/abstract/document/10219214
  • Yaohui Cai, Zhewei Yao, Zhen Dong, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. 2020, Zeroq: A novel zero shot quantization framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13169–13178, 2020, https://arxiv.org/abs/2001.00281 Code: https://github.com/amirgholami/ZeroQ
  • Natarajan Vaidhyanathan Mar 7, 2024, How to Quadruple LLM Decoding Performance with Speculative Decoding (SpD) and Microscaling (MX) Formats on Qualcomm® Cloud AI 100, https://www.qualcomm.com/developer/blog/2024/03/how-quadruple-llm-decoding-performance-speculative-decoding-spd-and-microscaling-mx-formats
  • J. Choi and S. Venkataramani, 2019, Approximate Computing Techniques for Deep Neural Networks. Cham: Springer, 2019, pp. 307–329, Chapter 15, https://link.springer.com/chapter/10.1007/978-3-319-99322-5_15
  • Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In ECCV, 2018. https://arxiv.org/abs/1807.11164
  • Olivia Weng, Jan 2023, Neural Network Quantization for Efficient Inference: A Survey, https://arxiv.org/abs/2112.06126
  • Arnav Chavan, Raghav Magazine, Shubham Kushwaha, Mérouane Debbah, Deepak Gupta, 24 Apr 2024 (v2), Faster and Lighter LLMs: A Survey on Current Challenges and Way Forward, https://arxiv.org/abs/2402.01799 Code: https://github.com/nyunAI/Faster-LLM-Survey
  • 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
  • Geonhwa Jeong, Po-An Tsai, Stephen W. Keckler, Tushar Krishna, 19 Jun 2024, SDQ: Sparse Decomposed Quantization for LLM Inference, https://arxiv.org/abs/2406.13868 (Combining sparsity and quantization.)
  • Franklin Huang, May 17, 2024, Machine Learning Systems with Reduced Memory Requirements, Masters of Science, Electrical Engineering and Computer Sciences, University of California, Berkeley, Technical Report No. UCB/EECS-2024-120 http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.html https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.pdf Code: https://github.com/hongyihuang/spec-mcts/blob/main/triton (Broad paper that examines a lot of different optimizations that reduce memory costs, including quantization, kernel fusion, sparsity, MatMul optimizations, KV cache compression, and various other methods.)
  • Milan Tamang June 30, 2024, How I built my own custom 8-bit Quantizer from scratch: a step-by-step guide using PyTorch, https://towardsai.net/p/machine-learning/how-i-built-my-own-custom-8-bit-quantizer-from-scratch-a-step-by-step-guide-using-pytorch
  • Xinji Mai, Zeng Tao, Junxiong Lin, Haoran Wang, Yang Chang, Yanlan Kang, Yan Wang, Wenqiang Zhang, 27 Jun 2024, From Efficient Multimodal Models to World Models: A Survey, https://arxiv.org/abs/2407.00118 (A survey of multimodal models with coverage of many optimization techniques.)
  • Yipin Guo, Yilin Lang, Qinyuan Ren, 3 Jul 2024, GPTQT: Quantize Large Language Models Twice to Push the Efficiency, https://arxiv.org/abs/2407.02891 (Two-phase quantization, first to high-bit, then to binary quantization.)
  • Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas Kübler, Jiaji Huang, Matthäus Kleindessner, Jun Huan, Volkan Cevher, Yida Wang, George Karypis, 12 Jul 2024, Inference Optimization of Foundation Models on AI Accelerators, KDD’24, August 25–29, 2024, Barcelona, Spain, https://arxiv.org/abs/2407.09111
  • 18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
  • Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748
  • Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
  • Gihwan Kim, Jemin Lee, Sihyeong Park, Yongin Kwon, Hyungshin Kim, 26 Jul 2024, Mixed Non-linear Quantization for Vision Transformers, https://arxiv.org/abs/2407.18437 Code: https://gitlab.com/ones-ai/mixed-non-linear-quantization
  • Sergei Savvov, Jun 27, 2023, 7 Ways To Speed Up Inference of Your Hosted LLMs, https://betterprogramming.pub/speed-up-llm-inference-83653aa24c47
  • Lianwei Yang, Haisong Gong, 6 Aug 2024, DopQ-ViT: Towards Distribution-Friendly and Outlier-Aware Post-Training Quantization for Vision Transformers, https://arxiv.org/abs/2408.03291
  • Leo Donisch, Sigurd Schacht, Carsten Lanquillon, 6 Aug 2024, Inference Optimizations for Large Language Models: Effects, Challenges, and Practical Considerations, https://arxiv.org/abs/2408.03130
  • Wang, Miao Wei, Shu Yang, Fangmin Chen, Xing Mei, 16 Aug 2024, ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models, https://arxiv.org/abs/2408.08554
  • Zhikai Li, Xuewen Liu, Jing Zhang, Qingyi Gu, 8 Feb 2024, RepQuant: Towards Accurate Post-Training Quantization of Large Transformer Models via Scale Reparameterization, https://arxiv.org/abs/2402.05628 (Quantization with focus on LayerNorm and Softmax activation quantization.)
  • Sophia R. Cunningham,Dominique Archambault,Austin Kung, May 2024, Efficient Training and Inference: Techniques for Large Language Models Using Llama, https://www.techrxiv.org/doi/full/10.36227/techrxiv.171651876.65094225/v1
  • Fuwen Tan, Royson Lee, Łukasz Dudziak, Shell Xu Hu, Sourav Bhattacharya, Timothy Hospedales, Georgios Tzimiropoulos, Brais Martinez, 25 Aug 2024, MobileQuant: Mobile-friendly Quantization for On-device Language Models, https://arxiv.org/abs/2408.13933 https://github.com/saic-fi/MobileQuant
  • Fred Hohman, Chaoqun Wang, Jinmook Lee, Jochen Görtler, Dominik Moritz, Jeffrey P Bigham, Zhile Ren, Cecile Foret, Qi Shan, and Xiaoyi Zhang. 2024. Talaria: Interactively Optimizing Machine Learning Models for Efficient Inference. In Proceedings of the CHI Conference on Human Factors in Computing Systems (CHI '24). Association for Computing Machinery, New York, NY, USA, Article 648, 1–19. https://doi.org/10.1145/3613904.3642628 https://dl.acm.org/doi/full/10.1145/3613904.3642628
  • Sean I. Young, 3 Sep 2024, Foundations of Large Language Model Compression -- Part 1: Weight Quantization, https://arxiv.org/abs/2409.02026 https://github.com/seannz/cvxq
  • David Spuler, September 2nd, 2024, 500+ LLM Inference Optimization Techniques, Aussie AI Blog, https://www.aussieai.com/blog/llm-inference-optimization
  • Ummara Bibi, Mahrukh Mazharm Dilshad Sabir, Muhammad Fasih Uddin Butt, Ali Hassan, Mustansar Ali Ghazanfar, Arshad Ali Khan, Wadood Abdul, 2024, Advances in Pruning and Quantization for Natural Language Processing, IEEE Access, doi: 10.1109/ACCESS.2024.3465631. https://ieeexplore.ieee.org/document/10685352 PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10685352
  • Ian Colbert, Fabian Grob, Giuseppe Franco, Jinjie Zhang, Rayan Saab, 25 Sep 2024, Accumulator-Aware Post-Training Quantization, https://arxiv.org/abs/2409.17092
  • Elias Frantar, September, 2024, Compressing Large Neural Networks Algorithms, Systems and Scaling Laws, Ph.D. Thesis, Graduate School, Institute of Science and Technology, Austria, https://research-explorer.ista.ac.at/download/17485/17880/frantar_thesis_final.pdf
  • Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Guohao Dai, 6 Oct 2024, Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective, https://arxiv.org/abs/2410.04466
  • Cong Guo, Feng Cheng, Zhixu Du, James Kiessling, Jonathan Ku, Shiyu Li, Ziru Li, Mingyuan Ma, Tergel Molom-Ochir, Benjamin Morris, Haoxuan Shan, Jingwei Sun, Yitu Wang, Chiyue Wei, Xueying Wu, Yuhao Wu, Hao Frank Yang, Jingyang Zhang, Junyao Zhang, Qilin Zheng, Guanglei Zhou, Hai (Helen)Li, Yiran Chen, 8 Oct 2024. A Survey: Collaborative Hardware and Software Design in the Era of Large Language Models, https://arxiv.org/abs/2410.07265
  • Zifei Xu, Sayeh Sharify, Wanzin Yazar, Tristan Webb, Xin Wang, 18 Oct 2024, Understanding the difficulty of low-precision post-training quantization of large language models, https://arxiv.org/abs/2410.14570
  • Yuzhe Yang, Yipeng Du, Ahmad Farhan, Claudio Angione, Yue Zhao, Harry Yang, Fielding Johnston, James Buban, Patrick Colangelo, 28 Oct 2024, Meta-Learning for Speeding Up Large Model Inference in Decentralized Environments, https://arxiv.org/abs/2410.21340 (Choosing between multiple acceleration techniques).
  • Jiedong Lang, Zhehao Guo, Shuyu Huang, 30 Oct 2024, A Comprehensive Study on Quantization Techniques for Large Language Models, https://arxiv.org/abs/2411.02530
  • Fali Wang, Zhiwei Zhang, Xianren Zhang, Zongyu Wu, Tzuhao Mo, Qiuhao Lu, Wanjing Wang, Rui Li, Junjie Xu, Xianfeng Tang, Qi He, Yao Ma, Ming Huang, Suhang Wang, 4 Nov 2024, A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness, https://arxiv.org/abs/2411.03350
  • Liu, J., Zhang, B., Cao, X. (2025). ROI-Aware Dynamic Network Quantization for Neural Video Compression. In: Antonacopoulos, A., Chaudhuri, S., Chellappa, R., Liu, CL., Bhattacharya, S., Pal, U. (eds) Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15305. Springer, Cham. https://doi.org/10.1007/978-3-031-78169-8_22 https://link.springer.com/chapter/10.1007/978-3-031-78169-8_22
  • Thanaphon Suwannaphong, Ferdian Jovan, Ian Craddock, Ryan McConville, 12 Dec 2024, Optimising TinyML with Quantization and Distillation of Transformer and Mamba Models for Indoor Localisation on Edge Devices, https://arxiv.org/abs/2412.09289
  • Wenchao Xu, Jinyu Chen, Peirong Zheng, Xiaoquan Yi, Tianyi Tian, Wenhui Zhu, Quan Wan, Haozhao Wang, Yunfeng Fan, Qinliang Su, Xuemin Shen, https://arxiv.org/abs/2412.13437 18 Dec 2024, Deploying Foundation Model Powered Agent Services: A Survey, (A survey of not just deployment, but many inference optimization techniques.)
  • Giordano d'Aloisio, Luca Traini, Federica Sarro, Antinisca Di Marco, 18 Dec 2024, On the Compression of Language Models for Code: An Empirical Study on CodeBERT, https://arxiv.org/abs/2412.13737 (Quantization, pruning and distillation on code generation models.)
  • Kyle Wiggers, December 23, 2024, A popular technique to make AI more efficient has drawbacks, https://techcrunch.com/2024/12/23/a-popular-technique-to-make-ai-more-efficient-has-drawbacks/
  • Tollef Emil Jørgensen, 13 May 2025, Resource-Efficient Language Models: Quantization for Fast and Accessible Inference, https://arxiv.org/abs/2505.08620
  • Eduardo Alvarez, Huizi Mao, Wei-Ming Chen and Omri Almog, Aug 01, 2025, Optimizing LLMs for Performance and Accuracy with Post-Training Quantization, https://developer.nvidia.com/blog/optimizing-llms-for-performance-and-accuracy-with-post-training-quantization/
  • Xinhao Wang, Zhiwei Lin, Zhongyu Xia, Yongtao Wang, 14 Aug 2025, PTQAT: A Hybrid Parameter-Efficient Quantization Algorithm for 3D Perception Tasks, https://arxiv.org/abs/2508.10557
  • Yanxia Deng, Aozhong Zhang, Selcuk Gurses, Naigang Wang, Zi Yang and Penghang Yin, 14 Aug 2025, CLoQ: Enhancing Fine-Tuning of Quantized LLMs via Calibrated LoRA Initialization, https://arxiv.org/abs/2501.18475
  • Euntae Choi, Sumin Song, Woosang Lim, Sungjoo Yoo, 14 Aug 2025, Grouped Sequency-arranged Rotation: Optimizing Rotation Transformation for Quantization for Free, https://arxiv.org/abs/2505.03810
  • Steven K. Esser, Jeffrey L. McKinstry, Deepika Bablani, Rathinakumar Appuswamy, Dharmendra S. Modha, 22 Jul 2025, SiLQ: Simple Large Language Model Quantization-Aware Training, https://arxiv.org/abs/2507.16933
  • Yutong Liu, Cairong Zhao, Guosheng Hu, 23 Jul 2025, A Comprehensive Evaluation on Quantization Techniques for Large Language Models, https://arxiv.org/abs/2507.17417
  • Jiaji Zhang, Ruichao Sun, Hailiang Zhao, Jiaju Wu, Peng Chen, Hao Li, Yuying Liu, Xinkui Zhao, Kingsum Chow, Gang Xiong, Shuiguang Deng, 23 Jul 2025, SegQuant: A Semantics-Aware and Generalizable Quantization Framework for Diffusion Models, https://arxiv.org/abs/2507.14811
  • Haitian Wang, Xinyu Wang, Yiren Wang, Karen Lee, Zichen Geng, Xian Zhang, Kehkashan Kiran, Yu Zhang, and Bo Miao, 21 Jul 2025, Quantization-Aware Neuromorphic Architecture for Efficient Skin Disease Classification on Resource-Constrained Devices, https://arxiv.org/abs/2507.15958
  • Ao Shen, Qiang Wang, Zhiquan Lai, Xionglve Li, Dongsheng Li, 22 Jul 2025, Accurate and Efficient Fine-Tuning of Quantized Large Language Models Through Optimal Balance, https://arxiv.org/abs/2407.17029
  • Seung-Wook Kim, Seongyeol Kim, Jiah Kim, Seowon Ji, Se-Ho Lee, 22 Jul 2025, FedWSQ: Efficient Federated Learning with Weight Standardization and Distribution-Aware Non-Uniform Quantization, https://arxiv.org/abs/2506.23516
  • Yujia Tong, Jingling Yuan, Chuang Hu, 17 Jul 2025, Enhancing Quantization-Aware Training on Edge Devices via Relative Entropy Coreset Selection and Cascaded Layer Correction, https://arxiv.org/abs/2507.17768
  • Qingcheng Zhu, Yangyang Ren, Linlin Yang, Mingbao Lin, Yanjing Li, Sheng Xu, Zichao Feng, Haodong Zhu, Yuguang Yang, Juan Zhang, Runqi Wang, Baochang Zhang, 24 Jul 2025, Squeeze10-LLM: Squeezing LLMs' Weights by 10 Times via a Staged Mixed-Precision Quantization Method, https://arxiv.org/abs/2507.18073
  • Jiale Chen, Torsten Hoefler, Dan Alistarh, 24 Jul 2025, The Geometry of LLM Quantization: GPTQ as Babai's Nearest Plane Algorithm, https://arxiv.org/abs/2507.18553
  • Wonsuk Jang, Thierry Tambe, 24 Jul 2025, BlockDialect: Block-wise Fine-grained Mixed Format Quantization for Energy-Efficient LLM Inference, https://arxiv.org/abs/2501.01144
  • Oussama Bouaggad, Natalia Grabar, 18 Jul 2025, Search-Optimized Quantization in Biomedical Ontology Alignment, https://arxiv.org/abs/2507.13742
  • Daehyeon Baek, Jieun Choi, Jimyoung Son, Kyungmin Bin, Seungbeom Choi, Kihyo Moon, Minsung Jang, Hyojung Lee, 18 Jul 2025, FireQ: Fast INT4-FP8 Kernel and RoPE-aware Quantization for LLM Inference Acceleration, https://arxiv.org/abs/2505.20839
  • Yujia Tong, Jingling Yuan, Tian Zhang, Jianquan Liu, Chuang Hu, 19 Jul 2025, DFQ-ViT: Data-Free Quantization for Vision Transformers without Fine-tuning, https://arxiv.org/abs/2507.14481
  • Sebastian A. Cruz Romero, Wilfredo E. Lugo Beauchamp, 20 Jul 2025, Performance Analysis of Post-Training Quantization for CNN-based Conjunctival Pallor Anemia Detection, https://arxiv.org/abs/2507.15151
  • Li Jiao, Qiuxia Lai, Yu Li, Qiang Xu, 20 Jul 2025, Vector Quantization Prompting for Continual Learning, https://arxiv.org/abs/2410.20444
  • Zian Zhai, Fan Li, Xingyu Tan, Xiaoyang Wang, Wenjie Zhang, 8 Aug 2025, Graph is a Natural Regularization: Revisiting Vector Quantization for Graph Representation Learning, https://arxiv.org/abs/2508.06588
  • Tuo Zhang, Ning Li, Xin Yuan, Wenchao Xu, Quan Chen, Song Guo, Haijun Zhang, 10 Aug 2025, Efficient Edge LLMs Deployment via HessianAware Quantization and CPU GPU Collaborative, https://arxiv.org/abs/2508.07329
  • Beilong Tang, Xiaoxiao Miao, Xin Wang, Ming Li, 9 Aug 2025, SEF-MK: Speaker-Embedding-Free Voice Anonymization through Multi-k-means Quantization, https://arxiv.org/abs/2508.07086
  • Prateek Yadav, Leshem Choshen, Colin Raffel, Mohit Bansal, 9 Aug 2025, ComPEFT: Compression for Communicating Parameter Efficient Updates via Sparsification and Quantization, https://arxiv.org/abs/2311.13171
  • Brian Chmiel, Maxim Fishman, Ron Banner, Daniel Soudry, 10 Aug 2025, FP4 All the Way: Fully Quantized Training of LLMs, https://arxiv.org/abs/2505.19115
  • JiangYong Yu, Sifan Zhou, Dawei Yang, Shuo Wang, Shuoyu Li, Xing Hu, Chen Xu, Zukang Xu, Changyong Shu, Zhihang Yuan, 10 Aug 2025, MQuant: Unleashing the Inference Potential of Multimodal Large Language Models via Full Static Quantization, https://arxiv.org/abs/2502.00425
  • Yuxuan Sun, Ruikang Liu, Haoli Bai, Han Bao, Kang Zhao, Yuening Li, Jiaxin Hu, Xianzhi Yu, Lu Hou, Chun Yuan, Xin Jiang, Wulong Liu, Jun Yao, 10 Aug 2025, FlatQuant: Flatness Matters for LLM Quantization, https://arxiv.org/abs/2410.09426
  • Chao Zeng, Songwei Liu, Shu Yang, Fangmin Chen, Lean Fu, Xing Mei, 27 Jul 2025, GQSA: Group Quantization and Sparsity for Accelerating Large Language Model Inference, https://arxiv.org/abs/2412.17560
  • Semyon Savkin, Eitan Porat, Or Ordentlich, Yury Polyanskiy, 26 Jul 2025, NestQuant: Nested Lattice Quantization for Matrix Products and LLMs, https://arxiv.org/abs/2502.09720
  • Jinuk Kim, Marwa El Halabi, Wonpyo Park, Clemens JS Schaefer, Deokjae Lee, Yeonhong Park, Jae W. Lee, Hyun Oh Song, 27 Jul 2025, GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance, https://arxiv.org/abs/2505.07004
  • Aotao Wang, Haikuo Shao, Shaobo Ma, Zhongfeng Wang, 28 Jul 2025, FastMamba: A High-Speed and Efficient Mamba Accelerator on FPGA with Accurate Quantization, https://arxiv.org/abs/2505.18975
  • Jiwoong Park, Chaeun Lee, Yongseok Choi, Sein Park, Deokki Hong, Jungwook Choi, 29 Jul 2025, Enhancing Generalization in Data-free Quantization via Mixup-class Prompting, https://arxiv.org/abs/2507.21947
  • Jihao Xin, Marco Canini, Peter Richt\'arik, Samuel Horv\'ath, 29 Jul 2025, Quantize Once, Train Fast: Allreduce-Compatible Compression with Provable Guarantees, https://arxiv.org/abs/2305.18627
  • Hao Wang, Ligong Han, Kai Xu, Akash Srivastava, 28 Jul 2025, SQuat: Subspace-orthogonal KV Cache Quantization, https://arxiv.org/abs/2503.24358
  • Patrik Czak\'o, G\'abor Kert\'esz and S\'andor Sz\'en\'asi, 29 Jul 2025, SmoothRot: Combining Channel-Wise Scaling and Rotation for Quantization-Friendly LLMs, https://arxiv.org/abs/2506.05413
  • Xueying Wu, Baijun Zhou, Zhihui Gao, Yuzhe Fu, Qilin Zheng, Yintao He, and Hai Li, 30 Jul 2025, KLLM: Fast LLM Inference with K-Means Quantization, https://arxiv.org/abs/2507.23035
  • Ran Ben-Basat, Yaniv Ben-Itzhak, Michael Mitzenmacher, Shay Vargaftik, 31 Jul 2025, Optimal and Near-Optimal Adaptive Vector Quantization, https://arxiv.org/abs/2402.03158
  • Feng Jiang, Zihao Zheng, Xiuping Cui, Maoliang Li, JIayu Chen, Xiang Chen, 31 Jul 2025, EaqVLA: Encoding-aligned Quantization for Vision-Language-Action Models, https://arxiv.org/abs/2505.21567
  • Seokho Han, Seoyeon Yoon, Jinhee Kim, Dongwei Wang, Kang Eun Jeon, Huanrui Yang, Jong Hwan Ko, 30 Jul 2025, MSQ: Memory-Efficient Bit Sparsification Quantization, https://arxiv.org/abs/2507.22349
  • Yongxin Zhu, Bocheng Li, Yifei Xin, Zhihua Xia, Linli Xu, 30 Jul 2025, Addressing Representation Collapse in Vector Quantized Models with One Linear Layer, https://arxiv.org/abs/2411.02038
  • Deyu Cao and Samin Aref, 30 Jul 2025, Enhancing Ultra-Low-Bit Quantization of Large Language Models Through Saliency-Aware Partial Retraining, https://arxiv.org/abs/2504.13932
  • Chuan He, Yang Chen, Wuliang Huang, Tianyi Zheng, Jianhu Chen, Bin Dou, Yice Luo, Yun Zhu, Baokun Wang, Yongchao Liu, Xing Fu, Yu Cheng, Chuntao Hong, Weiqiang Wang, Xin-Wei Yao, 1 Aug 2025, Learning Unified User Quantized Tokenizers for User Representation, https://arxiv.org/abs/2508.00956
  • Johann Birnick, 1 Aug 2025, The Lattice Geometry of Neural Network Quantization -- A Short Equivalence Proof of GPTQ and Babai's algorithm, https://arxiv.org/abs/2508.01077
  • Tianpei Lu, Bingsheng Zhang, Lekun Peng, Bowen Zheng, Lichun Li, Kui Ren, 3 Aug 2025, Privacy-Preserving Inference for Quantized BERT Models, https://arxiv.org/abs/2508.01636
  • Wenyuan Liu, Haoqian Meng, Yilun Luo, Peng Zhang, Xindian Ma, 4 Aug 2025, MicroMix: Efficient Mixed-Precision Quantization with Microscaling Formats for Large Language Models, https://arxiv.org/abs/2508.02343
  • Chen Feng and Yicheng Lin and Shaojie Zhuo and Chenzheng Su and Ramchalam Kinattinkara Ramakrishnan and Zhaocong Yuan and Xiaopeng Zhang, 1 Aug 2025, Edge-ASR: Towards Low-Bit Quantization of Automatic Speech Recognition Models, https://arxiv.org/abs/2507.07877
  • Haidong Kang, Lianbo Ma, Guo Yu and Shangce Gao, 5 Aug 2025, Where and How to Enhance: Discovering Bit-Width Contribution for Mixed Precision Quantization, https://arxiv.org/abs/2508.03002
  • He Xiao, Qingyao Yang, Dirui Xie, Wendong Xu, Wenyong Zhou, Haobo Liu, Zhengwu Liu, Ngai Wong, 5 Aug 2025, Exploring Layer-wise Information Effectiveness for Post-Training Quantization in Small Language Models, https://arxiv.org/abs/2508.03332
  • Yufei Xue, Yushi Huang, Jiawei Shao, Jun Zhang, 5 Aug 2025, VLMQ: Efficient Post-Training Quantization for Large Vision-Language Models via Hessian Augmentation, https://arxiv.org/abs/2508.03351
  • Jilong Li, Zhenxi Song, Jiaqi Wang, Meishan Zhang, Honghai Liu, Min Zhang, Zhiguo Zhang, 5 Aug 2025, BrainECHO: Semantic Brain Signal Decoding through Vector-Quantized Spectrogram Reconstruction for Whisper-Enhanced Text Generation, https://arxiv.org/abs/2410.14971
  • Hao Zhang, Aining Jia, Weifeng Bu, Yushu Cai, Kai Sheng, Hao Chen, Xin He, 6 Aug 2025, FlexQ: Efficient Post-training INT6 Quantization for LLM Serving via Algorithm-System Co-Design, https://arxiv.org/abs/2508.04405
  • Jianqiao Chen, Tingting Zhu, Huishi Song, Nan Ma, and Xiaodong Xu, 1 Aug 2025, VQ-DeepISC: Vector Quantized-Enabled Digital Semantic Communication with Channel Adaptive Image Transmission, https://arxiv.org/abs/2508.03740
  • Jiaqi Zhao, Miao Zhang, Ming Wang, Yuzhang Shang, Kaihao Zhang, Weili Guan, Yaowei Wang, Min Zhang, 6 Aug 2025, PTQ1.61: Push the Real Limit of Extremely Low-Bit Post-Training Quantization Methods for Large Language Models, https://arxiv.org/abs/2502.13179
  • Jiaqi Zhao, Weili Guan, Ming Li, Miao Zhang, 6 Aug 2025, Boost Post-Training Quantization via Null Space Optimization for Large Language Models, https://arxiv.org/abs/2506.11044
  • Mehmet Emre Akbulut, Hazem Hesham Yousef Shalby, Fabrizio Pittorino, Manuel Roveri, 6 Aug 2025, InfoQ: Mixed-Precision Quantization via Global Information Flow, https://arxiv.org/abs/2508.04753
  • Haoyu Zhang, Shihao Zhang, Ian Colbert, Rayan Saab, 6 Aug 2025, Provable Post-Training Quantization: Theoretical Analysis of OPTQ and Qronos, https://arxiv.org/abs/2508.04853
  • Vignesh Ethiraj, Ashwath David, Sidhanth Menon, Divya Vijay, 5 Aug 2025, Toward Low-Latency End-to-End Voice Agents for Telecommunications Using Streaming ASR, Quantized LLMs, and Real-Time TTS, https://arxiv.org/abs/2508.04721
  • Youngeun Kim, Seunghwan Lee, Aecheon Jung, Bogon Ryu, Sungeun Hong, 7 Aug 2025, Task Vector Quantization for Memory-Efficient Model Merging, https://arxiv.org/abs/2503.06921
  • Constantinos Tsakonas, Konstantinos Chatzilygeroudis, 7 Aug 2025, Vector Quantized-Elites: Unsupervised and Problem-Agnostic Quality-Diversity Optimization, https://arxiv.org/abs/2504.08057
  • Dongjie Fu, Tengjiao Sun, Pengcheng Fang, Xiaohao Cai, Hansung Kim, 7 Aug 2025, MOGO: Residual Quantized Hierarchical Causal Transformer for High-Quality and Real-Time 3D Human Motion Generation, https://arxiv.org/abs/2506.05952
  • Zeyu Cao, Boyang Gu, Cheng Zhang, Pedro Gimenes, Jianqiao Lu, Jianyi Cheng, Xitong Gao, Yiren Zhao, 6 Aug 2025, Scaling Laws For Mixed Quantization, https://arxiv.org/abs/2410.06722
  • Jeffrey Uhlmann, 8 Aug 2025, The Fourth State: Signed-Zero Ternary for Stable LLM Quantization (and More), https://arxiv.org/abs/2508.05905
  • Hazem Hesham Yousef Shalby, Fabrizio Pittorino, Francesca Palermo, Diana Trojaniello and Manuel Roveri, 7 Aug 2025, DQT: Dynamic Quantization Training via Dequantization-Free Nested Integer Arithmetic, https://arxiv.org/abs/2508.09176
  • Jinhao Zhang, Yunquan Zhang, Boyang Zhang, Zeyu Liu, Daning Cheng, 9 Aug 2025, MoQE: Improve Quantization Model performance via Mixture of Quantization Experts, https://arxiv.org/abs/2508.09204
  • Marco S\"alzer, Fran\c{c}ois Schwarzentruber, Nicolas Troquard, 13 Aug 2025, Verifying Quantized Graph Neural Networks is PSPACE-complete, https://arxiv.org/abs/2502.16244
  • Aakash Kumar, Emanuele Natale, 14 Aug 2025, Quantization vs Pruning: Insights from the Strong Lottery Ticket Hypothesis, https://arxiv.org/abs/2508.11020
  • Jianhao Ma, Lin Xiao, 14 Aug 2025, Quantization through Piecewise-Affine Regularization: Optimization and Statistical Guarantees, https://arxiv.org/abs/2508.11112
  • Mohammad Mozaffari, Amir Yazdanbakhsh, Maryam Mehri Dehnavi, 14 Aug 2025, SLiM: One-shot Quantization and Sparsity with Low-rank Approximation for LLM Weight Compression, https://arxiv.org/abs/2410.09615
  • Ruikang Liu, Yuxuan Sun, Manyi Zhang, Haoli Bai, Xianzhi Yu, Tiezheng Yu, Chun Yuan, Lu Hou, 18 Aug 2025, Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models, https://arxiv.org/abs/2504.04823
  • Sergey Salishev, Ian Akhremchik, 19 Aug 2025, GDNSQ: Gradual Differentiable Noise Scale Quantization for Low-bit Neural Networks, https://arxiv.org/abs/2508.14004
  • Kim Hammar and Tao Li, 20 Aug 2025, Online Incident Response Planning under Model Misspecification through Bayesian Learning and Belief Quantization, https://arxiv.org/abs/2508.14385
  • Haokun Lin, Haobo Xu, Yichen Wu, Ziyu Guo, Renrui Zhang, Zhichao Lu, Ying Wei, Qingfu Zhang, Zhenan Sun, 20 Aug 2025, Quantization Meets dLLMs: A Systematic Study of Post-training Quantization for Diffusion LLMs, https://arxiv.org/abs/2508.14896
  • Hamza A. Abushahla, Dara Varam, Ariel J. N. Panopio, Mohamed I. AlHajri, 20 Aug 2025, Quantized Neural Networks for Microcontrollers: A Comprehensive Review of Methods, Platforms, and Applications, https://arxiv.org/abs/2508.15008
  • Xinshuang Liu, Runfa Blark Li, Keito Suzuki, Truong Nguyen, 21 Aug 2025, Image-Conditioned 3D Gaussian Splat Quantization, https://arxiv.org/abs/2508.15372
  • Yi Xu, Moyu Zhang, Chenxuan Li, Zhihao Liao, Haibo Xing, Hao Deng, Jinxin Hu, Yu Zhang, Xiaoyi Zeng, Jing Zhang, 21 Aug 2025, MMQ: Multimodal Mixture-of-Quantization Tokenization for Semantic ID Generation and User Behavioral Adaptation, https://arxiv.org/abs/2508.15281
  • Lucas Maisonnave, Karim Haroun, Tom Pegeot, 22 Aug 2025, Exploiting Information Redundancy in Attention Maps for Extreme Quantization of Vision Transformers, https://arxiv.org/abs/2508.16311
  • Manpreet Singh, Hassan Sajjad, 22 Aug 2025, Interpreting the Effects of Quantization on LLMs, https://arxiv.org/abs/2508.16785
  • Aditri Paul, Archan Paul, 25 Aug 2025, AQ-PCDSys: An Adaptive Quantized Planetary Crater Detection System for Autonomous Space Exploration, https://arxiv.org/abs/2508.18025
  • Tianyao Shi, Yi Ding, 22 Aug 2025, Systematic Characterization of LLM Quantization: A Performance, Energy, and Quality Perspective, https://arxiv.org/abs/2508.16712
  • Federico Marcuzzi, Xuefei Ning, Roy Schwartz, and Iryna Gurevych, 25 Aug 2025, How Quantization Shapes Bias in Large Language Models, https://arxiv.org/abs/2508.18088
  • Xiangxiang Wang, Xuanyu Wang, YiJia Luo, Yongbin Yu, Manping Fan, Jingtao Zhang, Liyong Ren, 25 Aug 2025, Scene-Aware Vectorized Memory Multi-Agent Framework with Cross-Modal Differentiated Quantization VLMs for Visually Impaired Assistance, https://arxiv.org/abs/2508.18177
  • Xinlin Li, Osama Hanna, Christina Fragouli, Suhas Diggavi, 24 Aug 2025, ICQuant: Index Coding enables Low-bit LLM Quantization, https://arxiv.org/abs/2505.00850

More Deep AI Research Areas

Read more about:

AI Books from Aussie AI



The Sweetest Lesson: Your Brain Versus AI The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
  • Your brain is 50 times bigger than the best AI engines.
  • Truly intelligent AI will require more compute!
  • Another case of the bitter lesson?
  • Maybe it's the opposite of that: the sweetest lesson.

Get your copy from Amazon: The Sweetest Lesson



RAG Optimization RAG Optimization: Accurate and Efficient LLM Applications: new book on RAG architectures:
  • Smarter RAG
  • Faster RAG
  • Cheaper RAG
  • Agentic RAG
  • RAG reasoning

Get your copy from Amazon: RAG Optimization



Generative AI in C++ Generative AI Applications book:
  • Deciding on your AI project
  • Planning for success and safety
  • Designs and LLM architectures
  • Expediting development
  • Implementation and deployment

Get your copy from Amazon: Generative AI Applications



Generative AI in C++ Generative AI programming book:
  • Generative AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++



CUDA C++ Optimization CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization



CUDA C++ Optimization CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging