Aussie AI

LLM Quantization Research

Last Updated 24 June, 2025

by David Spuler, Ph.D.

Quantization is an extremely popular method of model compression that has a huge number of research papers, and has been implemented in many modern inference engines. Generally, quantization has been very successful at both reducing inference compute times and storage space, without a huge hit to model accuracy, achieving near floating-point accuracy.

Types of Quantization

Quantization is usually separated into two main categories:

Post-Training Quantization (PTQ). This is where a pre-trained model is quantized for faster inference.
Quantization-Aware Training (QAT). This is the use of quantization during model training.

Quantization granularity specifies what numbers are going to be quantized, and which sets of numbers will have different quantization parameters. There are different model floating-point numbers that can be quantized:

Weights quantized (weight-only quantization)
Activations quantized (weight-and-activation quantization)
Per-layer quantization
Per-tensor quantization
Per-channel quantization

The scaling algorithm or "scale factor" by which floating-point numbers are scaled to a smaller range of numbers is also a distinction for a given quantization algorithm. Several types include:

Uniform scaling (uniform quantization)
Uniform affine quantization
Symmetric uniform quantization
Non-uniform quantization
Power-of-two quantization
Asymmetric quantization

There are several different technical types of quantization:

FP16 quantization: This uses 16-bit floating point numbers instead of 32-bit numbers. Commonly used.
FP8 quantization: 8-bit floating point.
FP4 quantization: 4-bit floating point. Occasionally used in research papers.
8-bit integer quantization (INT8): This uses single 8-bit bytes for quantization. Commonly used. Weights are scaled to either -128 to 127 (signed), or 0 to 255 (unsigned bytes).
4-bit quantization (INT4): Another popular size for quantization is a "nibble" (4 bits). There can be 2^4=16 weights. This is commonly used and quite effective given its low bitwidth.
3-bit quantization (INT3). This uses 2^3=8 distinct weights.
2-bit quantization (INT2): There are 4 distinct weights. Not commonly used.
Ternary quantization: This is quantization with 3 weights, usually -1, 0, and +1. Uses 2 bits, but not 4 weights. Suffers accuracy problems.
Binary quantization: This is 1-bit quantization with 2 possible weights (usually 0 and 1, or -1 and +1). Not highly accurate.
0-bit quantization. Good luck with this algorithm.
Integer-only-arithmetic quantization. This refers to quantization where the actual arithmetic performed during inference is all integer multiplications. This is distinct from the rather unkindly named "fake quantization" which is quantization where the integers are "dequantized" back to floating-point before using floating point multiplication in inference calculations. Integer-only-arithmetic quantization aims to improve both speed and space, whereas non-integer-arithmetic-only integer quantization still reduces model size and storage space, but improves execution speed less fully (latency is still somewhat improved due to reduced memory-related activity).
Dyadic quantization: This is an uncommon quantization method using dyadic numbers, which are a mathematical representation of numbers as rational quotients where the numerator is an integer, but the denominator is always a power-of-two (allowing bitshifts).
Logarithmic Bitshift Quantization (Power-of-Two Quantization). This is where the weights are all powers of 2, so that faster bitshifts are used instead of integer multiplication. A generalization is "Sum of Two Bitshifts Quantization" which uses multiple bitshifts added together.

And some more quantization terminology:

Stochastic quantization. This is a method of intentionally introducing some non-determinancy and randomness into quantization algorithms with the goal of increased inference accuracy.
Extreme quantization: Usually refers to binary quantization.
Low-bit quantization: Usually means binary, ternary, or at most 4-bit quantization.
Fake quantization (or "simulated quantization"). Refers somewhat unkindly to integer quantization with the main goal of storage space reduction from a reduced model size, where the actual arithmetic is still performed as floating-point multiplications, rather than the "real quantization" of integer-only-arithmetic quantization.
Fixed point quantization. Using fixed-point arithmetic to change vector dot product from floating point multiplication/addition into integer multiplication and addition.
Mixed-precision quantization (or simply "mixed quantization"): Refers to more finely granular quantization where different parts of the model have different levels of quantization in terms of bits.

Quantization Theory

Research papers on the theoretical basis of quantization:

Zifei Xu, Sayeh Sharify, Wanzin Yazar, Tristan Webb, Xin Wang, 18 Oct 2024, Understanding the difficulty of low-precision post-training quantization of large language models, https://arxiv.org/abs/2410.14570
Jiayi Pan, Chengcan Wang, Kaifu Zheng, Yangguang Li, Zhenyu Wang, Bin Feng, 6 Dec 2023, SmoothQuant+: Accurate and Efficient 4-bit Post-Training WeightQuantization for LLM, https://arxiv.org/abs/2312.03788
Jiedong Lang, Zhehao Guo, Shuyu Huang, 30 Oct 2024, A Comprehensive Study on Quantization Techniques for Large Language Models, https://arxiv.org/abs/2411.02530
Tanishq Kumar, Zachary Ankner, Benjamin F. Spector, Blake Bordelon, Niklas Muennighoff, Mansheej Paul, Cengiz Pehlevan, Christopher Ré, Aditi Raghunathan, 7 Nov 2024, Scaling Laws for Precision, https://arxiv.org/abs/2411.04330
Xu Ouyang, Tao Ge, Thomas Hartvigsen, Zhisong Zhang, Haitao Mi, Dong Yu, 27 Nov 2024 (v2), Low-Bit Quantization Favors Undertrained LLMs: Scaling Laws for Quantized LLMs with 100T Training Tokens, https://arxiv.org/abs/2411.17691
Noga Bar, Raja Giryes, 12 Jan 2025, ZOQO: Zero-Order Quantized Optimization, https://arxiv.org/abs/2501.06736
Mohammad Sadegh Akhondzadeh, Aleksandar Bojchevski, Evangelos Eleftheriou, Martino Dazzi, 3 Mar 2025, KurTail : Kurtosis-based LLM Quantization, https://arxiv.org/abs/2503.01483

Survey Papers on Quantization

Sangeetha Siddegowda, Marios Fournarakis, Markus Nagel, Tijmen Blankevoort, Chirag Patel, Abhijit Khobare, Neural Network Quantization with AI Model Efficiency Toolkit (AIMET), arXiv:2201.08442v1 [cs.LG], 20 Jan 2022, https://arxiv.org/pdf/2201.08442.pdf
Krishnamoorthi, R., Quantizing deep convolutional networks for efficient inference: A whitepaper. arXiv preprint arXiv:1806.08342, 2018, https://arxiv.org/abs/1806.08342
Nagel, M., Fournarakis, M., Amjad, R. A., Bondarenko, Y., van Baalen, M., and Blankevoort, T., "A white paper on neural network quantization", 2021, https://arxiv.org/abs/2106.08295
Jingjing Xu, Wangchunshu Zhou, Zhiyi Fu, Hao Zhou, Lei Li, A Survey on Green Deep Learning, Nov 2021, https://arxiv.org/abs/2111.05193 (Extensive 2021 survey paper including quantization.)
Xunyu Zhu, Jian Li, Yong Liu, Can Ma, Weiping Wang, A Survey on Model Compression for Large Language Models, arXiv preprint arXiv:2308.07633, Aug 2023 https://arxiv.org/abs/2308.07633 (Recent 2023 survey paper on various topics, including PTQ and QAT quantization.)
Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W. Mahoney, and Kurt Keutzer. 2021. A Survey of Quantization Methods for Efficient Neural Network Inference. arXiv:2103.13630 [cs], June 2021, http://arxiv.org/abs/2103. 13630 arXiv: 2103.13630, https://arxiv.org/abs/2103.13630
T Liang, J Glossner, L Wang, S Shi, X Zhang, 2021, Neurocomputing, Pruning and quantization for deep neural network acceleration: A survey, https://arxiv.org/abs/2101.09671
T Choudhary, V Mishra, A Goswami, 2020, A comprehensive survey on model compression and acceleration, Artifcial Intelligence Review, https://doi.org/10.1007/s10462-020-09816-7, https://link.springer.com/article/10.1007/s10462-020-09816-7
Y Cheng, D Wang, P Zhou, T Zhang, June 2020 (revised), A survey of model compression and acceleration for deep neural networks, arXiv preprint arXiv:1710.09282, https://arxiv.org/abs/1710.09282
R. Ding, Z. Liu, R. D. Blanton, and D. Marculescu. Quantized deep neural networks for energy efficient hardware-based inference. In IEEE Asia and South Pacific Design Automation Conference, pages 1–8, 2018. https://ieeexplore.ieee.org/document/8297274 (Survey and evaluation of various quantized DNN models in 2018, including binarized and light models, on chosen datasets.)
B Rokh, A Azarpeyvand, A Khanteymoori, ACM Transactions on Intelligent Systems, 2023, A Comprehensive Survey on Model Quantization for Deep Neural Networks in Image Classification, PDF: https://dl.acm.org/doi/pdf/10.1145/3623402
Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748
Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
Yanshu Wang, Tong Yang, Xiyan Liang, Guoan Wang, Hanning Lu, Xu Zhe, Yaoming Li, Li Weitao, 18 Sep 2024, Art and Science of Quantizing Large-Scale Models: A Comprehensive Overview, https://arxiv.org/abs/2409.11650 (Extensive survey of quantization from the basics to SOTA approaches, with also some coverage of knowledge distillation and KV cache compression.)
David Spuler, September 2nd, 2024, 500+ LLM Inference Optimization Techniques, Aussie AI Blog, https://www.aussieai.com/blog/llm-inference-optimization
Ummara Bibi, Mahrukh Mazharm Dilshad Sabir, Muhammad Fasih Uddin Butt, Ali Hassan, Mustansar Ali Ghazanfar, Arshad Ali Khan, Wadood Abdul, 2024, Advances in Pruning and Quantization for Natural Language Processing, IEEE Access, doi: 10.1109/ACCESS.2024.3465631. https://ieeexplore.ieee.org/document/10685352 PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10685352
Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu, 25 Sep 2024, A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms, https://arxiv.org/abs/2409.16694
Babak Rokh, Ali Azarpeyvand, Alireza Khanteymoori, 23 Oct 2023 (v5), A Comprehensive Survey on Model Quantization for Deep Neural Networks in Image Classification, https://arxiv.org/abs/2205.07877
Wenchao Xu, Jinyu Chen, Peirong Zheng, Xiaoquan Yi, Tianyi Tian, Wenhui Zhu, Quan Wan, Haozhao Wang, Yunfeng Fan, Qinliang Su, Xuemin Shen, https://arxiv.org/abs/2412.13437 18 Dec 2024, Deploying Foundation Model Powered Agent Services: A Survey, (A survey of not just deployment, but many inference optimization techniques.)
Tom Wallace, Naser Ezzati-Jivan, Beatrice Ombuki-Berman, 16 Jan 2025, Optimization Strategies for Enhancing Resource Efficiency in Transformers & Large Language Models, https://arxiv.org/abs/2502.00046
Mozhgan Navardi, Romina Aalishah, Yuzhe Fu, Yueqian Lin, Hai Li, Yiran Chen, Tinoosh Mohsenin, 19 Feb 2025, GenAI at the Edge: Comprehensive Survey on Empowering Edge Devices, https://arxiv.org/abs/2502.15816
Liangdong Liu, Zhitong Zheng, Cong Wang, Tianhuang Su, Zhenyu Yang, 26 Feb 2025, Binary Neural Networks for Large Language Model: A Survey, https://arxiv.org/abs/2502.19008
Guiyao Tie, Zeli Zhao, Dingjie Song, Fuyang Wei, Rong Zhou, Yurou Dai, Wen Yin, Zhejian Yang, Jiangyue Yan, Yao Su, Zhenhan Dai, Yifeng Xie, Yihan Cao, Lichao Sun, Pan Zhou, Lifang He, Hechang Chen, Yu Zhang, Qingsong Wen, Tianming Liu, Neil Zhenqiang Gong, Jiliang Tang, Caiming Xiong, Heng Ji, Philip S. Yu, Jianfeng Gao, 8 Mar 2025, A Survey on Post-training of Large Language Models, https://arxiv.org/abs/2503.06072
Minjun Kim, Jaehyeon Choi, Jongkeun Lee, Wonjin Cho, U Kang, 14 May 2025, Zero-shot Quantization: A Comprehensive Survey, https://arxiv.org/abs/2505.09188
Kai Liu, Qian Zheng, Kaiwen Tao, Zhiteng Li, Haotong Qin, Wenbo Li, Yong Guo, Xianglong Liu, Linghe Kong, Guihai Chen, Yulun Zhang, Xiaokang Yang, 8 May 2025, Low-bit Model Quantization for Deep Neural Networks: A Survey, https://arxiv.org/abs/2505.05530 https://github.com/Kai-Liu001/Awesome-Model-Quantization

Floating Point Quantization

The most straight-forward quantization is to reduce 32-bit floating point (4 bytes) to 16-bit floating point (2 bytes). This halves the memory storage requirements, and does not suffer much reduction in model accuracy. All operations in matmuls are still done in floating point arithmetic.

The classic form of floating point quantization is often abbreviated as FP16. There is also "bfloat16", which uses a different representation of numbers. An even more reduced size is FP8 quantization, which uses 8-bit floating point numbers.

Research papers on floating point quantization (there are many):

Song Han, Huizi Mao, William J. Dally, Feb 2016, Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, arXiv:1510.00149 http://arxiv.org/abs/1510.00149
Post-training float16 quantization (TensorFlow), https://www.tensorflow.org/lite/performance/post_training_float16_quant
N. Wang, J. Choi, D. Brand, C. Chen and K. Gopalakrishnan, "Training deep neural networks with 8-bit floating point numbers", arXiv:1812.08011, 2018. https://arxiv.org/abs/1812.08011 (FP8 quantization.)

8-bit Floating-Point Quantization (FP8)

FP8 quantization hasn't caught on in the AI industry as much as FP16 or integer quantization methods, but there are plenty of papers. Resarch papers on FP8 quantization:

Léopold Cambier, Anahita Bhiwandiwalla, Ting Gong, Mehran Nekuii, Oguz H Elibol, Hanlin Tang, Jan 2020, Shifted and squeezed 8-bit floating point format for low-precision training of deep neural networks, arXiv preprint, https://arxiv.org/abs/2001.05674
Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwaite, Sangwon Ha, Alexander Heinecke, Patrick Judd, John Kamalu, Naveen Mellempudi, Stuart Oberman, Mohammad Shoeybi, Michael Siu, Hao Wu, Sep 2022, FP8 Formats for Deep Learning, arXiv preprint, https://arxiv.org/abs/2209.05433
Andrey Kuzmin, Mart Van Baalen, Yuwei Ren, Markus Nagel, Jorn Peters, Tijmen Blankevoort, Aug 2022, FP8 Quantization: The Power of the Exponent, https://arxiv.org/abs/2208.09225
Naigang Wang, Jungwook Choi, Daniel Brand, Chia-Yu Chen, Kailash Gopalakrishnan, Dec 2018, Training Deep Neural Networks with 8-bit Floating Point Numbers, NeurIPS, 2018, https://arxiv.org/abs/1812.08011 https://proceedings.neurips.cc/paper/2018/file/335d3d1cd7ef05ec77714a215134914c-Paper.pdf
Youngdeok Hwang; Janghwan Lee; Jiwoong Park; Jieun Lim; Jungwook Choi, Jan 2024, Searching Optimal Floating-Point Format for Sub-8-Bit Large Language Model Inference, 2024 International Conference on Electronics, Information, and Communication (ICEIC), https://ieeexplore.ieee.org/abstract/document/10457111 (Examines floating-point representations below 8 bits, and also the importance of denormalized numbers.)
Jeffrey Yu, Kartik Prabhu, Yonatan Urman, Robert M. Radway, Eric Han, Priyanka Raina, 27 April 2024, ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, April 2024, Pages 5–21, https://doi.org/10.1145/3620666.3651368 https://dl.acm.org/doi/abs/10.1145/3620666.3651368
Fireworks.ai, Jan 9, 2024, FireAttention — Serving Open Source Models 4x faster than vLLM by quantizing with ~no tradeoffs, https://blog.fireworks.ai/fireattention-serving-open-source-models-4x-faster-than-vllm-by-quantizing-with-no-tradeoffs-a29a85ad28d0
8bit HippoAttention: Up to 3X Faster Compared to FlashAttentionV2, HippoML Blog, Jan 2024, https://blog.hippoml.com/8bit-hippoattention-up-to-3x-faster-compared-to-flashattentionv2-8f9def90b482
Houwen Peng, Kan Wu, Yixuan Wei, Guoshuai Zhao, Yuxiang Yang, Ze Liu, Yifan Xiong, Ziyue Yang, Bolin Ni, Jingcheng Hu, Ruihang Li, Miaosen Zhang, Chen Li, Jia Ning, Ruizhe Wang, Zheng Zhang, Shuguang Liu, Joe Chau, Han Hu, Peng Cheng, Dec 2023, FP8-LM: Training FP8 Large Language Models, https://arxiv.org/abs/2310.18313 Code: https://github.com/Azure/MS-AMP
Jianwei Li, Tianchi Zhang, Ian En-Hsu Yen, Dongkuan Xu, Dec 2023, FP8-BERT: Post-Training Quantization for Transformer, https://arxiv.org/abs/2312.05725
Xiaoxia Wu, Zhewei Yao, Yuxiong He, 2021, A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats Microsoft, https://neurips2023-enlsp.github.io/papers/paper_92.pdf Code: https://github.com/microsoft/DeepSpeed (FP4 4-bit weight quantization with 8-bit FP8 activation quantization, and showed FP8 bettered INT8 quantization and FP4 beat INT4.)
Tim Dettmers, Mike Lewis, Younes Belkada, Luke Zettlemoyer, LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale, Nov 2022, https://arxiv.org/abs/2208.07339
Z Zhang, Y Zhang, G Shi, Y Shen, X Wei, R Gong, X Xia, Oct 2023. Exploring the Potential of Flexible 8-bit Format: Design and Algorithm, https://arxiv.org/pdf/2310.13513.pdf
Sergio P. Perez, Yan Zhang, James Briggs, Charlie Blake, Josh Levy-Kramer, Paul Balanca, Carlo Luschi, Stephen Barlow, Andrew William Fitzgibbon, Sep 2023, Training and inference of large language models using 8-bit floating point, https://arxiv.org/abs/2309.17224
Zihao Ye,, Lequn Chen, Ruihang Lai, Yilong Zhao, Size Zheng, Junru Shao, Bohan Hou, Hongyi Jin, Yifei Zuo, Liangsheng Yin, Tianqi Chen, Luis Ceze, Feb 2, 2024, Accelerating Self-Attentions for LLM Serving with FlashInfer, https://flashinfer.ai/2024/02/02/introduce-flashinfer.html
The SGLang Team, Sep 04, 2024 SGLang v0.3 Release: 7x Faster DeepSeek MLA, 1.5x Faster torch.compile, Multi-Image/Video LLaVA-OneVision, https://lmsys.org/blog/2024-09-04-sglang-v0-3/
Together AI, September 5, 2024, Supercharging NVIDIA H200 and H100 GPU Cluster Performance With Together Kernel Collection, https://www.together.ai/blog/nvidia-h200-and-h100-gpu-cluster-performance-together-kernel-collection
Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu, 25 Sep 2024, A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms, https://arxiv.org/abs/2409.16694
Shixiaowei02, Oct 2024, TensorRT-LLM 0.13.0 Release Latest, https://github.com/NVIDIA/TensorRT-LLM/releases/tag/v0.13.0
Eldar Kurtic, Alexandre Marques, Shubhra Pandit, Mark Kurtz, Dan Alistarh, 4 Nov 2024, "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization, https://arxiv.org/abs/2411.02355
Markus Rabe, Carl Case, November 14, 2024, Rethinking LLM Inference: Why Developer AI Needs a Different Approach, https://www.augmentcode.com/blog/rethinking-llm-inference-why-developer-ai-needs-a-different-approach
NVIDIA, Dec 2024, Multi-Head, Multi-Query, and Group-Query Attention, https://nvidia.github.io/TensorRT-LLM/advanced/gpt-attention.html#kv-cache
Less Wright, Adnan Hoque, November 01, 2024, Deep Dive on CUTLASS Ping-Pong GEMM Kernel, https://pytorch.org/blog/cutlass-ping-pong-gemm-kernel/ https://github.com/pytorch-labs/applied-ai/tree/main/kernels/cuda/cutlass_gemm https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp (CUTLASS optimized FP8 tiled GEMM kernel using warp specialization into data producers and consumers.)
Minhajul Hoque, Jan 4, 2025, DeepSeek V3: How They Achieved Big Results with Small Compute, https://ai.plainenglish.io/deepseek-v3-how-they-achieved-big-results-with-small-compute-fb694606d59a (DeepSeek optimizations included FP8 quantization with outlier handling, attention and KV cache optimization via Multi-Head Latent Attention (MHLA), and multi-token decoding.)
Nandini Lokesh Reddy, Jan 2025, DeepSeek: Bridging Performance and Efficiency in Modern AI, https://medium.com/@nandinilreddy/deepseek-bridging-performance-and-efficiency-in-modern-ai-106181a85693
S. A. Mirsalari, S. Yousefzadeh, A. Hemani and G. Tagliavini, "Unleashing 8-Bit Floating Point Formats Out of the Deep-Learning Domain," 2024 31st IEEE International Conference on Electronics, Circuits and Systems (ICECS), Nancy, France, 2024, pp. 1-4, doi: 10.1109/ICECS61496.2024.10848785. https://ieeexplore.ieee.org/abstract/document/10848785
Jiwoo Kim, Joonhyung Lee, Gunho Park, Byeongwook Kim, Se Jung Kwon, Dongsoo Lee, Youngjoo Lee, 6 Feb 2025 (v2), An Investigation of FP8 Across Accelerators for LLM Inference, https://arxiv.org/abs/2502.01070
Pradeep Ramani, Jason Knight, Philippe Tillet, Thomas Raoux, Paweł Szczerbuk and Peter Bell, Feb 05, 2025, OpenAI Triton on NVIDIA Blackwell Boosts AI Performance and Programmability, https://developer.nvidia.com/blog/openai-triton-on-nvidia-blackwell-boosts-ai-performance-and-programmability/
Saaketh Narayan, Abhay Gupta, Mansheej Paul, Davis Blalock, 9 Feb 2025, μnit Scaling: Simple and Scalable FP8 LLM Training, https://arxiv.org/abs/2502.05967
DeepSeek, Feb 2025, DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling, https://github.com/deepseek-ai/DeepGEMM (DeepSeek releases FP8 GEMM kernels.)
Ashley Goolam, March 4, 2025, DeepSeek Open Source Week: A Complete Summary, https://apidog.com/blog/deepseek-open-source-week/
Benjamin Spector, Aaryan Singhal, Dan Fu, Chris Ré, Mar 15, 2025, ThunderKittens Now on Blackwells! https://hazyresearch.stanford.edu/blog/2025-03-15-tk-blackwell https://github.com/HazyResearch/ThunderKittens/blob/blackwell/kernels/attn/b200/b200.cu https://github.com/HazyResearch/ThunderKittens/blob/blackwell/kernels/matmul/B200/matmul.cu https://github.com/HazyResearch/ThunderKittens/blob/blackwell/kernels/matmul/FP8_B200/matmul.cu

6-bit Floating-Point Quantization (FP6)

Research on FP6 quantization:

Haojun Xia, Zhen Zheng, Xiaoxia Wu, Shiyang Chen, Zhewei Yao, Stephen Youn, Arash Bakhtiari, Michael Wyatt, Donglin Zhuang, Zhongzhu Zhou, Olatunji Ruwase, Yuxiong He, Shuaiwen Leon Song, 25 Jan 2024, FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design, https://arxiv.org/abs/2401.14112
Haojun Xia, Zhen Zheng, Xiaoxia Wu, Shiyang Chen, Zhewei Yao, Stephen Youn, Arash Bakhtiari, Michael Wyatt, Donglin Zhuang, Zhongzhu Zhou, Olatunji Ruwase, Yuxiong He, Shuaiwen Leon Song, July 2024, Quant-LLM: Accelerating the Serving of Large Language Models via FP6-Centric Algorithm-System Co-Design on Modern GPUs, Proceedings of the 2024 USENIX Annual Technical Conference. July 10–12, 2024,Santa Clara, CA, USA, 978-1-939133-41-0, https://www.usenix.org/conference/atc24/presentation/xia https://www.usenix.org/system/files/atc24-xia.pdf

4-bit Floating-Point Quantization (FP4)

Research on FP4 quantization:

Youngdeok Hwang; Janghwan Lee; Jiwoong Park; Jieun Lim; Jungwook Choi, Jan 2024, Searching Optimal Floating-Point Format for Sub-8-Bit Large Language Model Inference, 2024 International Conference on Electronics, Information, and Communication (ICEIC), https://ieeexplore.ieee.org/abstract/document/10457111 (Examines floating-point representations below 8 bits, and also the importance of denormalized numbers.)
Luchang Li, Sheng Qian, Jie Lu, Lunxi Yuan, Rui Wang, Qin Xie, 29 Mar 2024, Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs, https://arxiv.org/abs/2403.20041 (On-device LLMs via four optimizations: dynamic-tensor-shape inference, FP4 quantization, operator optimizations, and KV cache improvements.)
Xiaoxia Wu, Zhewei Yao, Yuxiong He, 2021, A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats Microsoft, https://neurips2023-enlsp.github.io/papers/paper_92.pdf Code: https://github.com/microsoft/DeepSpeed (FP4 4-bit weight quantization with 8-bit FP8 activation quantization, and showed FP8 bettered INT8 quantization and FP4 beat INT4.)
S Liu, Z Liu, X Huang, P Dong, KT Cheng, 2023 LLM-FP4: 4-Bit Floating-Point Quantized Transformers, arXiv preprint arXiv:2310.16836, https://arxiv.org/pdf/2310.16836.pdf Code: https://github.com/nbasyl/LLM-FP4
Wonsuk Jang, Thierry Tambe, 2 Jan 2025, BlockDialect: Block-wise Fine-grained Mixed Format for Energy-Efficient LLM Inference, https://arxiv.org/abs/2501.01144 (Per-block granular mixed-precision quantization including FP4.)
Ruizhe Wang, Yeyun Gong, Xiao Liu, Guoshuai Zhao, Ziyue Yang, Baining Guo, Zhengjun Zha, Peng Cheng, 28 Jan 2025, Optimizing Large Language Model Training Using FP4 Quantization, https://arxiv.org/abs/2501.17116
Roberto L. Castro, Andrei Panferov, Soroush Tabesh, Oliver Sieberling, Jiale Chen, Mahdi Nikdan, Saleh Ashkboos, Dan Alistarh, 20 May 2025, Quartet: Native FP4 Training Can Be Optimal for Large Language Models, https://arxiv.org/abs/2505.14669

16-bit Floating-Point Quantization (FP16)

Quantization from high-precision 32-bit floating point weights (usually abbreviated "FP32" or "float32") to lower-precision 16-bit floating point (usually abbreviated "FP16" or "float16") can yield significant benefits, often without a significant loss of accuracy. There is much research in this area.

Post-training float16 quantization (TensorFlow), https://www.tensorflow.org/lite/performance/post_training_float16_quant
R Tiwari, A Chavan, D Gupta, G Mago, A Gupta, 2023, RCV2023 Challenges: Benchmarking Model Training and Inference for Resource-Constrained Deep Learning, PDF: https://openaccess.thecvf.com/content/ICCV2023W/RCV/papers/Tiwari_RCV2023_Challenges_Benchmarking_Model_Training_and_Inference_for_Resource-Constrained_Deep_ICCVW_2023_paper.pdf
Radha Gulhane, 2024, Accelerated and Memory-Efficient Distributed Deep Learning: Leveraging Quantization, Parallelism Techniques, and Mix-Match Runtime Communication , Masters Thesis, Computer Science and Engineering , The Ohio State University, https://etd.ohiolink.edu/acprod/odb_etd/ws/send_file/send?accession=osu1713381834648517&disposition=inline
B. Li, S. Lu, K. Xie, and Z. Wang, “Accelerating NLP Tasks on FPGA with Compressed BERT and a Hardware-Oriented Early Exit Method,” in IEEE Computer Society Annual Symposium on VLSI (ISVLSI), 2022, pp. 410–413. https://ieeexplore.ieee.org/document/9912016
PyTorch, 2023, Model inference optimization checklist, https://pytorch.org/serve/performance_checklist.html
Srivastava, Nitish. Improving neural networks with dropout. 2013, Master’s thesis, U. Toronto, https://www.semanticscholar.org/paper/Improving-Neural-Networks-with-Dropout-Srivastava/5d5d4f49d6443c8529a6f5ebef5c499d47a869da PDF: http://www.cs.toronto.edu/~nitish/msc_thesis.pdf
Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. 2014. DaDianNao: A machine-learning supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-47). IEEE, Los Alamitos, CA, 609–622. https://doi. org/10.1109/MICRO.2014.58
Naigang Wang, Jungwook Choi, Daniel Brand, Chia-Yu Chen, and Kailash Gopalakrishnan. 2018. Training deep neural networks with 8-bit floating point numbers. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.). Curran Associates, Red Hook, NY, 7675–7684. http://papers.nips.cc/paper/7994-training-deep-neural-networks-with-8-bit-floating-pointnumbers.pdf.
Kartik Hegde, Jiyong Yu, Rohit Agrawal, Mengjia Yan, Michael Pellauer, and Christopher W. Fletcher. 2018. UCNN: Exploiting computational reuse in deep neural networks via weight repetition. In Proceedings of the 45th Annual International Symposium on Computer Architecture (ISCA’18). IEEE, Los Alamitos, CA, 674–687. https://doi.org/10. 1109/ISCA.2018.00062
Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel Emer, Stephen W. Keckler, and William J. Dally. 2017. SCNN: An accelerator for compressed-sparse convolutional neural networks. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA’17). ACM, New York, NY, 27–40. https://doi.org/10.1145/3079856.3080254
Jinguang Wang, Yuexi Yin, Haifeng Sun, Qi Qi, Jingyu Wang, Zirui Zhuang, Tingting Yang, Jianxin Liao, 27 Jun 2024, OutlierTune: Efficient Channel-Wise Quantization for Large Language Models, https://arxiv.org/abs/2406.18832
Sergei Savvov, Jun 27, 2023, 7 Ways To Speed Up Inference of Your Hosted LLMs, https://betterprogramming.pub/speed-up-llm-inference-83653aa24c47
Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu, 25 Sep 2024, A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms, https://arxiv.org/abs/2409.16694
Douglas C. Youvan, September 27, 2024, Building and Running Large-Scale Language Models: The Infrastructure and Techniques Behind GPT-4 , https://www.researchgate.net/profile/Douglas-Youvan/publication/384398902_Building_and_Running_Large-Scale_Language_Models_The_Infrastructure_and_Techniques_Behind_GPT-4/links/66f6f4d3906bca2ac3d20e68/Building-and-Running-Large-Scale-Language-Models-The-Infrastructure-and-Techniques-Behind-GPT-4.pdf
Jiajun Wu, Mo Song, Jingmin Zhao, Yizhao Gao, Jia Li, Hayden Kwok-Hay So, 6 Nov 2024, TATAA: Programmable Mixed-Precision Transformer Acceleration with a Transformable Arithmetic Architecture, https://arxiv.org/abs/2411.03697
Yuzong Chen, Xilai Dai, Chi-chih Chang, Yash Akhauri, Mohamed S. Abdelfattah, 6 Jan 2025, The Power of Negative Zero: Datatype Customization for Quantized Large Language Models, https://arxiv.org/abs/2501.04052 (Remapping negative zero to other values.)

32-bit Floating-Point Quantization (FP32)

Is there an FP32 quantization technique? No, not really! It's not quantized if it's the same format as the default.

Integer Quantization

Integer quantization of AI models is a long-standing area of research, with much literature. These are only some of the many papers:

Post-training integer quantization (TensorFlow), https://www.tensorflow.org/lite/performance/post_training_integer_quant
Julien Simon, "Smaller is better: Q8-Chat, an efficient generative AI experience on Xeon", May 16th 2023, https://huggingface.co/blog/generative-ai-models-on-intel-cpu
Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev, Paulius Micikevicius, "Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation", arXiv:2004.09602v1 [cs.LG] 20 Apr 2020, https://arxiv.org/abs/2004.09602
Song Han, Huizi Mao, William J. Dally, "Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding", arXiv:1510.00149v5 [cs.CV], 15 Feb 2016, https://arxiv.org/abs/1510.00149
Shaokai Ye, Tianyun Zhang, Kaiqi Zhang, Jiayu Li, Jiaming Xie, Yun Liang, Sijia Liu, Xue Lin, Yanzhi Wang, "A Unified Framework of DNN Weight Pruning and Weight Clustering/Quantization Using ADMM", November 2018, https://arxiv.org/abs/1811.01907
Zhihang Yuan, Lin Niu, Jiawei Liu, Wenyu Liu, Xinggang Wang, Yuzhang Shang, Guangyu Sun, Qiang Wu, Jiaxiang Wu, Bingzhe Wu, "RPTQ: Reorder-based Post-training Quantization for Large Language Models", May 2023, https://arxiv.org/abs/2304.01089
Yuhang Li, Ruihao Gong, Xu Tan, Yang Yang, Peng Hu, Qi Zhang, Fengwei Yu, Wei Wang, and Shi Gu, "Brecq: Pushing the limit of post-training quantization by block reconstruction", arXiv preprint arXiv:2102.05426, 2021, https://arxiv.org/abs/2102.05426
Soheil Hashemi; Nicholas Anthony; Hokchhay Tann; R. Iris Bahar; Sherief Reda, "Understanding the impact of precision quantization on the accuracy and energy of neural networks", Design, Automation & Test in Europe Conference & Exhibition, March 2017, https://ieeexplore.ieee.org/abstract/document/7927224
Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio, "Quantized neural networks: Training neural networks with low precision weights and activations", The Journal of Machine Learning Research, 18(1):6869–6898, 2017, https://arxiv.org/abs/1609.07061.
Itay Hubara, Yury Nahshan, Yair Hanani, Ron Banner, and Daniel Soudry, "Improving post training neural quantization: Layer-wise calibration and integer programming", arXiv preprint arXiv:2006.10518, 2020, https://arxiv.org/abs/2006.10518
Javier Fernandez-Marques, Paul N. Whatmough, Andrew Mundy, Matthew Mattina, 2020, Searching for winograd-aware quantized networks, Proceedings of the 3rd MLSys Conference, Austin, TX, USA, 2020. PDF: https://proceedings.mlsys.org/paper_files/paper/2020/file/678e209691cd37f145a5502695378bac-Paper.pdf (Evaluates INT8, INT10, and INT16 quantization.)
Markus Nagel, Mart van Baalen, Tijmen Blankevoort, Max Welling, 2019, Data-free quantization through weight equalization and bias correction, PDF: https://openaccess.thecvf.com/content_ICCV_2019/papers/Nagel_Data-Free_Quantization_Through_Weight_Equalization_and_Bias_Correction_ICCV_2019_paper.pdf (Evaluates INT5, INT6, INT8, INT10, INT12, and INT16.)
Pierre-Emmanuel Novac, March 2023, MicroAI: Embedded Artificial Intelligence for Human Activity Recognition on Smart Glasses, Ph.D. Thesis, Artificial Intelligence. Université Côte d’Azur, https://theses.hal.science/tel-04049008/document (Uses INT8 and INT16 quantization.)
Gunho Park, Baeseong Park, Minsub Kim, Sungjae Lee, Jeonghoon Kim, Beomseok Kwon, Se Jung Kwon, Byeongwook Kim, Youngjoo Lee, Dongsoo Lee, Apr 2023, LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models https://arxiv.org/abs/2206.09557
Jaeyong Jang; Yulhwa Kim; Juheun Lee; Jae-Joon Kim, March 2024, FIGNA: Integer Unit-Based Accelerator Design for FP-INT GEMM Preserving Numerical Accuracy, 2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA), https://ieeexplore.ieee.org/abstract/document/10476470
B. Li, S. Lu, K. Xie, and Z. Wang, “Accelerating NLP Tasks on FPGA with Compressed BERT and a Hardware-Oriented Early Exit Method,” in IEEE Computer Society Annual Symposium on VLSI (ISVLSI), 2022, pp. 410–413. https://ieeexplore.ieee.org/document/9912016
David Spuler, March 2024, Chapter 32. Quantization, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Wu H, Judd P, Zhang X, Isaev M, Micikevicius P. Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation, 2020. arXiv:2004.09602 http://arxiv.org/abs/2004.09602
Han Guo, William Brandon, Radostin Cholakov, Jonathan Ragan-Kelley, Eric P. Xing, Yoon Kim, 15 Jul 2024, Fast Matrix Multiplications for Lookup Table-Quantized LLMs, https://arxiv.org/abs/2407.10960
David Spuler, March 2024, Integer Quantization, in Generative AI in C++, https://www.aussieai.com/book/ch32-integer-quantization

Integer-Only-Arithmetic Quantization

Integer-only quantization is integer quantization where only integer multiplication is performed. The assumption that this is true for all integer quantization algorithms is false. Several types of integer quantization may store weights as quantized integers, but then de-quantize them back to floating point at various points (even for weight multiplication in some algorithms). Methods that strictly restrict arithmetic to avoid floating point operations are more precisely named "integer-only-arithmetic quantization algorithms". For methods that also quantize non-linear components to integers, such as Softmax and normalization components, see also end-to-end integer Transformers.

Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, Dmitry Kalenichenko, "Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference", Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 2704-2713, https://openaccess.thecvf.com/content_cvpr_2018/papers/Jacob_Quantization_and_Training_CVPR_2018_paper.pdf
Zhewei Yao, Zhen Dong, Zhangcheng Zheng, Amir Gholami, Jiali Yu, Eric Tan, Leyuan Wang, Qijing Huang, Yida Wang, Michael Mahoney, Kurt Keutzer, HAWQ-V3: Dyadic Neural Network Quantization, Proceedings of the 38th International Conference on Machine Learning, PMLR 139:11875-11886, 2021, https://arxiv.org/abs/2011.10680
Tim Dettmers, Mike Lewis, Younes Belkada, Luke Zettlemoyer, LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale, Nov 2022, https://arxiv.org/abs/2208.07339
Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer, I-BERT: Integer-only BERT Quantization, Proceedings of the 38th International Conference on Machine Learning, PMLR 139:5506-5518, 2021, https://arxiv.org/abs/2101.01321, https://proceedings.mlr.press/v139/kim21d.html
Peng Peng, Mingyu You, Weisheng Xu, and Jiaxin Li. Fully integer-based quantization for mobile convolutional neural network inference. Neurocomputing, 432:194–205, 2021, https://www.sciencedirect.com/science/article/abs/pii/S0925231220319354
Y. Lin, Y. Li, T. Liu et al., “Towards fully 8-bit integer inference for the transformer model,” in Proc. of IJCAI, 2020, pp. 3759–3765. https://arxiv.org/abs/2009.08034
Dong, Z., Yao, Z., Gholami, A., Mahoney, M. W., Keutzer, K., HAWQ: Hessian AWare Quantization of neural networks with mixed-precision. In The IEEE International Conference on Computer Vision (ICCV), October 2019. https://ieeexplore.ieee.org/document/9009512, https://arxiv.org/abs/1905.03696
A. Rock, A. Untether, O. Khalil, O. Shai, and P. Grouchy, 2022, INT8 Transformers for Inference Acceleration, 36th Conference on Neural Information Processing Systems (NeurIPS), PDF: https://neurips2022-enlsp.github.io/papers/paper_52.pdf
Radha Gulhane, 2024, Accelerated and Memory-Efficient Distributed Deep Learning: Leveraging Quantization, Parallelism Techniques, and Mix-Match Runtime Communication , Masters Thesis, Computer Science and Engineering , The Ohio State University, https://etd.ohiolink.edu/acprod/odb_etd/ws/send_file/send?accession=osu1713381834648517&disposition=inline
P Dong, L Lu, C Wu, C Lyu, G Yuan, H Tang, Y Wang, 2023, PackQViT: Faster Sub-8-bit Vision Transformers via Full and Packed Quantization on the Mobile, https://openreview.net/pdf?id=N56hAiQvot Code: https://github.com/PeiyanFlying/PackQViT
Krishna Teja Chitty-Venkata, Sparsh Mittal, Murali Emani, Venkatram Vishwanath, Arun K. Somani, 2023, A Survey of Techniques for Optimizing Transformer Inference, https://arxiv.org/abs/2307.07982
M Huang, J Luo, C Ding, Z Wei, S Huang, H Yu, Oct 2023, An Integer-Only and Group-Vector Systolic Accelerator for Efficiently Mapping Vision Transformer on Edge, IEEE Transactions on Circuits and Systems I: Regular Papers ( Early Access ), https://ieeexplore.ieee.org/abstract/document/10288182/
H Lin, J Lou, L Xiong, Integer-arithmetic-only certified robustness for quantized neural networks, 2021, https://openaccess.thecvf.com/content/ICCV2021/papers/Lin_Integer-Arithmetic-Only_Certified_Robustness_for_Quantized_Neural_Networks_ICCV_2021_paper.pdf
David Spuler, March 2024, Chapter 32. Quantization, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Z Li, Q Gu, 2022, I-ViT: integer-only quantization for efficient vision transformer inference, https://arxiv.org/abs/2207.01405
S Kim, A Gholami, Z Yao, N Lee, P Wang, 2022, Integer-only zero-shot quantization for efficient speech recognition, https://arxiv.org/pdf/2103.16827
Z Zhang, B He, Z Zhang, 2023, Practical Edge Kernels for Integer-Only Vision Transformers Under Post-training Quantization, https://proceedings.mlsys.org/paper_files/paper/2023/hash/023560744aae353c03f7ae787f2998dd-Abstract-mlsys2023.html
E Yvinec, A Dapogny, K Bailly, 2023, NUPES: Non-Uniform Post-Training Quantization via Power Exponent Search, https://arxiv.org/abs/2308.05600
VP Nia, E Sari, V Courville, M Asgharian, 2023, Training Integer-Only Deep Recurrent Neural Networks, https://link.springer.com/article/10.1007/s42979-023-01920-z https://arxiv.org/pdf/2212.11791
Y Zhang, L Zhao, S Cao, W Wang, T Cao, 2023, Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models, https://arxiv.org/abs/2305.12356
H Je, D Ryu, H Lee, K Kim, 2023, Neural Network Quantization is All You Need for Energy Efficient ISP, https://ieeexplore.ieee.org/abstract/document/10103005/
Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748
Fuwen Tan, Royson Lee, Łukasz Dudziak, Shell Xu Hu, Sourav Bhattacharya, Timothy Hospedales, Georgios Tzimiropoulos, Brais Martinez, 25 Aug 2024, MobileQuant: Mobile-friendly Quantization for On-device Language Models, https://arxiv.org/abs/2408.13933 https://github.com/saic-fi/MobileQuant
David Spuler, March 2024, Integer-Only-Arithmetic Quantization, in Generative AI in C++, https://www.aussieai.com/book/ch32-integer-only-quantization
Sehoon Kim, Oct 2024, Full Stack Approach for Efficient Deep Learning Inference, Doctor of Philosophy, Computer Science, University of California, Berkeley, https://escholarship.org/content/qt4wf834q8/qt4wf834q8.pdf

Low-Bit Quantization

Low-bit quantization generall refers to 4-bit quantization or less. See below for research on binary, ternary, 2-bit, 3-bit, and 4-bit quantization papers.

Papers on low-bit quantization in general:

Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. 2016, Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, https://arxiv.org/abs/1606.06160
Harshavardhan Adepu, Zhanpeng Zeng, Li Zhang, Vikas Singh, 10 Mar 2024, FrameQuant: Flexible Low-Bit Quantization for Transformers, https://arxiv.org/abs/2403.06082 (A method using 2-bit quantization.)
Cheng Zhang, Jianyi Cheng, Ilia Shumailov, George A. Constantinides, Yiren Zhao, Oct 2023, Revisiting Block-based Quantisation: What is Important for Sub-8-bit LLM Inference? https://arxiv.org/abs/2310.05079, https://arxiv.org/pdf/2310.05079.pdf
JH Heo, J Kim, B Kwon, B Kim, SJ Kwon, D Lee, Sep 2023, Rethinking Channel Dimensions to Isolate Outliers for Low-bit Weight Quantization of Large Language Models, arXiv preprint arXiv:2309.15531, https://arxiv.org/pdf/2309.15531.pdf
Shuchang Zhou, Yuxin Wu, Zekun Ni, et al. 2016. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv:1606.06160, http://arxiv.org/abs/1606.06160
Yuzhuang Xu, Xu Han, Zonghan Yang, Shuo Wang, Qingfu Zhu, Zhiyuan Liu, Weidong Liu, Wanxiang Che 22 May 2024 (v3), OneBit: Towards Extremely Low-bit Large Language Models, https://arxiv.org/abs/2402.11295
Zhiwen Mo, Lei Wang, Jianyu Wei, Zhichen Zeng, Shijie Cao, Lingxiao Ma, Naifeng Jing, Ting Cao, Jilong Xue, Fan Yang, Mao Yang, 12 Aug 2024, LUT Tensor Core: Lookup Table Enables Efficient Low-Bit LLM Inference Acceleration, https://arxiv.org/abs/2408.06003 (Lookup tables for mixed-precision MatMul/GEMM kernels using low-bit quantization mixed with full precision.)
Vladimir Malinovskii, Denis Mazur, Ivan Ilin, Denis Kuznedelev, Konstantin Burlachenko, Kai Yi, Dan Alistarh, Peter Richtarik, 30 May 2024 (v2), PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression, https://arxiv.org/abs/2405.14852 https://burlachenkok.github.io/PV-Tuning/
Szabolcs Cséfalvay, James Imber, 31 Jan 2023 (v2), Self-Compressing Neural Networks, https://arxiv.org/abs/2301.13142
Wang, Miao Wei, Shu Yang, Fangmin Chen, Xing Mei, 16 Aug 2024, ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models, https://arxiv.org/abs/2408.08554
Hossein Katebi, Navidreza Asadi, Maziar Goudarzi, 2024, FullPack: Full Vector Utilization for Sub-Byte Quantized Vector-Matrix Multiplication on General Purpose CPUs, IEEE Computer Architecture Letters, PrePrints pp. 1-4, DOI Bookmark: 10.1109/LCA.2024.3370402, https://www.computer.org/csdl/journal/ca/5555/01/10449368/1USuDIYNOQE
Mohamed Mekkouri, Marc Sun, Leandro von Werra, Pedro Cuenca, Omar Sanseviero, Thomas Wolf, September 18, 2024, Fine-tuning LLMs to 1.58bit: extreme quantization made easy, https://huggingface.co/blog/1_58_llm_extreme_quantization
Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu, 25 Sep 2024, A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms, https://arxiv.org/abs/2409.16694
Yifei Liu, Jicheng Wen, Yang Wang, Shengyu Ye, Li Lyna Zhang, Ting Cao, Cheng Li, Mao Yang, 25 Sep 2024, VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models, https://arxiv.org/abs/2409.17066 https://arxiv.org/pdf/2409.17066
Shaobo Ma, Chao Fang, Haikuo Shao, Zhongfeng Wang, 26 Sep 2024, Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores, https://arxiv.org/abs/2409.17870
Kang Zhao, Tao Yuan, Han Bao, Zhenfeng Su, Chang Gao, Zhaofeng Sun, Zichen Liang, Liping Jing, Jianfei Chen, 21 Oct 2024, Beyond 2:4: exploring V:N:M sparsity for efficient transformer inference on GPUs, https://arxiv.org/abs/2410.16135
Zifei Xu, Sayeh Sharify, Wanzin Yazar, Tristan Webb, Xin Wang, 18 Oct 2024, Understanding the difficulty of low-precision post-training quantization of large language models, https://arxiv.org/abs/2410.14570
Yuhang Li, Priyadarshini Panda, 24 Oct 2024, TesseraQ: Ultra Low-Bit LLM Post-Training Quantization with Block Reconstruction, https://arxiv.org/abs/2410.19103
Xu Ouyang, Tao Ge, Thomas Hartvigsen, Zhisong Zhang, Haitao Mi, Dong Yu, 27 Nov 2024 (v2), Low-Bit Quantization Favors Undertrained LLMs: Scaling Laws for Quantized LLMs with 100T Training Tokens, https://arxiv.org/abs/2411.17691
Yuzhuang Xu, Shiyu Ji, Qingfu Zhu, Wanxiang Che, 12 Dec 2024, CRVQ: Channel-relaxed Vector Quantization for Extreme Compression of LLMs, https://arxiv.org/abs/2412.09282 (Vector quantization of low-bit or 1-bit weight vectors, with additional bits for some channels, analogous to combining mixed-precision quantization and/or weight clustering.)
Weilun Feng, Haotong Qin, Chuanguang Yang, Zhulin An, Libo Huang, Boyu Diao, Fei Wang, Renshuai Tao, Yongjun Xu, Michele Magno, 16 Dec 2024, MPQ-DM: Mixed Precision Quantization for Extremely Low Bit Diffusion Models, https://arxiv.org/abs/2412.11549
Kyle Wiggers, December 23, 2024, A popular technique to make AI more efficient has drawbacks, https://techcrunch.com/2024/12/23/a-popular-technique-to-make-ai-more-efficient-has-drawbacks/
Dibakar Gope, David Mansell, Danny Loh, Ian Bratt, 23 Dec 2024, Highly Optimized Kernels and Fine-Grained Codebooks for LLM Inference on Arm CPUs, https://arxiv.org/abs/2501.00032 https://github.com/ggerganov/llama.cpp
Yuzong Chen, Xilai Dai, Chi-chih Chang, Yash Akhauri, Mohamed S. Abdelfattah, 6 Jan 2025, The Power of Negative Zero: Datatype Customization for Quantized Large Language Models, https://arxiv.org/abs/2501.04052 (Remapping negative zero to other values.)
Zhang, Z., Liu, S., Chen, R., Kailkhura, B., Chen, B., & Wang, Z. (2024). Q-Hitter: A Better Token Oracle for Efficient LLM Inference via Sparse-Quantized KV Cache. In P. Gibbons, G. Pekhimenko, & C. De Sa (Eds.), Proceedings of Machine Learning and Systems 6 (MLSys 2024) (pp. 381-394). MLSys. https://proceedings.mlsys.org/paper_files/paper/2024/hash/bbb7506579431a85861a05fff048d3e1-Abstract Conference.htm https://github.com/VITA-Group/Q-Hitter
18 Jan 2025, LUT-DLA: Lookup Table as Efficient Extreme Low-Bit Deep Learning Accelerator, Guoyu Li, Shengyu Ye, Chunyun Chen, Yang Wang, Fan Yang, Ting Cao, Cheng Liu, Mohamed M. Sabry, Mao Yang, https://arxiv.org/abs/2501.10658 (Extremely low-bit quantization below 1-bit (!) with vector quantization to table lookup.)
Xuerui Qiu, Jieyuan Zhang, Wenjie Wei, Honglin Cao, Junsheng Guo, Rui-Jie Zhu, Yimeng Shan, Yang Yang, Malu Zhang, Haizhou Li, 23 Jan 2025, Quantized Spike-driven Transformer, https://arxiv.org/abs/2501.13492 https://github.com/bollossom/QSD-Transformer
Weiming Hu, Haoyan Zhang, Cong Guo, Yu Feng, Renyang Guan, Zhendong Hua, Zihan Liu, Yue Guan, Minyi Guo, Jingwen Leng, 26 Feb 2025, M-ANT: Efficient Low-bit Group Quantization for LLMs via Mathematically Adaptive Numerical Type, https://arxiv.org/abs/2502.18755
Jaewoo Song, Fangzhen Lin, 7 Mar 2025, SplitQuantV2: Enhancing Low-Bit Quantization of LLMs Without GPUs, https://arxiv.org/abs/2503.07657
Vikas Natesh, H.T. Kung, 12 Apr 2025, PQS (Prune, Quantize, and Sort): Low-Bitwidth Accumulation of Dot Products in Neural Network Computations, https://arxiv.org/abs/2504.09064
Kai Liu, Qian Zheng, Kaiwen Tao, Zhiteng Li, Haotong Qin, Wenbo Li, Yong Guo, Xianglong Liu, Linghe Kong, Guihai Chen, Yulun Zhang, Xiaokang Yang, 8 May 2025, Low-bit Model Quantization for Deep Neural Networks: A Survey, https://arxiv.org/abs/2505.05530 https://github.com/Kai-Liu001/Awesome-Model-Quantization
Ba-Hien Tran, Van Minh Nguyen, 28 May 2025. Highly Efficient and Effective LLMs with Multi-Boolean Architectures, https://arxiv.org/abs/2505.22811

Binary Quantization

The extreme of quantization is to encode floating-point weights down to 1 bit. This is binary quantization (or "binarization"), where there are only 2 weights, and they are 0 and 1, or alternatively -1 and +1. This compresses the model by a factor of 32 in terms of space, and reduces the inference computations. In fact, binary quantization changes multiplication by a floating point weight to a simple addition (for 1) and a null test (for 0). Or for binary weights -1 and +1, the -1 is a subtraction and +1 an addition, which is usually further optimized to use a sign bit tweak. There are also other invocations of binary neural network architectures that use only bitwise operations, such as XNOR networks and Weightless Neural Networks (WNNs).

H. Yang, M. Fritzsche, C. Bartz, and C. Meinel, Bmxnet: An open-source binary neural network implementation based on mxnet, CoRR, vol. abs/1705.09864, 2017, https://arxiv.org/abs/1705.09864, Code: https://github.com/hpi-xnor
Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. 2016. Xnor-net: Imagenet classification using binary convolutional neural networks. In European conference on computer vision, Springer, 525–542, https://arxiv.org/abs/1603.05279
B. McDanel, S. Teerapittayanon, and H. Kung, Embedded binarized neural networks, 2017, arXiv preprint arXiv:1709.02260, https://arxiv.org/abs/1709.02260
Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, Yoshua Bengio, Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1 Feb 2016, https://arxiv.org/abs/1602.02830
Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio, 2016, Binarized neural networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems. 4114–4122, https://proceedings.neurips.cc/paper/2016/hash/d8330f857a17c53d217014ee776bfd50-Abstract.html
Zhouhan Lin, Matthieu Courbariaux, Roland Memisevic, Yoshua Bengio, Neural Networks with Few Multiplications, Feb 2016, https://arxiv.org/abs/1510.03009v1
Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi, Xnor-net: Imagenet classification using binary convolutional neural networks. In ECCV, pages 525–542. Springer, 2016, https://arxiv.org/abs/1603.05279
Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David, Binaryconnect: Training deep neural networks with binary weights during propagations, In NeuriPS, pages 3123–3131, 2015, https://arxiv.org/abs/1511.00363
Zhaowei Cai, Xiaodong He, Jian Sun, and Nuno Vasconcelos. Deep learning with low precision by half-wave gaussian quantization. In CVPR, pages 5918–5926, 2017, https://arxiv.org/abs/1702.00953
Yefei He, Zhenyu Lou, Luoming Zhang, Weijia Wu, Bohan Zhuang, and Hong Zhou. Bivit: Extremely compressed binary vision transformer. arXiv preprint arXiv:2211.07091, 2022. https://arxiv.org/abs/2211.07091 (Softmax-aware binarization)
Zechun Liu, Barlas Oguz, Aasish Pappu, Lin Xiao, Scott Yih, Meng Li, Raghuraman Krishnamoorthi, and Yashar Mehdad. Bit: Robustly binarized multi-distilled transformer. arXiv preprint arXiv:2205.13016, 2022. https://arxiv.org/abs/2205.13016, Code: https://github.com/facebookresearch/bit
Felix Juefei-Xu, Vishnu Naresh Boddeti, and Marios Savvides. Local binary convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 19–28, 2017. https://arxiv.org/abs/1608.06049
Zechun Liu, Zhiqiang Shen, Marios Savvides, and KwangTing Cheng. Reactnet: Towards precise binary neural network with generalized activation functions. In European Conference on Computer Vision, pages 143–159. Springer, 2020. https://arxiv.org/abs/2003.03488
Koen Helwegen, James Widdicombe, Lukas Geiger, Zechun Liu, Kwang-Ting Cheng, and Roeland Nusselder. Latent weights do not exist: Rethinking binarized neural network optimization. In Advances in Neural Information Processing Systems 32, pages 7533–7544. 2019. https://arxiv.org/abs/1906.02107, Code: https://github.com/plumerai/rethinking-bnn-optimization
Lin, X.; Zhao, C.; and Pan, W. 2017. Towards Accurate Binary Convolutional Neural Network. Advances in Neural Information Processing Systems, 30, https://arxiv.org/abs/1711.11294
Yichi Zhang, Ankush Garg, Yuan Cao, Łukasz Lew, Behrooz Ghorbani, Zhiru Zhang, Orhan Firat, Feb 2023, Binarized Neural Machine Translation, https://arxiv.org/abs/2302.04907
Kota Ando; Kodai Ueyoshi; Kentaro Orimo; Haruyoshi Yonekawa; Shimpei Sato; Hiroki Nakahara; Masayuki Ikebe, "BRein memory: A 13-layer 4.2 K neuron/0.8 M synapse binary/ternary reconfigurable in-memory deep neural network accelerator in 65 nm CMOS", Proc. Symp. VLSI Circuits, pp. C24-C25, Jun. 2017. https://ieeexplore.ieee.org/document/8008533
S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen and Y. Zou, "DoReFa-Net: Training low bitwidth convolutional neural networks with low bitwidth gradients", arXiv:1606.06160, 2016. https://arxiv.org/abs/1606.06160 (Has binary weights, 2-bit activations)
R. Andri, L. Cavigelli, D. Rossi and L. Benini, "YodaNN: An ultra-low power convolutional neural network accelerator based on binary weights", Proc. IEEE Comput. Soc. Annu. Symp. VLSI (ISVLSI), pp. 236-241, Jul. 2016. https://arxiv.org/abs/1606.05487v1
Z. Cai, X. He, J. Sun and N. Vasconcelos, "Deep learning with low precision by half-wave Gaussian quantization", Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 5918-5926, Jul. 2017. https://arxiv.org/abs/1702.00953 (Has binary weights, 2-bit activations)
R. Ding, Z. Liu, R. D. Blanton, and D. Marculescu. Quantized deep neural networks for energy efficient hardware-based inference. In IEEE Asia and South Pacific Design Automation Conference, pages 1–8, 2018. https://ieeexplore.ieee.org/document/8297274 (Survey and evaluation of various quantized DNN models in 2018, including binarized and light models, on chosen datasets.)
Taylor Simons and Dah-Jye Lee, 2019, A Review of Binarized Neural Networks, Electronics 2019, 8, 661; doi:10.3390/electronics8060661, MDPI, https://www.mdpi.com/2079-9292/8/6/661/review_report
Xiaofan Lin, Cong Zhao, and Wei Pan. Towards accurate binary convolutional neural network. Advances in Neural Information Processing Systems, 30, 2017. https://arxiv.org/abs/1711.11294 (Uses multiple single-bit weights combined to create a multi-binary quantization method.)
Y Shang, Z Yuan, Q Wu, Z Dong, PB-LLM: Partially Binarized Large Language Models, Sep 2023, arXiv preprint arXiv:2310.00034, https://browse.arxiv.org/pdf/2310.00034.pdf Code: https://github.com/hahnyuan/BinaryLLM (Hybrid partial binarization.)

Ternary Quantization

Ternary quantization (or "ternarization") is the use of 3 weights: -1, 0, and 1. This requires 2 bits for representation of the weights in the model, so why wouldn't you just use 4 weights? The answer is that ternary quantization can use zero-multiplication arithmetic in the inference algorithm, with an addition (for +1), a subtraction (for -1), and a null test (for 0).

N. Mellempudi, A. Kundu, D. Mudigere, D. Das, B. Kaul, and P. Dubey, Ternary neural networks with fine-grained quantization, CoRR, vol. abs/1705.01462, 2017, https://arxiv.org/abs/1705.01462
Fengfu Li, Bo Zhang, and Bin Liu. 2016. Ternary weight networks. arXiv preprint arXiv:1605.04711 (2016), https://arxiv.org/abs/1605.04711
Zhu et al. 2016] Zhu, C.; Han, S.; Mao, H.; and Dally, W. J. 2016. Trained ternary quantization. arXiv preprint arXiv:1612.01064, https://arxiv.org/abs/1612.01064
D Liu, X Liu, Ternary Quantization: A Survey, arXiv preprint arXiv:2303.01505, 2023, https://arxiv.org/abs/2303.01505
E Yvinec, A Dapogny, K Bailly, Designing strong baselines for ternary neural network quantization through support and mass equalization, arXiv preprint arXiv:2306.17442, 2023, https://arxiv.org/abs/2306.17442
Fengfu Li, Bin Liu, Xiaoxing Wang, Bo Zhang, Junchi Yan, Nov 2022, Ternary Weight Networks, https://arxiv.org/abs/1605.04711, Code: https://github.com/Thinklab-SJTU/twns
M Kim, S Lee, J Lee, S Hong, DS Chang, Token-Scaled Logit Distillation for Ternary Weight Generative Language Models, 2023, https://arxiv.org/abs/2308.06744,
Hyperspherical Quantization: Toward Smaller and More Accurate Models, Dan Liu, Xi Chen, Chen Ma, Xue Liu, Dec 2022, https://arxiv.org/abs/2212.12653
Kota Ando; Kodai Ueyoshi; Kentaro Orimo; Haruyoshi Yonekawa; Shimpei Sato; Hiroki Nakahara; Masayuki Ikebe, "BRein memory: A 13-layer 4.2 K neuron/0.8 M synapse binary/ternary reconfigurable in-memory deep neural network accelerator in 65 nm CMOS", Proc. Symp. VLSI Circuits, pp. C24-C25, Jun. 2017. https://ieeexplore.ieee.org/document/8008533
S. K. Esser et al., "Convolutional networks for fast energy-efficient neuromorphic computing", Proc. Nat. Acad. Sci. USA, vol. 113, no. 41, pp. 11441-11446, 2016. https://arxiv.org/abs/1603.08270 (Ternary weights, binary activations.)
Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, Jason K. Eshraghian, 4 Jun 2024, Scalable MatMul-free Language Modeling, https://arxiv.org/abs/2406.02528 Code: https://github.com/ridgerchu/matmulfreellm (Uses addition via ternary quantization and elementwise Hadamard products to replace MatMul.)
G Rutishauser, 2024, Agile and Efficient Inference of Quantized Neural Networks, Ph.D. Thesis, ETH Zurich, https://www.research-collection.ethz.ch/bitstream/handle/20.500.11850/675547/1/thesis.pdf
Luca Dordoni, Dec 2023, Sparsification of deep neural network via ternary quantization, Masters Thesis, POLITECNICO DI TORINO, Italy, https://webthesis.biblio.polito.it/29424/1/tesi.pdf
Chenzhuo Zhu, Song Han, Huizi Mao, and William J. Dally. 2016. Trained ternary quantization. arXiv:1612.01064 http://arxiv.org/abs/1612.01064
David Spuler, March 2024, Chapter 44. Advanced Quantization, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
B Liu, F Li, X Wang, B Zhang, 2023, Ternary weight networks, https://ieeexplore.ieee.org/abstract/document/10094626/
Z Liu, B Oguz, A Pappu, Y Shi, 2023, Binary and Ternary Natural Language Generation, https://arxiv.org/abs/2306.01841 Code: https://github.com/facebookresearch/Ternary_Binary_Transformer
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. arXiv:1512.03385 [cs] (Dec. 2015). http://arxiv.org/abs/1512.03385 arXiv: 1512.03385.
Xiaojing Fan, Chunliang Tao, 8 Aug 2024, Towards Resilient and Efficient LLMs: A Comparative Study of Efficiency, Performance, and Adversarial Robustness, https://arxiv.org/abs/2408.04585
R. Agrawal, N. S. Abhijith, U. Anil Kumar, S. Veeramachaneni and S. E. Ahmed, 2024, Energy-Efficient Ternary Multiplier, 2024 IEEE 6th International Conference on AI Circuits and Systems (AICAS), Abu Dhabi, United Arab Emirates, 2024, pp. 382-387, doi: 10.1109/AICAS59952.2024.10595938, https://ieeexplore.ieee.org/abstract/document/10595938
Mohamed Mekkouri, Marc Sun, Leandro von Werra, Pedro Cuenca, Omar Sanseviero, Thomas Wolf, September 18, 2024, Fine-tuning LLMs to 1.58bit: extreme quantization made easy, https://huggingface.co/blog/1_58_llm_extreme_quantization
Chenglin Yang, Celong Liu, Xueqing Deng, Dongwon Kim, Xing Mei, Xiaohui Shen, Liang-Chieh Chen, 24 Dec 2024, 1.58-bit FLUX, https://arxiv.org/abs/2412.18653
Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, Furu Wei, 27 Feb 2024, The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits, https://arxiv.org/abs/2402.17764
Arun Nanda, Sep 8, 2024, Understanding 1.58-bit Large Language Models, https://pub.towardsai.net/understanding-1-58-bit-large-language-models-88373010974a
Y. Ogiwara and H. Kawashima, "Sparse Ternary Matrix Multiplication with Tensor Core for Transformer," 2024 Twelfth International Symposium on Computing and Networking Workshops (CANDARW), Naha, Japan, 2024, pp. 150-156, doi: 10.1109/CANDARW64572.2024.00031. https://ieeexplore.ieee.org/abstract/document/10817836/
Jinheng Wang, Hansong Zhou, Ting Song, Shijie Cao, Yan Xia, Ting Cao, Jianyu Wei, Shuming Ma, Hongyu Wang, Furu Wei, 17 Feb 2025, Bitnet.cpp: Efficient Edge Inference for Ternary LLMs, https://arxiv.org/abs/2502.11880
Chenyang Yin, Zhenyu Bai, Pranav Venkatram, Shivam Aggarwal, Zhaoying Li, Tulika Mitra, 23 Feb 2025, TerEffic: Highly Efficient Ternary LLM Inference on FPGA, https://arxiv.org/abs/2502.16473
Jacob Nielsen a,Lukas Galke bandPeter Schneider-Kamp, Mar 2025, WhenAre1.58 Bits Enough? A Bottom-up Exploration of Quantization-Aware Training with Ternary Weights, Proceedings of the 17th International Conference on Agents and Artificial Intelligence (ICAART 2025)- Volume 3, pages 1440-1449, https://www.scitepress.org/Papers/2025/133824/133824.pdf
24 Apr 2025 (v2), TeLLMe: An Energy-Efficient Ternary LLM Accelerator for Prefilling and Decoding on Edge FPGAs, https://arxiv.org/abs/2504.16266

1.58 Bit Quantization

1.58 bit quantization refers to ternary quantization with the values -1, 0, and +1. The number arises as log2(3), which represents how many binary bits are needed to store three values. It is somewhat of a misnomer, because impelemtnations of these "1.58 bit" models are not actually using this few bits, but are usually implemented as ternary values in a more reasonable data representation, such as 8 bits (wastes space, but is efficient). Like ternary models, 1.58 bit models are efficient because they don't need multiplication, but suffer from accuracy problems, although the latest research seems to be correcting this limitation.

Research papers on 1.58 bit models:

Mohamed Mekkouri, Marc Sun, Leandro von Werra, Pedro Cuenca, Omar Sanseviero, Thomas Wolf, September 18, 2024, Fine-tuning LLMs to 1.58bit: extreme quantization made easy, https://huggingface.co/blog/1_58_llm_extreme_quantization
Chenglin Yang, Celong Liu, Xueqing Deng, Dongwon Kim, Xing Mei, Xiaohui Shen, Liang-Chieh Chen, 24 Dec 2024, 1.58-bit FLUX, https://arxiv.org/abs/2412.18653
Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, Furu Wei, 27 Feb 2024, The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits, https://arxiv.org/abs/2402.17764
Arun Nanda, Sep 8, 2024, Understanding 1.58-bit Large Language Models, https://pub.towardsai.net/understanding-1-58-bit-large-language-models-88373010974a
Zhenxuan Yu, Yutaka Matsuo, Takeshi Kojima, Yusuke Iwasawa, Jan 2025, Slender-Mamba: Fully Quantized Mamba in 1.58 Bits From Head to Toe, Proceedings of the 31st International Conference on Computational Linguistics, pages 4715–4724, January 19–24, 2025, Association for Computational Linguistics, https://aclanthology.org/2025.coling-main.316.pdf
Daniel & Michael, Jan 27, 2025, Run DeepSeek R1 Dynamic 1.58-bit, https://unsloth.ai/blog/deepseekr1-dynamic
Jacob Nielsen a,Lukas Galke bandPeter Schneider-Kamp, Mar 2025, WhenAre1.58 Bits Enough? A Bottom-up Exploration of Quantization-Aware Training with Ternary Weights, Proceedings of the 17th International Conference on Agents and Artificial Intelligence (ICAART 2025)- Volume 3, pages 1440-1449, https://www.scitepress.org/Papers/2025/133824/133824.pdf

2-Bit Quantization (INT2)

This section refers to non-ternary 2-bit quantization, using 4 distinct weights. In practice, 2-bit quantization is regarded as still having some problems with model accuracy, whereas 4-bit integer quantization is considered a more reasonable tradeoff of speed-vs-accuracy. On the other hand, maybe this is unwarranted, since Liu et al (2022) tested lots of models with 2-bits, 3-bits, and 4-bits (see Table 1 in their paper), and the extra accuracy of 4-bits over 2-bits was usually only a couple of percentage points (for double the space).

Research papers on integer quantization using 2-bits include:

Jungwook Choi, Pierce I-Jen Chuang, Zhuo Wang, Swagath Venkataramani, Vijayalakshmi Srinivasan, Kailash Gopalakrishnan, July 2018, Bridging the Accuracy Gap for 2-bit Quantized Neural Networks (QNN), https://arxiv.org/abs/1807.06964
Jungwook Choi, Swagath Venkataramani, Vijayalakshmi (Viji) Srinivasan, Kailash Gopalakrishnan, Zhuo Wang, Pierce Chuang, 2019, Accurate and Efficient 2-bit Quantized Neural Networks, Proceedings of Machine Learning and Systems 1 (MLSys 2019), https://proceedings.mlsys.org/paper/2019/file/006f52e9102a8d3be2fe5614f42ba989-Paper.pdf
S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen and Y. Zou, "DoReFa-Net: Training low bitwidth convolutional neural networks with low bitwidth gradients", arXiv:1606.06160, 2016. https://arxiv.org/abs/1606.06160 (Has binary weights, 2-bit activations)
Z. Cai, X. He, J. Sun and N. Vasconcelos, "Deep learning with low precision by half-wave Gaussian quantization", Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 5918-5926, Jul. 2017. https://arxiv.org/abs/1702.00953 (Has binary weights, 2-bit activations)
Han-Byul Kim, Eunhyeok Park, and Sungjoo Yoo. 2022. BASQ: Branch-wise Activation-clipping Search Quantization for Sub-4-bit Neural Networks. In European Conference on Computer Vision, Cham: Springer Nature Switzerland, 17-33. https://link.springer.com/chapter/10.1007/978-3-031-19775-8_2 (Evaluates quantization precision from 2-bits to 4-bits.)
Zechun Liu, Kwang-Ting Cheng, Dong Huang, Eric Xing, Zhiqiang Shen. Apr 2022. Nonuniform-to-uniform quantization: Towards accurate quantization via generalized straight-through estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4942-4952, https://arxiv.org/abs/2111.14826, Code: https://github.com/liuzechun/Nonuniform-to-Uniform-Quantization (Contains an extensive review of models with 2-bit weights and 2-bit activations, and also 3-bits and 4-bits.)
E Kloberdanz, W Le, Sep 2023, MixQuant: Mixed Precision Quantization with a Bit-width Optimization Search, arXiv preprint arXiv:2309.17341, https://arxiv.org/pdf/2309.17341.pdf (Various tests of quantization from 2-bits to 8-bits.)
Xiaofan Lin, Cong Zhao, and Wei Pan. Towards accurate binary convolutional neural network. Advances in Neural Information Processing Systems, 30, 2017. https://arxiv.org/abs/1711.11294 (Unique 2-bit quantization approach is really a double-binarized quantization method.)
NM Ho, DT Nguyen, JL Gustafson, WF Wong, 2023, Bedot: Bit Efficient Dot Product for Deep Generative Models, CoNGA 2023: Next Generation Arithmetic, pp. 19–37, https://link.springer.com/chapter/10.1007/978-3-031-32180-1_2, PDF: https://www.comp.nus.edu.sg/~wongwf/papers/CONGA23-Bedot.pdf (2–3 bits for weights and 2–5 bits for activation.)
Li, Y., Gong, R., Tan, X., Yang, Y., Hu, P., Zhang, Q., Yu, F., Wang, W., & Gu, S., 2021, BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction. ArXiv, abs/2102.05426. https://arxiv.org/abs/2102.05426 Code: https://github.com/yhhhli/BRECQ (Tests 2 to 4 bits for weights, and mixed-precision quantization.)
Yuji Chai, John Gkountouras, Glenn G. Ko, David Brooks, Gu-Yeon Wei, June 2023, INT2.1: Towards Fine-Tunable Quantized Large Language Models with Error Correction through Low-Rank Adaptation, arXiv preprint arXiv:2306.08162, https://arxiv.org/abs/2306.08162
Yi Guo, Fanliu Kong, Xiaoyang Li, Hui Li, Wei Chen, Xiaogang Tian, Jinping Cai, Yang Zhang, Shouda Liu, 19 Apr 2024, decoupleQ: Towards 2-bit Post-Training Uniform Quantization via decoupling Parameters into Integer and Floating Points, https://arxiv.org/abs/2404.12759 Code: https://github.com/bytedance/decoupleQ (Decouple parameters into integer and floating-point parts for more accurate quantization at low bitwidths.)
S Ashfaq, A Hoffman, S Mitra, S Sah, Sep 2023, DeepliteRT: Computer Vision at the Edge, https://arxiv.org/pdf/2309.10878.pdf
Yang Sui, Yanyu Li, Anil Kag, Yerlan Idelbayev, Junli Cao, Ju Hu, Dhritiman Sagar, Bo Yuan, Sergey Tulyakov, Jian Ren, June 2024, BitsFusion: 1.99 bits Weight Quantization of Diffusion Model, https://snap-research.github.io/BitsFusion/
Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, Xia Hu, 5 Feb 2024, KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache, https://arxiv.org/abs/2402.02750 Code: https://github.com/jy-yuan/KIVI (KV cache 2-bit quantization on Llama-2, Falcon and Mistral models.)
Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, Furu Wei, 27 Feb 2024, The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits, https://arxiv.org/abs/2402.17764 (It's like a cross between binary and ternary quantization, but it uses only 1.58 bits.)
Haotong Qin, Xudong Ma, Xingyu Zheng, Xiaoyang Li, Yang Zhang, Shouda Liu, Jie Luo, Xianglong Liu, Michele Magno, 8 Feb 2024, Accurate LoRA-Finetuning Quantization of LLMs via Information Retention, https://arxiv.org/abs/2402.05445 Code: https://github.com/htqin/ir-qlora
Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, Dan Alistarh, Jan 2024, Extreme Compression of Large Language Models via Additive Quantization, https://arxiv.org/abs/2401.06118
Steven K. Esser, Jeffrey L. McKinstry, Deepika Bablani, et al. 2019. Learned step size quantization. arXiv:1902.08153 http://arxiv.org/abs/1902.08153
Xingyu Zheng, Haotong Qin, Xudong Ma, Mingyuan Zhang, Haojie Hao, Jiakai Wang, Zixiang Zhao, Jinyang Guo, Xianglong Liu, 8 Apr 2024, BinaryDM: Towards Accurate Binarization of Diffusion Model, https://arxiv.org/abs/2404.05662 (Binary quantization applied to diffusion models.)
U Bamba, N Anand, S Aggarwal, DK Prasad, DK Gupta, 2024, Partial Binarization of Neural Networks for Budget-Aware Efficient Learning, https://openaccess.thecvf.com/content/WACV2024/papers/Bamba_Partial_Binarization_of_Neural_Networks_for_Budget-Aware_Efficient_Learning_WACV_2024_paper.pdf Code: https://github.com/transmuteAI/trailmet
M. Kim and P. Smaragdis. 2016, Bitwise neural networks. arXiv preprint arXiv:1601.06071, https://arxiv.org/abs/1601.06071
Yuzhuang Xu, Xu Han, Zonghan Yang, Shuo Wang, Qingfu Zhu, Zhiyuan Liu, Weidong Liu, Wanxiang Che, 17 Feb 2024, OneBit: Towards Extremely Low-bit Large Language Models, https://arxiv.org/abs/2402.11295
Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele Magno, Xiaojuan Qi, 6 Feb 2024, BiLLM: Pushing the Limit of Post-Training Quantization for LLMs, https://arxiv.org/abs/2402.04291 (Working on 1-bit quantization through methods to allocate weights to binary distributions.)
B Zhang, S Xu, M Lin, T Wang, D Doermann, 2023 Binary Neural Networks: Algorithms, Architectures, and Applications, https://books.google.com.au/books?hl=en&lr=&id=r9jeEAAAQBAJ&oi=fnd&pg=PT3&ots=e7uJtDHhPl&sig=IrLor3Mh44oIbvTHWgKqoqoYE3U&redir_esc=y#v=onepage&q&f=false
N Phipps, JJ Shang, TH Teo, IC Wey, 2023, Pre-Computing Batch Normalisation Parameters for Edge Devices on a Binarized Neural Network, https://www.mdpi.com/1424-8220/23/12/5556
David Spuler, March 2024, Chapter 44. Advanced Quantization, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Z Liu, B Oguz, A Pappu, Y Shi, 2023, Binary and Ternary Natural Language Generation, https://arxiv.org/abs/2306.01841 Code: https://github.com/facebookresearch/Ternary_Binary_Transformer
Yeyu Huang, Feb 21, 2024, The 2-bit Quantization is Insane! See How to Run Mixtral-8x7B on Free-tier Colab, https://levelup.gitconnected.com/the-2-bit-quantization-is-insane-see-how-to-run-mixtral-8x7b-on-free-tier-colab-2803e39b9b9d
Yipin Guo, Yilin Lang, Qinyuan Ren, 3 Jul 2024, GPTQT: Quantize Large Language Models Twice to Push the Efficiency, https://arxiv.org/abs/2407.02891 (Two-phase quantization, first to high-bit, then to binary quantization.)
Edgar Solis Romeu; Dmitry Vadimovich Shashev, May 2024, Binary Convolution Model for Image Classification, 2024 X International Conference on Information Technology and Nanotechnology (ITNT), 20-24 May 2024, https://ieeexplore.ieee.org/abstract/document/10582370/
Harshavardhan Adepu, Zhanpeng Zeng, Li Zhang, Vikas Singh, 10 Mar 2024, FrameQuant: Flexible Low-Bit Quantization for Transformers, https://arxiv.org/abs/2403.06082 (A method using 2-bit quantization.)
Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, Christopher De Sa, 15 Jan 2024 (v2), QuIP: 2-Bit Quantization of Large Language Models With Guarantees, https://arxiv.org/abs/2307.13304 Code: https://github.com/Cornell-RelaxML/QuIP
Jinendra Malekar, Mohammed E. Elbtity, Ramtin Zand Co, 21 Aug 2024, Matmul or No Matmal in the Era of 1-bit LLMs, https://arxiv.org/abs/2408.11939
Van Minh Nguyen, Cristian Ocampo, Aymen Askri, Louis Leconte, Ba-Hien Tran, 25 May 2024, BOLD: Boolean Logic Deep Learning, https://arxiv.org/abs/2405.16339 (Unique method of training using Booleans, which is similar to binary networks or zero-multiplication models.)
A. T. L. Bacellar, Z. Susskind, M. Breternitz, L. K. John, F. M. G. França and P. M. V. Lima, "Soon Filter: Advancing Tiny Neural Architectures for High Throughput Edge Inference," 2024 International Joint Conference on Neural Networks (IJCNN), Yokohama, Japan, 2024, pp. 1-8, doi: 10.1109/IJCNN60899.2024.10650678, https://ieeexplore.ieee.org/abstract/document/10650678
Yuzong Chen, Jian Meng, Jae-sun Seo, Mohamed S. Abdelfattah, 8 Sep 2024, BBS: Bi-directional Bit-level Sparsity for Deep Learning Acceleration, https://arxiv.org/abs/2409.05227
Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu, 25 Sep 2024, A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms, https://arxiv.org/abs/2409.16694
Elias Frantar, September, 2024, Compressing Large Neural Networks Algorithms, Systems and Scaling Laws, Ph.D. Thesis, Graduate School, Institute of Science and Technology, Austria, https://research-explorer.ista.ac.at/download/17485/17880/frantar_thesis_final.pdf
Y Cong, 2024, Research for Enhancing Processing and Computational Efficiency in LLM, 2024 2nd International Conference on Image, https://www.atlantis-press.com/article/126004157.pdf
Qian Tao, Wenyuan Yu, Jingren Zhou, 17 Oct 2024, AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asymmetric Quantization Configurations, https://arxiv.org/abs/2410.13212
Kang Zhao, Tao Yuan, Han Bao, Zhenfeng Su, Chang Gao, Zhaofeng Sun, Zichen Liang, Liping Jing, Jianfei Chen, 21 Oct 2024, Beyond 2:4: exploring V:N:M sparsity for efficient transformer inference on GPUs, https://arxiv.org/abs/2410.16135
Majid Daliri, Zhao Song, Chiwun Yang, 3 Nov 2024, Unlocking the Theory Behind Scaling 1-Bit Neural Networks, https://arxiv.org/abs/2411.01663
Wang, Hongyu ; Ma, Shuming ; Wei, Furu, November 2024, BitNet a4.8: 4-bit Activations for 1-bit LLMs, eprint arXiv:2411.04965, https://ui.adsabs.harvard.edu/abs/2024arXiv241104965W/abstract
Ben Dickson, November 13, 2024, How Microsoft’s next-gen BitNet architecture is turbocharging LLM efficiency, https://venturebeat.com/ai/how-microsofts-next-gen-bitnet-architecture-is-turbocharging-llm-efficiency/
Felix Petersen, Hilde Kuehne, Christian Borgelt, Julian Welzel, Stefano Ermon, 7 Nov 2024, Convolutional Differentiable Logic Gate Networks, 38th Conference on Neural Information Processing Systems (NeurIPS 2024), https://arxiv.org/abs/2411.04732
Shengkun Tang, Liqun Ma, Haonan Li, Mingjie Sun, Zhiqiang Shen, 18 Nov 2024, Bi-Mamba: Towards Accurate 1-Bit State Space Models, https://arxiv.org/abs/2411.11843
Yi Ren, Ruge Xu, Xinfei Guo, Weikang Qian, 27 Nov 2024, FAMES: Fast Approximate Multiplier Substitution for Mixed-Precision Quantized DNNs--Down to 2 Bits! https://arxiv.org/abs/2411.18055
Luca Colombo, Fabrizio Pittorino, Manuel Roveri, 28 Nov 2024, Training Multi-Layer Binary Neural Networks With Local Binary Error Signals, https://arxiv.org/abs/2412.00119
Wazib Ansar, Saptarsi Goswami, Amlan Chakrabarti, 6 Dec 2024, BEExformer: A Fast Inferencing Transformer Architecture via Binarization with Multiple Early Exits, https://arxiv.org/abs/2412.05225
Yuzhuang Xu, Shiyu Ji, Qingfu Zhu, Wanxiang Che, 12 Dec 2024, CRVQ: Channel-relaxed Vector Quantization for Extreme Compression of LLMs, https://arxiv.org/abs/2412.09282 (Vector quantization of low-bit or 1-bit weight vectors, with additional bits for some channels, analogous to combining mixed-precision quantization and/or weight clustering.)
Hongxuan Zhang, Yao Zhao, Jiaqi Zheng, Chenyi Zhuang, Jinjie Gu, Guihai Chen, 16 Dec 2024, CSR:Achieving 1 Bit Key-Value Cache via Sparse Representation, https://arxiv.org/abs/2412.11741
18 Jan 2025, LUT-DLA: Lookup Table as Efficient Extreme Low-Bit Deep Learning Accelerator, Guoyu Li, Shengyu Ye, Chunyun Chen, Yang Wang, Fan Yang, Ting Cao, Cheng Liu, Mohamed M. Sabry, Mao Yang, https://arxiv.org/abs/2501.10658 (Extremely low-bit quantization below 1-bit (!) with vector quantization to table lookup.)
Liangdong Liu, Zhitong Zheng, Cong Wang, Tianhuang Su, Zhenyu Yang, 26 Feb 2025, Binary Neural Networks for Large Language Model: A Survey, https://arxiv.org/abs/2502.19008
Ba-Hien Tran, Van Minh Nguyen, 28 May 2025. Highly Efficient and Effective LLMs with Multi-Boolean Architectures, https://arxiv.org/abs/2505.22811
Apple, June 2025, Updates to Apple's On-Device and Server Foundation Language Models, https://machinelearning.apple.com/research/apple-foundation-models-2025-updates (Apple's 3B on-device model with cloud server alternative. The on-device architecture includes 2-bit quantization, 4-bit embeddings quantization, 8-bit KV quantization, a unique KV cache compression, interleaved local-global attention and multi-LoRA.)
Baohao Liao, Christian Herold, Shahram Khadivi, Christof Monz, 21 Jun 2024 (v3), ApiQ: Finetuning of 2-Bit Quantized Large Language Model, https://arxiv.org/abs/2402.05147

3-Bit Quantization (INT3)

3-bit quantization is uncommon and unpopular, and it's not entirely clear why. It has improved accuracy over 2-bits and saves 25% storage compared to its more popular 4-bit cousin, being only slightly less accurate, since it allows 2^3=8 distinct weights. Maybe it just seems too inelegant for programmers to code cramming 3-bit values into 8-bits or 32-bits for packing and unpacking? But, no, even 5-bit quantization gets recommended by AI experts on forums, whereas listening for supporters of the 3-bit versions, all you hear are crickets.

Even the research papers on 3-bit quantization don't like to admit to it, and you'll struggle to even find "3-bit quantization" in a paper title. Here are some papers on 3-bit quantization (as if you care):

Jeonghoon Kim, Jung Hyun Lee, Sungdong Kim, Joonsuk Park, Kang Min Yoo, Se Jung Kwon, and Dongsoo Lee. Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization. CoRR, abs/2305.14152, 202 https://arxiv.org/abs/2305.14152 (Quantization to 3-bit and 4-bit levels.)
Han-Byul Kim, Eunhyeok Park, and Sungjoo Yoo. 2022. BASQ: Branch-wise Activation-clipping Search Quantization for Sub-4-bit Neural Networks. In European Conference on Computer Vision, Cham: Springer Nature Switzerland, 17-33. https://link.springer.com/chapter/10.1007/978-3-031-19775-8_2 (Evaluates quantization precision from 2-bits to 4-bits.)
Zechun Liu, Kwang-Ting Cheng, Dong Huang, Eric Xing, Zhiqiang Shen. Apr 2022. Nonuniform-to-uniform quantization: Towards accurate quantization via generalized straight-through estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4942-4952, https://arxiv.org/abs/2111.14826, Code: https://github.com/liuzechun/Nonuniform-to-Uniform-Quantization (Contains an extensive review of models from 2-bits to 4-bits for both weights and activations.)
E Kloberdanz, W Le, Sep 2023, MixQuant: Mixed Precision Quantization with a Bit-width Optimization Search, arXiv preprint arXiv:2309.17341, https://arxiv.org/pdf/2309.17341.pdf (Various tests of quantization from 2-bits to 8-bits.)
NM Ho, DT Nguyen, JL Gustafson, WF Wong, 2023, Bedot: Bit Efficient Dot Product for Deep Generative Models, CoNGA 2023: Next Generation Arithmetic, pp. 19–37, https://link.springer.com/chapter/10.1007/978-3-031-32180-1_2, PDF: https://www.comp.nus.edu.sg/~wongwf/papers/CONGA23-Bedot.pdf (2–3 bits for weights and 2–5 bits for activation.)
A. H. Zadeh, I. Edo, O. M. Awad, and A. Moshovos, “GOBO: Quantizing attention-based NLP models for low latency and energy efficient inference,” in 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2020, pp. 811–824. https://arxiv.org/abs/2005.03842 (Compares to BERT at 3-bit and 4-bit quantization levels.)
N. Frumkin, D. Gope, and D. Marculescu, “CPT-V: A Contrastive Approach to Post-Training Quantization of Vision Transformers,” arXiv preprint arXiv:2211.09643, 2022. https://arxiv.org/abs/2211.09643 (Examines 3-bit, 4-bit, and 8-bit.)
B Gouin-Ferland, R Coffee, AC Therrien, 2022, Data reduction through optimized scalar quantization for more compact neural networks, Frontiers in Physics, https://www.frontiersin.org/articles/10.3389/fphy.2022.957128/full (Examined 3 to 7 bit weights for quantization.)
Li, Y., Gong, R., Tan, X., Yang, Y., Hu, P., Zhang, Q., Yu, F., Wang, W., & Gu, S., 2021, BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction. ArXiv, abs/2102.05426. https://arxiv.org/abs/2102.05426 Code: https://github.com/yhhhli/BRECQ (Tests 2, 3 and 4 bits for weights, and mixed-precision quantization.)
W Cheng, Y Cai, K Lv, H Shen, Oct 2023, TEQ: Trainable Equivalent Transformation for Quantization of LLMs, https://arxiv.org/pdf/2310.10944.pdf
Stefanos Laskaridis, Kleomenis Katevas, Lorenzo Minto, Hamed Haddadi, 20 Mar 2024 (v2), MELTing point: Mobile Evaluation of Language Transformers, https://arxiv.org/abs/2403.12844 (Survey and benchmarking of SOTA methods for running LLM inference natively on phones including iPhone and Android, with quantization levels, and with measurement of speed and battery depletion.)
Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, Dan Alistarh, Jan 2024, Extreme Compression of Large Language Models via Additive Quantization, https://arxiv.org/abs/2401.06118
Han Guo, William Brandon, Radostin Cholakov, Jonathan Ragan-Kelley, Eric P. Xing, Yoon Kim, 15 Jul 2024, Fast Matrix Multiplications for Lookup Table-Quantized LLMs, https://arxiv.org/abs/2407.10960
Shuming Shi, Enbo Zhao, Deng Cai, Leyang Cui, Xinting Huang, Huayang Li, 16 Jan 2024, Inferflow: an Efficient and Highly Configurable Inference Engine for Large Language Models, https://arxiv.org/abs/2401.08294 Source: https://github.com/inferflow/inferflow
Haotong Qin, Xudong Ma, Xingyu Zheng, Xiaoyang Li, Yang Zhang, Shouda Liu, Jie Luo, Xianglong Liu, Michele Magno, 8 Feb 2024, Accurate LoRA-Finetuning Quantization of LLMs via Information Retention, https://arxiv.org/abs/2402.05445 Code: https://github.com/htqin/ir-qlora

4-Bit Quantization (INT4)

4-bit quantization is one of the most popular quantization regimes in practical usage. It is far more common to see a 4-bit quantization of an open source model than binary, 2-bits, or 3-bits. INT4 allows eight-fold storage compression of 32-bits down to 4-bits, which reduces memory requirements, and can also speed up inference by reducing memory-cache transfer overheads in both CPU and GPU. The 4 bits allow 2^4=16 distinct weights, which is enough for reasonable accuracy compared to the full-precision model. The 4-bit weights also fit cleanly into 8-bit bytes or 32-bit integers, making the unpacking code simple and efficient.

Research papers on 4-bit quantization:

Ron Banner, Yury Nahshan, Elad Hoffer, Daniel Soudry, May 2019, Post-training 4-bit quantization of convolution networks for rapid-deployment, NeurIPS 2019, https://arxiv.org/abs/1810.05723, https://proceedings.neurips.cc/paper_files/paper/2019/file/c0a62e133894cdce435bcb4a5df1db2d-Paper.pdf
Han-Byul Kim, Eunhyeok Park, and Sungjoo Yoo. 2022. BASQ: Branch-wise Activation-clipping Search Quantization for Sub-4-bit Neural Networks. In European Conference on Computer Vision, Cham: Springer Nature Switzerland, 17-33. https://link.springer.com/chapter/10.1007/978-3-031-19775-8_2 (Evaluates quantization precision from 2-bits to 4-bits.)
Zechun Liu, Kwang-Ting Cheng, Dong Huang, Eric Xing, Zhiqiang Shen. Apr 2022. Nonuniform-to-uniform quantization: Towards accurate quantization via generalized straight-through estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4942-4952, https://arxiv.org/abs/2111.14826, Code: https://github.com/liuzechun/Nonuniform-to-Uniform-Quantization (Contains an extensive review of models from 2-bits to 4-bits for both weights and activations.)
Anton Trusov, Elena Limonova, Dmitry Slugin, Dmitry Nikolaev, Vladimir V. Arlazarov, Oct 2020, Fast Implementation of 4-bit Convolutional Neural Networks for Mobile Devices, 2020 25th International Conference on Pattern Recognition (ICPR), https://arxiv.org/abs/2009.06488, https://ieeexplore.ieee.org/document/9412841
Xiao Sun, Naigang Wang, Chia-yu Chen, Jia-min Ni, Ankur Agrawal, Xiaodong Cui, Swagath Venkataramani Kaoutar El Maghraoui, Vijayalakshmi Srinivasan, Kailash Gopalakrishnan, 2020, Ultra-low precision 4-bit training of deep neural networks 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada, https://papers.nips.cc/paper/2020/file/13b919438259814cd5be8cb45877d577-Paper.pdf
Shaojin Ding, Phoenix Meadowlark, Yanzhang He, Lukasz Lew, Shivani Agrawal, and Oleg Rybakov. 4-bit conformer with native quantization aware training for speech recognition. In Hanseok Ko and John H. L. Hansen, editors, Interspeech 2022, 23rd Annual Conference of the International Speech Communication Association, Incheon, Korea, 18-22 September 2022, pages 1711–1715. ISCA, 2022. https://arxiv.org/abs/2203.15952
Jeonghoon Kim, Jung Hyun Lee, Sungdong Kim, Joonsuk Park, Kang Min Yoo, Se Jung Kwon, and Dongsoo Lee. Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization. CoRR, abs/2305.14152, 202 https://arxiv.org/abs/2305.14152 (Quantization to 3-bit and 4-bit levels.)
HuggingFace, May 2023, Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA, HuggingFace Blog, https://huggingface.co/blog/4bit-transformers-bitsandbytes
E Kloberdanz, W Le, Sep 2023, MixQuant: Mixed Precision Quantization with a Bit-width Optimization Search, arXiv preprint arXiv:2309.17341, https://arxiv.org/pdf/2309.17341.pdf (Various tests of quantization from 2-bits to 8-bits.)
NM Ho, DT Nguyen, JL Gustafson, WF Wong, 2023, Bedot: Bit Efficient Dot Product for Deep Generative Models, CoNGA 2023: Next Generation Arithmetic, pp. 19–37, https://link.springer.com/chapter/10.1007/978-3-031-32180-1_2, PDF: https://www.comp.nus.edu.sg/~wongwf/papers/CONGA23-Bedot.pdf (2–3 bits for weights and 2–5 bits for activation.)
Z. Yuan, C. Xue, Y. Chen, Q. Wu, and G. Sun, “PTQ4ViT: Post-training quantization for vision transformers with twin uniform quantization,” in European Conference on Computer Vision. Springer, 2022, pp. 191–207. https://link.springer.com/chapter/10.1007/978-3-031-19775-8_12, PDF: https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136720190.pdf (Has 4-bit, 6-bit and 8-bit quantization.)
A. H. Zadeh, I. Edo, O. M. Awad, and A. Moshovos, “GOBO: Quantizing attention-based NLP models for low latency and energy efficient inference,” in 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2020, pp. 811–824. https://arxiv.org/abs/2005.03842 (Compares to BERT at 3-bit and 4-bit quantization levels.)
Z. Liu, Y. Wang, K. Han, W. Zhang, S. Ma, and W. Gao, “Post-training quantization for vision transformer,” Advances in Neural Information Processing Systems, vol. 34, pp. 28 092–28 103, 2021. https://arxiv.org/abs/2106.14156 (Has evaluations of 4-bit, 6-bit, and 8-bit quantization; also mixed-precision.)
N. Frumkin, D. Gope, and D. Marculescu, “CPT-V: A Contrastive Approach to Post-Training Quantization of Vision Transformers,” arXiv preprint arXiv:2211.09643, 2022. https://arxiv.org/abs/2211.09643 (Examines 3-bit, 4-bit, and 8-bit.)
B Gouin-Ferland, R Coffee, AC Therrien, 2022, Data reduction through optimized scalar quantization for more compact neural networks, Frontiers in Physics, https://www.frontiersin.org/articles/10.3389/fphy.2022.957128/full (Examined 3 to 7 bit weights for quantization.)
Li, Y., Gong, R., Tan, X., Yang, Y., Hu, P., Zhang, Q., Yu, F., Wang, W., & Gu, S., 2021, BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction. ArXiv, abs/2102.05426. https://arxiv.org/abs/2102.05426 Code: https://github.com/yhhhli/BRECQ (Tests 2 to 4 bits for weights, and mixed-precision quantization.)
J Liu, R Gong, X Wei, Z Dong, J Cai, B Zhuang, Oct 2023, QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models, arXiv preprint arXiv:2310.08041, https://arxiv.org/pdf/2310.08041.pdf (PTQ with 4-bit quantization of Llama models.)
Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, Song Han, 7 May 2024, QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving, arXiv preprint arXiv:2405.04532, https://arxiv.org/abs/2405.04532 Project: https://hanlab.mit.edu/projects/qserve Code: https://github.com/mit-han-lab/qserve (Efficient quantized inference on GPUs using 4-bit weights, 8-bit activations, and 4-bit KV cache, mostly via a GEMM speedup.)
Haihao Shen, Hanwen Chang, Bo Dong, Yu Luo, Hengyu Meng, Dec 2023, Efficient LLM Inference on CPUs, Intel, NeurIPS 2023, https://arxiv.org/abs/2311.00502 Code: https://github.com/intel/intel-extension-for-transformers
H Shen, H Chang, B Dong, Y Luo, H Meng, Nov 2023, Efficient LLM Inference on CPUs, arXiv preprint arXiv:2311.00502, https://arxiv.org/pdf/2311.00502.pdf Code: https://github.com/intel/intel-extension-for-transformers (INT4 weight quantization with 16-bit activations, and highly optimized kernel with support for AVX2, AVX512, AVX512_VNNI and Advanced Matrix Extensions (AMX), and KV caching, tested on LLamam2 3B to 20B with 20-80ms latency per token.)
Irina Proskurina, Luc Brun, Guillaume Metzler, Julien Velcin, 1 May 2024, When Quantization Affects Confidence of Large Language Models? https://arxiv.org/abs/2405.00632
Martin Thissen, April 20, 2024, Llama 3 on Your Local Computer | Free GPT-4 Alternative, https://medium.com/@martin-thissen/llama-3-on-your-local-computer-free-gpt-4-alternative-1f533e9abff7 (Llama3-70B with 4-bit quantization using vLLM for inference on NVIDIA RTX 6000 Ada GPU.)
Robert Wolfe, Isaac Slaughter, Bin Han, Bingbing Wen, Yiwei Yang, Lucas Rosenblatt, Bernease Herman, Eva Brown, Zening Qu, Nic Weber, and Bill Howe. 2024. Laboratory-Scale AI: Open-Weight Models are Competitive with ChatGPT Even in Low-Resource Settings. In ACMConference on Fairness, Accountability, and Transparency (ACM FAccT ’24), June 3–6, 2024, Rio de Janeiro, Brazil. ACM, New York, NY, USA, 18 pages. https://doi.org/10.1145/3630106.3658966 https://arxiv.org/pdf/2405.16820
Stefanos Laskaridis, Kleomenis Katevas, Lorenzo Minto, Hamed Haddadi, 20 Mar 2024 (v2), MELTing point: Mobile Evaluation of Language Transformers, https://arxiv.org/abs/2403.12844 (Survey and benchmarking of SOTA methods for running LLM inference natively on phones including iPhone and Android, with quantization levels, and with measurement of speed and battery depletion.)
P Dong, L Lu, C Wu, C Lyu, G Yuan, H Tang, Y Wang, 2023, PackQViT: Faster Sub-8-bit Vision Transformers via Full and Packed Quantization on the Mobile, https://openreview.net/pdf?id=N56hAiQvot Code: https://github.com/PeiyanFlying/PackQViT
Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L. Croci, Bo Li, Martin Jaggi, Dan Alistarh, Torsten Hoefler, James Hensman, 30 Mar 2024, QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs, https://arxiv.org/abs/2404.00456 Code: https://github.com/spcl/QuaRot
Jiayi Pan, Chengcan Wang, Kaifu Zheng, Yangguang Li, Zhenyu Wang, Bin Feng, 6 Dec 2023, SmoothQuant+: Accurate and Efficient 4-bit Post-Training WeightQuantization for LLM, https://arxiv.org/abs/2312.03788 (Using 4-bit quantization to run Code Llama-34B model on an A100 40GB GPU.)
Haotong Qin, Xudong Ma, Xingyu Zheng, Xiaoyang Li, Yang Zhang, Shouda Liu, Jie Luo, Xianglong Liu, Michele Magno, 8 Feb 2024, Accurate LoRA-Finetuning Quantization of LLMs via Information Retention, https://arxiv.org/abs/2402.05445 Code: https://github.com/htqin/ir-qlora
Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, Baris Kasikci, Nov 2023, Atom: Low-bit Quantization for Efficient and Accurate LLM Serving, https://arxiv.org/abs/2310.19102
Ignacio de Gregorio, June 2024, My Thoughts on Apple Intelligence: Leveling the Stakes & Betraying the Essence, https://readmedium.com/en/my-thoughts-on-apple-intelligence-16a793359cb5
W Cheng, Y Cai, K Lv, H Shen, Oct 2023, TEQ: Trainable Equivalent Transformation for Quantization of LLMs, https://arxiv.org/pdf/2310.10944.pdf
LMDeploy Contributors, 2023, LMDeploy: A Toolkit for Compressing, Deploying, and Serving LLM, Apache 2.0 License, Code: https://github.com/InternLM/lmdeploy
Franklin Huang, May 17, 2024, Machine Learning Systems with Reduced Memory Requirements, Masters of Science, Electrical Engineering and Computer Sciences, University of California, Berkeley, Technical Report No. UCB/EECS-2024-120 http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.html https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.pdf Code: https://github.com/hongyihuang/spec-mcts/blob/main/triton (Broad paper that examines a lot of different optimizations that reduce memory costs, including quantization, kernel fusion, sparsity, MatMul optimizations, KV cache compression, and various other methods.)
Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, Qianyu Chen, Huarong Zhou, Zhensheng Zou, Haoye Zhang, Shengding Hu, Zhi Zheng, Jie Zhou, Jie Cai, Xu Han, Guoyang Zeng, Dahai Li, Zhiyuan Liu, Maosong Sun, 3 Aug 2024, MiniCPM-V: A GPT-4V Level MLLM on Your Phone, https://arxiv.org/abs/2408.01800 Code: https://github.com/OpenBMB/MiniCPM-V
Gavin Li, August 3rd, 2024, Crazy Challenge: Run Llama 405B on a 8GB VRAM GPU, https://ai.gopubby.com/crazy-challenge-run-llama-405b-on-a-8gb-vram-gpu-ab5a280a3889 (Run Llama's 405B model on a low-end GPU via 4-bit quantization and layer-by-layer inference, both to save memory.)
Leo Donisch, Sigurd Schacht, Carsten Lanquillon, 6 Aug 2024, Inference Optimizations for Large Language Models: Effects, Challenges, and Practical Considerations, https://arxiv.org/abs/2408.03130
Zihao Ye,, Lequn Chen, Ruihang Lai, Yilong Zhao, Size Zheng, Junru Shao, Bohan Hou, Hongyi Jin, Yifei Zuo, Liangsheng Yin, Tianqi Chen, Luis Ceze, Feb 2, 2024, Accelerating Self-Attentions for LLM Serving with FlashInfer, https://flashinfer.ai/2024/02/02/introduce-flashinfer.html
Intel, Jul 3, 2024, Reduce LLM Footprint with OpenVINO™ Toolkit Weight Compression, OpenVINO™ toolkit, https://medium.com/openvino-toolkit/reduce-llm-footprint-with-openvino-toolkit-weight-compression-4d276ad824e5
Sophia R. Cunningham,Dominique Archambault,Austin Kung, May 2024, Efficient Training and Inference: Techniques for Large Language Models Using Llama, https://www.techrxiv.org/doi/full/10.36227/techrxiv.171651876.65094225/v1
Fuwen Tan, Royson Lee, Łukasz Dudziak, Shell Xu Hu, Sourav Bhattacharya, Timothy Hospedales, Georgios Tzimiropoulos, Brais Martinez, 25 Aug 2024, MobileQuant: Mobile-friendly Quantization for On-device Language Models, https://arxiv.org/abs/2408.13933 https://github.com/saic-fi/MobileQuant
Shikhar Bajpai, Sep 27, 2024, Shrinking Elephants: A Funny Guide to 4-bit and 8-bit Quantization for LLMs with LoRA, https://medium.com/@shikharstruck/shrinking-elephants-a-funny-guide-to-4-bit-and-8-bit-quantization-for-llms-with-lora-ddf9f1a62070
Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu, 25 Sep 2024, A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms, https://arxiv.org/abs/2409.16694
Ke Yi, Zengke Liu, Jianwei Zhang, Chengyuan Li, Tong Zhang, Junyang Lin, Jingren Zhou, 30 Sep 2024, Rotated Runtime Smooth: Training-Free Activation Smoother for accurate INT4 inference, https://arxiv.org/abs/2409.20361 (Handling of outliers in INT4 quantization.)
L. F. H. Duarte, G. B. Nardes, W. Grignani, D. R. Melo and C. A. Zeferino, "Deep Nibble: A 4-bit Number Format for Efficient DNN Training and Inference in FPGA," 2024 37th SBC/SBMicro/IEEE Symposium on Integrated Circuits and Systems Design (SBCCI), Joao Pessoa, Brazil, 2024, pp. 1-5, doi: 10.1109/SBCCI62366.2024.10703994. https://ieeexplore.ieee.org/abstract/document/10703994 (Log quantization method in 4-bits.)
Lian Liu, Haimeng Ren, Long Cheng, Zhaohui Xu, Yudong Pan, Mengdi Wang, Xiaowei Li, Yinhe Han, Ying Wang, 16 Oct 2024, COMET: Towards Partical W4A4KV4 LLMs Serving, https://arxiv.org/abs/2410.12168
Y Cong, 2024, Research for Enhancing Processing and Computational Efficiency in LLM, 2024 2nd International Conference on Image, https://www.atlantis-press.com/article/126004157.pdf
Kang Zhao, Tao Yuan, Han Bao, Zhenfeng Su, Chang Gao, Zhaofeng Sun, Zichen Liang, Liping Jing, Jianfei Chen, 21 Oct 2024, Beyond 2:4: exploring V:N:M sparsity for efficient transformer inference on GPUs, https://arxiv.org/abs/2410.16135
David Spuler, October 26, 2024, Weight Clustering Needs a Refresh, Aussie AI Blog, https://www.aussieai.com/blog/weight-clustering
Yang Zhou, Zhen Dong, Ellick Chan, Dhiraj Kalamkar, Diana Marculescu, Kurt Keutzer, 26 Oct 2024, DQRM: Deep Quantized Recommendation Models, https://arxiv.org/abs/2410.20046
X. Shen et al., "HotaQ: Hardware Oriented Token Adaptive Quantization for Large Language Models," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, doi: 10.1109/TCAD.2024.3487781. https://ieeexplore.ieee.org/abstract/document/10737419/
Eldar Kurtic, Alexandre Marques, Shubhra Pandit, Mark Kurtz, Dan Alistarh, 4 Nov 2024, "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization, https://arxiv.org/abs/2411.02355
Datadrifters, Aug 14, 2024, Llama 3.1 INT4 Quantization: Cut Costs by 75% Without Sacrificing Performance! https://medium.com/@datadrifters/llama-3-1-int4-quantization-cut-costs-by-75-without-sacrificing-performance-420c58da01ab
Wang, Hongyu ; Ma, Shuming ; Wei, Furu, November 2024, BitNet a4.8: 4-bit Activations for 1-bit LLMs, eprint arXiv:2411.04965, https://ui.adsabs.harvard.edu/abs/2024arXiv241104965W/abstract
Jintao Zhang, Haofeng Huang, Pengle Zhang, Jia Wei, Jun Zhu, Jianfei Chen, 17 Nov 2024, SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration, https://arxiv.org/abs/2411.10958 https://github.com/thu-ml/SageAttention
Daniel & Michael, Unsloth, Unsloth - Dynamic 4-bit Quantization, Dec 4, 2024, https://unsloth.ai/blog/dynamic-4bit
Yuzong Chen, Xilai Dai, Chi-chih Chang, Yash Akhauri, Mohamed S. Abdelfattah, 6 Jan 2025, The Power of Negative Zero: Datatype Customization for Quantized Large Language Models, https://arxiv.org/abs/2501.04052 (Remapping negative zero to other values.)
Taehee Jeong, 17 Jan 2025, 4bit-Quantization in Vector-Embedding for RAG, https://arxiv.org/abs/2501.10534 https://github.com/taeheej/4bit-Quantization-in-Vector-Embedding-for-RAG
Dongyoung Lee, Seungkyu Choi, Ik Joon Chang, 23 Jan 2025, Qrazor: Reliable and effortless 4-bit llm quantization by significant data razoring, https://arxiv.org/abs/2501.13331
Tom Wallace, Naser Ezzati-Jivan, Beatrice Ombuki-Berman, 16 Jan 2025, Optimization Strategies for Enhancing Resource Efficiency in Transformers & Large Language Models, https://arxiv.org/abs/2502.00046
Jinguang Wang, Jingyu Wang, Haifeng Sun, Tingting Yang, Zirui Zhuang, Wanyi Ning, Yuexi Yin, Qi Qi, Jianxin Liao, 7 Mar 2025, MergeQuant: Accurate 4-bit Static Quantization of Large Language Models by Channel-wise Calibration, https://arxiv.org/abs/2503.07654

5-Bit Quantization (INT5)

Research papers on 5-bit quantization:

E Kloberdanz, W Le, Sep 2023, MixQuant: Mixed Precision Quantization with a Bit-width Optimization Search, arXiv preprint arXiv:2309.17341, https://arxiv.org/pdf/2309.17341.pdf (Various tests of quantization from 2-bits to 8-bits.)
NM Ho, DT Nguyen, JL Gustafson, WF Wong, 2023, Bedot: Bit Efficient Dot Product for Deep Generative Models, CoNGA 2023: Next Generation Arithmetic, pp. 19–37, https://link.springer.com/chapter/10.1007/978-3-031-32180-1_2, PDF: https://www.comp.nus.edu.sg/~wongwf/papers/CONGA23-Bedot.pdf (2–3 bits for weights and 2–5 bits for activation.)
B Gouin-Ferland, R Coffee, AC Therrien, 2022, Data reduction through optimized scalar quantization for more compact neural networks, Frontiers in Physics, https://www.frontiersin.org/articles/10.3389/fphy.2022.957128/full (Examined 3 to 7 bit weights for quantization.)
Markus Nagel, Mart van Baalen, Tijmen Blankevoort, Max Welling, 2019, Data-free quantization through weight equalization and bias correction, PDF: https://openaccess.thecvf.com/content_ICCV_2019/papers/Nagel_Data-Free_Quantization_Through_Weight_Equalization_and_Bias_Correction_ICCV_2019_paper.pdf (Evaluates INT5, INT6, INT8, INT10, INT12, and INT16.)

6-Bit Quantization (INT6)

Research papers on 6-bit quantization:

E Kloberdanz, W Le, Sep 2023, MixQuant: Mixed Precision Quantization with a Bit-width Optimization Search, arXiv preprint arXiv:2309.17341, https://arxiv.org/pdf/2309.17341.pdf (Various tests of quantization from 2-bits to 8-bits.)
Z. Yuan, C. Xue, Y. Chen, Q. Wu, and G. Sun, “PTQ4ViT: Post-training quantization for vision transformers with twin uniform quantization,” in European Conference on Computer Vision. Springer, 2022, pp. 191–207. https://link.springer.com/chapter/10.1007/978-3-031-19775-8_12, PDF: https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136720190.pdf (Has 4-bit, 6-bit and 8-bit quantization.)
Z. Liu, Y. Wang, K. Han, W. Zhang, S. Ma, and W. Gao, “Post-training quantization for vision transformer,” Advances in Neural Information Processing Systems, vol. 34, pp. 28 092–28 103, 2021. https://arxiv.org/abs/2106.14156 (Has evaluations of 4-bit, 6-bit, and 8-bit quantization; also mixed-precision.)
M Giacobbe, TA Henzinger, M Lechner, 2020, How many bits does it take to quantize your neural network? TACAS 2020, https://link.springer.com/chapter/10.1007/978-3-030-45237-7_5, PDF: https://link.springer.com/content/pdf/10.1007/978-3-030-45237-7_5.pdf (Ran experiments from 6-bit to 10-bit quantization.)
B Gouin-Ferland, R Coffee, AC Therrien, 2022, Data reduction through optimized scalar quantization for more compact neural networks, Frontiers in Physics, https://www.frontiersin.org/articles/10.3389/fphy.2022.957128/full (Examined 3 to 7 bit weights for quantization.)
Markus Nagel, Mart van Baalen, Tijmen Blankevoort, Max Welling, 2019, Data-free quantization through weight equalization and bias correction, PDF: https://openaccess.thecvf.com/content_ICCV_2019/papers/Nagel_Data-Free_Quantization_Through_Weight_Equalization_and_Bias_Correction_ICCV_2019_paper.pdf (Evaluates INT5, INT6, INT8, INT10, INT12, and INT16.)
Jinguang Wang, Yuexi Yin, Haifeng Sun, Qi Qi, Jingyu Wang, Zirui Zhuang, Tingting Yang, Jianxin Liao, 27 Jun 2024, OutlierTune: Efficient Channel-Wise Quantization for Large Language Models, https://arxiv.org/abs/2406.18832

7-Bit Quantization (INT7)

Research papers on 7-bit quantization:

E Kloberdanz, W Le, Sep 2023, MixQuant: Mixed Precision Quantization with a Bit-width Optimization Search, arXiv preprint arXiv:2309.17341, https://arxiv.org/pdf/2309.17341.pdf (Various tests of quantization from 2-bits to 8-bits.)
Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, Dmitry Kalenichenko, "Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference", Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 2704-2713, https://openaccess.thecvf.com/content_cvpr_2018/papers/Jacob_Quantization_and_Training_CVPR_2018_paper.pdf
M Giacobbe, TA Henzinger, M Lechner, 2020, How many bits does it take to quantize your neural network? TACAS 2020, https://link.springer.com/chapter/10.1007/978-3-030-45237-7_5, PDF: https://link.springer.com/content/pdf/10.1007/978-3-030-45237-7_5.pdf (Ran experiments from 6-bit to 10-bit quantization.)
B Gouin-Ferland, R Coffee, AC Therrien, 2022, Data reduction through optimized scalar quantization for more compact neural networks, Frontiers in Physics, https://www.frontiersin.org/articles/10.3389/fphy.2022.957128/full (Examined 3 to 7 bit weights for quantization.)

8-Bit Integer Quantization (INT8)

Research papers on 8-bit quantization:

Tim Dettmers, Mike Lewis, Younes Belkada, Luke Zettlemoyer, LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale, Nov 2022, https://arxiv.org/abs/2208.07339
A. Rock, A. Untether, O. Khalil, O. Shai, and P. Grouchy, 2022, INT8 Transformers for Inference Acceleration, 36th Conference on Neural Information Processing Systems (NeurIPS), PDF: https://neurips2022-enlsp.github.io/papers/paper_52.pdf
E Kloberdanz, W Le, Sep 2023, MixQuant: Mixed Precision Quantization with a Bit-width Optimization Search, arXiv preprint arXiv:2309.17341, https://arxiv.org/pdf/2309.17341.pdf (Various tests of quantization from 2-bits to 8-bits.)
O. Zafrir, G. Boudoukh, P. Izsak, and M. Wasserblat, “Q8BERT: Quantized 8Bit BERT,” in Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS), 2019, pp. 36–39. https://arxiv.org/abs/1910.06188
B. Li, S. Lu, K. Xie, and Z. Wang, “Accelerating NLP Tasks on FPGA with Compressed BERT and a Hardware-Oriented Early Exit Method,” in IEEE Computer Society Annual Symposium on VLSI (ISVLSI), 2022, pp. 410–413. https://ieeexplore.ieee.org/document/9912016 (8-bit and 16-bit integer quantization.)
Gunho Park, Baeseong Park, Sungjae Lee, Minsub Kim, Byeongwook Kim, Se Jung Kwon, Youngjoo Lee, Dongsoo Lee, 2022, nuQmm: Quantized MatMul for Efficient Inference of Large-Scale Generative Language Models CoRR, abs/2206.09557, https://arxiv.org/abs/2206.09557v2
Y. Lin, Y. Li, T. Liu, T. Xiao, T. Liu, and J. Zhu, “Towards fully 8-bit integer inference for the transformer model,” the 29th International Joint Conference on Artificial Intelligence and the 17th Pacific Rim International Conference on Artificial Intelligence (IJCAI-PRICAI2020), 2020. https://arxiv.org/abs/2009.08034 (Quantizes not just weights, but also non-linear functions such as Softmax.)
Feng Zhu, Ruihao Gong, Fengwei Yu, Xianglong Liu, Yanfei Wang, Zhelong Li, Xiuqi Yang, Junjie Yan, 2020, Towards Unified INT8 Training for Convolutional Neural Network, https://arxiv.org/abs/1912.12607, https://openaccess.thecvf.com/content_CVPR_2020/papers/Zhu_Towards_Unified_INT8_Training_for_Convolutional_Neural_Network_CVPR_2020_paper.pdf
Manuele Rusci, Marco Fariselli, Martin Croome, Francesco Paci, Eric Flamand, 2022, Accelerating RNN-Based Speech Enhancement on a Multi-core MCU with Mixed FP16-INT8 Post-training Quantization https://arxiv.org/abs/2210.07692 (Mixed precision FP16-INT8 quantization.)
Gunho Park, Baeseong Park, Minsub Kim, Sungjae Lee, Jeonghoon Kim, Beomseok Kwon, Se Jung Kwon, Byeongwook Kim, Youngjoo Lee, Dongsoo Lee, Apr 2023, LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models https://arxiv.org/abs/2206.09557
V. Vanhoucke, A. Senior and M. Z. Mao, "Improving the speed of neural networks on CPUs", Proc. Deep Learn. Unsupervised Feature Learn. Workshop, pp. 1-8, 2011. https://research.google/pubs/pub37631/, PDF: https://research.google/pubs/pub37631.pdf (INT8 quantization.)
M. A. Nasution, D. Chahyati and M. I. Fanany, "Faster R-CNN with structured sparsity learning and Ristretto for mobile environment", Proc. Int. Conf. Adv. Comput. Sci. Inf. Syst. (ICACSIS), pp. 309-314, Oct. 2017. https://ieeexplore.ieee.org/document/8355051
Tim Dettmers, Mike Lewis, Younes Belkada, Luke Zettlemoyer, 2022, GPT3.int8(): 8-bit Matrix Multiplication for Transformers at Scale, Part of Advances in Neural Information Processing Systems 35 (NeurIPS 2022) https://papers.nips.cc/paper_files/paper/2022/hash/c3ba4962c05c49636d4c6206a97e9c8a-Abstract-Conference.html
A. Mishra, J. A. Latorre, J. Pool, D. Stosic, D. Stosic, G. Venkatesh, C. Yu, and P. Micikevicius, “Accelerating sparse deep neural networks,” arXiv preprint arXiv:2104.08378, 2021. https://arxiv.org/abs/2104.08378
Z. Yuan, C. Xue, Y. Chen, Q. Wu, and G. Sun, “PTQ4ViT: Post-training quantization for vision transformers with twin uniform quantization,” in European Conference on Computer Vision. Springer, 2022, pp. 191–207. https://link.springer.com/chapter/10.1007/978-3-031-19775-8_12, PDF: https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136720190.pdf (Has 4-bit, 6-bit and 8-bit quantization.)
Z. Liu, Y. Wang, K. Han, W. Zhang, S. Ma, and W. Gao, “Post-training quantization for vision transformer,” Advances in Neural Information Processing Systems, vol. 34, pp. 28 092–28 103, 2021. https://arxiv.org/abs/2106.14156 (Has evaluations of 4-bit, 6-bit, and 8-bit quantization; also mixed-precision.)
N. Frumkin, D. Gope, and D. Marculescu, “CPT-V: A Contrastive Approach to Post-Training Quantization of Vision Transformers,” arXiv preprint arXiv:2211.09643, 2022. https://arxiv.org/abs/2211.09643 (Examines 3-bit, 4-bit, and 8-bit.)
Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, Dmitry Kalenichenko, "Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference", Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 2704-2713, https://openaccess.thecvf.com/content_cvpr_2018/papers/Jacob_Quantization_and_Training_CVPR_2018_paper.pdf
M Giacobbe, TA Henzinger, M Lechner, 2020, How many bits does it take to quantize your neural network? TACAS 2020, https://link.springer.com/chapter/10.1007/978-3-030-45237-7_5, PDF: https://link.springer.com/content/pdf/10.1007/978-3-030-45237-7_5.pdf (Ran experiments from 6-bit to 10-bit quantization.)
Javier Fernandez-Marques, Paul N. Whatmough, Andrew Mundy, Matthew Mattina, 2020, Searching for winograd-aware quantized networks, Proceedings of the 3rd MLSys Conference, Austin, TX, USA, 2020. PDF: https://proceedings.mlsys.org/paper_files/paper/2020/file/678e209691cd37f145a5502695378bac-Paper.pdf (Evaluates INT8, INT10, and INT16 quantization.)
Markus Nagel, Mart van Baalen, Tijmen Blankevoort, Max Welling, 2019, Data-free quantization through weight equalization and bias correction, PDF: https://openaccess.thecvf.com/content_ICCV_2019/papers/Nagel_Data-Free_Quantization_Through_Weight_Equalization_and_Bias_Correction_ICCV_2019_paper.pdf (Evaluates INT5, INT6, INT8, INT10, INT12, and INT16.)
Pierre-Emmanuel Novac, March 2023, MicroAI: Embedded Artificial Intelligence for Human Activity Recognition on Smart Glasses, Ph.D. Thesis, Artificial Intelligence. Université Côte d’Azur, https://theses.hal.science/tel-04049008/document (Uses INT8 and INT16 quantization.)
Li, Y., Gong, R., Tan, X., Yang, Y., Hu, P., Zhang, Q., Yu, F., Wang, W., & Gu, S., 2021, BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction. ArXiv, abs/2102.05426. https://arxiv.org/abs/2102.05426 Code: https://github.com/yhhhli/BRECQ (Tests 8-bit weights, along with 2-4 bits.)
Radha Gulhane, 2024, Accelerated and Memory-Efficient Distributed Deep Learning: Leveraging Quantization, Parallelism Techniques, and Mix-Match Runtime Communication , Masters Thesis, Computer Science and Engineering , The Ohio State University, https://etd.ohiolink.edu/acprod/odb_etd/ws/send_file/send?accession=osu1713381834648517&disposition=inline
Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, Song Han, 7 May 2024, QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving, arXiv preprint arXiv:2405.04532, https://arxiv.org/abs/2405.04532 Project: https://hanlab.mit.edu/projects/qserve Code: https://github.com/mit-han-lab/qserve (Efficient quantized inference on GPUs using 4-bit weights, 8-bit activations, and 4-bit KV cache, mostly via a GEMM speedup.)
8bit HippoAttention: Up to 3X Faster Compared to FlashAttentionV2, HippoML Blog, Jan 2024, https://blog.hippoml.com/8bit-hippoattention-up-to-3x-faster-compared-to-flashattentionv2-8f9def90b482
Zhewei Yao, Reza Yazdani Aminabadi, Stephen Youn, Xiaoxia Wu, Elton Zheng, Yuxiong He, Oct 2023, ZeroQuant-HERO: Hardware-Enhanced Robust Optimized Post-Training Quantization Framework for W8A8 Transformers, https://arxiv.org/abs/2310.17723
Y Hua, L Yu, X Meng, Z Qin, 2021, A Dynamic Balance Quantization Method for YOLOv3, Journal of Physics, PDF: https://iopscience.iop.org/article/10.1088/1742-6596/1848/1/012157/pdf
PE Novac, G Boukli Hacene, A Pegatoquet, 2021, Quantization and deployment of deep neural networks on microcontrollers, Sensors, 2021, https://www.mdpi.com/1424-8220/21/9/2984
PyTorch, 2023, Model inference optimization checklist, https://pytorch.org/serve/performance_checklist.html
M. Horowitz, Feb 2014, "Computing’s energy problem (and what we can do about it)", IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, pp. 10-14, Feb. 2014. https://ieeexplore.ieee.org/document/6757323 (INT8 quantization.)
J Wu, M Song, J Zhao, HKH So, 2024, A Case for Low Bitwidth Floating Point Arithmetic on FPGA for Transformer Based DNN Inference, https://wujiajunic.cn/publication/ipdpsw2024/IPDPSW2024.pdf
P Dong, L Lu, C Wu, C Lyu, G Yuan, H Tang, Y Wang, 2023, PackQViT: Faster Sub-8-bit Vision Transformers via Full and Packed Quantization on the Mobile, https://openreview.net/pdf?id=N56hAiQvot Code: https://github.com/PeiyanFlying/PackQViT
JINGXUAN YANG, XIAOQIN WANG, AND YIYING JIAN, 17 February 2024, CANET: Quantized Neural Network Inference With 8-bit Carry-Aware Accumulator, https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10445180
D. Ilin, E. Limonova, V. Arlazarov, and D. Nikolaev, “Fast integer approximations in convolutional neural networks using layer-by-layer training,” in Ninth International Conference on Machine Vision, 103410Q–103410Q, International Society for Optics and Photonics (2017). DOI: 10.1117/12.2268722. https://spie.org/Publications/Proceedings/Paper/10.1117/12.2268722?SSO=1
Character.AI, June 20, 2024, Optimizing AI Inference at Character.AI, https://research.character.ai/optimizing-inference/
Daliang Xu, Hao Zhang, Liming Yang, Ruiqi Liu, Mengwei Xu, and Xuanzhe Liu, 11 June 2024, WiP: Efficient LLM Prefilling with Mobile NPU, EdgeFM '24: Proceedings of the Workshop on Edge and Mobile Foundation Models, June 2024, Pages 33 - 35, https://doi.org/10.1145/3662006.3662066 https://dl.acm.org/doi/abs/10.1145/3662006.3662066 PDF: https://dl.acm.org/doi/pdf/10.1145/3662006.3662066 (Faster NPU prefill via chunked prefilling using sequences of tokens, along with INT8 NPU quantization that is aware of outliers and offloads FP32 calculations from NPU back to CPU.)
Jinguang Wang, Yuexi Yin, Haifeng Sun, Qi Qi, Jingyu Wang, Zirui Zhuang, Tingting Yang, Jianxin Liao, 27 Jun 2024, OutlierTune: Efficient Channel-Wise Quantization for Large Language Models, https://arxiv.org/abs/2406.18832
Milan Tamang June 30, 2024, How I built my own custom 8-bit Quantizer from scratch: a step-by-step guide using PyTorch, https://towardsai.net/p/machine-learning/how-i-built-my-own-custom-8-bit-quantizer-from-scratch-a-step-by-step-guide-using-pytorch
QuaLA-MiniLM: a Quantized Length Adaptive MiniLM, Shira Guskin, Moshe Wasserblat, Chang Wang, Haihao Shen, May 2023, https://arxiv.org/abs/2210.17114 (Intel labs paper. Low-bit quantization, distillation, and Length-Adaptive Transformer (LAT) technique. )
Intel, Fast DistilBERT on CPUs, 2022, https://arxiv.org/pdf/2211.07715.pdf
Longhao Chen, Yina Zhao, Qiangjun Xie, Qinghua Sheng, 6 Jun 2024, Optimization of Armv9 architecture general large language model inference performance based on Llama.cpp, https://arxiv.org/abs/2406.10816
Leo Donisch, Sigurd Schacht, Carsten Lanquillon, 6 Aug 2024, Inference Optimizations for Large Language Models: Effects, Challenges, and Practical Considerations, https://arxiv.org/abs/2408.03130
Guangxuan Xiao, May 2024, Efficient Deployment Algorithms for Large Language Models, Masters Thesis, MIT, https://dspace.mit.edu/bitstream/handle/1721.1/156332/xiao-xgx-sm-eecs-2024-thesis.pdf
Haocheng Xi, Yuxiang Chen, Kang Zhao, Kai Jun Teh, Jianfei Chen, Jun Zhu, 21 Jul 2024 (v2), Jetfire: Efficient and Accurate Transformer Pretraining with INT8 Data Flow and Per-Block Quantization, https://arxiv.org/abs/2403.12422 Code: https://github.com/thu-ml/Jetfire-INT8Training
Tim Dettmers, Mike Lewis, Sam Shleifer, Luke Zettlemoyer, 20 Jun 2022 (v2), 8-bit Optimizers via Block-wise Quantization, https://arxiv.org/abs/2110.02861
Intel, Jul 3, 2024, Reduce LLM Footprint with OpenVINO™ Toolkit Weight Compression, OpenVINO™ toolkit, https://medium.com/openvino-toolkit/reduce-llm-footprint-with-openvino-toolkit-weight-compression-4d276ad824e5
Sophia R. Cunningham,Dominique Archambault,Austin Kung, May 2024, Efficient Training and Inference: Techniques for Large Language Models Using Llama, https://www.techrxiv.org/doi/full/10.36227/techrxiv.171651876.65094225/v1
Fuwen Tan, Royson Lee, Łukasz Dudziak, Shell Xu Hu, Sourav Bhattacharya, Timothy Hospedales, Georgios Tzimiropoulos, Brais Martinez, 25 Aug 2024, MobileQuant: Mobile-friendly Quantization for On-device Language Models, https://arxiv.org/abs/2408.13933 https://github.com/saic-fi/MobileQuant
Fabien Geyer, Johannes Freitag, Tobias Schulz, Sascha Uhrig, 16 Jan 2024, Efficient and Mathematically Robust Operations for Certified Neural Networks Inference, https://arxiv.org/abs/2401.08225 https://arxiv.org/pdf/2401.08225 (Finds that fixed point arithmetic is more efficient with comparable accuracy to using floating point.)
Abhinav Agarwalla, Abhay Gupta, Alexandre Marques, Shubhra Pandit, Michael Goin, Eldar Kurtic, Kevin Leong, Tuan Nguyen, Mahmoud Salem, Dan Alistarh, Sean Lie, Mark Kurtz, 6 May 2024, Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment, https://arxiv.org/abs/2405.03594
Neural Magic, 2024, DeepSparse: Sparsity-aware deep learning inference runtime for CPUs, https://github.com/neuralmagic/deepsparse https://neuralmagic.com/deepsparse/
Jaxpruner: A Concise Library for Sparsity Research, Joo Hyung Lee, Wonpyo Park, Nicole Elyse Mitchell, Jonathan Pilault, Johan Samir Obando Ceron, Han-Byul Kim, Namhoon Lee, Elias Frantar, Yun Long, Amir Yazdanbakhsh, Woohyun Han, Shivani Agrawal, Suvinay Subramanian, Xin Wang, Sheng-Chun Kao, Xingyao Zhang, Trevor Gale, Aart J.C. Bik, Milen Ferev, Zhonglin Han, Hong-Seok Kim, Yann Dauphin, Gintare Karolina Dziugaite, Pablo Samuel Castro, Utku Evci, Conference on Parsimony and Learning, PMLR 234:515-528, 2024. https://proceedings.mlr.press/v234/lee24a.html https://proceedings.mlr.press/v234/lee24a/lee24a.pdf https://openreview.net/forum?id=H2rCZCfXkS https://openreview.net/pdf?id=H2rCZCfXkS
Penghao Xiao, Chunjie Zhang, Qian Guo, Xiayang Xiao, Haipeng Wang, 2024, Neural Networks Integer Computation: Quantizing Convolutional Neural Networks of Inference and Training for Object Detection in Embedded Systems, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, DOI 10.1109/JSTARS.2024.3452321, PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10660473
Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu, 25 Sep 2024, A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms, https://arxiv.org/abs/2409.16694
Shimao Chen, Zirui Liu, Zhiying Wu, Ce Zheng, Peizhuang Cong, Zihan Jiang, Yuhan Wu, Lei Su, Tong Yang, 26 Sep 2024 (v2), INT-FlashAttention: Enabling Flash Attention for INT8 Quantization, https://arxiv.org/abs/2409.16997 https://github.com/INT-FlashAttention2024/INT-FlashAttention
Douglas C. Youvan, September 27, 2024, Building and Running Large-Scale Language Models: The Infrastructure and Techniques Behind GPT-4 , https://www.researchgate.net/profile/Douglas-Youvan/publication/384398902_Building_and_Running_Large-Scale_Language_Models_The_Infrastructure_and_Techniques_Behind_GPT-4/links/66f6f4d3906bca2ac3d20e68/Building-and-Running-Large-Scale-Language-Models-The-Infrastructure-and-Techniques-Behind-GPT-4.pdf
X. Shen et al., "HotaQ: Hardware Oriented Token Adaptive Quantization for Large Language Models," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, doi: 10.1109/TCAD.2024.3487781. https://ieeexplore.ieee.org/abstract/document/10737419/
Eldar Kurtic, Alexandre Marques, Shubhra Pandit, Mark Kurtz, Dan Alistarh, 4 Nov 2024, "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization, https://arxiv.org/abs/2411.02355
Jiajun Wu, Mo Song, Jingmin Zhao, Yizhao Gao, Jia Li, Hayden Kwok-Hay So, 6 Nov 2024, TATAA: Programmable Mixed-Precision Transformer Acceleration with a Transformable Arithmetic Architecture, https://arxiv.org/abs/2411.03697
Character.AI, Nov 21, 2024, Optimizing AI Inference at Character.AI (Part Deux), https://research.character.ai/optimizing-ai-inference-at-character-ai-part-deux/ (Optimization techniques discussed include INT8, Flash attention 3, kernel fusion of KV dequantization and attention, MQA parallelization, producer-consumer CUDA warp specialization, fused matrix transpose, and more.)
NVIDIA, Dec 2024, Multi-Head, Multi-Query, and Group-Query Attention, https://nvidia.github.io/TensorRT-LLM/advanced/gpt-attention.html#kv-cache
Victor J.B. Jung, Alessio Burrello, Moritz Scherer, Francesco Conti, Luca Benini, 3 Apr 2024, Optimizing the Deployment of Tiny Transformers on Low-Power MCUs, https://arxiv.org/abs/2404.02945 (Uses an approach called "Fused Weight Self-Attention" that fuses some of the QKV matrices and also tiling in multi-head attention, along with 8-bit integer quantization and integerized Softmax.)
Shikhar Bajpai, Sep 27, 2024, Shrinking Elephants: A Funny Guide to 4-bit and 8-bit Quantization for LLMs with LoRA, https://medium.com/@shikharstruck/shrinking-elephants-a-funny-guide-to-4-bit-and-8-bit-quantization-for-llms-with-lora-ddf9f1a62070
Pengle Zhang, Jia Wei, Jintao Zhang, Jun Zhu, Jianfei Chen, 12 Mar 2025 (v2), Accurate INT8 Training Through Dynamic Block-Level Fallback, https://arxiv.org/abs/2503.08040

9-Bit Quantization (INT9)

Research papers on 9-bit quantization:

M Giacobbe, TA Henzinger, M Lechner, 2020, How many bits does it take to quantize your neural network? TACAS 2020, https://link.springer.com/chapter/10.1007/978-3-030-45237-7_5, PDF: https://link.springer.com/content/pdf/10.1007/978-3-030-45237-7_5.pdf (Ran experiments from 6-bit to 10-bit quantization.)
W Jiang, P Liu, F Wen, 2017, An improved vector quantization method using deep neural network, AEU - International Journal of Electronics and Communications, Volume 72, February 2017, Pages 178-183, https://www.sciencedirect.com/science/article/pii/S1434841116313954
Y Nakahara, M Kiyama, M Amagasaki, 2020, Relationship between recognition accuracy and numerical precision in convolutional neural network models, https://www.jstage.jst.go.jp/article/transinf/E103.D/12/E103.D_2020PAL0002/_pdf

10-Bit Quantization (INT10)

Research papers on 10-bit quantization:

M Giacobbe, TA Henzinger, M Lechner, 2020, How many bits does it take to quantize your neural network? TACAS 2020, https://link.springer.com/chapter/10.1007/978-3-030-45237-7_5, PDF: https://link.springer.com/content/pdf/10.1007/978-3-030-45237-7_5.pdf (Ran experiments from 6-bit to 10-bit quantization.)
J Shi, M Lu, F Chen, S Pu, Z Ma, 2022, Rate-Distortion Optimized Post-Training Quantization for Learned Image Compression arXiv preprint arXiv:2211.02854, https://arxiv.org/abs/2211.02854
Javier Fernandez-Marques, Paul N. Whatmough, Andrew Mundy, Matthew Mattina, 2020, Searching for winograd-aware quantized networks, Proceedings of the 3rd MLSys Conference, Austin, TX, USA, 2020. PDF: https://proceedings.mlsys.org/paper_files/paper/2020/file/678e209691cd37f145a5502695378bac-Paper.pdf (Evaluates INT8, INT10, and INT16 quantization.)
Markus Nagel, Mart van Baalen, Tijmen Blankevoort, Max Welling, 2019, Data-free quantization through weight equalization and bias correction, PDF: https://openaccess.thecvf.com/content_ICCV_2019/papers/Nagel_Data-Free_Quantization_Through_Weight_Equalization_and_Bias_Correction_ICCV_2019_paper.pdf (Evaluates INT5, INT6, INT8, INT10, INT12, and INT16.)

11-Bit Quantization (INT11)

Research papers on 11-bit quantization:

G Dundar, K Rose, 1995, The effects of quantization on multilayer neural networks, IEEE Transactions on Neural Networks, Volume 6, Issue 6, November 1995, https://ieeexplore.ieee.org/abstract/document/471364
Fang Tang, Denis Guangyin Chen, Bo Wang, Amine Bermak, 2013, Low-Power CMOS Image Sensor Based on Column-Parallel Single-Slope/SAR Quantization Scheme, IEEE Transactions on Electron Devices, Vol. 60, No. 8, August 2013, https://ieeexplore.ieee.org/document/6547236, PDF: https://ss-sensing.com/paper/Low-Power%20CMOS%20Image%20Sensor%20Based%20on%20Column-Parallel%20Single-Slope-SAR%20Quantization%20Scheme.pdf

12-Bit Quantization (INT12)

Research papers on 12-bit quantization:

Markus Nagel, Mart van Baalen, Tijmen Blankevoort, Max Welling, 2019, Data-free quantization through weight equalization and bias correction, PDF: https://openaccess.thecvf.com/content_ICCV_2019/papers/Nagel_Data-Free_Quantization_Through_Weight_Equalization_and_Bias_Correction_ICCV_2019_paper.pdf (Evaluates INT5, INT6, INT8, INT10, INT12, and INT16.)
Xishan Zhang1,2, Shaoli Liu, Rui Zhang, Chang Liu, Di Huang, Shiyi Zhou, Jiaming Guo, Qi Guo, Zidong Du, Tian Zhi, Yunji Chen, 2020, Fixed-Point Back-Propagation Training, CVPR 2020, PDF: https://openaccess.thecvf.com/content_CVPR_2020/papers/Zhang_Fixed-Point_Back-Propagation_Training_CVPR_2020_paper.pdf

16-Bit Integer Quantization (INT16)

INT16 is the use of 16-bit integers, so as to use half the space of FP32 weights/activations, and also using integer arithmetic kernels. Consideration of the pros and cons of integer versus floating-point computations is warranted, since FP16 quantization uses the same 16-bit memory size, but may be more accurate than quantized 16-bit integers.

Research papers on 16-bit integer quantization:

B. Li, S. Lu, K. Xie, and Z. Wang, “Accelerating NLP Tasks on FPGA with Compressed BERT and a Hardware-Oriented Early Exit Method,” in IEEE Computer Society Annual Symposium on VLSI (ISVLSI), 2022, pp. 410–413. https://ieeexplore.ieee.org/document/9912016
Javier Fernandez-Marques, Paul N. Whatmough, Andrew Mundy, Matthew Mattina, 2020, Searching for winograd-aware quantized networks, Proceedings of the 3rd MLSys Conference, Austin, TX, USA, 2020. PDF: https://proceedings.mlsys.org/paper_files/paper/2020/file/678e209691cd37f145a5502695378bac-Paper.pdf (Evaluates INT8, INT10, and INT16 quantization.)
Markus Nagel, Mart van Baalen, Tijmen Blankevoort, Max Welling, 2019, Data-free quantization through weight equalization and bias correction, PDF: https://openaccess.thecvf.com/content_ICCV_2019/papers/Nagel_Data-Free_Quantization_Through_Weight_Equalization_and_Bias_Correction_ICCV_2019_paper.pdf (Evaluates INT5, INT6, INT8, INT10, INT12, and INT16.)
Pierre-Emmanuel Novac, March 2023, MicroAI: Embedded Artificial Intelligence for Human Activity Recognition on Smart Glasses, Ph.D. Thesis, Artificial Intelligence. Université Côte d’Azur, https://theses.hal.science/tel-04049008/document (Uses INT8 and INT16 quantization.)
X. Shen et al., "HotaQ: Hardware Oriented Token Adaptive Quantization for Large Language Models," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, doi: 10.1109/TCAD.2024.3487781. https://ieeexplore.ieee.org/abstract/document/10737419/

32-Bit Integer Quantization (INT32)

INT32 is not an effective form of "model compression", because it's not compressed at all! The data is no smaller than the FP32 raw weights, although it does allow integer arithmetic instead of floating-point computations. Also closely related is fixed-point quantization, using 32-bit integers and integer arithmetic.

Research papers on 32-bit integer quantization:

B Jacob, S Kligys, B Chen, M Zhu, 2018, Quantization and training of neural networks for efficient integer-arithmetic-only inference, http://openaccess.thecvf.com/content_cvpr_2018/papers/Jacob_Quantization_and_Training_CVPR_2018_paper.pdf
Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev, Paulius Micikevicius, 20 Apr 2020, Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation, https://arxiv.org/abs/2004.09602
Z Li, Q Gu, 2022, I-ViT: integer-only quantization for efficient vision transformer inference, https://arxiv.org/abs/2207.01405

Mixed-Precision Quantization

Research papers on mixed-precision quantization:

Manuele Rusci, Marco Fariselli, Martin Croome, Francesco Paci, Eric Flamand, 2022, Accelerating RNN-Based Speech Enhancement on a Multi-core MCU with Mixed FP16-INT8 Post-training Quantization https://arxiv.org/abs/2210.07692 (Mixed precision FP16-INT8 quantization.)
Dipankar Das, Naveen Mellempudi, Dheevatsa Mudigere, Dhiraj Kalamkar, Sasikanth Avancha, Kunal Banerjee, Srinivas Sridharan, Karthik Vaidyanathan, Bharat Kaul, Evangelos Georganas, Alexander Heinecke, Pradeep Dubey, Jesus Corbal, Nikita Shustrov, Roma Dubtsov, Evarist Fomenko, Vadim Pirogov, 2018, Mixed Precision Training of Convolutional Neural Networks using Integer Operations, https://arxiv.org/pdf/1802.00930
M. A. Nasution, D. Chahyati and M. I. Fanany, "Faster R-CNN with structured sparsity learning and Ristretto for mobile environment", Proc. Int. Conf. Adv. Comput. Sci. Inf. Syst. (ICACSIS), pp. 309-314, Oct. 2017. https://ieeexplore.ieee.org/document/8355051
Zhen Dong, Zhewei Yao, Amir Gholami, Michael W Mahoney, and Kurt Keutzer, 2019, HAWQ: Hessian AWare Quantization of Neural Networks with Mixed-Precision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 293–302, https://arxiv.org/abs/1905.03696
Yiren Zhou, Seyed-Mohsen Moosavi-Dezfooli, Ngai-Man Cheung, and Pascal Frossard. Adaptive quantization for deep neural network. arXiv preprint arXiv:1712.01048, 2017, https://arxiv.org/abs/1712.01048 (Layerwise different bitwidth quantization.)
Sijie Zhao, Tao Yue, and Xuemei Hu. Distributionaware adaptive multi-bit quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9281–9290, 2021, https://ieeexplore.ieee.org/document/9577892, PDF: https://openaccess.thecvf.com/content/CVPR2021/papers/Zhao_Distribution-Aware_Adaptive_Multi-Bit_Quantization_CVPR_2021_paper.pdf
Zhewei Yao, Zhen Dong, Zhangcheng Zheng, Amir Gholami, Jiali Yu, Eric Tan, Leyuan Wang, Qijing Huang, Yida Wang, Michael W. Mahoney, Kurt Keutzer, June 2021, HAWQV3: Dyadic Neural Network Quantization, arXiv preprint arXiv:2011.10680, https://arxiv.org/abs/2011.10680
Huanrui Yang, Lin Duan, Yiran Chen, Hai Li, Feb 2021, BSQ: Exploring Bit-Level Sparsity for Mixed-Precision Neural Network Quantization, arXiv preprint arXiv:2102.10462, https://arxiv.org/abs/2102.10462
Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han. 2019, HAQ: Hardware-aware automated quantization. In Proceedings of the IEEE conference on computer vision and pattern recognition, https://arxiv.org/abs/1811.08886
Zhongnan Qu, Zimu Zhou, Yun Cheng, and Lothar Thiele, June 2020, Adaptive loss-aware quantization for multi-bit networks, In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), https://arxiv.org/abs/1912.08883
Hai Victor Habi, Roy H. Jennings, Arnon Netzer, July 2020, HMQ: Hardware Friendly Mixed Precision Quantization Block for CNNs, arXiv preprint arXiv:2007.09952, https://arxiv.org/abs/2007.09952
Z. Liu, Y. Wang, K. Han, W. Zhang, S. Ma, and W. Gao, “Post-training quantization for vision transformer,” Advances in Neural Information Processing Systems, vol. 34, pp. 28 092–28 103, 2021. https://arxiv.org/abs/2106.14156 (Has evaluations of 4-bit, 6-bit, and 8-bit quantization; also mixed-precision.)
Li, Y., Gong, R., Tan, X., Yang, Y., Hu, P., Zhang, Q., Yu, F., Wang, W., & Gu, S., 2021, BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction. ArXiv, abs/2102.05426. https://arxiv.org/abs/2102.05426 Code: https://github.com/yhhhli/BRECQ (Tests 2 to 4 bits for weights, and mixed-precision quantization.)
E Kloberdanz, W Le, Sep 2023, MixQuant: Mixed Precision Quantization with a Bit-width Optimization Search, arXiv preprint arXiv:2309.17341, https://arxiv.org/pdf/2309.17341.pdf (Various tests of 2-8 bits, and mixed precision quantization.)
A Chauhan, U Tiwari, 2023, Post Training Mixed Precision Quantization of Neural Networks Using First-Order Information, Proceedings of the IEEE/CVF International Conference, https://openaccess.thecvf.com/content/ICCV2023W/RCV/papers/Chauhan_Post_Training_Mixed_Precision_Quantization_of_Neural_Networks_Using_First-Order_ICCVW_2023_paper.pdf
Y Shang, Z Yuan, Q Wu, Z Dong, PB-LLM: Partially Binarized Large Language Models, Sep 2023, arXiv preprint arXiv:2310.00034, https://browse.arxiv.org/pdf/2310.00034.pdf Code: https://github.com/hahnyuan/BinaryLLM (Hybrid partial binarization.)
Changhun Lee, Jungyu Jin, Taesu Kim, Hyungjun Kim, Eunhyeok Park, 24 Jan 2024, OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and Inference of Large Language Models, https://arxiv.org/abs/2306.02272 Code: https://github.com/xvyaward/owq (Stores some weights in different quantization levels based on their values.)
G Rutishauser, 2024, Agile and Efficient Inference of Quantized Neural Networks, Ph.D. Thesis, ETH Zurich, https://www.research-collection.ethz.ch/bitstream/handle/20.500.11850/675547/1/thesis.pdf
N Martynov, A Goncharov, G Kumichev, E Egorov, 2024, On the Way to Lossless Compression of Language Transformers: Exploring Cross-Domain Properties of Quantization, https://aclanthology.org/2024.lrec-main.1089.pdf (Quantize 95% of weights to INT8, leaving the remainder at FP32.)
M Rakka, ME Fouda, P Khargonekar, F Kurdahi, 29 April 2024, A Review of State-of-the-Art Mixed-Precision Neural Network Frameworks, IEEE Transactions on Pattern Analysis and Machine Intelligence (Early Access), Pages 1 - 20, DOI: 10.1109/TPAMI.2024.3394390, https://doi.org/10.1109/TPAMI.2024.3394390 https://ieeexplore.ieee.org/abstract/document/10509805/
Wei Huang, Haotong Qin, Yangdong Liu, Yawei Li, Xianglong Liu, Luca Benini, Michele Magno, Xiaojuan Qi, 23 May 2024, SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models, https://arxiv.org/abs/2405.14917 Code: https://github.com/Aaronhuang-778/SliM-LLM
Utkarsh Saxena, Kaushik Roy, McQueen: Mixed Precision Quantization of Early Exit Networks, https://papers.bmvc2023.org/0511.pdf (Combination of mixed-precision quantization, with precision specifiable staticly to a layerwise granularity, with early exit dynamic depth optimizations.)
JY Jeon, XT Nguyen, S Ryu, HJ Lee, 2024, USDN: A Unified Sample-wise Dynamic Network with Mixed-Precision and Early-Exit, https://openaccess.thecvf.com/content/WACV2024/papers/Jeon_USDN_A_Unified_Sample-Wise_Dynamic_Network_With_Mixed-Precision_and_Early-Exit_WACV_2024_paper.pdf
Rong Tian, Zijing Zhao, Weijie Liu, Haoyan Liu Weiquan Mao, Zhe Zhao, Kimmo Yan, Sep 2022, SAMP: A Toolkit for Model Inference with Self-Adaptive Mixed-Precision, https://arxiv.org/pdf/2209.09130.pdf (Mixed-precision quantization combined with kernel fusion, including QKV tensor operation fusion and AddBias-LayerNorm fusion.)
Mariam Rakka, Mohammed E. Fouda, Pramod Khargonekar, Fadi Kurdahi, 11 Aug 2022, Mixed-Precision Neural Networks: A Survey, https://arxiv.org/abs/2208.06064
Piotr Kluska, Adri´an Castello, Florian Scheidegger, A. Cristiano I. Malossi, 2024, QAttn: Efficient GPU Kernels for mixed-precision Vision Transformers https://openaccess.thecvf.com/content/CVPR2024W/eLVM/papers/Kluska_QAttn_Efficient_GPU_Kernels_for_Mixed-precision_Vision_Transformers_CVPRW_2024_paper.pdf
Behnam Ghavami, Amin Kamjoo, Lesley Shannon, Steve Wilton, 3 Apr 2024, DNN Memory Footprint Reduction via Post-Training Intra-Layer Multi-Precision Quantization, https://arxiv.org/abs/2404.02947
Dimitrios Danopoulos, Georgios Zervakis, Dimitrios Soudris, Jörg Henkel, 12 Feb 2024, TransAxx: Efficient Transformers with Approximate Computing, https://arxiv.org/abs/2402.07545 (Using approximations in Vision Transformer architectures.)
Rong Tian, Zijing Zhao, Weijie Liu, Haoyan Liu, Weiquan Mao, Zhe Zhao, and Kan Zhou, 2023, SAMP: A Model Inference Toolkit of Post-Training Quantization for Text Processing via Self-Adaptive Mixed-Precision https://aclanthology.org/2023.emnlp-industry.13.pdf
Tao Yang, Dongyue Li, Zhuoran Song, Yilong Zhao, Fangxin Liu, Zongwu Wang, Zhezhi He, and Li Jiang, 2022, DTQAtten: Leveraging Dynamic Token-based Quantization for Efficient Attention Architecture, https://mxhx7199.github.io/files/%5BDATE-22%5DDTQAtten_preprint.pdf
David Spuler, March 2024, Chapter 44. Advanced Quantization, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Yaohui Cai, Zhewei Yao, Zhen Dong, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. 2020, Zeroq: A novel zero shot quantization framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13169–13178, 2020, https://arxiv.org/abs/2001.00281 Code: https://github.com/amirgholami/ZeroQ
JH Park, JS Choi, JH Ko, 2020, Dual-Precision Deep Neural Network, https://dl.acm.org/doi/abs/10.1145/3430199.3430228 https://arxiv.org/pdf/2009.02191
Jing Liu, Ruihao Gong, Mingyang Zhang, Yefei He, Jianfei Cai, Bohan Zhuang, 13 Jun 2024, ME-Switch: A Memory-Efficient Expert Switching Framework for Large Language Models, https://arxiv.org/abs/2406.09041 (How to load multiple experts for MoE in a memory-efficient way using mixed-precision quantization based on identifying the few salient channels that need higher precision, as an alternative to multi-LoRA.)
Jinguang Wang, Yuexi Yin, Haifeng Sun, Qi Qi, Jingyu Wang, Zirui Zhuang, Tingting Yang, Jianxin Liao, 27 Jun 2024, OutlierTune: Efficient Channel-Wise Quantization for Large Language Models, https://arxiv.org/abs/2406.18832
Jianyu Wei, Shijie Cao, Ting Cao, Lingxiao Ma, Lei Wang, Yanyong Zhang, Mao Yang, 25 Jun 2024, T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge, https://arxiv.org/abs/2407.00088 Code: https://github.com/microsoft/T-MAC (Table lookup for low-bit quantization on CPUs.)
Yipin Guo, Yilin Lang, Qinyuan Ren, 3 Jul 2024, GPTQT: Quantize Large Language Models Twice to Push the Efficiency, https://arxiv.org/abs/2407.02891 (Two-phase quantization, first to high-bit, then to binary quantization.)
Md Fahim Faysal Khan, May 2024, Constraint Driven Multimodal Edge Intelligence, Ph.D. Thesis, Electrical Engineering and Computer Science, Pennsylvania State University, https://etda.libraries.psu.edu/files/final_submissions/29680 (Layer-specific quantization levels for mixed-precision quantization.)
J Wu, M Song, J Zhao, HKH So, 2024, A Case for Low Bitwidth Floating Point Arithmetic on FPGA for Transformer Based DNN Inference, https://wujiajunic.cn/publication/ipdpsw2024/IPDPSW2024.pdf
Sergei Savvov, Jun 27, 2023, 7 Ways To Speed Up Inference of Your Hosted LLMs, https://betterprogramming.pub/speed-up-llm-inference-83653aa24c47
Zhiwen Mo, Lei Wang, Jianyu Wei, Zhichen Zeng, Shijie Cao, Lingxiao Ma, Naifeng Jing, Ting Cao, Jilong Xue, Fan Yang, Mao Yang, 12 Aug 2024, LUT Tensor Core: Lookup Table Enables Efficient Low-Bit LLM Inference Acceleration, https://arxiv.org/abs/2408.06003 (Lookup tables for mixed-precision MatMul/GEMM kernels using low-bit quantization mixed with full precision.)
Wang, Miao Wei, Shu Yang, Fangmin Chen, Xing Mei, 16 Aug 2024, ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models, https://arxiv.org/abs/2408.08554
P. -C. Chen, Y. -T. Liu, G. -Y. Zeng and T. -D. Chiueh, 2024, Design and Implementation of an Easy-to-Deploy Energy-Efficient Inference Acceleration System for Multi-Precision Neural Networks, 2024 IEEE 6th International Conference on AI Circuits and Systems (AICAS), Abu Dhabi, United Arab Emirates, 2024, pp. 587-591, doi: 10.1109/AICAS59952.2024.10595940, https://ieeexplore.ieee.org/document/10595940
Xueying Wu, Edward Hanson, Nansu Wang, Qilin Zheng, Xiaoxuan Yang, Huanrui Yang, Shiyu Li, Feng Cheng, Partha Pratim Pande, Janardhan Rao Doppa, Krishnendu Chakrabarty, Hai Li, 27 Oct 2023 (v3), Block-Wise Mixed-Precision Quantization: Enabling High Efficiency for Practical ReRAM-based DNN Accelerators, https://arxiv.org/abs/2310.12182
Penghao Xiao, Chunjie Zhang, Qian Guo, Xiayang Xiao, Haipeng Wang, 2024, Neural Networks Integer Computation: Quantizing Convolutional Neural Networks of Inference and Training for Object Detection in Embedded Systems, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, DOI 10.1109/JSTARS.2024.3452321, PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10660473
Ziheng Wang, Pedro Reviriego, Farzad Niknia, Javier Conde, Shanshan Liu, Fabrizio Lombardi, 26 Aug 2024, Adaptive Resolution Inference (ARI): Energy-Efficient Machine Learning for Internet of Things, https://arxiv.org/abs/2408.14528 (Running a small quantized model and then determining whether to run the full non-quantized model.)
W Byun, J Woo, S Mukhopadhyay, 2024, Hardware-friendly Hessian-driven Row-wise Quantization and FPGA Acceleration for Transformer-based Models, https://dl.acm.org/doi/pdf/10.1145/3665314.3670806
Junfeng Gong, Cheng Liu, Long Cheng, Huawei Li, Xiaowei Li, 17 Jul 2024, MCU-MixQ: A HW/SW Co-optimized Mixed-precision Neural Network Design Framework for MCUs, https://arxiv.org/abs/2407.18267
Bernard Ryhede Bengtsson, Joel Bengs, 2024, Accelerated Segmentation with Mixed-Precision Quantization of EfficientViT-SAM, MSc Thesis, Lund University, Sweden, https://lup.lub.lu.se/luur/download?func=downloadFile&recordOId=9174462&fileOId=9174463
Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu, 25 Sep 2024, A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms, https://arxiv.org/abs/2409.16694
Zeyu Cao, Cheng Zhang, Pedro Gimenes, Jianqiao Lu, Jianyi Cheng, Yiren Zhao, 9 Oct 2024, Scaling Laws for Mixed Quantization, https://arxiv.org/abs/2410.06722
Lian Liu, Haimeng Ren, Long Cheng, Zhaohui Xu, Yudong Pan, Mengdi Wang, Xiaowei Li, Yinhe Han, Ying Wang, 16 Oct 2024, COMET: Towards Partical W4A4KV4 LLMs Serving, https://arxiv.org/abs/2410.12168
Hao Mark Chen, Fuwen Tan, Alexandros Kouris, Royson Lee, Hongxiang Fan, Stylianos I. Venieris, 17 Oct 2024, Progressive Mixed-Precision Decoding for Efficient LLM Inference, https://arxiv.org/abs/2410.13461
Jie Peng, Zhang Cao, Huaizhi Qu, Zhengyu Zhang, Chang Guo, Yanyong Zhang, Zhichao Cao, Tianlong Chen, 23 Oct 2024 (v2), Harnessing Your DRAM and SSD for Sustainable and Accessible LLM Inference with Mixed-Precision and Multi-level Caching, https://arxiv.org/abs/2410.14740
Xinghao Wang, Pengyu Wang, Bo Wang, Dong Zhang, Yunhua Zhou, Xipeng Qiu, 31 Oct 2024, BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments, https://arxiv.org/abs/2410.23918 https://github.com/xinghaow99/BitStack
Eldar Kurtic, Alexandre Marques, Shubhra Pandit, Mark Kurtz, Dan Alistarh, 4 Nov 2024, "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization, https://arxiv.org/abs/2411.02355
Peng Tang, Jiacheng Liu, Xiaofeng Hou, Yifei Pu, Jing Wang, Pheng-Ann Heng, Chao Li, Minyi Guo, 6 Nov 2024 (v2), HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference, https://arxiv.org/abs/2411.01433
Jiajun Wu, Mo Song, Jingmin Zhao, Yizhao Gao, Jia Li, Hayden Kwok-Hay So, 6 Nov 2024, TATAA: Programmable Mixed-Precision Transformer Acceleration with a Transformable Arithmetic Architecture, https://arxiv.org/abs/2411.03697
Qinyang Bao, MPBRQ- A Framework for Mixed-Precision Quantization for Large Language Models, Masters in Applied Science, Graduate Department of Electrical and Computer Engineering, University of Toronto 2024, https://tspace.library.utoronto.ca/bitstream/1807/141039/1/Bao_Qinyang_202411_MAS_thesis.pdf
Jianhua Gao, Bingjie Liu, Weixing Ji, Hua Huang, 9 Apr 2024, A Systematic Literature Survey of Sparse Matrix-Vector Multiplication, https://arxiv.org/abs/2404.06047
Yuzong Chen, Ahmed F. AbouElhamayed, Xilai Dai, Yang Wang, Marta Andronic, George A. Constantinides, Mohamed S. Abdelfattah, 18 Nov 2024, BitMoD: Bit-serial Mixture-of-Datatype LLM Acceleration, https://arxiv.org/abs/2411.11745
Yi Ren, Ruge Xu, Xinfei Guo, Weikang Qian, 27 Nov 2024, FAMES: Fast Approximate Multiplier Substitution for Mixed-Precision Quantized DNNs--Down to 2 Bits! https://arxiv.org/abs/2411.18055
Amitash Nanda, Sree Bhargavi Balija, Debashis Sahoo, 3 Dec 2024, CPTQuant -- A Novel Mixed Precision Post-Training Quantization Techniques for Large Language Models, https://arxiv.org/abs/2412.03599
Yiming Fang, Li Chen, Yunfei Chen, Weidong Wang, Changsheng You, 5 Dec 2024 (v2), Mixed-Precision Quantization: Make the Best Use of Bits Where They Matter Most, https://arxiv.org/abs/2412.03101
Yuzhuang Xu, Shiyu Ji, Qingfu Zhu, Wanxiang Che, 12 Dec 2024, CRVQ: Channel-relaxed Vector Quantization for Extreme Compression of LLMs, https://arxiv.org/abs/2412.09282 (Vector quantization of low-bit or 1-bit weight vectors, with additional bits for some channels, analogous to combining mixed-precision quantization and/or weight clustering.)
Mukul Lokhande, Gopal Raut, Santosh Kumar Vishvakarma, 16 Dec 2024, Flex-PE: Flexible and SIMD Multi-Precision Processing Element for AI Workloads, https://arxiv.org/abs/2412.11702
Weilun Feng, Haotong Qin, Chuanguang Yang, Zhulin An, Libo Huang, Boyu Diao, Fei Wang, Renshuai Tao, Yongjun Xu, Michele Magno, 16 Dec 2024, MPQ-DM: Mixed Precision Quantization for Extremely Low Bit Diffusion Models, https://arxiv.org/abs/2412.11549
Zhen Zheng, Xiaonan Song, Chuanjie Liu, 19 Dec 2024, MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design, https://arxiv.org/abs/2412.14590
Utkarsh Saxena, Sayeh Sharify, Kaushik Roy, Xin Wang, 18 Dec 2024, ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals, https://arxiv.org/abs/2412.14363 https://github.com/utkarsh-dmx/project-resq
Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, Lei Chen, 27 Dec 2024, A Survey on Large Language Model Acceleration based on KV Cache Management, https://arxiv.org/abs/2412.19442 (Huge survey of all KV cache optimization methods.)
Wonsuk Jang, Thierry Tambe, 2 Jan 2025, BlockDialect: Block-wise Fine-grained Mixed Format for Energy-Efficient LLM Inference, https://arxiv.org/abs/2501.01144 (Per-block granular mixed-precision quantization including FP4.)
S. Kim, H. Lee, S. Kim, C. Kim and W. W. Ro, "AirGun: Adaptive Granularity Quantization for Accelerating Large Language Models," 2024 IEEE 42nd International Conference on Computer Design (ICCD), Milan, Italy, 2024, pp. 645-652, doi: 10.1109/ICCD63220.2024.00103. https://ieeexplore.ieee.org/abstract/document/10818069
Haoning Xu, Zhaoqing Li, Zengrui Jin, Huimeng Wang, Youjun Chen, Guinan Li, Mengzhe Geng, Shujie Hu, Jiajun Deng, Xunying Liu, 7 Jan 2025, Effective and Efficient Mixed Precision Quantization of Speech Foundation Models, https://arxiv.org/abs/2501.03643
A. Xu et al., "GausiQ: Generalized Automatic Hybrid-Precision Quantization for MIMO Detection," in IEEE Wireless Communications Letters, doi: 10.1109/LWC.2024.3509269. https://ieeexplore.ieee.org/abstract/document/10839390
Jeongseok Kim, Jemin Lee, Yongin Kwon, Daeyoung Kim, 13 Jan 2025, QuantuneV2: Compiler-Based Local Metric-Driven Mixed Precision Quantization for Practical Embedded AI Applications, https://arxiv.org/abs/2501.07161
Michael Wu, Arnab Raha, Deepak A. Mathaikutty, Martin Langhammer, Engin Tunali, 31 Jan 2025, StruM: Structured Mixed Precision for Efficient Deep Learning Hardware Codesign, https://arxiv.org/abs/2501.18953
Haojie Duanmu, Xiuhong Li, Zhihang Yuan, Size Zheng, Jiangfei Duan, Xingcheng Zhang, Dahua Lin, 9 May 2025, MxMoE: Mixed-precision Quantization for MoE with Accuracy and Performance Co-Design, https://arxiv.org/abs/2505.05799 https://github.com/cat538/MxMoE
C Zhang, X Zhu, L Chen, T Yang, E Pan, G Yu, Y Zhao, 2025, Enhancing LLM Inference Performance on ARMCPUsthrough Software and Hardware Co-optimization Strategies, DOI 10.23919/ICS.2025.3568404, https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10994252
Raunak Shah, Zhaoheng Li, Yongjoo Park, 7 May 2025, QStore: Quantization-Aware Compressed Model Storage, https://arxiv.org/abs/2505.04081

Logarithmic Bitshift Quantization (Power-of-Two Quantization)

The idea with bitshift quantization is to use power-of-2 integer weights and bitshift operations rather than integer multiplication. There is a significant trade-off in terms of accuracy of the model, since the number of distinct weights is greatly reduced. This is an active area of research that is well-known, with the earliest papers dating back to 1992 and 1993. However, software algorithms using bitshift seem unlikely to outperform hardware acceleration of integer multiplication, and hardware support is limited. Extending hardware accelerators to use bitshifting or the highest power-of-two approximate multiplication in hardware, presumably requiring fewer operations and less computing power (and reduced heat generation) seems an open area for further research. Note that the highest bit of an integer can be efficiently calculated using Brian Kernighan's algorithm (1988).

Maarten Vandersteegen, Kristof Van Beeck and Toon Goedemé, "Integer-Only CNNs with 4 Bit Weights and Bit-Shift Quantization Scales at Full-Precision Accuracy", Electronics, October 2021, 10(22), 2823, https://www.mdpi.com/2079-9292/10/22/2823
Yiren Zhao, Xitong Gao, Daniel Bates, Robert Mullins, Cheng-Zhong Xu, "Focused Quantization for Sparse CNNs", Advances in Neural Information Processing Systems 32 (NeurIPS 2019), 2019, https://proceedings.neurips.cc/paper/2019/hash/58aaee7ae94b52697ad3b9275d46ec7f-Abstract.html
Dominika Przewlocka-Rus, Syed Shakib Sarwar, H. Ekin Sumbul, Yuecheng Li, Barbara De Salvo, "Power-of-Two Quantization for Low Bitwidth and Hardware Compliant Neural Networks", Feb 2022, https://arxiv.org/abs/2203.05025
Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio, "Quantized neural networks: Training neural networks with low precision weights and activations", The Journal of Machine Learning Research, 18(1):6869–6898, 2017, https://arxiv.org/abs/1609.07061.
T. Hokchhay, S. Hashemi, R. I. Bahar, and S. Reda, “Hardware-software codesign of accurate, multiplier-free deep neural networks,” in Proc. 54th Annu. Design Autom. Conf. (DAC), 2017, pp. 1–6., https://arxiv.org/abs/1705.04288
Yuhang Li, Xin Dong, and Wei Wang, "Additive powers-of-two quantization: An efficient non-uniform discretization for neural networks", International Conference on Learning Representations, February 2020, https://arxiv.org/abs/1909.13144
Z. Lin, M. Courbariaux, R. Memisevic, and Y. Bengio. Neural networks with few multiplications. CoRR, abs/1510.03009, 2015. https://arxiv.org/abs/1510.03009 (Power-of-Two Quantization)
Soheil Hashemi; Nicholas Anthony; Hokchhay Tann; R. Iris Bahar; Sherief Reda, "Understanding the impact of precision quantization on the accuracy and energy of neural networks", Design, Automation & Test in Europe Conference & Exhibition, March 2017, https://ieeexplore.ieee.org/abstract/document/7927224
Marchesi, Michele, Orlandi, Gianni, Piazza, Francesco, and Uncini, Aurelio, "Fast neural networks without multipliers", IEEE Transactions on Neural Networks , 4(1):53–62, 1993, https://ieeexplore.ieee.org/document/182695
A. White and M. 1. Elmasry, "The digi-neocognitron: a digital neocognitron neural network model for VLSI", IEEE Trans. Neural Networks, vol. 3. pp. 73-85, Jan. 1992, https://ieeexplore.ieee.org/document/105419
Kwan, Hon Keung and Tang, CZ, "Multiplierless multilayer feedforward neural network design suitable for continuous input-output mapping", Electronics Letters, 29(14):1259–1260, 1993, https://digital-library.theiet.org/content/journals/10.1049/el_19930841
Sean Eron Anderson, "Bit Twiddling Hacks" (Kernighan Algorithm), https://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetKernighan
Peter Wegner, "A technique for counting ones in a binary computer", Communications of the ACM, Volume 3, Issue 5, May 1960, https://doi.org/10.1145/367236.367286
Daisuke Miyashita, Edward H. Lee, and Boris Murmann, Convolutional Neural Networks using Logarithmic Data Representation, 2016, CoRR abs/1603.01025 (2016), https://arxiv.org/abs/1603.01025
Edward H. Lee, Daisuke Miyashita, Elaina Chai, Boris Murmann, and S. Simon Wong, LogNet: Energy-efficient neural networks using logarithmic computation, 2017, In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2017, New Orleans, LA, USA, March 5-9, 2017. 5900–5904. https://doi.org/10.1109/ICASSP.2017.7953288
Elhoushi, M.; Chen, Z.; Shafiq, F.; Tian, Y. H.; and Li, J. Y., 2019, Deepshift: Towards multiplication-less neural networks, arXiv preprint arXiv:1905.13298, https://arxiv.org/abs/1905.13298
Zhou, A.; Yao, A.; Guo, Y.; Xu, L.; and Chen, Y., 2017, Incremental network quantization: Towards lossless CNNs with low-precision weight, arXiv preprint arXiv:1702.03044, https://arxiv.org/abs/1702.03044
J Cai, M Takemoto, H Nakajo, A deep look into logarithmic quantization of model parameters in neural networks, 2018, https://dl.acm.org/doi/abs/10.1145/3291280.3291800
HyunJin Kim; Min Soo Kim; Alberto A. Del Barrio; Nader Bagherzadeh, A cost-efficient iterative truncated logarithmic multiplication for convolutional neural networks, 2019, IEEE 26th Symposium on Computer Arithmetic (ARITH), https://ieeexplore.ieee.org/abstract/document/8877474
X Li, B Liu, RH Yang, V Courville, C Xing, VP Nia, 2023, DenseShift: Towards Accurate and Efficient Low-Bit Power-of-Two Quantization, Proceedings of the IEEE/CVF, https://openaccess.thecvf.com/content/ICCV2023/papers/Li_DenseShift_Towards_Accurate_and_Efficient_Low-Bit_Power-of-Two_Quantization_ICCV_2023_paper.pdf (Extends log quantization to floating point numbers efficiently by using a bitwise trick of integer addition on the sign and exponent bits of 32-bit IEEE 754 floats.)
P Dong, L Lu, C Wu, C Lyu, G Yuan, H Tang, Y Wang, 2023, PackQViT: Faster Sub-8-bit Vision Transformers via Full and Packed Quantization on the Mobile, https://openreview.net/pdf?id=N56hAiQvot Code: https://github.com/PeiyanFlying/PackQViT
Bahareh Khabbazan, Marc Riera, Antonio González, Oct 2023, An Energy-Efficient Near-Data Processing Accelerator for DNNs that Optimizes Data Accesses, https://arxiv.org/abs/2310.18181
H Madadum, Y Becerikli, 2022, A resource-efficient convolutional neural network accelerator using fine-grained logarithmic quantization, https://www.academia.edu/download/91076689/pdf.pdf
MY Malsagov, EM Khayrov, MM Pushkareva, 2019, Exponential discretization of weights of neural network connections in pre-trained neural networks, https://arxiv.org/pdf/2002.00623
H. Tann, S. Hashemi, I. Bahar, and S. Reda, “Hardware-software codesign of accurate, multiplier-free deep neural networks,” CoRR, vol. abs/1705.04288, 2017. [Online]. Available: http://arxiv.org/abs/1705.04288
Mogaka, O.M., Zewail, R., Inoue, K. et al. TinyEmergencyNet: a hardware-friendly ultra-lightweight deep learning model for aerial scene image classification. J Real-Time Image Proc 21, 51 (2024). https://doi.org/10.1007/s11554-024-01430-y https://link.springer.com/article/10.1007/s11554-024-01430-y#citeas (Use of both power-of-two quantization and channel pruning for fast image analysis.)
Fangzhou He, Ke Ding, DingjiangYan, Jie Li, Jiajun Wang, Mingzhe Chen, ANovel Quantization and Model Compression Approach for Hardware Accelerators in Edge Computing, https://cdn.techscience.cn/files/cmc/2024/TSP_CMC-80-2/TSP_CMC_53632/TSP_CMC_53632.pdf (Power-of-two quantization with bitshifting further accelerated with LUTs.)
X. Geng, S. Liu, J. Jiang, K. Jiang and H. Jiang, 2024, Compact Powers-of-Two: An Efficient Non-Uniform Quantization for Deep Neural Networks, 2024 Design, Automation & Test in Europe Conference & Exhibition (DATE), Valencia, Spain, 2024, pp. 1-6, doi: 10.23919/DATE58400.2024.10546652, https://ieeexplore.ieee.org/abstract/document/10546652
Rappy Saha, Jude Haris, José Cano, 30 Sep 2024, Accelerating PoT Quantization on Edge Devices, https://arxiv.org/abs/2409.20403 https://github.com/gicLAB/PoTAcc (Power-of-two/bitshift logarithmic quantization on edge devices.)
L. F. H. Duarte, G. B. Nardes, W. Grignani, D. R. Melo and C. A. Zeferino, "Deep Nibble: A 4-bit Number Format for Efficient DNN Training and Inference in FPGA," 2024 37th SBC/SBMicro/IEEE Symposium on Integrated Circuits and Systems Design (SBCCI), Joao Pessoa, Brazil, 2024, pp. 1-5, doi: 10.1109/SBCCI62366.2024.10703994. https://ieeexplore.ieee.org/abstract/document/10703994 (Log quantization method in 4-bits.)
Mohammadreza Doostmohammadian, Sérgio Pequito, 27 Oct 2024, Logarithmically Quantized Distributed Optimization over Dynamic Multi-Agent Networks. https://arxiv.org/abs/2410.20345
W. Wang, W. Sun and Y. Liu, "Improving Transformer Inference Through Optimized Non-Linear Operations With Quantization-Approximation-Based Strategy," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, doi: 10.1109/TCAD.2024.3488572. https://ieeexplore.ieee.org/abstract/document/10738457
Longlong Zhang, Xiang Hu, Xuan Liao, Tong Zhou, Yuanxi Peng, FLQ: Design and implementation of hybrid multi-base full logarithmic quantization neural network acceleration architecture based on FPGA, Signal Processing: Image Communication, 2025, 117270, ISSN 0923-5965, https://doi.org/10.1016/j.image.2025.117270. https://www.sciencedirect.com/science/article/pii/S0923596525000177
Song, M., Zhou, Y., Song, M. et al. May 2025, Speed up integer-arithmetic-only inference via bit-shifting. Sci Rep 15, 18765 (2025). https://doi.org/10.1038/s41598-025-02544-4 https://www.nature.com/articles/s41598-025-02544-4

Sum of Two Bitshifts Quantization

The downside of logarithmic quantization is that there are relatively few unique weights, limiting precision, even if the number of bits used is maximized using a large scaling factor. An alternative implementation is to use two bitshift operations and an addition (or use of "shift-and-add" operations). In this way, the two highest bits of the quantized integer weight can be used, which improves model precision at the cost of more computation. This assumes that two integer shifts and an integer addition are less than the cost of a single integer multiplication. An early mention of this "sums of powers of two" method is in Marchesi et al. (1993).

Sung-En Chang, Yanyu Li, Mengshu Sun, Runbin Shi, Hayden K.-H. So, Xuehai Qian, Yanzhi Wang, and Xue Lin, "Mix and Match: A Novel FPGA-Centric Deep Neural Network Quantization Framework", 2021, In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, Seoul, Korea (South), 208–220, https://doi.org/10.1109/HPCA51647.2021.00027
You, H.; Chen, X.; Zhang, Y.; Li, C.; Li, S.; Liu, Z.; Wang, Z.; and Lin, Y., 2020, ShiftAddNet: A Hardware-Inspired Deep Network, In NeurIPS, https://arxiv.org/abs/2010.12785
Marchesi, Michele, Orlandi, Gianni, Piazza, Francesco, and Uncini, Aurelio, "Fast neural networks without multipliers", IEEE Transactions on Neural Networks , 4(1):53–62, 1993, https://ieeexplore.ieee.org/document/182695
Robert Eisele, Optimizing integer multiplication, April 29th, 2010, https://www.xarg.org/2010/04/optimizing-integer-multiplication/
Yuhang Li, Xin Dong, and Wei Wang, "Additive powers-of-two quantization: An efficient non-uniform discretization for neural networks", International Conference on Learning Representations, February 2020, https://arxiv.org/abs/1909.13144
Li, Yanyu, Aug 2024, Accelerating large scale generative AI : a comprehensive study, Ph.D. Thesis, Northeastern University, Boston, Massachusetts, USA, https://hdl.handle.net/2047/D20669654 https://repository.library.northeastern.edu/files/neu:ms35wj107 https://repository.library.northeastern.edu/files/neu:ms35wj107/fulltext.pdf

Arbitrary Base Logarithmic Quantization

The main use of logarithms in quantization is power-of-two logarithmic quantization. This is efficient, allowing bitshifting, but its lack of accuracy is a known problem. There is also some research on bases other than two, or indeed arbitrary bases, to try to more accurately map weights to a logarithmic format:

S. Vogel, M. Liang, A. Guntoro, W. Stechele, and G. Ascheid, 2018, Efficient hardware acceleration of CNNs using logarithmic data representation with arbitrary log-base, In Proceedings of the 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD’18). 1–8.

Integer Division for Quantization?

What about using integer division instead of multiplications in quantization? After all, multiplication by a small weight like 0.003 could instead be a division by 333. Is this an avenue for optimization? It seems unlikely, since division is usually much slower than multiplication, often by an order-of-magnitude.

Integer division can possibly be used efficiently using bitshift operations. Power-of-two division might be an opportunity for (right) bitshifts instead of division, which is effectively the same as the left bitshift quantization above. Dyadic numbers are an interesting idea and their implementation involves division by a power-of-two, usually performed via a right bitshift.

Note that division is often used in scaling operations, particularly in de-quantization. However, in such cases, it isn't the bottleneck operation, as scaling or de-quantization is performed an order-of-magnitude fewer times.

No papers were found on "division quantization". Some research on division arithmetic algorithms:

LibDivide, https://libdivide.com/ and https://github.com/ridiculousfish/libdivide
Benchmarking division and libdivide on Apple M1 and Intel AVX512, May 12th, 2021, https://ridiculousfish.com/blog/posts/benchmarking-libdivide-m1-avx512.html
Jung Hyun Lee, Jeonghoon Kim, Se Jung Kwon, Dongsoo Lee, 16 Jul 2024 (v2), FlexRound: Learnable Rounding based on Element-wise Division for Post-Training Quantization, https://arxiv.org/abs/2306.00317

Dyadic Quantization

Dyadic numbers are a class of numbers represented as rational numbers, but operated on as paired numbers. The number is an integer, but the denominator is restricted to be a power-of-two integer. This allows dyadic numbers to support a wide range of weights, including quite high precision in fractional weights, but integer arithmetic can be used.

Zhewei Yao, Zhen Dong, Zhangcheng Zheng, Amir Gholami, Jiali Yu, Eric Tan, Leyuan Wang, Qijing Huang, Yida Wang, Michael Mahoney, Kurt Keutzer, HAWQ-V3: Dyadic Neural Network Quantization, Proceedings of the 38th International Conference on Machine Learning, PMLR 139:11875-11886, 2021, https://arxiv.org/abs/2011.10680
Renato J. Cintra; Stefan, Duffner; Christophe Garcia; André Leite, Low-Complexity Approximate Convolutional Neural Networks, IEEE Transactions on Neural Networks and Learning Systems, Volume 29, Issue 12, December 2018, pp.5981-5992, https://ieeexplore.ieee.org/abstract/document/8334697
Fernanda Botelho, Max Garzon, On Dynamical Properties of Neural Networks, Complex Systems 5 (1991), p.401-413, https://wpmedia.wolfram.com/uploads/sites/13/2018/02/05-4-4.pdf
David Spuler, March 2024, Chapter 44. Advanced Quantization, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9

Fixed-Point Quantization

Fixed-point numbers are a way of representing numbers, that differs from floating-point numbers. For example, we can represent a dollars and cents number $12.34 as the integer "1234". This is fixed-point numbers, scaled by 100.

In practice, we can convert any fractional number to an integer by multiplying by a scaling factor (and truncating any lower-level fractional digits). Doing this is "floating-point quantization".

The main advantage is integer arithmetic. Using fixed-point quantizations changes vector dot product to use integer multiplication and integer addition (with a bitshift). See fixed point number system.

Floating-point numbers have a per-number exponent. Fixed-point numbers are like having a single global exponent value for all numbers (not stored anyway). The intermediate method is "block floating-point" where blocks of numbers (i.e. vectors) have a per-block or per-vector exponent value. Integer-only arithmetic is possible with fixed-point and block-floating point quantization.

Quantized Model Slowdown

Quantized model is slow? Although a quantized model should run inference queries more efficiently, there are various reasons why a quantized version of a model may run slower than the non-quantized FP32 model. Some of the possibilities for a slow quantized model include:

Quantized kernel is not fully specialized for that bit size (and it's doing something slow like converting back to FP32 or using bit unpacking prior to integer arithmetic).
Kernel fusion of your quantized kernel is lacking for a followup component, such as activation functions or Softmax (whereas this was fused in the FP32 version).
Sometimes GPUs can be faster on different sizes (e.g., FP16 quantization may be faster than INT8 quantization, even though the former is twice the byte size).
Your quantized version is "fake quantization" and its doing too many conversions back-and-forth between the integer and FP32 data types.
Double quantization where you have quantized an already-quantized model version.
CUDA C++ optimizer is confused by something in the integer kernel code, and does less auto-optimization.

Generally, it means there's something wrong with the C++ kernel code for the quantized versions.

Stochastic Quantization

Stochastic quantization is a research area that examines intentionally inserting some randomness or statistical variation into the quantization algorithms, which may result in higher accuracy. This idea can be used in conjunction with Post-Training Quantization (PTQ) or with Quantization-Aware Training (QAT).

Angela Fan, Pierre Stock, Benjamin Graham, Edouard Grave, Rémi Gribonval, Hervé Jégou, and Armand Joulin, Training with quantization noise for extreme model compression, 2020, arXiv e-prints, pages arXiv–2004, https://arxiv.org/abs/2004.07320
Jianfei Chen, Yu Gai, Zhewei Yao, Michael W Mahoney, and Joseph E Gonzalez. A statistical framework for low-bitwidth training of deep neural networks. arXiv preprint arXiv:2010.14298, 2020, https://arxiv.org/abs/2010.14298 J Zhang, 2023, Quantization for High-dimensional Data and Neural Networks: Theory and Algorithms, Ph.D. Thesis, University of California, San Diego, https://escholarship.org/content/qt9bd2k7gf/qt9bd2k7gf.pdf (See Chapter 5 for stochastic quantization algorithms.)
David Spuler, March 2024, Chapter 44. Advanced Quantization, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
J. Choi and S. Venkataramani, 2019, Approximate Computing Techniques for Deep Neural Networks. Cham: Springer, 2019, pp. 307–329, Chapter 15, https://link.springer.com/chapter/10.1007/978-3-319-99322-5_15

Weight Clustering

Weight clustering is conceptually like pruning and quantization combined, and is sometimes called "cluster-based quantization". A group of weights are merged with similar weights, to make all of the similar weights have exactly the same weight. Hashing has also been used to group weights for clustering.

Shaokai Ye, Tianyun Zhang, Kaiqi Zhang, Jiayu Li, Jiaming Xie, Yun Liang, Sijia Liu, Xue Lin, Yanzhi Wang, "A Unified Framework of DNN Weight Pruning and Weight Clustering/Quantization Using ADMM", November 2018, https://arxiv.org/abs/1811.01907
Steven J. Nowlan; Geoffrey E. Hinton, "Simplifying Neural Networks by Soft Weight-Sharing", Neural Computation, 4(4), July 1992, https://ieeexplore.ieee.org/abstract/document/6796174
Zhihang Yuan, Lin Niu, Jiawei Liu, Wenyu Liu, Xinggang Wang, Yuzhang Shang, Guangyu Sun, Qiang Wu, Jiaxiang Wu, Bingzhe Wu, "RPTQ: Reorder-based Post-training Quantization for Large Language Models", May 2023, https://arxiv.org/abs/2304.01089
Weight clustering (TensorFlow), https://www.tensorflow.org/model_optimization/guide/clustering
A. Zhou, A. Yao, Y. Guo, L. Xu and Y. Chen, "Incremental network quantization: Towards lossless CNNs with low-precision weights", arXiv:1702.03044, 2017. https://arxiv.org/abs/1702.03044 (Groups large and small weights)
W. Chen, J. T. Wilson, S. Tyree, K. Weinberger and Y. Chen, "Compressing neural networks with the hashing trick", Proc. ICML, pp. 2285-2294, 2015. https://arxiv.org/abs/1504.04788 (Uses hashing to do weight clustering/grouping weights.)
Maedeh Hemmat, Joshua San Miguel, Azadeh Davoodi, "AirNN: A Featherweight Framework for Dynamic Input-Dependent Approximation of CNNs", IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol.40, no.10, pp.2090-2103, 2021. https://ieeexplore.ieee.org/document/9239327 (Input-dependent inference optimization via layer-wise weight clustering and early exit based on a termination condition.)
Maedeh Hemmat; Azadeh Davoodi, March 2019, Dynamic Reconfiguration of CNNs for Input-Dependent Approximation, 20th International Symposium on Quality Electronic Design (ISQED), https://ieeexplore.ieee.org/document/8697843 (Dynamically decides how many clusters of similar weights to use, depending on input.)
B Rokh, A Azarpeyvand, A Khanteymoori, ACM Transactions on Intelligent Systems, 2023, A Comprehensive Survey on Model Quantization for Deep Neural Networks in Image Classification, PDF: https://dl.acm.org/doi/pdf/10.1145/3623402 (Includes a survey of weight clustering.)
W Cheng, W Zhang, H Shen, Y Cai, X He, K Lv, 2023, Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs, arXiv preprint arXiv:2309.05516, PDF: https://arxiv.org/pdf/2309.05516.pdf (Examination of rounding schemes in PTQ and QAT for quantization and weight clustering.)
TensorFlow, 2024, TensorFlow Model Optimization Toolkit — Weight Clustering API https://blog.tensorflow.org/2020/08/tensorflow-model-optimization-toolkit-weight-clustering-api.html
TensorFlow, 2024, Weight clustering comprehensive guide https://www.tensorflow.org/model_optimization/guide/clustering/clustering_comprehensive_guide
David Spuler, March 2024, Chapter 44. Advanced Quantization, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Shixun Wu, Yitong Ding, Yujia Zhai, Jinyang Liu, Jiajun Huang, Zizhe Jian, Huangliang Dai, Sheng Di, Bryan M. Wong, Zizhong Chen, Franck Cappello, 2 Aug 2024, FT K-Means: A High-Performance K-Means on GPU with Fault Tolerance, https://arxiv.org/abs/2408.01391
Dylan Patel and Daniel Nishball, Oct 03, 2024, AI Neocloud Playbook and Anatomy, https://www.semianalysis.com/p/ai-neocloud-playbook-and-anatomy
Nathan Bailey, Mar 20, 2024, Optimizing Neural Networks— Weight Clustering Explained, https://nathanbaileyw.medium.com/optimizing-neural-network-weight-clustering-explained-be947088a974
Song Han, Huizi Mao, William J. Dally, 15 Feb 2016 (v5), Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, https://arxiv.org/abs/1510.00149
Vasileios Tsouvalas, Samaneh Mohammadi, Ali Balador, Tanir Ozcelebi, Francesco Flammini, Nirvana Meratnia, 13 Jun 2024, EncCluster: Scalable Functional Encryption in Federated Learning through Weight Clustering and Probabilistic Filters, https://arxiv.org/abs/2406.09152
Vasileios Tsouvalas, Aaqib Saeed, Tanir Ozcelebi, Nirvana Meratnia, 25 Feb 2024 (v3), Communication-Efficient Federated Learning through Adaptive Weight Clustering and Server-Side Distillation https://arxiv.org/abs/2401.14211
Ao Ren, Tianyun Zhang, Shaokai Ye, Jiayu Li, Wenyao Xu, Xuehai Qian, Xue Lin, and Yanzhi Wang. 2019. ADMM-NN: An Algorithm-Hardware Co-Design Framework of DNNs Using Alternating Direction Methods of Multipliers. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '19). Association for Computing Machinery, New York, NY, USA, 925–938. https://doi.org/10.1145/3297858.3304076 https://dl.acm.org/doi/abs/10.1145/3297858.3304076 https://dl.acm.org/doi/pdf/10.1145/3297858.3304076 https://github.com/yeshaokai/admm-nn.
Mariam Rakka, Mohammed E. Fouda, Pramod Khargonekar, Fadi Kurdahi, 11 Aug 2022, Mixed-Precision Neural Networks: A Survey, https://arxiv.org/abs/2208.06064
John Edward Mixter, Oct 2024, Neural Network Reduction for Efficient Execution on Edge Devices, Ph.D. Thesis, Department of Electrical and Computer Engineering, University of Arizona, https://repository.arizona.edu/bitstream/handle/10150/670868/azu_etd_21112_sip1_m.pdf?sequence=1
David Spuler, October 26, 2024, Weight Clustering Needs a Refresh, Aussie AI Blog, https://www.aussieai.com/blog/weight-clustering
Babak Rokh, Ali Azarpeyvand, Alireza Khanteymoori, 23 Oct 2023 (v5), A Comprehensive Survey on Model Quantization for Deep Neural Networks in Image Classification, https://arxiv.org/abs/2205.07877
Hu, J., Rao, W., Zhao, Q. (2021). aHCQ: Adaptive Hierarchical Clustering Based Quantization Framework for Deep Neural Networks. In: Karlapalem, K., et al. Advances in Knowledge Discovery and Data Mining. PAKDD 2021. Lecture Notes in Computer Science(), vol 12713. Springer, Cham. https://doi.org/10.1007/978-3-030-75765-6_17 https://link.springer.com/chapter/10.1007/978-3-030-75765-6_17
Caro, M., Abella, J. Energy-Efficient Object Detection: Impact of Weight Clustering for Different Arithmetic Representations. J Sign Process Syst 96, 287–300 (2024). https://doi.org/10.1007/s11265-024-01917-8 https://dl.acm.org/doi/10.1007/s11265-024-01917-8 https://link.springer.com/article/10.1007/s11265-024-01917-8
Yuzhuang Xu, Shiyu Ji, Qingfu Zhu, Wanxiang Che, 12 Dec 2024, CRVQ: Channel-relaxed Vector Quantization for Extreme Compression of LLMs, https://arxiv.org/abs/2412.09282 (Vector quantization of low-bit or 1-bit weight vectors, with additional bits for some channels, analogous to combining mixed-precision quantization and/or weight clustering.)
Dibakar Gope, David Mansell, Danny Loh, Ian Bratt, 23 Dec 2024, Highly Optimized Kernels and Fine-Grained Codebooks for LLM Inference on Arm CPUs, https://arxiv.org/abs/2501.00032 https://github.com/ggerganov/llama.cpp
Arissa Wongpanich, Tayo Oguntebi, Jose Baiocchi Paredes, Yu Emma Wang, Phitchaya Mangpo Phothilimthana, Ritwika Mitra, Zongwei Zhou, Naveen Kumar, Vijay Janapa Reddi, 10 Feb 2025, Machine Learning Fleet Efficiency: Analyzing and Optimizing Large-Scale Google TPU Systems with ML Productivity Goodput, https://arxiv.org/abs/2502.06982
Zehao Fan, Garrett Gagnon, Zhenyu Liu, Liu Liu, 9 May 2025, Sparse Attention Remapping with Clustering for Efficient LLM Decoding on PIM, https://arxiv.org/abs/2505.05772
Yaoqi Chen, Jinkai Zhang, Baotong Lu, Qianxi Zhang, Chengruidong Zhang, Jingjia Luo, Di Liu, Huiqiang Jiang, Qi Chen, Jing Liu, Bailu Ding, Xiao Yan, Jiawei Jiang, Chen Chen, Mingxing Zhang, Yuqing Yang, Fan Yang, Mao Yang, 5 May 2025, RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference, https://arxiv.org/abs/2505.02922

Outliers

The issue of "outliers" refers to weights or activations that are largers (or smaller) than the vast majority of other values. There are various parts of a Transformer where the output results can depend inordinately on a small subset of values, notably in the attention computation (and hence also the KV cache).

Research papers that discuss the issue of outliers include:

Jerry Yao-Chieh Hu, Pei-Hsuan Chang, Robin Luo, Hong-Yu Chen, Weijian Li, Wei-Po Wang, Han Liu, 4 Apr 2024, Outlier-Efficient Hopfield Layers for Large Transformer-Based Models, https://arxiv.org/abs/2404.03828 Code: https://github.com/MAGICS-LAB/OutEffHop (Addresses outliers in quantization with a modified Softmax and an advanced Hopfield memory model.)
Xing Hu, Yuan Chen, Dawei Yang, Sifan Zhou, Zhihang Yuan, Jiangyong Yu, Chen Xu, 28 May 2024, I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models, https://arxiv.org/abs/2405.17849 Code: https://anonymous.4open.science/r/I-LLM-F242/
Wanyun Cui, Qianle Wang, 3 Apr 2024, Cherry on Top: Parameter Heterogeneity and Quantization in Large Language Models, https://arxiv.org/abs/2404.02837 (Examines which weights most affect inference, including outlier values.)
Nikita Trukhanov, Ilya Soloveychik, 29 Mar 2024, Accurate Block Quantization in LLMs with Outliers, https://arxiv.org/abs/2403.20137 (Analyzes block floating point number formats in block quantization with a focus on the KV cache memory reduction, including the use of permutations to reorder tensor weight rows.)
Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey
Daliang Xu, Hao Zhang, Liming Yang, Ruiqi Liu, Mengwei Xu, and Xuanzhe Liu, 11 June 2024, WiP: Efficient LLM Prefilling with Mobile NPU, EdgeFM '24: Proceedings of the Workshop on Edge and Mobile Foundation Models, June 2024, Pages 33 - 35, https://doi.org/10.1145/3662006.3662066 https://dl.acm.org/doi/abs/10.1145/3662006.3662066 PDF: https://dl.acm.org/doi/pdf/10.1145/3662006.3662066 (Faster NPU prefill via chunked prefilling using sequences of tokens, along with INT8 NPU quantization that is aware of outliers and offloads FP32 calculations from NPU back to CPU.)
Franklin Huang, May 17, 2024, Machine Learning Systems with Reduced Memory Requirements, Masters of Science, Electrical Engineering and Computer Sciences, University of California, Berkeley, Technical Report No. UCB/EECS-2024-120 http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.html https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.pdf Code: https://github.com/hongyihuang/spec-mcts/blob/main/triton (Broad paper that examines a lot of different optimizations that reduce memory costs, including quantization, kernel fusion, sparsity, MatMul optimizations, KV cache compression, and various other methods.)
Jinguang Wang, Yuexi Yin, Haifeng Sun, Qi Qi, Jingyu Wang, Zirui Zhuang, Tingting Yang, Jianxin Liao, 27 Jun 2024, OutlierTune: Efficient Channel-Wise Quantization for Large Language Models, https://arxiv.org/abs/2406.18832
Wonbeom Lee, Jungi Lee, Junghwan Seo, Jaewoong Sim, 28 Jun 2024, InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management, https://arxiv.org/abs/2406.19707 (KV caching optimization using salient token pruning for the attention layer.)
Lianwei Yang, Haisong Gong, 6 Aug 2024, DopQ-ViT: Towards Distribution-Friendly and Outlier-Aware Post-Training Quantization for Vision Transformers, https://arxiv.org/abs/2408.03291
Guangxuan Xiao, May 2024, Efficient Deployment Algorithms for Large Language Models, Masters Thesis, MIT, https://dspace.mit.edu/bitstream/handle/1721.1/156332/xiao-xgx-sm-eecs-2024-thesis.pdf
Xin Ding, Xiaoyu Liu, Zhijun Tu, Yun Zhang, Wei Li, Jie Hu, Hanting Chen, Yehui Tang, Zhiwei Xiong, Baoqun Yin, Yunhe Wang, 15 Apr 2024 (v4), CBQ: Cross-Block Quantization for Large Language Models, https://arxiv.org/abs/2312.07950
Bowen Zhao, Hannaneh Hajishirzi, Qingqing Cao, 4 Jun 2024 (v2), APT: Adaptive Pruning and Tuning Pretrained Language Models for Efficient Training and Inference, ICML 2024 Oral, https://arxiv.org/abs/2401.12200 https://github.com/ROIM1998/APT
Yupeng Su, Ziyi Guan, Xiaoqun Liu, Tianlai Jin, Dongkuan Wu, Graziano Chesi, Ngai Wong, Hao Yu, 20 Aug 2024, LLM-Barber: Block-Aware Rebuilder for Sparsity Mask in One-Shot for Large Language Models, https://arxiv.org/abs/2408.10631 https://github.com/YupengSu/LLM-Barber
Jahyun Koo, Dahoon Park, Sangwoo Jung, Jaeha Kung, 6 Sep 2024, OPAL: Outlier-Preserved Microscaling Quantization A ccelerator for Generative Large Language Models, https://arxiv.org/abs/2409.05902
Yanshu Wang, Tong Yang, Xiyan Liang, Guoan Wang, Hanning Lu, Xu Zhe, Yaoming Li, Li Weitao, 18 Sep 2024, Art and Science of Quantizing Large-Scale Models: A Comprehensive Overview, https://arxiv.org/abs/2409.11650 (Extensive survey of quantization from the basics to SOTA approaches, with also some coverage of knowledge distillation and KV cache compression.)
Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu, 25 Sep 2024, A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms, https://arxiv.org/abs/2409.16694
Yifei Liu, Jicheng Wen, Yang Wang, Shengyu Ye, Li Lyna Zhang, Ting Cao, Cheng Li, Mao Yang, 25 Sep 2024, VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models, https://arxiv.org/abs/2409.17066 https://arxiv.org/pdf/2409.17066
Tianzhu Ye, Li Dong, Yuqing Xia, Yutao Sun, Yi Zhu, Gao Huang, Furu Wei, 7 Oct 2024, Differential Transformer, https://arxiv.org/abs/2410.05258
Ke Yi, Zengke Liu, Jianwei Zhang, Chengyuan Li, Tong Zhang, Junyang Lin, Jingren Zhou, 30 Sep 2024, Rotated Runtime Smooth: Training-Free Activation Smoother for accurate INT4 inference, https://arxiv.org/abs/2409.20361 (Handling of outliers in INT4 quantization.)
Mengzhao Chen, Yi Liu, Jiahao Wang, Yi Bin, Wenqi Shao, Ping Luo, 7 Oct 2024, PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs, https://arxiv.org/abs/2410.05265 https://github.com/ChenMnZ/PrefixQuant (Puts outliers into the KV cache as a prefix.)
Akshat Ramachandran, Souvik Kundu, Tushar Krishna, 12 Nov 2024 (v2), MicroScopiQ: Accelerating Foundational Models through Outlier-Aware Microscaling Quantization, https://arxiv.org/abs/2411.05282
Dongwei Wang, Huanrui Yang, 8 Dec 2024, Taming Sensitive Weights : Noise Perturbation Fine-tuning for Robust LLM Quantization, https://arxiv.org/abs/2412.06858
H Kang, Q Zhang, S Kundu, G Jeong, Z Liu, T Krishna, Dec 2024, GEAR: An Efficient Error Reduction Framework for KV Cache Compression in LLM Inference, https://neurips2024-enlsp.github.io/papers/paper_3.pdf (Use extra information in low-rank and sparse matrices to efficiently alleviate lossy KV cache quantization issues such as outliers.)
Jiun-Man Chen, Yu-Hsuan Chao, Yu-Jie Wang, Ming-Der Shieh, Chih-Chung Hsu, Wei-Fen Lin, 11 Mar 2024, QuantTune: Optimizing Model Quantization with Adaptive Outlier-Driven Fine Tuning, https://arxiv.org/abs/2403.06497 (Outlier-correcting fine-tuning and quantization method.)
Kyle Wiggers, December 23, 2024, A popular technique to make AI more efficient has drawbacks, https://techcrunch.com/2024/12/23/a-popular-technique-to-make-ai-more-efficient-has-drawbacks/
Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, Lei Chen, 27 Dec 2024, A Survey on Large Language Model Acceleration based on KV Cache Management, https://arxiv.org/abs/2412.19442 (Huge survey of all KV cache optimization methods.)
S. Kim, H. Lee, S. Kim, C. Kim and W. W. Ro, "AirGun: Adaptive Granularity Quantization for Accelerating Large Language Models," 2024 IEEE 42nd International Conference on Computer Design (ICCD), Milan, Italy, 2024, pp. 645-652, doi: 10.1109/ICCD63220.2024.00103. https://ieeexplore.ieee.org/abstract/document/10818069
Xuerui Qiu, Jieyuan Zhang, Wenjie Wei, Honglin Cao, Junsheng Guo, Rui-Jie Zhu, Yimeng Shan, Yang Yang, Malu Zhang, Haizhou Li, 23 Jan 2025, Quantized Spike-driven Transformer, https://arxiv.org/abs/2501.13492 https://github.com/bollossom/QSD-Transformer
Minhajul Hoque, Jan 4, 2025, DeepSeek V3: How They Achieved Big Results with Small Compute, https://ai.plainenglish.io/deepseek-v3-how-they-achieved-big-results-with-small-compute-fb694606d59a (DeepSeek optimizations included FP8 quantization with outlier handling, attention and KV cache optimization via Multi-Head Latent Attention (MHLA), and multi-token decoding.)
Zunhai Su, Zhe Chen, Wang Shen, Hanyu Wei, Linge Li, Huangqi Yu, Kehong Yuan, 25 Jan 2025, RotateKV: Accurate and Robust 2-Bit KV Cache Quantization for LLMs via Outlier-Aware Adaptive Rotations, https://arxiv.org/abs/2501.16383 (INT2 KV caching with special handling of outliers, RoPE, and attention sinks, and the resulting architecture works in Chain-of-Thought.)
Mingyu Jin, Kai Mei, Wujiang Xu, Mingjie Sun, Ruixiang Tang, Mengnan Du, Zirui Liu, Yongfeng Zhang, 3 Feb 2025, Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding, https://arxiv.org/abs/2502.01563 https://github.com/MingyuJ666/Rope_with_LLM (Finds that outliers in attention are important, and arise by being generated by RoPE.)
Songhao Wu, Ang Lv, Xiao Feng, Yufei Zhang, Xun Zhang, Guojun Yin, Wei Lin, Rui Yan, 1 Feb 2025, PolarQuant: Leveraging Polar Transformation for Efficient Key Cache Quantization and Decoding Acceleration, https://arxiv.org/abs/2502.00527
G. Wang, S. Cai, W. Li, D. Lyu and G. He, "OFQ-LLM: Outlier-Flexing Quantization for Efficient Low-Bit Large Language Model Acceleration," in IEEE Transactions on Circuits and Systems I: Regular Papers, doi: 10.1109/TCSI.2025.3547732. https://ieeexplore.ieee.org/abstract/document/10924797
Yi Su, Yuechi Zhou, Quantong Qiu, Juntao Li, Qingrong Xia, Ping Li, Xinyu Duan, Zhefeng Wang, Min Zhang, 16 May 2025, Accurate KV Cache Quantization with Outlier Tokens Tracing, https://arxiv.org/abs/2505.10938
P Czakó, G Kertész, S Szénási, 2025, Addressing Activation Outliers in LLMs: A Systematic Review of Post-Training Quantization Techniques, IEEE Access, 2025, https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10994764
M. Seo, J. Hyun, S. Jeong, X. T. Nguyen, H. -J. Lee and H. Lee, "OASIS: Outlier-Aware KV Cache Clustering for Scaling LLM Inference in CXL Memory Systems," in IEEE Computer Architecture Letters, doi: 10.1109/LCA.2025.3567844, https://ieeexplore.ieee.org/abstract/document/10990150

Activation Quantization

The quantization of dynamic calculations of activations is well-known and almost always used now.

Research papers on activation quantization:

Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, Liang Zhao, 4 Jan 2024, Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models https://arxiv.org/abs/2401.00625 (A general survey paper with coverage of many techniques including this one.)
Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I-Jen Chuang, Vijayalakshmi Srinivasan, Kailash Gopalakrishnan, Jul 2018, PACT: Parameterized Clipping Activation for Quantized Neural Networks, https://arxiv.org/abs/1805.06085
Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, Song Han, 10 May 2024 (v2), QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving, https://arxiv.org/abs/2405.04532 Code: https://github.com/mit-han-lab/qserve
Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey
8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
Apple, June 2024, Introducing Apple’s On-Device and Server Foundation Models, https://machinelearning.apple.com/research/introducing-apple-foundation-models (Apple's on-device models feature optimizations including small models, grouped query attention, 2-bit/4-bit quantization including activation quantization, shared embedding/unembedding tensors, small-ish vocabulary size of 49k, an undisclosed efficient KV cache optimization for neural engines, and layer-specific 16-bit LoRA/QLoRA adapters of size "10s of megabytes" for fine-tuned specialized model versions, also sometimes in 2-bit/4-bit, claiming speed rates of 0.6ms/token in prefill, and 30 tokens per second in decoding.)
Jinguang Wang, Yuexi Yin, Haifeng Sun, Qi Qi, Jingyu Wang, Zirui Zhuang, Tingting Yang, Jianxin Liao, 27 Jun 2024, OutlierTune: Efficient Channel-Wise Quantization for Large Language Models, https://arxiv.org/abs/2406.18832
Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748
Leo Donisch, Sigurd Schacht, Carsten Lanquillon, 6 Aug 2024, Inference Optimizations for Large Language Models: Effects, Challenges, and Practical Considerations, https://arxiv.org/abs/2408.03130
Wang, Miao Wei, Shu Yang, Fangmin Chen, Xing Mei, 16 Aug 2024, ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models, https://arxiv.org/abs/2408.08554
Jahyun Koo, Dahoon Park, Sangwoo Jung, Jaeha Kung, 6 Sep 2024, OPAL: Outlier-Preserved Microscaling Quantization A ccelerator for Generative Large Language Models, https://arxiv.org/abs/2409.05902
Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu, 25 Sep 2024, A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms, https://arxiv.org/abs/2409.16694
Eldar Kurtic, Alexandre Marques, Shubhra Pandit, Mark Kurtz, Dan Alistarh, 4 Nov 2024, "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization, https://arxiv.org/abs/2411.02355
A. Jantsch et al., "Special Session: Estimation and Optimization of DNNs for Embedded Platforms," 2024 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), Raleigh, NC, USA, 2024, pp. 21-30, doi: 10.1109/CODES-ISSS60120.2024.00013. https://ieeexplore.ieee.org/abstract/document/10740783
Liu, J., Zhang, B., Cao, X. (2025). ROI-Aware Dynamic Network Quantization for Neural Video Compression. In: Antonacopoulos, A., Chaudhuri, S., Chellappa, R., Liu, CL., Bhattacharya, S., Pal, U. (eds) Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15305. Springer, Cham. https://doi.org/10.1007/978-3-031-78169-8_22 https://link.springer.com/chapter/10.1007/978-3-031-78169-8_22

Vector Quantization

Vector quantization is a longstanding ML technique that pre-dates all of the Transformer work. Hence, there are many early papers on ML topics. Vector quantization is related to other vector methods such as nearest-neighbor search, such as for the analysis of embedding vectors and semantic similarity, amongst many other applications.

Research papers on vector quantization:

Sebastian Bruch, Jan 2024, Foundations of Vector Retrieval, https://arxiv.org/abs/2401.09350 (Extensive 200+ pages review of vector lookup data structures such as LSH and clustering.)
Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, Dan Alistarh, Jan 2024, Extreme Compression of Large Language Models via Additive Quantization, https://arxiv.org/abs/2401.06118
W Jiang, P Liu, F Wen, 2017, An improved vector quantization method using deep neural network, AEU, Volume 72, February 2017, Pages 178-183, https://www.sciencedirect.com/science/article/pii/S1434841116313954
Zicheng Liu, Li Wang, Siyuan Li, Zedong Wang, Haitao Lin, Stan Z. Li, 18 Apr 2024 (v2), LongVQ: Long Sequence Modeling with Vector Quantization on Structured Memory, https://arxiv.org/abs/2404.11163
Siyuan Li, Zedong Wang, Zicheng Liu, Di Wu, Cheng Tan, Jiangbin Zheng, Yufei Huang, Stan Z. Li, 2 Jun 2024 (v2), VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling, https://arxiv.org/abs/2405.10812
Yunchao Gong, Liu Liu, Ming Yang, Lubomir Bourdev, 18 Dec 2014, Compressing Deep Convolutional Networks using Vector Quantization, https://arxiv.org/abs/1412.6115 (A very early paper on vector quantization of CNNs that has been cited many times.)
Qijiong Liu, Xiaoyu Dong, Jiaren Xiao, Nuo Chen, Hengchang Hu, Jieming Zhu, Chenxu Zhu, Tetsuya Sakai, Xiao-Ming Wu, 6 May 2024, Vector Quantization for Recommender Systems: A Review and Outlook, https://arxiv.org/abs/2405.03110 (Survey paper on vector quantization.)
Yanjun Zhao, Tian Zhou, Chao Chen, Liang Sun, Yi Qian, Rong Jin, 8 Feb 2024, Sparse-VQ Transformer: An FFN-Free Framework with Vector Quantization for Enhanced Time Series Forecasting, https://arxiv.org/abs/2402.05830
James Jie Pan, Jianguo Wang, Guoliang Li, 21 Oct 2023, Survey of Vector Database Management Systems, https://arxiv.org/abs/2310.14021 https://link.springer.com/article/10.1007/s00778-024-00864-x
Or Sharir, Anima Anandkumar, 27 Jul 2023, Incrementally-Computable Neural Networks: Efficient Inference for Dynamic Inputs, https://arxiv.org/abs/2307.14988
David Spuler, 26th August, 2024, Inference Optimization Research Ideas, https://www.aussieai.com/blog/inference-optimization-ideas
Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu, 25 Sep 2024, A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms, https://arxiv.org/abs/2409.16694
Yifei Liu, Jicheng Wen, Yang Wang, Shengyu Ye, Li Lyna Zhang, Ting Cao, Cheng Li, Mao Yang, 25 Sep 2024, VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models, https://arxiv.org/abs/2409.17066 https://arxiv.org/pdf/2409.17066
Christopher Fifty, Ronald G. Junkins, Dennis Duan, Aniketh Iger, Jerry W. Liu, Ehsan Amid, Sebastian Thrun, Christopher Ré, 8 Oct 2024, Restructuring Vector Quantization with the Rotation Trick, https://arxiv.org/abs/2410.06424 https://github.com/cfifty/rotation_trick
Junhyuck Kim, Jongho Park, Jaewoong Cho, Dimitris Papailiopoulos, 12 Dec 2024, Lexico: Extreme KV Cache Compression via Sparse Coding over Universal Dictionaries, https://arxiv.org/abs/2412.08890 https://github.com/krafton-ai/lexico (Sparsification of KV cache in prefill, using INT8 and vector lookup in a dictionary of predefined vectors.)
Yuzhuang Xu, Shiyu Ji, Qingfu Zhu, Wanxiang Che, 12 Dec 2024, CRVQ: Channel-relaxed Vector Quantization for Extreme Compression of LLMs, https://arxiv.org/abs/2412.09282 (Vector quantization of low-bit or 1-bit weight vectors, with additional bits for some channels, analogous to combining mixed-precision quantization and/or weight clustering.)
Taehee Jeong, 17 Jan 2025, 4bit-Quantization in Vector-Embedding for RAG, https://arxiv.org/abs/2501.10534 https://github.com/taeheej/4bit-Quantization-in-Vector-Embedding-for-RAG
18 Jan 2025, LUT-DLA: Lookup Table as Efficient Extreme Low-Bit Deep Learning Accelerator, Guoyu Li, Shengyu Ye, Chunyun Chen, Yang Wang, Fan Yang, Ting Cao, Cheng Liu, Mohamed M. Sabry, Mao Yang, https://arxiv.org/abs/2501.10658 (Extremely low-bit quantization below 1-bit (!) with vector quantization to table lookup.)
Zihan Liu, Xinhao Luo, Junxian Guo, Wentao Ni, Yangjie Zhou, Yue Guan, Cong Guo, Weihao Cui, Yu Feng, Minyi Guo, Yuhao Zhu, Minjia Zhang, Jingwen Leng, Chen Jin, 4 Mar 2025, VQ-LLM: High-performance Code Generation for Vector Quantization Augmented LLM Inference, https://arxiv.org/abs/2503.02236
Michael Ryaboy,· May 16, 2025, The Most Common Vector Search Mistake Is Costing Enterprises Hundreds of Thousands, https://medium.com/kx-systems/the-most-common-vector-search-mistake-is-costing-enterprises-hundreds-of-thousands-dd1ffd0b976d

Quantization Granularity

Quantization granularity refers to the parts or segments or structures of weights that are quantized. For example, granularity levels may be:

Layerwise quantization
Vector quantization
Block quantization

Research papers on granularity of quantization include:

Jan Klhufek, Miroslav Safar, Vojtech Mrazek, Zdenek Vasicek, Lukas Sekanina, 8 Apr 2024, Exploring Quantization and Mapping Synergy in Hardware-Aware Deep Neural Network Accelerators, https://arxiv.org/abs/2404.05368 (Quantization of weights and activations on a CNN with a method to identify the optimal per-layer bitwidth for quantization.)
Nikita Trukhanov, Ilya Soloveychik, 29 Mar 2024, Accurate Block Quantization in LLMs with Outliers, https://arxiv.org/abs/2403.20137 (Analyzes block floating point number formats in block quantization with a focus on the KV cache memory reduction, including the use of permutations to reorder tensor weight rows.)
Wang-Cheng Kang, Derek Zhiyuan Cheng, Ting Chen, Xinyang Yi, Dong Lin, Lichan Hong, Ed H. Chi, 25 Aug 2020 (v2), Learning Multi-granular Quantized Embeddings for Large-Vocab Categorical Features in Recommender Systems, https://arxiv.org/abs/2002.08530
Lianwei Yang, Zhikai Li, Junrui Xiao, Haisong Gong, Qingyi Gu, 13 Jun 2024, MGRQ: Post-Training Quantization For Vision Transformer With Mixed Granularity Reconstruction, https://arxiv.org/abs/2406.09229
Minghai Qin, 27 Aug 2024, The Uniqueness of LLaMA3-70B with Per-Channel Quantization: An Empirical Study, https://arxiv.org/abs/2408.15301
Yanshu Wang, Tong Yang, Xiyan Liang, Guoan Wang, Hanning Lu, Xu Zhe, Yaoming Li, Li Weitao, 18 Sep 2024, Art and Science of Quantizing Large-Scale Models: A Comprehensive Overview, https://arxiv.org/abs/2409.11650 (Extensive survey of quantization from the basics to SOTA approaches, with also some coverage of knowledge distillation and KV cache compression.)
David Spuler, 26th August, 2024, Inference Optimization Research Ideas, https://www.aussieai.com/blog/inference-optimization-ideas
S. Kim, H. Lee, S. Kim, C. Kim and W. W. Ro, "AirGun: Adaptive Granularity Quantization for Accelerating Large Language Models," 2024 IEEE 42nd International Conference on Computer Design (ICCD), Milan, Italy, 2024, pp. 645-652, doi: 10.1109/ICCD63220.2024.00103. https://ieeexplore.ieee.org/abstract/document/10818069
M Raji, AG Ahsaei, K Soroush, B Ghavami, Jan 2025, Progressive Bitwidth Assignment Approaches for Efficient Capsule Networks Quantization, IEEE Access, https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10854429
Shaibal Saha, Lanyu Xu, 26 Feb 2025, Vision Transformers on the Edge: A Comprehensive Survey of Model Compression and Acceleration Strategies, https://arxiv.org/abs/2503.02891

Layerwise Quantization

Layerwise quantization is quantization done at the granularity level of layers. Each layer can have its own quantization parameters. This can be a special case of mixed-precision quantization (i.e. per-layer precision quantization), but it is also possible to use the same precision quantization at each level, but with different parameters.

Research papers on layerwise quantization:

Jan Klhufek, Miroslav Safar, Vojtech Mrazek, Zdenek Vasicek, Lukas Sekanina, 8 Apr 2024, Exploring Quantization and Mapping Synergy in Hardware-Aware Deep Neural Network Accelerators, https://arxiv.org/abs/2404.05368 (Quantization of weights and activations on a CNN with a method to identify the optimal per-layer bitwidth for quantization.)
Shuchang Zhou, Yuxin Wu, Zekun Ni, et al. 2016. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv:1606.06160, http://arxiv.org/abs/1606.06160
Chenzhuo Zhu, Song Han, Huizi Mao, and William J. Dally. 2016. Trained ternary quantization. arXiv:1612.01064 http://arxiv.org/abs/1612.01064
C Tang, H Zhai, K Ouyang, Z Wang, Y Zhu, 2022, Arbitrary Bit-width Network: A Joint Layer-Wise Quantization and Adaptive Inference Approach, https://dl.acm.org/doi/pdf/10.1145/3503161.3548001
Gihwan Kim, Jemin Lee, Sihyeong Park, Yongin Kwon, Hyungshin Kim, 26 Jul 2024, Mixed Non-linear Quantization for Vision Transformers, https://arxiv.org/abs/2407.18437 Code: https://gitlab.com/ones-ai/mixed-non-linear-quantization
Shachar Gluska, Mark Grobman, 15 Dec 2020, Exploring Neural Networks Quantization via Layer-Wise Quantization Analysis, https://arxiv.org/abs/2012.08420
Chen Tang, Kai Ouyang, Zhi Wang, Yifei Zhu, Yaowei Wang, Wen Ji, Wenwu Zhu, 5 Mar 2023 (v5), Mixed-Precision Neural Network Quantization via Learned Layer-wise Importance, https://arxiv.org/abs/2203.08368 https://github.com/1hunters/LIMPQ
Chen Tang, Haoyu Zhai, Kai Ouyang, Zhi Wang, Yifei Zhu, Wenwu Zhu, 21 Apr 2022, Arbitrary Bit-width Network: A Joint Layer-Wise Quantization and Adaptive Inference Approach, https://arxiv.org/abs/2204.09992
Razvan-Gabriel Dumitru, Vikas Yadav, Rishabh Maheshwary, Paul-Ioan Clotan, Sathwik Tejaswi Madhusudhan, Mihai Surdeanu, 26 Jun 2024 (v2), Layer-Wise Quantization: A Pragmatic and Effective Method for Quantizing LLMs Beyond Integer Bit-Levels, https://arxiv.org/abs/2406.17415 Code: https://github.com/RazvanDu/LayerwiseQuant
Bernard Ryhede Bengtsson, Joel Bengs, 2024, Accelerated Segmentation with Mixed-Precision Quantization of EfficientViT-SAM, MSc Thesis, Lund University, Sweden, https://lup.lub.lu.se/luur/download?func=downloadFile&recordOId=9174462&fileOId=9174463
Yanshu Wang, Tong Yang, Xiyan Liang, Guoan Wang, Hanning Lu, Xu Zhe, Yaoming Li, Li Weitao, 18 Sep 2024, Art and Science of Quantizing Large-Scale Models: A Comprehensive Overview, https://arxiv.org/abs/2409.11650 (Extensive survey of quantization from the basics to SOTA approaches, with also some coverage of knowledge distillation and KV cache compression.)

Blockwise Quantization

Blockwise quantization is a very granular type of quantization where each "block" of data has its own quantization parameters.

Nikita Trukhanov, Ilya Soloveychik, 29 Mar 2024, Accurate Block Quantization in LLMs with Outliers, https://arxiv.org/abs/2403.20137 (Analyzes block floating point number formats in block quantization with a focus on the KV cache memory reduction, including the use of permutations to reorder tensor weight rows.)
Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao, July 11, 2024, FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision, https://arxiv.org/abs/2407.08608 https://tridao.me/blog/2024/flash3/
Ziteng Sun1 Uri Mendlovic, Yaniv Leviathan1 Asaf Aharoni, Ahmad Beirami , Jae HunRo, Ananda Theertha Suresh, https://openreview.net/pdf?id=OWwc8eOIm8
Yanshu Wang, Wenyang He, Tong Yang, 24 May 2024, Athena: Efficient Block-Wise Post-Training Quantization for Large Language Models Using Second-Order Matrix Derivative Information, https://arxiv.org/abs/2405.17470
Haocheng Xi, Yuxiang Chen, Kang Zhao, Kai Jun Teh, Jianfei Chen, Jun Zhu, 21 Jul 2024 (v2), Jetfire: Efficient and Accurate Transformer Pretraining with INT8 Data Flow and Per-Block Quantization, https://arxiv.org/abs/2403.12422 Code: https://github.com/thu-ml/Jetfire-INT8Training
Xin Ding, Xiaoyu Liu, Zhijun Tu, Yun Zhang, Wei Li, Jie Hu, Hanting Chen, Yehui Tang, Zhiwei Xiong, Baoqun Yin, Yunhe Wang, 15 Apr 2024 (v4), CBQ: Cross-Block Quantization for Large Language Models, https://arxiv.org/abs/2312.07950
Xueying Wu, Edward Hanson, Nansu Wang, Qilin Zheng, Xiaoxuan Yang, Huanrui Yang, Shiyu Li, Feng Cheng, Partha Pratim Pande, Janardhan Rao Doppa, Krishnendu Chakrabarty, Hai Li, 27 Oct 2023 (v3), Block-Wise Mixed-Precision Quantization: Enabling High Efficiency for Practical ReRAM-based DNN Accelerators, https://arxiv.org/abs/2310.12182
Sebastian Eliassen, Raghavendra Selvan, 16 Jan 2024 (v2), Activation Compression of Graph Neural Networks using Block-wise Quantization with Improved Variance Minimization, https://arxiv.org/abs/2309.11856
Tim Dettmers, Mike Lewis, Sam Shleifer, Luke Zettlemoyer, 20 Jun 2022 (v2), 8-bit Optimizers via Block-wise Quantization, https://arxiv.org/abs/2110.02861
Jahyun Koo, Dahoon Park, Sangwoo Jung, Jaeha Kung, 6 Sep 2024, OPAL: Outlier-Preserved Microscaling Quantization A ccelerator for Generative Large Language Models, https://arxiv.org/abs/2409.05902
W Byun, J Woo, S Mukhopadhyay, 2024, Hardware-friendly Hessian-driven Row-wise Quantization and FPGA Acceleration for Transformer-based Models, https://dl.acm.org/doi/pdf/10.1145/3665314.3670806
G. Wang, H. Qin, S. A. Jacobs, C. Holmes, S. Rajbhandari, O. Ruwase, F. Yan, L. Yang, and Y. He, 2023, Zero++: Extremely efficient collective communication for giant model training, arXiv preprint arXiv:2306.10209, 2023. https://arxiv.org/abs/2306.10209
David Spuler, 26th August, 2024, Inference Optimization Research Ideas, https://www.aussieai.com/blog/inference-optimization-ideas
Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu, 25 Sep 2024, A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms, https://arxiv.org/abs/2409.16694
Alireza Khodamoradi, Kristof Denolf, Eric Dellinger, 15 Oct 2024, Error Diffusion: Post Training Quantization with Block-Scaled Number Formats for Neural Networks, https://arxiv.org/abs/2410.11203 https://github.com/ROCm/tensorcast
Yuzong Chen, Ahmed F. AbouElhamayed, Xilai Dai, Yang Wang, Marta Andronic, George A. Constantinides, Mohamed S. Abdelfattah, 18 Nov 2024, BitMoD: Bit-serial Mixture-of-Datatype LLM Acceleration, https://arxiv.org/abs/2411.11745
Wonsuk Jang, Thierry Tambe, 2 Jan 2025, BlockDialect: Block-wise Fine-grained Mixed Format for Energy-Efficient LLM Inference, https://arxiv.org/abs/2501.01144 (Per-block granular mixed-precision quantization including FP4.)
A. Xu et al., "GausiQ: Generalized Automatic Hybrid-Precision Quantization for MIMO Detection," in IEEE Wireless Communications Letters, doi: 10.1109/LWC.2024.3509269. https://ieeexplore.ieee.org/abstract/document/10839390
M Raji, AG Ahsaei, K Soroush, B Ghavami, Jan 2025, Progressive Bitwidth Assignment Approaches for Efficient Capsule Networks Quantization, IEEE Access, https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10854429
Michael Wu, Arnab Raha, Deepak A. Mathaikutty, Martin Langhammer, Engin Tunali, 31 Jan 2025, StruM: Structured Mixed Precision for Efficient Deep Learning Hardware Codesign, https://arxiv.org/abs/2501.18953
Weiming Hu, Haoyan Zhang, Cong Guo, Yu Feng, Renyang Guan, Zhendong Hua, Zihan Liu, Yue Guan, Minyi Guo, Jingwen Leng, 26 Feb 2025, M-ANT: Efficient Low-bit Group Quantization for LLMs via Mathematically Adaptive Numerical Type, https://arxiv.org/abs/2502.18755

Quantization-Aware Training (QAT)

A small selection of some papers on QAT:

Haocheng Xi, Yuxiang Chen, Kang Zhao, Kaijun Zheng, Jianfei Chen, Jun Zhu, 19 Mar 2024, Jetfire: Efficient and Accurate Transformer Pretraining with INT8 Data Flow and Per-Block Quantization, https://arxiv.org/abs/2403.12422 (Quantization during pre-training using INT8 quantization and low-granularity per-block quantization.)
Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
Penghao Xiao, Chunjie Zhang, Qian Guo, Xiayang Xiao, Haipeng Wang, 2024, Neural Networks Integer Computation: Quantizing Convolutional Neural Networks of Inference and Training for Object Detection in Embedded Systems, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, DOI 10.1109/JSTARS.2024.3452321, PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10660473
Yanshu Wang, Tong Yang, Xiyan Liang, Guoan Wang, Hanning Lu, Xu Zhe, Yaoming Li, Li Weitao, 18 Sep 2024, Art and Science of Quantizing Large-Scale Models: A Comprehensive Overview, https://arxiv.org/abs/2409.11650 (Extensive survey of quantization from the basics to SOTA approaches, with also some coverage of knowledge distillation and KV cache compression.)
Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu, 25 Sep 2024, A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms, https://arxiv.org/abs/2409.16694
Babak Rokh, Ali Azarpeyvand, Alireza Khanteymoori, 23 Oct 2023 (v5), A Comprehensive Survey on Model Quantization for Deep Neural Networks in Image Classification, https://arxiv.org/abs/2205.07877
Fali Wang, Zhiwei Zhang, Xianren Zhang, Zongyu Wu, Tzuhao Mo, Qiuhao Lu, Wanjing Wang, Rui Li, Junjie Xu, Xianfeng Tang, Qi He, Yao Ma, Ming Huang, Suhang Wang, 4 Nov 2024, A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness, https://arxiv.org/abs/2411.03350
Saleh Ashkboos, Bram Verhoef, Torsten Hoefler, Evangelos Eleftheriou, Martino Dazzi, 17 Nov 2024, EfQAT: An Efficient Framework for Quantization-Aware Training, https://arxiv.org/abs/2411.11038
M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024, Resource-efficient Algorithms and Systems of Foundation Models: A Survey, https://dl.acm.org/doi/pdf/10.1145/3706418
Pengle Zhang, Jia Wei, Jintao Zhang, Jun Zhu, Jianfei Chen, 12 Mar 2025 (v2), Accurate INT8 Training Through Dynamic Block-Level Fallback, https://arxiv.org/abs/2503.08040

Post-Training Quantization (PTQ)

A brief selection of some papers on PTQ:

Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
Yipin Guo, Yilin Lang, Qinyuan Ren, 3 Jul 2024, GPTQT: Quantize Large Language Models Twice to Push the Efficiency, https://arxiv.org/abs/2407.02891 (Two-phase quantization, first to high-bit, then to binary quantization.)
Lianwei Yang, Haisong Gong, 6 Aug 2024, DopQ-ViT: Towards Distribution-Friendly and Outlier-Aware Post-Training Quantization for Vision Transformers, https://arxiv.org/abs/2408.03291
Penghao Xiao, Chunjie Zhang, Qian Guo, Xiayang Xiao, Haipeng Wang, 2024, Neural Networks Integer Computation: Quantizing Convolutional Neural Networks of Inference and Training for Object Detection in Embedded Systems, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, DOI 10.1109/JSTARS.2024.3452321, PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10660473
Yanshu Wang, Tong Yang, Xiyan Liang, Guoan Wang, Hanning Lu, Xu Zhe, Yaoming Li, Li Weitao, 18 Sep 2024, Art and Science of Quantizing Large-Scale Models: A Comprehensive Overview, https://arxiv.org/abs/2409.11650 (Extensive survey of quantization from the basics to SOTA approaches, with also some coverage of knowledge distillation and KV cache compression.)
Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu, 25 Sep 2024, A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms, https://arxiv.org/abs/2409.16694
Ian Colbert, Fabian Grob, Giuseppe Franco, Jinjie Zhang, Rayan Saab, 25 Sep 2024, Accumulator-Aware Post-Training Quantization, https://arxiv.org/abs/2409.17092
Babak Rokh, Ali Azarpeyvand, Alireza Khanteymoori, 23 Oct 2023 (v5), A Comprehensive Survey on Model Quantization for Deep Neural Networks in Image Classification, https://arxiv.org/abs/2205.07877
Fali Wang, Zhiwei Zhang, Xianren Zhang, Zongyu Wu, Tzuhao Mo, Qiuhao Lu, Wanjing Wang, Rui Li, Junjie Xu, Xianfeng Tang, Qi He, Yao Ma, Ming Huang, Suhang Wang, 4 Nov 2024, A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness, https://arxiv.org/abs/2411.03350
H. M. Jeddi, M. Grailoo and J. Nunez-Yanez, "Leveraging Dynamic Range Analysis for Efficient Post-Training Quantization in Graph Convolutional Networks," 2024 IEEE Nordic Circuits and Systems Conference (NorCAS), Lund, Sweden, 2024, pp. 1-7, doi: 10.1109/NorCAS64408.2024.10752486. https://ieeexplore.ieee.org/abstract/document/10752486
Donghyeon Yi, Seoyoung Lee, Jongho Kim, Junyoung Kim, Sohmyung Ha, Ik Joon Chang, Minkyu Je, 22 Nov 2024, FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration, https://arxiv.org/abs/2411.14733
M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024, Resource-efficient Algorithms and Systems of Foundation Models: A Survey, https://dl.acm.org/doi/pdf/10.1145/3706418
Deus Ex Machina, Dec 2024, Overview of Post-training Quantization and Examples of Algorithms and Implementations, https://deus-ex-machina-ism.com/?p=62443
Tollef Emil Jørgensen, 13 May 2025, Resource-Efficient Language Models: Quantization for Fast and Accessible Inference, https://arxiv.org/abs/2505.08620

KV Caching and Quantization

There is a lot of analogous research on using quantization in the KV cache data. The aims are similar to weight and activation quantization: smaller data size leads to less memory usage costs and tighter arithmetic kernels. Read more about these KV cache research areas:

Dequantization

Dequantization is the reverse mapping of quantized integers to their original weights. In reality, the result of dequantization loses precision compared to the original values that were quantized. Dequantization is not often considered in detail by papers on quantization, but there is still some research specific to issues of dequantization:

Han Guo, William Brandon, Radostin Cholakov, Jonathan Ragan-Kelley, Eric P. Xing, Yoon Kim, 15 Jul 2024, Fast Matrix Multiplications for Lookup Table-Quantized LLMs, https://arxiv.org/abs/2407.10960
Haojun Xia, Zhen Zheng, Xiaoxia Wu, Shiyang Chen, Zhewei Yao, Stephen Youn, Arash Bakhtiari, Michael Wyatt, Donglin Zhuang, Zhongzhu Zhou, Olatunji Ruwase, Yuxiong He, Shuaiwen Leon Song, July 2024, Quant-LLM: Accelerating the Serving of Large Language Models via FP6-Centric Algorithm-System Co-Design on Modern GPUs, Proceedings of the 2024 USENIX Annual Technical Conference. July 10–12, 2024,Santa Clara, CA, USA, 978-1-939133-41-0, https://www.usenix.org/conference/atc24/presentation/xia https://www.usenix.org/system/files/atc24-xia.pdf
Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu, 25 Sep 2024, A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms, https://arxiv.org/abs/2409.16694
A. Jantsch et al., "Special Session: Estimation and Optimization of DNNs for Embedded Platforms," 2024 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), Raleigh, NC, USA, 2024, pp. 21-30, doi: 10.1109/CODES-ISSS60120.2024.00013. https://ieeexplore.ieee.org/abstract/document/10740783
Character.AI, Nov 21, 2024, Optimizing AI Inference at Character.AI (Part Deux), https://research.character.ai/optimizing-ai-inference-at-character-ai-part-deux/ (Optimization techniques discussed include INT8, Flash attention 3, kernel fusion of KV dequantization and attention, MQA parallelization, producer-consumer CUDA warp specialization, fused matrix transpose, and more.)
Zhang, Y., Lu, L., Zhao, R. et al. An efficient quantized GEMV implementation for large language models inference with matrix core. J Supercomput 81, 496 (2025). https://doi.org/10.1007/s11227-025-06993-6 https://link.springer.com/article/10.1007/s11227-025-06993-6
C Zhang, X Zhu, L Chen, T Yang, E Pan, G Yu, Y Zhao, 2025, Enhancing LLM Inference Performance on ARMCPUsthrough Software and Hardware Co-optimization Strategies, DOI 10.23919/ICS.2025.3568404, https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10994252

General Research Papers on Quantization

Papers on the general theory of quantization, or specific works with relevance to the overall theoretical basis of using quantization for model compression:

Frantar, E.; Ashkboos, S.; Hoefler, T.; and Alistarh, D. 2022. GPTQ: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323 https://arxiv.org/abs/2210.17323, Code: https://github.com/IST-DASLab/gptq
Norm Tweaking: High-performance Low-bit Quantization of Large Language Models L Li, Q Li, B Zhang, X Chu - arXiv preprint arXiv:2309.02784, 2023 https://arxiv.org/pdf/2309.02784.pdf (Novel quantization-related optimization strategy with quantization based on tweaking LayerNorm weights.)
EPTQ: Enhanced Post-Training Quantization via Label-Free Hessian O Gordon, HV Habi, A Netzer, arXiv preprint arXiv:2309.11531, 2023, https://arxiv.org/pdf/2309.11531.pdf, Code: https://github.com/sony/model_optimization
J Liu, R Gong, X Wei, Z Dong, J Cai, B Zhuang, Oct 2023, QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models, arXiv preprint arXiv:2310.08041, https://arxiv.org/pdf/2310.08041.pdf (PTQ with 4-bit quantization of Llama models.)
Z Li, X Liu, B Zhu, Z Dong, Q Gu, K Keutzer, Oct 2023, QFT: Quantized Full-parameter Tuning of LLMs with Affordable Resources, arXiv preprint arXiv:2310.07147, https://arxiv.org/pdf/2310.07147.pdf
G Rutishauser, 2024, Agile and Efficient Inference of Quantized Neural Networks, Ph.D. Thesis, ETH Zurich, https://www.research-collection.ethz.ch/bitstream/handle/20.500.11850/675547/1/thesis.pdf
Sayeh Sharify, Zifei Xu, Wanzin Yazar, Xin Wang, 12 May 2024, Combining multiple post-training techniques to achieve most efficient quantized LLMs, https://arxiv.org/abs/2405.07135
Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, Song Han, 7 May 2024, QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving, arXiv preprint arXiv:2405.04532, https://arxiv.org/abs/2405.04532 Project: https://hanlab.mit.edu/projects/qserve Code: https://github.com/mit-han-lab/qserve (Efficient quantized inference on GPUs using 4-bit weights, 8-bit activations, and 4-bit KV cache, mostly via a GEMM speedup.)
Irina Proskurina, Luc Brun, Guillaume Metzler, Julien Velcin, 1 May 2024, When Quantization Affects Confidence of Large Language Models? https://arxiv.org/abs/2405.00632
Dayou Du, Gu Gong, Xiaowen Chu, 1 May 2024, Model Quantization and Hardware Acceleration for Vision Transformers: A Comprehensive Survey, https://arxiv.org/abs/2405.00314
Yi Guo, Fanliu Kong, Xiaoyang Li, Hui Li, Wei Chen, Xiaogang Tian, Jinping Cai, Yang Zhang, Shouda Liu, 19 Apr 2024, decoupleQ: Towards 2-bit Post-Training Uniform Quantization via decoupling Parameters into Integer and Floating Points, https://arxiv.org/abs/2404.12759 Code: https://github.com/bytedance/decoupleQ (Decouple parameters into integer and floating-point parts for more accurate quantization at low bitwidths.)
J Wu, M Song, J Zhao, HKH So, 2024, A Case for Low Bitwidth Floating Point Arithmetic on FPGA for Transformer Based DNN Inference, https://wujiajunic.cn/publication/ipdpsw2024/IPDPSW2024.pdf
Wojciech Czaja, Sanghoon Na, 11 Apr 2024, Frame Quantization of Neural Networks, https://arxiv.org/abs/2404.08131 (Deep theory of quantization.)
Georgy Tyukin, 2 Apr 2024, Enhancing Inference Efficiency of Large Language Models: Investigating Optimization Strategies and Architectural Innovations, Masters Thesis, Data Science and Machine Learning, University College London., https://arxiv.org/abs/2404.05741 (Reviews various model compression and inference optimization techniques, and specifically analyzes layer skipping and sublayer skipping, such as attention head pruning and FFN/MLP pruning.)
Busayo Awobade, Mardiyyah Oduwole, Steven Kolawole, 6 Apr 2024, What Happens When Small Is Made Smaller? Exploring the Impact of Compression on Small Data Pretrained Language Models, https://arxiv.org/abs/2404.04759 (General article shows that the big three of model compression work not just on compression big LLMs, but also on making small models even smaller.)
Jan Klhufek, Miroslav Safar, Vojtech Mrazek, Zdenek Vasicek, Lukas Sekanina, 8 Apr 2024, Exploring Quantization and Mapping Synergy in Hardware-Aware Deep Neural Network Accelerators, https://arxiv.org/abs/2404.05368 (Quantization of weights and activations on a CNN with a method to identify the optimal per-layer bitwidth for quantization.)
Seungtae Hong, Gunju Park, Jeong-Si Kim, 9 June 2024, Automated deep-learning model optimization framework for microcontrollers, https://doi.org/10.4218/etrij.2023-0522 https://onlinelibrary.wiley.com/doi/full/10.4218/etrij.2023-0522 (Framework for using quantization and pruning on microcontroller devices.)
B Jiang, X Cheng, Y Li, J Zhang, S Fu, Q Yang, M Liu, 2023, Output-Directed Dynamic Quantization for DNN Acceleration, ICPP '23: Proceedings of the 52nd International Conference on Parallel Processing, August 2023, Pages 645–654, https://doi.org/10.1145/3605573.3605580, https://dl.acm.org/doi/abs/10.1145/3605573.3605580
Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
You Zhou, Xiujing Lin, Xiang Zhang, Maolin Wang, Gangwei Jiang, Huakang Lu, Yupeng Wu, Kai Zhang, Zhe Yang, Kehang Wang, Yongduo Sui, Fengwei Jia, Zuoli Tang, Yao Zhao, Hongxuan Zhang, Tiannuo Yang, Weibo Chen, Yunong Mao, Yi Li, De Bao, Yu Li, Hongrui Liao, Ting Liu, Jingwen Liu, Jinchi Guo, Xiangyu Zhao, Ying WEI, Hong Qian, Qi Liu, Xiang Wang, Wai Kin (Victor)Chan, Chenliang Li, Yusen Li, Shiyu Yang, Jining Yan, Chao Mou, Shuai Han, Wuxia Jin, Guannan Zhang, Xiaodong Zeng, Nov 2023, On the Opportunities of Green Computing: A Survey, https://arxiv.org/abs/2311.00447 (Extensive survey of environmental and green AI issues, along with a survey of various optimization methods to reduce AI resource requirements in training and inference.)
Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, Liang Zhao, 4 Jan 2024, Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models https://arxiv.org/abs/2401.00625 (A general survey paper with coverage of many techniques including this one.)
T Liang, J Glossner, L Wang, S Shi, X Zhang, 2021, Neurocomputing, Pruning and quantization for deep neural network acceleration: A survey, https://arxiv.org/abs/2101.09671
Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. A survey of model compression and acceleration for deep neural networks. CoRR, abs/1710.09282, 2017. https://arxiv.org/abs/1710.09282
Marina Neseem, Conor McCullough, Randy Hsin, Chas Leichner, Shan Li, In Suk Chong, Andrew G. Howard, Lukasz Lew, Sherief ,Reda, Ville-Mikko Rautio, Daniele Moro, 29 Mar 2024, PikeLPN: Mitigating Overlooked Inefficiencies of Low-Precision Neural Networks, https://arxiv.org/abs/2404.00103
Nikita Trukhanov, Ilya Soloveychik, 29 Mar 2024, Accurate Block Quantization in LLMs with Outliers, https://arxiv.org/abs/2403.20137 (Analyzes block floating point number formats in block quantization with a focus on the KV cache memory reduction, including the use of permutations to reorder tensor weight rows.)
Ian Colbert, Alessandro Pappalardo, Jakoba Petri-Koenig, Yaman Umuroglu, 19 Jan 2024, A2Q+: Improving Accumulator-Aware Weight Quantization, https://arxiv.org/abs/2401.10432
Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 15 Mar 2024 (v5), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer (A large survey of a variety of LLM optimizations.)
Zhuocheng Gong, Jiahao Liu, Jingang Wang, Xunliang Cai, Dongyan Zhao, Rui Yan, 11 Mar 2024, What Makes Quantization for Large Language Models Hard? An Empirical Study from the Lens of Perturbation, https://arxiv.org/abs/2403.06408 (Deep analysis of quantization and how it works or fails.)
Haotong Qin, Xudong Ma, Xingyu Zheng, Xiaoyang Li, Yang Zhang, Shouda Liu, Jie Luo, Xianglong Liu, Michele Magno, 8 Feb 2024, Accurate LoRA-Finetuning Quantization of LLMs via Information Retention, https://arxiv.org/abs/2402.05445 Code: https://github.com/htqin/ir-qlora
Alberto Marchisio, Davide Dura, Maurizio Capra, Maurizio Martina, Guido Masera, Muhammad Shafique, Apr 2023, SwiftTron: An Efficient Hardware Accelerator for Quantized Transformers, https://arxiv.org/abs/2304.03986 Code: https://github.com/albertomarchisio/SwiftTron
Shashank Verma and Neal Vaidya, Mastering LLM Techniques: Inference Optimization, Nov 17, 2023, NVIDIA Technical Blog, https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/
S Peng, F Yang, N Sun, S Chen, Y Jiang, A Pan, Oct 2023, Exploring Post-Training Quantization of Protein Language Models, arXiv preprint arXiv:2310.19624, https://arxiv.org/abs/2310.19624
MA Shafique, A Munir, J Kong, Oct 2023, Deep Learning Performance Characterization on GPUs for Various Quantization Frameworks, https://www.mdpi.com/2673-2688/4/4/47
MWU Rahman, MM Abrar, HG Copening, S Hariri, Oct 2023, Quantized Transformer Language Model Implementations on Edge Devices, https://arxiv.org/pdf/2310.03971.pdf (Uses a "FlatBuffer" format on TensorFlow-Lite.)
PE Novac, G Boukli Hacene, A Pegatoquet, 2021, Quantization and deployment of deep neural networks on microcontrollers, Sensors, 2021, https://www.mdpi.com/1424-8220/21/9/2984
Shuchang Zhou, Yuxin Wu, Zekun Ni, et al. 2016. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv:1606.06160 http://arxiv.org/abs/1606.06160
L Chen, P Lou, 2022, Clipping-Based Post Training 8-Bit Quantization of Convolution Neural Networks for Object Detection, https://www.mdpi.com/2076-3417/12/23/12405/pdf
B Moons, K Goetschalckx, 2017, Minimum energy quantized neural networks, https://arxiv.org/pdf/1711.00215
A Kuzmin, M Van Baalen, Y Ren, 2022, https://proceedings.neurips.cc/paper_files/paper/2022/file/5e07476b6bd2497e1fbd11b8f0b2de3c-Paper-Conference.pdf
B Jacob, S Kligys, B Chen, M Zhu, 2018, Quantization and training of neural networks for efficient integer-arithmetic-only inference, http://openaccess.thecvf.com/content_cvpr_2018/papers/Jacob_Quantization_and_Training_CVPR_2018_paper.pdf
H Lin, J Lou, L Xiong, Integer-arithmetic-only certified robustness for quantized neural networks, 2021, https://openaccess.thecvf.com/content/ICCV2021/papers/Lin_Integer-Arithmetic-Only_Certified_Robustness_for_Quantized_Neural_Networks_ICCV_2021_paper.pdf
GW Jeon, SE Yu, JS Lee, 2023, Integer Quantized Learned Image Compression, 2023 IEEE International Conference on Image Processing (ICIP), https://ieeexplore.ieee.org/abstract/document/10222336 (Quantizes both weights and activations using integer quantization.)
K Nan, S Liu, J Du, H Liu, 2019, Deep model compression for mobile platforms: A survey, Tsinghua Science and Technology (Volume 24, Issue 6, December 2019), https://ieeexplore.ieee.org/abstract/document/8727762 PDF: https://ieeexplore.ieee.org/iel7/5971803/8727756/08727762.pdf
David Spuler, March 2024, Chapter 32. Quantization, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Xiaotian Zhao; Ruge Xu; Xinfei Guo, Post-training Quantization or Quantization-aware Training? That is the Question, https://ieeexplore.ieee.org/abstract/document/10219214
Yaohui Cai, Zhewei Yao, Zhen Dong, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. 2020, Zeroq: A novel zero shot quantization framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13169–13178, 2020, https://arxiv.org/abs/2001.00281 Code: https://github.com/amirgholami/ZeroQ
Natarajan Vaidhyanathan Mar 7, 2024, How to Quadruple LLM Decoding Performance with Speculative Decoding (SpD) and Microscaling (MX) Formats on Qualcomm® Cloud AI 100, https://www.qualcomm.com/developer/blog/2024/03/how-quadruple-llm-decoding-performance-speculative-decoding-spd-and-microscaling-mx-formats
J. Choi and S. Venkataramani, 2019, Approximate Computing Techniques for Deep Neural Networks. Cham: Springer, 2019, pp. 307–329, Chapter 15, https://link.springer.com/chapter/10.1007/978-3-319-99322-5_15
Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In ECCV, 2018. https://arxiv.org/abs/1807.11164
Olivia Weng, Jan 2023, Neural Network Quantization for Efficient Inference: A Survey, https://arxiv.org/abs/2112.06126
Arnav Chavan, Raghav Magazine, Shubham Kushwaha, Mérouane Debbah, Deepak Gupta, 24 Apr 2024 (v2), Faster and Lighter LLMs: A Survey on Current Challenges and Way Forward, https://arxiv.org/abs/2402.01799 Code: https://github.com/nyunAI/Faster-LLM-Survey
8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
Geonhwa Jeong, Po-An Tsai, Stephen W. Keckler, Tushar Krishna, 19 Jun 2024, SDQ: Sparse Decomposed Quantization for LLM Inference, https://arxiv.org/abs/2406.13868 (Combining sparsity and quantization.)
Franklin Huang, May 17, 2024, Machine Learning Systems with Reduced Memory Requirements, Masters of Science, Electrical Engineering and Computer Sciences, University of California, Berkeley, Technical Report No. UCB/EECS-2024-120 http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.html https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.pdf Code: https://github.com/hongyihuang/spec-mcts/blob/main/triton (Broad paper that examines a lot of different optimizations that reduce memory costs, including quantization, kernel fusion, sparsity, MatMul optimizations, KV cache compression, and various other methods.)
Milan Tamang June 30, 2024, How I built my own custom 8-bit Quantizer from scratch: a step-by-step guide using PyTorch, https://towardsai.net/p/machine-learning/how-i-built-my-own-custom-8-bit-quantizer-from-scratch-a-step-by-step-guide-using-pytorch
Xinji Mai, Zeng Tao, Junxiong Lin, Haoran Wang, Yang Chang, Yanlan Kang, Yan Wang, Wenqiang Zhang, 27 Jun 2024, From Efficient Multimodal Models to World Models: A Survey, https://arxiv.org/abs/2407.00118 (A survey of multimodal models with coverage of many optimization techniques.)
Yipin Guo, Yilin Lang, Qinyuan Ren, 3 Jul 2024, GPTQT: Quantize Large Language Models Twice to Push the Efficiency, https://arxiv.org/abs/2407.02891 (Two-phase quantization, first to high-bit, then to binary quantization.)
Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas Kübler, Jiaji Huang, Matthäus Kleindessner, Jun Huan, Volkan Cevher, Yida Wang, George Karypis, 12 Jul 2024, Inference Optimization of Foundation Models on AI Accelerators, KDD’24, August 25–29, 2024, Barcelona, Spain, https://arxiv.org/abs/2407.09111
18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748
Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
Gihwan Kim, Jemin Lee, Sihyeong Park, Yongin Kwon, Hyungshin Kim, 26 Jul 2024, Mixed Non-linear Quantization for Vision Transformers, https://arxiv.org/abs/2407.18437 Code: https://gitlab.com/ones-ai/mixed-non-linear-quantization
Sergei Savvov, Jun 27, 2023, 7 Ways To Speed Up Inference of Your Hosted LLMs, https://betterprogramming.pub/speed-up-llm-inference-83653aa24c47
Lianwei Yang, Haisong Gong, 6 Aug 2024, DopQ-ViT: Towards Distribution-Friendly and Outlier-Aware Post-Training Quantization for Vision Transformers, https://arxiv.org/abs/2408.03291
Leo Donisch, Sigurd Schacht, Carsten Lanquillon, 6 Aug 2024, Inference Optimizations for Large Language Models: Effects, Challenges, and Practical Considerations, https://arxiv.org/abs/2408.03130
Wang, Miao Wei, Shu Yang, Fangmin Chen, Xing Mei, 16 Aug 2024, ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models, https://arxiv.org/abs/2408.08554
Zhikai Li, Xuewen Liu, Jing Zhang, Qingyi Gu, 8 Feb 2024, RepQuant: Towards Accurate Post-Training Quantization of Large Transformer Models via Scale Reparameterization, https://arxiv.org/abs/2402.05628 (Quantization with focus on LayerNorm and Softmax activation quantization.)
Sophia R. Cunningham,Dominique Archambault,Austin Kung, May 2024, Efficient Training and Inference: Techniques for Large Language Models Using Llama, https://www.techrxiv.org/doi/full/10.36227/techrxiv.171651876.65094225/v1
Fuwen Tan, Royson Lee, Łukasz Dudziak, Shell Xu Hu, Sourav Bhattacharya, Timothy Hospedales, Georgios Tzimiropoulos, Brais Martinez, 25 Aug 2024, MobileQuant: Mobile-friendly Quantization for On-device Language Models, https://arxiv.org/abs/2408.13933 https://github.com/saic-fi/MobileQuant
Fred Hohman, Chaoqun Wang, Jinmook Lee, Jochen Görtler, Dominik Moritz, Jeffrey P Bigham, Zhile Ren, Cecile Foret, Qi Shan, and Xiaoyi Zhang. 2024. Talaria: Interactively Optimizing Machine Learning Models for Efficient Inference. In Proceedings of the CHI Conference on Human Factors in Computing Systems (CHI '24). Association for Computing Machinery, New York, NY, USA, Article 648, 1–19. https://doi.org/10.1145/3613904.3642628 https://dl.acm.org/doi/full/10.1145/3613904.3642628
Sean I. Young, 3 Sep 2024, Foundations of Large Language Model Compression -- Part 1: Weight Quantization, https://arxiv.org/abs/2409.02026 https://github.com/seannz/cvxq
David Spuler, September 2nd, 2024, 500+ LLM Inference Optimization Techniques, Aussie AI Blog, https://www.aussieai.com/blog/llm-inference-optimization
Ummara Bibi, Mahrukh Mazharm Dilshad Sabir, Muhammad Fasih Uddin Butt, Ali Hassan, Mustansar Ali Ghazanfar, Arshad Ali Khan, Wadood Abdul, 2024, Advances in Pruning and Quantization for Natural Language Processing, IEEE Access, doi: 10.1109/ACCESS.2024.3465631. https://ieeexplore.ieee.org/document/10685352 PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10685352
Ian Colbert, Fabian Grob, Giuseppe Franco, Jinjie Zhang, Rayan Saab, 25 Sep 2024, Accumulator-Aware Post-Training Quantization, https://arxiv.org/abs/2409.17092
Elias Frantar, September, 2024, Compressing Large Neural Networks Algorithms, Systems and Scaling Laws, Ph.D. Thesis, Graduate School, Institute of Science and Technology, Austria, https://research-explorer.ista.ac.at/download/17485/17880/frantar_thesis_final.pdf
Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Guohao Dai, 6 Oct 2024, Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective, https://arxiv.org/abs/2410.04466
Cong Guo, Feng Cheng, Zhixu Du, James Kiessling, Jonathan Ku, Shiyu Li, Ziru Li, Mingyuan Ma, Tergel Molom-Ochir, Benjamin Morris, Haoxuan Shan, Jingwei Sun, Yitu Wang, Chiyue Wei, Xueying Wu, Yuhao Wu, Hao Frank Yang, Jingyang Zhang, Junyao Zhang, Qilin Zheng, Guanglei Zhou, Hai (Helen)Li, Yiran Chen, 8 Oct 2024. A Survey: Collaborative Hardware and Software Design in the Era of Large Language Models, https://arxiv.org/abs/2410.07265
Zifei Xu, Sayeh Sharify, Wanzin Yazar, Tristan Webb, Xin Wang, 18 Oct 2024, Understanding the difficulty of low-precision post-training quantization of large language models, https://arxiv.org/abs/2410.14570
Yuzhe Yang, Yipeng Du, Ahmad Farhan, Claudio Angione, Yue Zhao, Harry Yang, Fielding Johnston, James Buban, Patrick Colangelo, 28 Oct 2024, Meta-Learning for Speeding Up Large Model Inference in Decentralized Environments, https://arxiv.org/abs/2410.21340 (Choosing between multiple acceleration techniques).
Jiedong Lang, Zhehao Guo, Shuyu Huang, 30 Oct 2024, A Comprehensive Study on Quantization Techniques for Large Language Models, https://arxiv.org/abs/2411.02530
Fali Wang, Zhiwei Zhang, Xianren Zhang, Zongyu Wu, Tzuhao Mo, Qiuhao Lu, Wanjing Wang, Rui Li, Junjie Xu, Xianfeng Tang, Qi He, Yao Ma, Ming Huang, Suhang Wang, 4 Nov 2024, A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness, https://arxiv.org/abs/2411.03350
Liu, J., Zhang, B., Cao, X. (2025). ROI-Aware Dynamic Network Quantization for Neural Video Compression. In: Antonacopoulos, A., Chaudhuri, S., Chellappa, R., Liu, CL., Bhattacharya, S., Pal, U. (eds) Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15305. Springer, Cham. https://doi.org/10.1007/978-3-031-78169-8_22 https://link.springer.com/chapter/10.1007/978-3-031-78169-8_22
Thanaphon Suwannaphong, Ferdian Jovan, Ian Craddock, Ryan McConville, 12 Dec 2024, Optimising TinyML with Quantization and Distillation of Transformer and Mamba Models for Indoor Localisation on Edge Devices, https://arxiv.org/abs/2412.09289
Wenchao Xu, Jinyu Chen, Peirong Zheng, Xiaoquan Yi, Tianyi Tian, Wenhui Zhu, Quan Wan, Haozhao Wang, Yunfeng Fan, Qinliang Su, Xuemin Shen, https://arxiv.org/abs/2412.13437 18 Dec 2024, Deploying Foundation Model Powered Agent Services: A Survey, (A survey of not just deployment, but many inference optimization techniques.)
Giordano d'Aloisio, Luca Traini, Federica Sarro, Antinisca Di Marco, 18 Dec 2024, On the Compression of Language Models for Code: An Empirical Study on CodeBERT, https://arxiv.org/abs/2412.13737 (Quantization, pruning and distillation on code generation models.)
Kyle Wiggers, December 23, 2024, A popular technique to make AI more efficient has drawbacks, https://techcrunch.com/2024/12/23/a-popular-technique-to-make-ai-more-efficient-has-drawbacks/
Tollef Emil Jørgensen, 13 May 2025, Resource-Efficient Language Models: Quantization for Fast and Accessible Inference, https://arxiv.org/abs/2505.08620