Aussie AI
LLM Quantization Research
-
Last Updated 12 December, 2024
-
by David Spuler, Ph.D.
Quantization is an extremely popular method of model compression that has a huge number of research papers, and has been implemented in many modern inference engines. Generally, quantization has been very successful at both reducing inference compute times and storage space, without a huge hit to model accuracy, achieving near floating-point accuracy.
Types of Quantization
Quantization is usually separated into two main categories:
- Post-Training Quantization (PTQ). This is where a pre-trained model is quantized for faster inference.
- Quantization-Aware Training (QAT). This is the use of quantization during model training.
Quantization granularity specifies what numbers are going to be quantized, and which sets of numbers will have different quantization parameters. There are different model floating-point numbers that can be quantized:
- Weights quantized (weight-only quantization)
- Activations quantized (weight-and-activation quantization)
- Per-layer quantization
- Per-tensor quantization
- Per-channel quantization
The scaling algorithm or "scale factor" by which floating-point numbers are scaled to a smaller range of numbers is also a distinction for a given quantization algorithm. Several types include:
- Uniform scaling (uniform quantization)
- Uniform affine quantization
- Symmetric uniform quantization
- Non-uniform quantization
- Power-of-two quantization
- Asymmetric quantization
There are several different technical types of quantization:
- FP16 quantization: This uses 16-bit floating point numbers instead of 32-bit numbers. Commonly used.
- FP8 quantization: 8-bit floating point.
- FP4 quantization: 4-bit floating point. Occasionally used in research papers.
- 8-bit integer quantization (INT8): This uses single 8-bit bytes for quantization. Commonly used. Weights are scaled to either -128 to 127 (signed), or 0 to 255 (unsigned bytes).
- 4-bit quantization (INT4): Another popular size for quantization is a "nibble" (4 bits). There can be 2^4=16 weights. This is commonly used and quite effective given its low bitwidth.
- 3-bit quantization (INT3). This uses 2^3=8 distinct weights.
- 2-bit quantization (INT2): There are 4 distinct weights. Not commonly used.
- Ternary quantization: This is quantization with 3 weights, usually -1, 0, and +1. Uses 2 bits, but not 4 weights. Suffers accuracy problems.
- Binary quantization: This is 1-bit quantization with 2 possible weights (usually 0 and 1, or -1 and +1). Not highly accurate.
- 0-bit quantization. Good luck with this algorithm.
- Integer-only-arithmetic quantization. This refers to quantization where the actual arithmetic performed during inference is all integer multiplications. This is distinct from the rather unkindly named "fake quantization" which is quantization where the integers are "dequantized" back to floating-point before using floating point multiplication in inference calculations. Integer-only-arithmetic quantization aims to improve both speed and space, whereas non-integer-arithmetic-only integer quantization still reduces model size and storage space, but improves execution speed less fully (latency is still somewhat improved due to reduced memory-related activity).
- Dyadic quantization: This is an uncommon quantization method using dyadic numbers, which are a mathematical representation of numbers as rational quotients where the numerator is an integer, but the denominator is always a power-of-two (allowing bitshifts).
- Logarithmic Bitshift Quantization (Power-of-Two Quantization). This is where the weights are all powers of 2, so that faster bitshifts are used instead of integer multiplication. A generalization is "Sum of Two Bitshifts Quantization" which uses multiple bitshifts added together.
And some more quantization terminology:
- Stochastic quantization. This is a method of intentionally introducing some non-determinancy and randomness into quantization algorithms with the goal of increased inference accuracy.
- Extreme quantization: Usually refers to binary quantization.
- Low-bit quantization: Usually means binary, ternary, or at most 4-bit quantization.
- Fake quantization (or "simulated quantization"). Refers somewhat unkindly to integer quantization with the main goal of storage space reduction from a reduced model size, where the actual arithmetic is still performed as floating-point multiplications, rather than the "real quantization" of integer-only-arithmetic quantization.
- Fixed point quantization. Using fixed-point arithmetic to change vector dot product from floating point multiplication/addition into integer multiplication and addition.
- Mixed-precision quantization (or simply "mixed quantization"): Refers to more finely granular quantization where different parts of the model have different levels of quantization in terms of bits.
Quantization Theory
Research papers on the theoretical basis of quantization:
- Zifei Xu, Sayeh Sharify, Wanzin Yazar, Tristan Webb, Xin Wang, 18 Oct 2024, Understanding the difficulty of low-precision post-training quantization of large language models, https://arxiv.org/abs/2410.14570
- Jiayi Pan, Chengcan Wang, Kaifu Zheng, Yangguang Li, Zhenyu Wang, Bin Feng, 6 Dec 2023, SmoothQuant+: Accurate and Efficient 4-bit Post-Training WeightQuantization for LLM, https://arxiv.org/abs/2312.03788
- Jiedong Lang, Zhehao Guo, Shuyu Huang, 30 Oct 2024, A Comprehensive Study on Quantization Techniques for Large Language Models, https://arxiv.org/abs/2411.02530
- Tanishq Kumar, Zachary Ankner, Benjamin F. Spector, Blake Bordelon, Niklas Muennighoff, Mansheej Paul, Cengiz Pehlevan, Christopher Ré, Aditi Raghunathan, 7 Nov 2024, Scaling Laws for Precision, https://arxiv.org/abs/2411.04330
- Xu Ouyang, Tao Ge, Thomas Hartvigsen, Zhisong Zhang, Haitao Mi, Dong Yu, 27 Nov 2024 (v2), Low-Bit Quantization Favors Undertrained LLMs: Scaling Laws for Quantized LLMs with 100T Training Tokens, https://arxiv.org/abs/2411.17691
Survey Papers on Quantization
- Sangeetha Siddegowda, Marios Fournarakis, Markus Nagel, Tijmen Blankevoort, Chirag Patel, Abhijit Khobare, Neural Network Quantization with AI Model Efficiency Toolkit (AIMET), arXiv:2201.08442v1 [cs.LG], 20 Jan 2022, https://arxiv.org/pdf/2201.08442.pdf
- Krishnamoorthi, R., Quantizing deep convolutional networks for efficient inference: A whitepaper. arXiv preprint arXiv:1806.08342, 2018, https://arxiv.org/abs/1806.08342
- Nagel, M., Fournarakis, M., Amjad, R. A., Bondarenko, Y., van Baalen, M., and Blankevoort, T., "A white paper on neural network quantization", 2021, https://arxiv.org/abs/2106.08295
- Jingjing Xu, Wangchunshu Zhou, Zhiyi Fu, Hao Zhou, Lei Li, A Survey on Green Deep Learning, Nov 2021, https://arxiv.org/abs/2111.05193 (Extensive 2021 survey paper including quantization.)
- Xunyu Zhu, Jian Li, Yong Liu, Can Ma, Weiping Wang, A Survey on Model Compression for Large Language Models, arXiv preprint arXiv:2308.07633, Aug 2023 https://arxiv.org/abs/2308.07633 (Recent 2023 survey paper on various topics, including PTQ and QAT quantization.)
- Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W. Mahoney, and Kurt Keutzer. 2021. A Survey of Quantization Methods for Efficient Neural Network Inference. arXiv:2103.13630 [cs], June 2021, http://arxiv.org/abs/2103. 13630 arXiv: 2103.13630, https://arxiv.org/abs/2103.13630
- T Liang, J Glossner, L Wang, S Shi, X Zhang, 2021, Neurocomputing, Pruning and quantization for deep neural network acceleration: A survey, https://arxiv.org/abs/2101.09671
- T Choudhary, V Mishra, A Goswami, 2020, A comprehensive survey on model compression and acceleration, Artifcial Intelligence Review, https://doi.org/10.1007/s10462-020-09816-7, https://link.springer.com/article/10.1007/s10462-020-09816-7
- Y Cheng, D Wang, P Zhou, T Zhang, June 2020 (revised), A survey of model compression and acceleration for deep neural networks, arXiv preprint arXiv:1710.09282, https://arxiv.org/abs/1710.09282
- R. Ding, Z. Liu, R. D. Blanton, and D. Marculescu. Quantized deep neural networks for energy efficient hardware-based inference. In IEEE Asia and South Pacific Design Automation Conference, pages 1–8, 2018. https://ieeexplore.ieee.org/document/8297274 (Survey and evaluation of various quantized DNN models in 2018, including binarized and light models, on chosen datasets.)
- B Rokh, A Azarpeyvand, A Khanteymoori, ACM Transactions on Intelligent Systems, 2023, A Comprehensive Survey on Model Quantization for Deep Neural Networks in Image Classification, PDF: https://dl.acm.org/doi/pdf/10.1145/3623402
- Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748
- Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
- Yanshu Wang, Tong Yang, Xiyan Liang, Guoan Wang, Hanning Lu, Xu Zhe, Yaoming Li, Li Weitao, 18 Sep 2024, Art and Science of Quantizing Large-Scale Models: A Comprehensive Overview, https://arxiv.org/abs/2409.11650 (Extensive survey of quantization from the basics to SOTA approaches, with also some coverage of knowledge distillation and KV cache compression.)
- David Spuler, September 2nd, 2024, 500+ LLM Inference Optimization Techniques, Aussie AI Blog, https://www.aussieai.com/blog/llm-inference-optimization
- Ummara Bibi, Mahrukh Mazharm Dilshad Sabir, Muhammad Fasih Uddin Butt, Ali Hassan, Mustansar Ali Ghazanfar, Arshad Ali Khan, Wadood Abdul, 2024, Advances in Pruning and Quantization for Natural Language Processing, IEEE Access, doi: 10.1109/ACCESS.2024.3465631. https://ieeexplore.ieee.org/document/10685352 PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10685352
- Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu, 25 Sep 2024, A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms, https://arxiv.org/abs/2409.16694
- Babak Rokh, Ali Azarpeyvand, Alireza Khanteymoori, 23 Oct 2023 (v5), A Comprehensive Survey on Model Quantization for Deep Neural Networks in Image Classification, https://arxiv.org/abs/2205.07877
General Research Papers on Quantization
Papers on the general theory of quantization, or specific works with relevance to the overall theoretical basis of using quantization for model compression:
- Frantar, E.; Ashkboos, S.; Hoefler, T.; and Alistarh, D. 2022. GPTQ: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323 https://arxiv.org/abs/2210.17323, Code: https://github.com/IST-DASLab/gptq
- Norm Tweaking: High-performance Low-bit Quantization of Large Language Models L Li, Q Li, B Zhang, X Chu - arXiv preprint arXiv:2309.02784, 2023 https://arxiv.org/pdf/2309.02784.pdf (Novel quantization-related optimization strategy with quantization based on tweaking LayerNorm weights.)
- EPTQ: Enhanced Post-Training Quantization via Label-Free Hessian O Gordon, HV Habi, A Netzer, arXiv preprint arXiv:2309.11531, 2023, https://arxiv.org/pdf/2309.11531.pdf, Code: https://github.com/sony/model_optimization
- J Liu, R Gong, X Wei, Z Dong, J Cai, B Zhuang, Oct 2023, QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models, arXiv preprint arXiv:2310.08041, https://arxiv.org/pdf/2310.08041.pdf (PTQ with 4-bit quantization of Llama models.)
- Z Li, X Liu, B Zhu, Z Dong, Q Gu, K Keutzer, Oct 2023, QFT: Quantized Full-parameter Tuning of LLMs with Affordable Resources, arXiv preprint arXiv:2310.07147, https://arxiv.org/pdf/2310.07147.pdf
- G Rutishauser, 2024, Agile and Efficient Inference of Quantized Neural Networks, Ph.D. Thesis, ETH Zurich, https://www.research-collection.ethz.ch/bitstream/handle/20.500.11850/675547/1/thesis.pdf
- Sayeh Sharify, Zifei Xu, Wanzin Yazar, Xin Wang, 12 May 2024, Combining multiple post-training techniques to achieve most efficient quantized LLMs, https://arxiv.org/abs/2405.07135
- Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, Song Han, 7 May 2024, QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving, arXiv preprint arXiv:2405.04532, https://arxiv.org/abs/2405.04532 Project: https://hanlab.mit.edu/projects/qserve Code: https://github.com/mit-han-lab/qserve (Efficient quantized inference on GPUs using 4-bit weights, 8-bit activations, and 4-bit KV cache, mostly via a GEMM speedup.)
- Irina Proskurina, Luc Brun, Guillaume Metzler, Julien Velcin, 1 May 2024, When Quantization Affects Confidence of Large Language Models? https://arxiv.org/abs/2405.00632
- Dayou Du, Gu Gong, Xiaowen Chu, 1 May 2024, Model Quantization and Hardware Acceleration for Vision Transformers: A Comprehensive Survey, https://arxiv.org/abs/2405.00314
- Yi Guo, Fanliu Kong, Xiaoyang Li, Hui Li, Wei Chen, Xiaogang Tian, Jinping Cai, Yang Zhang, Shouda Liu, 19 Apr 2024, decoupleQ: Towards 2-bit Post-Training Uniform Quantization via decoupling Parameters into Integer and Floating Points, https://arxiv.org/abs/2404.12759 Code: https://github.com/bytedance/decoupleQ (Decouple parameters into integer and floating-point parts for more accurate quantization at low bitwidths.)
- J Wu, M Song, J Zhao, HKH So, 2024, A Case for Low Bitwidth Floating Point Arithmetic on FPGA for Transformer Based DNN Inference, https://wujiajunic.cn/publication/ipdpsw2024/IPDPSW2024.pdf
- Wojciech Czaja, Sanghoon Na, 11 Apr 2024, Frame Quantization of Neural Networks, https://arxiv.org/abs/2404.08131 (Deep theory of quantization.)
- Georgy Tyukin, 2 Apr 2024, Enhancing Inference Efficiency of Large Language Models: Investigating Optimization Strategies and Architectural Innovations, Masters Thesis, Data Science and Machine Learning, University College London., https://arxiv.org/abs/2404.05741 (Reviews various model compression and inference optimization techniques, and specifically analyzes layer skipping and sublayer skipping, such as attention head pruning and FFN/MLP pruning.)
- Busayo Awobade, Mardiyyah Oduwole, Steven Kolawole, 6 Apr 2024, What Happens When Small Is Made Smaller? Exploring the Impact of Compression on Small Data Pretrained Language Models, https://arxiv.org/abs/2404.04759 (General article shows that the big three of model compression work not just on compression big LLMs, but also on making small models even smaller.)
- Jan Klhufek, Miroslav Safar, Vojtech Mrazek, Zdenek Vasicek, Lukas Sekanina, 8 Apr 2024, Exploring Quantization and Mapping Synergy in Hardware-Aware Deep Neural Network Accelerators, https://arxiv.org/abs/2404.05368 (Quantization of weights and activations on a CNN with a method to identify the optimal per-layer bitwidth for quantization.)
- Seungtae Hong, Gunju Park, Jeong-Si Kim, 9 June 2024, Automated deep-learning model optimization framework for microcontrollers, https://doi.org/10.4218/etrij.2023-0522 https://onlinelibrary.wiley.com/doi/full/10.4218/etrij.2023-0522 (Framework for using quantization and pruning on microcontroller devices.)
- B Jiang, X Cheng, Y Li, J Zhang, S Fu, Q Yang, M Liu, 2023, Output-Directed Dynamic Quantization for DNN Acceleration, ICPP '23: Proceedings of the 52nd International Conference on Parallel Processing, August 2023, Pages 645–654, https://doi.org/10.1145/3605573.3605580, https://dl.acm.org/doi/abs/10.1145/3605573.3605580
- Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
- You Zhou, Xiujing Lin, Xiang Zhang, Maolin Wang, Gangwei Jiang, Huakang Lu, Yupeng Wu, Kai Zhang, Zhe Yang, Kehang Wang, Yongduo Sui, Fengwei Jia, Zuoli Tang, Yao Zhao, Hongxuan Zhang, Tiannuo Yang, Weibo Chen, Yunong Mao, Yi Li, De Bao, Yu Li, Hongrui Liao, Ting Liu, Jingwen Liu, Jinchi Guo, Xiangyu Zhao, Ying WEI, Hong Qian, Qi Liu, Xiang Wang, Wai Kin (Victor)Chan, Chenliang Li, Yusen Li, Shiyu Yang, Jining Yan, Chao Mou, Shuai Han, Wuxia Jin, Guannan Zhang, Xiaodong Zeng, Nov 2023, On the Opportunities of Green Computing: A Survey, https://arxiv.org/abs/2311.00447 (Extensive survey of environmental and green AI issues, along with a survey of various optimization methods to reduce AI resource requirements in training and inference.)
- Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, Liang Zhao, 4 Jan 2024, Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models https://arxiv.org/abs/2401.00625 (A general survey paper with coverage of many techniques including this one.)
- T Liang, J Glossner, L Wang, S Shi, X Zhang, 2021, Neurocomputing, Pruning and quantization for deep neural network acceleration: A survey, https://arxiv.org/abs/2101.09671
- Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. A survey of model compression and acceleration for deep neural networks. CoRR, abs/1710.09282, 2017. https://arxiv.org/abs/1710.09282
- Marina Neseem, Conor McCullough, Randy Hsin, Chas Leichner, Shan Li, In Suk Chong, Andrew G. Howard, Lukasz Lew, Sherief ,Reda, Ville-Mikko Rautio, Daniele Moro, 29 Mar 2024, PikeLPN: Mitigating Overlooked Inefficiencies of Low-Precision Neural Networks, https://arxiv.org/abs/2404.00103
- Nikita Trukhanov, Ilya Soloveychik, 29 Mar 2024, Accurate Block Quantization in LLMs with Outliers, https://arxiv.org/abs/2403.20137 (Analyzes block floating point number formats in block quantization with a focus on the KV cache memory reduction, including the use of permutations to reorder tensor weight rows.)
- Ian Colbert, Alessandro Pappalardo, Jakoba Petri-Koenig, Yaman Umuroglu, 19 Jan 2024, A2Q+: Improving Accumulator-Aware Weight Quantization, https://arxiv.org/abs/2401.10432
- Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 15 Mar 2024 (v5), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer (A large survey of a variety of LLM optimizations.)
- Zhuocheng Gong, Jiahao Liu, Jingang Wang, Xunliang Cai, Dongyan Zhao, Rui Yan, 11 Mar 2024, What Makes Quantization for Large Language Models Hard? An Empirical Study from the Lens of Perturbation, https://arxiv.org/abs/2403.06408 (Deep analysis of quantization and how it works or fails.)
- Haotong Qin, Xudong Ma, Xingyu Zheng, Xiaoyang Li, Yang Zhang, Shouda Liu, Jie Luo, Xianglong Liu, Michele Magno, 8 Feb 2024, Accurate LoRA-Finetuning Quantization of LLMs via Information Retention, https://arxiv.org/abs/2402.05445 Code: https://github.com/htqin/ir-qlora
- Alberto Marchisio, Davide Dura, Maurizio Capra, Maurizio Martina, Guido Masera, Muhammad Shafique, Apr 2023, SwiftTron: An Efficient Hardware Accelerator for Quantized Transformers, https://arxiv.org/abs/2304.03986 Code: https://github.com/albertomarchisio/SwiftTron
- Shashank Verma and Neal Vaidya, Mastering LLM Techniques: Inference Optimization, Nov 17, 2023, NVIDIA Technical Blog, https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/
- S Peng, F Yang, N Sun, S Chen, Y Jiang, A Pan, Oct 2023, Exploring Post-Training Quantization of Protein Language Models, arXiv preprint arXiv:2310.19624, https://arxiv.org/abs/2310.19624
- MA Shafique, A Munir, J Kong, Oct 2023, Deep Learning Performance Characterization on GPUs for Various Quantization Frameworks, https://www.mdpi.com/2673-2688/4/4/47
- MWU Rahman, MM Abrar, HG Copening, S Hariri, Oct 2023, Quantized Transformer Language Model Implementations on Edge Devices, https://arxiv.org/pdf/2310.03971.pdf (Uses a "FlatBuffer" format on TensorFlow-Lite.)
- PE Novac, G Boukli Hacene, A Pegatoquet, 2021, Quantization and deployment of deep neural networks on microcontrollers, Sensors, 2021, https://www.mdpi.com/1424-8220/21/9/2984
- Shuchang Zhou, Yuxin Wu, Zekun Ni, et al. 2016. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv:1606.06160 http://arxiv.org/abs/1606.06160
- L Chen, P Lou, 2022, Clipping-Based Post Training 8-Bit Quantization of Convolution Neural Networks for Object Detection, https://www.mdpi.com/2076-3417/12/23/12405/pdf
- B Moons, K Goetschalckx, 2017, Minimum energy quantized neural networks, https://arxiv.org/pdf/1711.00215
- A Kuzmin, M Van Baalen, Y Ren, 2022, https://proceedings.neurips.cc/paper_files/paper/2022/file/5e07476b6bd2497e1fbd11b8f0b2de3c-Paper-Conference.pdf
- B Jacob, S Kligys, B Chen, M Zhu, 2018, Quantization and training of neural networks for efficient integer-arithmetic-only inference, http://openaccess.thecvf.com/content_cvpr_2018/papers/Jacob_Quantization_and_Training_CVPR_2018_paper.pdf
- H Lin, J Lou, L Xiong, Integer-arithmetic-only certified robustness for quantized neural networks, 2021, https://openaccess.thecvf.com/content/ICCV2021/papers/Lin_Integer-Arithmetic-Only_Certified_Robustness_for_Quantized_Neural_Networks_ICCV_2021_paper.pdf
- GW Jeon, SE Yu, JS Lee, 2023, Integer Quantized Learned Image Compression, 2023 IEEE International Conference on Image Processing (ICIP), https://ieeexplore.ieee.org/abstract/document/10222336 (Quantizes both weights and activations using integer quantization.)
- K Nan, S Liu, J Du, H Liu, 2019, Deep model compression for mobile platforms: A survey, Tsinghua Science and Technology (Volume 24, Issue 6, December 2019), https://ieeexplore.ieee.org/abstract/document/8727762 PDF: https://ieeexplore.ieee.org/iel7/5971803/8727756/08727762.pdf
- David Spuler, March 2024, Chapter 32. Quantization, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Xiaotian Zhao; Ruge Xu; Xinfei Guo, Post-training Quantization or Quantization-aware Training? That is the Question, https://ieeexplore.ieee.org/abstract/document/10219214
- Yaohui Cai, Zhewei Yao, Zhen Dong, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. 2020, Zeroq: A novel zero shot quantization framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13169–13178, 2020, https://arxiv.org/abs/2001.00281 Code: https://github.com/amirgholami/ZeroQ
- Natarajan Vaidhyanathan Mar 7, 2024, How to Quadruple LLM Decoding Performance with Speculative Decoding (SpD) and Microscaling (MX) Formats on Qualcomm® Cloud AI 100, https://www.qualcomm.com/developer/blog/2024/03/how-quadruple-llm-decoding-performance-speculative-decoding-spd-and-microscaling-mx-formats
- J. Choi and S. Venkataramani, 2019, Approximate Computing Techniques for Deep Neural Networks. Cham: Springer, 2019, pp. 307–329, Chapter 15, https://link.springer.com/chapter/10.1007/978-3-319-99322-5_15
- Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In ECCV, 2018. https://arxiv.org/abs/1807.11164
- Olivia Weng, Jan 2023, Neural Network Quantization for Efficient Inference: A Survey, https://arxiv.org/abs/2112.06126
- Arnav Chavan, Raghav Magazine, Shubham Kushwaha, Mérouane Debbah, Deepak Gupta, 24 Apr 2024 (v2), Faster and Lighter LLMs: A Survey on Current Challenges and Way Forward, https://arxiv.org/abs/2402.01799 Code: https://github.com/nyunAI/Faster-LLM-Survey
- 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
- Geonhwa Jeong, Po-An Tsai, Stephen W. Keckler, Tushar Krishna, 19 Jun 2024, SDQ: Sparse Decomposed Quantization for LLM Inference, https://arxiv.org/abs/2406.13868 (Combining sparsity and quantization.)
- Franklin Huang, May 17, 2024, Machine Learning Systems with Reduced Memory Requirements, Masters of Science, Electrical Engineering and Computer Sciences, University of California, Berkeley, Technical Report No. UCB/EECS-2024-120 http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.html https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.pdf Code: https://github.com/hongyihuang/spec-mcts/blob/main/triton (Broad paper that examines a lot of different optimizations that reduce memory costs, including quantization, kernel fusion, sparsity, MatMul optimizations, KV cache compression, and various other methods.)
- Milan Tamang June 30, 2024, How I built my own custom 8-bit Quantizer from scratch: a step-by-step guide using PyTorch, https://towardsai.net/p/machine-learning/how-i-built-my-own-custom-8-bit-quantizer-from-scratch-a-step-by-step-guide-using-pytorch
- Xinji Mai, Zeng Tao, Junxiong Lin, Haoran Wang, Yang Chang, Yanlan Kang, Yan Wang, Wenqiang Zhang, 27 Jun 2024, From Efficient Multimodal Models to World Models: A Survey, https://arxiv.org/abs/2407.00118 (A survey of multimodal models with coverage of many optimization techniques.)
- Yipin Guo, Yilin Lang, Qinyuan Ren, 3 Jul 2024, GPTQT: Quantize Large Language Models Twice to Push the Efficiency, https://arxiv.org/abs/2407.02891 (Two-phase quantization, first to high-bit, then to binary quantization.)
- Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas Kübler, Jiaji Huang, Matthäus Kleindessner, Jun Huan, Volkan Cevher, Yida Wang, George Karypis, 12 Jul 2024, Inference Optimization of Foundation Models on AI Accelerators, KDD’24, August 25–29, 2024, Barcelona, Spain, https://arxiv.org/abs/2407.09111
- 18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
- Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748
- Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
- Gihwan Kim, Jemin Lee, Sihyeong Park, Yongin Kwon, Hyungshin Kim, 26 Jul 2024, Mixed Non-linear Quantization for Vision Transformers, https://arxiv.org/abs/2407.18437 Code: https://gitlab.com/ones-ai/mixed-non-linear-quantization
- Sergei Savvov, Jun 27, 2023, 7 Ways To Speed Up Inference of Your Hosted LLMs, https://betterprogramming.pub/speed-up-llm-inference-83653aa24c47
- Lianwei Yang, Haisong Gong, 6 Aug 2024, DopQ-ViT: Towards Distribution-Friendly and Outlier-Aware Post-Training Quantization for Vision Transformers, https://arxiv.org/abs/2408.03291
- Leo Donisch, Sigurd Schacht, Carsten Lanquillon, 6 Aug 2024, Inference Optimizations for Large Language Models: Effects, Challenges, and Practical Considerations, https://arxiv.org/abs/2408.03130
- Wang, Miao Wei, Shu Yang, Fangmin Chen, Xing Mei, 16 Aug 2024, ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models, https://arxiv.org/abs/2408.08554
- Zhikai Li, Xuewen Liu, Jing Zhang, Qingyi Gu, 8 Feb 2024, RepQuant: Towards Accurate Post-Training Quantization of Large Transformer Models via Scale Reparameterization, https://arxiv.org/abs/2402.05628 (Quantization with focus on LayerNorm and Softmax activation quantization.)
- Sophia R. Cunningham,Dominique Archambault,Austin Kung, May 2024, Efficient Training and Inference: Techniques for Large Language Models Using Llama, https://www.techrxiv.org/doi/full/10.36227/techrxiv.171651876.65094225/v1
- Fuwen Tan, Royson Lee, Łukasz Dudziak, Shell Xu Hu, Sourav Bhattacharya, Timothy Hospedales, Georgios Tzimiropoulos, Brais Martinez, 25 Aug 2024, MobileQuant: Mobile-friendly Quantization for On-device Language Models, https://arxiv.org/abs/2408.13933 https://github.com/saic-fi/MobileQuant
- Fred Hohman, Chaoqun Wang, Jinmook Lee, Jochen Görtler, Dominik Moritz, Jeffrey P Bigham, Zhile Ren, Cecile Foret, Qi Shan, and Xiaoyi Zhang. 2024. Talaria: Interactively Optimizing Machine Learning Models for Efficient Inference. In Proceedings of the CHI Conference on Human Factors in Computing Systems (CHI '24). Association for Computing Machinery, New York, NY, USA, Article 648, 1–19. https://doi.org/10.1145/3613904.3642628 https://dl.acm.org/doi/full/10.1145/3613904.3642628
- Sean I. Young, 3 Sep 2024, Foundations of Large Language Model Compression -- Part 1: Weight Quantization, https://arxiv.org/abs/2409.02026 https://github.com/seannz/cvxq
- David Spuler, September 2nd, 2024, 500+ LLM Inference Optimization Techniques, Aussie AI Blog, https://www.aussieai.com/blog/llm-inference-optimization
- Ummara Bibi, Mahrukh Mazharm Dilshad Sabir, Muhammad Fasih Uddin Butt, Ali Hassan, Mustansar Ali Ghazanfar, Arshad Ali Khan, Wadood Abdul, 2024, Advances in Pruning and Quantization for Natural Language Processing, IEEE Access, doi: 10.1109/ACCESS.2024.3465631. https://ieeexplore.ieee.org/document/10685352 PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10685352
- Ian Colbert, Fabian Grob, Giuseppe Franco, Jinjie Zhang, Rayan Saab, 25 Sep 2024, Accumulator-Aware Post-Training Quantization, https://arxiv.org/abs/2409.17092
- Elias Frantar, September, 2024, Compressing Large Neural Networks Algorithms, Systems and Scaling Laws, Ph.D. Thesis, Graduate School, Institute of Science and Technology, Austria, https://research-explorer.ista.ac.at/download/17485/17880/frantar_thesis_final.pdf
- Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Guohao Dai, 6 Oct 2024, Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective, https://arxiv.org/abs/2410.04466
- Cong Guo, Feng Cheng, Zhixu Du, James Kiessling, Jonathan Ku, Shiyu Li, Ziru Li, Mingyuan Ma, Tergel Molom-Ochir, Benjamin Morris, Haoxuan Shan, Jingwei Sun, Yitu Wang, Chiyue Wei, Xueying Wu, Yuhao Wu, Hao Frank Yang, Jingyang Zhang, Junyao Zhang, Qilin Zheng, Guanglei Zhou, Hai (Helen)Li, Yiran Chen, 8 Oct 2024. A Survey: Collaborative Hardware and Software Design in the Era of Large Language Models, https://arxiv.org/abs/2410.07265
- Zifei Xu, Sayeh Sharify, Wanzin Yazar, Tristan Webb, Xin Wang, 18 Oct 2024, Understanding the difficulty of low-precision post-training quantization of large language models, https://arxiv.org/abs/2410.14570
- Yuzhe Yang, Yipeng Du, Ahmad Farhan, Claudio Angione, Yue Zhao, Harry Yang, Fielding Johnston, James Buban, Patrick Colangelo, 28 Oct 2024, Meta-Learning for Speeding Up Large Model Inference in Decentralized Environments, https://arxiv.org/abs/2410.21340 (Choosing between multiple acceleration techniques).
- Jiedong Lang, Zhehao Guo, Shuyu Huang, 30 Oct 2024, A Comprehensive Study on Quantization Techniques for Large Language Models, https://arxiv.org/abs/2411.02530
- Fali Wang, Zhiwei Zhang, Xianren Zhang, Zongyu Wu, Tzuhao Mo, Qiuhao Lu, Wanjing Wang, Rui Li, Junjie Xu, Xianfeng Tang, Qi He, Yao Ma, Ming Huang, Suhang Wang, 4 Nov 2024, A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness, https://arxiv.org/abs/2411.03350
- Liu, J., Zhang, B., Cao, X. (2025). ROI-Aware Dynamic Network Quantization for Neural Video Compression. In: Antonacopoulos, A., Chaudhuri, S., Chellappa, R., Liu, CL., Bhattacharya, S., Pal, U. (eds) Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15305. Springer, Cham. https://doi.org/10.1007/978-3-031-78169-8_22 https://link.springer.com/chapter/10.1007/978-3-031-78169-8_22
Floating Point Quantization
The most straight-forward quantization is to reduce 32-bit floating point (4 bytes) to 16-bit floating point (2 bytes). This halves the memory storage requirements, and does not suffer much reduction in model accuracy. All operations in matmuls are still done in floating point arithmetic.
The classic form of floating point quantization is often abbreviated as FP16. There is also "bfloat16", which uses a different representation of numbers. An even more reduced size is FP8 quantization, which uses 8-bit floating point numbers.
Research papers on floating point quantization (there are many):
- Song Han, Huizi Mao, William J. Dally, Feb 2016, Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, arXiv:1510.00149 http://arxiv.org/abs/1510.00149
- Post-training float16 quantization (TensorFlow), https://www.tensorflow.org/lite/performance/post_training_float16_quant
- N. Wang, J. Choi, D. Brand, C. Chen and K. Gopalakrishnan, "Training deep neural networks with 8-bit floating point numbers", arXiv:1812.08011, 2018. https://arxiv.org/abs/1812.08011 (FP8 quantization.)
8-bit Floating-Point Quantization (FP8)
FP8 quantization hasn't caught on in the AI industry as much as FP16 or integer quantization methods, but there are plenty of papers. Resarch papers on FP8 quantization:
- Léopold Cambier, Anahita Bhiwandiwalla, Ting Gong, Mehran Nekuii, Oguz H Elibol, Hanlin Tang, Jan 2020, Shifted and squeezed 8-bit floating point format for low-precision training of deep neural networks, arXiv preprint, https://arxiv.org/abs/2001.05674
- Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwaite, Sangwon Ha, Alexander Heinecke, Patrick Judd, John Kamalu, Naveen Mellempudi, Stuart Oberman, Mohammad Shoeybi, Michael Siu, Hao Wu, Sep 2022, FP8 Formats for Deep Learning, arXiv preprint, https://arxiv.org/abs/2209.05433
- Andrey Kuzmin, Mart Van Baalen, Yuwei Ren, Markus Nagel, Jorn Peters, Tijmen Blankevoort, Aug 2022, FP8 Quantization: The Power of the Exponent, https://arxiv.org/abs/2208.09225
- Naigang Wang, Jungwook Choi, Daniel Brand, Chia-Yu Chen, Kailash Gopalakrishnan, Dec 2018, Training Deep Neural Networks with 8-bit Floating Point Numbers, NeurIPS, 2018, https://arxiv.org/abs/1812.08011 https://proceedings.neurips.cc/paper/2018/file/335d3d1cd7ef05ec77714a215134914c-Paper.pdf
- Youngdeok Hwang; Janghwan Lee; Jiwoong Park; Jieun Lim; Jungwook Choi, Jan 2024, Searching Optimal Floating-Point Format for Sub-8-Bit Large Language Model Inference, 2024 International Conference on Electronics, Information, and Communication (ICEIC), https://ieeexplore.ieee.org/abstract/document/10457111 (Examines floating-point representations below 8 bits, and also the importance of denormalized numbers.)
- Jeffrey Yu, Kartik Prabhu, Yonatan Urman, Robert M. Radway, Eric Han, Priyanka Raina, 27 April 2024, ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, April 2024, Pages 5–21, https://doi.org/10.1145/3620666.3651368 https://dl.acm.org/doi/abs/10.1145/3620666.3651368
- Fireworks.ai, Jan 9, 2024, FireAttention — Serving Open Source Models 4x faster than vLLM by quantizing with ~no tradeoffs, https://blog.fireworks.ai/fireattention-serving-open-source-models-4x-faster-than-vllm-by-quantizing-with-no-tradeoffs-a29a85ad28d0
- 8bit HippoAttention: Up to 3X Faster Compared to FlashAttentionV2, HippoML Blog, Jan 2024, https://blog.hippoml.com/8bit-hippoattention-up-to-3x-faster-compared-to-flashattentionv2-8f9def90b482
- Houwen Peng, Kan Wu, Yixuan Wei, Guoshuai Zhao, Yuxiang Yang, Ze Liu, Yifan Xiong, Ziyue Yang, Bolin Ni, Jingcheng Hu, Ruihang Li, Miaosen Zhang, Chen Li, Jia Ning, Ruizhe Wang, Zheng Zhang, Shuguang Liu, Joe Chau, Han Hu, Peng Cheng, Dec 2023, FP8-LM: Training FP8 Large Language Models, https://arxiv.org/abs/2310.18313 Code: https://github.com/Azure/MS-AMP
- Jianwei Li, Tianchi Zhang, Ian En-Hsu Yen, Dongkuan Xu, Dec 2023, FP8-BERT: Post-Training Quantization for Transformer, https://arxiv.org/abs/2312.05725
- Xiaoxia Wu, Zhewei Yao, Yuxiong He, 2021, A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats Microsoft, https://neurips2023-enlsp.github.io/papers/paper_92.pdf Code: https://github.com/microsoft/DeepSpeed (FP4 4-bit weight quantization with 8-bit FP8 activation quantization, and showed FP8 bettered INT8 quantization and FP4 beat INT4.)
- Tim Dettmers, Mike Lewis, Younes Belkada, Luke Zettlemoyer, LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale, Nov 2022, https://arxiv.org/abs/2208.07339
- Z Zhang, Y Zhang, G Shi, Y Shen, X Wei, R Gong, X Xia, Oct 2023. Exploring the Potential of Flexible 8-bit Format: Design and Algorithm, https://arxiv.org/pdf/2310.13513.pdf
- Sergio P. Perez, Yan Zhang, James Briggs, Charlie Blake, Josh Levy-Kramer, Paul Balanca, Carlo Luschi, Stephen Barlow, Andrew William Fitzgibbon, Sep 2023, Training and inference of large language models using 8-bit floating point, https://arxiv.org/abs/2309.17224
- Zihao Ye,, Lequn Chen, Ruihang Lai, Yilong Zhao, Size Zheng, Junru Shao, Bohan Hou, Hongyi Jin, Yifei Zuo, Liangsheng Yin, Tianqi Chen, Luis Ceze, Feb 2, 2024, Accelerating Self-Attentions for LLM Serving with FlashInfer, https://flashinfer.ai/2024/02/02/introduce-flashinfer.html
- The SGLang Team, Sep 04, 2024 SGLang v0.3 Release: 7x Faster DeepSeek MLA, 1.5x Faster torch.compile, Multi-Image/Video LLaVA-OneVision, https://lmsys.org/blog/2024-09-04-sglang-v0-3/
- Together AI, September 5, 2024, Supercharging NVIDIA H200 and H100 GPU Cluster Performance With Together Kernel Collection, https://www.together.ai/blog/nvidia-h200-and-h100-gpu-cluster-performance-together-kernel-collection
- Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu, 25 Sep 2024, A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms, https://arxiv.org/abs/2409.16694
- Shixiaowei02, Oct 2024, TensorRT-LLM 0.13.0 Release Latest, https://github.com/NVIDIA/TensorRT-LLM/releases/tag/v0.13.0
- Eldar Kurtic, Alexandre Marques, Shubhra Pandit, Mark Kurtz, Dan Alistarh, 4 Nov 2024, "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization, https://arxiv.org/abs/2411.02355
- Markus Rabe, Carl Case, November 14, 2024, Rethinking LLM Inference: Why Developer AI Needs a Different Approach, https://www.augmentcode.com/blog/rethinking-llm-inference-why-developer-ai-needs-a-different-approach
6-bit Floating-Point Quantization (FP6)
Research on FP6 quantization:
- Haojun Xia, Zhen Zheng, Xiaoxia Wu, Shiyang Chen, Zhewei Yao, Stephen Youn, Arash Bakhtiari, Michael Wyatt, Donglin Zhuang, Zhongzhu Zhou, Olatunji Ruwase, Yuxiong He, Shuaiwen Leon Song, 25 Jan 2024, FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design, https://arxiv.org/abs/2401.14112
- Haojun Xia, Zhen Zheng, Xiaoxia Wu, Shiyang Chen, Zhewei Yao, Stephen Youn, Arash Bakhtiari, Michael Wyatt, Donglin Zhuang, Zhongzhu Zhou, Olatunji Ruwase, Yuxiong He, Shuaiwen Leon Song, July 2024, Quant-LLM: Accelerating the Serving of Large Language Models via FP6-Centric Algorithm-System Co-Design on Modern GPUs, Proceedings of the 2024 USENIX Annual Technical Conference. July 10–12, 2024,Santa Clara, CA, USA, 978-1-939133-41-0, https://www.usenix.org/conference/atc24/presentation/xia https://www.usenix.org/system/files/atc24-xia.pdf
4-bit Floating-Point Quantization (FP4)
Research on FP4 quantization:
- Youngdeok Hwang; Janghwan Lee; Jiwoong Park; Jieun Lim; Jungwook Choi, Jan 2024, Searching Optimal Floating-Point Format for Sub-8-Bit Large Language Model Inference, 2024 International Conference on Electronics, Information, and Communication (ICEIC), https://ieeexplore.ieee.org/abstract/document/10457111 (Examines floating-point representations below 8 bits, and also the importance of denormalized numbers.)
- Luchang Li, Sheng Qian, Jie Lu, Lunxi Yuan, Rui Wang, Qin Xie, 29 Mar 2024, Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs, https://arxiv.org/abs/2403.20041 (On-device LLMs via four optimizations: dynamic-tensor-shape inference, FP4 quantization, operator optimizations, and KV cache improvements.)
- Xiaoxia Wu, Zhewei Yao, Yuxiong He, 2021, A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats Microsoft, https://neurips2023-enlsp.github.io/papers/paper_92.pdf Code: https://github.com/microsoft/DeepSpeed (FP4 4-bit weight quantization with 8-bit FP8 activation quantization, and showed FP8 bettered INT8 quantization and FP4 beat INT4.)
- S Liu, Z Liu, X Huang, P Dong, KT Cheng, 2023 LLM-FP4: 4-Bit Floating-Point Quantized Transformers, arXiv preprint arXiv:2310.16836, https://arxiv.org/pdf/2310.16836.pdf Code: https://github.com/nbasyl/LLM-FP4
16-bit Floating-Point Quantization (FP16)
Quantization from high-precision 32-bit floating point weights (usually abbreviated "FP32" or "float32") to lower-precision 16-bit floating point (usually abbreviated "FP16" or "float16") can yield significant benefits, often without a significant loss of accuracy. There is much research in this area.
- Post-training float16 quantization (TensorFlow), https://www.tensorflow.org/lite/performance/post_training_float16_quant
- R Tiwari, A Chavan, D Gupta, G Mago, A Gupta, 2023, RCV2023 Challenges: Benchmarking Model Training and Inference for Resource-Constrained Deep Learning, PDF: https://openaccess.thecvf.com/content/ICCV2023W/RCV/papers/Tiwari_RCV2023_Challenges_Benchmarking_Model_Training_and_Inference_for_Resource-Constrained_Deep_ICCVW_2023_paper.pdf
- Radha Gulhane, 2024, Accelerated and Memory-Efficient Distributed Deep Learning: Leveraging Quantization, Parallelism Techniques, and Mix-Match Runtime Communication , Masters Thesis, Computer Science and Engineering , The Ohio State University, https://etd.ohiolink.edu/acprod/odb_etd/ws/send_file/send?accession=osu1713381834648517&disposition=inline
- B. Li, S. Lu, K. Xie, and Z. Wang, “Accelerating NLP Tasks on FPGA with Compressed BERT and a Hardware-Oriented Early Exit Method,” in IEEE Computer Society Annual Symposium on VLSI (ISVLSI), 2022, pp. 410–413. https://ieeexplore.ieee.org/document/9912016
- PyTorch, 2023, Model inference optimization checklist, https://pytorch.org/serve/performance_checklist.html
- Srivastava, Nitish. Improving neural networks with dropout. 2013, Master’s thesis, U. Toronto, https://www.semanticscholar.org/paper/Improving-Neural-Networks-with-Dropout-Srivastava/5d5d4f49d6443c8529a6f5ebef5c499d47a869da PDF: http://www.cs.toronto.edu/~nitish/msc_thesis.pdf
- Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. 2014. DaDianNao: A machine-learning supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-47). IEEE, Los Alamitos, CA, 609–622. https://doi. org/10.1109/MICRO.2014.58
- Naigang Wang, Jungwook Choi, Daniel Brand, Chia-Yu Chen, and Kailash Gopalakrishnan. 2018. Training deep neural networks with 8-bit floating point numbers. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.). Curran Associates, Red Hook, NY, 7675–7684. http://papers.nips.cc/paper/7994-training-deep-neural-networks-with-8-bit-floating-pointnumbers.pdf.
- Kartik Hegde, Jiyong Yu, Rohit Agrawal, Mengjia Yan, Michael Pellauer, and Christopher W. Fletcher. 2018. UCNN: Exploiting computational reuse in deep neural networks via weight repetition. In Proceedings of the 45th Annual International Symposium on Computer Architecture (ISCA’18). IEEE, Los Alamitos, CA, 674–687. https://doi.org/10. 1109/ISCA.2018.00062
- Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel Emer, Stephen W. Keckler, and William J. Dally. 2017. SCNN: An accelerator for compressed-sparse convolutional neural networks. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA’17). ACM, New York, NY, 27–40. https://doi.org/10.1145/3079856.3080254
- Jinguang Wang, Yuexi Yin, Haifeng Sun, Qi Qi, Jingyu Wang, Zirui Zhuang, Tingting Yang, Jianxin Liao, 27 Jun 2024, OutlierTune: Efficient Channel-Wise Quantization for Large Language Models, https://arxiv.org/abs/2406.18832
- Sergei Savvov, Jun 27, 2023, 7 Ways To Speed Up Inference of Your Hosted LLMs, https://betterprogramming.pub/speed-up-llm-inference-83653aa24c47
- Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu, 25 Sep 2024, A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms, https://arxiv.org/abs/2409.16694
- Douglas C. Youvan, September 27, 2024, Building and Running Large-Scale Language Models: The Infrastructure and Techniques Behind GPT-4 , https://www.researchgate.net/profile/Douglas-Youvan/publication/384398902_Building_and_Running_Large-Scale_Language_Models_The_Infrastructure_and_Techniques_Behind_GPT-4/links/66f6f4d3906bca2ac3d20e68/Building-and-Running-Large-Scale-Language-Models-The-Infrastructure-and-Techniques-Behind-GPT-4.pdf
- Jiajun Wu, Mo Song, Jingmin Zhao, Yizhao Gao, Jia Li, Hayden Kwok-Hay So, 6 Nov 2024, TATAA: Programmable Mixed-Precision Transformer Acceleration with a Transformable Arithmetic Architecture, https://arxiv.org/abs/2411.03697
32-bit Floating-Point Quantization (FP32)
Is there an FP32 quantization technique? No, not really! It's not quantized if it's the same format as the default.
Integer Quantization
Integer quantization of AI models is a long-standing area of research, with much literature. These are only some of the many papers:
- Post-training integer quantization (TensorFlow), https://www.tensorflow.org/lite/performance/post_training_integer_quant
- Julien Simon, "Smaller is better: Q8-Chat, an efficient generative AI experience on Xeon", May 16th 2023, https://huggingface.co/blog/generative-ai-models-on-intel-cpu
- Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev, Paulius Micikevicius, "Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation", arXiv:2004.09602v1 [cs.LG] 20 Apr 2020, https://arxiv.org/abs/2004.09602
- Song Han, Huizi Mao, William J. Dally, "Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding", arXiv:1510.00149v5 [cs.CV], 15 Feb 2016, https://arxiv.org/abs/1510.00149
- Shaokai Ye, Tianyun Zhang, Kaiqi Zhang, Jiayu Li, Jiaming Xie, Yun Liang, Sijia Liu, Xue Lin, Yanzhi Wang, "A Unified Framework of DNN Weight Pruning and Weight Clustering/Quantization Using ADMM", November 2018, https://arxiv.org/abs/1811.01907
- Zhihang Yuan, Lin Niu, Jiawei Liu, Wenyu Liu, Xinggang Wang, Yuzhang Shang, Guangyu Sun, Qiang Wu, Jiaxiang Wu, Bingzhe Wu, "RPTQ: Reorder-based Post-training Quantization for Large Language Models", May 2023, https://arxiv.org/abs/2304.01089
- Yuhang Li, Ruihao Gong, Xu Tan, Yang Yang, Peng Hu, Qi Zhang, Fengwei Yu, Wei Wang, and Shi Gu, "Brecq: Pushing the limit of post-training quantization by block reconstruction", arXiv preprint arXiv:2102.05426, 2021, https://arxiv.org/abs/2102.05426
- Soheil Hashemi; Nicholas Anthony; Hokchhay Tann; R. Iris Bahar; Sherief Reda, "Understanding the impact of precision quantization on the accuracy and energy of neural networks", Design, Automation & Test in Europe Conference & Exhibition, March 2017, https://ieeexplore.ieee.org/abstract/document/7927224
- Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio, "Quantized neural networks: Training neural networks with low precision weights and activations", The Journal of Machine Learning Research, 18(1):6869–6898, 2017, https://arxiv.org/abs/1609.07061.
- Itay Hubara, Yury Nahshan, Yair Hanani, Ron Banner, and Daniel Soudry, "Improving post training neural quantization: Layer-wise calibration and integer programming", arXiv preprint arXiv:2006.10518, 2020, https://arxiv.org/abs/2006.10518
- Javier Fernandez-Marques, Paul N. Whatmough, Andrew Mundy, Matthew Mattina, 2020, Searching for winograd-aware quantized networks, Proceedings of the 3rd MLSys Conference, Austin, TX, USA, 2020. PDF: https://proceedings.mlsys.org/paper_files/paper/2020/file/678e209691cd37f145a5502695378bac-Paper.pdf (Evaluates INT8, INT10, and INT16 quantization.)
- Markus Nagel, Mart van Baalen, Tijmen Blankevoort, Max Welling, 2019, Data-free quantization through weight equalization and bias correction, PDF: https://openaccess.thecvf.com/content_ICCV_2019/papers/Nagel_Data-Free_Quantization_Through_Weight_Equalization_and_Bias_Correction_ICCV_2019_paper.pdf (Evaluates INT5, INT6, INT8, INT10, INT12, and INT16.)
- Pierre-Emmanuel Novac, March 2023, MicroAI: Embedded Artificial Intelligence for Human Activity Recognition on Smart Glasses, Ph.D. Thesis, Artificial Intelligence. Université Côte d’Azur, https://theses.hal.science/tel-04049008/document (Uses INT8 and INT16 quantization.)
- Gunho Park, Baeseong Park, Minsub Kim, Sungjae Lee, Jeonghoon Kim, Beomseok Kwon, Se Jung Kwon, Byeongwook Kim, Youngjoo Lee, Dongsoo Lee, Apr 2023, LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models https://arxiv.org/abs/2206.09557
- Jaeyong Jang; Yulhwa Kim; Juheun Lee; Jae-Joon Kim, March 2024, FIGNA: Integer Unit-Based Accelerator Design for FP-INT GEMM Preserving Numerical Accuracy, 2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA), https://ieeexplore.ieee.org/abstract/document/10476470
- B. Li, S. Lu, K. Xie, and Z. Wang, “Accelerating NLP Tasks on FPGA with Compressed BERT and a Hardware-Oriented Early Exit Method,” in IEEE Computer Society Annual Symposium on VLSI (ISVLSI), 2022, pp. 410–413. https://ieeexplore.ieee.org/document/9912016
- David Spuler, March 2024, Chapter 32. Quantization, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Wu H, Judd P, Zhang X, Isaev M, Micikevicius P. Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation, 2020. arXiv:2004.09602 http://arxiv.org/abs/2004.09602
- Han Guo, William Brandon, Radostin Cholakov, Jonathan Ragan-Kelley, Eric P. Xing, Yoon Kim, 15 Jul 2024, Fast Matrix Multiplications for Lookup Table-Quantized LLMs, https://arxiv.org/abs/2407.10960
- David Spuler, March 2024, Integer Quantization, in Generative AI in C++, https://www.aussieai.com/book/ch32-integer-quantization
Integer-Only-Arithmetic Quantization
Integer-only quantization is integer quantization where only integer multiplication is performed. The assumption that this is true for all integer quantization algorithms is false. Several types of integer quantization may store weights as quantized integers, but then de-quantize them back to floating point at various points (even for weight multiplication in some algorithms). Methods that strictly restrict arithmetic to avoid floating point operations are more precisely named "integer-only-arithmetic quantization algorithms". For methods that also quantize non-linear components to integers, such as Softmax and normalization components, see also end-to-end integer Transformers.
- Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, Dmitry Kalenichenko, "Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference", Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 2704-2713, https://openaccess.thecvf.com/content_cvpr_2018/papers/Jacob_Quantization_and_Training_CVPR_2018_paper.pdf
- Zhewei Yao, Zhen Dong, Zhangcheng Zheng, Amir Gholami, Jiali Yu, Eric Tan, Leyuan Wang, Qijing Huang, Yida Wang, Michael Mahoney, Kurt Keutzer, HAWQ-V3: Dyadic Neural Network Quantization, Proceedings of the 38th International Conference on Machine Learning, PMLR 139:11875-11886, 2021, https://arxiv.org/abs/2011.10680
- Tim Dettmers, Mike Lewis, Younes Belkada, Luke Zettlemoyer, LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale, Nov 2022, https://arxiv.org/abs/2208.07339
- Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer, I-BERT: Integer-only BERT Quantization, Proceedings of the 38th International Conference on Machine Learning, PMLR 139:5506-5518, 2021, https://arxiv.org/abs/2101.01321, https://proceedings.mlr.press/v139/kim21d.html
- Peng Peng, Mingyu You, Weisheng Xu, and Jiaxin Li. Fully integer-based quantization for mobile convolutional neural network inference. Neurocomputing, 432:194–205, 2021, https://www.sciencedirect.com/science/article/abs/pii/S0925231220319354
- Y. Lin, Y. Li, T. Liu et al., “Towards fully 8-bit integer inference for the transformer model,” in Proc. of IJCAI, 2020, pp. 3759–3765. https://arxiv.org/abs/2009.08034
- Dong, Z., Yao, Z., Gholami, A., Mahoney, M. W., Keutzer, K., HAWQ: Hessian AWare Quantization of neural networks with mixed-precision. In The IEEE International Conference on Computer Vision (ICCV), October 2019. https://ieeexplore.ieee.org/document/9009512, https://arxiv.org/abs/1905.03696
- A. Rock, A. Untether, O. Khalil, O. Shai, and P. Grouchy, 2022, INT8 Transformers for Inference Acceleration, 36th Conference on Neural Information Processing Systems (NeurIPS), PDF: https://neurips2022-enlsp.github.io/papers/paper_52.pdf
- Radha Gulhane, 2024, Accelerated and Memory-Efficient Distributed Deep Learning: Leveraging Quantization, Parallelism Techniques, and Mix-Match Runtime Communication , Masters Thesis, Computer Science and Engineering , The Ohio State University, https://etd.ohiolink.edu/acprod/odb_etd/ws/send_file/send?accession=osu1713381834648517&disposition=inline
- P Dong, L Lu, C Wu, C Lyu, G Yuan, H Tang, Y Wang, 2023, PackQViT: Faster Sub-8-bit Vision Transformers via Full and Packed Quantization on the Mobile, https://openreview.net/pdf?id=N56hAiQvot Code: https://github.com/PeiyanFlying/PackQViT
- Krishna Teja Chitty-Venkata, Sparsh Mittal, Murali Emani, Venkatram Vishwanath, Arun K. Somani, 2023, A Survey of Techniques for Optimizing Transformer Inference, https://arxiv.org/abs/2307.07982
- M Huang, J Luo, C Ding, Z Wei, S Huang, H Yu, Oct 2023, An Integer-Only and Group-Vector Systolic Accelerator for Efficiently Mapping Vision Transformer on Edge, IEEE Transactions on Circuits and Systems I: Regular Papers ( Early Access ), https://ieeexplore.ieee.org/abstract/document/10288182/
- H Lin, J Lou, L Xiong, Integer-arithmetic-only certified robustness for quantized neural networks, 2021, https://openaccess.thecvf.com/content/ICCV2021/papers/Lin_Integer-Arithmetic-Only_Certified_Robustness_for_Quantized_Neural_Networks_ICCV_2021_paper.pdf
- David Spuler, March 2024, Chapter 32. Quantization, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Z Li, Q Gu, 2022, I-ViT: integer-only quantization for efficient vision transformer inference, https://arxiv.org/abs/2207.01405
- S Kim, A Gholami, Z Yao, N Lee, P Wang, 2022, Integer-only zero-shot quantization for efficient speech recognition, https://arxiv.org/pdf/2103.16827
- Z Zhang, B He, Z Zhang, 2023, Practical Edge Kernels for Integer-Only Vision Transformers Under Post-training Quantization, https://proceedings.mlsys.org/paper_files/paper/2023/hash/023560744aae353c03f7ae787f2998dd-Abstract-mlsys2023.html
- E Yvinec, A Dapogny, K Bailly, 2023, NUPES: Non-Uniform Post-Training Quantization via Power Exponent Search, https://arxiv.org/abs/2308.05600
- VP Nia, E Sari, V Courville, M Asgharian, 2023, Training Integer-Only Deep Recurrent Neural Networks, https://link.springer.com/article/10.1007/s42979-023-01920-z https://arxiv.org/pdf/2212.11791
- Y Zhang, L Zhao, S Cao, W Wang, T Cao, 2023, Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models, https://arxiv.org/abs/2305.12356
- H Je, D Ryu, H Lee, K Kim, 2023, Neural Network Quantization is All You Need for Energy Efficient ISP, https://ieeexplore.ieee.org/abstract/document/10103005/
- Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748
- Fuwen Tan, Royson Lee, Łukasz Dudziak, Shell Xu Hu, Sourav Bhattacharya, Timothy Hospedales, Georgios Tzimiropoulos, Brais Martinez, 25 Aug 2024, MobileQuant: Mobile-friendly Quantization for On-device Language Models, https://arxiv.org/abs/2408.13933 https://github.com/saic-fi/MobileQuant
- David Spuler, March 2024, Integer-Only-Arithmetic Quantization, in Generative AI in C++, https://www.aussieai.com/book/ch32-integer-only-quantization
Low-Bit Quantization
Low-bit quantization generall refers to 4-bit quantization or less. See below for research on binary, ternary, 2-bit, 3-bit, and 4-bit quantization papers.
Papers on low-bit quantization in general:
- Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. 2016, Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, https://arxiv.org/abs/1606.06160
- Harshavardhan Adepu, Zhanpeng Zeng, Li Zhang, Vikas Singh, 10 Mar 2024, FrameQuant: Flexible Low-Bit Quantization for Transformers, https://arxiv.org/abs/2403.06082 (A method using 2-bit quantization.)
- Cheng Zhang, Jianyi Cheng, Ilia Shumailov, George A. Constantinides, Yiren Zhao, Oct 2023, Revisiting Block-based Quantisation: What is Important for Sub-8-bit LLM Inference? https://arxiv.org/abs/2310.05079, https://arxiv.org/pdf/2310.05079.pdf
- JH Heo, J Kim, B Kwon, B Kim, SJ Kwon, D Lee, Sep 2023, Rethinking Channel Dimensions to Isolate Outliers for Low-bit Weight Quantization of Large Language Models, arXiv preprint arXiv:2309.15531, https://arxiv.org/pdf/2309.15531.pdf
- Shuchang Zhou, Yuxin Wu, Zekun Ni, et al. 2016. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv:1606.06160, http://arxiv.org/abs/1606.06160
- Yuzhuang Xu, Xu Han, Zonghan Yang, Shuo Wang, Qingfu Zhu, Zhiyuan Liu, Weidong Liu, Wanxiang Che 22 May 2024 (v3), OneBit: Towards Extremely Low-bit Large Language Models, https://arxiv.org/abs/2402.11295
- Zhiwen Mo, Lei Wang, Jianyu Wei, Zhichen Zeng, Shijie Cao, Lingxiao Ma, Naifeng Jing, Ting Cao, Jilong Xue, Fan Yang, Mao Yang, 12 Aug 2024, LUT Tensor Core: Lookup Table Enables Efficient Low-Bit LLM Inference Acceleration, https://arxiv.org/abs/2408.06003 (Lookup tables for mixed-precision MatMul/GEMM kernels using low-bit quantization mixed with full precision.)
- Vladimir Malinovskii, Denis Mazur, Ivan Ilin, Denis Kuznedelev, Konstantin Burlachenko, Kai Yi, Dan Alistarh, Peter Richtarik, 30 May 2024 (v2), PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression, https://arxiv.org/abs/2405.14852 https://burlachenkok.github.io/PV-Tuning/
- Szabolcs Cséfalvay, James Imber, 31 Jan 2023 (v2), Self-Compressing Neural Networks, https://arxiv.org/abs/2301.13142
- Wang, Miao Wei, Shu Yang, Fangmin Chen, Xing Mei, 16 Aug 2024, ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models, https://arxiv.org/abs/2408.08554
- Hossein Katebi, Navidreza Asadi, Maziar Goudarzi, 2024, FullPack: Full Vector Utilization for Sub-Byte Quantized Vector-Matrix Multiplication on General Purpose CPUs, IEEE Computer Architecture Letters, PrePrints pp. 1-4, DOI Bookmark: 10.1109/LCA.2024.3370402, https://www.computer.org/csdl/journal/ca/5555/01/10449368/1USuDIYNOQE
- Mohamed Mekkouri, Marc Sun, Leandro von Werra, Pedro Cuenca, Omar Sanseviero, Thomas Wolf, September 18, 2024, Fine-tuning LLMs to 1.58bit: extreme quantization made easy, https://huggingface.co/blog/1_58_llm_extreme_quantization
- Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
- Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu, 25 Sep 2024, A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms, https://arxiv.org/abs/2409.16694
- Yifei Liu, Jicheng Wen, Yang Wang, Shengyu Ye, Li Lyna Zhang, Ting Cao, Cheng Li, Mao Yang, 25 Sep 2024, VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models, https://arxiv.org/abs/2409.17066 https://arxiv.org/pdf/2409.17066
- Shaobo Ma, Chao Fang, Haikuo Shao, Zhongfeng Wang, 26 Sep 2024, Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores, https://arxiv.org/abs/2409.17870
- Kang Zhao, Tao Yuan, Han Bao, Zhenfeng Su, Chang Gao, Zhaofeng Sun, Zichen Liang, Liping Jing, Jianfei Chen, 21 Oct 2024, Beyond 2:4: exploring V:N:M sparsity for efficient transformer inference on GPUs, https://arxiv.org/abs/2410.16135
- Zifei Xu, Sayeh Sharify, Wanzin Yazar, Tristan Webb, Xin Wang, 18 Oct 2024, Understanding the difficulty of low-precision post-training quantization of large language models, https://arxiv.org/abs/2410.14570
- Yuhang Li, Priyadarshini Panda, 24 Oct 2024, TesseraQ: Ultra Low-Bit LLM Post-Training Quantization with Block Reconstruction, https://arxiv.org/abs/2410.19103
- Xu Ouyang, Tao Ge, Thomas Hartvigsen, Zhisong Zhang, Haitao Mi, Dong Yu, 27 Nov 2024 (v2), Low-Bit Quantization Favors Undertrained LLMs: Scaling Laws for Quantized LLMs with 100T Training Tokens, https://arxiv.org/abs/2411.17691
Binary Quantization
The extreme of quantization is to encode floating-point weights down to 1 bit. This is binary quantization (or "binarization"), where there are only 2 weights, and they are 0 and 1, or alternatively -1 and +1. This compresses the model by a factor of 32 in terms of space, and reduces the inference computations. In fact, binary quantization changes multiplication by a floating point weight to a simple addition (for 1) and a null test (for 0). Or for binary weights -1 and +1, the -1 is a subtraction and +1 an addition, which is usually further optimized to use a sign bit tweak. There are also other invocations of binary neural network architectures that use only bitwise operations, such as XNOR networks and Weightless Neural Networks (WNNs).
- H. Yang, M. Fritzsche, C. Bartz, and C. Meinel, Bmxnet: An open-source binary neural network implementation based on mxnet, CoRR, vol. abs/1705.09864, 2017, https://arxiv.org/abs/1705.09864, Code: https://github.com/hpi-xnor
- Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. 2016. Xnor-net: Imagenet classification using binary convolutional neural networks. In European conference on computer vision, Springer, 525–542, https://arxiv.org/abs/1603.05279
- B. McDanel, S. Teerapittayanon, and H. Kung, Embedded binarized neural networks, 2017, arXiv preprint arXiv:1709.02260, https://arxiv.org/abs/1709.02260
- Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, Yoshua Bengio, Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1 Feb 2016, https://arxiv.org/abs/1602.02830
- Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio, 2016, Binarized neural networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems. 4114–4122, https://proceedings.neurips.cc/paper/2016/hash/d8330f857a17c53d217014ee776bfd50-Abstract.html
- Zhouhan Lin, Matthieu Courbariaux, Roland Memisevic, Yoshua Bengio, Neural Networks with Few Multiplications, Feb 2016, https://arxiv.org/abs/1510.03009v1
- Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi, Xnor-net: Imagenet classification using binary convolutional neural networks. In ECCV, pages 525–542. Springer, 2016, https://arxiv.org/abs/1603.05279
- Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David, Binaryconnect: Training deep neural networks with binary weights during propagations, In NeuriPS, pages 3123–3131, 2015, https://arxiv.org/abs/1511.00363
- Zhaowei Cai, Xiaodong He, Jian Sun, and Nuno Vasconcelos. Deep learning with low precision by half-wave gaussian quantization. In CVPR, pages 5918–5926, 2017, https://arxiv.org/abs/1702.00953
- Yefei He, Zhenyu Lou, Luoming Zhang, Weijia Wu, Bohan Zhuang, and Hong Zhou. Bivit: Extremely compressed binary vision transformer. arXiv preprint arXiv:2211.07091, 2022. https://arxiv.org/abs/2211.07091 (Softmax-aware binarization)
- Zechun Liu, Barlas Oguz, Aasish Pappu, Lin Xiao, Scott Yih, Meng Li, Raghuraman Krishnamoorthi, and Yashar Mehdad. Bit: Robustly binarized multi-distilled transformer. arXiv preprint arXiv:2205.13016, 2022. https://arxiv.org/abs/2205.13016, Code: https://github.com/facebookresearch/bit
- Felix Juefei-Xu, Vishnu Naresh Boddeti, and Marios Savvides. Local binary convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 19–28, 2017. https://arxiv.org/abs/1608.06049
- Zechun Liu, Zhiqiang Shen, Marios Savvides, and KwangTing Cheng. Reactnet: Towards precise binary neural network with generalized activation functions. In European Conference on Computer Vision, pages 143–159. Springer, 2020. https://arxiv.org/abs/2003.03488
- Koen Helwegen, James Widdicombe, Lukas Geiger, Zechun Liu, Kwang-Ting Cheng, and Roeland Nusselder. Latent weights do not exist: Rethinking binarized neural network optimization. In Advances in Neural Information Processing Systems 32, pages 7533–7544. 2019. https://arxiv.org/abs/1906.02107, Code: https://github.com/plumerai/rethinking-bnn-optimization
- Lin, X.; Zhao, C.; and Pan, W. 2017. Towards Accurate Binary Convolutional Neural Network. Advances in Neural Information Processing Systems, 30, https://arxiv.org/abs/1711.11294
- Yichi Zhang, Ankush Garg, Yuan Cao, Łukasz Lew, Behrooz Ghorbani, Zhiru Zhang, Orhan Firat, Feb 2023, Binarized Neural Machine Translation, https://arxiv.org/abs/2302.04907
- Kota Ando; Kodai Ueyoshi; Kentaro Orimo; Haruyoshi Yonekawa; Shimpei Sato; Hiroki Nakahara; Masayuki Ikebe, "BRein memory: A 13-layer 4.2 K neuron/0.8 M synapse binary/ternary reconfigurable in-memory deep neural network accelerator in 65 nm CMOS", Proc. Symp. VLSI Circuits, pp. C24-C25, Jun. 2017. https://ieeexplore.ieee.org/document/8008533
- S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen and Y. Zou, "DoReFa-Net: Training low bitwidth convolutional neural networks with low bitwidth gradients", arXiv:1606.06160, 2016. https://arxiv.org/abs/1606.06160 (Has binary weights, 2-bit activations)
- R. Andri, L. Cavigelli, D. Rossi and L. Benini, "YodaNN: An ultra-low power convolutional neural network accelerator based on binary weights", Proc. IEEE Comput. Soc. Annu. Symp. VLSI (ISVLSI), pp. 236-241, Jul. 2016. https://arxiv.org/abs/1606.05487v1
- Z. Cai, X. He, J. Sun and N. Vasconcelos, "Deep learning with low precision by half-wave Gaussian quantization", Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 5918-5926, Jul. 2017. https://arxiv.org/abs/1702.00953 (Has binary weights, 2-bit activations)
- R. Ding, Z. Liu, R. D. Blanton, and D. Marculescu. Quantized deep neural networks for energy efficient hardware-based inference. In IEEE Asia and South Pacific Design Automation Conference, pages 1–8, 2018. https://ieeexplore.ieee.org/document/8297274 (Survey and evaluation of various quantized DNN models in 2018, including binarized and light models, on chosen datasets.)
- Taylor Simons and Dah-Jye Lee, 2019, A Review of Binarized Neural Networks, Electronics 2019, 8, 661; doi:10.3390/electronics8060661, MDPI, https://www.mdpi.com/2079-9292/8/6/661/review_report
- Xiaofan Lin, Cong Zhao, and Wei Pan. Towards accurate binary convolutional neural network. Advances in Neural Information Processing Systems, 30, 2017. https://arxiv.org/abs/1711.11294 (Uses multiple single-bit weights combined to create a multi-binary quantization method.)
- Y Shang, Z Yuan, Q Wu, Z Dong, PB-LLM: Partially Binarized Large Language Models, Sep 2023, arXiv preprint arXiv:2310.00034, https://browse.arxiv.org/pdf/2310.00034.pdf Code: https://github.com/hahnyuan/BinaryLLM (Hybrid partial binarization.)
Ternary Quantization
Ternary quantization (or "ternarization") is the use of 3 weights: -1, 0, and 1. This requires 2 bits for representation of the weights in the model, so why wouldn't you just use 4 weights? The answer is that ternary quantization can use zero-multiplication arithmetic in the inference algorithm, with an addition (for +1), a subtraction (for -1), and a null test (for 0).
- N. Mellempudi, A. Kundu, D. Mudigere, D. Das, B. Kaul, and P. Dubey, Ternary neural networks with fine-grained quantization, CoRR, vol. abs/1705.01462, 2017, https://arxiv.org/abs/1705.01462
- Fengfu Li, Bo Zhang, and Bin Liu. 2016. Ternary weight networks. arXiv preprint arXiv:1605.04711 (2016), https://arxiv.org/abs/1605.04711
- Zhu et al. 2016] Zhu, C.; Han, S.; Mao, H.; and Dally, W. J. 2016. Trained ternary quantization. arXiv preprint arXiv:1612.01064, https://arxiv.org/abs/1612.01064
- D Liu, X Liu, Ternary Quantization: A Survey, arXiv preprint arXiv:2303.01505, 2023, https://arxiv.org/abs/2303.01505
- E Yvinec, A Dapogny, K Bailly, Designing strong baselines for ternary neural network quantization through support and mass equalization, arXiv preprint arXiv:2306.17442, 2023, https://arxiv.org/abs/2306.17442
- Fengfu Li, Bin Liu, Xiaoxing Wang, Bo Zhang, Junchi Yan, Nov 2022, Ternary Weight Networks, https://arxiv.org/abs/1605.04711, Code: https://github.com/Thinklab-SJTU/twns
- M Kim, S Lee, J Lee, S Hong, DS Chang, Token-Scaled Logit Distillation for Ternary Weight Generative Language Models, 2023, https://arxiv.org/abs/2308.06744,
- Hyperspherical Quantization: Toward Smaller and More Accurate Models, Dan Liu, Xi Chen, Chen Ma, Xue Liu, Dec 2022, https://arxiv.org/abs/2212.12653
- Kota Ando; Kodai Ueyoshi; Kentaro Orimo; Haruyoshi Yonekawa; Shimpei Sato; Hiroki Nakahara; Masayuki Ikebe, "BRein memory: A 13-layer 4.2 K neuron/0.8 M synapse binary/ternary reconfigurable in-memory deep neural network accelerator in 65 nm CMOS", Proc. Symp. VLSI Circuits, pp. C24-C25, Jun. 2017. https://ieeexplore.ieee.org/document/8008533
- S. K. Esser et al., "Convolutional networks for fast energy-efficient neuromorphic computing", Proc. Nat. Acad. Sci. USA, vol. 113, no. 41, pp. 11441-11446, 2016. https://arxiv.org/abs/1603.08270 (Ternary weights, binary activations.)
- Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, Jason K. Eshraghian, 4 Jun 2024, Scalable MatMul-free Language Modeling, https://arxiv.org/abs/2406.02528 Code: https://github.com/ridgerchu/matmulfreellm (Uses addition via ternary quantization and elementwise Hadamard products to replace MatMul.)
- G Rutishauser, 2024, Agile and Efficient Inference of Quantized Neural Networks, Ph.D. Thesis, ETH Zurich, https://www.research-collection.ethz.ch/bitstream/handle/20.500.11850/675547/1/thesis.pdf
- Luca Dordoni, Dec 2023, Sparsification of deep neural network via ternary quantization, Masters Thesis, POLITECNICO DI TORINO, Italy, https://webthesis.biblio.polito.it/29424/1/tesi.pdf
- Chenzhuo Zhu, Song Han, Huizi Mao, and William J. Dally. 2016. Trained ternary quantization. arXiv:1612.01064 http://arxiv.org/abs/1612.01064
- David Spuler, March 2024, Chapter 44. Advanced Quantization, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- B Liu, F Li, X Wang, B Zhang, 2023, Ternary weight networks, https://ieeexplore.ieee.org/abstract/document/10094626/
- Z Liu, B Oguz, A Pappu, Y Shi, 2023, Binary and Ternary Natural Language Generation, https://arxiv.org/abs/2306.01841 Code: https://github.com/facebookresearch/Ternary_Binary_Transformer
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. arXiv:1512.03385 [cs] (Dec. 2015). http://arxiv.org/abs/1512.03385 arXiv: 1512.03385.
- Xiaojing Fan, Chunliang Tao, 8 Aug 2024, Towards Resilient and Efficient LLMs: A Comparative Study of Efficiency, Performance, and Adversarial Robustness, https://arxiv.org/abs/2408.04585
- R. Agrawal, N. S. Abhijith, U. Anil Kumar, S. Veeramachaneni and S. E. Ahmed, 2024, Energy-Efficient Ternary Multiplier, 2024 IEEE 6th International Conference on AI Circuits and Systems (AICAS), Abu Dhabi, United Arab Emirates, 2024, pp. 382-387, doi: 10.1109/AICAS59952.2024.10595938, https://ieeexplore.ieee.org/abstract/document/10595938
- Mohamed Mekkouri, Marc Sun, Leandro von Werra, Pedro Cuenca, Omar Sanseviero, Thomas Wolf, September 18, 2024, Fine-tuning LLMs to 1.58bit: extreme quantization made easy, https://huggingface.co/blog/1_58_llm_extreme_quantization
2-Bit Quantization (INT2)
This section refers to non-ternary 2-bit quantization, using 4 distinct weights. In practice, 2-bit quantization is regarded as still having some problems with model accuracy, whereas 4-bit integer quantization is considered a more reasonable tradeoff of speed-vs-accuracy. On the other hand, maybe this is unwarranted, since Liu et al (2022) tested lots of models with 2-bits, 3-bits, and 4-bits (see Table 1 in their paper), and the extra accuracy of 4-bits over 2-bits was usually only a couple of percentage points (for double the space).
Research papers on integer quantization using 2-bits include:
- Jungwook Choi, Pierce I-Jen Chuang, Zhuo Wang, Swagath Venkataramani, Vijayalakshmi Srinivasan, Kailash Gopalakrishnan, July 2018, Bridging the Accuracy Gap for 2-bit Quantized Neural Networks (QNN), https://arxiv.org/abs/1807.06964
- Jungwook Choi, Swagath Venkataramani, Vijayalakshmi (Viji) Srinivasan, Kailash Gopalakrishnan, Zhuo Wang, Pierce Chuang, 2019, Accurate and Efficient 2-bit Quantized Neural Networks, Proceedings of Machine Learning and Systems 1 (MLSys 2019), https://proceedings.mlsys.org/paper/2019/file/006f52e9102a8d3be2fe5614f42ba989-Paper.pdf
- S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen and Y. Zou, "DoReFa-Net: Training low bitwidth convolutional neural networks with low bitwidth gradients", arXiv:1606.06160, 2016. https://arxiv.org/abs/1606.06160 (Has binary weights, 2-bit activations)
- Z. Cai, X. He, J. Sun and N. Vasconcelos, "Deep learning with low precision by half-wave Gaussian quantization", Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 5918-5926, Jul. 2017. https://arxiv.org/abs/1702.00953 (Has binary weights, 2-bit activations)
- Han-Byul Kim, Eunhyeok Park, and Sungjoo Yoo. 2022. BASQ: Branch-wise Activation-clipping Search Quantization for Sub-4-bit Neural Networks. In European Conference on Computer Vision, Cham: Springer Nature Switzerland, 17-33. https://link.springer.com/chapter/10.1007/978-3-031-19775-8_2 (Evaluates quantization precision from 2-bits to 4-bits.)
- Zechun Liu, Kwang-Ting Cheng, Dong Huang, Eric Xing, Zhiqiang Shen. Apr 2022. Nonuniform-to-uniform quantization: Towards accurate quantization via generalized straight-through estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4942-4952, https://arxiv.org/abs/2111.14826, Code: https://github.com/liuzechun/Nonuniform-to-Uniform-Quantization (Contains an extensive review of models with 2-bit weights and 2-bit activations, and also 3-bits and 4-bits.)
- E Kloberdanz, W Le, Sep 2023, MixQuant: Mixed Precision Quantization with a Bit-width Optimization Search, arXiv preprint arXiv:2309.17341, https://arxiv.org/pdf/2309.17341.pdf (Various tests of quantization from 2-bits to 8-bits.)
- Xiaofan Lin, Cong Zhao, and Wei Pan. Towards accurate binary convolutional neural network. Advances in Neural Information Processing Systems, 30, 2017. https://arxiv.org/abs/1711.11294 (Unique 2-bit quantization approach is really a double-binarized quantization method.)
- NM Ho, DT Nguyen, JL Gustafson, WF Wong, 2023, Bedot: Bit Efficient Dot Product for Deep Generative Models, CoNGA 2023: Next Generation Arithmetic, pp. 19–37, https://link.springer.com/chapter/10.1007/978-3-031-32180-1_2, PDF: https://www.comp.nus.edu.sg/~wongwf/papers/CONGA23-Bedot.pdf (2–3 bits for weights and 2–5 bits for activation.)
- Li, Y., Gong, R., Tan, X., Yang, Y., Hu, P., Zhang, Q., Yu, F., Wang, W., & Gu, S., 2021, BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction. ArXiv, abs/2102.05426. https://arxiv.org/abs/2102.05426 Code: https://github.com/yhhhli/BRECQ (Tests 2 to 4 bits for weights, and mixed-precision quantization.)
- Yuji Chai, John Gkountouras, Glenn G. Ko, David Brooks, Gu-Yeon Wei, June 2023, INT2.1: Towards Fine-Tunable Quantized Large Language Models with Error Correction through Low-Rank Adaptation, arXiv preprint arXiv:2306.08162, https://arxiv.org/abs/2306.08162
- Yi Guo, Fanliu Kong, Xiaoyang Li, Hui Li, Wei Chen, Xiaogang Tian, Jinping Cai, Yang Zhang, Shouda Liu, 19 Apr 2024, decoupleQ: Towards 2-bit Post-Training Uniform Quantization via decoupling Parameters into Integer and Floating Points, https://arxiv.org/abs/2404.12759 Code: https://github.com/bytedance/decoupleQ (Decouple parameters into integer and floating-point parts for more accurate quantization at low bitwidths.)
- S Ashfaq, A Hoffman, S Mitra, S Sah, Sep 2023, DeepliteRT: Computer Vision at the Edge, https://arxiv.org/pdf/2309.10878.pdf
- Yang Sui, Yanyu Li, Anil Kag, Yerlan Idelbayev, Junli Cao, Ju Hu, Dhritiman Sagar, Bo Yuan, Sergey Tulyakov, Jian Ren, June 2024, BitsFusion: 1.99 bits Weight Quantization of Diffusion Model, https://snap-research.github.io/BitsFusion/
- Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, Xia Hu, 5 Feb 2024, KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache, https://arxiv.org/abs/2402.02750 Code: https://github.com/jy-yuan/KIVI (KV cache 2-bit quantization on Llama-2, Falcon and Mistral models.)
- Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, Furu Wei, 27 Feb 2024, The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits, https://arxiv.org/abs/2402.17764 (It's like a cross between binary and ternary quantization, but it uses only 1.58 bits.)
- Haotong Qin, Xudong Ma, Xingyu Zheng, Xiaoyang Li, Yang Zhang, Shouda Liu, Jie Luo, Xianglong Liu, Michele Magno, 8 Feb 2024, Accurate LoRA-Finetuning Quantization of LLMs via Information Retention, https://arxiv.org/abs/2402.05445 Code: https://github.com/htqin/ir-qlora
- Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, Dan Alistarh, Jan 2024, Extreme Compression of Large Language Models via Additive Quantization, https://arxiv.org/abs/2401.06118
- Steven K. Esser, Jeffrey L. McKinstry, Deepika Bablani, et al. 2019. Learned step size quantization. arXiv:1902.08153 http://arxiv.org/abs/1902.08153
- Xingyu Zheng, Haotong Qin, Xudong Ma, Mingyuan Zhang, Haojie Hao, Jiakai Wang, Zixiang Zhao, Jinyang Guo, Xianglong Liu, 8 Apr 2024, BinaryDM: Towards Accurate Binarization of Diffusion Model, https://arxiv.org/abs/2404.05662 (Binary quantization applied to diffusion models.)
- U Bamba, N Anand, S Aggarwal, DK Prasad, DK Gupta, 2024, Partial Binarization of Neural Networks for Budget-Aware Efficient Learning, https://openaccess.thecvf.com/content/WACV2024/papers/Bamba_Partial_Binarization_of_Neural_Networks_for_Budget-Aware_Efficient_Learning_WACV_2024_paper.pdf Code: https://github.com/transmuteAI/trailmet
- M. Kim and P. Smaragdis. 2016, Bitwise neural networks. arXiv preprint arXiv:1601.06071, https://arxiv.org/abs/1601.06071
- Yuzhuang Xu, Xu Han, Zonghan Yang, Shuo Wang, Qingfu Zhu, Zhiyuan Liu, Weidong Liu, Wanxiang Che, 17 Feb 2024, OneBit: Towards Extremely Low-bit Large Language Models, https://arxiv.org/abs/2402.11295
- Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele Magno, Xiaojuan Qi, 6 Feb 2024, BiLLM: Pushing the Limit of Post-Training Quantization for LLMs, https://arxiv.org/abs/2402.04291 (Working on 1-bit quantization through methods to allocate weights to binary distributions.)
- B Zhang, S Xu, M Lin, T Wang, D Doermann, 2023 Binary Neural Networks: Algorithms, Architectures, and Applications, https://books.google.com.au/books?hl=en&lr=&id=r9jeEAAAQBAJ&oi=fnd&pg=PT3&ots=e7uJtDHhPl&sig=IrLor3Mh44oIbvTHWgKqoqoYE3U&redir_esc=y#v=onepage&q&f=false
- N Phipps, JJ Shang, TH Teo, IC Wey, 2023, Pre-Computing Batch Normalisation Parameters for Edge Devices on a Binarized Neural Network, https://www.mdpi.com/1424-8220/23/12/5556
- David Spuler, March 2024, Chapter 44. Advanced Quantization, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Z Liu, B Oguz, A Pappu, Y Shi, 2023, Binary and Ternary Natural Language Generation, https://arxiv.org/abs/2306.01841 Code: https://github.com/facebookresearch/Ternary_Binary_Transformer
- Yeyu Huang, Feb 21, 2024, The 2-bit Quantization is Insane! See How to Run Mixtral-8x7B on Free-tier Colab, https://levelup.gitconnected.com/the-2-bit-quantization-is-insane-see-how-to-run-mixtral-8x7b-on-free-tier-colab-2803e39b9b9d
- Yipin Guo, Yilin Lang, Qinyuan Ren, 3 Jul 2024, GPTQT: Quantize Large Language Models Twice to Push the Efficiency, https://arxiv.org/abs/2407.02891 (Two-phase quantization, first to high-bit, then to binary quantization.)
- Edgar Solis Romeu; Dmitry Vadimovich Shashev, May 2024, Binary Convolution Model for Image Classification, 2024 X International Conference on Information Technology and Nanotechnology (ITNT), 20-24 May 2024, https://ieeexplore.ieee.org/abstract/document/10582370/
- Harshavardhan Adepu, Zhanpeng Zeng, Li Zhang, Vikas Singh, 10 Mar 2024, FrameQuant: Flexible Low-Bit Quantization for Transformers, https://arxiv.org/abs/2403.06082 (A method using 2-bit quantization.)
- Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, Christopher De Sa, 15 Jan 2024 (v2), QuIP: 2-Bit Quantization of Large Language Models With Guarantees, https://arxiv.org/abs/2307.13304 Code: https://github.com/Cornell-RelaxML/QuIP
- Jinendra Malekar, Mohammed E. Elbtity, Ramtin Zand Co, 21 Aug 2024, Matmul or No Matmal in the Era of 1-bit LLMs, https://arxiv.org/abs/2408.11939
- Van Minh Nguyen, Cristian Ocampo, Aymen Askri, Louis Leconte, Ba-Hien Tran, 25 May 2024, BOLD: Boolean Logic Deep Learning, https://arxiv.org/abs/2405.16339 (Unique method of training using Booleans, which is similar to binary networks or zero-multiplication models.)
- A. T. L. Bacellar, Z. Susskind, M. Breternitz, L. K. John, F. M. G. França and P. M. V. Lima, "Soon Filter: Advancing Tiny Neural Architectures for High Throughput Edge Inference," 2024 International Joint Conference on Neural Networks (IJCNN), Yokohama, Japan, 2024, pp. 1-8, doi: 10.1109/IJCNN60899.2024.10650678, https://ieeexplore.ieee.org/abstract/document/10650678
- Yuzong Chen, Jian Meng, Jae-sun Seo, Mohamed S. Abdelfattah, 8 Sep 2024, BBS: Bi-directional Bit-level Sparsity for Deep Learning Acceleration, https://arxiv.org/abs/2409.05227
- Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu, 25 Sep 2024, A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms, https://arxiv.org/abs/2409.16694
- Elias Frantar, September, 2024, Compressing Large Neural Networks Algorithms, Systems and Scaling Laws, Ph.D. Thesis, Graduate School, Institute of Science and Technology, Austria, https://research-explorer.ista.ac.at/download/17485/17880/frantar_thesis_final.pdf
- Y Cong, 2024, Research for Enhancing Processing and Computational Efficiency in LLM, 2024 2nd International Conference on Image, https://www.atlantis-press.com/article/126004157.pdf
- Qian Tao, Wenyuan Yu, Jingren Zhou, 17 Oct 2024, AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asymmetric Quantization Configurations, https://arxiv.org/abs/2410.13212
- Kang Zhao, Tao Yuan, Han Bao, Zhenfeng Su, Chang Gao, Zhaofeng Sun, Zichen Liang, Liping Jing, Jianfei Chen, 21 Oct 2024, Beyond 2:4: exploring V:N:M sparsity for efficient transformer inference on GPUs, https://arxiv.org/abs/2410.16135
- Majid Daliri, Zhao Song, Chiwun Yang, 3 Nov 2024, Unlocking the Theory Behind Scaling 1-Bit Neural Networks, https://arxiv.org/abs/2411.01663
- Wang, Hongyu ; Ma, Shuming ; Wei, Furu, November 2024, BitNet a4.8: 4-bit Activations for 1-bit LLMs, eprint arXiv:2411.04965, https://ui.adsabs.harvard.edu/abs/2024arXiv241104965W/abstract
- Ben Dickson, November 13, 2024, How Microsoft’s next-gen BitNet architecture is turbocharging LLM efficiency, https://venturebeat.com/ai/how-microsofts-next-gen-bitnet-architecture-is-turbocharging-llm-efficiency/
- Felix Petersen, Hilde Kuehne, Christian Borgelt, Julian Welzel, Stefano Ermon, 7 Nov 2024, Convolutional Differentiable Logic Gate Networks, 38th Conference on Neural Information Processing Systems (NeurIPS 2024), https://arxiv.org/abs/2411.04732
- Shengkun Tang, Liqun Ma, Haonan Li, Mingjie Sun, Zhiqiang Shen, 18 Nov 2024, Bi-Mamba: Towards Accurate 1-Bit State Space Models, https://arxiv.org/abs/2411.11843
- Yi Ren, Ruge Xu, Xinfei Guo, Weikang Qian, 27 Nov 2024, FAMES: Fast Approximate Multiplier Substitution for Mixed-Precision Quantized DNNs--Down to 2 Bits! https://arxiv.org/abs/2411.18055
- Luca Colombo, Fabrizio Pittorino, Manuel Roveri, 28 Nov 2024, Training Multi-Layer Binary Neural Networks With Local Binary Error Signals, https://arxiv.org/abs/2412.00119
- Wazib Ansar, Saptarsi Goswami, Amlan Chakrabarti, 6 Dec 2024, BEExformer: A Fast Inferencing Transformer Architecture via Binarization with Multiple Early Exits, https://arxiv.org/abs/2412.05225
3-Bit Quantization (INT3)
3-bit quantization is uncommon and unpopular, and it's not entirely clear why. It has improved accuracy over 2-bits and saves 25% storage compared to its more popular 4-bit cousin, being only slightly less accurate, since it allows 2^3=8 distinct weights. Maybe it just seems too inelegant for programmers to code cramming 3-bit values into 8-bits or 32-bits for packing and unpacking? But, no, even 5-bit quantization gets recommended by AI experts on forums, whereas listening for supporters of the 3-bit versions, all you hear are crickets.
Even the research papers on 3-bit quantization don't like to admit to it, and you'll struggle to even find "3-bit quantization" in a paper title. Here are some papers on 3-bit quantization (as if you care):
- Jeonghoon Kim, Jung Hyun Lee, Sungdong Kim, Joonsuk Park, Kang Min Yoo, Se Jung Kwon, and Dongsoo Lee. Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization. CoRR, abs/2305.14152, 202 https://arxiv.org/abs/2305.14152 (Quantization to 3-bit and 4-bit levels.)
- Han-Byul Kim, Eunhyeok Park, and Sungjoo Yoo. 2022. BASQ: Branch-wise Activation-clipping Search Quantization for Sub-4-bit Neural Networks. In European Conference on Computer Vision, Cham: Springer Nature Switzerland, 17-33. https://link.springer.com/chapter/10.1007/978-3-031-19775-8_2 (Evaluates quantization precision from 2-bits to 4-bits.)
- Zechun Liu, Kwang-Ting Cheng, Dong Huang, Eric Xing, Zhiqiang Shen. Apr 2022. Nonuniform-to-uniform quantization: Towards accurate quantization via generalized straight-through estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4942-4952, https://arxiv.org/abs/2111.14826, Code: https://github.com/liuzechun/Nonuniform-to-Uniform-Quantization (Contains an extensive review of models from 2-bits to 4-bits for both weights and activations.)
- E Kloberdanz, W Le, Sep 2023, MixQuant: Mixed Precision Quantization with a Bit-width Optimization Search, arXiv preprint arXiv:2309.17341, https://arxiv.org/pdf/2309.17341.pdf (Various tests of quantization from 2-bits to 8-bits.)
- NM Ho, DT Nguyen, JL Gustafson, WF Wong, 2023, Bedot: Bit Efficient Dot Product for Deep Generative Models, CoNGA 2023: Next Generation Arithmetic, pp. 19–37, https://link.springer.com/chapter/10.1007/978-3-031-32180-1_2, PDF: https://www.comp.nus.edu.sg/~wongwf/papers/CONGA23-Bedot.pdf (2–3 bits for weights and 2–5 bits for activation.)
- A. H. Zadeh, I. Edo, O. M. Awad, and A. Moshovos, “GOBO: Quantizing attention-based NLP models for low latency and energy efficient inference,” in 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2020, pp. 811–824. https://arxiv.org/abs/2005.03842 (Compares to BERT at 3-bit and 4-bit quantization levels.)
- N. Frumkin, D. Gope, and D. Marculescu, “CPT-V: A Contrastive Approach to Post-Training Quantization of Vision Transformers,” arXiv preprint arXiv:2211.09643, 2022. https://arxiv.org/abs/2211.09643 (Examines 3-bit, 4-bit, and 8-bit.)
- B Gouin-Ferland, R Coffee, AC Therrien, 2022, Data reduction through optimized scalar quantization for more compact neural networks, Frontiers in Physics, https://www.frontiersin.org/articles/10.3389/fphy.2022.957128/full (Examined 3 to 7 bit weights for quantization.)
- Li, Y., Gong, R., Tan, X., Yang, Y., Hu, P., Zhang, Q., Yu, F., Wang, W., & Gu, S., 2021, BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction. ArXiv, abs/2102.05426. https://arxiv.org/abs/2102.05426 Code: https://github.com/yhhhli/BRECQ (Tests 2, 3 and 4 bits for weights, and mixed-precision quantization.)
- W Cheng, Y Cai, K Lv, H Shen, Oct 2023, TEQ: Trainable Equivalent Transformation for Quantization of LLMs, https://arxiv.org/pdf/2310.10944.pdf
- Stefanos Laskaridis, Kleomenis Katevas, Lorenzo Minto, Hamed Haddadi, 20 Mar 2024 (v2), MELTing point: Mobile Evaluation of Language Transformers, https://arxiv.org/abs/2403.12844 (Survey and benchmarking of SOTA methods for running LLM inference natively on phones including iPhone and Android, with quantization levels, and with measurement of speed and battery depletion.)
- Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, Dan Alistarh, Jan 2024, Extreme Compression of Large Language Models via Additive Quantization, https://arxiv.org/abs/2401.06118
- Han Guo, William Brandon, Radostin Cholakov, Jonathan Ragan-Kelley, Eric P. Xing, Yoon Kim, 15 Jul 2024, Fast Matrix Multiplications for Lookup Table-Quantized LLMs, https://arxiv.org/abs/2407.10960
4-Bit Quantization (INT4)
4-bit quantization is one of the most popular quantization regimes in practical usage. It is far more common to see a 4-bit quantization of an open source model than binary, 2-bits, or 3-bits. INT4 allows eight-fold storage compression of 32-bits down to 4-bits, which reduces memory requirements, and can also speed up inference by reducing memory-cache transfer overheads in both CPU and GPU. The 4 bits allow 2^4=16 distinct weights, which is enough for reasonable accuracy compared to the full-precision model. The 4-bit weights also fit cleanly into 8-bit bytes or 32-bit integers, making the unpacking code simple and efficient.
Research papers on 4-bit quantization:
- Ron Banner, Yury Nahshan, Elad Hoffer, Daniel Soudry, May 2019, Post-training 4-bit quantization of convolution networks for rapid-deployment, NeurIPS 2019, https://arxiv.org/abs/1810.05723, https://proceedings.neurips.cc/paper_files/paper/2019/file/c0a62e133894cdce435bcb4a5df1db2d-Paper.pdf
- Han-Byul Kim, Eunhyeok Park, and Sungjoo Yoo. 2022. BASQ: Branch-wise Activation-clipping Search Quantization for Sub-4-bit Neural Networks. In European Conference on Computer Vision, Cham: Springer Nature Switzerland, 17-33. https://link.springer.com/chapter/10.1007/978-3-031-19775-8_2 (Evaluates quantization precision from 2-bits to 4-bits.)
- Zechun Liu, Kwang-Ting Cheng, Dong Huang, Eric Xing, Zhiqiang Shen. Apr 2022. Nonuniform-to-uniform quantization: Towards accurate quantization via generalized straight-through estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4942-4952, https://arxiv.org/abs/2111.14826, Code: https://github.com/liuzechun/Nonuniform-to-Uniform-Quantization (Contains an extensive review of models from 2-bits to 4-bits for both weights and activations.)
- Anton Trusov, Elena Limonova, Dmitry Slugin, Dmitry Nikolaev, Vladimir V. Arlazarov, Oct 2020, Fast Implementation of 4-bit Convolutional Neural Networks for Mobile Devices, 2020 25th International Conference on Pattern Recognition (ICPR), https://arxiv.org/abs/2009.06488, https://ieeexplore.ieee.org/document/9412841
- Xiao Sun, Naigang Wang, Chia-yu Chen, Jia-min Ni, Ankur Agrawal, Xiaodong Cui, Swagath Venkataramani Kaoutar El Maghraoui, Vijayalakshmi Srinivasan, Kailash Gopalakrishnan, 2020, Ultra-low precision 4-bit training of deep neural networks 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada, https://papers.nips.cc/paper/2020/file/13b919438259814cd5be8cb45877d577-Paper.pdf
- Shaojin Ding, Phoenix Meadowlark, Yanzhang He, Lukasz Lew, Shivani Agrawal, and Oleg Rybakov. 4-bit conformer with native quantization aware training for speech recognition. In Hanseok Ko and John H. L. Hansen, editors, Interspeech 2022, 23rd Annual Conference of the International Speech Communication Association, Incheon, Korea, 18-22 September 2022, pages 1711–1715. ISCA, 2022. https://arxiv.org/abs/2203.15952
- Jeonghoon Kim, Jung Hyun Lee, Sungdong Kim, Joonsuk Park, Kang Min Yoo, Se Jung Kwon, and Dongsoo Lee. Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization. CoRR, abs/2305.14152, 202 https://arxiv.org/abs/2305.14152 (Quantization to 3-bit and 4-bit levels.)
- HuggingFace, May 2023, Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA, HuggingFace Blog, https://huggingface.co/blog/4bit-transformers-bitsandbytes
- E Kloberdanz, W Le, Sep 2023, MixQuant: Mixed Precision Quantization with a Bit-width Optimization Search, arXiv preprint arXiv:2309.17341, https://arxiv.org/pdf/2309.17341.pdf (Various tests of quantization from 2-bits to 8-bits.)
- NM Ho, DT Nguyen, JL Gustafson, WF Wong, 2023, Bedot: Bit Efficient Dot Product for Deep Generative Models, CoNGA 2023: Next Generation Arithmetic, pp. 19–37, https://link.springer.com/chapter/10.1007/978-3-031-32180-1_2, PDF: https://www.comp.nus.edu.sg/~wongwf/papers/CONGA23-Bedot.pdf (2–3 bits for weights and 2–5 bits for activation.)
- Z. Yuan, C. Xue, Y. Chen, Q. Wu, and G. Sun, “PTQ4ViT: Post-training quantization for vision transformers with twin uniform quantization,” in European Conference on Computer Vision. Springer, 2022, pp. 191–207. https://link.springer.com/chapter/10.1007/978-3-031-19775-8_12, PDF: https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136720190.pdf (Has 4-bit, 6-bit and 8-bit quantization.)
- A. H. Zadeh, I. Edo, O. M. Awad, and A. Moshovos, “GOBO: Quantizing attention-based NLP models for low latency and energy efficient inference,” in 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2020, pp. 811–824. https://arxiv.org/abs/2005.03842 (Compares to BERT at 3-bit and 4-bit quantization levels.)
- Z. Liu, Y. Wang, K. Han, W. Zhang, S. Ma, and W. Gao, “Post-training quantization for vision transformer,” Advances in Neural Information Processing Systems, vol. 34, pp. 28 092–28 103, 2021. https://arxiv.org/abs/2106.14156 (Has evaluations of 4-bit, 6-bit, and 8-bit quantization; also mixed-precision.)
- N. Frumkin, D. Gope, and D. Marculescu, “CPT-V: A Contrastive Approach to Post-Training Quantization of Vision Transformers,” arXiv preprint arXiv:2211.09643, 2022. https://arxiv.org/abs/2211.09643 (Examines 3-bit, 4-bit, and 8-bit.)
- B Gouin-Ferland, R Coffee, AC Therrien, 2022, Data reduction through optimized scalar quantization for more compact neural networks, Frontiers in Physics, https://www.frontiersin.org/articles/10.3389/fphy.2022.957128/full (Examined 3 to 7 bit weights for quantization.)
- Li, Y., Gong, R., Tan, X., Yang, Y., Hu, P., Zhang, Q., Yu, F., Wang, W., & Gu, S., 2021, BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction. ArXiv, abs/2102.05426. https://arxiv.org/abs/2102.05426 Code: https://github.com/yhhhli/BRECQ (Tests 2 to 4 bits for weights, and mixed-precision quantization.)
- J Liu, R Gong, X Wei, Z Dong, J Cai, B Zhuang, Oct 2023, QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models, arXiv preprint arXiv:2310.08041, https://arxiv.org/pdf/2310.08041.pdf (PTQ with 4-bit quantization of Llama models.)
- Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, Song Han, 7 May 2024, QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving, arXiv preprint arXiv:2405.04532, https://arxiv.org/abs/2405.04532 Project: https://hanlab.mit.edu/projects/qserve Code: https://github.com/mit-han-lab/qserve (Efficient quantized inference on GPUs using 4-bit weights, 8-bit activations, and 4-bit KV cache, mostly via a GEMM speedup.)
- Haihao Shen, Hanwen Chang, Bo Dong, Yu Luo, Hengyu Meng, Dec 2023, Efficient LLM Inference on CPUs, Intel, NeurIPS 2023, https://arxiv.org/abs/2311.00502 Code: https://github.com/intel/intel-extension-for-transformers
- H Shen, H Chang, B Dong, Y Luo, H Meng, Nov 2023, Efficient LLM Inference on CPUs, arXiv preprint arXiv:2311.00502, https://arxiv.org/pdf/2311.00502.pdf Code: https://github.com/intel/intel-extension-for-transformers (INT4 weight quantization with 16-bit activations, and highly optimized kernel with support for AVX2, AVX512, AVX512_VNNI and Advanced Matrix Extensions (AMX), and KV caching, tested on LLamam2 3B to 20B with 20-80ms latency per token.)
- Irina Proskurina, Luc Brun, Guillaume Metzler, Julien Velcin, 1 May 2024, When Quantization Affects Confidence of Large Language Models? https://arxiv.org/abs/2405.00632
- Martin Thissen, April 20, 2024, Llama 3 on Your Local Computer | Free GPT-4 Alternative, https://medium.com/@martin-thissen/llama-3-on-your-local-computer-free-gpt-4-alternative-1f533e9abff7 (Llama3-70B with 4-bit quantization using vLLM for inference on NVIDIA RTX 6000 Ada GPU.)
- Robert Wolfe, Isaac Slaughter, Bin Han, Bingbing Wen, Yiwei Yang, Lucas Rosenblatt, Bernease Herman, Eva Brown, Zening Qu, Nic Weber, and Bill Howe. 2024. Laboratory-Scale AI: Open-Weight Models are Competitive with ChatGPT Even in Low-Resource Settings. In ACMConference on Fairness, Accountability, and Transparency (ACM FAccT ’24), June 3–6, 2024, Rio de Janeiro, Brazil. ACM, New York, NY, USA, 18 pages. https://doi.org/10.1145/3630106.3658966 https://arxiv.org/pdf/2405.16820
- Stefanos Laskaridis, Kleomenis Katevas, Lorenzo Minto, Hamed Haddadi, 20 Mar 2024 (v2), MELTing point: Mobile Evaluation of Language Transformers, https://arxiv.org/abs/2403.12844 (Survey and benchmarking of SOTA methods for running LLM inference natively on phones including iPhone and Android, with quantization levels, and with measurement of speed and battery depletion.)
- P Dong, L Lu, C Wu, C Lyu, G Yuan, H Tang, Y Wang, 2023, PackQViT: Faster Sub-8-bit Vision Transformers via Full and Packed Quantization on the Mobile, https://openreview.net/pdf?id=N56hAiQvot Code: https://github.com/PeiyanFlying/PackQViT
- Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L. Croci, Bo Li, Martin Jaggi, Dan Alistarh, Torsten Hoefler, James Hensman, 30 Mar 2024, QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs, https://arxiv.org/abs/2404.00456 Code: https://github.com/spcl/QuaRot
- Jiayi Pan, Chengcan Wang, Kaifu Zheng, Yangguang Li, Zhenyu Wang, Bin Feng, 6 Dec 2023, SmoothQuant+: Accurate and Efficient 4-bit Post-Training WeightQuantization for LLM, https://arxiv.org/abs/2312.03788 (Using 4-bit quantization to run Code Llama-34B model on an A100 40GB GPU.)
- Haotong Qin, Xudong Ma, Xingyu Zheng, Xiaoyang Li, Yang Zhang, Shouda Liu, Jie Luo, Xianglong Liu, Michele Magno, 8 Feb 2024, Accurate LoRA-Finetuning Quantization of LLMs via Information Retention, https://arxiv.org/abs/2402.05445 Code: https://github.com/htqin/ir-qlora
- Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, Baris Kasikci, Nov 2023, Atom: Low-bit Quantization for Efficient and Accurate LLM Serving, https://arxiv.org/abs/2310.19102
- Ignacio de Gregorio, June 2024, My Thoughts on Apple Intelligence: Leveling the Stakes & Betraying the Essence, https://readmedium.com/en/my-thoughts-on-apple-intelligence-16a793359cb5
- W Cheng, Y Cai, K Lv, H Shen, Oct 2023, TEQ: Trainable Equivalent Transformation for Quantization of LLMs, https://arxiv.org/pdf/2310.10944.pdf
- LMDeploy Contributors, 2023, LMDeploy: A Toolkit for Compressing, Deploying, and Serving LLM, Apache 2.0 License, Code: https://github.com/InternLM/lmdeploy
- Franklin Huang, May 17, 2024, Machine Learning Systems with Reduced Memory Requirements, Masters of Science, Electrical Engineering and Computer Sciences, University of California, Berkeley, Technical Report No. UCB/EECS-2024-120 http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.html https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.pdf Code: https://github.com/hongyihuang/spec-mcts/blob/main/triton (Broad paper that examines a lot of different optimizations that reduce memory costs, including quantization, kernel fusion, sparsity, MatMul optimizations, KV cache compression, and various other methods.)
- Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, Qianyu Chen, Huarong Zhou, Zhensheng Zou, Haoye Zhang, Shengding Hu, Zhi Zheng, Jie Zhou, Jie Cai, Xu Han, Guoyang Zeng, Dahai Li, Zhiyuan Liu, Maosong Sun, 3 Aug 2024, MiniCPM-V: A GPT-4V Level MLLM on Your Phone, https://arxiv.org/abs/2408.01800 Code: https://github.com/OpenBMB/MiniCPM-V
- Gavin Li, August 3rd, 2024, Crazy Challenge: Run Llama 405B on a 8GB VRAM GPU, https://ai.gopubby.com/crazy-challenge-run-llama-405b-on-a-8gb-vram-gpu-ab5a280a3889 (Run Llama's 405B model on a low-end GPU via 4-bit quantization and layer-by-layer inference, both to save memory.)
- Leo Donisch, Sigurd Schacht, Carsten Lanquillon, 6 Aug 2024, Inference Optimizations for Large Language Models: Effects, Challenges, and Practical Considerations, https://arxiv.org/abs/2408.03130
- Zihao Ye,, Lequn Chen, Ruihang Lai, Yilong Zhao, Size Zheng, Junru Shao, Bohan Hou, Hongyi Jin, Yifei Zuo, Liangsheng Yin, Tianqi Chen, Luis Ceze, Feb 2, 2024, Accelerating Self-Attentions for LLM Serving with FlashInfer, https://flashinfer.ai/2024/02/02/introduce-flashinfer.html
- Intel, Jul 3, 2024, Reduce LLM Footprint with OpenVINO™ Toolkit Weight Compression, OpenVINO™ toolkit, https://medium.com/openvino-toolkit/reduce-llm-footprint-with-openvino-toolkit-weight-compression-4d276ad824e5
- Sophia R. Cunningham,Dominique Archambault,Austin Kung, May 2024, Efficient Training and Inference: Techniques for Large Language Models Using Llama, https://www.techrxiv.org/doi/full/10.36227/techrxiv.171651876.65094225/v1
- Fuwen Tan, Royson Lee, Łukasz Dudziak, Shell Xu Hu, Sourav Bhattacharya, Timothy Hospedales, Georgios Tzimiropoulos, Brais Martinez, 25 Aug 2024, MobileQuant: Mobile-friendly Quantization for On-device Language Models, https://arxiv.org/abs/2408.13933 https://github.com/saic-fi/MobileQuant
- Shikhar Bajpai, Sep 27, 2024, Shrinking Elephants: A Funny Guide to 4-bit and 8-bit Quantization for LLMs with LoRA, https://medium.com/@shikharstruck/shrinking-elephants-a-funny-guide-to-4-bit-and-8-bit-quantization-for-llms-with-lora-ddf9f1a62070
- Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu, 25 Sep 2024, A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms, https://arxiv.org/abs/2409.16694
- Ke Yi, Zengke Liu, Jianwei Zhang, Chengyuan Li, Tong Zhang, Junyang Lin, Jingren Zhou, 30 Sep 2024, Rotated Runtime Smooth: Training-Free Activation Smoother for accurate INT4 inference, https://arxiv.org/abs/2409.20361 (Handling of outliers in INT4 quantization.)
- L. F. H. Duarte, G. B. Nardes, W. Grignani, D. R. Melo and C. A. Zeferino, "Deep Nibble: A 4-bit Number Format for Efficient DNN Training and Inference in FPGA," 2024 37th SBC/SBMicro/IEEE Symposium on Integrated Circuits and Systems Design (SBCCI), Joao Pessoa, Brazil, 2024, pp. 1-5, doi: 10.1109/SBCCI62366.2024.10703994. https://ieeexplore.ieee.org/abstract/document/10703994 (Log quantization method in 4-bits.)
- Lian Liu, Haimeng Ren, Long Cheng, Zhaohui Xu, Yudong Pan, Mengdi Wang, Xiaowei Li, Yinhe Han, Ying Wang, 16 Oct 2024, COMET: Towards Partical W4A4KV4 LLMs Serving, https://arxiv.org/abs/2410.12168
- Y Cong, 2024, Research for Enhancing Processing and Computational Efficiency in LLM, 2024 2nd International Conference on Image, https://www.atlantis-press.com/article/126004157.pdf
- Kang Zhao, Tao Yuan, Han Bao, Zhenfeng Su, Chang Gao, Zhaofeng Sun, Zichen Liang, Liping Jing, Jianfei Chen, 21 Oct 2024, Beyond 2:4: exploring V:N:M sparsity for efficient transformer inference on GPUs, https://arxiv.org/abs/2410.16135
- David Spuler, October 26, 2024, Weight Clustering Needs a Refresh, Aussie AI Blog, https://www.aussieai.com/blog/weight-clustering
- Yang Zhou, Zhen Dong, Ellick Chan, Dhiraj Kalamkar, Diana Marculescu, Kurt Keutzer, 26 Oct 2024, DQRM: Deep Quantized Recommendation Models, https://arxiv.org/abs/2410.20046
- X. Shen et al., "HotaQ: Hardware Oriented Token Adaptive Quantization for Large Language Models," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, doi: 10.1109/TCAD.2024.3487781. https://ieeexplore.ieee.org/abstract/document/10737419/
- Eldar Kurtic, Alexandre Marques, Shubhra Pandit, Mark Kurtz, Dan Alistarh, 4 Nov 2024, "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization, https://arxiv.org/abs/2411.02355
- Datadrifters, Aug 14, 2024, Llama 3.1 INT4 Quantization: Cut Costs by 75% Without Sacrificing Performance! https://medium.com/@datadrifters/llama-3-1-int4-quantization-cut-costs-by-75-without-sacrificing-performance-420c58da01ab
- Wang, Hongyu ; Ma, Shuming ; Wei, Furu, November 2024, BitNet a4.8: 4-bit Activations for 1-bit LLMs, eprint arXiv:2411.04965, https://ui.adsabs.harvard.edu/abs/2024arXiv241104965W/abstract
- Jintao Zhang, Haofeng Huang, Pengle Zhang, Jia Wei, Jun Zhu, Jianfei Chen, 17 Nov 2024, SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration, https://arxiv.org/abs/2411.10958 https://github.com/thu-ml/SageAttention
- Daniel & Michael, Unsloth, Unsloth - Dynamic 4-bit Quantization, Dec 4, 2024, https://unsloth.ai/blog/dynamic-4bit
5-Bit Quantization (INT5)
Research papers on 5-bit quantization:
- E Kloberdanz, W Le, Sep 2023, MixQuant: Mixed Precision Quantization with a Bit-width Optimization Search, arXiv preprint arXiv:2309.17341, https://arxiv.org/pdf/2309.17341.pdf (Various tests of quantization from 2-bits to 8-bits.)
- NM Ho, DT Nguyen, JL Gustafson, WF Wong, 2023, Bedot: Bit Efficient Dot Product for Deep Generative Models, CoNGA 2023: Next Generation Arithmetic, pp. 19–37, https://link.springer.com/chapter/10.1007/978-3-031-32180-1_2, PDF: https://www.comp.nus.edu.sg/~wongwf/papers/CONGA23-Bedot.pdf (2–3 bits for weights and 2–5 bits for activation.)
- B Gouin-Ferland, R Coffee, AC Therrien, 2022, Data reduction through optimized scalar quantization for more compact neural networks, Frontiers in Physics, https://www.frontiersin.org/articles/10.3389/fphy.2022.957128/full (Examined 3 to 7 bit weights for quantization.)
- Markus Nagel, Mart van Baalen, Tijmen Blankevoort, Max Welling, 2019, Data-free quantization through weight equalization and bias correction, PDF: https://openaccess.thecvf.com/content_ICCV_2019/papers/Nagel_Data-Free_Quantization_Through_Weight_Equalization_and_Bias_Correction_ICCV_2019_paper.pdf (Evaluates INT5, INT6, INT8, INT10, INT12, and INT16.)
6-Bit Quantization (INT6)
Research papers on 6-bit quantization:
- E Kloberdanz, W Le, Sep 2023, MixQuant: Mixed Precision Quantization with a Bit-width Optimization Search, arXiv preprint arXiv:2309.17341, https://arxiv.org/pdf/2309.17341.pdf (Various tests of quantization from 2-bits to 8-bits.)
- Z. Yuan, C. Xue, Y. Chen, Q. Wu, and G. Sun, “PTQ4ViT: Post-training quantization for vision transformers with twin uniform quantization,” in European Conference on Computer Vision. Springer, 2022, pp. 191–207. https://link.springer.com/chapter/10.1007/978-3-031-19775-8_12, PDF: https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136720190.pdf (Has 4-bit, 6-bit and 8-bit quantization.)
- Z. Liu, Y. Wang, K. Han, W. Zhang, S. Ma, and W. Gao, “Post-training quantization for vision transformer,” Advances in Neural Information Processing Systems, vol. 34, pp. 28 092–28 103, 2021. https://arxiv.org/abs/2106.14156 (Has evaluations of 4-bit, 6-bit, and 8-bit quantization; also mixed-precision.)
- M Giacobbe, TA Henzinger, M Lechner, 2020, How many bits does it take to quantize your neural network? TACAS 2020, https://link.springer.com/chapter/10.1007/978-3-030-45237-7_5, PDF: https://link.springer.com/content/pdf/10.1007/978-3-030-45237-7_5.pdf (Ran experiments from 6-bit to 10-bit quantization.)
- B Gouin-Ferland, R Coffee, AC Therrien, 2022, Data reduction through optimized scalar quantization for more compact neural networks, Frontiers in Physics, https://www.frontiersin.org/articles/10.3389/fphy.2022.957128/full (Examined 3 to 7 bit weights for quantization.)
- Markus Nagel, Mart van Baalen, Tijmen Blankevoort, Max Welling, 2019, Data-free quantization through weight equalization and bias correction, PDF: https://openaccess.thecvf.com/content_ICCV_2019/papers/Nagel_Data-Free_Quantization_Through_Weight_Equalization_and_Bias_Correction_ICCV_2019_paper.pdf (Evaluates INT5, INT6, INT8, INT10, INT12, and INT16.)
- Jinguang Wang, Yuexi Yin, Haifeng Sun, Qi Qi, Jingyu Wang, Zirui Zhuang, Tingting Yang, Jianxin Liao, 27 Jun 2024, OutlierTune: Efficient Channel-Wise Quantization for Large Language Models, https://arxiv.org/abs/2406.18832
7-Bit Quantization (INT7)
Research papers on 7-bit quantization:
- E Kloberdanz, W Le, Sep 2023, MixQuant: Mixed Precision Quantization with a Bit-width Optimization Search, arXiv preprint arXiv:2309.17341, https://arxiv.org/pdf/2309.17341.pdf (Various tests of quantization from 2-bits to 8-bits.)
- Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, Dmitry Kalenichenko, "Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference", Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 2704-2713, https://openaccess.thecvf.com/content_cvpr_2018/papers/Jacob_Quantization_and_Training_CVPR_2018_paper.pdf
- M Giacobbe, TA Henzinger, M Lechner, 2020, How many bits does it take to quantize your neural network? TACAS 2020, https://link.springer.com/chapter/10.1007/978-3-030-45237-7_5, PDF: https://link.springer.com/content/pdf/10.1007/978-3-030-45237-7_5.pdf (Ran experiments from 6-bit to 10-bit quantization.)
- B Gouin-Ferland, R Coffee, AC Therrien, 2022, Data reduction through optimized scalar quantization for more compact neural networks, Frontiers in Physics, https://www.frontiersin.org/articles/10.3389/fphy.2022.957128/full (Examined 3 to 7 bit weights for quantization.)
8-Bit Integer Quantization (INT8)
Research papers on 8-bit quantization:
- Tim Dettmers, Mike Lewis, Younes Belkada, Luke Zettlemoyer, LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale, Nov 2022, https://arxiv.org/abs/2208.07339
- A. Rock, A. Untether, O. Khalil, O. Shai, and P. Grouchy, 2022, INT8 Transformers for Inference Acceleration, 36th Conference on Neural Information Processing Systems (NeurIPS), PDF: https://neurips2022-enlsp.github.io/papers/paper_52.pdf
- E Kloberdanz, W Le, Sep 2023, MixQuant: Mixed Precision Quantization with a Bit-width Optimization Search, arXiv preprint arXiv:2309.17341, https://arxiv.org/pdf/2309.17341.pdf (Various tests of quantization from 2-bits to 8-bits.)
- O. Zafrir, G. Boudoukh, P. Izsak, and M. Wasserblat, “Q8BERT: Quantized 8Bit BERT,” in Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS), 2019, pp. 36–39. https://arxiv.org/abs/1910.06188
- B. Li, S. Lu, K. Xie, and Z. Wang, “Accelerating NLP Tasks on FPGA with Compressed BERT and a Hardware-Oriented Early Exit Method,” in IEEE Computer Society Annual Symposium on VLSI (ISVLSI), 2022, pp. 410–413. https://ieeexplore.ieee.org/document/9912016 (8-bit and 16-bit integer quantization.)
- Gunho Park, Baeseong Park, Sungjae Lee, Minsub Kim, Byeongwook Kim, Se Jung Kwon, Youngjoo Lee, Dongsoo Lee, 2022, nuQmm: Quantized MatMul for Efficient Inference of Large-Scale Generative Language Models CoRR, abs/2206.09557, https://arxiv.org/abs/2206.09557v2
- Y. Lin, Y. Li, T. Liu, T. Xiao, T. Liu, and J. Zhu, “Towards fully 8-bit integer inference for the transformer model,” the 29th International Joint Conference on Artificial Intelligence and the 17th Pacific Rim International Conference on Artificial Intelligence (IJCAI-PRICAI2020), 2020. https://arxiv.org/abs/2009.08034 (Quantizes not just weights, but also non-linear functions such as Softmax.)
- Feng Zhu, Ruihao Gong, Fengwei Yu, Xianglong Liu, Yanfei Wang, Zhelong Li, Xiuqi Yang, Junjie Yan, 2020, Towards Unified INT8 Training for Convolutional Neural Network, https://arxiv.org/abs/1912.12607, https://openaccess.thecvf.com/content_CVPR_2020/papers/Zhu_Towards_Unified_INT8_Training_for_Convolutional_Neural_Network_CVPR_2020_paper.pdf
- Manuele Rusci, Marco Fariselli, Martin Croome, Francesco Paci, Eric Flamand, 2022, Accelerating RNN-Based Speech Enhancement on a Multi-core MCU with Mixed FP16-INT8 Post-training Quantization https://arxiv.org/abs/2210.07692 (Mixed precision FP16-INT8 quantization.)
- Gunho Park, Baeseong Park, Minsub Kim, Sungjae Lee, Jeonghoon Kim, Beomseok Kwon, Se Jung Kwon, Byeongwook Kim, Youngjoo Lee, Dongsoo Lee, Apr 2023, LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models https://arxiv.org/abs/2206.09557
- V. Vanhoucke, A. Senior and M. Z. Mao, "Improving the speed of neural networks on CPUs", Proc. Deep Learn. Unsupervised Feature Learn. Workshop, pp. 1-8, 2011. https://research.google/pubs/pub37631/, PDF: https://research.google/pubs/pub37631.pdf (INT8 quantization.)
- M. A. Nasution, D. Chahyati and M. I. Fanany, "Faster R-CNN with structured sparsity learning and Ristretto for mobile environment", Proc. Int. Conf. Adv. Comput. Sci. Inf. Syst. (ICACSIS), pp. 309-314, Oct. 2017. https://ieeexplore.ieee.org/document/8355051
- Tim Dettmers, Mike Lewis, Younes Belkada, Luke Zettlemoyer, 2022, GPT3.int8(): 8-bit Matrix Multiplication for Transformers at Scale, Part of Advances in Neural Information Processing Systems 35 (NeurIPS 2022) https://papers.nips.cc/paper_files/paper/2022/hash/c3ba4962c05c49636d4c6206a97e9c8a-Abstract-Conference.html
- A. Mishra, J. A. Latorre, J. Pool, D. Stosic, D. Stosic, G. Venkatesh, C. Yu, and P. Micikevicius, “Accelerating sparse deep neural networks,” arXiv preprint arXiv:2104.08378, 2021. https://arxiv.org/abs/2104.08378
- Z. Yuan, C. Xue, Y. Chen, Q. Wu, and G. Sun, “PTQ4ViT: Post-training quantization for vision transformers with twin uniform quantization,” in European Conference on Computer Vision. Springer, 2022, pp. 191–207. https://link.springer.com/chapter/10.1007/978-3-031-19775-8_12, PDF: https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136720190.pdf (Has 4-bit, 6-bit and 8-bit quantization.)
- Z. Liu, Y. Wang, K. Han, W. Zhang, S. Ma, and W. Gao, “Post-training quantization for vision transformer,” Advances in Neural Information Processing Systems, vol. 34, pp. 28 092–28 103, 2021. https://arxiv.org/abs/2106.14156 (Has evaluations of 4-bit, 6-bit, and 8-bit quantization; also mixed-precision.)
- N. Frumkin, D. Gope, and D. Marculescu, “CPT-V: A Contrastive Approach to Post-Training Quantization of Vision Transformers,” arXiv preprint arXiv:2211.09643, 2022. https://arxiv.org/abs/2211.09643 (Examines 3-bit, 4-bit, and 8-bit.)
- Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, Dmitry Kalenichenko, "Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference", Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 2704-2713, https://openaccess.thecvf.com/content_cvpr_2018/papers/Jacob_Quantization_and_Training_CVPR_2018_paper.pdf
- M Giacobbe, TA Henzinger, M Lechner, 2020, How many bits does it take to quantize your neural network? TACAS 2020, https://link.springer.com/chapter/10.1007/978-3-030-45237-7_5, PDF: https://link.springer.com/content/pdf/10.1007/978-3-030-45237-7_5.pdf (Ran experiments from 6-bit to 10-bit quantization.)
- Javier Fernandez-Marques, Paul N. Whatmough, Andrew Mundy, Matthew Mattina, 2020, Searching for winograd-aware quantized networks, Proceedings of the 3rd MLSys Conference, Austin, TX, USA, 2020. PDF: https://proceedings.mlsys.org/paper_files/paper/2020/file/678e209691cd37f145a5502695378bac-Paper.pdf (Evaluates INT8, INT10, and INT16 quantization.)
- Markus Nagel, Mart van Baalen, Tijmen Blankevoort, Max Welling, 2019, Data-free quantization through weight equalization and bias correction, PDF: https://openaccess.thecvf.com/content_ICCV_2019/papers/Nagel_Data-Free_Quantization_Through_Weight_Equalization_and_Bias_Correction_ICCV_2019_paper.pdf (Evaluates INT5, INT6, INT8, INT10, INT12, and INT16.)
- Pierre-Emmanuel Novac, March 2023, MicroAI: Embedded Artificial Intelligence for Human Activity Recognition on Smart Glasses, Ph.D. Thesis, Artificial Intelligence. Université Côte d’Azur, https://theses.hal.science/tel-04049008/document (Uses INT8 and INT16 quantization.)
- Li, Y., Gong, R., Tan, X., Yang, Y., Hu, P., Zhang, Q., Yu, F., Wang, W., & Gu, S., 2021, BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction. ArXiv, abs/2102.05426. https://arxiv.org/abs/2102.05426 Code: https://github.com/yhhhli/BRECQ (Tests 8-bit weights, along with 2-4 bits.)
- Radha Gulhane, 2024, Accelerated and Memory-Efficient Distributed Deep Learning: Leveraging Quantization, Parallelism Techniques, and Mix-Match Runtime Communication , Masters Thesis, Computer Science and Engineering , The Ohio State University, https://etd.ohiolink.edu/acprod/odb_etd/ws/send_file/send?accession=osu1713381834648517&disposition=inline
- Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, Song Han, 7 May 2024, QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving, arXiv preprint arXiv:2405.04532, https://arxiv.org/abs/2405.04532 Project: https://hanlab.mit.edu/projects/qserve Code: https://github.com/mit-han-lab/qserve (Efficient quantized inference on GPUs using 4-bit weights, 8-bit activations, and 4-bit KV cache, mostly via a GEMM speedup.)
- 8bit HippoAttention: Up to 3X Faster Compared to FlashAttentionV2, HippoML Blog, Jan 2024, https://blog.hippoml.com/8bit-hippoattention-up-to-3x-faster-compared-to-flashattentionv2-8f9def90b482
- Zhewei Yao, Reza Yazdani Aminabadi, Stephen Youn, Xiaoxia Wu, Elton Zheng, Yuxiong He, Oct 2023, ZeroQuant-HERO: Hardware-Enhanced Robust Optimized Post-Training Quantization Framework for W8A8 Transformers, https://arxiv.org/abs/2310.17723
- Y Hua, L Yu, X Meng, Z Qin, 2021, A Dynamic Balance Quantization Method for YOLOv3, Journal of Physics, PDF: https://iopscience.iop.org/article/10.1088/1742-6596/1848/1/012157/pdf
- PE Novac, G Boukli Hacene, A Pegatoquet, 2021, Quantization and deployment of deep neural networks on microcontrollers, Sensors, 2021, https://www.mdpi.com/1424-8220/21/9/2984
- PyTorch, 2023, Model inference optimization checklist, https://pytorch.org/serve/performance_checklist.html
- M. Horowitz, Feb 2014, "Computing’s energy problem (and what we can do about it)", IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, pp. 10-14, Feb. 2014. https://ieeexplore.ieee.org/document/6757323 (INT8 quantization.)
- J Wu, M Song, J Zhao, HKH So, 2024, A Case for Low Bitwidth Floating Point Arithmetic on FPGA for Transformer Based DNN Inference, https://wujiajunic.cn/publication/ipdpsw2024/IPDPSW2024.pdf
- P Dong, L Lu, C Wu, C Lyu, G Yuan, H Tang, Y Wang, 2023, PackQViT: Faster Sub-8-bit Vision Transformers via Full and Packed Quantization on the Mobile, https://openreview.net/pdf?id=N56hAiQvot Code: https://github.com/PeiyanFlying/PackQViT
- JINGXUAN YANG, XIAOQIN WANG, AND YIYING JIAN, 17 February 2024, CANET: Quantized Neural Network Inference With 8-bit Carry-Aware Accumulator, https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10445180
- D. Ilin, E. Limonova, V. Arlazarov, and D. Nikolaev, “Fast integer approximations in convolutional neural networks using layer-by-layer training,” in Ninth International Conference on Machine Vision, 103410Q–103410Q, International Society for Optics and Photonics (2017). DOI: 10.1117/12.2268722. https://spie.org/Publications/Proceedings/Paper/10.1117/12.2268722?SSO=1
- Character.AI, June 20, 2024, Optimizing AI Inference at Character.AI, https://research.character.ai/optimizing-inference/
- Daliang Xu, Hao Zhang, Liming Yang, Ruiqi Liu, Mengwei Xu, and Xuanzhe Liu, 11 June 2024, WiP: Efficient LLM Prefilling with Mobile NPU, EdgeFM '24: Proceedings of the Workshop on Edge and Mobile Foundation Models, June 2024, Pages 33 - 35, https://doi.org/10.1145/3662006.3662066 https://dl.acm.org/doi/abs/10.1145/3662006.3662066 PDF: https://dl.acm.org/doi/pdf/10.1145/3662006.3662066 (Faster NPU prefill via chunked prefilling using sequences of tokens, along with INT8 NPU quantization that is aware of outliers and offloads FP32 calculations from NPU back to CPU.)
- Jinguang Wang, Yuexi Yin, Haifeng Sun, Qi Qi, Jingyu Wang, Zirui Zhuang, Tingting Yang, Jianxin Liao, 27 Jun 2024, OutlierTune: Efficient Channel-Wise Quantization for Large Language Models, https://arxiv.org/abs/2406.18832
- Milan Tamang June 30, 2024, How I built my own custom 8-bit Quantizer from scratch: a step-by-step guide using PyTorch, https://towardsai.net/p/machine-learning/how-i-built-my-own-custom-8-bit-quantizer-from-scratch-a-step-by-step-guide-using-pytorch
- QuaLA-MiniLM: a Quantized Length Adaptive MiniLM, Shira Guskin, Moshe Wasserblat, Chang Wang, Haihao Shen, May 2023, https://arxiv.org/abs/2210.17114 (Intel labs paper. Low-bit quantization, distillation, and Length-Adaptive Transformer (LAT) technique. )
- Intel, Fast DistilBERT on CPUs, 2022, https://arxiv.org/pdf/2211.07715.pdf
- Longhao Chen, Yina Zhao, Qiangjun Xie, Qinghua Sheng, 6 Jun 2024, Optimization of Armv9 architecture general large language model inference performance based on Llama.cpp, https://arxiv.org/abs/2406.10816
- Leo Donisch, Sigurd Schacht, Carsten Lanquillon, 6 Aug 2024, Inference Optimizations for Large Language Models: Effects, Challenges, and Practical Considerations, https://arxiv.org/abs/2408.03130
- Guangxuan Xiao, May 2024, Efficient Deployment Algorithms for Large Language Models, Masters Thesis, MIT, https://dspace.mit.edu/bitstream/handle/1721.1/156332/xiao-xgx-sm-eecs-2024-thesis.pdf
- Haocheng Xi, Yuxiang Chen, Kang Zhao, Kai Jun Teh, Jianfei Chen, Jun Zhu, 21 Jul 2024 (v2), Jetfire: Efficient and Accurate Transformer Pretraining with INT8 Data Flow and Per-Block Quantization, https://arxiv.org/abs/2403.12422 Code: https://github.com/thu-ml/Jetfire-INT8Training
- Tim Dettmers, Mike Lewis, Sam Shleifer, Luke Zettlemoyer, 20 Jun 2022 (v2), 8-bit Optimizers via Block-wise Quantization, https://arxiv.org/abs/2110.02861
- Intel, Jul 3, 2024, Reduce LLM Footprint with OpenVINO™ Toolkit Weight Compression, OpenVINO™ toolkit, https://medium.com/openvino-toolkit/reduce-llm-footprint-with-openvino-toolkit-weight-compression-4d276ad824e5
- Sophia R. Cunningham,Dominique Archambault,Austin Kung, May 2024, Efficient Training and Inference: Techniques for Large Language Models Using Llama, https://www.techrxiv.org/doi/full/10.36227/techrxiv.171651876.65094225/v1
- Fuwen Tan, Royson Lee, Łukasz Dudziak, Shell Xu Hu, Sourav Bhattacharya, Timothy Hospedales, Georgios Tzimiropoulos, Brais Martinez, 25 Aug 2024, MobileQuant: Mobile-friendly Quantization for On-device Language Models, https://arxiv.org/abs/2408.13933 https://github.com/saic-fi/MobileQuant
- Fabien Geyer, Johannes Freitag, Tobias Schulz, Sascha Uhrig, 16 Jan 2024, Efficient and Mathematically Robust Operations for Certified Neural Networks Inference, https://arxiv.org/abs/2401.08225 https://arxiv.org/pdf/2401.08225 (Finds that fixed point arithmetic is more efficient with comparable accuracy to using floating point.)
- Abhinav Agarwalla, Abhay Gupta, Alexandre Marques, Shubhra Pandit, Michael Goin, Eldar Kurtic, Kevin Leong, Tuan Nguyen, Mahmoud Salem, Dan Alistarh, Sean Lie, Mark Kurtz, 6 May 2024, Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment, https://arxiv.org/abs/2405.03594
- Neural Magic, 2024, DeepSparse: Sparsity-aware deep learning inference runtime for CPUs, https://github.com/neuralmagic/deepsparse https://neuralmagic.com/deepsparse/
- Jaxpruner: A Concise Library for Sparsity Research, Joo Hyung Lee, Wonpyo Park, Nicole Elyse Mitchell, Jonathan Pilault, Johan Samir Obando Ceron, Han-Byul Kim, Namhoon Lee, Elias Frantar, Yun Long, Amir Yazdanbakhsh, Woohyun Han, Shivani Agrawal, Suvinay Subramanian, Xin Wang, Sheng-Chun Kao, Xingyao Zhang, Trevor Gale, Aart J.C. Bik, Milen Ferev, Zhonglin Han, Hong-Seok Kim, Yann Dauphin, Gintare Karolina Dziugaite, Pablo Samuel Castro, Utku Evci, Conference on Parsimony and Learning, PMLR 234:515-528, 2024. https://proceedings.mlr.press/v234/lee24a.html https://proceedings.mlr.press/v234/lee24a/lee24a.pdf https://openreview.net/forum?id=H2rCZCfXkS https://openreview.net/pdf?id=H2rCZCfXkS
- Penghao Xiao, Chunjie Zhang, Qian Guo, Xiayang Xiao, Haipeng Wang, 2024, Neural Networks Integer Computation: Quantizing Convolutional Neural Networks of Inference and Training for Object Detection in Embedded Systems, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, DOI 10.1109/JSTARS.2024.3452321, PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10660473
- Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu, 25 Sep 2024, A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms, https://arxiv.org/abs/2409.16694
- Shimao Chen, Zirui Liu, Zhiying Wu, Ce Zheng, Peizhuang Cong, Zihan Jiang, Yuhan Wu, Lei Su, Tong Yang, 26 Sep 2024 (v2), INT-FlashAttention: Enabling Flash Attention for INT8 Quantization, https://arxiv.org/abs/2409.16997 https://github.com/INT-FlashAttention2024/INT-FlashAttention
- Douglas C. Youvan, September 27, 2024, Building and Running Large-Scale Language Models: The Infrastructure and Techniques Behind GPT-4 , https://www.researchgate.net/profile/Douglas-Youvan/publication/384398902_Building_and_Running_Large-Scale_Language_Models_The_Infrastructure_and_Techniques_Behind_GPT-4/links/66f6f4d3906bca2ac3d20e68/Building-and-Running-Large-Scale-Language-Models-The-Infrastructure-and-Techniques-Behind-GPT-4.pdf
- X. Shen et al., "HotaQ: Hardware Oriented Token Adaptive Quantization for Large Language Models," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, doi: 10.1109/TCAD.2024.3487781. https://ieeexplore.ieee.org/abstract/document/10737419/
- Eldar Kurtic, Alexandre Marques, Shubhra Pandit, Mark Kurtz, Dan Alistarh, 4 Nov 2024, "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization, https://arxiv.org/abs/2411.02355
- Jiajun Wu, Mo Song, Jingmin Zhao, Yizhao Gao, Jia Li, Hayden Kwok-Hay So, 6 Nov 2024, TATAA: Programmable Mixed-Precision Transformer Acceleration with a Transformable Arithmetic Architecture, https://arxiv.org/abs/2411.03697
- Character.AI, Nov 21, 2024, Optimizing AI Inference at Character.AI (Part Deux), https://research.character.ai/optimizing-ai-inference-at-character-ai-part-deux/ (Optimization techniques discussed include INT8, Flash attention 3, kernel fusion of KV dequantization and attention, MQA parallelization, producer-consumer CUDA warp specialization, fused matrix transpose, and more.)
9-Bit Quantization (INT9)
Research papers on 9-bit quantization:
- M Giacobbe, TA Henzinger, M Lechner, 2020, How many bits does it take to quantize your neural network? TACAS 2020, https://link.springer.com/chapter/10.1007/978-3-030-45237-7_5, PDF: https://link.springer.com/content/pdf/10.1007/978-3-030-45237-7_5.pdf (Ran experiments from 6-bit to 10-bit quantization.)
- W Jiang, P Liu, F Wen, 2017, An improved vector quantization method using deep neural network, AEU - International Journal of Electronics and Communications, Volume 72, February 2017, Pages 178-183, https://www.sciencedirect.com/science/article/pii/S1434841116313954
- Y Nakahara, M Kiyama, M Amagasaki, 2020, Relationship between recognition accuracy and numerical precision in convolutional neural network models, https://www.jstage.jst.go.jp/article/transinf/E103.D/12/E103.D_2020PAL0002/_pdf
10-Bit Quantization (INT10)
Research papers on 10-bit quantization:
- M Giacobbe, TA Henzinger, M Lechner, 2020, How many bits does it take to quantize your neural network? TACAS 2020, https://link.springer.com/chapter/10.1007/978-3-030-45237-7_5, PDF: https://link.springer.com/content/pdf/10.1007/978-3-030-45237-7_5.pdf (Ran experiments from 6-bit to 10-bit quantization.)
- J Shi, M Lu, F Chen, S Pu, Z Ma, 2022, Rate-Distortion Optimized Post-Training Quantization for Learned Image Compression arXiv preprint arXiv:2211.02854, https://arxiv.org/abs/2211.02854
- Javier Fernandez-Marques, Paul N. Whatmough, Andrew Mundy, Matthew Mattina, 2020, Searching for winograd-aware quantized networks, Proceedings of the 3rd MLSys Conference, Austin, TX, USA, 2020. PDF: https://proceedings.mlsys.org/paper_files/paper/2020/file/678e209691cd37f145a5502695378bac-Paper.pdf (Evaluates INT8, INT10, and INT16 quantization.)
- Markus Nagel, Mart van Baalen, Tijmen Blankevoort, Max Welling, 2019, Data-free quantization through weight equalization and bias correction, PDF: https://openaccess.thecvf.com/content_ICCV_2019/papers/Nagel_Data-Free_Quantization_Through_Weight_Equalization_and_Bias_Correction_ICCV_2019_paper.pdf (Evaluates INT5, INT6, INT8, INT10, INT12, and INT16.)
11-Bit Quantization (INT11)
Research papers on 11-bit quantization:
- G Dundar, K Rose, 1995, The effects of quantization on multilayer neural networks, IEEE Transactions on Neural Networks, Volume 6, Issue 6, November 1995, https://ieeexplore.ieee.org/abstract/document/471364
- Fang Tang, Denis Guangyin Chen, Bo Wang, Amine Bermak, 2013, Low-Power CMOS Image Sensor Based on Column-Parallel Single-Slope/SAR Quantization Scheme, IEEE Transactions on Electron Devices, Vol. 60, No. 8, August 2013, https://ieeexplore.ieee.org/document/6547236, PDF: https://ss-sensing.com/paper/Low-Power%20CMOS%20Image%20Sensor%20Based%20on%20Column-Parallel%20Single-Slope-SAR%20Quantization%20Scheme.pdf
12-Bit Quantization (INT12)
Research papers on 12-bit quantization:
- Markus Nagel, Mart van Baalen, Tijmen Blankevoort, Max Welling, 2019, Data-free quantization through weight equalization and bias correction, PDF: https://openaccess.thecvf.com/content_ICCV_2019/papers/Nagel_Data-Free_Quantization_Through_Weight_Equalization_and_Bias_Correction_ICCV_2019_paper.pdf (Evaluates INT5, INT6, INT8, INT10, INT12, and INT16.)
- Xishan Zhang1,2, Shaoli Liu, Rui Zhang, Chang Liu, Di Huang, Shiyi Zhou, Jiaming Guo, Qi Guo, Zidong Du, Tian Zhi, Yunji Chen, 2020, Fixed-Point Back-Propagation Training, CVPR 2020, PDF: https://openaccess.thecvf.com/content_CVPR_2020/papers/Zhang_Fixed-Point_Back-Propagation_Training_CVPR_2020_paper.pdf
16-Bit Integer Quantization (INT16)
INT16 is the use of 16-bit integers, so as to use half the space of FP32 weights/activations, and also using integer arithmetic kernels. Consideration of the pros and cons of integer versus floating-point computations is warranted, since FP16 quantization uses the same 16-bit memory size, but may be more accurate than quantized 16-bit integers.
Research papers on 16-bit integer quantization:
- B. Li, S. Lu, K. Xie, and Z. Wang, “Accelerating NLP Tasks on FPGA with Compressed BERT and a Hardware-Oriented Early Exit Method,” in IEEE Computer Society Annual Symposium on VLSI (ISVLSI), 2022, pp. 410–413. https://ieeexplore.ieee.org/document/9912016
- Javier Fernandez-Marques, Paul N. Whatmough, Andrew Mundy, Matthew Mattina, 2020, Searching for winograd-aware quantized networks, Proceedings of the 3rd MLSys Conference, Austin, TX, USA, 2020. PDF: https://proceedings.mlsys.org/paper_files/paper/2020/file/678e209691cd37f145a5502695378bac-Paper.pdf (Evaluates INT8, INT10, and INT16 quantization.)
- Markus Nagel, Mart van Baalen, Tijmen Blankevoort, Max Welling, 2019, Data-free quantization through weight equalization and bias correction, PDF: https://openaccess.thecvf.com/content_ICCV_2019/papers/Nagel_Data-Free_Quantization_Through_Weight_Equalization_and_Bias_Correction_ICCV_2019_paper.pdf (Evaluates INT5, INT6, INT8, INT10, INT12, and INT16.)
- Pierre-Emmanuel Novac, March 2023, MicroAI: Embedded Artificial Intelligence for Human Activity Recognition on Smart Glasses, Ph.D. Thesis, Artificial Intelligence. Université Côte d’Azur, https://theses.hal.science/tel-04049008/document (Uses INT8 and INT16 quantization.)
- X. Shen et al., "HotaQ: Hardware Oriented Token Adaptive Quantization for Large Language Models," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, doi: 10.1109/TCAD.2024.3487781. https://ieeexplore.ieee.org/abstract/document/10737419/
32-Bit Integer Quantization (INT32)
INT32 is not an effective form of "model compression", because it's not compressed at all! The data is no smaller than the FP32 raw weights, although it does allow integer arithmetic instead of floating-point computations. Also closely related is fixed-point quantization, using 32-bit integers and integer arithmetic.
Research papers on 32-bit integer quantization:
- B Jacob, S Kligys, B Chen, M Zhu, 2018, Quantization and training of neural networks for efficient integer-arithmetic-only inference, http://openaccess.thecvf.com/content_cvpr_2018/papers/Jacob_Quantization_and_Training_CVPR_2018_paper.pdf
- Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev, Paulius Micikevicius, 20 Apr 2020, Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation, https://arxiv.org/abs/2004.09602
- Z Li, Q Gu, 2022, I-ViT: integer-only quantization for efficient vision transformer inference, https://arxiv.org/abs/2207.01405
Mixed-Precision Quantization
Research papers on mixed-precision quantization:
- Manuele Rusci, Marco Fariselli, Martin Croome, Francesco Paci, Eric Flamand, 2022, Accelerating RNN-Based Speech Enhancement on a Multi-core MCU with Mixed FP16-INT8 Post-training Quantization https://arxiv.org/abs/2210.07692 (Mixed precision FP16-INT8 quantization.)
- Dipankar Das, Naveen Mellempudi, Dheevatsa Mudigere, Dhiraj Kalamkar, Sasikanth Avancha, Kunal Banerjee, Srinivas Sridharan, Karthik Vaidyanathan, Bharat Kaul, Evangelos Georganas, Alexander Heinecke, Pradeep Dubey, Jesus Corbal, Nikita Shustrov, Roma Dubtsov, Evarist Fomenko, Vadim Pirogov, 2018, Mixed Precision Training of Convolutional Neural Networks using Integer Operations, https://arxiv.org/pdf/1802.00930
- M. A. Nasution, D. Chahyati and M. I. Fanany, "Faster R-CNN with structured sparsity learning and Ristretto for mobile environment", Proc. Int. Conf. Adv. Comput. Sci. Inf. Syst. (ICACSIS), pp. 309-314, Oct. 2017. https://ieeexplore.ieee.org/document/8355051
- Zhen Dong, Zhewei Yao, Amir Gholami, Michael W Mahoney, and Kurt Keutzer, 2019, HAWQ: Hessian AWare Quantization of Neural Networks with Mixed-Precision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 293–302, https://arxiv.org/abs/1905.03696
- Yiren Zhou, Seyed-Mohsen Moosavi-Dezfooli, Ngai-Man Cheung, and Pascal Frossard. Adaptive quantization for deep neural network. arXiv preprint arXiv:1712.01048, 2017, https://arxiv.org/abs/1712.01048 (Layerwise different bitwidth quantization.)
- Sijie Zhao, Tao Yue, and Xuemei Hu. Distributionaware adaptive multi-bit quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9281–9290, 2021, https://ieeexplore.ieee.org/document/9577892, PDF: https://openaccess.thecvf.com/content/CVPR2021/papers/Zhao_Distribution-Aware_Adaptive_Multi-Bit_Quantization_CVPR_2021_paper.pdf
- Zhewei Yao, Zhen Dong, Zhangcheng Zheng, Amir Gholami, Jiali Yu, Eric Tan, Leyuan Wang, Qijing Huang, Yida Wang, Michael W. Mahoney, Kurt Keutzer, June 2021, HAWQV3: Dyadic Neural Network Quantization, arXiv preprint arXiv:2011.10680, https://arxiv.org/abs/2011.10680
- Huanrui Yang, Lin Duan, Yiran Chen, Hai Li, Feb 2021, BSQ: Exploring Bit-Level Sparsity for Mixed-Precision Neural Network Quantization, arXiv preprint arXiv:2102.10462, https://arxiv.org/abs/2102.10462
- Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han. 2019, HAQ: Hardware-aware automated quantization. In Proceedings of the IEEE conference on computer vision and pattern recognition, https://arxiv.org/abs/1811.08886
- Zhongnan Qu, Zimu Zhou, Yun Cheng, and Lothar Thiele, June 2020, Adaptive loss-aware quantization for multi-bit networks, In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), https://arxiv.org/abs/1912.08883
- Hai Victor Habi, Roy H. Jennings, Arnon Netzer, July 2020, HMQ: Hardware Friendly Mixed Precision Quantization Block for CNNs, arXiv preprint arXiv:2007.09952, https://arxiv.org/abs/2007.09952
- Z. Liu, Y. Wang, K. Han, W. Zhang, S. Ma, and W. Gao, “Post-training quantization for vision transformer,” Advances in Neural Information Processing Systems, vol. 34, pp. 28 092–28 103, 2021. https://arxiv.org/abs/2106.14156 (Has evaluations of 4-bit, 6-bit, and 8-bit quantization; also mixed-precision.)
- Li, Y., Gong, R., Tan, X., Yang, Y., Hu, P., Zhang, Q., Yu, F., Wang, W., & Gu, S., 2021, BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction. ArXiv, abs/2102.05426. https://arxiv.org/abs/2102.05426 Code: https://github.com/yhhhli/BRECQ (Tests 2 to 4 bits for weights, and mixed-precision quantization.)
- E Kloberdanz, W Le, Sep 2023, MixQuant: Mixed Precision Quantization with a Bit-width Optimization Search, arXiv preprint arXiv:2309.17341, https://arxiv.org/pdf/2309.17341.pdf (Various tests of 2-8 bits, and mixed precision quantization.)
- A Chauhan, U Tiwari, 2023, Post Training Mixed Precision Quantization of Neural Networks Using First-Order Information, Proceedings of the IEEE/CVF International Conference, https://openaccess.thecvf.com/content/ICCV2023W/RCV/papers/Chauhan_Post_Training_Mixed_Precision_Quantization_of_Neural_Networks_Using_First-Order_ICCVW_2023_paper.pdf
- Y Shang, Z Yuan, Q Wu, Z Dong, PB-LLM: Partially Binarized Large Language Models, Sep 2023, arXiv preprint arXiv:2310.00034, https://browse.arxiv.org/pdf/2310.00034.pdf Code: https://github.com/hahnyuan/BinaryLLM (Hybrid partial binarization.)
- Changhun Lee, Jungyu Jin, Taesu Kim, Hyungjun Kim, Eunhyeok Park, 24 Jan 2024, OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and Inference of Large Language Models, https://arxiv.org/abs/2306.02272 Code: https://github.com/xvyaward/owq (Stores some weights in different quantization levels based on their values.)
- G Rutishauser, 2024, Agile and Efficient Inference of Quantized Neural Networks, Ph.D. Thesis, ETH Zurich, https://www.research-collection.ethz.ch/bitstream/handle/20.500.11850/675547/1/thesis.pdf
- N Martynov, A Goncharov, G Kumichev, E Egorov, 2024, On the Way to Lossless Compression of Language Transformers: Exploring Cross-Domain Properties of Quantization, https://aclanthology.org/2024.lrec-main.1089.pdf (Quantize 95% of weights to INT8, leaving the remainder at FP32.)
- M Rakka, ME Fouda, P Khargonekar, F Kurdahi, 29 April 2024, A Review of State-of-the-Art Mixed-Precision Neural Network Frameworks, IEEE Transactions on Pattern Analysis and Machine Intelligence (Early Access), Pages 1 - 20, DOI: 10.1109/TPAMI.2024.3394390, https://doi.org/10.1109/TPAMI.2024.3394390 https://ieeexplore.ieee.org/abstract/document/10509805/
- Wei Huang, Haotong Qin, Yangdong Liu, Yawei Li, Xianglong Liu, Luca Benini, Michele Magno, Xiaojuan Qi, 23 May 2024, SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models, https://arxiv.org/abs/2405.14917 Code: https://github.com/Aaronhuang-778/SliM-LLM
- Utkarsh Saxena, Kaushik Roy, McQueen: Mixed Precision Quantization of Early Exit Networks, https://papers.bmvc2023.org/0511.pdf (Combination of mixed-precision quantization, with precision specifiable staticly to a layerwise granularity, with early exit dynamic depth optimizations.)
- JY Jeon, XT Nguyen, S Ryu, HJ Lee, 2024, USDN: A Unified Sample-wise Dynamic Network with Mixed-Precision and Early-Exit, https://openaccess.thecvf.com/content/WACV2024/papers/Jeon_USDN_A_Unified_Sample-Wise_Dynamic_Network_With_Mixed-Precision_and_Early-Exit_WACV_2024_paper.pdf
- Rong Tian, Zijing Zhao, Weijie Liu, Haoyan Liu Weiquan Mao, Zhe Zhao, Kimmo Yan, Sep 2022, SAMP: A Toolkit for Model Inference with Self-Adaptive Mixed-Precision, https://arxiv.org/pdf/2209.09130.pdf (Mixed-precision quantization combined with kernel fusion, including QKV tensor operation fusion and AddBias-LayerNorm fusion.)
- Mariam Rakka, Mohammed E. Fouda, Pramod Khargonekar, Fadi Kurdahi, 11 Aug 2022, Mixed-Precision Neural Networks: A Survey, https://arxiv.org/abs/2208.06064
- Piotr Kluska, Adri´an Castello, Florian Scheidegger, A. Cristiano I. Malossi, 2024, QAttn: Efficient GPU Kernels for mixed-precision Vision Transformers https://openaccess.thecvf.com/content/CVPR2024W/eLVM/papers/Kluska_QAttn_Efficient_GPU_Kernels_for_Mixed-precision_Vision_Transformers_CVPRW_2024_paper.pdf
- Behnam Ghavami, Amin Kamjoo, Lesley Shannon, Steve Wilton, 3 Apr 2024, DNN Memory Footprint Reduction via Post-Training Intra-Layer Multi-Precision Quantization, https://arxiv.org/abs/2404.02947
- Dimitrios Danopoulos, Georgios Zervakis, Dimitrios Soudris, Jörg Henkel, 12 Feb 2024, TransAxx: Efficient Transformers with Approximate Computing, https://arxiv.org/abs/2402.07545 (Using approximations in Vision Transformer architectures.)
- Rong Tian, Zijing Zhao, Weijie Liu, Haoyan Liu, Weiquan Mao, Zhe Zhao, and Kan Zhou, 2023, SAMP: A Model Inference Toolkit of Post-Training Quantization for Text Processing via Self-Adaptive Mixed-Precision https://aclanthology.org/2023.emnlp-industry.13.pdf
- Tao Yang, Dongyue Li, Zhuoran Song, Yilong Zhao, Fangxin Liu, Zongwu Wang, Zhezhi He, and Li Jiang, 2022, DTQAtten: Leveraging Dynamic Token-based Quantization for Efficient Attention Architecture, https://mxhx7199.github.io/files/%5BDATE-22%5DDTQAtten_preprint.pdf
- David Spuler, March 2024, Chapter 44. Advanced Quantization, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Yaohui Cai, Zhewei Yao, Zhen Dong, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. 2020, Zeroq: A novel zero shot quantization framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13169–13178, 2020, https://arxiv.org/abs/2001.00281 Code: https://github.com/amirgholami/ZeroQ
- JH Park, JS Choi, JH Ko, 2020, Dual-Precision Deep Neural Network, https://dl.acm.org/doi/abs/10.1145/3430199.3430228 https://arxiv.org/pdf/2009.02191
- Jing Liu, Ruihao Gong, Mingyang Zhang, Yefei He, Jianfei Cai, Bohan Zhuang, 13 Jun 2024, ME-Switch: A Memory-Efficient Expert Switching Framework for Large Language Models, https://arxiv.org/abs/2406.09041 (How to load multiple experts for MoE in a memory-efficient way using mixed-precision quantization based on identifying the few salient channels that need higher precision, as an alternative to multi-LoRA.)
- Jinguang Wang, Yuexi Yin, Haifeng Sun, Qi Qi, Jingyu Wang, Zirui Zhuang, Tingting Yang, Jianxin Liao, 27 Jun 2024, OutlierTune: Efficient Channel-Wise Quantization for Large Language Models, https://arxiv.org/abs/2406.18832
- Jianyu Wei, Shijie Cao, Ting Cao, Lingxiao Ma, Lei Wang, Yanyong Zhang, Mao Yang, 25 Jun 2024, T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge, https://arxiv.org/abs/2407.00088 Code: https://github.com/microsoft/T-MAC (Table lookup for low-bit quantization on CPUs.)
- Yipin Guo, Yilin Lang, Qinyuan Ren, 3 Jul 2024, GPTQT: Quantize Large Language Models Twice to Push the Efficiency, https://arxiv.org/abs/2407.02891 (Two-phase quantization, first to high-bit, then to binary quantization.)
- Md Fahim Faysal Khan, May 2024, Constraint Driven Multimodal Edge Intelligence, Ph.D. Thesis, Electrical Engineering and Computer Science, Pennsylvania State University, https://etda.libraries.psu.edu/files/final_submissions/29680 (Layer-specific quantization levels for mixed-precision quantization.)
- J Wu, M Song, J Zhao, HKH So, 2024, A Case for Low Bitwidth Floating Point Arithmetic on FPGA for Transformer Based DNN Inference, https://wujiajunic.cn/publication/ipdpsw2024/IPDPSW2024.pdf
- Sergei Savvov, Jun 27, 2023, 7 Ways To Speed Up Inference of Your Hosted LLMs, https://betterprogramming.pub/speed-up-llm-inference-83653aa24c47
- Zhiwen Mo, Lei Wang, Jianyu Wei, Zhichen Zeng, Shijie Cao, Lingxiao Ma, Naifeng Jing, Ting Cao, Jilong Xue, Fan Yang, Mao Yang, 12 Aug 2024, LUT Tensor Core: Lookup Table Enables Efficient Low-Bit LLM Inference Acceleration, https://arxiv.org/abs/2408.06003 (Lookup tables for mixed-precision MatMul/GEMM kernels using low-bit quantization mixed with full precision.)
- Wang, Miao Wei, Shu Yang, Fangmin Chen, Xing Mei, 16 Aug 2024, ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models, https://arxiv.org/abs/2408.08554
- P. -C. Chen, Y. -T. Liu, G. -Y. Zeng and T. -D. Chiueh, 2024, Design and Implementation of an Easy-to-Deploy Energy-Efficient Inference Acceleration System for Multi-Precision Neural Networks, 2024 IEEE 6th International Conference on AI Circuits and Systems (AICAS), Abu Dhabi, United Arab Emirates, 2024, pp. 587-591, doi: 10.1109/AICAS59952.2024.10595940, https://ieeexplore.ieee.org/document/10595940
- Xueying Wu, Edward Hanson, Nansu Wang, Qilin Zheng, Xiaoxuan Yang, Huanrui Yang, Shiyu Li, Feng Cheng, Partha Pratim Pande, Janardhan Rao Doppa, Krishnendu Chakrabarty, Hai Li, 27 Oct 2023 (v3), Block-Wise Mixed-Precision Quantization: Enabling High Efficiency for Practical ReRAM-based DNN Accelerators, https://arxiv.org/abs/2310.12182
- Penghao Xiao, Chunjie Zhang, Qian Guo, Xiayang Xiao, Haipeng Wang, 2024, Neural Networks Integer Computation: Quantizing Convolutional Neural Networks of Inference and Training for Object Detection in Embedded Systems, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, DOI 10.1109/JSTARS.2024.3452321, PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10660473
- Ziheng Wang, Pedro Reviriego, Farzad Niknia, Javier Conde, Shanshan Liu, Fabrizio Lombardi, 26 Aug 2024, Adaptive Resolution Inference (ARI): Energy-Efficient Machine Learning for Internet of Things, https://arxiv.org/abs/2408.14528 (Running a small quantized model and then determining whether to run the full non-quantized model.)
- W Byun, J Woo, S Mukhopadhyay, 2024, Hardware-friendly Hessian-driven Row-wise Quantization and FPGA Acceleration for Transformer-based Models, https://dl.acm.org/doi/pdf/10.1145/3665314.3670806
- Junfeng Gong, Cheng Liu, Long Cheng, Huawei Li, Xiaowei Li, 17 Jul 2024, MCU-MixQ: A HW/SW Co-optimized Mixed-precision Neural Network Design Framework for MCUs, https://arxiv.org/abs/2407.18267
- Bernard Ryhede Bengtsson, Joel Bengs, 2024, Accelerated Segmentation with Mixed-Precision Quantization of EfficientViT-SAM, MSc Thesis, Lund University, Sweden, https://lup.lub.lu.se/luur/download?func=downloadFile&recordOId=9174462&fileOId=9174463
- Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu, 25 Sep 2024, A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms, https://arxiv.org/abs/2409.16694
- Zeyu Cao, Cheng Zhang, Pedro Gimenes, Jianqiao Lu, Jianyi Cheng, Yiren Zhao, 9 Oct 2024, Scaling Laws for Mixed Quantization, https://arxiv.org/abs/2410.06722
- Lian Liu, Haimeng Ren, Long Cheng, Zhaohui Xu, Yudong Pan, Mengdi Wang, Xiaowei Li, Yinhe Han, Ying Wang, 16 Oct 2024, COMET: Towards Partical W4A4KV4 LLMs Serving, https://arxiv.org/abs/2410.12168
- Hao Mark Chen, Fuwen Tan, Alexandros Kouris, Royson Lee, Hongxiang Fan, Stylianos I. Venieris, 17 Oct 2024, Progressive Mixed-Precision Decoding for Efficient LLM Inference, https://arxiv.org/abs/2410.13461
- Jie Peng, Zhang Cao, Huaizhi Qu, Zhengyu Zhang, Chang Guo, Yanyong Zhang, Zhichao Cao, Tianlong Chen, 23 Oct 2024 (v2), Harnessing Your DRAM and SSD for Sustainable and Accessible LLM Inference with Mixed-Precision and Multi-level Caching, https://arxiv.org/abs/2410.14740
- Xinghao Wang, Pengyu Wang, Bo Wang, Dong Zhang, Yunhua Zhou, Xipeng Qiu, 31 Oct 2024, BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments, https://arxiv.org/abs/2410.23918 https://github.com/xinghaow99/BitStack
- Eldar Kurtic, Alexandre Marques, Shubhra Pandit, Mark Kurtz, Dan Alistarh, 4 Nov 2024, "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization, https://arxiv.org/abs/2411.02355
- Peng Tang, Jiacheng Liu, Xiaofeng Hou, Yifei Pu, Jing Wang, Pheng-Ann Heng, Chao Li, Minyi Guo, 6 Nov 2024 (v2), HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference, https://arxiv.org/abs/2411.01433
- Jiajun Wu, Mo Song, Jingmin Zhao, Yizhao Gao, Jia Li, Hayden Kwok-Hay So, 6 Nov 2024, TATAA: Programmable Mixed-Precision Transformer Acceleration with a Transformable Arithmetic Architecture, https://arxiv.org/abs/2411.03697
- Qinyang Bao, MPBRQ- A Framework for Mixed-Precision Quantization for Large Language Models, Masters in Applied Science, Graduate Department of Electrical and Computer Engineering, University of Toronto 2024, https://tspace.library.utoronto.ca/bitstream/1807/141039/1/Bao_Qinyang_202411_MAS_thesis.pdf
- Jianhua Gao, Bingjie Liu, Weixing Ji, Hua Huang, 9 Apr 2024, A Systematic Literature Survey of Sparse Matrix-Vector Multiplication, https://arxiv.org/abs/2404.06047
- Yuzong Chen, Ahmed F. AbouElhamayed, Xilai Dai, Yang Wang, Marta Andronic, George A. Constantinides, Mohamed S. Abdelfattah, 18 Nov 2024, BitMoD: Bit-serial Mixture-of-Datatype LLM Acceleration, https://arxiv.org/abs/2411.11745
- Yi Ren, Ruge Xu, Xinfei Guo, Weikang Qian, 27 Nov 2024, FAMES: Fast Approximate Multiplier Substitution for Mixed-Precision Quantized DNNs--Down to 2 Bits! https://arxiv.org/abs/2411.18055
- Amitash Nanda, Sree Bhargavi Balija, Debashis Sahoo, 3 Dec 2024, CPTQuant -- A Novel Mixed Precision Post-Training Quantization Techniques for Large Language Models, https://arxiv.org/abs/2412.03599
- Yiming Fang, Li Chen, Yunfei Chen, Weidong Wang, Changsheng You, 5 Dec 2024 (v2), Mixed-Precision Quantization: Make the Best Use of Bits Where They Matter Most, https://arxiv.org/abs/2412.03101
Logarithmic Bitshift Quantization (Power-of-Two Quantization)
The idea with bitshift quantization is to use power-of-2 integer weights and bitshift operations rather than integer multiplication. There is a significant trade-off in terms of accuracy of the model, since the number of distinct weights is greatly reduced. This is an active area of research that is well-known, with the earliest papers dating back to 1992 and 1993. However, software algorithms using bitshift seem unlikely to outperform hardware acceleration of integer multiplication, and hardware support is limited. Extending hardware accelerators to use bitshifting or the highest power-of-two approximate multiplication in hardware, presumably requiring fewer operations and less computing power (and reduced heat generation) seems an open area for further research. Note that the highest bit of an integer can be efficiently calculated using Brian Kernighan's algorithm (1988).
- Maarten Vandersteegen, Kristof Van Beeck and Toon Goedemé, "Integer-Only CNNs with 4 Bit Weights and Bit-Shift Quantization Scales at Full-Precision Accuracy", Electronics, October 2021, 10(22), 2823, https://www.mdpi.com/2079-9292/10/22/2823
- Yiren Zhao, Xitong Gao, Daniel Bates, Robert Mullins, Cheng-Zhong Xu, "Focused Quantization for Sparse CNNs", Advances in Neural Information Processing Systems 32 (NeurIPS 2019), 2019, https://proceedings.neurips.cc/paper/2019/hash/58aaee7ae94b52697ad3b9275d46ec7f-Abstract.html
- Dominika Przewlocka-Rus, Syed Shakib Sarwar, H. Ekin Sumbul, Yuecheng Li, Barbara De Salvo, "Power-of-Two Quantization for Low Bitwidth and Hardware Compliant Neural Networks", Feb 2022, https://arxiv.org/abs/2203.05025
- Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio, "Quantized neural networks: Training neural networks with low precision weights and activations", The Journal of Machine Learning Research, 18(1):6869–6898, 2017, https://arxiv.org/abs/1609.07061.
- T. Hokchhay, S. Hashemi, R. I. Bahar, and S. Reda, “Hardware-software codesign of accurate, multiplier-free deep neural networks,” in Proc. 54th Annu. Design Autom. Conf. (DAC), 2017, pp. 1–6., https://arxiv.org/abs/1705.04288
- Yuhang Li, Xin Dong, and Wei Wang, "Additive powers-of-two quantization: An efficient non-uniform discretization for neural networks", International Conference on Learning Representations, February 2020, https://arxiv.org/abs/1909.13144
- Z. Lin, M. Courbariaux, R. Memisevic, and Y. Bengio. Neural networks with few multiplications. CoRR, abs/1510.03009, 2015. https://arxiv.org/abs/1510.03009 (Power-of-Two Quantization)
- Soheil Hashemi; Nicholas Anthony; Hokchhay Tann; R. Iris Bahar; Sherief Reda, "Understanding the impact of precision quantization on the accuracy and energy of neural networks", Design, Automation & Test in Europe Conference & Exhibition, March 2017, https://ieeexplore.ieee.org/abstract/document/7927224
- Marchesi, Michele, Orlandi, Gianni, Piazza, Francesco, and Uncini, Aurelio, "Fast neural networks without multipliers", IEEE Transactions on Neural Networks , 4(1):53–62, 1993, https://ieeexplore.ieee.org/document/182695
- A. White and M. 1. Elmasry, "The digi-neocognitron: a digital neocognitron neural network model for VLSI", IEEE Trans. Neural Networks, vol. 3. pp. 73-85, Jan. 1992, https://ieeexplore.ieee.org/document/105419
- Kwan, Hon Keung and Tang, CZ, "Multiplierless multilayer feedforward neural network design suitable for continuous input-output mapping", Electronics Letters, 29(14):1259–1260, 1993, https://digital-library.theiet.org/content/journals/10.1049/el_19930841
- Sean Eron Anderson, "Bit Twiddling Hacks" (Kernighan Algorithm), https://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetKernighan
- Peter Wegner, "A technique for counting ones in a binary computer", Communications of the ACM, Volume 3, Issue 5, May 1960, https://doi.org/10.1145/367236.367286
- Daisuke Miyashita, Edward H. Lee, and Boris Murmann, Convolutional Neural Networks using Logarithmic Data Representation, 2016, CoRR abs/1603.01025 (2016), https://arxiv.org/abs/1603.01025
- Edward H. Lee, Daisuke Miyashita, Elaina Chai, Boris Murmann, and S. Simon Wong, LogNet: Energy-efficient neural networks using logarithmic computation, 2017, In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2017, New Orleans, LA, USA, March 5-9, 2017. 5900–5904. https://doi.org/10.1109/ICASSP.2017.7953288
- Elhoushi, M.; Chen, Z.; Shafiq, F.; Tian, Y. H.; and Li, J. Y., 2019, Deepshift: Towards multiplication-less neural networks, arXiv preprint arXiv:1905.13298, https://arxiv.org/abs/1905.13298
- Zhou, A.; Yao, A.; Guo, Y.; Xu, L.; and Chen, Y., 2017, Incremental network quantization: Towards lossless CNNs with low-precision weight, arXiv preprint arXiv:1702.03044, https://arxiv.org/abs/1702.03044
- J Cai, M Takemoto, H Nakajo, A deep look into logarithmic quantization of model parameters in neural networks, 2018, https://dl.acm.org/doi/abs/10.1145/3291280.3291800
- HyunJin Kim; Min Soo Kim; Alberto A. Del Barrio; Nader Bagherzadeh, A cost-efficient iterative truncated logarithmic multiplication for convolutional neural networks, 2019, IEEE 26th Symposium on Computer Arithmetic (ARITH), https://ieeexplore.ieee.org/abstract/document/8877474
- X Li, B Liu, RH Yang, V Courville, C Xing, VP Nia, 2023, DenseShift: Towards Accurate and Efficient Low-Bit Power-of-Two Quantization, Proceedings of the IEEE/CVF, https://openaccess.thecvf.com/content/ICCV2023/papers/Li_DenseShift_Towards_Accurate_and_Efficient_Low-Bit_Power-of-Two_Quantization_ICCV_2023_paper.pdf (Extends log quantization to floating point numbers efficiently by using a bitwise trick of integer addition on the sign and exponent bits of 32-bit IEEE 754 floats.)
- P Dong, L Lu, C Wu, C Lyu, G Yuan, H Tang, Y Wang, 2023, PackQViT: Faster Sub-8-bit Vision Transformers via Full and Packed Quantization on the Mobile, https://openreview.net/pdf?id=N56hAiQvot Code: https://github.com/PeiyanFlying/PackQViT
- Bahareh Khabbazan, Marc Riera, Antonio González, Oct 2023, An Energy-Efficient Near-Data Processing Accelerator for DNNs that Optimizes Data Accesses, https://arxiv.org/abs/2310.18181
- H Madadum, Y Becerikli, 2022, A resource-efficient convolutional neural network accelerator using fine-grained logarithmic quantization, https://www.academia.edu/download/91076689/pdf.pdf
- MY Malsagov, EM Khayrov, MM Pushkareva, 2019, Exponential discretization of weights of neural network connections in pre-trained neural networks, https://arxiv.org/pdf/2002.00623
- H. Tann, S. Hashemi, I. Bahar, and S. Reda, “Hardware-software codesign of accurate, multiplier-free deep neural networks,” CoRR, vol. abs/1705.04288, 2017. [Online]. Available: http://arxiv.org/abs/1705.04288
- Mogaka, O.M., Zewail, R., Inoue, K. et al. TinyEmergencyNet: a hardware-friendly ultra-lightweight deep learning model for aerial scene image classification. J Real-Time Image Proc 21, 51 (2024). https://doi.org/10.1007/s11554-024-01430-y https://link.springer.com/article/10.1007/s11554-024-01430-y#citeas (Use of both power-of-two quantization and channel pruning for fast image analysis.)
- Fangzhou He, Ke Ding, DingjiangYan, Jie Li, Jiajun Wang, Mingzhe Chen, ANovel Quantization and Model Compression Approach for Hardware Accelerators in Edge Computing, https://cdn.techscience.cn/files/cmc/2024/TSP_CMC-80-2/TSP_CMC_53632/TSP_CMC_53632.pdf (Power-of-two quantization with bitshifting further accelerated with LUTs.)
- X. Geng, S. Liu, J. Jiang, K. Jiang and H. Jiang, 2024, Compact Powers-of-Two: An Efficient Non-Uniform Quantization for Deep Neural Networks, 2024 Design, Automation & Test in Europe Conference & Exhibition (DATE), Valencia, Spain, 2024, pp. 1-6, doi: 10.23919/DATE58400.2024.10546652, https://ieeexplore.ieee.org/abstract/document/10546652
- Rappy Saha, Jude Haris, José Cano, 30 Sep 2024, Accelerating PoT Quantization on Edge Devices, https://arxiv.org/abs/2409.20403 https://github.com/gicLAB/PoTAcc (Power-of-two/bitshift logarithmic quantization on edge devices.)
- L. F. H. Duarte, G. B. Nardes, W. Grignani, D. R. Melo and C. A. Zeferino, "Deep Nibble: A 4-bit Number Format for Efficient DNN Training and Inference in FPGA," 2024 37th SBC/SBMicro/IEEE Symposium on Integrated Circuits and Systems Design (SBCCI), Joao Pessoa, Brazil, 2024, pp. 1-5, doi: 10.1109/SBCCI62366.2024.10703994. https://ieeexplore.ieee.org/abstract/document/10703994 (Log quantization method in 4-bits.)
- Mohammadreza Doostmohammadian, Sérgio Pequito, 27 Oct 2024, Logarithmically Quantized Distributed Optimization over Dynamic Multi-Agent Networks. https://arxiv.org/abs/2410.20345
- W. Wang, W. Sun and Y. Liu, "Improving Transformer Inference Through Optimized Non-Linear Operations With Quantization-Approximation-Based Strategy," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, doi: 10.1109/TCAD.2024.3488572. https://ieeexplore.ieee.org/abstract/document/10738457
Sum of Two Bitshifts Quantization
The downside of logarithmic quantization is that there are relatively few unique weights, limiting precision, even if the number of bits used is maximized using a large scaling factor. An alternative implementation is to use two bitshift operations and an addition (or use of "shift-and-add" operations). In this way, the two highest bits of the quantized integer weight can be used, which improves model precision at the cost of more computation. This assumes that two integer shifts and an integer addition are less than the cost of a single integer multiplication. An early mention of this "sums of powers of two" method is in Marchesi et al. (1993).
- Sung-En Chang, Yanyu Li, Mengshu Sun, Runbin Shi, Hayden K.-H. So, Xuehai Qian, Yanzhi Wang, and Xue Lin, "Mix and Match: A Novel FPGA-Centric Deep Neural Network Quantization Framework", 2021, In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, Seoul, Korea (South), 208–220, https://doi.org/10.1109/HPCA51647.2021.00027
- You, H.; Chen, X.; Zhang, Y.; Li, C.; Li, S.; Liu, Z.; Wang, Z.; and Lin, Y., 2020, ShiftAddNet: A Hardware-Inspired Deep Network, In NeurIPS, https://arxiv.org/abs/2010.12785
- Marchesi, Michele, Orlandi, Gianni, Piazza, Francesco, and Uncini, Aurelio, "Fast neural networks without multipliers", IEEE Transactions on Neural Networks , 4(1):53–62, 1993, https://ieeexplore.ieee.org/document/182695
- Robert Eisele, Optimizing integer multiplication, April 29th, 2010, https://www.xarg.org/2010/04/optimizing-integer-multiplication/
- Yuhang Li, Xin Dong, and Wei Wang, "Additive powers-of-two quantization: An efficient non-uniform discretization for neural networks", International Conference on Learning Representations, February 2020, https://arxiv.org/abs/1909.13144
- Li, Yanyu, Aug 2024, Accelerating large scale generative AI : a comprehensive study, Ph.D. Thesis, Northeastern University, Boston, Massachusetts, USA, https://hdl.handle.net/2047/D20669654 https://repository.library.northeastern.edu/files/neu:ms35wj107 https://repository.library.northeastern.edu/files/neu:ms35wj107/fulltext.pdf
Arbitrary Base Logarithmic Quantization
The main use of logarithms in quantization is power-of-two logarithmic quantization. This is efficient, allowing bitshifting, but its lack of accuracy is a known problem. There is also some research on bases other than two, or indeed arbitrary bases, to try to more accurately map weights to a logarithmic format:
- S. Vogel, M. Liang, A. Guntoro, W. Stechele, and G. Ascheid, 2018, Efficient hardware acceleration of CNNs using logarithmic data representation with arbitrary log-base, In Proceedings of the 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD’18). 1–8.
Integer Division for Quantization?
What about using integer division instead of multiplications in quantization? After all, multiplication by a small weight like 0.003 could instead be a division by 333. Is this an avenue for optimization? It seems unlikely, since division is usually much slower than multiplication, often by an order-of-magnitude.
Integer division can possibly be used efficiently using bitshift operations. Power-of-two division might be an opportunity for (right) bitshifts instead of division, which is effectively the same as the left bitshift quantization above. Dyadic numbers are an interesting idea and their implementation involves division by a power-of-two, usually performed via a right bitshift.
Note that division is often used in scaling operations, particularly in de-quantization. However, in such cases, it isn't the bottleneck operation, as scaling or de-quantization is performed an order-of-magnitude fewer times.
No papers were found on "division quantization". Some research on division arithmetic algorithms:
- LibDivide, https://libdivide.com/ and https://github.com/ridiculousfish/libdivide
- Benchmarking division and libdivide on Apple M1 and Intel AVX512, May 12th, 2021, https://ridiculousfish.com/blog/posts/benchmarking-libdivide-m1-avx512.html
- Jung Hyun Lee, Jeonghoon Kim, Se Jung Kwon, Dongsoo Lee, 16 Jul 2024 (v2), FlexRound: Learnable Rounding based on Element-wise Division for Post-Training Quantization, https://arxiv.org/abs/2306.00317
Dyadic Quantization
Dyadic numbers are a class of numbers represented as rational numbers, but operated on as paired numbers. The number is an integer, but the denominator is restricted to be a power-of-two integer. This allows dyadic numbers to support a wide range of weights, including quite high precision in fractional weights, but integer arithmetic can be used.
- Zhewei Yao, Zhen Dong, Zhangcheng Zheng, Amir Gholami, Jiali Yu, Eric Tan, Leyuan Wang, Qijing Huang, Yida Wang, Michael Mahoney, Kurt Keutzer, HAWQ-V3: Dyadic Neural Network Quantization, Proceedings of the 38th International Conference on Machine Learning, PMLR 139:11875-11886, 2021, https://arxiv.org/abs/2011.10680
- Renato J. Cintra; Stefan, Duffner; Christophe Garcia; André Leite, Low-Complexity Approximate Convolutional Neural Networks, IEEE Transactions on Neural Networks and Learning Systems, Volume 29, Issue 12, December 2018, pp.5981-5992, https://ieeexplore.ieee.org/abstract/document/8334697
- Fernanda Botelho, Max Garzon, On Dynamical Properties of Neural Networks, Complex Systems 5 (1991), p.401-413, https://wpmedia.wolfram.com/uploads/sites/13/2018/02/05-4-4.pdf
- David Spuler, March 2024, Chapter 44. Advanced Quantization, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Fixed-Point Quantization
Fixed-point numbers are a way of representing numbers, that differs from floating-point numbers. For example, we can represent a dollars and cents number $12.34 as the integer "1234". This is fixed-point numbers, scaled by 100.
In practice, we can convert any fractional number to an integer by multiplying by a scaling factor (and truncating any lower-level fractional digits). Doing this is "floating-point quantization".
The main advantage is integer arithmetic. Using fixed-point quantizations changes vector dot product to use integer multiplication and integer addition (with a bitshift). See fixed point number system.
Floating-point numbers have a per-number exponent. Fixed-point numbers are like having a single global exponent value for all numbers (not stored anyway). The intermediate method is "block floating-point" where blocks of numbers (i.e. vectors) have a per-block or per-vector exponent value. Integer-only arithmetic is possible with fixed-point and block-floating point quantization.
Stochastic Quantization
Stochastic quantization is a research area that examines intentionally inserting some randomness or statistical variation into the quantization algorithms, which may result in higher accuracy. This idea can be used in conjunction with Post-Training Quantization (PTQ) or with Quantization-Aware Training (QAT).
- Angela Fan, Pierre Stock, Benjamin Graham, Edouard Grave, Rémi Gribonval, Hervé Jégou, and Armand Joulin, Training with quantization noise for extreme model compression, 2020, arXiv e-prints, pages arXiv–2004, https://arxiv.org/abs/2004.07320
- Jianfei Chen, Yu Gai, Zhewei Yao, Michael W Mahoney, and Joseph E Gonzalez. A statistical framework for low-bitwidth training of deep neural networks. arXiv preprint arXiv:2010.14298, 2020, https://arxiv.org/abs/2010.14298 J Zhang, 2023, Quantization for High-dimensional Data and Neural Networks: Theory and Algorithms, Ph.D. Thesis, University of California, San Diego, https://escholarship.org/content/qt9bd2k7gf/qt9bd2k7gf.pdf (See Chapter 5 for stochastic quantization algorithms.)
- David Spuler, March 2024, Chapter 44. Advanced Quantization, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- J. Choi and S. Venkataramani, 2019, Approximate Computing Techniques for Deep Neural Networks. Cham: Springer, 2019, pp. 307–329, Chapter 15, https://link.springer.com/chapter/10.1007/978-3-319-99322-5_15
Weight Clustering
Weight clustering is conceptually like pruning and quantization combined, and is sometimes called "cluster-based quantization". A group of weights are merged with similar weights, to make all of the similar weights have exactly the same weight. Hashing has also been used to group weights for clustering.
- Shaokai Ye, Tianyun Zhang, Kaiqi Zhang, Jiayu Li, Jiaming Xie, Yun Liang, Sijia Liu, Xue Lin, Yanzhi Wang, "A Unified Framework of DNN Weight Pruning and Weight Clustering/Quantization Using ADMM", November 2018, https://arxiv.org/abs/1811.01907
- Steven J. Nowlan; Geoffrey E. Hinton, "Simplifying Neural Networks by Soft Weight-Sharing", Neural Computation, 4(4), July 1992, https://ieeexplore.ieee.org/abstract/document/6796174
- Zhihang Yuan, Lin Niu, Jiawei Liu, Wenyu Liu, Xinggang Wang, Yuzhang Shang, Guangyu Sun, Qiang Wu, Jiaxiang Wu, Bingzhe Wu, "RPTQ: Reorder-based Post-training Quantization for Large Language Models", May 2023, https://arxiv.org/abs/2304.01089
- Weight clustering (TensorFlow), https://www.tensorflow.org/model_optimization/guide/clustering
- A. Zhou, A. Yao, Y. Guo, L. Xu and Y. Chen, "Incremental network quantization: Towards lossless CNNs with low-precision weights", arXiv:1702.03044, 2017. https://arxiv.org/abs/1702.03044 (Groups large and small weights)
- W. Chen, J. T. Wilson, S. Tyree, K. Weinberger and Y. Chen, "Compressing neural networks with the hashing trick", Proc. ICML, pp. 2285-2294, 2015. https://arxiv.org/abs/1504.04788 (Uses hashing to do weight clustering/grouping weights.)
- Maedeh Hemmat, Joshua San Miguel, Azadeh Davoodi, "AirNN: A Featherweight Framework for Dynamic Input-Dependent Approximation of CNNs", IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol.40, no.10, pp.2090-2103, 2021. https://ieeexplore.ieee.org/document/9239327 (Input-dependent inference optimization via layer-wise weight clustering and early exit based on a termination condition.)
- Maedeh Hemmat; Azadeh Davoodi, March 2019, Dynamic Reconfiguration of CNNs for Input-Dependent Approximation, 20th International Symposium on Quality Electronic Design (ISQED), https://ieeexplore.ieee.org/document/8697843 (Dynamically decides how many clusters of similar weights to use, depending on input.)
- B Rokh, A Azarpeyvand, A Khanteymoori, ACM Transactions on Intelligent Systems, 2023, A Comprehensive Survey on Model Quantization for Deep Neural Networks in Image Classification, PDF: https://dl.acm.org/doi/pdf/10.1145/3623402 (Includes a survey of weight clustering.)
- W Cheng, W Zhang, H Shen, Y Cai, X He, K Lv, 2023, Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs, arXiv preprint arXiv:2309.05516, PDF: https://arxiv.org/pdf/2309.05516.pdf (Examination of rounding schemes in PTQ and QAT for quantization and weight clustering.)
- TensorFlow, 2024, TensorFlow Model Optimization Toolkit — Weight Clustering API https://blog.tensorflow.org/2020/08/tensorflow-model-optimization-toolkit-weight-clustering-api.html
- TensorFlow, 2024, Weight clustering comprehensive guide https://www.tensorflow.org/model_optimization/guide/clustering/clustering_comprehensive_guide
- David Spuler, March 2024, Chapter 44. Advanced Quantization, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Shixun Wu, Yitong Ding, Yujia Zhai, Jinyang Liu, Jiajun Huang, Zizhe Jian, Huangliang Dai, Sheng Di, Bryan M. Wong, Zizhong Chen, Franck Cappello, 2 Aug 2024, FT K-Means: A High-Performance K-Means on GPU with Fault Tolerance, https://arxiv.org/abs/2408.01391
- Dylan Patel and Daniel Nishball, Oct 03, 2024, AI Neocloud Playbook and Anatomy, https://www.semianalysis.com/p/ai-neocloud-playbook-and-anatomy
- Nathan Bailey, Mar 20, 2024, Optimizing Neural Networks— Weight Clustering Explained, https://nathanbaileyw.medium.com/optimizing-neural-network-weight-clustering-explained-be947088a974
- Song Han, Huizi Mao, William J. Dally, 15 Feb 2016 (v5), Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, https://arxiv.org/abs/1510.00149
- Vasileios Tsouvalas, Samaneh Mohammadi, Ali Balador, Tanir Ozcelebi, Francesco Flammini, Nirvana Meratnia, 13 Jun 2024, EncCluster: Scalable Functional Encryption in Federated Learning through Weight Clustering and Probabilistic Filters, https://arxiv.org/abs/2406.09152
- Vasileios Tsouvalas, Aaqib Saeed, Tanir Ozcelebi, Nirvana Meratnia, 25 Feb 2024 (v3), Communication-Efficient Federated Learning through Adaptive Weight Clustering and Server-Side Distillation https://arxiv.org/abs/2401.14211
- Ao Ren, Tianyun Zhang, Shaokai Ye, Jiayu Li, Wenyao Xu, Xuehai Qian, Xue Lin, and Yanzhi Wang. 2019. ADMM-NN: An Algorithm-Hardware Co-Design Framework of DNNs Using Alternating Direction Methods of Multipliers. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '19). Association for Computing Machinery, New York, NY, USA, 925–938. https://doi.org/10.1145/3297858.3304076 https://dl.acm.org/doi/abs/10.1145/3297858.3304076 https://dl.acm.org/doi/pdf/10.1145/3297858.3304076 https://github.com/yeshaokai/admm-nn.
- Mariam Rakka, Mohammed E. Fouda, Pramod Khargonekar, Fadi Kurdahi, 11 Aug 2022, Mixed-Precision Neural Networks: A Survey, https://arxiv.org/abs/2208.06064
- John Edward Mixter, Oct 2024, Neural Network Reduction for Efficient Execution on Edge Devices, Ph.D. Thesis, Department of Electrical and Computer Engineering, University of Arizona, https://repository.arizona.edu/bitstream/handle/10150/670868/azu_etd_21112_sip1_m.pdf?sequence=1
- David Spuler, October 26, 2024, Weight Clustering Needs a Refresh, Aussie AI Blog, https://www.aussieai.com/blog/weight-clustering
- Babak Rokh, Ali Azarpeyvand, Alireza Khanteymoori, 23 Oct 2023 (v5), A Comprehensive Survey on Model Quantization for Deep Neural Networks in Image Classification, https://arxiv.org/abs/2205.07877
- Hu, J., Rao, W., Zhao, Q. (2021). aHCQ: Adaptive Hierarchical Clustering Based Quantization Framework for Deep Neural Networks. In: Karlapalem, K., et al. Advances in Knowledge Discovery and Data Mining. PAKDD 2021. Lecture Notes in Computer Science(), vol 12713. Springer, Cham. https://doi.org/10.1007/978-3-030-75765-6_17 https://link.springer.com/chapter/10.1007/978-3-030-75765-6_17
- Caro, M., Abella, J. Energy-Efficient Object Detection: Impact of Weight Clustering for Different Arithmetic Representations. J Sign Process Syst 96, 287–300 (2024). https://doi.org/10.1007/s11265-024-01917-8 https://dl.acm.org/doi/10.1007/s11265-024-01917-8 https://link.springer.com/article/10.1007/s11265-024-01917-8
Outliers
The issue of "outliers" refers to weights or activations that are largers (or smaller) than the vast majority of other values. There are various parts of a Transformer where the output results can depend inordinately on a small subset of values, notably in the attention computation (and hence also the KV cache).
Research papers that discuss the issue of outliers include:
- Jerry Yao-Chieh Hu, Pei-Hsuan Chang, Robin Luo, Hong-Yu Chen, Weijian Li, Wei-Po Wang, Han Liu, 4 Apr 2024, Outlier-Efficient Hopfield Layers for Large Transformer-Based Models, https://arxiv.org/abs/2404.03828 Code: https://github.com/MAGICS-LAB/OutEffHop (Addresses outliers in quantization with a modified Softmax and an advanced Hopfield memory model.)
- Xing Hu, Yuan Chen, Dawei Yang, Sifan Zhou, Zhihang Yuan, Jiangyong Yu, Chen Xu, 28 May 2024, I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models, https://arxiv.org/abs/2405.17849 Code: https://anonymous.4open.science/r/I-LLM-F242/
- Wanyun Cui, Qianle Wang, 3 Apr 2024, Cherry on Top: Parameter Heterogeneity and Quantization in Large Language Models, https://arxiv.org/abs/2404.02837 (Examines which weights most affect inference, including outlier values.)
- Nikita Trukhanov, Ilya Soloveychik, 29 Mar 2024, Accurate Block Quantization in LLMs with Outliers, https://arxiv.org/abs/2403.20137 (Analyzes block floating point number formats in block quantization with a focus on the KV cache memory reduction, including the use of permutations to reorder tensor weight rows.)
- Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey
- Daliang Xu, Hao Zhang, Liming Yang, Ruiqi Liu, Mengwei Xu, and Xuanzhe Liu, 11 June 2024, WiP: Efficient LLM Prefilling with Mobile NPU, EdgeFM '24: Proceedings of the Workshop on Edge and Mobile Foundation Models, June 2024, Pages 33 - 35, https://doi.org/10.1145/3662006.3662066 https://dl.acm.org/doi/abs/10.1145/3662006.3662066 PDF: https://dl.acm.org/doi/pdf/10.1145/3662006.3662066 (Faster NPU prefill via chunked prefilling using sequences of tokens, along with INT8 NPU quantization that is aware of outliers and offloads FP32 calculations from NPU back to CPU.)
- Franklin Huang, May 17, 2024, Machine Learning Systems with Reduced Memory Requirements, Masters of Science, Electrical Engineering and Computer Sciences, University of California, Berkeley, Technical Report No. UCB/EECS-2024-120 http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.html https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.pdf Code: https://github.com/hongyihuang/spec-mcts/blob/main/triton (Broad paper that examines a lot of different optimizations that reduce memory costs, including quantization, kernel fusion, sparsity, MatMul optimizations, KV cache compression, and various other methods.)
- Jinguang Wang, Yuexi Yin, Haifeng Sun, Qi Qi, Jingyu Wang, Zirui Zhuang, Tingting Yang, Jianxin Liao, 27 Jun 2024, OutlierTune: Efficient Channel-Wise Quantization for Large Language Models, https://arxiv.org/abs/2406.18832
- Wonbeom Lee, Jungi Lee, Junghwan Seo, Jaewoong Sim, 28 Jun 2024, InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management, https://arxiv.org/abs/2406.19707 (KV caching optimization using salient token pruning for the attention layer.)
- Lianwei Yang, Haisong Gong, 6 Aug 2024, DopQ-ViT: Towards Distribution-Friendly and Outlier-Aware Post-Training Quantization for Vision Transformers, https://arxiv.org/abs/2408.03291
- Guangxuan Xiao, May 2024, Efficient Deployment Algorithms for Large Language Models, Masters Thesis, MIT, https://dspace.mit.edu/bitstream/handle/1721.1/156332/xiao-xgx-sm-eecs-2024-thesis.pdf
- Xin Ding, Xiaoyu Liu, Zhijun Tu, Yun Zhang, Wei Li, Jie Hu, Hanting Chen, Yehui Tang, Zhiwei Xiong, Baoqun Yin, Yunhe Wang, 15 Apr 2024 (v4), CBQ: Cross-Block Quantization for Large Language Models, https://arxiv.org/abs/2312.07950
- Bowen Zhao, Hannaneh Hajishirzi, Qingqing Cao, 4 Jun 2024 (v2), APT: Adaptive Pruning and Tuning Pretrained Language Models for Efficient Training and Inference, ICML 2024 Oral, https://arxiv.org/abs/2401.12200 https://github.com/ROIM1998/APT
- Yupeng Su, Ziyi Guan, Xiaoqun Liu, Tianlai Jin, Dongkuan Wu, Graziano Chesi, Ngai Wong, Hao Yu, 20 Aug 2024, LLM-Barber: Block-Aware Rebuilder for Sparsity Mask in One-Shot for Large Language Models, https://arxiv.org/abs/2408.10631 https://github.com/YupengSu/LLM-Barber
- Jahyun Koo, Dahoon Park, Sangwoo Jung, Jaeha Kung, 6 Sep 2024, OPAL: Outlier-Preserved Microscaling Quantization A ccelerator for Generative Large Language Models, https://arxiv.org/abs/2409.05902
- Yanshu Wang, Tong Yang, Xiyan Liang, Guoan Wang, Hanning Lu, Xu Zhe, Yaoming Li, Li Weitao, 18 Sep 2024, Art and Science of Quantizing Large-Scale Models: A Comprehensive Overview, https://arxiv.org/abs/2409.11650 (Extensive survey of quantization from the basics to SOTA approaches, with also some coverage of knowledge distillation and KV cache compression.)
- Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu, 25 Sep 2024, A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms, https://arxiv.org/abs/2409.16694
- Yifei Liu, Jicheng Wen, Yang Wang, Shengyu Ye, Li Lyna Zhang, Ting Cao, Cheng Li, Mao Yang, 25 Sep 2024, VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models, https://arxiv.org/abs/2409.17066 https://arxiv.org/pdf/2409.17066
- Tianzhu Ye, Li Dong, Yuqing Xia, Yutao Sun, Yi Zhu, Gao Huang, Furu Wei, 7 Oct 2024, Differential Transformer, https://arxiv.org/abs/2410.05258
- Ke Yi, Zengke Liu, Jianwei Zhang, Chengyuan Li, Tong Zhang, Junyang Lin, Jingren Zhou, 30 Sep 2024, Rotated Runtime Smooth: Training-Free Activation Smoother for accurate INT4 inference, https://arxiv.org/abs/2409.20361 (Handling of outliers in INT4 quantization.)
- Mengzhao Chen, Yi Liu, Jiahao Wang, Yi Bin, Wenqi Shao, Ping Luo, 7 Oct 2024, PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs, https://arxiv.org/abs/2410.05265 https://github.com/ChenMnZ/PrefixQuant (Puts outliers into the KV cache as a prefix.)
- Akshat Ramachandran, Souvik Kundu, Tushar Krishna, 12 Nov 2024 (v2), MicroScopiQ: Accelerating Foundational Models through Outlier-Aware Microscaling Quantization, https://arxiv.org/abs/2411.05282
Activation Quantization
The quantization of dynamic calculations of activations is well-known and almost always used now.
Research papers on activation quantization:
- Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, Liang Zhao, 4 Jan 2024, Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models https://arxiv.org/abs/2401.00625 (A general survey paper with coverage of many techniques including this one.)
- Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I-Jen Chuang, Vijayalakshmi Srinivasan, Kailash Gopalakrishnan, Jul 2018, PACT: Parameterized Clipping Activation for Quantized Neural Networks, https://arxiv.org/abs/1805.06085
- Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, Song Han, 10 May 2024 (v2), QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving, https://arxiv.org/abs/2405.04532 Code: https://github.com/mit-han-lab/qserve
- Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey
- 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
- Apple, June 2024, Introducing Apple’s On-Device and Server Foundation Models, https://machinelearning.apple.com/research/introducing-apple-foundation-models (Apple's on-device models feature optimizations including small models, grouped query attention, 2-bit/4-bit quantization including activation quantization, shared embedding/unembedding tensors, small-ish vocabulary size of 49k, an undisclosed efficient KV cache optimization for neural engines, and layer-specific 16-bit LoRA/QLoRA adapters of size "10s of megabytes" for fine-tuned specialized model versions, also sometimes in 2-bit/4-bit, claiming speed rates of 0.6ms/token in prefill, and 30 tokens per second in decoding.)
- Jinguang Wang, Yuexi Yin, Haifeng Sun, Qi Qi, Jingyu Wang, Zirui Zhuang, Tingting Yang, Jianxin Liao, 27 Jun 2024, OutlierTune: Efficient Channel-Wise Quantization for Large Language Models, https://arxiv.org/abs/2406.18832
- Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748
- Leo Donisch, Sigurd Schacht, Carsten Lanquillon, 6 Aug 2024, Inference Optimizations for Large Language Models: Effects, Challenges, and Practical Considerations, https://arxiv.org/abs/2408.03130
- Wang, Miao Wei, Shu Yang, Fangmin Chen, Xing Mei, 16 Aug 2024, ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models, https://arxiv.org/abs/2408.08554
- Jahyun Koo, Dahoon Park, Sangwoo Jung, Jaeha Kung, 6 Sep 2024, OPAL: Outlier-Preserved Microscaling Quantization A ccelerator for Generative Large Language Models, https://arxiv.org/abs/2409.05902
- Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu, 25 Sep 2024, A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms, https://arxiv.org/abs/2409.16694
- Eldar Kurtic, Alexandre Marques, Shubhra Pandit, Mark Kurtz, Dan Alistarh, 4 Nov 2024, "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization, https://arxiv.org/abs/2411.02355
- A. Jantsch et al., "Special Session: Estimation and Optimization of DNNs for Embedded Platforms," 2024 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), Raleigh, NC, USA, 2024, pp. 21-30, doi: 10.1109/CODES-ISSS60120.2024.00013. https://ieeexplore.ieee.org/abstract/document/10740783
- Liu, J., Zhang, B., Cao, X. (2025). ROI-Aware Dynamic Network Quantization for Neural Video Compression. In: Antonacopoulos, A., Chaudhuri, S., Chellappa, R., Liu, CL., Bhattacharya, S., Pal, U. (eds) Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15305. Springer, Cham. https://doi.org/10.1007/978-3-031-78169-8_22 https://link.springer.com/chapter/10.1007/978-3-031-78169-8_22
Vector Quantization
Vector quantization is a longstanding ML technique that pre-dates all of the Transformer work. Hence, there are many early papers on ML topics. Vector quantization is related to other vector methods such as nearest-neighbor search, such as for the analysis of embedding vectors and semantic similarity, amongst many other applications.
Research papers on vector quantization:
- Sebastian Bruch, Jan 2024, Foundations of Vector Retrieval, https://arxiv.org/abs/2401.09350 (Extensive 200+ pages review of vector lookup data structures such as LSH and clustering.)
- Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, Dan Alistarh, Jan 2024, Extreme Compression of Large Language Models via Additive Quantization, https://arxiv.org/abs/2401.06118
- W Jiang, P Liu, F Wen, 2017, An improved vector quantization method using deep neural network, AEU, Volume 72, February 2017, Pages 178-183, https://www.sciencedirect.com/science/article/pii/S1434841116313954
- Zicheng Liu, Li Wang, Siyuan Li, Zedong Wang, Haitao Lin, Stan Z. Li, 18 Apr 2024 (v2), LongVQ: Long Sequence Modeling with Vector Quantization on Structured Memory, https://arxiv.org/abs/2404.11163
- Siyuan Li, Zedong Wang, Zicheng Liu, Di Wu, Cheng Tan, Jiangbin Zheng, Yufei Huang, Stan Z. Li, 2 Jun 2024 (v2), VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling, https://arxiv.org/abs/2405.10812
- Yunchao Gong, Liu Liu, Ming Yang, Lubomir Bourdev, 18 Dec 2014, Compressing Deep Convolutional Networks using Vector Quantization, https://arxiv.org/abs/1412.6115 (A very early paper on vector quantization of CNNs that has been cited many times.)
- Qijiong Liu, Xiaoyu Dong, Jiaren Xiao, Nuo Chen, Hengchang Hu, Jieming Zhu, Chenxu Zhu, Tetsuya Sakai, Xiao-Ming Wu, 6 May 2024, Vector Quantization for Recommender Systems: A Review and Outlook, https://arxiv.org/abs/2405.03110 (Survey paper on vector quantization.)
- Yanjun Zhao, Tian Zhou, Chao Chen, Liang Sun, Yi Qian, Rong Jin, 8 Feb 2024, Sparse-VQ Transformer: An FFN-Free Framework with Vector Quantization for Enhanced Time Series Forecasting, https://arxiv.org/abs/2402.05830
- James Jie Pan, Jianguo Wang, Guoliang Li, 21 Oct 2023, Survey of Vector Database Management Systems, https://arxiv.org/abs/2310.14021 https://link.springer.com/article/10.1007/s00778-024-00864-x
- Or Sharir, Anima Anandkumar, 27 Jul 2023, Incrementally-Computable Neural Networks: Efficient Inference for Dynamic Inputs, https://arxiv.org/abs/2307.14988
- David Spuler, 26th August, 2024, Inference Optimization Research Ideas, https://www.aussieai.com/blog/inference-optimization-ideas
- Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu, 25 Sep 2024, A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms, https://arxiv.org/abs/2409.16694
- Yifei Liu, Jicheng Wen, Yang Wang, Shengyu Ye, Li Lyna Zhang, Ting Cao, Cheng Li, Mao Yang, 25 Sep 2024, VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models, https://arxiv.org/abs/2409.17066 https://arxiv.org/pdf/2409.17066
- Christopher Fifty, Ronald G. Junkins, Dennis Duan, Aniketh Iger, Jerry W. Liu, Ehsan Amid, Sebastian Thrun, Christopher Ré, 8 Oct 2024, Restructuring Vector Quantization with the Rotation Trick, https://arxiv.org/abs/2410.06424 https://github.com/cfifty/rotation_trick
Quantization Granularity
Quantization granularity refers to the parts or segments or structures of weights that are quantized. For example, granularity levels may be:
- Layerwise quantization
- Vector quantization
- Block quantization
Research papers on granularity of quantization include:
- Jan Klhufek, Miroslav Safar, Vojtech Mrazek, Zdenek Vasicek, Lukas Sekanina, 8 Apr 2024, Exploring Quantization and Mapping Synergy in Hardware-Aware Deep Neural Network Accelerators, https://arxiv.org/abs/2404.05368 (Quantization of weights and activations on a CNN with a method to identify the optimal per-layer bitwidth for quantization.)
- Nikita Trukhanov, Ilya Soloveychik, 29 Mar 2024, Accurate Block Quantization in LLMs with Outliers, https://arxiv.org/abs/2403.20137 (Analyzes block floating point number formats in block quantization with a focus on the KV cache memory reduction, including the use of permutations to reorder tensor weight rows.)
- Wang-Cheng Kang, Derek Zhiyuan Cheng, Ting Chen, Xinyang Yi, Dong Lin, Lichan Hong, Ed H. Chi, 25 Aug 2020 (v2), Learning Multi-granular Quantized Embeddings for Large-Vocab Categorical Features in Recommender Systems, https://arxiv.org/abs/2002.08530
- Lianwei Yang, Zhikai Li, Junrui Xiao, Haisong Gong, Qingyi Gu, 13 Jun 2024, MGRQ: Post-Training Quantization For Vision Transformer With Mixed Granularity Reconstruction, https://arxiv.org/abs/2406.09229
- Minghai Qin, 27 Aug 2024, The Uniqueness of LLaMA3-70B with Per-Channel Quantization: An Empirical Study, https://arxiv.org/abs/2408.15301
- Yanshu Wang, Tong Yang, Xiyan Liang, Guoan Wang, Hanning Lu, Xu Zhe, Yaoming Li, Li Weitao, 18 Sep 2024, Art and Science of Quantizing Large-Scale Models: A Comprehensive Overview, https://arxiv.org/abs/2409.11650 (Extensive survey of quantization from the basics to SOTA approaches, with also some coverage of knowledge distillation and KV cache compression.)
- David Spuler, 26th August, 2024, Inference Optimization Research Ideas, https://www.aussieai.com/blog/inference-optimization-ideas
Layerwise Quantization
Layerwise quantization is quantization done at the granularity level of layers. Each layer can have its own quantization parameters. This can be a special case of mixed-precision quantization (i.e. per-layer precision quantization), but it is also possible to use the same precision quantization at each level, but with different parameters.
Research papers on layerwise quantization:
- Jan Klhufek, Miroslav Safar, Vojtech Mrazek, Zdenek Vasicek, Lukas Sekanina, 8 Apr 2024, Exploring Quantization and Mapping Synergy in Hardware-Aware Deep Neural Network Accelerators, https://arxiv.org/abs/2404.05368 (Quantization of weights and activations on a CNN with a method to identify the optimal per-layer bitwidth for quantization.)
- Shuchang Zhou, Yuxin Wu, Zekun Ni, et al. 2016. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv:1606.06160, http://arxiv.org/abs/1606.06160
- Chenzhuo Zhu, Song Han, Huizi Mao, and William J. Dally. 2016. Trained ternary quantization. arXiv:1612.01064 http://arxiv.org/abs/1612.01064
- C Tang, H Zhai, K Ouyang, Z Wang, Y Zhu, 2022, Arbitrary Bit-width Network: A Joint Layer-Wise Quantization and Adaptive Inference Approach, https://dl.acm.org/doi/pdf/10.1145/3503161.3548001
- Gihwan Kim, Jemin Lee, Sihyeong Park, Yongin Kwon, Hyungshin Kim, 26 Jul 2024, Mixed Non-linear Quantization for Vision Transformers, https://arxiv.org/abs/2407.18437 Code: https://gitlab.com/ones-ai/mixed-non-linear-quantization
- Shachar Gluska, Mark Grobman, 15 Dec 2020, Exploring Neural Networks Quantization via Layer-Wise Quantization Analysis, https://arxiv.org/abs/2012.08420
- Chen Tang, Kai Ouyang, Zhi Wang, Yifei Zhu, Yaowei Wang, Wen Ji, Wenwu Zhu, 5 Mar 2023 (v5), Mixed-Precision Neural Network Quantization via Learned Layer-wise Importance, https://arxiv.org/abs/2203.08368 https://github.com/1hunters/LIMPQ
- Chen Tang, Haoyu Zhai, Kai Ouyang, Zhi Wang, Yifei Zhu, Wenwu Zhu, 21 Apr 2022, Arbitrary Bit-width Network: A Joint Layer-Wise Quantization and Adaptive Inference Approach, https://arxiv.org/abs/2204.09992
- Razvan-Gabriel Dumitru, Vikas Yadav, Rishabh Maheshwary, Paul-Ioan Clotan, Sathwik Tejaswi Madhusudhan, Mihai Surdeanu, 26 Jun 2024 (v2), Layer-Wise Quantization: A Pragmatic and Effective Method for Quantizing LLMs Beyond Integer Bit-Levels, https://arxiv.org/abs/2406.17415 Code: https://github.com/RazvanDu/LayerwiseQuant
- Bernard Ryhede Bengtsson, Joel Bengs, 2024, Accelerated Segmentation with Mixed-Precision Quantization of EfficientViT-SAM, MSc Thesis, Lund University, Sweden, https://lup.lub.lu.se/luur/download?func=downloadFile&recordOId=9174462&fileOId=9174463
- Yanshu Wang, Tong Yang, Xiyan Liang, Guoan Wang, Hanning Lu, Xu Zhe, Yaoming Li, Li Weitao, 18 Sep 2024, Art and Science of Quantizing Large-Scale Models: A Comprehensive Overview, https://arxiv.org/abs/2409.11650 (Extensive survey of quantization from the basics to SOTA approaches, with also some coverage of knowledge distillation and KV cache compression.)
Blockwise Quantization
Blockwise quantization is a very granular type of quantization where each "block" of data has its own quantization parameters.
- Nikita Trukhanov, Ilya Soloveychik, 29 Mar 2024, Accurate Block Quantization in LLMs with Outliers, https://arxiv.org/abs/2403.20137 (Analyzes block floating point number formats in block quantization with a focus on the KV cache memory reduction, including the use of permutations to reorder tensor weight rows.)
- Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao, July 11, 2024, FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision, https://arxiv.org/abs/2407.08608 https://tridao.me/blog/2024/flash3/
- Ziteng Sun1 Uri Mendlovic, Yaniv Leviathan1 Asaf Aharoni, Ahmad Beirami , Jae HunRo, Ananda Theertha Suresh, https://openreview.net/pdf?id=OWwc8eOIm8
- Yanshu Wang, Wenyang He, Tong Yang, 24 May 2024, Athena: Efficient Block-Wise Post-Training Quantization for Large Language Models Using Second-Order Matrix Derivative Information, https://arxiv.org/abs/2405.17470
- Haocheng Xi, Yuxiang Chen, Kang Zhao, Kai Jun Teh, Jianfei Chen, Jun Zhu, 21 Jul 2024 (v2), Jetfire: Efficient and Accurate Transformer Pretraining with INT8 Data Flow and Per-Block Quantization, https://arxiv.org/abs/2403.12422 Code: https://github.com/thu-ml/Jetfire-INT8Training
- Xin Ding, Xiaoyu Liu, Zhijun Tu, Yun Zhang, Wei Li, Jie Hu, Hanting Chen, Yehui Tang, Zhiwei Xiong, Baoqun Yin, Yunhe Wang, 15 Apr 2024 (v4), CBQ: Cross-Block Quantization for Large Language Models, https://arxiv.org/abs/2312.07950
- Xueying Wu, Edward Hanson, Nansu Wang, Qilin Zheng, Xiaoxuan Yang, Huanrui Yang, Shiyu Li, Feng Cheng, Partha Pratim Pande, Janardhan Rao Doppa, Krishnendu Chakrabarty, Hai Li, 27 Oct 2023 (v3), Block-Wise Mixed-Precision Quantization: Enabling High Efficiency for Practical ReRAM-based DNN Accelerators, https://arxiv.org/abs/2310.12182
- Sebastian Eliassen, Raghavendra Selvan, 16 Jan 2024