Aussie AI
Integer Arithmetic
-
Last Updated 28 November, 2024
-
by David Spuler, Ph.D.
Integer arithmetic replacing floating point calculations is a well-known optimization. In regard to AI, everyone thinks of quantization, which is the most common use of integer arithmetic. However, it's not the only place where integer arithmetic optimizations can be used.
Quantization: A list of AI quantization techniques that involve integer arithmetic include:
- Quantization: overview, binary, ternary, 2-bit (INT2), 3-bit (INT3), 4-bit (INT4), 5-bit (INT5), 6-bit (INT6), 7-bit (INT7), 8-bit (INT8), INT16, INT32
- Integer-only arithmetic quantization
- Fixed-point quantization
- Block floating-point quantization
- Logarithmic power-of-two quantization (bitshift quantization)
- Dyadic quantization
A full integer-only implementation of quantization will also have integer arithmetic not just in the MatMuls, but also in all Transformer components:
- Integer Softmax
- Integer activation functions (e.g. RELU)
- Integer normalization
- Integer positional encoding
Non-Quantization Integers: A list of AI non-quantization optimization techniques that involve integer arithmetic includes:
- Bitshift-add networks
- Add-as-integer networks
- Bitwise neural networks
- Weightless Neural Networks (WNNs)
- XNOR networks (see also binary quantization)
Integer Arithmetic
Research papers on integer arithmetic in AI models:
- Radha Gulhane, 2024, Accelerated and Memory-Efficient Distributed Deep Learning: Leveraging Quantization, Parallelism Techniques, and Mix-Match Runtime Communication , Masters Thesis, Computer Science and Engineering , The Ohio State University, https://etd.ohiolink.edu/acprod/odb_etd/ws/send_file/send?accession=osu1713381834648517&disposition=inline
- Z Zou, C Zhang, S Chen, H Kou, B Liu, March 2024, Integer Arithmetic-Based and Activation-Aware GELU Optimization for Vision Transformer, 2024 Conference of Science and Technology for Integrated Circuits (CSTIC), 17-18 March 2024, https://ieeexplore.ieee.org/abstract/document/10531966/
- Yi Guo, Fanliu Kong, Xiaoyang Li, Hui Li, Wei Chen, Xiaogang Tian, Jinping Cai, Yang Zhang, Shouda Liu, 19 Apr 2024, decoupleQ: Towards 2-bit Post-Training Uniform Quantization via decoupling Parameters into Integer and Floating Points, https://arxiv.org/abs/2404.12759 Code: https://github.com/bytedance/decoupleQ (Decouple parameters into integer and floating-point parts for more accurate quantization at low bitwidths.)
- Xing Hu, Yuan Chen, Dawei Yang, Sifan Zhou, Zhihang Yuan, Jiangyong Yu, Chen Xu, 28 May 2024, I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models, https://arxiv.org/abs/2405.17849 Code: https://anonymous.4open.science/r/I-LLM-F242/
- Ruiqi Sun, Yinchen Ni, Xin He, Jie Zhao, An Zou, 1 Feb 2024, ONE-SA: Enabling Nonlinear Operations in Systolic Arrays for Efficient and Flexible Neural Network Inference, https://arxiv.org/abs/2402.00395
- Alberto Marchisio, Davide Dura, Maurizio Capra, Maurizio Martina, Guido Masera, Muhammad Shafique, Apr 2023, SwiftTron: An Efficient Hardware Accelerator for Quantized Transformers, https://arxiv.org/abs/2304.03986 Code: https://github.com/albertomarchisio/SwiftTron
- Ghadeer Jaradat, Mohammed Tolba, Ghada Alsuhli, Hani Saleh, Mahmoud Al-Qutayri, Thanos Stouraitis, Baker Mohammad, 7 Jul 2024, Hybrid Dynamic Pruning: A Pathway to Efficient Transformer Inference, https://arxiv.org/abs/2407.12893
- Mohammadreza Tayaranian, Seyyed Hasan Mozafari, James J. Clark, Brett Meyer, Warren Gross, 2 Feb 2024, Faster Inference of Integer SWIN Transformer by Removing the GELU Activation, https://arxiv.org/abs/2402.01169 (Replace GELU with RELU.)
- Fuwen Tan, Royson Lee, Łukasz Dudziak, Shell Xu Hu, Sourav Bhattacharya, Timothy Hospedales, Georgios Tzimiropoulos, Brais Martinez, 25 Aug 2024, MobileQuant: Mobile-friendly Quantization for On-device Language Models, https://arxiv.org/abs/2408.13933 https://github.com/saic-fi/MobileQuant
- Penghao Xiao, Chunjie Zhang, Qian Guo, Xiayang Xiao, Haipeng Wang, 2024, Neural Networks Integer Computation: Quantizing Convolutional Neural Networks of Inference and Training for Object Detection in Embedded Systems, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, DOI 10.1109/JSTARS.2024.3452321, PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10660473
- Donghyeon Yi, Seoyoung Lee, Jongho Kim, Junyoung Kim, Sohmyung Ha, Ik Joon Chang, Minkyu Je, 22 Nov 2024, FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration, https://arxiv.org/abs/2411.14733
End-to-End Integer Arithmetic
Integers everywhere. That's the goal of end-to-end integer arithmetic for inference in a Transformer. The weights and activations as integers is the realm of integer-only arithmetic quantization. But other components also need to be processed as integers to achieve end-to-end integer-only inference, such as activation functions, normalization, and Softmax components.
Research papers on end-to-end integer arithmetic:
- J Zhong, Z Liu, X Chen, Apr 2023, Transformer-based models and hardware acceleration analysis in autonomous driving: A survey, https://arxiv.org/abs/2304.10891 (Mainly focused on 8-bit integer arithmetic for machine vision Transformers.)
- Zhewei Yao, Zhen Dong, Zhangcheng Zheng, Amir Gholami, Jiali Yu, Eric Tan, Leyuan Wang, Qijing Huang, Yida Wang, Michael Mahoney, Kurt Keutzer, HAWQ-V3: Dyadic Neural Network Quantization, Proceedings of the 38th International Conference on Machine Learning, PMLR 139:11875-11886, 2021, https://arxiv.org/abs/2011.10680 (Integers only in quantized weights and activations with INT4 or INT8, but also uses integers for batch normalization and residual connection components, too.)
- Y. Lin, Y. Li, T. Liu et al., “Towards fully 8-bit integer inference for the transformer model,” in Proc. of IJCAI, 2020, pp. 3759–3765. https://arxiv.org/abs/2009.08034 (Integers for weights, but also for Softmax, layer normalization, and other components, by replacing or approximating non-linear functions such as exponential and square-root.)
- Peng Peng, Mingyu You, Weisheng Xu, and Jiaxin Li. Fully integer-based quantization for mobile convolutional neural network inference. Neurocomputing, 432:194–205, 2021, https://www.sciencedirect.com/science/article/abs/pii/S0925231220319354 (Quantizes with INT4, but not only weights, but also has integer batch normalization.)
- Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer, I-BERT: Integer-only BERT Quantization, Proceedings of the 38th International Conference on Machine Learning, PMLR 139:5506-5518, 2021, https://arxiv.org/abs/2101.01321, https://proceedings.mlr.press/v139/kim21d.html (I-BERT uses quantization, but also has integer arithmetic for GELU, Softmax, and Layer Normalization.)
- Dong, Z., Yao, Z., Gholami, A., Mahoney, M. W., Keutzer, K., HAWQ: Hessian AWare Quantization of neural networks with mixed-precision. In The IEEE International Conference on Computer Vision (ICCV), October 2019. https://ieeexplore.ieee.org/document/9009512, https://arxiv.org/abs/1905.03696 (Early paper that isn't quite end-to-end with integers.)
- Ruokai Yin, Yuhang Li, Abhishek Moitra, Priyadarshini Panda, Dec 2022, Training Integer-Only Deep Recurrent Neural Networks https://arxiv.org/abs/2212.11791 (Integer-only version of RNNs called iRNN, with integer-only layer normalization, integer-only attention, and piecewise linear approximation for integer-only activation functions such as tanh and sigmoid.)
- R Yin, Y Li, A Moitra, P Panda, Sep 2023, MINT: Multiplier-less Integer Quantization for Spiking Neural Networks, https://arxiv.org/abs/2305.09850
- Shuo Huai, Di Liu, Xiangzhong Luo, Hui Chen, Weichen Liu, Ravi Subramaniam, 2023, Crossbar-Aligned & Integer-Only Neural Network Compression for Efficient In-Memory Acceleration, ASPDAC '23: Proceedings of the 28th Asia and South Pacific Design Automation Conference, January 2023, Pages 234–239, https://doi.org/10.1145/3566097.3567856, https://dl.acm.org/doi/abs/10.1145/3566097.3567856
- Z Zhang, B He, Z Zhang, 2023, Practical Edge Kernels for Integer-Only Vision Transformers Under Post-training Quantization, Proceedings of Machine Learning and Systems 5 pre-proceedings (MLSys 2023) mlsys2023, https://proceedings.mlsys.org/paper_files/paper/2023/hash/023560744aae353c03f7ae787f2998dd-Abstract-mlsys2023.html, PDF: https://proceedings.mlsys.org/paper_files/paper/2023/file/023560744aae353c03f7ae787f2998dd-Paper-mlsys2023.pdf (Integer-only-arithmetic quantization with integer-only versions of Softmax, LayerNorm, and GELU.)
- Eyyüb Sari, Vanessa Courville, Vahid Partovi Nia, Feb 2022, iRNN: Integer-only Recurrent Neural Network, https://arxiv.org/abs/2109.09828
- J Bartels, A Hagihara, L Minati, An Integer-Only Resource-Minimized RNN on FPGA for Low-Frequency Sensors in Edge-AI, 2023, IEEE Sensors Journal, Volume 23, Issue 15, 01 August 2023, https://ieeexplore.ieee.org/abstract/document/10161725/, PDF: https://ieeexplore.ieee.org/iel7/7361/4427201/10161725.pdf
- Lin, Y., Zhang, T., Sun, P., Li, Z., and Zhou, S. FQ-ViT: Post-Training Quantization for Fully Quantized Vision Transformer. Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, pp. 1173–1179, 2022. https://arxiv.org/abs/2111.13824
- A. Rock, A. Untether, O. Khalil, O. Shai, and P. Grouchy, 2022, INT8 Transformers for Inference Acceleration, 36th Conference on Neural Information Processing Systems (NeurIPS), PDF: https://neurips2022-enlsp.github.io/papers/paper_52.pdf
- Victor J.B. Jung, Alessio Burrello, Moritz Scherer, Francesco Conti, Luca Benini, 3 Apr 2024, Optimizing the Deployment of Tiny Transformers on Low-Power MCUs, https://arxiv.org/abs/2404.02945 (Uses an approach called "Fused Weight Self-Attention" that fuses some of the QKV matrices and also tiling in multi-head attention, along with 8-bit integer quantization and integerized Softmax.)
- David Spuler, March 2024, Chapter 53. Arithmetic Optimization Research, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Yifei Liu, Jicheng Wen, Yang Wang, Shengyu Ye, Li Lyna Zhang, Ting Cao, Cheng Li, Mao Yang, 25 Sep 2024, VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models, https://arxiv.org/abs/2409.17066 https://arxiv.org/pdf/2409.17066
Integer Dot Product
The dot product of two vectors containing integers is also an integer.
Research papers on integer-based vector dot products:
- Nils Kohl, Stephen F. McCormick, Rasmus Tamstorf, 30 Jun 2023, Multigrid Methods using Block Floating Point Arithmetic, https://arxiv.org/abs/2307.00124
- Mario Drumond, Tao Lin, Martin Jaggi, Babak Falsafi, 2 Dec 2018 ( v4), Training DNNs with Hybrid Block Floating Point, NeurIPS, https://arxiv.org/abs/1804.01526 PDF: https://proceedings.neurips.cc/paper/2018/file/6a9aeddfc689c1d0e3b9ccc3ab651bc5-Paper.pdf
- Yeong Foong Choo, Brian L. Evans, Alan Gatherer, 25 Oct 2017 ( v2), Complex Block Floating-Point Format with Box Encoding For Wordlength Reduction in Communication Systems, https://arxiv.org/abs/1705.05217 (Use of BFP for audio sampling in 2017.)
- Simla Burcu Harma, Ayan Chakraborty, Babak Falsafi, Martin Jaggi, Yunho Oh, 2023, Accuracy Boosters: Epoch-Driven Mixed-Mantissa Block Floating Point for DNN Training, ML for Computer Architecture and Systems (MLArchSys), ISCA 2023, https://openreview.net/pdf?id=nfmfqzQ4Mwl (Mixed precision version of BFP with per-block bit sizes, and integer arithmetic for dot product, but FP32 for other operations.)
- Wikipedia, April 2024 (accessed), Block floating point https://en.wikipedia.org/wiki/Block_floating_point
- Chhabra, Arun; Iyer, Ramesh (December 1999). "TMS320C55x A Block Floating Point Implementation on the TMS320C54x DSP" (PDF) (Application report). Digital Signal Processing Solutions. Texas Instruments. SPRA610. Archived (PDF) from the original on 2018-07-11. Retrieved 2018-07-11. https://web.archive.org/web/20180711175625/http://www.eeng.dcu.ie/~ee206/pdf/block_flt_pt.pdf
- Elam, David; Iovescu, Cesar (September 2003). "A Block Floating Point Implementation for an N-Point FFT on the TMS320C55x DSP" (PDF) (Application report). TMS320C5000 Software Applications. Texas Instruments. SPRA948. Archived (PDF) from the original on 2018-07-11. Retrieved 2015-11-01. https://www.ti.com/lit/an/spra948/spra948.pdf
- Wilkinson, James Hardy (1963). Rounding Errors in Algebraic Processes (1 ed.). Englewood Cliffs, NJ, USA: Prentice-Hall, Inc. MR 0161456. https://books.google.com.au/books?id=yFogU9Ot-qsC&redir_esc=y
- Nikita Trukhanov, Ilya Soloveychik, 29 Mar 2024, Accurate Block Quantization in LLMs with Outliers, https://arxiv.org/abs/2403.20137 (Analyzes block floating point number formats in block quantization with a focus on the KV cache memory reduction, including the use of permutations to reorder tensor weight rows.)
- Microsoft, “MX pytorch emulation library,” https://github.com/microsoft/microxcaling, 2023.
More AI Research
Read more about: