Aussie AI

End-to-End Integer Arithmetic

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

End-to-End Integer Arithmetic

Integers everywhere. That's the goal of end-to-end integer arithmetic for inference in a Transformer. The weights and activations as integers is the realm of integer-only arithmetic quantization. But all components also need to be processed as integers to achieve end-to-end integer-only inference.

Integer arithmetic is much faster than parallel arithmetic, for both sequential or parallel computation. Replacing floating-point calculations with integer arithmetic is a well-known optimization and the conversion of AI engines to use integer-only arithmetic has been an ongoing area of research.

In regard to AI, everyone thinks of quantization, which is the most common use of integer arithmetic. Integer quantization has moved from research into the mainstream with commonly used sizes being 16-bit, 8-bit, and even 4-bit integers (see Chapter 44). However, integer quantization does not typically perform all arithmetic in integers, but often converts back and forth from floating-point.

The extension to “integer-only arithmetic quantization” remains an area of research and there is a considerable amount of research being done to create “integer-only engines” for faster AI. A full implementation will need integer arithmetic not just in the weights and MatMuls, but also in the other Transformer components, such as:

Integer Softmax
Integer activation functions (e.g. RELU is easy)
Integer normalization
Integer positional encoding

Non-Quantization Integer Research: Quantization is not the only place where integer arithmetic optimizations can be used in AI engines. A list of AI other optimization techniques that involve integer arithmetic includes:

Bitshift-add networks
Add-as-integer networks
Bitwise neural networks
Weightless Neural Networks (WNNs)
XNOR networks (see also binary quantization)

Research papers on end-to-end integer networks:

J Zhong, Z Liu, X Chen, Apr 2023, Transformer-based models and hardware acceleration analysis in autonomous driving: A survey, https://arxiv.org/abs/2304.10891 (Mainly focused on 8-bit integer arithmetic for machine vision Transformers.)
Zhewei Yao, Zhen Dong, Zhangcheng Zheng, Amir Gholami, Jiali Yu, Eric Tan, Leyuan Wang, Qijing Huang, Yida Wang, Michael Mahoney, Kurt Keutzer, 2021, HAWQ-V3: Dyadic Neural Network Quantization, Proceedings of the 38th International Conference on Machine Learning, PMLR 139:11875-11886, 2021, https://arxiv.org/abs/2011.10680 (Integers only in quantized weights and activations with INT4 or INT8, but also uses integers for batch normalization and residual connection components, too.)
Y. Lin, Y. Li, T. Liu et al., 2020, Towards fully 8-bit integer inference for the transformer model, in Proc. of IJCAI, 2020, pp. 3759–3765. https://arxiv.org/abs/2009.08034 (Integers for weights, but also for Softmax, layer normalization, and other components, by replacing or approximating non-linear functions such as exponential and square-root.)
Peng Peng, Mingyu You, Weisheng Xu, and Jiaxin Li. 2021, Fully integer-based quantization for mobile convolutional neural network inference, Neurocomputing, 432:194–205, 2021, https://www.sciencedirect.com/science/article/abs/pii/S0925231220319354 (Quantizes with INT4, but not only weights, but also has integer batch normalization.)
Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer, 2021, I-BERT: Integer-only BERT Quantization, Proceedings of the 38th International Conference on Machine Learning, PMLR 139:5506-5518, 2021, https://arxiv.org/abs/2101.01321, https://proceedings.mlr.press/v139/kim21d.html (I-BERT uses quantization, but also has integer arithmetic for GELU, Softmax, and Layer Normalization.)
Dong, Z., Yao, Z., Gholami, A., Mahoney, M. W., Keutzer, K., 2019, HAWQ: Hessian AWare Quantization of neural networks with mixed-precision, In The IEEE International Conference on Computer Vision (ICCV), October 2019. https://ieeexplore.ieee.org/document/9009512, https://arxiv.org/abs/1905.03696 (Early paper that isn't quite end-to-end with integers.)
Ruokai Yin, Yuhang Li, Abhishek Moitra, Priyadarshini Panda, Dec 2022, Training Integer-Only Deep Recurrent Neural Networks, https://arxiv.org/abs/2212.11791 (Integer-only version of RNNs called iRNN, with integer-only layer normalization, integer-only attention, and piecewise linear approximation for integer-only activation functions such as tanh and sigmoid.)
R Yin, Y Li, A Moitra, P Panda, Sep 2023, MINT: Multiplier-less Integer Quantization for Spiking Neural Networks, https://arxiv.org/abs/2305.09850
Shuo Huai, Di Liu, Xiangzhong Luo, Hui Chen, Weichen Liu, Ravi Subramaniam, 2023, Crossbar-Aligned & Integer-Only Neural Network Compression for Efficient In-Memory Acceleration, ASPDAC '23: Proceedings of the 28th Asia and South Pacific Design Automation Conference, January 2023, Pages 234–239, https://doi.org/10.1145/3566097.3567856, https://dl.acm.org/doi/abs/10.1145/3566097.3567856
Z Zhang, B He, Z Zhang, 2023, Practical Edge Kernels for Integer-Only Vision Transformers Under Post-training Quantization, Proceedings of Machine Learning and Systems 5 pre-proceedings (MLSys 2023) mlsys2023, https://proceedings.mlsys.org/paper_files/paper/2023/hash/023560744aae353c03f7ae787f2998dd-Abstract-mlsys2023.html, PDF: https://proceedings.mlsys.org/paper_files/paper/2023/file/023560744aae353c03f7ae787f2998dd-Paper-mlsys2023.pdf (Integer-only-arithmetic quantization with integer-only versions of Softmax, LayerNorm, and GELU.)
Eyyüb Sari, Vanessa Courville, Vahid Partovi Nia, Feb 2022, iRNN: Integer-only Recurrent Neural Network, https://arxiv.org/abs/2109.09828
J Bartels, A Hagihara, L Minati, 2023, An Integer-Only Resource-Minimized RNN on FPGA for Low-Frequency Sensors in Edge-AI, IEEE Sensors Journal, Volume 23, Issue 15, 01 August 2023, https://ieeexplore.ieee.org/abstract/document/10161725/, PDF: https://ieeexplore.ieee.org/iel7/7361/4427201/10161725.pdf
Lin, Y., Zhang, T., Sun, P., Li, Z., and Zhou, S., 2022, FQ-ViT: Post-Training Quantization for Fully Quantized Vision Transformer, Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, pp. 1173–1179, 2022. https://arxiv.org/abs/2111.13824
A. Rock, A. Untether, O. Khalil, O. Shai, and P. Grouchy, 2022, INT8 Transformers for Inference Acceleration, 36th Conference on Neural Information Processing Systems (NeurIPS), PDF: https://neurips2022-enlsp.github.io/papers/paper_52.pdf

For more research papers on end-to-end integer arithmetic in AI engines, see https://www.aussieai.com/research/integer#end2end.

• Next:

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs

The new AI programming book by Aussie AI co-founders:

AI coding in C++
Transformer engine speedups
LLM models
Phone and desktop AI
Code examples
Research citations

Get your copy from Amazon: Generative AI in C++

Aussie AI

End-to-End Integer Arithmetic

End-to-End Integer Arithmetic

Quick Links

Product

New to Writing?

Writing Styles