Aussie AI

Sum of Two Bitshifts Quantization

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

Sum of Two Bitshifts Quantization

The downside of logarithmic quantization is that there are relatively few unique weights, limiting precision, even if the number of bits used is maximized using a large scaling factor. An alternative implementation is to use two bitshift operations and an addition (or use of “shift-and-add” operations). In this way, the two highest bits of the quantized integer weight can be used, which improves model precision at the cost of more computation. This assumes that two integer shifts and an integer addition are less than the cost of a single integer multiplication. An early mention of this “sums of powers of two” method is in Marchesi et al. (1993).

Research papers on sum-of-two-bitshifts quantization:

Sung-En Chang, Yanyu Li, Mengshu Sun, Runbin Shi, Hayden K.-H. So, Xuehai Qian, Yanzhi Wang, and Xue Lin, 2021, Mix and Match: A Novel FPGA-Centric Deep Neural Network Quantization Framework, 2021, In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, Seoul, Korea (South), 208–220, https://doi.org/10.1109/HPCA51647.2021.00027
You, H.; Chen, X.; Zhang, Y.; Li, C.; Li, S.; Liu, Z.; Wang, Z.; and Lin, Y., 2020, ShiftAddNet: A Hardware-Inspired Deep Network, In NeurIPS, https://arxiv.org/abs/2010.12785
Marchesi, Michele, Orlandi, Gianni, Piazza, Francesco, and Uncini, Aurelio, 1993, Fast neural networks without multipliers, IEEE Transactions on Neural Networks , 4(1):53–62, 1993, https://ieeexplore.ieee.org/document/182695
Robert Eisele, 2010, Optimizing integer multiplication, April 29th, 2010, https://www.xarg.org/2010/04/optimizing-integer-multiplication/
Yuhang Li, Xin Dong, and Wei Wang, 2020, Additive powers-of-two quantization: An efficient non-uniform discretization for neural networks, International Conference on Learning Representations, February 2020, https://arxiv.org/abs/1909.13144

See also more sum of two bitshift quantization papers at https://www.aussieai.com/research/quantization#logarithmic.

• Next:

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs

The new AI programming book by Aussie AI co-founders:

AI coding in C++
Transformer engine speedups
LLM models
Phone and desktop AI
Code examples
Research citations

Get your copy from Amazon: Generative AI in C++

Aussie AI

Sum of Two Bitshifts Quantization

Sum of Two Bitshifts Quantization

Quick Links

Product

New to Writing?

Writing Styles