Aussie AI

Binary Quantization

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

Binary Quantization

The extreme of quantization is to encode floating-point weights down to 1 bit. This is binary quantization (or “binarization”), where there are only 2 weights, and they are 0 and 1, or alternatively -1 and +1. This compresses the model by a factor of 32 in terms of space, and reduces the inference computations to simpler arithmetic.

The downside of binary quantization is the loss of accuracy. Hence, binary networks haven't really caught on in widespread industry usage. However, there is a continual stream of research papers attempting to improve them.

The attraction of binary quantization is that its runtime efficiency is hard to beat. Binary quantization's use of minimal weights changes multiplication by a floating-point weight to a simple addition (for 1) and a null test (for 0). Or for binary weights -1 and +1, the -1 is a subtraction and +1 an addition, which is usually further optimized to use a sign bit tweak.

Binary quantization is not the only way to use single bits for AI models. There are also other invocations of binary neural network architectures that use only bitwise operations, such as XNOR networks and Weightless Neural Networks (WNNs).

Research papers on binary quantization:

H. Yang, M. Fritzsche, C. Bartz, and C. Meinel (2017), Bmxnet: An open-source binary neural network implementation based on mxnet, CoRR, vol. abs/1705.09864, 2017, https://arxiv.org/abs/1705.09864, Code: https://github.com/hpi-xnor
Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi (2016), Xnor-net: Imagenet classification using binary convolutional neural networks. In European conference on computer vision, Springer, 525–542, https://arxiv.org/abs/1603.05279
B. McDanel, S. Teerapittayanon, and H. Kung (2017), Embedded binarized neural networks, arXiv preprint arXiv:1709.02260, 2017, https://arxiv.org/abs/1709.02260
Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, Yoshua Bengio (2016), Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1 Feb 2016, https://arxiv.org/abs/1602.02830
Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio (2016), Binarized neural networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems. 4114–4122, https://proceedings.neurips.cc/paper/2016/hash/d8330f857a17c53d217014ee776bfd50-Abstract.html
Zhouhan Lin, Matthieu Courbariaux, Roland Memisevic, Yoshua Bengio (2016), Neural Networks with Few Multiplications, Feb 2016, https://arxiv.org/abs/1510.03009v1
Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi (2016), Xnor-net: Imagenet classification using binary convolutional neural networks. In ECCV, pages 525–542. Springer, 2016, https://arxiv.org/abs/1603.05279
Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David (2015), Binaryconnect: Training deep neural networks with binary weights during propagations, In NeuriPS, pages 3123–3131, 2015, https://arxiv.org/abs/1511.00363
Zhaowei Cai, Xiaodong He, Jian Sun, and Nuno Vasconcelos (2017). Deep learning with low precision by half-wave gaussian quantization. In CVPR, pages 5918–5926, 2017, https://arxiv.org/abs/1702.00953
Yefei He, Zhenyu Lou, Luoming Zhang, Weijia Wu, Bohan Zhuang, and Hong Zhou (2022). Bivit: Extremely compressed binary vision transformer. arXiv preprint arXiv:2211.07091, 2022. https://arxiv.org/abs/2211.07091 (Softmax-aware binarization)
Zechun Liu, Barlas Oguz, Aasish Pappu, Lin Xiao, Scott Yih, Meng Li, Raghuraman Krishnamoorthi, and Yashar Mehdad (2022). Bit: Robustly binarized multi-distilled transformer. arXiv preprint arXiv:2205.13016, 2022. https://arxiv.org/abs/2205.13016, Code: https://github.com/facebookresearch/bit
Felix Juefei-Xu, Vishnu Naresh Boddeti, and Marios Savvides (2017). Local binary convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 19–28, 2017. https://arxiv.org/abs/1608.06049
Zechun Liu, Zhiqiang Shen, Marios Savvides, and KwangTing Cheng (2020). Reactnet: Towards precise binary neural network with generalized activation functions. In European Conference on Computer Vision, pages 143–159. Springer, 2020. https://arxiv.org/abs/2003.03488
Koen Helwegen, James Widdicombe, Lukas Geiger, Zechun Liu, Kwang-Ting Cheng, and Roeland Nusselder (2019). Latent weights do not exist: Rethinking binarized neural network optimization. In Advances in Neural Information Processing Systems 32, pages 7533–7544. 2019. https://arxiv.org/abs/1906.02107, Code: https://github.com/plumerai/rethinking-bnn-optimization
Lin, X.; Zhao, C.; and Pan, W. (2017). Towards Accurate Binary Convolutional Neural Network. Advances in Neural Information Processing Systems, 30, https://arxiv.org/abs/1711.11294
Yichi Zhang, Ankush Garg, Yuan Cao, Łukasz Lew, Behrooz Ghorbani, Zhiru Zhang, Orhan Firat (2023), Binarized Neural Machine Translation, Feb 2023, https://arxiv.org/abs/2302.04907
Kota Ando; Kodai Ueyoshi; Kentaro Orimo; Haruyoshi Yonekawa; Shimpei Sato; Hiroki Nakahara; Masayuki Ikebe (2017), BRein memory: A 13-layer 4.2 K neuron/0.8 M synapse binary/ternary reconfigurable in-memory deep neural network accelerator in 65 nm CMOS, Proc. Symp. VLSI Circuits, pp. C24-C25, Jun. 2017. https://ieeexplore.ieee.org/document/8008533
S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen and Y. Zou (2016), DoReFa-Net: Training low bitwidth convolutional neural networks with low bitwidth gradients, arXiv:1606.06160, 2016. https://arxiv.org/abs/1606.06160 (Has binary weights, 2-bit activations)
R. Andri, L. Cavigelli, D. Rossi and L. Benini (2016), YodaNN: An ultra-low power convolutional neural network accelerator based on binary weights, Proc. IEEE Comput. Soc. Annu. Symp. VLSI (ISVLSI), pp. 236-241, Jul. 2016. https://arxiv.org/abs/1606.05487v1
Z. Cai, X. He, J. Sun and N. Vasconcelos (2017), Deep learning with low precision by half-wave Gaussian quantization, Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 5918-5926, Jul. 2017. https://arxiv.org/abs/1702.00953 (Has binary weights, 2-bit activations)
R. Ding, Z. Liu, R. D. Blanton, and D. Marculescu (2018). Quantized deep neural networks for energy efficient hardware-based inference. In IEEE Asia and South Pacific Design Automation Conference, pages 1–8, 2018. https://ieeexplore.ieee.org/document/8297274 (Survey and evaluation of various quantized DNN models in 2018, including binarized and light models, on chosen datasets.)
Taylor Simons and Dah-Jye Lee (2019), A Review of Binarized Neural Networks, Electronics 2019, 8, 661; doi:10.3390/electronics8060661, MDPI, https://www.mdpi.com/2079-9292/8/6/661/review_report
Xiaofan Lin, Cong Zhao, and Wei Pan (2017). Towards accurate binary convolutional neural network. Advances in Neural Information Processing Systems, 30, 2017. https://arxiv.org/abs/1711.11294 (Uses multiple single-bit weights combined to create a multi-binary quantization method.)
Y Shang, Z Yuan, Q Wu, Z Dong (2023), PB-LLM: Partially Binarized Large Language Models, Sep 2023, arXiv preprint arXiv:2310.00034, https://browse.arxiv.org/pdf/2310.00034.pdf, Code: https://github.com/hahnyuan/BinaryLLM (Hybrid partial binarization.)

See more papers on binary quantization at: https://www.aussieai.com/research/quantization#binary

• Next:

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs

The new AI programming book by Aussie AI co-founders:

AI coding in C++
Transformer engine speedups
LLM models
Phone and desktop AI
Code examples
Research citations

Get your copy from Amazon: Generative AI in C++

Aussie AI

Binary Quantization

Binary Quantization

Quick Links

Product

New to Writing?

Writing Styles