Aussie AI

Low Bit Quantization

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

Low Bit Quantization

Firstly, an interesting point is that quantization with a very low number of bits (one or two) can achieve zero-multiplication inference.

Binary quantization: 1-bit binary quantization achieves the replacement of multiplication with addition, or with sign-flips. If the weights are only 1 or 0, then the “multiplication” by 1 is an addition, and multiplication by zero becomes a null-test. If the weights are +1 and -1, which is more common, then it's a sign test followed by an addition or a subtraction, or simply by a sign-flip. Oftentimes, these are optimized with bit arithmetic, since binary quantization is 1-bit quantization. Binary quantization is very fast, but has well-known problems with model accuracy.

Ternary quantization: Similarly, ternary quantization with weights -1, 0, and 1, can be implemented as a sign test, null test, addition and subtraction. However, ternary quantization also has problems with model accuracy.

2-bit quantization: The four possible weights could be implemented by zero, one or two additions, instead of multiplication. This type of 2-bit quantization does not receive as much attention in the literature.

See Chapter 44 (Advanced Quantization) for more information about these low-bit quantization techniques and their research papers.

• Next:

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs

The new AI programming book by Aussie AI co-founders:

AI coding in C++
Transformer engine speedups
LLM models
Phone and desktop AI
Code examples
Research citations

Get your copy from Amazon: Generative AI in C++

Aussie AI

Low Bit Quantization

Low Bit Quantization

Quick Links

Product

New to Writing?

Writing Styles