Aussie AI

Low Bit Quantization

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Low Bit Quantization

Firstly, an interesting point is that quantization with a very low number of bits (one or two) can achieve zero-multiplication inference.

Binary quantization: 1-bit binary quantization achieves the replacement of multiplication with addition, or with sign-flips. If the weights are only 1 or 0, then the “multiplication” by 1 is an addition, and multiplication by zero becomes a null-test. If the weights are +1 and -1, which is more common, then it's a sign test followed by an addition or a subtraction, or simply by a sign-flip. Oftentimes, these are optimized with bit arithmetic, since binary quantization is 1-bit quantization. Binary quantization is very fast, but has well-known problems with model accuracy.

Ternary quantization: Similarly, ternary quantization with weights -1, 0, and 1, can be implemented as a sign test, null test, addition and subtraction. However, ternary quantization also has problems with model accuracy.

2-bit quantization: The four possible weights could be implemented by zero, one or two additions, instead of multiplication. This type of 2-bit quantization does not receive as much attention in the literature.

See Chapter 44 (Advanced Quantization) for more information about these low-bit quantization techniques and their research papers.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++