Aussie AI
Floating-Point Quantization
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
Floating-Point Quantization
The precision levels for different types of floating-point quantization include:
- FP16 quantization: This uses 16-bit floating-point numbers instead of 32-bit numbers, with the IEEE standard 5-bit exponent. Commonly used.
- BF16 quantization: This uses “brain float 16” format which has a larger 8-bit exponent, but a smaller mantissa. Used commercially, especially with Google TPUs.
- FP8 quantization: 8-bit floating-point. Mostly a research area.
- FP4 quantization: 4-bit floating-point. Occasionally used in research papers.
Note that there's no such thing as FP32 quantization. Why?
The most straight-forward quantization is to reduce 32-bit floating-point (4 bytes) to 16-bit floating-point (2 bytes). This halves the memory storage requirements, and does not suffer much reduction in model accuracy. All operations in MatMuls are still done in floating-point arithmetic.
The classic form of floating-point quantization is often abbreviated as FP16. There is also “bfloat16”, which uses a different representation of numbers. Quantization from high-precision 32-bit floating-point weights (usually abbreviated “FP32” or “float32”) to lower-precision 16-bit floating-point (usually abbreviated “FP16” or “float16”) can yield significant benefits, often without a significant loss of accuracy.
An even more reduced size is FP8 quantization, which uses 8-bit floating-point numbers. FP8 quantization hasn't caught on in the AI industry as much as FP16 or integer quantization methods, but there are plenty of papers. Weirdly, there's even FP4 quantization with a compressed representation of floating-point numbers in 4 bits.
• Next: • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |