Aussie AI

Floating-Point Quantization

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

Floating-Point Quantization

The precision levels for different types of floating-point quantization include:

FP16 quantization: This uses 16-bit floating-point numbers instead of 32-bit numbers, with the IEEE standard 5-bit exponent. Commonly used.
BF16 quantization: This uses “brain float 16” format which has a larger 8-bit exponent, but a smaller mantissa. Used commercially, especially with Google TPUs.
FP8 quantization: 8-bit floating-point. Mostly a research area.
FP4 quantization: 4-bit floating-point. Occasionally used in research papers.

Note that there's no such thing as FP32 quantization. Why?

The most straight-forward quantization is to reduce 32-bit floating-point (4 bytes) to 16-bit floating-point (2 bytes). This halves the memory storage requirements, and does not suffer much reduction in model accuracy. All operations in MatMuls are still done in floating-point arithmetic.

The classic form of floating-point quantization is often abbreviated as FP16. There is also “bfloat16”, which uses a different representation of numbers. Quantization from high-precision 32-bit floating-point weights (usually abbreviated “FP32” or “float32”) to lower-precision 16-bit floating-point (usually abbreviated “FP16” or “float16”) can yield significant benefits, often without a significant loss of accuracy.

An even more reduced size is FP8 quantization, which uses 8-bit floating-point numbers. FP8 quantization hasn't caught on in the AI industry as much as FP16 or integer quantization methods, but there are plenty of papers. Weirdly, there's even FP4 quantization with a compressed representation of floating-point numbers in 4 bits.

• Next:

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs

The new AI programming book by Aussie AI co-founders:

AI coding in C++
Transformer engine speedups
LLM models
Phone and desktop AI
Code examples
Research citations

Get your copy from Amazon: Generative AI in C++

Aussie AI

Floating-Point Quantization

Floating-Point Quantization

Quick Links

Product

New to Writing?

Writing Styles