Aussie AI

Types of Quantization

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

Types of Quantization

Quantization is usually separated into two main categories:

Post-Training Quantization (PTQ). This is where a pre-trained model is quantized for faster inference.
Quantization-Aware Training (QAT). This is the use of quantization during model training.

An important distinction is whether “weights” or “activations” are quantized.

Weights quantized — weight-only quantization.
Activations quantized — weight-and-activation quantization.

Weights are the static parameters in the model file. Activations are the dynamic numbers that are computed into vectors and represent the results of a neuron after its activation function. Much of the early quantization research focused on weight-only quantization, where the weights might be quantized to FP16, INT8, or smaller, but the activation computations would still be done in FP32.

Quantization granularity specifies what numbers are going to be quantized, and which sets of model parameters will have different quantization parameters. There are different model floating-point numbers that can be quantized:

Per-layer quantization
Per-tensor quantization
Per-channel quantization

The scaling algorithm or “scale factor” by which floating-point numbers are scaled to a smaller range of numbers is also a distinction for a given quantization algorithm. Several types include:

Uniform scaling (uniform quantization)
Uniform affine quantization
Symmetric uniform quantization
Non-uniform quantization
Power-of-two quantization
Asymmetric quantization

As was discussed briefly at the start of this chapter, there are several different technical types of quantization based on the data types used. The major categories are:

Floating-point quantization
Integer quantization
Mixed-precision quantization (or simply “mixed quantization”): Refers to more finely granular quantization where different parts of the model have different levels of quantization in terms of bits.

And some more quantization terminology:

Extreme quantization: Usually refers to binary quantization (1-bit quantization).
Low-bit quantization: Usually means binary, ternary, or at most 4-bit quantization.
Fake quantization (or “simulated quantization”). Refers somewhat unkindly to integer quantization with the main goal of storage space reduction from a reduced model size, where the actual arithmetic is still performed as floating-point multiplications, rather than the “real quantization” of integer-only-arithmetic quantization. Equivalently, this is weight-only quantization to an integer type, where the activations are still computed in FP32.

• Next:

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs

The new AI programming book by Aussie AI co-founders:

AI coding in C++
Transformer engine speedups
LLM models
Phone and desktop AI
Code examples
Research citations

Get your copy from Amazon: Generative AI in C++

Aussie AI

Types of Quantization

Types of Quantization

Quick Links

Product

New to Writing?

Writing Styles