Aussie AI

Types of Quantization

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Types of Quantization

Quantization is usually separated into two main categories:

  • Post-Training Quantization (PTQ). This is where a pre-trained model is quantized for faster inference.
  • Quantization-Aware Training (QAT). This is the use of quantization during model training.

An important distinction is whether “weights” or “activations” are quantized.

  • Weights quantized — weight-only quantization.
  • Activations quantized — weight-and-activation quantization.

Weights are the static parameters in the model file. Activations are the dynamic numbers that are computed into vectors and represent the results of a neuron after its activation function. Much of the early quantization research focused on weight-only quantization, where the weights might be quantized to FP16, INT8, or smaller, but the activation computations would still be done in FP32.

Quantization granularity specifies what numbers are going to be quantized, and which sets of model parameters will have different quantization parameters. There are different model floating-point numbers that can be quantized:

  • Per-layer quantization
  • Per-tensor quantization
  • Per-channel quantization

The scaling algorithm or “scale factor” by which floating-point numbers are scaled to a smaller range of numbers is also a distinction for a given quantization algorithm. Several types include:

  • Uniform scaling (uniform quantization)
  • Uniform affine quantization
  • Symmetric uniform quantization
  • Non-uniform quantization
  • Power-of-two quantization
  • Asymmetric quantization

As was discussed briefly at the start of this chapter, there are several different technical types of quantization based on the data types used. The major categories are:

  • Floating-point quantization
  • Integer quantization
  • Mixed-precision quantization (or simply “mixed quantization”): Refers to more finely granular quantization where different parts of the model have different levels of quantization in terms of bits.

And some more quantization terminology:

  • Extreme quantization: Usually refers to binary quantization (1-bit quantization).
  • Low-bit quantization: Usually means binary, ternary, or at most 4-bit quantization.
  • Fake quantization (or “simulated quantization”). Refers somewhat unkindly to integer quantization with the main goal of storage space reduction from a reduced model size, where the actual arithmetic is still performed as floating-point multiplications, rather than the “real quantization” of integer-only-arithmetic quantization. Equivalently, this is weight-only quantization to an integer type, where the activations are still computed in FP32.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++