Aussie AI

32. Quantization

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

“Being so many different sizes in a day is very confusing.”

— Lewis Carroll, Alice in Wonderland, 1865.

 

 

What is Quantization?

Quantization is an extremely popular method of model compression that has a huge number of research papers, and has been implemented in many modern inference engines. Generally, quantization has been very successful at both reducing inference compute times and storage space, without a huge hit to model accuracy, achieving near floating-point accuracy.

Quantization is changing model weights from a large high-precision range (e.g. 32-bit floating-point) to a smaller range of numbers by using a smaller data type. For example, quantization of 32-bit floating-point weights could reduce the data sizes to 16-bit floating-point, 16-bit integers, or even to smaller numbers such as 8-bit integers (bytes) or single bits (binary numbers).

Smaller and faster. Quantization has many variants, but generally can have several goals:

  • Model compression — smaller models on disk and in memory.
  • Faster inference — reduced memory accesses and/or faster arithmetic.
  • Arithmetic reduction — lower-precision floating-point arithmetic or integer-only arithmetic.

A quantized model has the same number of weights, but they are smaller data types. Every weight is still used in the inference computation, but they are changed to more efficient computations.

Data sizes. Quantization can be used with either floating-point computations or integer arithmetic, and there are also some “mixed precision” types of quantization. The goal for most types is a combination of memory reduction (smaller models in memory) and faster computations.

A typical floating-point quantization would reduce weights from 32-bit floats (FP32) to 16-bit half-precision floating-point (FP16) or Google's “brain float” 16-bit (BF16). In either case, 16-bit floating-point arithmetic is faster and also saves memory, which also improves speed when inference is memory-bound (as is typical). There is also 8-bit floating-point (FP8) quantization, although it is primarily seen in research papers rather than industry.

Integer types. Integer quantization has many variations. Standard 32-bit floats can be quantized to 32-bit integers (INT32 quantization), which does not improve memory usage, but changes to integer arithmetic. Smaller integer types such as 16-bit short int (INT16 quantization) or 8-bit bytes (INT8 quantization) can be used. In fact, there are also low-bit integer quantizations of any number of bits from 16 down to 1 bit (called “binary quantization”).

You might assume that integer quantization would result in integer arithmetic. Not so fast! Typical integer quantization to date has involved integer results, but they were typically converted back to floating-point (“de-quantized” in the jargon) at various points. Hence, there was a mixture of integer and floating-point arithmetic during inference.

However, there is a hot area of recent research that focuses on “integer-only arithmetic quantization” whereby de-quantization back to float is avoided by having all the other Transformer components also work in integer mode. This is actually quite difficult to achieve in practice for all of the many different components (e.g. activation functions, normalization, Softmax, etc.), and is a relatively recent optimization.

Types of Quantization

Quantization is usually separated into two main categories:

  • Post-Training Quantization (PTQ). This is where a pre-trained model is quantized for faster inference.
  • Quantization-Aware Training (QAT). This is the use of quantization during model training.

An important distinction is whether “weights” or “activations” are quantized.

  • Weights quantized — weight-only quantization.
  • Activations quantized — weight-and-activation quantization.

Weights are the static parameters in the model file. Activations are the dynamic numbers that are computed into vectors and represent the results of a neuron after its activation function. Much of the early quantization research focused on weight-only quantization, where the weights might be quantized to FP16, INT8, or smaller, but the activation computations would still be done in FP32.

Quantization granularity specifies what numbers are going to be quantized, and which sets of model parameters will have different quantization parameters. There are different model floating-point numbers that can be quantized:

  • Per-layer quantization
  • Per-tensor quantization
  • Per-channel quantization

The scaling algorithm or “scale factor” by which floating-point numbers are scaled to a smaller range of numbers is also a distinction for a given quantization algorithm. Several types include:

  • Uniform scaling (uniform quantization)
  • Uniform affine quantization
  • Symmetric uniform quantization
  • Non-uniform quantization
  • Power-of-two quantization
  • Asymmetric quantization

As was discussed briefly at the start of this chapter, there are several different technical types of quantization based on the data types used. The major categories are:

  • Floating-point quantization
  • Integer quantization
  • Mixed-precision quantization (or simply “mixed quantization”): Refers to more finely granular quantization where different parts of the model have different levels of quantization in terms of bits.

And some more quantization terminology:

  • Extreme quantization: Usually refers to binary quantization (1-bit quantization).
  • Low-bit quantization: Usually means binary, ternary, or at most 4-bit quantization.
  • Fake quantization (or “simulated quantization”). Refers somewhat unkindly to integer quantization with the main goal of storage space reduction from a reduced model size, where the actual arithmetic is still performed as floating-point multiplications, rather than the “real quantization” of integer-only-arithmetic quantization. Equivalently, this is weight-only quantization to an integer type, where the activations are still computed in FP32.

Floating-Point Quantization

The precision levels for different types of floating-point quantization include:

  • FP16 quantization: This uses 16-bit floating-point numbers instead of 32-bit numbers, with the IEEE standard 5-bit exponent. Commonly used.
  • BF16 quantization: This uses “brain float 16” format which has a larger 8-bit exponent, but a smaller mantissa. Used commercially, especially with Google TPUs.
  • FP8 quantization: 8-bit floating-point. Mostly a research area.
  • FP4 quantization: 4-bit floating-point. Occasionally used in research papers.

Note that there's no such thing as FP32 quantization. Why?

The most straight-forward quantization is to reduce 32-bit floating-point (4 bytes) to 16-bit floating-point (2 bytes). This halves the memory storage requirements, and does not suffer much reduction in model accuracy. All operations in MatMuls are still done in floating-point arithmetic.

The classic form of floating-point quantization is often abbreviated as FP16. There is also “bfloat16”, which uses a different representation of numbers. Quantization from high-precision 32-bit floating-point weights (usually abbreviated “FP32” or “float32”) to lower-precision 16-bit floating-point (usually abbreviated “FP16” or “float16”) can yield significant benefits, often without a significant loss of accuracy.

An even more reduced size is FP8 quantization, which uses 8-bit floating-point numbers. FP8 quantization hasn't caught on in the AI industry as much as FP16 or integer quantization methods, but there are plenty of papers. Weirdly, there's even FP4 quantization with a compressed representation of floating-point numbers in 4 bits.

Integer Quantization

Integer quantization of AI models is a long-standing area of research, with much literature. Industry practice routinely uses INT32, INT16, or INT8 for quantization. Surprisingly, even INT4 does very well, with an eight-fold size reduction, but often retaining accuracy above 90% of the original 32-bit model.

The main high-accuracy integer quantization subtypes include:

  • 32-bit integer quantization (INT32): Not any smaller than FP32, so this doesn't save space, but allows integer arithmetic operations.
  • 16-bit integer quantization (INT16): The weights are all “short” types. Smaller and faster.
  • 8-bit integer quantization (INT8): This uses single 8-bit bytes for quantization. This level of quantization is commonly used, such as in open source models. Weights are scaled to either -128 to 127 (signed), or 0 to 255 (unsigned bytes).

Low-bit quantization types include:

  • 4-bit quantization (INT4): Another popular size for quantization is a “nibble” (4 bits). There can be 2^4=16 weights. This seems quite drastic, but works surprisingly well and is commonly used and quite effective given its low bitwidth.
  • 3-bit quantization (INT3). This uses 2^3=8 distinct weights.
  • 2-bit quantization (INT2): There are 4 distinct weights. Not commonly used.
  • Ternary quantization: This is quantization with 3 weights, usually -1, 0, and +1. Uses 2 bits, but not 4 weights. Suffers accuracy problems.
  • Binary quantization: This is 1-bit quantization with 2 possible weights (usually 0 and 1, or -1 and +1). Not highly accurate.
  • 0-bit quantization. Good luck with this algorithm.

Some of the different types of advanced integer quantization algorithms include:

  • Integer-only-arithmetic quantization. This refers to quantization where the actual arithmetic performed during inference is all integer multiplications. This is distinct from the rather unkindly named “fake quantization” which is quantization where the integers are “dequantized” back to floating-point before using floating-point multiplication in inference calculations. Integer-only-arithmetic quantization aims to improve both speed and space, whereas non-integer-arithmetic-only integer quantization still reduces model size and storage space, but improves execution speed less fully (latency is still somewhat improved due to reduced memory-related activity).
  • Logarithmic Bitshift Quantization (Power-of-Two Quantization). This is where the weights are all powers of 2, so that faster integer bitshifts are used instead of integer multiplication. A generalization is “Sum of Two Bitshifts Quantization” which uses multiple bitshifts added together.

Integer-Only-Arithmetic Quantization

Integer-only quantization is integer quantization where only integer multiplication is performed. The assumption that this is true for all integer quantization algorithms is false. Several types of integer quantization may store weights as quantized integers, but then de-quantize them back to floating-point at various points (even for weight multiplication in some algorithms). Methods that strictly restrict arithmetic to avoid floating-point operations are more precisely named “integer-only-arithmetic quantization algorithms”.

Even these integer-only quantization algorithms may still have floating-point computations in some components of the Transformer. Methods that also fully quantize non-linear components to integers, such as Softmax and normalization components, are called “end-to-end integer Transformers.”

Uncommon Quantization Types

There are a few other obscure types of quantization, which live happy and meaningful lives on the creaking shelves of university libraries, buried deep inside musty volumes of research journals.

  • Stochastic quantization. This is a method of intentionally introducing some non-determinacy and randomness into quantization algorithms with the goal of increased inference accuracy.
  • Dyadic quantization: This is an uncommon quantization method using dyadic numbers, which are a mathematical representation of numbers as rational quotients where the numerator is an integer, but the denominator is always a power-of-two (allowing bitshifts).
  • Fixed-point quantization: Uses fixed-point number formats rather than floating-point.

 

Next: Chapter 33. Pruning

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++