Aussie AI
What is Quantizationand?
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
What is Quantization?
Quantization is an extremely popular method of model compression that has a huge number of research papers, and has been implemented in many modern inference engines. Generally, quantization has been very successful at both reducing inference compute times and storage space, without a huge hit to model accuracy, achieving near floating-point accuracy.
Quantization is changing model weights from a large high-precision range (e.g. 32-bit floating-point) to a smaller range of numbers by using a smaller data type. For example, quantization of 32-bit floating-point weights could reduce the data sizes to 16-bit floating-point, 16-bit integers, or even to smaller numbers such as 8-bit integers (bytes) or single bits (binary numbers).
Smaller and faster. Quantization has many variants, but generally can have several goals:
- Model compression — smaller models on disk and in memory.
- Faster inference — reduced memory accesses and/or faster arithmetic.
- Arithmetic reduction — lower-precision floating-point arithmetic or integer-only arithmetic.
A quantized model has the same number of weights, but they are smaller data types. Every weight is still used in the inference computation, but they are changed to more efficient computations.
Data sizes. Quantization can be used with either floating-point computations or integer arithmetic, and there are also some “mixed precision” types of quantization. The goal for most types is a combination of memory reduction (smaller models in memory) and faster computations.
A typical floating-point quantization would reduce weights from 32-bit floats (FP32) to 16-bit half-precision floating-point (FP16) or Google's “brain float” 16-bit (BF16). In either case, 16-bit floating-point arithmetic is faster and also saves memory, which also improves speed when inference is memory-bound (as is typical). There is also 8-bit floating-point (FP8) quantization, although it is primarily seen in research papers rather than industry.
Integer types.
Integer quantization has many variations.
Standard 32-bit floats can be quantized to 32-bit integers (INT32 quantization),
which does not improve memory usage, but changes to integer arithmetic.
Smaller integer types such as 16-bit short int
(INT16 quantization)
or 8-bit bytes (INT8 quantization) can be used.
In fact, there are also low-bit integer quantizations
of any number of bits from 16 down to 1 bit (called “binary quantization”).
You might assume that integer quantization would result in integer arithmetic. Not so fast! Typical integer quantization to date has involved integer results, but they were typically converted back to floating-point (“de-quantized” in the jargon) at various points. Hence, there was a mixture of integer and floating-point arithmetic during inference.
However, there is a hot area of recent research that focuses on “integer-only arithmetic quantization” whereby de-quantization back to float is avoided by having all the other Transformer components also work in integer mode. This is actually quite difficult to achieve in practice for all of the many different components (e.g. activation functions, normalization, Softmax, etc.), and is a relatively recent optimization.
• Next: • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |