Aussie AI
Integer Quantization
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
Integer Quantization
Integer quantization of AI models is a long-standing area of research, with much literature. Industry practice routinely uses INT32, INT16, or INT8 for quantization. Surprisingly, even INT4 does very well, with an eight-fold size reduction, but often retaining accuracy above 90% of the original 32-bit model.
The main high-accuracy integer quantization subtypes include:
- 32-bit integer quantization (INT32): Not any smaller than FP32, so this doesn't save space, but allows integer arithmetic operations.
- 16-bit integer quantization (INT16): The weights are all “
short
” types. Smaller and faster. - 8-bit integer quantization (INT8): This uses single 8-bit bytes for quantization. This level of quantization is commonly used, such as in open source models. Weights are scaled to either -128 to 127 (signed), or 0 to 255 (unsigned bytes).
Low-bit quantization types include:
- 4-bit quantization (INT4): Another popular size for quantization is a “nibble” (4 bits). There can be 2^4=16 weights. This seems quite drastic, but works surprisingly well and is commonly used and quite effective given its low bitwidth.
- 3-bit quantization (INT3). This uses 2^3=8 distinct weights.
- 2-bit quantization (INT2): There are 4 distinct weights. Not commonly used.
- Ternary quantization: This is quantization with 3 weights, usually -1, 0, and +1. Uses 2 bits, but not 4 weights. Suffers accuracy problems.
- Binary quantization: This is 1-bit quantization with 2 possible weights (usually 0 and 1, or -1 and +1). Not highly accurate.
- 0-bit quantization. Good luck with this algorithm.
Some of the different types of advanced integer quantization algorithms include:
- Integer-only-arithmetic quantization. This refers to quantization where the actual arithmetic performed during inference is all integer multiplications. This is distinct from the rather unkindly named “fake quantization” which is quantization where the integers are “dequantized” back to floating-point before using floating-point multiplication in inference calculations. Integer-only-arithmetic quantization aims to improve both speed and space, whereas non-integer-arithmetic-only integer quantization still reduces model size and storage space, but improves execution speed less fully (latency is still somewhat improved due to reduced memory-related activity).
- Logarithmic Bitshift Quantization (Power-of-Two Quantization). This is where the weights are all powers of 2, so that faster integer bitshifts are used instead of integer multiplication. A generalization is “Sum of Two Bitshifts Quantization” which uses multiple bitshifts added together.
• Next: • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |