Aussie AI
Standardized Bit Representations
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
Standardized Bit Representations
There's nothing magical about the choices of how many exponent versus mantissa bits. In the early days, there were many variations, but then they were mostly standardized by the IEEE 754 standard.
32-bit Floating-Point Numbers:
The most common type of floating-point is 32-bits, such as the C++ “float
” type.
Other than the sign bit, there are usually 31 bits to split between the two other types,
and the standard method is:
- Standard FP32 (IEEE754). Usually a “
float
” in C++, or “single precision” number. Standard 32-bit floating-point is represented in binary as: 1 sign bit, 8 exponent bits, and 23 mantissa bits (plus an implied prefix '1' mantissa bit that isn't actually stored, so it's really 24 bits of mantissa values). The exponent is stored with offset 127.
16-bit floating-point Numbers: With the “half” float types, there are 16 bits. There are a few common representations of floating-point numbers in different numbers of bits. The main ones are:
- Half-precision (FP16). This is the standard 16-bit floating-point number, also sometimes called “
float16
”. Annoyingly, there no standard “short float
” or other widely used predefined type in C++, although the C++23 standard adds one, so this may be changing soon. The most common IEEE754-standardized version of FP16 type uses 1 sign bit, 5 exponent bits, and 10 stored mantissa bits (plus implicit mantissa bit makes 11 bits). The exponent is stored with offset 15. Bfloat16
(brain float 16 or BF16): This is a different 16-bit floating-point numeric format, originally proposed by the Google Brain division, specifically for use in AI applications.Bfloat16
has 1 sign bit, 8 exponent bits and offset 127 (like FP32), and 8 mantissa bits (7 stored, 1 implicit). It is like FP32 but with the two lowermost bytes just thrown away, so conversion betweenbfloat16
and FP32 is simpler than converting from FP32 to FP16.
8-bit Floating-Point (FP8). The use of FP8 mainly appears in quantization research papers, but its usage is increasing within industry. There is usually 1 sign bit, 4 exponent bits, and 3 mantissa bits (which makes 4 bits with an implied extra mantissa bit). The other type of FP8 is 1 sign bit, 5 exponent bits, and 2 stored mantissa bits (3 bits total). Interestingly, the NVIDIA H100 GPU supports both of these FP8 formats.
• Next: • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |