Aussie AI

Floating Point Introduction

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Floating Point Introduction

Floating-point numbers in AI engines are typically stored in 32 bits for single-precision C++ “float” types, and it's actually a 32-bit integer behind the scenes. The main floating-point types that you already know from C++ programming are:

  • Single-precision floating-point — 32-bit float (FP32)
  • Double-precision floating-point — 64-bit double (FP64)

The smaller 16-bit floating-point numbers that are never used in everyday C++ coding, but are important for AI, include:

  • Half-precision IEEE type — 16-bit “short float” (FP16)
  • Half-precision Bfloat16 type — 16-bit “Brain float” (BF16)

If only there was really a “short float” type in C++. The BF16 type is the non-IEEE 16-bit float version from Google Brain. Note that there is new standardized support for these 16-bit types in C++23.

Which type of floating-point number should you use in your AI engine? That's when things get tricky, because there are many wrinkles in the choice between 32-bit and 16-bit floating-point. It's not always clear which floating-point size is the best to use. FP32 is the most common size used in basic Transformer inference, but FP16 is a good choice for quantization of models, because they are compressed to half the size and retain good accuracy. And BF16 has been very effective in terms of GPU-accelerated algorithms.

Some hardware accelerators support different formats and sizes for their parallel operations. And there are various software problems with portably coding 16-bit floating-point data types in C++, along with variable hardware support for 16-bit operations across platforms.

Less importantly, there are also some other floating-point sizes, both bigger and smaller:

  • Quarter-precision type — 8-bit floating-point (FP8)
  • Quadruple-precision type — 128-bit “quad” floating-point (FP128)

FP8 is mainly seen in research papers, and hasn't really caught on for quantization (8-bit integers are typically used instead). The bigger sizes FP64 and FP128 aren't really needed to make your model work accurately, so their significant extra cost in speed and size isn't worthwhile for only a small perplexity gain in most use cases.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++