Aussie AI

FP16 Problems in C++

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

FP16 Problems in C++

I already mentioned how there's not a standard half-precision type in C++, although that is fixable in the future, once compilers have implemented the C++23 standard. Here are some of the attempts at a 16-bit type:

__fp16 — only supported by ARM architecture.
_Float16 — not portably supported.
short float — doesn't seem to exist (I'm just wishful-thinking!).
std::float16_t — defined in the C++23 standard.
std::bfloat16_t — defined in the C++23 standard.

So, as of writing, if you want to code a 16-bit float in a portable way with C++, there's an ugly hack: short int.

A less fixable obstacle is that converting between FP32 and FP16 is not easy because their exponent bit sizes are different. So, it's fiddly to code, and not very efficient.

The alternative idea is to use “bfloat16” (BF16), which is the upper-most two bytes of FP32. Converting is just a bitshift 16 places or playing with bytes, so it's faster than FP16.

However, BF16 isn't high precision. With 8 mantissa bits (7 stored, 1 implicit), that's only about 3 decimal digits, because 8/3.3=3, and 3.3 is log2(10), in case you were wondering. But it's not much worse than FP16, which is only about 4 decimal digits using 11 binary mantissa bits.

• Next:

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs

The new AI programming book by Aussie AI co-founders:

AI coding in C++
Transformer engine speedups
LLM models
Phone and desktop AI
Code examples
Research citations

Get your copy from Amazon: Generative AI in C++

Aussie AI

FP16 Problems in C++

FP16 Problems in C++

Quick Links

Product

New to Writing?

Writing Styles