Aussie AI

Representing Special Numbers

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Representing Special Numbers

We've already discussed how zero is handled specially, and has a wonderful dichotomy. The full list of special floating-point numbers is:

  • Zero
  • Negative zero
  • +Inf (positive infinity)
  • -Inf (negative infinity)
  • NaN (Not a Number)
  • Denormalized numbers (subnormal numbers)

Whereas zero is represented by the exponent being all 0s, the special numbers Inf and NaN are represented by the exponent with all 1s. So, this means that the huge number 2^+128 is not actually represented, but reserved for these special values. And honestly, that's fine, because if 2^+128 isn't infinity, then I don't know what it is.

Infinity: Inf is represented by all 1s in the exponent, but all 0s in the mantissa. And if the sign bit is 1, then it's -Inf (negative infinity).

Not-a-Number: NaN also has all 1s for the exponent, but any other pattern of the mantissa bits means NaN. This means that there are many versions of NaN, for all variations of the mantissa bits, except when all mantissa bits are 0 (which means Inf). Also, if the sign bit is set, then the same patterns are also NaN (a kind of “negative NaN”, but that distinction is rarely used).

Denormalized numbers: Apparently, the designers of the floating-point standards think there's a “huge” difference between 2^-127 and zero. So, they decided to “smooth” it out a little by using some special numbers called “denormalized numbers” (also called “subnormal numbers”).

The standard does this by getting rid of the “implicit” mantissa bit. For one special exponent value, all 0s, the standard changes the meaning to consider the implicit hidden mantissa bit to be a leading 0, rather than a leading 1.

Hence, the mantissa can represent fractions less than 1.0, such as 0.1101 rather than only 1.1101 (in binary). The special exponent with all 0s therefore never represents -127, but represents the special value zero (or negative zero) if all the mantissa bits are 0s, or a tiny denormalized number if any of the mantissa bits are set. And even though the exponent with all 0s should represent -127, we pretend that it is -126, one less, for the denormalized numbers, for “smoothness” reasons that I leave as an exercise to the reader, mainly because I don't understand it. Note that denormalized numbers can also be tiny negatives if the sign bit is set.

Denormalized numbers are all very, very tiny, being less than 2^-126, so this feature of floating-point standardization is more useful for high-precision scientific calculations at NASA or SpaceX, rather than for AI engines. In fact, here's the news about denormalized numbers and AI:

    We don't use denormalized numbers.

In fact, we hate them, because they make our FPU run slow. So, really, the slowness of our AI engines is the fault of the FPU hardware engineers, as we've long suspected. Fortunately, there's a way to turn denormalized numbers off and run faster, which is discussed below.

To summarize and/or to further confuse things, the exponent has two special cases: all 0s and all 1s. If the exponent bits are all 0s, the number is either zero (or negative zero) or a denormalized number (a tiny positive or negative). If the exponent bits are all 1s, then the number is Inf or NaN (or negative Inf/NaN).

Testing for Special Values: The C++ standard has a number of fast routines to test a floating-point number. Some of the useful ones in <cmath> include:

  • std::isinf()
  • std::isnan()
  • std::isnormal()
  • std::isfinite()

For more general analysis of floats, std::fpclassify() in <cmath> returns a code that matches special enum values: FP_INFINITE, FP_NAN, FP_NORMAL, FP_SUBNORMAL, FP_ZERO. Unfortunately, it's hard to distinguish positive and negative infinity, or to detect negative zero using these functions. You'll need to add a call to the “std::signbit” function (since C++11 for float arguments or C++23 for double), which returns true if a floating-point number has the sign bit on. There also a “std::copysign” function to copy the sign from one float to another, which can be used for sign bit manipulations. Alternatively, define your own bitwise macro tricks for these inspections.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++