Aussie AI
Representing Special Numbers
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
Representing Special Numbers
We've already discussed how zero is handled specially, and has a wonderful dichotomy. The full list of special floating-point numbers is:
- Zero
- Negative zero
+Inf
(positive infinity)-Inf
(negative infinity)NaN
(Not a Number)- Denormalized numbers (subnormal numbers)
Whereas zero is represented by the exponent being all 0s, the special numbers Inf
and NaN
are
represented by the exponent with all 1s.
So, this means that the huge number 2^+128
is not actually represented,
but reserved for these special values.
And honestly, that's fine, because if 2^+128
isn't infinity, then I don't know what it is.
Infinity:
Inf
is represented by all 1s in the exponent, but all 0s in the mantissa.
And if the sign bit is 1, then it's -Inf
(negative infinity).
Not-a-Number:
NaN
also has all 1s for the exponent, but any other pattern of the mantissa bits means NaN
.
This means that there are many versions of NaN
, for all variations of the mantissa bits,
except when all mantissa bits are 0 (which means Inf
).
Also, if the sign bit is set, then the same patterns are also NaN
(a kind of “negative NaN
”, but
that distinction is rarely used).
Denormalized numbers:
Apparently, the designers of the floating-point standards
think there's a “huge” difference between 2^-127
and zero.
So, they decided to “smooth” it out a little
by using some special numbers called “denormalized numbers” (also called “subnormal numbers”).
The standard does this by getting rid of the “implicit” mantissa bit. For one special exponent value, all 0s, the standard changes the meaning to consider the implicit hidden mantissa bit to be a leading 0, rather than a leading 1.
Hence, the mantissa can represent fractions less than 1.0
, such as
0.1101
rather than only 1.1101
(in binary).
The special exponent with all 0s therefore never represents -127
, but represents the special value zero (or negative zero) if all the mantissa bits are 0s,
or a tiny denormalized number if any of the mantissa bits are set.
And even though the exponent with all 0s should represent -127
,
we pretend that it is -126
, one less,
for the denormalized numbers, for “smoothness” reasons
that I leave as an exercise to the reader, mainly because I don't understand it.
Note that denormalized numbers can also be tiny negatives if the sign bit is set.
Denormalized numbers are all very, very tiny, being less than 2^-126
, so this feature of floating-point standardization
is more useful for high-precision scientific calculations at NASA or SpaceX, rather than for AI engines.
In fact, here's the news about denormalized numbers and AI:
-
We don't use denormalized numbers.
In fact, we hate them, because they make our FPU run slow. So, really, the slowness of our AI engines is the fault of the FPU hardware engineers, as we've long suspected. Fortunately, there's a way to turn denormalized numbers off and run faster, which is discussed below.
To summarize and/or to further confuse things, the exponent has two special cases: all 0s and all 1s.
If the exponent bits are all 0s, the number is either zero (or negative zero) or a denormalized number (a tiny positive or negative).
If the exponent bits are all 1s, then the number is Inf
or NaN
(or negative Inf
/NaN
).
Testing for Special Values:
The C++ standard has a number of fast routines to test a floating-point number.
Some of the useful ones in <cmath>
include:
- std::isinf()
- std::isnan()
- std::isnormal()
- std::isfinite()
For more general analysis of floats,
std::fpclassify()
in <cmath>
returns a code that matches special enum values: FP_INFINITE
, FP_NAN
,
FP_NORMAL
, FP_SUBNORMAL
, FP_ZERO
.
Unfortunately, it's hard to distinguish positive and negative infinity,
or to detect negative zero using these functions.
You'll need to add a call to the “std::signbit
” function (since C++11 for float
arguments or C++23 for double
),
which returns true
if a floating-point number has the sign bit on.
There also a “std::copysign
” function to copy the sign from one float
to another,
which can be used for sign bit manipulations.
Alternatively, define your own bitwise macro tricks for these inspections.
• Next: • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |