Aussie AI

Floating Point Arithmetic

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

“Whatever floats your boat.”

— Anonymous.

Floating Point Arithmetic

Floating-point numbers in AI engines are typically stored in 32 bits for single-precision C++ “float” types, and it's actually a 32-bit integer behind the scenes. The main floating-point types that you already know from C++ programming are:

Single-precision floating-point — 32-bit float (FP32)
Double-precision floating-point — 64-bit double (FP64)

The smaller 16-bit floating-point numbers that are never used in everyday C++ coding, but are important for AI, include:

Half-precision IEEE type — 16-bit “short float” (FP16)
Half-precision Bfloat16 type — 16-bit “Brain float” (BF16)

If only there was really a “short float” type in C++. The BF16 type is the non-IEEE 16-bit float version from Google Brain. Note that there is new standardized support for these 16-bit types in C++23.

Which type of floating-point number should you use in your AI engine? That's when things get tricky, because there are many wrinkles in the choice between 32-bit and 16-bit floating-point. It's not always clear which floating-point size is the best to use. FP32 is the most common size used in basic Transformer inference, but FP16 is a good choice for quantization of models, because they are compressed to half the size and retain good accuracy. And BF16 has been very effective in terms of GPU-accelerated algorithms.

Some hardware accelerators support different formats and sizes for their parallel operations. And there are various software problems with portably coding 16-bit floating-point data types in C++, along with variable hardware support for 16-bit operations across platforms.

Less importantly, there are also some other floating-point sizes, both bigger and smaller:

Quarter-precision type — 8-bit floating-point (FP8)
Quadruple-precision type — 128-bit “quad” floating-point (FP128)

FP8 is mainly seen in research papers, and hasn't really caught on for quantization (8-bit integers are typically used instead). The bigger sizes FP64 and FP128 aren't really needed to make your model work accurately, so their significant extra cost in speed and size isn't worthwhile for only a small perplexity gain in most use cases.

Bit Representations of Floating-Point Numbers

Standardized bit patterns are used to represent floating-point numbers in a kind of scientific notation. There are three types of bits:

Sign bit
Exponent bits
Mantissa bits

Firstly, there's one bit for the sign, indicating whether the whole number is positive or negative. Then the remaining bits are split up between the “exponent” (i.e. the “power”), and the “mantissa” (also called the “digits” or the “significand” or the “fraction”). In a standard 32-bit “float” type used in AI, there is:

1 sign bit
8 exponent bits
23 mantissa bits

How does that even make a number? Well, it's like scientific notation, if you are familiar with that. The exponent is the power and the mantissa is the digits.

Let's pretend computers use decimal digits. If it were in base 10 storage, the decimal number 1234 would be stored as:

“0” for the sign bit — because non-negative.
“3” in the exponent — the power is 10^3=1000.
“1234” as the mantissa — the digits make the fraction “1.234”.

This would represent +1.234x10^3 (which hopefully equals 1234). That's how it would work for a decimal version.

But, as you know, silicon beasts are not decimal. A floating-point number is actually stored in binary, in a kind of base-two “binary scientific notation” numbering scheme. So, conceptually, 1234 would be stored as a power-of-two exponent that represents the largest power-of-two, which would be 1024, because 2^10=1024, so the exponent has to store power “10” (ten), which is 1010 in binary. And the 1234 would be converted to whatever the heck 1234/1024 is when you represent that in binary 0's and 1's, and remove the decimal point (which is implicitly “floating,” you see?).

It's more complicated than this, of course. That's what standards are for! The exponent bits are actually stored with an “offset” number (also called a “bias”), which differs by the size of the exponent bits. And there also some special bit patterns for particular numbers, such as zero or “NaN” (not-a-number).

Clear as mud? Don't you wish someone could go back in time and invent a base-10 computer?

Standardized Bit Representations

There's nothing magical about the choices of how many exponent versus mantissa bits. In the early days, there were many variations, but then they were mostly standardized by the IEEE 754 standard.

32-bit Floating-Point Numbers: The most common type of floating-point is 32-bits, such as the C++ “float” type. Other than the sign bit, there are usually 31 bits to split between the two other types, and the standard method is:

Standard FP32 (IEEE754). Usually a “float” in C++, or “single precision” number. Standard 32-bit floating-point is represented in binary as: 1 sign bit, 8 exponent bits, and 23 mantissa bits (plus an implied prefix '1' mantissa bit that isn't actually stored, so it's really 24 bits of mantissa values). The exponent is stored with offset 127.

16-bit floating-point Numbers: With the “half” float types, there are 16 bits. There are a few common representations of floating-point numbers in different numbers of bits. The main ones are:

Half-precision (FP16). This is the standard 16-bit floating-point number, also sometimes called “float16”. Annoyingly, there no standard “short float” or other widely used predefined type in C++, although the C++23 standard adds one, so this may be changing soon. The most common IEEE754-standardized version of FP16 type uses 1 sign bit, 5 exponent bits, and 10 stored mantissa bits (plus implicit mantissa bit makes 11 bits). The exponent is stored with offset 15.
Bfloat16 (brain float 16 or BF16): This is a different 16-bit floating-point numeric format, originally proposed by the Google Brain division, specifically for use in AI applications. Bfloat16 has 1 sign bit, 8 exponent bits and offset 127 (like FP32), and 8 mantissa bits (7 stored, 1 implicit). It is like FP32 but with the two lowermost bytes just thrown away, so conversion between bfloat16 and FP32 is simpler than converting from FP32 to FP16.

8-bit Floating-Point (FP8). The use of FP8 mainly appears in quantization research papers, but its usage is increasing within industry. There is usually 1 sign bit, 4 exponent bits, and 3 mantissa bits (which makes 4 bits with an implied extra mantissa bit). The other type of FP8 is 1 sign bit, 5 exponent bits, and 2 stored mantissa bits (3 bits total). Interestingly, the NVIDIA H100 GPU supports both of these FP8 formats.

FP16 Problems in C++

I already mentioned how there's not a standard half-precision type in C++, although that is fixable in the future, once compilers have implemented the C++23 standard. Here are some of the attempts at a 16-bit type:

__fp16 — only supported by ARM architecture.
_Float16 — not portably supported.
short float — doesn't seem to exist (I'm just wishful-thinking!).
std::float16_t — defined in the C++23 standard.
std::bfloat16_t — defined in the C++23 standard.

So, as of writing, if you want to code a 16-bit float in a portable way with C++, there's an ugly hack: short int.

A less fixable obstacle is that converting between FP32 and FP16 is not easy because their exponent bit sizes are different. So, it's fiddly to code, and not very efficient.

The alternative idea is to use “bfloat16” (BF16), which is the upper-most two bytes of FP32. Converting is just a bitshift 16 places or playing with bytes, so it's faster than FP16.

However, BF16 isn't high precision. With 8 mantissa bits (7 stored, 1 implicit), that's only about 3 decimal digits, because 8/3.3=3, and 3.3 is log2(10), in case you were wondering. But it's not much worse than FP16, which is only about 4 decimal digits using 11 binary mantissa bits.

Representing Zero

The sign bit, exponent, and mantissa can represent a lot of numbers, but not zero. We cannot just set all the mantissa bits to zero, because that's not zero, which is rather strange.

There's an implicit extra “1” bit so all the mantissa bits clear isn't 0.0000, it's 1.0000. It always starts with a “1” and there's literally no way to represent 0.0000.

Also, the exponent can represent -127 to +128, but setting the exponent to 0 also isn't zero, because 2^0 is 1. And 2^-127 is very small and does get us very close to zero, but it's also not zero. With sudden horrifying insight, we realize:

There's no way to represent zero!

The solution is that the IEEE 754 standard designers decided to treat all bits zero as being really zero. All bits zero in the exponent is 0, but then subtracting the 127 offset, means that it is -127 (the smallest number). So, if we clear all the exponent and mantissa bits to zeros, the number should be 1.0x2^-127, but we can all pretend it's actually zero. Then we can do some pretend coding, ahem, I mean microcoding, so that all our Floating-Point Units (FPUs) pretend it's zero, too.

Negative zero. Weirdly, there are two zeros: normal zero and negative zero. The IEEE 754 standard allows two different bit patterns to mean zero, depending on the sign bit. If we clear all the exponent and mantissa to zero, then the sign bit zero means zero, but the sign bit set to “1” means “negative zero”.

I'm not really sure what negative zero even means! But sometimes when you work with floats, a 0.000 number will get printed with a “-” in front of it. Maybe it's negative zero, or maybe it's a tiny negative number with hidden digits at the 15th decimal place.

Fortunately, most of the arithmetic operations treat negative zero the same as zero. The C++ compiler handles it automatically. Adding negative zero does nothing, and multiplying by negative zero is also zero. But one of the gotcha's if you're being tricky with the bits of a 32-bit floating-point number, by pretending it's a 32-bit integer: testing for zero isn't one integer comparison, it's two!

Representing Special Numbers

We've already discussed how zero is handled specially, and has a wonderful dichotomy. The full list of special floating-point numbers is:

Zero
Negative zero
+Inf (positive infinity)
-Inf (negative infinity)
NaN (Not a Number)
Denormalized numbers (subnormal numbers)

Whereas zero is represented by the exponent being all 0s, the special numbers Inf and NaN are represented by the exponent with all 1s. So, this means that the huge number 2^+128 is not actually represented, but reserved for these special values. And honestly, that's fine, because if 2^+128 isn't infinity, then I don't know what it is.

Infinity: Inf is represented by all 1s in the exponent, but all 0s in the mantissa. And if the sign bit is 1, then it's -Inf (negative infinity).

Not-a-Number: NaN also has all 1s for the exponent, but any other pattern of the mantissa bits means NaN. This means that there are many versions of NaN, for all variations of the mantissa bits, except when all mantissa bits are 0 (which means Inf). Also, if the sign bit is set, then the same patterns are also NaN (a kind of “negative NaN”, but that distinction is rarely used).

Denormalized numbers: Apparently, the designers of the floating-point standards think there's a “huge” difference between 2^-127 and zero. So, they decided to “smooth” it out a little by using some special numbers called “denormalized numbers” (also called “subnormal numbers”).

The standard does this by getting rid of the “implicit” mantissa bit. For one special exponent value, all 0s, the standard changes the meaning to consider the implicit hidden mantissa bit to be a leading 0, rather than a leading 1.

Hence, the mantissa can represent fractions less than 1.0, such as 0.1101 rather than only 1.1101 (in binary). The special exponent with all 0s therefore never represents -127, but represents the special value zero (or negative zero) if all the mantissa bits are 0s, or a tiny denormalized number if any of the mantissa bits are set. And even though the exponent with all 0s should represent -127, we pretend that it is -126, one less, for the denormalized numbers, for “smoothness” reasons that I leave as an exercise to the reader, mainly because I don't understand it. Note that denormalized numbers can also be tiny negatives if the sign bit is set.

Denormalized numbers are all very, very tiny, being less than 2^-126, so this feature of floating-point standardization is more useful for high-precision scientific calculations at NASA or SpaceX, rather than for AI engines. In fact, here's the news about denormalized numbers and AI:

We don't use denormalized numbers.

In fact, we hate them, because they make our FPU run slow. So, really, the slowness of our AI engines is the fault of the FPU hardware engineers, as we've long suspected. Fortunately, there's a way to turn denormalized numbers off and run faster, which is discussed below.

To summarize and/or to further confuse things, the exponent has two special cases: all 0s and all 1s. If the exponent bits are all 0s, the number is either zero (or negative zero) or a denormalized number (a tiny positive or negative). If the exponent bits are all 1s, then the number is Inf or NaN (or negative Inf/NaN).

Testing for Special Values: The C++ standard has a number of fast routines to test a floating-point number. Some of the useful ones in <cmath> include:

std::isinf()
std::isnan()
std::isnormal()
std::isfinite()

For more general analysis of floats, std::fpclassify() in <cmath> returns a code that matches special enum values: FP_INFINITE, FP_NAN, FP_NORMAL, FP_SUBNORMAL, FP_ZERO. Unfortunately, it's hard to distinguish positive and negative infinity, or to detect negative zero using these functions. You'll need to add a call to the “std::signbit” function (since C++11 for float arguments or C++23 for double), which returns true if a floating-point number has the sign bit on. There also a “std::copysign” function to copy the sign from one float to another, which can be used for sign bit manipulations. Alternatively, define your own bitwise macro tricks for these inspections.

Underflow and Overflow

Underflow is when a tiny floating-point number becomes so small that we can only represent it as zero. This can be a very tiny positive or negative number. Note that a negative number with a huge magnitude (near negative infinity) isn't underflow; that's actually negative overflow. Underflow refers to tiny fractions.

Generally, underflow isn't a problem for AI, because a number that low isn't going to affect the results. Similarly, I don't think an AI engine needs to worry much about subnormal/denormalized tiny numbers either. If the probability of a word appearing is 2^-127 (or 2^-126 for denormalized), well, it might as well be zero anyway.

If we're using Bfloat16 for 16-bit processing, it still has 8 bit exponents, so the lowest value is almost the same number (about 2^-127). If we've quantized the network to FP16 (also 16-bit but with a 5-bit exponent), then the lowest probability we can represent is 2^-31, which is also a tiny probability.

Generally speaking, AI engines don't tend to worry about underflow in floating-point, and “pruning” of low weight values is actually a common optimization. If a floating-point calculation underflows, it should just go harmlessly to zero. More concerning would be integer underflow, which is a different issue of large negatives wrapping around to positives. Floating-point underflow is better behaved.

Overflow is when a number gets so large that it cannot be represented in floating-point. Note that there are two types of overflow: positive overflow and negative overflow.

The exponent is the problem for overflow. When the number is larger than the highest exponent power, then it's either a very large positive or a very large-magnitude negative number. For an 8-bit exponent, that means 2^+127 (because +128 is reserved for the special Inf/NaN numbers). For a 5-bit exponent in FP16, this means 2^+31, which is, coincidentally, also a good salary to request at your next performance review.

Overflow can be a problem for AI engines, but usually only in the low-bit quantized models, rather than the usual FP32 calculations. We don't realistically want a probability or weight anywhere near the huge numbers (positive or negative), but arithmetic computations can sometimes go too high. One of the advantages of “normalization layers” is that it reduces the chance of this occurring, although they're mainly used for other reasons related to accuracy. When overflow occurs, it could become a special floating-point number (NaN or Inf), or an integer number might toggle over to negative (e.g. if integer-only-arithmetic quantized).

FTZ and DAZ CPU Modes

In many CPUs, the need to handle overflow, underflow and denormalized values is a cause of inefficiency. The CPU can do floating-point computations faster if it can ignore those situations. This would be in violation of the IEEE 754 standard, but sometimes you have to sacrifice greatness for speed.

There are two commonly used modifications to CPUs that speed up floating-point arithmetic, by ignoring underflow and tiny numbers:

Flush-To-Zero (FTZ). This mode means that when the results are “subnormal” they are “flushed” to zero instead of calculating the correct “denormalized” result. Since these denormalized numbers are tiny, this isn't a concern in AI engines.

Denormalized-Are-Zero (DAZ). This is similar to FTZ, but allows treating inputs that are some type of denormalized floating-point as a zero input.

Both these modes, FTZ and DAZ, are only relevant to very tiny numbers, well below the resolution that AI engines need to worry about, so you can totally enable them, provided we can figure out how to do so. CPUs with support for the FTZ and DAZ modes include x86 CPUs and ARM Cortex cores, and likely other processors. Google TPU doesn't support FTZ/DAZ because it operates on bfloat16 floating-point numbers.

Enabling FTZ and DAZ. Finding details on how to enable FTZ and DAZ is quite hard! (If only there was an AI engine to help me search the internet. Oh, wait, nevermind!) For command-line options, it seems to be “-ftz” on Linux/Mac or “/Qftz” on Windows. To control these modes dynamically in C++ code, you need to modify the MXCSR x86-64 CPU control register at runtime to set (or clear) the bits corresponding to FTZ and DAZ. Some of the primitives available to do so via GCC intrinsics include:

__builtin_ia32_ldmxcsr
__builtin_ia32_stmxcsr
_mm_getcsr
_mm_setcsr

In MSVS, there are preprocessor macros for FTZ in <xmmintrin.h> and for DAZ in <pmmintrin.h> header files. These control the FTZ and DAZ bits in the MXCSR, which is a CPU register with flags to control the CPU and the FPU. The C++ snippet to enable these modes looks like:

    #include <xmmintrin.h>
    #include <pmmintrin.h>

    void aussie_float_enable_FTZ_DAZ(bool ftz, bool daz)
    {
      if (ftz) {    // FTZ mode
        _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);
      }
      else {
        _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_OFF);
      }

      if (daz) {    // DAZ mode
        _MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON);
      }
      else {
        _MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_OFF);
      }
    }

These intrinsics for FTZ and DAZ are dynamic C++ calls. You can also disable these modes in C++, or switch back-and-forth between them dynamically. The MXCSR values are per-thread, so these modes must be set at the start of every new thread.

Negative Zero

Floating-point representations have two zeros: positive zero (the usual “0.0f” one) and negative zero (“-0.0f”). Note that there's no negative zero in integers, but only in floating-point types, because integers use two's complement in C++.

Usually, you don't have to worry about negative zero float values, because all of the floating-point operations treat zero and negative zero as equal. Negative zero is not less than positive zero, but is equal instead. For example, the “==” and “!=” operators should correctly handle both zeros as the same, and testing “f==0.0f” will succeed for zero and negative zero.

Normal C++ operations on float types will automatically handle negative zero for you, such as “<” will treat the two zeros are equal, not less-than. This happens at the cost of some inefficiency.

Detecting Negative Zero. Testing for negative zero is not easy. Unfortunately, you cannot use the std::fpclassify function because it returns FP_ZERO for both positive and negative zero. Here are some fast macros for 32-bit floats that look at the bits by pretending it's an unsigned 32-bit integer:

    #define AUSSIE_FLOAT_TO_UINT(f)  (*(unsigned int*)&f)
    #define AUSSIE_FLOAT_IS_POSITIVE_ZERO(f) \
        (((AUSSIE_FLOAT_TO_UINT(f) )) == 0)  // All 0s
    #define AUSSIE_FLOAT_IS_NEGATIVE_ZERO(f)  \ 
        (((AUSSIE_FLOAT_TO_UINT(f) )) == (1u<<31)) // Sign bit

Note that these macros only work for float variables, not constants, because the address-of “&” operator gets a compilation error for floating-point constants (e.g. 0.0f or -0.0f). Also, these only work for 32-bit float types, and comparable macros are needed for 64-bit double or 128-bit long double types.

Pitfall: Bitwise tricks on negative zero. There are some pitfalls with negative zero if you are trying to subvert the normal floating-point number representations and do bitwise operations on them (as I just did above!).

For example, if you're doing bitwise tests on a float, you may still need to test for two values of zero, such as using one or both of the above zero testing macros.

For magnitude comparisons of float types via their underlying bits, there's also a problem. Whereas positive zero is all-bits-zero and will equal integer zero or unsigned integer zero, negative zero has the uppermost bit set (the sign bit), so it will be a negative integer or a very large unsigned number. Hence, negative zero will sort as less than positive zero if using signed integer tests, or will sort as massively greater than many numbers if using unsigned integers for testing.

The problem with negative zero also means that doing any bitwise comparisons will fail. You cannot just compare the underlying integers for equality against each other, nor can you use byte-wise testing. For example, using memcmp for equality testing a float vector will occasionally fail for float values where positive zero compares against negative zero, leading to insidious bugs.

Optimization by Suppressing Negative Zero. Since negative zero introduces an inefficiency into basic float operations (e.g. == or != with 0.0), can we block it for a speedup? Are there any settings that fix the CPU or the compiler to ignore negative zero?

The FTZ and DAZ modes are mainly for subnormal numbers, not negative zero. I'm not aware of any hardware CPU modes specifically for disallowing skipping negative zeros, and I wonder whether they would actually be a de-optimization anyway, by forcing the FPU to explicitly check for negative zeros. Apparently, FTZ might help avoid negative zero in computations, but I'm not sure it's 100% of cases. There is a GCC flag “-ffast-math” which disables the production of negative zero in software.

AI Engines and Negative Zero. Can we speed up the floating-point computations of our AI engine by blocking all floating-point negative zeros? Then the FPU or GPU can assume there's only one type of zero, and run faster.

One situation with a lot of zeros are the “sparsity” methods from unstructured pruning (i.e. magnitude pruning or movement pruning), although hopefully these would be pruned to the normal zero, rather than negative zero. However, if a pruned model is doing any “zero skipping” routine that tests “f==0.0f” on a weight, there's an unnecessary hidden test for negative zero. We could either run in a negative-zero-disabled mode, or use our own bitwise test for floating point zero as all-bits-zero (i.e. using the unsigned integer trick).

Another point about negative zero in AI engines is that weights are static, so you can certainly pre-process the model file to ensure that none of the weights are negative zero. Then, during runtime inference, you can assume that all float values of weights can ignore negative zero. Alternatively, the training method can avoid negative zeros in its computations by running in such a mode, or it could explicitly check for negative zero as a final safety check.

What about zero values at runtime? The main float computation is the vector of logits, which represents probabilities of words (conceptually). Can we guarantee that it never contains a negative zero, and thereby speed up analysis? Can we ensure that vector dot products never compute negative zero? Or is there a way to have the RELU activation function fix any negative zero values that are computed? These optimizations are examined in later chapters.

Getting to the Bits in C++

The basic 32-bit floating-point number in C++ is a float with a size of 4 bytes. How can you manipulate the bits in a floating-point value, using the 32-bit float type? You cannot use any of the C++ bitwise operators on floating-point numbers, as they only work for integers.

The trick is to convert it to an unsigned integer (32-bit) with the same bits, and then use the integer bitwise operations. The obvious way to convert a float to unsigned is casting:

    float f = 3.14f;
    unsigned int u = (unsigned)f;  // Fail!

Nope. That doesn't get to the bits, because it does a proper conversion between floating-point numbers and integers, which is usually what you want when you aren't thinking about bits (i.e. all normal people).

To get to the bits in C++, we have to trick the compiler into thinking that it's already got an unsigned integer with pointer type casts:

    unsigned int u = *(unsigned int*)(&f);  // Tricky!

That's a bit old-school for type casting. Here's the modern way with reinterpret_cast:

    unsigned int u = *reinterpret_cast<unsigned int*>(&f);

Once we have the bits, then we can twiddle the bits of our unsigned integer to our heart's content. When we're finished, we can do the same trick in reverse to re-create a floating-point number:

    f = *(float *)(&u);   // Floating again...
    f = *reinterpret_cast<float*> (&u);  // Trendy version

And here's a timely reminder that it's important to use an “unsigned” type in C++ for the bit faking code, because the “>>” right-shift operator has undefined behavior on negatives.

Other Methods: Type casts aren't the only way in C++. There's also a trick involving “union” structures, and you can also directly copy the bits to a differently typed variable using “memcpy” or “bcopy”.

It seems to me that this type cast trick should be the fastest way, because a good compiler should convert the address-of, reinterpret_cast and indirection sequence into a simple variable copy, especially with the “reinterpret_cast” hint. However, I haven't actually benchmarked the speed of the different methods.

Pitfalls and Portability

Bitwise manipulation of float data is not the most portable code in the world. Let's examine some of the possible pitfalls in using these techniques.

Bitwise zero testing: If you've gone to the trouble to access the bits of a floating-point number, you might as well use them. Obviously, testing for “0.0” is a common requirement, so let's make it faster:

    #define FLOAT_IS_ZERO(f) \
     ((*reinterpret_cast<unsigned int*>(&f)) == 0u) // Bug!

Oops! We forgot about negative zero. There are two zeros in floating-point, depending on the sign bit, and it's hard to test it efficiently with bitwise operations (e.g. mask the sign bit or shift left first).

Strict anti-aliasing rule. An important point about all this is that most of it is platform-dependent, and officially “undefined behavior”. Some of it is standardized by IEEE 754, but many variations are possible. Another issue is that there's a “strict anti-aliasing rule” that specifies that many of these tricks are officially non-standard methods. Accessing a floating-point number as if it's an unsigned number is a technical violation of this rule. The “reinterpret_cast” method is probably less likely to run afoul of this problem, but it's still not guaranteed.

Anyway, the union trick and the use of memcpy don't really strike me as being particularly more portable, although memcpy might be less likely to be optimized wrongly by a compiler making wrong assumptions. Some additional risk mitigations are warranted, such as adding a lot of unit tests of even the most basic arithmetic operations. However, you're still not officially covered against an over-zealous optimizer that might rely on there being no aliases allowed.

Byte sizes. Another much simpler portability issue is checking the byte sizes of data types, which can vary across platforms. Most of this bit-fiddling stuff relies on particular 16-bit and 32-bit layouts. It doesn't hurt to add some self-tests to your code so you don't get bitten on a different platform, or even by a different set of compiler options:

   yassert(sizeof(int) == 4);
   yassert(sizeof(short int) == 2);
   yassert(sizeof(float) == 4);
   yassert(sizeof(unsigned int) == 4);

Also note that for this to work well, both types must be the same size. So, this would be a useful code portability check if it worked:

   #if sizeof(float) != sizeof(unsigned int)  // Fails!
   #error Big blue bug
   #endif

This macro preprocessor trick doesn't work because sizeof isn't allowed in a preprocessor expression, because the preprocessing phase precedes the syntax analysis. A better version uses a “static_assert” statement, which does compile-time checking in a more powerful way.

    static_assert(sizeof(float) == sizeof(unsigned), "Bug!");

Floating-Point Builtin Functions

The alternative to directly accessing the bits as an unsigned integer is to use the existing C++ functions. There are various existing functions for bitwise manipulation of floating-point numbers, in two categories: standard C++ library functions and compiler-specific intrinsics.

C++ has standard functions for the manipulation of floating-point numbers, and their bitwise representations.

std::signbit — Portably test the sign bit of a floating-point number.
std::copysign — Portably copies the sign bit from one float, merging it with the value of another (i.e. another's exponent and mantissa).

There are also various compiler-specific “intrinsics” or “builtins” to manipulate floating-point numbers. For Microsoft Visual Studio C++, these are in <intrin.h> and there are also versions for GCC and other compilers.

frexp — Get the mantissa and exponent.
ldexp — Bitshifting by an integer shift-count.
scalbn — Also integer bitshift on a float.
logb — Extracts the exponent.
ilogb — Extracts the exponent to integer.
modf — Splits into whole and fractional parts.
fma — Fused multiply add on float (Microsoft intrinsic)
remainder — Get fractional part of floating-point (Microsoft intrinsic)
_fcvt — Low-level convert float to string (Microsoft intrinsic)

For many of the listed functions, there are additional versions for different floating-point data types, such as float, double and long double. For example, “frexp” will split a double type into its significand (fractional part) and exponent integer, but there's also “frexpf” for 32-bit float types, and “frexpl” for long double types.

Floating-Point Bit Tricks for AI

Once you've got the bits into an unsigned integer, what can you do?

Assuming you're willing to throw the standards documents to the curb, you can do quite a lot. The bits can be directly manipulated in non-obvious ways to speed up some types of floating-point arithmetic with integer bitwise arithmetic on the underlying bits. Examples of floating-point bit manipulations used to optimize neural networks include:

Sign bit flipping: this can be used for fast non-multiplication binarized networks with floating-point computations.
Exponent bit manipulations: bitshifting float values in logarithmic quantization can be implemented as integer addition on the exponent bits of a float.
Add-as-integer networks: This method simply adds the underlying bit representations together as integers, to create a type of multiplication-free neural network. Weirdly, this simple trick implements an approximate multiplication algorithm known as Mitchell's algorithm.
Fast log2 computation on float types using the exponent bits directly.

The first step is to extract the bit patterns. Let's assume it's a standard 32-bit float type with 1 sign bit, 8 exponent bits, and 23 stored mantissa bits. You can get the different bits:

   int signbit = (u >> 31);
   int exponent = ( (u >> 23) & 255 );  // Fail!
   int mantissa = ( u & ((1 << 23) - 1 ));

Nice try, but that's only 2 out of 3. The exponent is wrong here! The bits are correct, but it's not the right number. We have to subtract the “offset” (or “bias”) of the exponent, which is 127 for an 8-bit exponent. This is correct:

   int exponent = ( (u >> 23) & 255 ) - 127;  // Correct!

Note that the sign bit and mantissa can be stored as unsigned (i.e. positive or zero), but the exponent must be a signed integer, even though it is extracted from the bits of an unsigned int. For a fraction like decimal 0.25 (i.e. a quarter), this is equal to 2^-2, so the exponent is -2. In an 8-bit exponent, the range of the exponent is -128 to +127. Note that the sign bit in a float specifies the overall sign of the whole number, and is not the sign of the exponent.

Here are some macro versions of the above bit extractions:

    #define AUSSIE_FLOAT_SIGN(f)     \
      ((*(unsigned *)&(f)) >> 31u)   // Leftmost bit
    #define AUSSIE_FLOAT_EXPONENT(f) \
      ((int)(((((*(unsigned*)&(f)))>> 23u) & 255) - 127)) 
    #define AUSSIE_FLOAT_MANTISSA(f) \
      ((*(unsigned*)&(f)) & 0x007fffffu) // Right 23 bits

Note that these macros don't work for constants, but give a compilation error such as “l-value required”. This is because of the “&” address-of operator trick being used needs a variable, not a constant. I don't see an easy way around it for bitwise trickery.

If you dislike bits for some strange reason, here's a simple way to define the sign bit macro using the “<” operator, which also works on constants:

    #define AUSSIE_FLOAT_SIGN(f)  ( (f) < 0.0f)  // Sign test

Example: Add-as-int Approximate Multiply

The add-as-integer method suggested by Mogami (2020) simply adds the integer bit representation of two floating-point variables, as if they are integers. It's quite surprising that this has any useful meaning, but it's actually a type of approximate multiplication called Mitchell's algorithm. Here's what the C++ code looks like on 32-bit float types:

    float aussie_add_as_int_mogami(float f1, float f2)
    {
        // Add as integer Mogami(2020)
        int c = *(int*)&(f1)+*(int*)&(f2)-0x3f800000; 
        return *(float*)&c;
    }

The magic number 0x3f800000 is (obviously) equal to “127<<23” and its purpose is to fix up the offset of the exponent. Otherwise, there are two offsets of 127 combined. (Is there a faster way? It's annoying to waste a whole addition operation on what's just an adjustment.)

Note that this algorithm is one exceptional case where we don't want to use unsigned integer types when tweaking bit representations. This trick needs the temporary variable of type “int” and the pointers to be “int*” so that it can correctly handle the sign bits of the two floating-point numbers.

This add-as-integer algorithm is not restricted to 32-bit float data. It should also work for 16-bit floating-point numbers in both float16 and bfloat16 formats, provided the magic number is changed to a different bitshift count and an offset of 15 (not 127) for 5-bit exponents.

Example: Float Bitshift via Integer Addition

This is another surprising bitwise trick on floating-point numbers. You cannot perform the standard bitshift operators on float types in C++, so you cannot easily speed up floating-point multiplication via bitshifts in the same way as for integers.

Bitshifts are a fast way of doing an integer multiplication by a power-of-two (e.g. “x<<1” is the same as “x*2”). Note that it also doesn't work to convert the float to its unsigned int bit version and shift it using integer bitshift operators.

On some platforms, there are some builtin special functions such as ldexp and scalbn for doing bitshifting on float data. The ldexp function accepts an integer power, and then bitshifts a floating-point number by this many places. The ldexp function is for double types, ldexpf is for float, and ldexpl is for long double types. The scalbn set of functions appears to be almost identical to ldexp functions. There is also a reverse function “frexp” which extracts the significant (fraction) and the power-of-two for a floating-point argument.

Although we can't bitshift floating-pointer values, there is an intriguing alternative optimization using integer arithmetic directly: addition. The suggestion in the DenseShift paper (Li et al., 2023) is to simply add the shift count to the exponent bits using integer addition.

Here's some example C++ code that works for 32-bit floating-point numbers:

    float aussie_float_bitshift_add_int(float f1, int bits)   
    {
        // Bitshift float by adding int to exponent bits
        // FP32 = 1 sign bit, 8 exponent, 23 mantissa
        unsigned int u = *(unsigned int*)&f1; // Get the bits
        if (u == 0) return f1;  // special case, don't change
        u += (bits << 23);  // Add shift count to exponent
        return *(float*)&u; // Convert back to float
    }

How does it work? Well, it makes a certain kind of sense. The exponent in a floating-point representation is a power-of-two, and we are bitshifting, which is increasing the number by a power-of-two. Hence, we can increase the power-of-two by adding 1 to the exponent, and it also works for adding numbers more than 1.

Note that this code also works for bitshift of a negative count (e.g. bitshift of -1 subtracts from the exponent and thereby halves the number) or zero (unchanged). However, this exponent-addition trick can overflow if the resulting number overflows or underflows the exponent range (e.g. -128 to +127).

This method has thereby improved the performance of floating-point multiplication by changing it to integer addition. The idea works provided we are multiplying by a power-of-two, which is done in logarithmic quantization. However, it's a little tricky in that special formats like zero (and NaN) are problematic for this algorithm. I had to add the test “u==0” which slows things down (maybe there's a better way?). Also, this approach can theoretically overflow the exponent bits, messing up the sign bit, but that's only if the float is very big or very tiny. Checking for all these wrinkles will slow down the code.

Example: Log2 of Floating-Point is the Exponent

The log2 function for float types is a non-linear function that is quite expensive to compute. We already computed log2 of an integer with low-level bit fiddling methods based on a count-leading-zeros algorithm in the previous chapter. There's also a different bitwise trick for log2 of floating-point numbers. This method computes the truncated integer version of the log2 algorithm (e.g. for use in logarithmic power-of-two quantization). There's a very easy way:

The base-2 logarithm is the exponent!

It's sitting right there, already calculated, hidden in plain sight amongst the 32 bits of your friendly float variables. Here's some C++ code to extract it:

    int ilog2_exponent(float f)  // Log2 for 32-bit float
    {
        unsigned int u = *(unsigned int*)&f;
        int iexp = ((u >> 23) & 255);  // 8-bit exponent
        iexp -= 127;  // Remove the "offset"
        return iexp;
    }

Alternatively, for greater portability and probably extra speed, too, there are some standardized builtin C++ functions available across various platforms (including Linux and Microsoft) that can extract the exponent: frexp, ldexp, ilogb, and scalbn, are some that come to mind.

References on Floating-Point

Eric Sakk (2018), Understanding Floating-Point Numbers, Concepts in Computer Systems (Volume 2), 7 June 2018, https://www.amazon.com/dp/1983093025/
Sean Eron Anderson (2005), Bit Twiddling Hacks, Stanford University, https://graphics.stanford.edu/~seander/bithacks.html
T. Mogami (2020), Deep neural network training without multiplications, In Beyond BackPropagation WS at 34th Conference on Neural Information Processing Systems, 2020, https://arxiv.org/abs/2012.03458 (Uses integer addition of the bits of an IEEE 754 floating-point representation to perform approximate floating-point multiplication.)
Jean-Michel Muller, Nicolas Brisebarre, Florent de Dinechin, Claude-Pierre Jeannerod, Vincent Lerevre, Guillaume Melquiond, Nathalie Revol, Damien Stehle, Serge Tones (2018), Handbook of Floating-Point Arithmetic, Birkhauser, 2018, https://link.springer.com/book/10.1007/978-3-319-76526-6, Contents: https://cds.cern.ch/record/1315760/files/9780817647049_TOC.pdf
Wonyeol Lee, Rahul Sharma, Alex Aiken (2016), Verifying Bit-Manipulations of Floating-Point, Stanford University, USA, https://theory.stanford.edu/~aiken/publications/papers/pldi16b.pdf
Xinlin Li, Bang Liu, Rui Heng Yang, Vanessa Courville, Chao Xing, Vahid Partovi Nia (2023), DenseShift: Towards Accurate and Efficient Low-Bit Power-of-Two Quantization, Oct 2023, https://arxiv.org/abs/2208.09708 (Uses integer addition on the sign and exponent bits of IEEE 754 floating-point to perform bitshifts on floats to perform power-of-two number quantization on 32-bit floats.)
Mostafa Elhoushi, Zihao Chen, Farhan Shafiq, Ye Henry Tian, Joey Yiwei Li (2021), DeepShift: Towards Multiplication-Less Neural Networks, July 2021, https://arxiv.org/abs/1905.13298 (Bitwise shifting and sign bit manipulation.)

• Next: Chapter 10. Arithmetic Optimizations

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs

The new AI programming book by Aussie AI co-founders:

AI coding in C++
Transformer engine speedups
LLM models
Phone and desktop AI
Code examples
Research citations

Get your copy from Amazon: Generative AI in C++

Aussie AI

Floating Point Arithmetic

Floating Point Arithmetic

Bit Representations of Floating-Point Numbers

Standardized Bit Representations

FP16 Problems in C++

Representing Zero

Representing Special Numbers

Underflow and Overflow

FTZ and DAZ CPU Modes

Negative Zero

Getting to the Bits in C++

Pitfalls and Portability

Floating-Point Builtin Functions

Floating-Point Bit Tricks for AI

Example: Add-as-int Approximate Multiply

Example: Float Bitshift via Integer Addition

Example: Log2 of Floating-Point is the Exponent

References on Floating-Point

Quick Links

Product

New to Writing?

Writing Styles