Aussie AI

Floating Point Bits

  • Last Updated 29 November, 2024
  • by David Spuler, Ph.D.

Floating point numbers are typically stored in 32-bits for single-precision (e.g. C++ "float" types), and it's actually a 32-bit integer behind the scenes. The main floating point types are:

  • Single-precision floating point (FP32): 32-bits (e.g. C++ "float")
  • Double-precision floating point (FP64): 64-bits (e.g. C++ "double" type)
  • Half-precision IEEE type (FP16): 16-bits (if only there was a "short float" type in C++!)
  • Half-precision Bfloat16 type ("Brain float 16"): 16-bits (a non-IEEE version from Google Brain)

And there's some less common ones:

  • Quarter-precision type (FP8): 8-bit floating point
  • Quadruple-precision type (FP128): 128-bit massive floating point.

And then things get tricky, because it's not always clear which is the best to use. FP32 is the most common size used in Transformer inference, but FP16 is a good choice for quantization of models (compressed to half the size and retains good accuracy). FP8 is mainly seen in research papers, and hasn't really caught on for quantization (8-bit integer quantization is typically used instead). The biggest sizes FP64 and FP128 aren't really needed, so their significant extra cost in speed and size isn't worthwhile for a small perplexity gain in most use cases.

Even in the choice between 32-bit and 16-bit floating point there are many wrinkles. Some hardware accelerators support different formats and sizes for their parallel operations. And there are various software problems with portably coding 16-bit floating-point data types in C++, along with variable hardware support for 16-bit operations across platforms.

Bit Representations of Floating Point Numbers

Standardized bit patterns are used to represent floating-point numbers in a kind of scientific notation. There's 1 bit for the sign, indicating whether the whole number is positive or negative. Then the remaining bits are split up between the "power" or "exponent", and the "digits" or the "mantissa" or the "significand" or the "fraction". There's various schemes for this trade-off between exponent and mantissa (see below).

How about an example to make it clear as mud? If it were in base 10 storage, the number 1234 would be stored with a 0 for the sign bit, a "3" in the exponent, and the mantissa would be "1234", which would represent +1.234x10^3 (which hopefully equals 1234). But it's not decimal, it's actually stored in binary, in a kind of base-2 scientific numbering scheme. So conceptually, 1234 would be stored as a power-of-2 exponent that represents the largest power-of-2, which would be 1024, so 2^10=1024, so the exponent has to store "10", which is 1010 in binary. And the 1234 would be converted to whatever the heck 1234/1024 is when you represent that in binary 0's and 1's, and remove the decimal point (which is implicitly "floating", you see?).

It's more complicated than this, of course. That's what standards are for! The exponent bits are actually stored with an "offset" number (also called a "bias"), which differs by the size of the exponent bits. And there also some special bit patterns for particular numbers, such as zero or "NaN" (not-a-number).

Don't you wish someone could go back in time and invent a base-10 computer?

Common Floating-Point Types

There are a few common representations of floating point numbers in different numbers of bits. Some of them are:

  • Standard FP32 (IEEE754). Usually a "float" in C++, or "single precision" number. Standard 32-bit floating point is represented in binary as: 1 sign bit, 8 exponent bits, and 23 mantissa bits (plus an implied prefix '1' mantissa bit that isn't actually stored, so it's really 24 bits of mantissa values). The exponent is stored with offset 127.
  • Half-precision (FP16). This is a 16-bit floating point number, also sometimes called "float16". Annoyingly, there no standard "short float" or other widely used predefined type in C++, although the C++23 standard adds one, so this may be changing soon. The most common IEEE754-standardized version of FP16 type uses 1 sign bit, 5 exponent bits, and 10 stored mantissa bits (plus implicit mantissa bit makes 11 bits). The exponent is stored with offset 15.
  • Bfloat16 (brain float 16): This is a different 16-bit floating-point numeric format, originally proposed by the Google Brain division. Bfloat16 has 1 sign bit, 8 exponent bits and offset 127 (like FP32), and 8 mantissa bits (7 stored, 1 implicit), It is like FP32 but with the two lowermost bytes just thrown away, so conversion between bfloat16 and FP32 is simple.

Conversion: Getting to the Bits in C++

The basic 32-bit floating point number is a "float" in C++, with a size of 4 bytes. How can you manipulate the bits in a floating point value, using the 32-bit "float" type? You cannot use any of the C++ bitwise operators on floating point numbers, as they only work for integers.

The trick is to convert it to an unsigned integer (32-bit) with the same bits, and then use the integer bitwise operations. The obvious way to convert a float to unsigned is:

    float f = 3.14;
    unsigned int u = (unsigned)f;  // Fail!

Nope. That doesn't get to the bits, because it does a proper conversion between floating-point numbers and integers, which is usually what you want when you aren't thinking about bits (i.e. all normal people). To get to the bits in C++, we have to trick the compiler into thinking that it's already got an unsigned integer with type casts:

    unsigned int u = *(unsigned int*)(&f);  // Tricky!

That's a bit old-school for type casting. Here's the modern way with reinterpret_cast.

    unsigned int u = * reinterpret_cast<unsigned int*>(&f);  // Fancy!

Once we have the bits, then we can twiddle the bits of our unsigned integer to our heart's content. When we're finished, we can do the same trick in reverse to re-create a floating point number:

    f = *(float *)(&u);   // Floating again...
    f = * reinterpret_cast<float*> (&u);  // Better version

Other Methods: Type casts aren't the only way in C++. There's also a trick involving "union" structures, and you can also directly copy the bits to a differently typed variable using "memcpy" or "bcopy". It seems to me that this type cast trick should be the fastest way, because a good compiler should convert the address-of, reinterpret_cast and indirection sequence into a simple variable copy, especially with the "reinterpret_cast" hint, but I haven't actually checked the speed of the different methods.

Pitfalls and Portability: Note it's important to use an "unsigned" type in C++ for the bit faking code, because the ">>" right-shift operator has undefined behavior on negatives.

An important point about all this is that most of it is platform-dependent, and officially "undefined behavior". Some of it is standardized by IEEE 754, but many variations are possible. Another issue is that there's a "strict anti-aliasing rule" that specifies that many of these tricks are officially non-standard methods. Accessing a floating point number as if it's an unsigned number is a technical violation of this rule. The "reinterpret_cast" method is probably less likely to run foul of this problem, but it's still not guaranteed. Anyway, the union trick and the use of memcpy don't really strike me as being particularly more portable, although memcpy might be less likely to be optimized wrongly by a compiler making wrong assumptions. Some additional risk mitigations are warranted, such as adding a lot of unit tests of even the most basic arithmetic operations. However, you're still not officially covered against an over-zealous optimizer that might rely on there being no aliases allowed.

Another much simpler issue is checking the byte sizes of data types, which can vary across platforms. Most of this bit-fiddling stuff relies on particular 16-bit and 32-bit layouts. It doesn't hurt to add some self-tests to your code so you don't get bitten on a different platform, or even by a different set of compiler options:

   yassert(sizeof(int) == 4);
   yassert(sizeof(short int) == 2);
   yassert(sizeof(float) == 4);
   yassert(sizeof(unsigned int) == 4);

Also note that for this to work well, both types must be the same size. So this can be a useful code portability check:

   #if sizeof(float) != sizeof(unsigned int)
   #error Big blue bug
   #endif

This macro preprocessor trick is old-school, but should work. A better version would use a "static_assert" statement, which does compile-time checking in a more powerful way.

Extracting Floating-Point Bits

Once you've got the bits into an unsigned integer, what can you do?

The first step is to extract the bit patterns. Let's assume it's a standard 32-bit float type with 1 sign bit, 8 exponent bits, and 23 stored mantissa bits. You can get the different bits:

   int signbit = (u >> 31);
   int exponent = ( (u >> 23) & 255 );  // Fail!
   int mantissa = ( u & ((1 << 23) - 1 );

Nice try, but that's only 2 out of 3. The exponent is wrong here! The bits are correct, but it's not the right number. We have to subtract the "offset" (or "bias") of the exponent, which is 127 for an 8-bit exponent. This is correct:

   int exponent = ( (u >> 23) & 255 ) - 127;  // Correct!

Note that the sign bit and mantissa are unsigned (i.e. positive or zero), but the exponent must be a signed integer, even though it is extracted from the bits of an unsigned int. For a fraction like 0.25, this is equal to 2^-2, so the exponent is -2. In an 8-bit exponent, the range of the exponent is -128 to +127. The sign bit specifies the overall sign of the whole number, not the sign of the exponent.

Here are some macro versions of the above bit extractions:

    #define YAPI_FLOAT_SIGN(f)      ( (*(unsigned *)&(f)) >> 31u)   // Leftmost bit
    #define YAPI_FLOAT_EXPONENT(f)  (int)( ((( (*(unsigned*)&(f)) )>> 23) & 255 ) - 127 ) 
    #define YAPI_FLOAT_MANTISSA(f)  ((*(unsigned*)&(f)) & 0x007fffffu)  // Rightmost 23 bits

Note that these macros don't work for constants, but give a compilation error such as "l-value required". This is because of the "&" address-of operator trick being used needs a variable, not a constant, and I don't see an easy way around it.

Here's an even simpler way to define the sign bit macro using only the "<" operator, which also works on constants:

    #define YAPI_FLOAT_SIGN(f)  ( (f) < 0.0f)   // Sign test

Floating Point Intrinsic Functions

Note that there are various "intrinsics" or "builtins" to manipulate floating point numbers. For Microsoft Visual Studio C++, these are in <intrin.h>, and there are also versions for GCC, and other compilers. For example, "frexp" will split the number into its significant (fractional part) and the exponent integer. There's also "frexpf" for 32-bit floats, and "frexpl" for long double types. Another example is "std::signbit" to test the sign bit portably, and "std::copysign" will copy the sign of one float into the value of another. There are also the "logb" and "ilogb" functions to extract the exponent. (Some other notable builtins for floating point operations are "ldexp" and "scalbn" for bitshifting of float exponents, and "modf" for splitting whole and fractional parts.)

Example: FP16 is annoying in C++!

I already mentioned how there's not a standard half-precision type in C++, although that is fixable in the future, once compilers have implemented the C++23 standard. Here are some of the attempts at a 16-bit type:

  • "__fp16": only supported by ARM architecture.
  • "_Float16": not portably supported.
  • "short float": doesn't seem to exist (I'm just wishful-thinking).
  • "std::float16_t": defined in the C++23 standard.
  • "std::bfloat16_t": defined in the C++23 standard.

So, as of writing, if you want to code a 16-bit float in a portable way with C++, there's an ugly hack: "short int".

Annoying times double: A less fixable annoyance is that converting between FP32 and FP16 is not easy because their exponent bit sizes are different. So it's fiddly to code, and not very efficient.

The alternative idea is to use bfloat16, which is the upper-most 2 bytes of FP32. Converting is just a bitshift 16 places. However, with 8 mantissa bits (7 stored, 1 implicit), that's only about 3 decimal digits (because 8/3.3=3, and 3.3 is log2(10), in case you're wondering) But it's not much worse than FP16, which is about 4 decimal digits using 11 binary mantissa bits.

Uncommon Floating Point Bit Tricks

All floating point numbers are stored as a sequence of bits, and these are processed in standardized ways to do addition or multiplication. But it's all really just a bunch of integers underneath.

Bits are bits. The underlying floating point bits are not magical, although they have clearly defined meanings in fancy IEEE standards documents, and these bits can be directly manipulated in non-obvious ways. Examples of floating point bit manipulations used to optimize neural networks include:

  • Sign bit flipping: this can be used for fast non-multiplication binarized networks with floating point computations.
  • Exponent bit manipulations: bitshifting float values in logarithmic quantization can be implemented as integer addition on the exponent bits of a float.
  • Add-as-integer networks: This method simply adds the underlying bit representations together as integers, to create a type of multiplication-free neural network. Weirdly, this simple trick implements an approximate multiplication algorithm known as Mitchell's algorithm.

Example: Add-as-Integer Approximate Multiply

The add-as-integer method suggested by Mogami (2020) simply adds the integer bit representation of two floating point variables, as if they are integers. It's quite surprising that this has any useful meaning, but it's actually a type of approximate multiplication called Mitchell's algorithm. Here's what the C++ code looks like on 32-bit floats:

    float yapi_approx_multiply_add_as_int_mogami(float f1, float f2)   // Add as integer
    {
	int c = *(int*)&(f1)+*(int*)&(f2)-0x3f800000;  // Mogami(2020)
	return *(float*)&c;
    }

The magic number 0x3f800000 is equal to "127<<23", and its purpose is to fix up the "offset" of the exponent. Otherwise, there are two offsets of 127 combined. (Is there a faster way? It's annoying to waste a whole another addition operation on what's just an adjustment.)

Note that this algorithm is one exceptional case where we don't want to use unsigned types when tweaking bit representations. This trick needs the temporary variable of type "int" and the pointers to be "int*" so that it can correctly handle the sign bits of the two floating-point numbers.

This add-as-integer algorithm is not restricted to 32-bit floats. It should also work for 16-bit floating point numbers in both float16 and bfloat16 formats, provided the magic number is changed to a different bitshift count and an offset of 15 (not 127) for 5-bit exponents.

Example: Floating-Point Bitshift via Integer Addition

This is another surprising bitwise trick on floating point numbers. You cannot perform the standard bitshift operators on floats in C++, so you cannot easily speed up floating-point multiplication via bitshifts in the same way as for integers. Bitshifts are a fast way of doing a multiplication by a power-of-two (e.g. "x<<1" is the same as "x*2"). Note that it also doesn't work to convert the float to its unsigned int bit version and shift it.

On some platforms, notably Linux, there are some builtin special functions such as "ldexp" and "scalbn". On Linux, there is a builtin function calls "ldexp" which accepts an integer power, and then bitshifts a floating-point number by this many places. The ldexp function is for double types, ldexpf is for float, and ldexpl is for long double types. The "scalbn" set of functions appears to be almost identical to "ldexp" functions. There is also a reverse function "frexp" which extracts the significant (fraction) and the power-of-two for a floating-point argument. See the list of bitwise builtin functions for more.

But there is an intriguing method using integer arithmetic. The suggestion in the DenseShift paper (Li et al., 2023) is to simply add the shift count to the exponent bits using integer addition.

Here's some example C++ code that works for 32-bit floating-point numbers:

    float yapi_float_bitshift_add_integer(float f1, int bitstoshift)   
    {
	// Bitshift on 32-bit float by adding integer to exponent bits
	// FP32 = 1 sign bit, 8 exponent bits, 23 mantissa bits
	// NOTE: This can overflow into the sign bit if all 8 exponent bits are '1' (i.e. 255)
	unsigned int u = *(unsigned int*)&f1;  // Get to the bits as an integer
	if (u == 0) return f1;  // special case, don't change it...
	u += (bitstoshift << 23);  // Add shift count to the exponent bits...
	return *(float*)&u;  // Convert back to float
    }

How does it work? Well, it makes a certain kind of sense. The exponent in a floating-point representation is a power-of-two, and we are bitshifting, which is increasing the number by a power-of-two. Hence, we can increase the power-of-two by adding 1 to the exponent, and it works for numbers more than 1. Note that this code also works for bitshift of a negative count (e.g. bitshift of -1 subtracts from the exponent and thereby halves the number) or zero (unchanged).

This code has thereby improved the performance of floating-point multiplication by changing it to integer addition. The idea works provided we are multiplying by a power-of-two, which is done in logarithmic quantization. However, it's a little tricky in that special formats like zero (and NaN) are problematic for this algorithm. I had to add the test "u==0" which slows things down (maybe there's a better way?). Also, this approach can theoretically overflow the exponent bits, messing up the sign bit, but that's only if the float is very big or very tiny. Checking for all these wrinkles will slow down the code.

Example: Log2 of Floating-Point is the Exponent

The log2 function for float types is a non-linear function that is quite expensive to compute. But if you only care about the truncated integer version of the base-2 logarithm, which is exactly what's needed in logarithmic power-of-two quantization, then there's a very easy way. The base-2 logarithm is the exponent! It's sitting right there, already calculated, hidden in plain sight amongst the 32 bits of your friendly float variables.

Here's some C++ code to extract it:

    int ilog2_exponent(float f)  // Log2 for 32-bit float
    {
	unsigned int u = *(unsigned int*)&f;
	int iexponent = ((u >> 23) & 255);  // 8-bit exponent (above 23-bit mantissa)
	iexponent -= 127;  // Remove the "offset"
	return iexponent;
    }

Alternatively, for greater portability and probably extra speed, too, there are some standardized builtin C++ functions available across various platforms (including Linux and Microsoft): frexp, ldexp, ilogb, and scalbn, are some that come to mind; see the full list of useful bitwise builtin functions.

Fast Log2 of an Integer: The above example code works for float types, but for a faster log2 of integer types, there are various other low-level bit fiddling methods. There's a long history of bitwise intrinsic functions and hardware support for a "find first set" ("ffs"), which finds the offset of the first byte set, and the closely related "clz" (count leading zeros) function. The value of ffs is one more than clz, if you think about it for a while. (Actually, no, I have this wrong! FFS counts zero bits from the right, whereas CLZ counts zero bits from the left, so integer log2 has to use a CLZ method, rather than FFS.) To code this in C++, there's the Microsoft _BitScanReverse and GCC's __builtin_clz to choose from, amongst other builtin bitwise intrinsics. Note that on the x86 architecture, these are implemented via a single instruction, and there are two candidates: BSR (Bit Scan Reverse), or LZCNT (leading zero count).

Research on AI with Floating-Point Bit Tricks

Research papers that use bitwise tricks on floating point numbers include:

Floating-Point Error Research

Research papers on floating-point errors:

General Research on Floating-Point Representations

Papers on the bitwise representation of floating point:

More AI Research

Read more about: