Aussie AI

Floating-Point Bit Tricks for AI

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

Floating-Point Bit Tricks for AI

Once you've got the bits into an unsigned integer, what can you do?

Assuming you're willing to throw the standards documents to the curb, you can do quite a lot. The bits can be directly manipulated in non-obvious ways to speed up some types of floating-point arithmetic with integer bitwise arithmetic on the underlying bits. Examples of floating-point bit manipulations used to optimize neural networks include:

Sign bit flipping: this can be used for fast non-multiplication binarized networks with floating-point computations.
Exponent bit manipulations: bitshifting float values in logarithmic quantization can be implemented as integer addition on the exponent bits of a float.
Add-as-integer networks: This method simply adds the underlying bit representations together as integers, to create a type of multiplication-free neural network. Weirdly, this simple trick implements an approximate multiplication algorithm known as Mitchell's algorithm.
Fast log2 computation on float types using the exponent bits directly.

The first step is to extract the bit patterns. Let's assume it's a standard 32-bit float type with 1 sign bit, 8 exponent bits, and 23 stored mantissa bits. You can get the different bits:

   int signbit = (u >> 31);
   int exponent = ( (u >> 23) & 255 );  // Fail!
   int mantissa = ( u & ((1 << 23) - 1 ));

Nice try, but that's only 2 out of 3. The exponent is wrong here! The bits are correct, but it's not the right number. We have to subtract the “offset” (or “bias”) of the exponent, which is 127 for an 8-bit exponent. This is correct:

   int exponent = ( (u >> 23) & 255 ) - 127;  // Correct!

Note that the sign bit and mantissa can be stored as unsigned (i.e. positive or zero), but the exponent must be a signed integer, even though it is extracted from the bits of an unsigned int. For a fraction like decimal 0.25 (i.e. a quarter), this is equal to 2^-2, so the exponent is -2. In an 8-bit exponent, the range of the exponent is -128 to +127. Note that the sign bit in a float specifies the overall sign of the whole number, and is not the sign of the exponent.

Here are some macro versions of the above bit extractions:

    #define AUSSIE_FLOAT_SIGN(f)     \
      ((*(unsigned *)&(f)) >> 31u)   // Leftmost bit
    #define AUSSIE_FLOAT_EXPONENT(f) \
      ((int)(((((*(unsigned*)&(f)))>> 23u) & 255) - 127)) 
    #define AUSSIE_FLOAT_MANTISSA(f) \
      ((*(unsigned*)&(f)) & 0x007fffffu) // Right 23 bits

Note that these macros don't work for constants, but give a compilation error such as “l-value required”. This is because of the “&” address-of operator trick being used needs a variable, not a constant. I don't see an easy way around it for bitwise trickery.

If you dislike bits for some strange reason, here's a simple way to define the sign bit macro using the “<” operator, which also works on constants:

    #define AUSSIE_FLOAT_SIGN(f)  ( (f) < 0.0f)  // Sign test

• Next:

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs

The new AI programming book by Aussie AI co-founders:

AI coding in C++
Transformer engine speedups
LLM models
Phone and desktop AI
Code examples
Research citations

Get your copy from Amazon: Generative AI in C++

Aussie AI

Floating-Point Bit Tricks for AI

Floating-Point Bit Tricks for AI

Quick Links

Product

New to Writing?

Writing Styles