Aussie AI

Example: Float Bitshift via Integer Addition

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Example: Float Bitshift via Integer Addition

This is another surprising bitwise trick on floating-point numbers. You cannot perform the standard bitshift operators on float types in C++, so you cannot easily speed up floating-point multiplication via bitshifts in the same way as for integers.

Bitshifts are a fast way of doing an integer multiplication by a power-of-two (e.g. “x<<1” is the same as “x*2”). Note that it also doesn't work to convert the float to its unsigned int bit version and shift it using integer bitshift operators.

On some platforms, there are some builtin special functions such as ldexp and scalbn for doing bitshifting on float data. The ldexp function accepts an integer power, and then bitshifts a floating-point number by this many places. The ldexp function is for double types, ldexpf is for float, and ldexpl is for long double types. The scalbn set of functions appears to be almost identical to ldexp functions. There is also a reverse function “frexp” which extracts the significant (fraction) and the power-of-two for a floating-point argument.

Although we can't bitshift floating-pointer values, there is an intriguing alternative optimization using integer arithmetic directly: addition. The suggestion in the DenseShift paper (Li et al., 2023) is to simply add the shift count to the exponent bits using integer addition.

Here's some example C++ code that works for 32-bit floating-point numbers:

    float aussie_float_bitshift_add_int(float f1, int bits)   
    {
        // Bitshift float by adding int to exponent bits
        // FP32 = 1 sign bit, 8 exponent, 23 mantissa
        unsigned int u = *(unsigned int*)&f1; // Get the bits
        if (u == 0) return f1;  // special case, don't change
        u += (bits << 23);  // Add shift count to exponent
        return *(float*)&u; // Convert back to float
    }

How does it work? Well, it makes a certain kind of sense. The exponent in a floating-point representation is a power-of-two, and we are bitshifting, which is increasing the number by a power-of-two. Hence, we can increase the power-of-two by adding 1 to the exponent, and it also works for adding numbers more than 1.

Note that this code also works for bitshift of a negative count (e.g. bitshift of -1 subtracts from the exponent and thereby halves the number) or zero (unchanged). However, this exponent-addition trick can overflow if the resulting number overflows or underflows the exponent range (e.g. -128 to +127).

This method has thereby improved the performance of floating-point multiplication by changing it to integer addition. The idea works provided we are multiplying by a power-of-two, which is done in logarithmic quantization. However, it's a little tricky in that special formats like zero (and NaN) are problematic for this algorithm. I had to add the test “u==0” which slows things down (maybe there's a better way?). Also, this approach can theoretically overflow the exponent bits, messing up the sign bit, but that's only if the float is very big or very tiny. Checking for all these wrinkles will slow down the code.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++