Aussie AI
Example: Float Bitshift via Integer Addition
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
Example: Float Bitshift via Integer Addition
This is another surprising bitwise trick on floating-point numbers.
You cannot perform the standard bitshift operators on float
types in C++,
so you cannot easily speed up floating-point multiplication via bitshifts in the same way as for integers.
Bitshifts are a fast way of doing an integer multiplication by a power-of-two (e.g. “x<<1
” is the same as “x*2
”).
Note that it also doesn't work to convert the float
to its unsigned int
bit version and shift it
using integer bitshift operators.
On some platforms,
there are some builtin special functions such as ldexp
and scalbn
for doing bitshifting on float
data.
The ldexp
function accepts an integer power, and then bitshifts
a floating-point number by this many places.
The ldexp
function is for double
types, ldexpf
is for float
, and ldexpl
is for long double
types.
The scalbn
set of functions appears to be almost identical to ldexp
functions.
There is also a reverse function “frexp
” which extracts the significant (fraction) and the power-of-two
for a floating-point argument.
Although we can't bitshift floating-pointer values, there is an intriguing alternative optimization using integer arithmetic directly: addition. The suggestion in the DenseShift paper (Li et al., 2023) is to simply add the shift count to the exponent bits using integer addition.
Here's some example C++ code that works for 32-bit floating-point numbers:
float aussie_float_bitshift_add_int(float f1, int bits) { // Bitshift float by adding int to exponent bits // FP32 = 1 sign bit, 8 exponent, 23 mantissa unsigned int u = *(unsigned int*)&f1; // Get the bits if (u == 0) return f1; // special case, don't change u += (bits << 23); // Add shift count to exponent return *(float*)&u; // Convert back to float }
How does it work? Well, it makes a certain kind of sense. The exponent in a floating-point representation is a power-of-two, and we are bitshifting, which is increasing the number by a power-of-two. Hence, we can increase the power-of-two by adding 1 to the exponent, and it also works for adding numbers more than 1.
Note that this code also works for bitshift of a negative count (e.g. bitshift of -1 subtracts from the exponent and thereby halves the number) or zero (unchanged).
However, this exponent-addition trick can overflow if the resulting number
overflows or underflows the exponent range (e.g. -128
to +127
).
This method has thereby improved the performance of floating-point multiplication
by changing it to integer addition.
The idea works provided we are multiplying by a power-of-two,
which is done in logarithmic quantization.
However, it's a little tricky in that special formats like zero (and NaN
) are problematic for this algorithm.
I had to add the test “u==0
” which slows things down (maybe there's a better way?).
Also, this approach can theoretically overflow the exponent bits, messing up the sign bit,
but that's only if the float
is very big or very tiny.
Checking for all these wrinkles will slow down the code.
• Next: • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |