Aussie AI

Floating Point Bits

Last Updated 2 March, 2025

by David Spuler, Ph.D.

Floating point numbers are typically stored in 32-bits for single-precision (e.g. C++ "float" types), and it's actually a 32-bit integer behind the scenes. The main floating point types are:

Single-precision floating point (FP32): 32-bits (e.g. C++ "float")
Double-precision floating point (FP64): 64-bits (e.g. C++ "double" type)
Half-precision IEEE type (FP16): 16-bits (if only there was a "short float" type in C++!)
Half-precision Bfloat16 type ("Brain float 16"): 16-bits (a non-IEEE version from Google Brain)

And there's some less common ones:

Quarter-precision type (FP8): 8-bit floating point
Quadruple-precision type (FP128): 128-bit massive floating point.

And then things get tricky, because it's not always clear which is the best to use. FP32 is the most common size used in Transformer inference, but FP16 is a good choice for quantization of models (compressed to half the size and retains good accuracy). FP8 is mainly seen in research papers, and hasn't really caught on for quantization (8-bit integer quantization is typically used instead). The biggest sizes FP64 and FP128 aren't really needed, so their significant extra cost in speed and size isn't worthwhile for a small perplexity gain in most use cases.

Even in the choice between 32-bit and 16-bit floating point there are many wrinkles. Some hardware accelerators support different formats and sizes for their parallel operations. And there are various software problems with portably coding 16-bit floating-point data types in C++, along with variable hardware support for 16-bit operations across platforms.

Bit Representations of Floating Point Numbers

Standardized bit patterns are used to represent floating-point numbers in a kind of scientific notation. There's 1 bit for the sign, indicating whether the whole number is positive or negative. Then the remaining bits are split up between the "power" or "exponent", and the "digits" or the "mantissa" or the "significand" or the "fraction". There's various schemes for this trade-off between exponent and mantissa (see below).

How about an example to make it clear as mud? If it were in base 10 storage, the number 1234 would be stored with a 0 for the sign bit, a "3" in the exponent, and the mantissa would be "1234", which would represent +1.234x10^3 (which hopefully equals 1234). But it's not decimal, it's actually stored in binary, in a kind of base-2 scientific numbering scheme. So conceptually, 1234 would be stored as a power-of-2 exponent that represents the largest power-of-2, which would be 1024, so 2^10=1024, so the exponent has to store "10", which is 1010 in binary. And the 1234 would be converted to whatever the heck 1234/1024 is when you represent that in binary 0's and 1's, and remove the decimal point (which is implicitly "floating", you see?).

It's more complicated than this, of course. That's what standards are for! The exponent bits are actually stored with an "offset" number (also called a "bias"), which differs by the size of the exponent bits. And there also some special bit patterns for particular numbers, such as zero or "NaN" (not-a-number).

Don't you wish someone could go back in time and invent a base-10 computer?

Common Floating-Point Types

There are a few common representations of floating point numbers in different numbers of bits. Some of them are:

Standard FP32 (IEEE754). Usually a "float" in C++, or "single precision" number. Standard 32-bit floating point is represented in binary as: 1 sign bit, 8 exponent bits, and 23 mantissa bits (plus an implied prefix '1' mantissa bit that isn't actually stored, so it's really 24 bits of mantissa values). The exponent is stored with offset 127.
Half-precision (FP16). This is a 16-bit floating point number, also sometimes called "float16". Annoyingly, there no standard "short float" or other widely used predefined type in C++, although the C++23 standard adds one, so this may be changing soon. The most common IEEE754-standardized version of FP16 type uses 1 sign bit, 5 exponent bits, and 10 stored mantissa bits (plus implicit mantissa bit makes 11 bits). The exponent is stored with offset 15.
Bfloat16 (brain float 16): This is a different 16-bit floating-point numeric format, originally proposed by the Google Brain division. Bfloat16 has 1 sign bit, 8 exponent bits and offset 127 (like FP32), and 8 mantissa bits (7 stored, 1 implicit), It is like FP32 but with the two lowermost bytes just thrown away, so conversion between bfloat16 and FP32 is simple.

Conversion: Getting to the Bits in C++

The basic 32-bit floating point number is a "float" in C++, with a size of 4 bytes. How can you manipulate the bits in a floating point value, using the 32-bit "float" type? You cannot use any of the C++ bitwise operators on floating point numbers, as they only work for integers.

The trick is to convert it to an unsigned integer (32-bit) with the same bits, and then use the integer bitwise operations. The obvious way to convert a float to unsigned is:

    float f = 3.14;
    unsigned int u = (unsigned)f;  // Fail!

Nope. That doesn't get to the bits, because it does a proper conversion between floating-point numbers and integers, which is usually what you want when you aren't thinking about bits (i.e. all normal people). To get to the bits in C++, we have to trick the compiler into thinking that it's already got an unsigned integer with type casts:

    unsigned int u = *(unsigned int*)(&f);  // Tricky!

That's a bit old-school for type casting. Here's the modern way with reinterpret_cast.

    unsigned int u = * reinterpret_cast<unsigned int*>(&f);  // Fancy!

Once we have the bits, then we can twiddle the bits of our unsigned integer to our heart's content. When we're finished, we can do the same trick in reverse to re-create a floating point number:

    f = *(float *)(&u);   // Floating again...
    f = * reinterpret_cast<float*> (&u);  // Better version

Other Methods: Type casts aren't the only way in C++. There's also a trick involving "union" structures, and you can also directly copy the bits to a differently typed variable using "memcpy" or "bcopy". It seems to me that this type cast trick should be the fastest way, because a good compiler should convert the address-of, reinterpret_cast and indirection sequence into a simple variable copy, especially with the "reinterpret_cast" hint, but I haven't actually checked the speed of the different methods.

Pitfalls and Portability: Note it's important to use an "unsigned" type in C++ for the bit faking code, because the ">>" right-shift operator has undefined behavior on negatives.

An important point about all this is that most of it is platform-dependent, and officially "undefined behavior". Some of it is standardized by IEEE 754, but many variations are possible. Another issue is that there's a "strict anti-aliasing rule" that specifies that many of these tricks are officially non-standard methods. Accessing a floating point number as if it's an unsigned number is a technical violation of this rule. The "reinterpret_cast" method is probably less likely to run foul of this problem, but it's still not guaranteed. Anyway, the union trick and the use of memcpy don't really strike me as being particularly more portable, although memcpy might be less likely to be optimized wrongly by a compiler making wrong assumptions. Some additional risk mitigations are warranted, such as adding a lot of unit tests of even the most basic arithmetic operations. However, you're still not officially covered against an over-zealous optimizer that might rely on there being no aliases allowed.

Another much simpler issue is checking the byte sizes of data types, which can vary across platforms. Most of this bit-fiddling stuff relies on particular 16-bit and 32-bit layouts. It doesn't hurt to add some self-tests to your code so you don't get bitten on a different platform, or even by a different set of compiler options:

   yassert(sizeof(int) == 4);
   yassert(sizeof(short int) == 2);
   yassert(sizeof(float) == 4);
   yassert(sizeof(unsigned int) == 4);

Also note that for this to work well, both types must be the same size. So this can be a useful code portability check:

   #if sizeof(float) != sizeof(unsigned int)
   #error Big blue bug
   #endif

This macro preprocessor trick is old-school, but should work. A better version would use a "static_assert" statement, which does compile-time checking in a more powerful way.

Extracting Floating-Point Bits

Once you've got the bits into an unsigned integer, what can you do?

The first step is to extract the bit patterns. Let's assume it's a standard 32-bit float type with 1 sign bit, 8 exponent bits, and 23 stored mantissa bits. You can get the different bits:

   int signbit = (u >> 31);
   int exponent = ( (u >> 23) & 255 );  // Fail!
   int mantissa = ( u & ((1 << 23) - 1 );

Nice try, but that's only 2 out of 3. The exponent is wrong here! The bits are correct, but it's not the right number. We have to subtract the "offset" (or "bias") of the exponent, which is 127 for an 8-bit exponent. This is correct:

   int exponent = ( (u >> 23) & 255 ) - 127;  // Correct!

Note that the sign bit and mantissa are unsigned (i.e. positive or zero), but the exponent must be a signed integer, even though it is extracted from the bits of an unsigned int. For a fraction like 0.25, this is equal to 2^-2, so the exponent is -2. In an 8-bit exponent, the range of the exponent is -128 to +127. The sign bit specifies the overall sign of the whole number, not the sign of the exponent.

Here are some macro versions of the above bit extractions:

    #define YAPI_FLOAT_SIGN(f)      ( (*(unsigned *)&(f)) >> 31u)   // Leftmost bit
    #define YAPI_FLOAT_EXPONENT(f)  (int)( ((( (*(unsigned*)&(f)) )>> 23) & 255 ) - 127 ) 
    #define YAPI_FLOAT_MANTISSA(f)  ((*(unsigned*)&(f)) & 0x007fffffu)  // Rightmost 23 bits

Note that these macros don't work for constants, but give a compilation error such as "l-value required". This is because of the "&" address-of operator trick being used needs a variable, not a constant, and I don't see an easy way around it.

Here's an even simpler way to define the sign bit macro using only the "<" operator, which also works on constants:

    #define YAPI_FLOAT_SIGN(f)  ( (f) < 0.0f)   // Sign test

Floating Point Intrinsic Functions

Note that there are various "intrinsics" or "builtins" to manipulate floating point numbers. For Microsoft Visual Studio C++, these are in <intrin.h>, and there are also versions for GCC, and other compilers. For example, "frexp" will split the number into its significant (fractional part) and the exponent integer. There's also "frexpf" for 32-bit floats, and "frexpl" for long double types. Another example is "std::signbit" to test the sign bit portably, and "std::copysign" will copy the sign of one float into the value of another. There are also the "logb" and "ilogb" functions to extract the exponent. (Some other notable builtins for floating point operations are "ldexp" and "scalbn" for bitshifting of float exponents, and "modf" for splitting whole and fractional parts.)

Example: FP16 is annoying in C++!

I already mentioned how there's not a standard half-precision type in C++, although that is fixable in the future, once compilers have implemented the C++23 standard. Here are some of the attempts at a 16-bit type:

"__fp16": only supported by ARM architecture.
"_Float16": not portably supported.
"short float": doesn't seem to exist (I'm just wishful-thinking).
"std::float16_t": defined in the C++23 standard.
"std::bfloat16_t": defined in the C++23 standard.

So, as of writing, if you want to code a 16-bit float in a portable way with C++, there's an ugly hack: "short int".

Annoying times double: A less fixable annoyance is that converting between FP32 and FP16 is not easy because their exponent bit sizes are different. So it's fiddly to code, and not very efficient.

The alternative idea is to use bfloat16, which is the upper-most 2 bytes of FP32. Converting is just a bitshift 16 places. However, with 8 mantissa bits (7 stored, 1 implicit), that's only about 3 decimal digits (because 8/3.3=3, and 3.3 is log2(10), in case you're wondering) But it's not much worse than FP16, which is about 4 decimal digits using 11 binary mantissa bits.

Uncommon Floating Point Bit Tricks

All floating point numbers are stored as a sequence of bits, and these are processed in standardized ways to do addition or multiplication. But it's all really just a bunch of integers underneath.

Bits are bits. The underlying floating point bits are not magical, although they have clearly defined meanings in fancy IEEE standards documents, and these bits can be directly manipulated in non-obvious ways. Examples of floating point bit manipulations used to optimize neural networks include:

Sign bit flipping: this can be used for fast non-multiplication binarized networks with floating point computations.
Exponent bit manipulations: bitshifting float values in logarithmic quantization can be implemented as integer addition on the exponent bits of a float.
Add-as-integer networks: This method simply adds the underlying bit representations together as integers, to create a type of multiplication-free neural network. Weirdly, this simple trick implements an approximate multiplication algorithm known as Mitchell's algorithm.

Example: Add-as-Integer Approximate Multiply

The add-as-integer method suggested by Mogami (2020) simply adds the integer bit representation of two floating point variables, as if they are integers. It's quite surprising that this has any useful meaning, but it's actually a type of approximate multiplication called Mitchell's algorithm. Here's what the C++ code looks like on 32-bit floats:

    float yapi_approx_multiply_add_as_int_mogami(float f1, float f2)   // Add as integer
    {
	int c = *(int*)&(f1)+*(int*)&(f2)-0x3f800000;  // Mogami(2020)
	return *(float*)&c;
    }

The magic number 0x3f800000 is equal to "127<<23", and its purpose is to fix up the "offset" of the exponent. Otherwise, there are two offsets of 127 combined. (Is there a faster way? It's annoying to waste a whole another addition operation on what's just an adjustment.)

Note that this algorithm is one exceptional case where we don't want to use unsigned types when tweaking bit representations. This trick needs the temporary variable of type "int" and the pointers to be "int*" so that it can correctly handle the sign bits of the two floating-point numbers.

This add-as-integer algorithm is not restricted to 32-bit floats. It should also work for 16-bit floating point numbers in both float16 and bfloat16 formats, provided the magic number is changed to a different bitshift count and an offset of 15 (not 127) for 5-bit exponents.

Example: Floating-Point Bitshift via Integer Addition

This is another surprising bitwise trick on floating point numbers. You cannot perform the standard bitshift operators on floats in C++, so you cannot easily speed up floating-point multiplication via bitshifts in the same way as for integers. Bitshifts are a fast way of doing a multiplication by a power-of-two (e.g. "x<<1" is the same as "x*2"). Note that it also doesn't work to convert the float to its unsigned int bit version and shift it.

On some platforms, notably Linux, there are some builtin special functions such as "ldexp" and "scalbn". On Linux, there is a builtin function calls "ldexp" which accepts an integer power, and then bitshifts a floating-point number by this many places. The ldexp function is for double types, ldexpf is for float, and ldexpl is for long double types. The "scalbn" set of functions appears to be almost identical to "ldexp" functions. There is also a reverse function "frexp" which extracts the significant (fraction) and the power-of-two for a floating-point argument. See the list of bitwise builtin functions for more.

But there is an intriguing method using integer arithmetic. The suggestion in the DenseShift paper (Li et al., 2023) is to simply add the shift count to the exponent bits using integer addition.

Here's some example C++ code that works for 32-bit floating-point numbers:

    float yapi_float_bitshift_add_integer(float f1, int bitstoshift)   
    {
	// Bitshift on 32-bit float by adding integer to exponent bits
	// FP32 = 1 sign bit, 8 exponent bits, 23 mantissa bits
	// NOTE: This can overflow into the sign bit if all 8 exponent bits are '1' (i.e. 255)
	unsigned int u = *(unsigned int*)&f1;  // Get to the bits as an integer
	if (u == 0) return f1;  // special case, don't change it...
	u += (bitstoshift << 23);  // Add shift count to the exponent bits...
	return *(float*)&u;  // Convert back to float
    }

How does it work? Well, it makes a certain kind of sense. The exponent in a floating-point representation is a power-of-two, and we are bitshifting, which is increasing the number by a power-of-two. Hence, we can increase the power-of-two by adding 1 to the exponent, and it works for numbers more than 1. Note that this code also works for bitshift of a negative count (e.g. bitshift of -1 subtracts from the exponent and thereby halves the number) or zero (unchanged).

This code has thereby improved the performance of floating-point multiplication by changing it to integer addition. The idea works provided we are multiplying by a power-of-two, which is done in logarithmic quantization. However, it's a little tricky in that special formats like zero (and NaN) are problematic for this algorithm. I had to add the test "u==0" which slows things down (maybe there's a better way?). Also, this approach can theoretically overflow the exponent bits, messing up the sign bit, but that's only if the float is very big or very tiny. Checking for all these wrinkles will slow down the code.

Example: Log2 of Floating-Point is the Exponent

The log2 function for float types is a non-linear function that is quite expensive to compute. But if you only care about the truncated integer version of the base-2 logarithm, which is exactly what's needed in logarithmic power-of-two quantization, then there's a very easy way. The base-2 logarithm is the exponent! It's sitting right there, already calculated, hidden in plain sight amongst the 32 bits of your friendly float variables.

Here's some C++ code to extract it:

    int ilog2_exponent(float f)  // Log2 for 32-bit float
    {
	unsigned int u = *(unsigned int*)&f;
	int iexponent = ((u >> 23) & 255);  // 8-bit exponent (above 23-bit mantissa)
	iexponent -= 127;  // Remove the "offset"
	return iexponent;
    }

Alternatively, for greater portability and probably extra speed, too, there are some standardized builtin C++ functions available across various platforms (including Linux and Microsoft): frexp, ldexp, ilogb, and scalbn, are some that come to mind; see the full list of useful bitwise builtin functions.

Fast Log2 of an Integer: The above example code works for float types, but for a faster log2 of integer types, there are various other low-level bit fiddling methods. There's a long history of bitwise intrinsic functions and hardware support for a "find first set" ("ffs"), which finds the offset of the first byte set, and the closely related "clz" (count leading zeros) function. The value of ffs is one more than clz, if you think about it for a while. (Actually, no, I have this wrong! FFS counts zero bits from the right, whereas CLZ counts zero bits from the left, so integer log2 has to use a CLZ method, rather than FFS.) To code this in C++, there's the Microsoft _BitScanReverse and GCC's __builtin_clz to choose from, amongst other builtin bitwise intrinsics. Note that on the x86 architecture, these are implemented via a single instruction, and there are two candidates: BSR (Bit Scan Reverse), or LZCNT (leading zero count).

Research on AI with Floating-Point Bit Tricks

Research papers that use bitwise tricks on floating point numbers include:

T. Mogami, Deep neural network training without multiplications, In Beyond BackPropagation WS at 34th Conference on Neural Information Processing Systems, 2020, https://arxiv.org/abs/2012.03458 (Uses integer addition of the bits of an IEEE 754 floating point representation to perform approximate floating point multiplication.)
Lingyun Yao, Martin Trapp, Karthekeyan Periasamy, Jelin Leslin, Gaurav Singh, Martin Andraud, June 2023, Logarithm-Approximate Floating-Point Multiplier for Hardware-efficient Inference in Probabilistic Circuits, Proceedings of The 6th Workshop on Tractable Probabilistic Modeling, https://openreview.net/forum?id=WL7YDLOLfK, PDF: https://openreview.net/pdf?id=WL7YDLOLfK (Also uses Mogami's addition method.)
A Kosson, M Jaggi, 2023, Hardware-Efficient Transformer Training via Piecewise Affine Operations, arXiv preprint arXiv:2305.17190, https://arxiv.org/abs/2305.17190, Code: https://github.com/epfml/piecewise-affine-multiplication (Good theoretical discussion of how the Mogami method works on the bits of a floating point number.)
Mostafa Elhoushi, Zihao Chen, Farhan Shafiq, Ye Henry Tian, Joey Yiwei Li, July 2021, DeepShift: Towards Multiplication-Less Neural Networks https://arxiv.org/abs/1905.13298 (Bitwise shifting and sign bit manipulation.)
X Li, B Liu, RH Yang, V Courville, C Xing, VP Nia, 2023, DenseShift: Towards Accurate and Efficient Low-Bit Power-of-Two Quantization, Proceedings of the IEEE/CVF, https://openaccess.thecvf.com/content/ICCV2023/papers/Li_DenseShift_Towards_Accurate_and_Efficient_Low-Bit_Power-of-Two_Quantization_ICCV_2023_paper.pdf (Uses integer addition on the sign and exponent bits of IEEE 754 floating point to perform bitshifts on floats to perform power-of-two number quantization on 32-bit floats.)

Floating-Point Error Research

Research papers on floating-point errors:

GPU-NBDetect Oct 2024 (accessed), Comprehensive-Study-on-GPU-Program-Numerical-Issues, https://github.com/GPU-Program-Bug-Study/Comprehensive-Study-on-GPU-Program-Numerical-Issues.github.io/tree/main/GPU-NBDetect
FP Checker, Jul 19, 2021, Floating-point Exceptions and GPU Applications, https://fpchecker.org/2021-07-12-exceptions.html
Lawrence Livermore National Laboratory, Oct 024, FP Checker: dynamic analysis tool to detect floating-point errors in HPC applications, https://fpchecker.org/index.html https://github.com/LLNL/FPChecker
Laguna, Ignacio. "FPChecker: Detecting Floating-point Exceptions in GPU Applications." In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 1126-1129. IEEE, 2019. https://ieeexplore.ieee.org/abstract/document/8952258 https://www.osti.gov/servlets/purl/1574625
Michael Bentley, Ian Briggs, Ganesh Gopalakrishnan, Ignacio Laguna, Harshitha Menon, Tristan Vanderbruggen, Cindy Rubio González, 2020, FPChecker Detecting Floating-Point Exceptions in GPUs, https://fpanalysistools.org/pearc19/slides/Module-FPChecker.pdf
Ignacio Laguna Feb 4, 2020, Improving Reliability Through Analyzing and Debugging Floating-Point Software, 2020 ECP Annual Meeting, https://fpanalysistools.org/slides/ignacio_laguna_ECP_2020.pdf
Ignacio Laguna, Xinyi Li, and Ganesh Gopalakrishnan. 2022. BinFPE: accurate floating-point exception detection for GPU applications. In Proceedings of the 11th ACM SIGPLAN International Workshop on the State Of the Art in Program Analysis (SOAP 2022). Association for Computing Machinery, New York, NY, USA, 1–8. https://doi.org/10.1145/3520313.3534655 https://dl.acm.org/doi/10.1145/3520313.3534655 https://dl.acm.org/doi/pdf/10.1145/3520313.3534655 https://github.com/LLNL/BinFPE
Floris Gorter, Enrico Barberis, Raphael Isemann, Erik van der Kouwe, Cristiano Giuffrida, Herbert Bos, November 1, 2023, FloatZone: How Floating Point Additions can Detect Memory Errors, https://download.vusec.net/papers/floatzone_sec23.pdf https://github.com/vusec/floatzone
Taylor Allred, Xinyi Li, Ashton Wiersdorf, Ben Greenman, Ganesh Gopalakrishnan, 22 Mar 2024, FlowFPX: Nimble Tools for Debugging Floating-Point Exceptions, https://arxiv.org/abs/2403.15632 https://juliahub.com/ui/Packages/FloatTracker/dBXig
Xinyi Li, Ignacio Laguna, Katarzyna Swirydowicz, Bo Fang, Ang Li, and Ganesh Gopalakrishnan. Design and evaluation of GPU-FPX: A low-overhead tool for floating-point excep tion detection in NVIDIA GPUs. In ACM HPDC 2023, 2023. doi:10.11578/dc.20230713.4. https://dl.acm.org/doi/pdf/10.1145/3588195.3592991
James Demmel, Jack J. Dongarra, Mark Gates, Greg Henry, Julien Langou, Xiaoye S. Li, Piotr Luszczek, Weslley S. Pereira, E. Jason Riedy, and Cindy Rubio-González. Pro posed consistent exception handling for the BLAS and LAPACK. In Correctness@SC, pages 1–9. IEEE, 2022. doi:10.1109/Correctness56720.2022.00006. https://netlib.org/utk/people/Jahttps://arxiv.org/abs/2207.09281ckDongarra/PAPERS/Proposed_Consistent_Exception_Handling_for_the_BLAS_and_LAPACK.pdf https://arxiv.org/abs/2207.09281
N. Toronto and J. McCarthy, "Practically Accurate Floating-Point Math," in Computing in Science & Engineering, vol. 16, no. 4, pp. 80-95, July-Aug. 2014, doi: 10.1109/MCSE.2014.90. https://ieeexplore.ieee.org/document/6879754 https://www.cs.umd.edu/~ntoronto/papers/toronto-2014cise-floating-point.pdf
Peter Dinda, Alex Bernat, and Conor Hetland. Spying on the Floating Point Behavior of Existing, Unmodified Sci entific Applications. In HPDC, pages 5–16. ACM, 2020. doi:10.1145/3369583.3392673. http://pdinda.org/Papers/hpdc20.pdf

General Research on Floating-Point Representations

Papers on the bitwise representation of floating point:

Jean-Michel Muller, Nicolas Brisebarre, Florent de Dinechin, Claude-Pierre Jeannerod, Vincent Lerevre, Guillaume Melquiond, Nathalie Revol, Damien Stehle, Serge Tones, 2018, Handbook of Floating-Point Arithmetic, Birkhauser https://link.springer.com/book/10.1007/978-3-319-76526-6, Contents: https://cds.cern.ch/record/1315760/files/9780817647049_TOC.pdf
Wonyeol Lee Rahul Sharma Alex Aiken, 2016, Verifying Bit-Manipulations of Floating-Point Stanford University, USA, https://theory.stanford.edu/~aiken/publications/papers/pldi16b.pdf
Sean Eron Anderson 2005, Bit Twiddling Hacks, Stanford University, https://graphics.stanford.edu/~seander/bithacks.html (Mostly integer bit fiddles, but a few of the tricks use floating point bitwise representations.)
Stack Overflow Users, 2019, Simulating Floating Point Multiplication in C using Bitwise Operators, https://stackoverflow.com/questions/54610832/simulating-floating-point-multiplication-in-c-using-bitwise-operators
John R. Hott, January 27, 2023, Bitwise Operations, Floating Point Numbers, CS 2130: Computer Systems and Organization 1 (lecture notes), https://www.cs.virginia.edu/~jh2jf/courses/cs2130/spring2023/lectures/5-bitwise-fp.pdf
S Graillat, V Ménissier-Morain, 2012, Accurate summation, dot product and polynomial evaluation in complex floating point arithmetic, Information and Computation, Volume 216, July 2012, Pages 57-71, https://www.sciencedirect.com/science/article/pii/S0890540112000715
SHF Langroudi, 2023, Tapered-Precision Numerical Formats for Deep Learning Inference and Training Ph.D. thesis, Department of Electrical and Computer Engineering Kate Gleason College of Engineering Rochester Institute of Technology Rochester, New York, https://www.proquest.com/openview/a88513887d40ec2e6744025447d7d948/1?pq-origsite=gscholar&cbl=18750&diss=y
C Gernigon, SI Filip, O Sentieys, C Coggiola, M Bruno, Oct 2023, Low-Precision Floating-Point for Efficient On-Board Deep Neural Network Processing, https://hal.science/hal-04252197/document
Andrew Thall, 2007, Extended-Precision Floating-Point Numbers for GPU Computation, Tech. Rep. CIM-007-01, March 2007, https://andrewthall.org/papers/df64_qf128.pdf
S Boldo, CP Jeannerod, G Melquiond, JM Muller, 2023, Floating-point arithmetic Acta Numerica, https://www.cambridge.org/core/services/aop-cambridge-core/content/view/287C4D5F6D4A43FBEEB1ABED2A405AAF/S0962492922000101a.pdf/floatingpoint_arithmetic.pdf
David Spuler, March 2024, Chapter 9. Floating Point Arithmetic, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Eric Sakk, Understanding Floating Point Numbers: Concepts in Computer Systems (Volume 2), June 7, 2018, https://www.amazon.com/Understanding-Floating-Point-Numbers-Concepts/dp/1983093025/
Grützmacher T, Anzt H (2019), A modular precision format for decoupling arithmetic format and storage format. In: Euro-Par 2018: Parallel Processing Workshops. Springer, Cham, pp 434–443. https://doi.org/10.1007/978-3-030-10549-5_34
Grützmacher T, Cojean T, Flegar G, Anzt H, Quintana-Ortí ES (2020), Acceleration of pagerank with customized precision based on mantissa segmentation. ACM Trans Parallel Comput 7(1):4. https://doi.org/10.1145/3380934
Grützmacher T, Cojean T, Flegar G, Göbel F, Anzt H, 2020, A customized precision format based on mantissa segmentation for accelerating sparse linear algebra. Concurr Comput Pract Experience 32(15):e5418. https://doi.org/10.1002/cpe.5418 https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.5418
Wen Wang; Bingjie Xia; Bing Xiong; Xiaoxia Han; Peng Liu, 2024, Mantissa-Aware Floating-Point Eight-Term Fused Dot Product Unit, 2024 IEEE International Symposium on Circuits and Systems (ISCAS), Singapore, Singapore, 2024, pp. 1-5, doi: 10.1109/ISCAS58744.2024.10558082, https://ieeexplore.ieee.org/abstract/document/10558082
Vincenzo Liguori, 9 Jun 2024, Procrastination Is All You Need: Exponent Indexed Accumulators for Floating Point, Posits and Logarithmic Numbers, https://arxiv.org/abs/2406.05866
Junxuan Yang, 2024, Acceleration System for Single-Precision Floating-Point Arithmetic Based on IEEE754 Standard, Vol. 87 (2024): 5th International Conference on Electronic Science and Automation Control (ESAC 2024), DOI: https://doi.org/10.54097/42rhk490 https://drpress.org/ojs/index.php/HSET/article/view/19066 https://drpress.org/ojs/index.php/HSET/article/download/19066/18629
L. Sommer, L. Weber, M. Kumm and A. Koch, 2020, Comparison of Arithmetic Number Formats for Inference in Sum-Product Networks on FPGAs, 2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), Fayetteville, AR, USA, 2020, pp. 75-83, doi: 10.1109/FCCM48280.2020.00020, https://ieeexplore.ieee.org/document/9114810
C. Hakert, K. -H. Chen and J. -J. Chen, 2024, FLInt: Exploiting Floating Point Enabled Integer Arithmetic for Efficient Random Forest Inference, 2024 Design, Automation & Test in Europe Conference & Exhibition (DATE), Valencia, Spain, 2024, pp. 1-2, doi: 10.23919/DATE58400.2024.10546851, https://ieeexplore.ieee.org/abstract/document/10546851
Fabien Geyer, Johannes Freitag, Tobias Schulz, Sascha Uhrig, 16 Jan 2024, Efficient and Mathematically Robust Operations for Certified Neural Networks Inference, https://arxiv.org/abs/2401.08225 https://arxiv.org/pdf/2401.08225 (Finds that fixed point arithmetic is more efficient with comparable accuracy to using floating point.)
Marius Hobbhahn, Lennart Heim, Gökçe Aydos, Nov 09, 2023, Trends in Machine Learning Hardware, https://epochai.org/blog/trends-in-machine-learning-hardware
Tianhua Xia and Sai Qian Zhang. 2024. Hyft: A Reconfigurable Softmax Accelerator with Hybrid Numeric Format for both Training and Inference. In Proceedings of the 29th ACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED '24). Association for Computing Machinery, New York, NY, USA, 1–6. https://doi.org/10.1145/3665314.3670816 https://dl.acm.org/doi/abs/10.1145/3665314.3670816 PDF: https://dl.acm.org/doi/pdf/10.1145/3665314.3670816
Crary, C., Burlacu, B., Banzhaf, W. (2024). Enhancing the Computational Efficiency of Genetic Programming Through Alternative Floating-Point Primitives. In: Affenzeller, M., et al. Parallel Problem Solving from Nature – PPSN XVIII. PPSN 2024. Lecture Notes in Computer Science, vol 15148. Springer, Cham. https://doi.org/10.1007/978-3-031-70055-2_20 https://link.springer.com/chapter/10.1007/978-3-031-70055-2_20
NVIDIA, 2024, Floating Point and IEEE 754 Compliance for NVIDIA GPUs. White paper covering the most common issues related to NVIDIA GPUs. https://docs.nvidia.com/cuda/floating-point/index.html https://docs.nvidia.com/cuda/pdf/Floating_Point_on_NVIDIA_GPU.pdf
Cong Guo, Feng Cheng, Zhixu Du, James Kiessling, Jonathan Ku, Shiyu Li, Ziru Li, Mingyuan Ma, Tergel Molom-Ochir, Benjamin Morris, Haoxuan Shan, Jingwei Sun, Yitu Wang, Chiyue Wei, Xueying Wu, Yuhao Wu, Hao Frank Yang, Jingyang Zhang, Junyao Zhang, Qilin Zheng, Guanglei Zhou, Hai (Helen)Li, Yiran Chen, 8 Oct 2024. A Survey: Collaborative Hardware and Software Design in the Era of Large Language Models, https://arxiv.org/abs/2410.07265
Yi Guo, Fanliu Kong, Xiaoyang Li, Hui Li, Wei Chen, Xiaogang Tian, Jinping Cai, Yang Zhang, Shouda Liu, 19 Apr 2024, decoupleQ: Towards 2-bit Post-Training Uniform Quantization via decoupling Parameters into Integer and Floating Points, https://arxiv.org/abs/2404.12759 Code: https://github.com/bytedance/decoupleQ (Decouple parameters into integer and floating-point parts for more accurate quantization at low bitwidths.)
Youngdeok Hwang; Janghwan Lee; Jiwoong Park; Jieun Lim; Jungwook Choi, Jan 2024, Searching Optimal Floating-Point Format for Sub-8-Bit Large Language Model Inference, 2024 International Conference on Electronics, Information, and Communication (ICEIC), https://ieeexplore.ieee.org/abstract/document/10457111 (Examines floating-point representations below 8 bits, and also the importance of denormalized numbers.)
Mark Harris, Jan 10, 2013, CUDA Pro Tip: Flush Denormals with Confidence, https://developer.nvidia.com/blog/cuda-pro-tip-flush-denormals-confidence/
Paresh Kharya, May 14, 2020, TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x. NVIDIA's Ampere architecture with TF32 speeds single-precision work, maintaining accuracy and using no new code. https://blogs.nvidia.com/blog/tensorfloat-32-precision-format/
Yuzong Chen, Xilai Dai, Chi-chih Chang, Yash Akhauri, Mohamed S. Abdelfattah, 6 Jan 2025, The Power of Negative Zero: Datatype Customization for Quantized Large Language Models, https://arxiv.org/abs/2501.04052 (Remapping negative zero to other values.)
J Wu, M Song, J Zhao, HKH So, 2024, A Case for Low Bitwidth Floating Point Arithmetic on FPGA for Transformer Based DNN Inference, https://wujiajunic.cn/publication/ipdpsw2024/IPDPSW2024.pdf
Nikita Trukhanov, Ilya Soloveychik, 29 Mar 2024, Accurate Block Quantization in LLMs with Outliers, https://arxiv.org/abs/2403.20137 (Analyzes block floating point number formats in block quantization with a focus on the KV cache memory reduction, including the use of permutations to reorder tensor weight rows.)
Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 15 Mar 2024 (v5), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer (A large survey of a variety of LLM optimizations.)
Jun-Shen Wu; Ren-Shuo Liu, Nov 2023, FM-P2L: An Algorithm Hardware Co-design of Fixed-Point MSBs with Power-of-2 LSBs in CNN Accelerators, 2023 IEEE 41st International Conference on Computer Design (ICCD), https://ieeexplore.ieee.org/abstract/document/10361039 (Number format that allows faster multiplication by storing least-significant bits in power-of-two format.)
L. F. H. Duarte, G. B. Nardes, W. Grignani, D. R. Melo and C. A. Zeferino, "Deep Nibble: A 4-bit Number Format for Efficient DNN Training and Inference in FPGA," 2024 37th SBC/SBMicro/IEEE Symposium on Integrated Circuits and Systems Design (SBCCI), Joao Pessoa, Brazil, 2024, pp. 1-5, doi: 10.1109/SBCCI62366.2024.10703994. https://ieeexplore.ieee.org/abstract/document/10703994 (Log quantization method in 4-bits.)
Alireza Khodamoradi, Kristof Denolf, Eric Dellinger, 15 Oct 2024, Error Diffusion: Post Training Quantization with Block-Scaled Number Formats for Neural Networks, https://arxiv.org/abs/2410.11203 https://github.com/ROCm/tensorcast
Hui Wang, Yuan Cheng, Xiaomeng Han, Zhengpeng Zhao, Dawei Yang, Zhe Jiang, 21 Jan 2025, Pushing the Limits of BFP on Narrow Precision LLM Inference, https://arxiv.org/abs/2502.00026
Weiming Hu, Haoyan Zhang, Cong Guo, Yu Feng, Renyang Guan, Zhendong Hua, Zihan Liu, Yue Guan, Minyi Guo, Jingwen Leng, 26 Feb 2025, M-ANT: Efficient Low-bit Group Quantization for LLMs via Mathematically Adaptive Numerical Type, https://arxiv.org/abs/2502.18755