Aussie AI

Overview of Arithmetic Optimizations

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

Overview of Arithmetic Optimizations

We take them for granted in C++ now, but all of the arithmetic operations were once coded by other programmers, first as software algorithms and then later as hardware microcode. There are numerous research papers on how to perform addition and multiplication fast, let alone more the complex schemes for division and square-root.

Just when you thought you could completely ignore all this obscure theory, along comes the nasty multiplication bottleneck in your AI engine. Is CPU multiplication really as fast as it can be? What about addition? What about floating-point? Is there a faster way to do multiplication than what's in your GPU today?

Surprisingly, the answer has been a resounding “Yes!” as has been shown by the 2018 invention of the “Brain float 16” (bfloat16) by Google. This is a 16-bit alternative floating-point representation, with different arithmetic properties, as parallelized in its Tensor Processing Unit (TPU) chips. Hence, I politely suggest that you might want to dust off all these old papers and put on your thinking cap. Some of the main research areas include:

Faster algorithms for multiplication arithmetic (mostly hardware acceleration algorithms for chip designers)
Alternatives to using arithmetic multiplication (e.g. bitshifting, logarithms, other fancy stuff) (see Multiplication-free inference in Chapter 51)
Approximate multiplication algorithms (faster ways to multiply two numbers by allowing errors)
Matrix algebra theory (e.g. factorization/decomposition) applied to making tensor algorithms faster
Advanced number systems using alternative arithmetic methods (see Chapter 55)

Approximations of these arithmetic operations might be a way to go even faster in either sequential or parallel processing. One example is already implemented in x86 CPUs with the FTZ and DAZ floating-point modes (see Chapter 9), which are approximations that dispense with some of the less important parts of the IEEE 754 floating-point standard. Another example is the many papers on 8-bit floating-point quantization (FP8 quantization).

More ideas for research: Could there be anything more? Well, I'm even going to throw out a few ideas:

Disabling negative-zero in floating-point would be helpful.
Floating-point representations that don't use an offset for the exponent. Adding to the exponent is like bitshifting.
Add-as-integer approximate multiplication by addition (Mogami, 2020) would also be faster if we didn't need to adjust for the exponent's offset.
RELU built into a fast hardware operation via bitwise-AND with the negated sign bit.
Bitshift operators on floating-point numbers.
Logarithmic Number System (LNS) arithmetic operation speedups, especially LNS addition.

It's hard to predict what other little parallelization tricks might be hidden in all these old papers, when you consider that they were almost all written for sequential execution.

• Next:

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs

The new AI programming book by Aussie AI co-founders:

AI coding in C++
Transformer engine speedups
LLM models
Phone and desktop AI
Code examples
Research citations

Get your copy from Amazon: Generative AI in C++

Aussie AI

Overview of Arithmetic Optimizations

Overview of Arithmetic Optimizations

Quick Links

Product

New to Writing?

Writing Styles