Aussie AI

10. Arithmetic Optimizations

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

“In mathematics you don’t understand things.
You just get used to them.”

— John von Neumann

The biggest problem with AI engines is that they do too much arithmetic. There is no shortage of electrons being applied to multiplication in vector dot products and matrix multiplication. The methods of optimization to fix this bottleneck are basically:

Do some other type of arithmetic instead.
Do fewer multiplications.

Alternative forms of arithmetic include bitwise shifting or addition. The ways to do fewer multiplications tend to involve higher-level algorithmic changes to the model, such as pruning or quantization.

There are two basic ways that arithmetic computations can be sped up whilst retaining the same results:

Single operator improvements
Expression-level optimizations (multiple operators)

Some of the methods of speeding up arithmetic come from the theory of compiler optimization (e.g. strength reduction, sub-expression elimination). Hence, the compiler will often automatically perform these types of optimizations (when the optimizer is invoked). To some extent, this makes these transformations redundant. Even so, good programming practice is to avoid situations where these optimizations are needed on a large scale. The compiler does not look at the program as a whole and can miss some “obvious” optimizations.

Operator Strength Reduction

Individual operations in C++ can be optimized in several ways. The general term is “strength reduction” because a stronger operator with high computation complexity is “reduced” to an equivalent operator that is simpler and faster. Strength reduction is a technique used in automatic optimization by compilers, but can also be used by programmers to improve algorithms.

The main “strong” operations that we're trying to avoid are:

Floating-point arithmetic (even addition)
Multiplication
Division
Remainder (% operator)
Math functions (e.g. sqrtf or expf)

Congratulations! The typical AI engine only has 3 out of 5.

Strength reduction has particular relevance to AI engines because the main bottleneck is floating-point multiplication. Many of the research papers on speedups are about replacing the floating-point multiplication operation with something simpler, like addition or integer arithmetic.

Some of the general approaches in regard to strength reduction include:

Bitwise operations (e.g. bitshifts can replace multiplication)
Multiplication is slower than addition.
Avoid division and modulo/remainder operators (they're the worst!)
Use integer arithmetic rather than floating-point (where possible)
Use float single-precision arithmetic, not double-precision.
Approximate arithmetic (e.g. for math functions)

Bitshift for multiplication: The canonical example that everybody knows is that shift operators can replace multiplications by a power of two. But it's only for integers, not for floating-point numbers. Note that there's a whole class of AI engines that rely on this integer multiply-vs-bitshift optimization called “logarithmic quantization” or “power-of-two quantization.” Here's a dummy example of integer multiplication;

   y = x * 4;

This can be more efficiently coded as a left bitshift:

   y = x << 2;

Bug alert! If you're making this code change, you're likely to introduce some bugs. The “<<” and “*” operators have different precedence levels, so make sure you add more parentheses. Also, consider whether you need to use “unsigned” type when switching to a bitwise operator.

Right shift for division: The use of bitshifting works for division, too (but only for unsigned):

   y = x / 4;  
   y = x >> 2u;  // faster

Bitwise remainder calculations: The arithmetic modulus operator (remainder) can also be optimized for power-of-two operands (but only on integers):

   y = x % 512;    // Remainder (mod)
   y = x & 511u;   // Bitwise-AND

And here's another one with integer relative comparisons versus bitwise-and, although this one might not necessarily be faster:

   if (x >= 512)
   if (x & ~511u)  // Bitwise-AND of the complement (unsigned)

Avoiding multiplication: There are some simple cases even with the most basic operators that have multiple options:

    y = x * 2;
    y = x + x;   // Addition
    y = x << 1;  // Shift

Automatic Strength Reduction: In theory, C++ compilers could know what will be faster on its platform, and perform all these optimizations automatically when compiling the program. The optimizers probably do some of them, but they cannot do them all.

Intrinsic Functions: Other more advanced types of strength reduction involve avoiding costly primitives, such as mathematical functions. For example, there are bitwise arithmetic tricks to quickly compute the integer log2 function.

GPU Strength Reduction: One final note is that when doing AI coding work, we aren't as concerned about which C++ operator works the best. The more important concern is which operation is most efficient in the GPU or other non-GPU hardware acceleration (e.g. AVX-512 on CPU).

Finally, note that these optimizations are local optimizations, and the same ideas apply globally to the entire AI engine architecture. There's been a lot of research trying to change all of the arithmetic in model inference from multiplication to bitshifting (see logarithmic power-of-two quantization in Chapter 44) or addition (see logarithmic number system models in Chapter 52, or “adder networks” and other “zero multiplication models” in Chapter 51).

Avoid % Remainder Operations

One common use of the remainder operator is the use of modulo arithmetic, such as the wraparound array implementation of a queue abstract data type, where the value of a variable is cyclically counted from 0 up to N-1, and then back to 0. The most common idiom for coding this is:

    x = (x + 1) % N;

However, the % operator is expensive, and in this case it is not really needed. The following code sequence performs the same task more efficiently:

    if (x == N - 1)
        x = 0;
    else
        x++;

This can also be written more concisely, but not necessarily more efficiently, as an expression with the “?:” ternary operator:

    (x == N - 1) ? (x = 0) : (x++);

Another example of a clever avoidance of % is when the operand is similar to the usual byte or word size. For example, consider this remainder:

    x % 256

This can be more efficiently coded with bitwise-and using:

    x & 255

But this can be even more efficiently coded as a type cast:

    (unsigned char) x

The conversion to this “unsigned char” type will be efficiently implemented by grabbing a byte out of a word. Unfortunately, this method is not portable to all obscure systems, as it relies on an “overflow” being handled harmlessly, and on “unsigned char” always containing 8 bits.

Reciprocal Multiplication

Division is a slow operation, whether in a CPU or a GPU. Multiplication is often significantly faster than division, and in some cases a division can be replaced by a multiplication using the reciprocal. A case in point is floating-point division by a constant. For example, consider the division:

    f = g / 100.0;

This can be replaced by the multiplication:

    f = g * 0.01;  // Reciprocal

If the divisor is a symbolic constant, it is possible to replace the symbolic constant with a hard-coded constant (or another symbolic constant). However, it is more convenient to replace the constant with an explicit reciprocal calculation. For example, consider the code:

    f = g / DIVISOR;

This can be rewritten as:

    f = g * (1.0 / DIVISOR);

The compiler should calculate the reciprocal using “constant folding” at compile-time. Note that the brackets around the division expression are probably not strictly necessary because optimizers know about associativity, but are certainly helpful to make life easier for the optimizer (and these poor critters need a break every now and then).

If the divisor is a complex expression, the compiler might not automate the use of a reciprocal. Here's the slow version of division by a scale factor:

    v[i] /= sqrtf(3.14159f);

Here's the faster way using the reciprocal of the constant:

    v[i] *= 1.0f / sqrtf(3.14159f);

And we really should pre-calculate this constant using constant folding and a static variable:

    static const float scalefactor = 1.0f / sqrtf(3.14159f);
    v[i] *= scalefactor;

Integer Arithmetic

Real arithmetic is slow compared to integer arithmetic. Hence, it is favorable to replace real arithmetic by equivalent integer arithmetic. Real arithmetic can be replaced by integer arithmetic when only limited precision is required (e.g. 1-3 decimal places). To do this, work in integer units that are 10, 100 or 1000 times larger (for 1, 2 and 3 decimal places) so that the decimal places appear as the lower digits of the integers.

To convert the integer into its true integer and fractional parts is quite simple. To get at the fractional part, calculate the number modulo 10, 100 or 1000 (using the % operator). To get the true integer part, divide by 10 or 100 or 1000 — remember that integer division truncates the fractional part.

A good example is: when working in dollars and cents, do all calculations in terms of cents (an integer). Then when printing it out, convert to dollars and cents using:

    cents = value % 100;
    dollars = value / 100;

However, note that this is now using two of the worst integer operators: remainder and division. The hierarchy of cost for integer operations is similar to floating-point: integer addition and subtraction are faster than multiplication, but division is still the worst, even for integers.

There appears little to be done to replace integer division with multiplication. Multiplying by the reciprocal will change an integer operation to a floating-point operation and will probably increase execution time. A power-of-two integer division could be done via the “>>” right bitshift operator, provided that it cannot be negative and uses an unsigned type.

Expression Transformations

Expression-level types of arithmetic improvements on an expression with multiple operations include:

Constant folding (compile-time precomputation of constant expressions)
Common subexpression elimination (only computing things once in expressions)
Algebraic identities in computations
Type consistency (avoid conversions)

Common Subexpression Elimination

Common subexpression elimination (CSE) is avoiding the recomputation of the same expression twice. There are many cases where the same computation appears multiple times in a single expression, or across the control flow of a program. Compiler optimizers attempt to automatically detect such cases and reuse the first computation.

In a complicated expression, there are often repeated sub-expressions. These are inefficient as they require the computer to calculate the same value twice or more. To save time, calculate the sub-expression first and store it in a temporary variable. Then replace the sub-expression with the temporary variable. For example:

    x = (i * i) + (i * i);

With a temporary variable, this becomes:

    temp = i * i;
    x = temp + temp;

Note that this attempt to be concise is incorrect:

    x = (temp = i * i) + temp; // Bug

This may fail because of its reliance on the order of evaluation of the + operator. It is not actually guaranteed in C++ that the + operator is evaluated left-to-right.

Common sub-expressions do not occur only in single expressions. It often happens that a program computes the same thing in subsequent statements. For example, consider the code sequence:

    if (x > y && x > 10) {
        // ...
    }
    if (x > y && y > 10) {
        // ...
    }

The Boolean condition “x>y” need be calculated only once:

    temp = (x > y);
    if (temp && x>10) {
        // ...
    }
    if (temp && y>10) {
        // ...
    }

Algebraic Identities

The calculations in some complicated expressions can be reduced by transforming the expression into another equivalent form. The aim when using algebraic identities is to group the operations differently, to reduce the total number of arithmetic operations. Care must be taken to ensure that the new expression has equivalent meaning. For example, the short-circuiting of the logical operators can cause differences. Some useful algebraic identities are:

    2 * x == x + x == x << 1
    a * x + a * y == a * (x + y)
    -x + -y == -(x + y)

There are also Boolean algebraic identities that can be used to perform fewer logical operations:

    (a && b) || (a && c) == a && (b || c)
    (a || b) && (a || c) == a || (b && c)
    !a && !b == !(a || b)
    !a || !b == !(a && b)

Float Family Loyalty

Hidden unnecessary C++ type conversions are a common source of extra inefficiency. The main type in a Transformer is usually “float” (32-bit), rather than “double” (64-bit). Avoid unnecessary type conversion code in two ways:

Don't mix float and double
Don't mix float and int

The use of float and int tends to be something professional C++ programmers are aware of, after having been burned a few times, and doesn't occur that often by accident.

However, inadvertently mixing float and double is difficult to avoid, and sneaks into your code all the time. For example, here's some C++ code that looks perfectly correct:

    float scalefactor = sqrt(2.0) * 3.14159;

You know this isn't real AI code because it doesn't have 27 decimal places for pi, which we've memorized by rote. AI engines don't really need anywhere near that much precision, but it looks good for the boss.

The above code is also a small slug, because it may be unnecessarily using “double” size arithmetic, although the compiler might fix it with constant folding (but emit a warning anyway). Here's the corrected code:

    float scalefactor = sqrtf(2.0f) * 3.14159f;

Note that this example shows there are two places where an “f” suffix is needed to signify that float arithmetic is required:

Numeric constants (i.e. “2.0f” specifying a 32-bit float, rather than “2.0”, which is a 64-bit double constant).
Standard C++ functions (i.e. “sqrtf” returns float rather than “sqrt” returning double).

Without the suffix “f”, the default is double type constants and double arithmetic functions. A lot of C++ compilers will warn about these type conversions losing precision, so if you aim for warning-free compilation as a quality goal, you'll also fix most of these wasteful hidden type conversions.

• Next: Chapter 11. Compile-Time Optimizations

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs