Aussie AI

Code Optimizations

  • Last Updated 3 December, 2024
  • by David Spuler, Ph.D.

Code efficiency has been a research area since the very first computers. Many transformations to code are known to improve its speed or memory usage, while retaining its functionality. These ideas can be applied to model inference algorithms to improve its speed, including many of the architectural inference optimizations.

General Programming Optimizations for Inference

The inference algorithm is just code. Many of the standard code optimizations are applicable to AI inference engines, although this is becoming less true over time as GPU accelerators far outpace other optimizations. General coding optimizations include changes to arithmetic, control flow, and architectural structure.

Archtitectural optimization techniques may include:

Algorithm-level optimizations can include:

Arithmetic optimizations may include:

Loop optimizations may include:

Low-level coding optimizations include:

  • Common subexpression elimination (see computation reuse)
  • Function inlining or macros (avoiding function call overhead)
  • Flattening function call hierarchies
  • Pointer arithmetic versus arrays and table lookup
  • Contiguous memory data storage (reducing paging)
  • Bitwise operations

Compiler-Automated Optimizations

In order to optimize code, it's important to know what sorts of optimizations your compiler is doing automatically. Compilers have been doing optimizations for literally 50 years, and the state-of-the-art is quite amazing, with an extensive body of research theory. Some of the main automated compiler optimizations include:

  • Constant folding
  • Common subexpression elimination
  • Redundant assignment removal
  • Strength reduction
  • Algebraic optimizations
  • Loop optimizations (some)

If you make simple changes to your code with some of the obvious things above, it's not going to give you a speedup. The compiler has already done so.

However, there's a little to what compilers can do. They certainly can't make architectural changes, and there's also many mid-level algorithmic changes that cannot be automated.

Function calls inside expressions are a good example of code changes that might need to be manually optimized. When the compiler sees a function call used in arithmetic, it isn't always able to know what that function is going to do, and has to be conservative by avoiding possibly incorrect optimizations.

Efficient C++ Language Features

Most of the development of C and then C++ has been focused on efficiency, so almost every feature of C++ could be listed here. However, there's some guidance about language features that seem complex but are typically implemented very efficiently by the C++ compiler (with many kudos due to the designers of the C++ language and its compilers!).

  • Switch beats else-if. The switch statement has been optimized to the nth degree over the years, and almost certainly beats anything you can code. If you're testing a set of constant values, don't use else-if.
  • Virtual functions. These seem complicated and possibly slow, but they're actually very carefully designed to be very fast to call. There's lots of painful work for compiler designers to get then to compile correctly, but their runtime efficiency is great for programmers. The implementation is effectively a small lookup table with function pointers. It's a couple more assenbler statements before the function call, and the overhead of calling a function will dwarf that cost. So don't use virtual functions if you don't need them, but also don't be afraid of using them.
  • References and pass-by-reference. These are basically implemented as pointers, so they are very fast. See more details on references versus pointers below.
  • Compiler intrinsics. There's lots of builtin functions called "intrinsics" which are typically implemented in assembly code by compiler writers. Functions such as memset and memcpy are very fast.

Constant Folding

Constant folding is the name of the optimization where the compiler automatically merges constant expressions. For example, assume your code has:

    x = 3 + 4;

The compiler should "fold" the two constants (3 and 4) by doing the "+" operation at compile-time, and then automatically generate code without the "+" operation. Effectively, it should execute as if you'd written:

    x = 7;

So, how good are the compilers? The answer is that they're getting pretty amazing, but it's not always clear. Here's some cases about constant folding:

  • sizeof is a constant. Any use of the sizeof operator in expressions should be treated like an integer. If your code says "8*sizeof(int)", then it's hopefully folded to give "32" at compile-time.
  • Named constants. If you declare "const int n=3", then hopefully all subsequent uses of "n" will be folded as if they said "3".
  • Named macro constants. This is trivially handled.
  • Type casts: Hopefully your compiler should propagate types correctly while doing constant folding.

What about some harder cases? Consider this code with an intrinsic math function:

   const float scalefactor = sqrtf(2.0f * 3.14159f);

Can the compiler fold this at compile-time? Surely it should do "2.0f*3.14159f" correctly. But what about the sqrtf calculation? It's theoretically possible, but I'm far from certain. I'd be inclined to declare it as a global constant, or a local "static" constant, so as to ensure it only gets pre-calculated at most once:

   static const float scalefactor = sqrtf(2.0f * 3.14159f);

Now, since it's declared as "static" (and "const"), hopefully the executable code would only compute it once at program startup, even if it wasn't fully folded. Another faster alternative that ensures it's not calcualted even once is to work out the formula yourself, and just put in the numeric result as a hard-coded floating point constant.

   static const float scalefactor = 2.506627216f; // sqrtf(2.0f * 3.14159f);

Research on Constant Folding: Many papers have been published on compiler-automation of constant folding, such as:

Common Subexpression Elimination

Common subexpression elimination (CSE) is avoiding the recomputation of the same expression twice. There are many cases where the same computation appears multiple times in a single expression, or across the control flow of a program. Compiler optimizers attempt to automatically detect such cases and reuse the first computation.

Research on Common Subexpression Eliminiation: Papers on CSE include:

Strength Reduction

Strength reduction is the name for optimizations that reduce the "strength" (i.e. computation complexity) of arithmetic operators. Some of the general approaches include:

  • Avoid division and modulo/remainder operators (they're the worst!)
  • Use integer arithmetic rather than floating point (where possible)
  • Use float arithmetic, not double-precision.

Bitshift for multiplication: The canonical example that everybody knows is that shift operators can replace multiplications by a power of two. Note that there's a whole class of AI engines that rely on this multiply-vs-bitshift optimization. It's called "logarithmic quantization" or "power-of-two quantization". Here's a dummy example of multiplication;

   y = x * 4;

This can be more efficiently done as a left bitshift:

   y = x << 2;

Bug alert! If you're making this code change, you're likely to introduce some bugs. The "<<" and "*" operators have different precedence levels, so make sure you add more parentheses. Also, consider whether you need to use "unsigned" type when switching to a bitwise operator.

Right shift for division: The use of bitshifting works for division, too (but only for unsigned):

   y = x / 4;  
   y = x >> 2;  // faster

Multiply by reciprocal: Division is one of the worst operations. An algebraic optimization is that floating point division is equivalent to multiplication by its reciprocal. This is the slow division by a scale factor:

    v[i] /= sqrtf(3.14159);

Here's the faster way using the reciprocal of the constant:

    v[i] *= 1.0f / sqrtf(3.14159);

And we really should pre-calculate this constant using constant folding and a static variable:

    static const float scalefactor = 1.0f / sqrtf(3.14159);
    v[i] *= scalefactor;

Bitwise remainder calculations: The arithmetic modulus operator (remainder) can also be optimized for power-of-two operands:

   y = x % 512;    // Remainder (mod)
   y = x & 511u;   // Bitwise-AND

And here's another one with integer relative comparisons versus bitwise-and, although this one might not necessarily be faster:

   if (x >= 512)
   if (x & ~511u)  // Bitwise-AND of the complement (unsigned)

Avoiding multiplication: There are some simple cases even with the most basic operators that have multiple options:

    y = x * 2;
    y = x + x;   // Addition
    y = x << 1;  // Shift

Automatic Strength Reduction: In theory, C++ compilers could know what will be faster on its platform, and perform all these optimizations automatically when compiling the program. The optimizers probably do some of them, but they cannot do them all.

Intrinsic Functions: Other more advanced types of strength reduction involve avoiding costly primitives, such as mathematical functions. For example, there are bitwise arithmetic tricks to quickly compute the integer log2 function.

GPU Strength Reduction: One final note is that when doing AI coding work, we aren't as concerned about which C++ operator works the best. The more important concern is which operation is most efficient in the GPU or other non-GPU hardware acceleration (e.g. AVX-512 on CPU).

Finally, note that these optimizations are local optimizations, and the same ideas apply globally to the entire architecture. There's been a lot of research trying to change all of the arithmetic in model inference from multiplication to bitshifting (see logarithmic power-of-two quantization) or addition (see logarithmic number system models, "adder networks" and "zero multiplication models").

Research on Strength Reduction: Research papers on code optimization by reducing strong operators to be weak and helpless include:

Precalculation

Precalculation or precomputation is a code optimization where results are partially or fully calculated ahead of time. This method is similar to caching and computation reuse and lookup tables, but refers to calculations being performed long before they are needed, often at program startup or compile-time. Like caching, this method trades extra space for time.

Typical precalculations are those where the results are computed at program initialization or compile-time. The best methods generate the results at compile-time, and are simply loaded as data, such as numeric constants or pre-initialized data arrays. Examples of this include constant folding and reciprocal multiplication for division.

One method for precomputation of larger amounts of data in an array or lookup table is to perform the initialization dynamically at program startup. An array can be populated with the required results, before the main logic of the program begins. (Or alternatively, the idea of "lazy evaluation" says to run the precomputation only when the program first needs the data.)

A possible alternative is to calculate all this data offline before program startup, and storing the results in a binary data file. This data file can then be loaded into an array at program startup, without needing to perform any of the arithmetic computations. Whether this is beneficial depends on the cost of the computations versus the cost of file loading.

The logical extension of the precomputation method for a large number of numeric results is to write special C++ code that performs these calculations, but then outputs the results into a text file in the exact format of a C++ source code file (rather than a data file), that declares a global array name and the numeric values. Yes, it's a C++ program that creates part of a C++ program, which is almost like your AI has become self-aware, only not quite. This auto-generated C++ code can be compiled and linked into your C++ program, and used like a global array of data in other parts of the program.

The benefit is that this auto-generated code method does not even require the time cost of startup initialization for any computations or file loading. The data is auto-loaded by the linker-loader during executable file instantiation (i.e. when the user starts the app). The only downsides for the user are that the size of the executable program increases, which means more disk space usage, and that application program startup may take longer and it will use more memory (regardless of whether it ever needs this precomputed data). Also, various offline tasks take longer for the software developers, such as compilation and linking for tests.

This self-generating code idea is quite workable for table lookups of approximations for FP16 numbers (16-bit half-precision floating point), because the lookup table needs to contain 2^16=65536 numbers. This is probably about a 200k C++ source file in plain text, and creates linked data of about 65k times 4 bytes equals about 256k space usage (or half that space if you also store the computation as 16-bit numbers rather than 32-bit floats or integers).

And one final point, you could try to precompute an entire AI model into C++ source code, and build it into an EXE file. In theory, you could save the cost of the model loading phase of inference. But it'll be a multi-gigabyte executable file, even if you can build it, and you probably can't anyway, because it'll likely be so large as to crash your C++ compiler. Good luck with that! (A better plan is to look into using an ML compiler.)

Research on Precalculation Optimizations: Papers about precomputation:

Padding

Padding is the placement of extra dummy values into an array or sequence. Typically, padding is formed with zero bytes, but it doesn't always have to be. Although the use of padding seems redundant, there are times when it is faster to process padding bytes than to do something special to avoid it. A particularly good example arises with GPUs. If the size of your vector to process with hardware acceleration is a little smaller than the GPU parallel size capacity, it's faster to send it some extra dummy data and just run the normal GPU instructions, rather than try to process the data in a non-GPU algorithm.

On the other hand, it's still better to avoid padding if you can, and judicious management of vector sizes with GPU capacities can often fix this by avoiding the cases where "odd" vector sizes occur. For more, see zero padding removal.

Padding Research: Research papers on padding as a positive optimization include:

Keeping It In the Float Family

Hidden unnecessary C++ type conversions are a common source of extra inefficiency. The main type in a Transformer is usually "float" (32-bit), rather than "double" (64-bit). Avoid unnecessary type conversion code in two ways:

  • Don't mix float and double
  • Don't mix float and int

The use of float and int tends to be something programmers are aware of, and doesn't occur that often by accident.

However, inadvertantly mixing float and double is difficult to avoid, and sneaks into your code all the time. For example, here's some C++ code that looks perfectly correct:

    float scalefactor = sqrt(2.0) * 3.14159;

But this may be unnecessarily using "double" size arithmetic, although the compiler might fix it with constant folding. Here's the corrected code:

    float scalefactor = sqrtf(2.0f) * 3.14159f;

Note that there are two changes with an "f" suffix to signify float arithmetic is required: on numeric constants (e.g. "2.0f" speficying a 32-bit float, rather than "2.0", which is a 64-bit double constant), and on the intrinsic functions (i.e. "sqrtf" rather than "sqrt"). Without the suffix "f", the default is double type constants and double arithmetic functions.

Efficiency of Pointers vs References

C++ allows two ways to indirectly refer to an object without needing to create a whole new copy: pointers and references. The syntax is either "*" or "&" for their declarations.

    MyVector *myptr = &mv;  // Pointer to mv object
    MyVector &myref = mv;   // Reference to mv object

Pointers and references are more efficient than spinning up a whole new copy of the object, especially when the underlying object is a complicated object. And when you have a function call, you should definitely avoid sending in a whole object.

    void processit(MyVector v)  // Slow
    {
        ....
    }

This is inefficient because the whole MyVector object will get copied, via whatever copy constructor you have defined, which is slow. And if you haven't defined a copy constructor, then the compiler uses default bitwise copy of a structure, which is not only slow, but also rarely what you want, and often a bug.

The faster reference version is to use a "const" reference (or a non-const if you're modifying it inside the function):

    void processit(const MyVector & v)  // Reference argument
    {
        ....
    }

The pointer version is:

    void processit(MyVector * v)  // Pointer argument
    {
        ....
    }

Which is faster in C++ — pointers or references? The short answer is "not any difference", because references are implemented as pointers by the compiler behind the scenes. The two functions above are not going to be significantly different in terms of speed.

The slightly longer answer is that references can be faster because there's no null case. A reference must always be referring to an object for the duration of its scope. The C++ compiler ensures that references cannot occur without an object:

    MyVector &v;          // Cannot do this
    MyVector &v = NULL;   // Nor this
    MyVector &v = 0;      // Nor this

A reference must be initialized from an object, and you cannot set references equal to pointers, because you actually have to de-reference the pointer with the "*" operator, which crashes if it's a null pointer:

    MyVector &v = myptr;  // Disallowed
    MyVector &v = *myptr; // Works if non-null

There's no way in C++ to get a zero value into a reference variable (we hope). Hence, references are always referring to something and they cannot be equivalent to the null pointer.

References are slightly faster: The guarantee of an object for a reference fixes all those null pointer core dumps, and also relieves the programmer of the burden of testing for null references. The compiler does this guarantee at compile-time, so there's no hidden null check being done by the compiler at run-time, making it efficent. So there's a minor speed improvement from using references, by not having to add safety checks for "ptr!=NULL" throughout the function call hierarchy.

Pointers can be better than references if you need a "null" situation to occur. For example, you're processing an object that may or may not exist, and you need the pointer to be allowed to be "NULL" if there's no object. This should occur rarely, and references should be preferred in many cases.

Lazy Evaluation

Research papers on lazy evaluation:

General Programming Optimizations for Inference

There aren't a lot of research papers specifically on non-architectural coding algorithms applied to model inference, although perhaps there should be more. Some research on using general coding optimizations for neural network models:

  • V. Vanhoucke, A. Senior, and M. Z. Mao, Improving the speed of neural networks on CPUs, In Proc. Deep Learning and Unsupervised Feature Learning NIPS Workshop, volume 1, page 4, 2011, https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.308.2766 (This paper explores some general code optimizations in relation to CPU and GPU execution, including lazy evaluation, loop unrolling, parallel accumulators, and in-memory partitioning of data for hardware acceleration.)
  • Rudy Bunel, Alban Desmaison, M. Pawan Kumar, Philip H.S. Torr, Pushmeet Kohli, Learning to superoptimize programs. In International Conference on Learning Representations (ICLR) (2017). https://arxiv.org/abs/1611.01787
  • Young Jin Kim, Marcin Junczys-Dowmunt, Hany Hassan, Alham Fikri Aji, Kenneth Heafield, Roman Grundkiewicz, and Nikolay Bogoychev. From research to production and back: Ludicrously fast neural machine translation. In Proc. of WNGT, 2019. https://www.aclweb.org/anthology/D19-5632/, Code: https://github.com/marian-nmt/marian
  • Z Guo, Z He, Y Zhang, 2023, Mira: A Progam-Behavior-Guided Far Memory System, PDF: https://cseweb.ucsd.edu/~yiying/Mira-SOSP23.pdf (Interesting memory management methods.)

Books and Articles on C++ Coding Efficiency

Research on Code Optimizations

More AI Research

Read more about: