Aussie AI

Code Optimizations

Last Updated 2 March, 2025

by David Spuler, Ph.D.

Code efficiency has been a research area since the very first computers. Many transformations to code are known to improve its speed or memory usage, while retaining its functionality. These ideas can be applied to model inference algorithms to improve its speed, including many of the architectural inference optimizations.

General Programming Optimizations for Inference

The inference algorithm is just code. Many of the standard code optimizations are applicable to AI inference engines, although this is becoming less true over time as GPU accelerators far outpace other optimizations. General coding optimizations include changes to arithmetic, control flow, and architectural structure.

Archtitectural optimization techniques may include:

Hardware acceleration
Parallelization (e.g. vectorization)
Pipelining
Precalculation (trading space for speed)
Data partitioning for hardware accelaration
Multi-threading
Integer-only architectures

Algorithm-level optimizations can include:

Kernel fusion
Kernel fission
Common case first (and calculated faster)
Caching and re-using results
Hashing
Bloom filters (a combination of hashing and bitflags)
Lazy evaluation (conditional computation)
Parallel accumulators (allowing fine-grained parallelization)

Arithmetic optimizations may include:

Integer operations instead of floating-point (see quantization)
Arithmetic operation reduction (see zero-multiplication inference algorithms)
Bitwise operations replace multiplication (e.g. bitshifts are faster than multiplication, as used in logarithmic quantization)
Approximate arithmetic
Strength reduction (using less computationally "strong" operations)
Constant folding (compile-time precomputation of constant expressions)
Common subexpression elimination (only computing things once in expressions)
Type consistency

Loop optimizations may include:

Loop unrolling (repeating the loop body to reduce loop test overhead)
Loop fusion (combining the bodies of two loops; see also kernel operator fusion which is mainly about loops)
Loop parallelization (mainly done in GPU hardware nowadays, but the software must prepare the data)
Loop early exit (see inference early exit optimizations)
Loop perforation (randomly skipping a subset of loop iterations, not a joke)
Loop tiling (processing sub-parts of contiguous data in separate loops)
Loop reversal (going backwards, and yet, still making progress)
Loop fission (opposite of loop fusion, splitting a single loop body into two loops)
Loop distribution (splitting two sub-parts of a loop body into two separate loops)
Loop code motion (moving or "hoisting" loop-invariant calculations from the loop body to pre-loop initialization)
Loop reordering (changing the arrangements of inner/outer nested loops)
Loop interchange (switching the inner and outer loop iterators of nested loops)
Loop coalescing (flattening nested loops)
Loop splitting (Split out subportions of iteration range)
Loop peeling (Unrolling the first few iterations.)
Loop coalescing (Flattening nested loops)

Low-level coding optimizations include:

Common subexpression elimination (see computation reuse)
Function inlining or macros (avoiding function call overhead)
Flattening function call hierarchies
Pointer arithmetic versus arrays and table lookup
Contiguous memory data storage (reducing paging)
Bitwise operations

Compiler-Automated Optimizations

In order to optimize code, it's important to know what sorts of optimizations your compiler is doing automatically. Compilers have been doing optimizations for literally 50 years, and the state-of-the-art is quite amazing, with an extensive body of research theory. Some of the main automated compiler optimizations include:

Constant folding
Common subexpression elimination
Redundant assignment removal
Strength reduction
Algebraic optimizations
Loop optimizations (some)

If you make simple changes to your code with some of the obvious things above, it's not going to give you a speedup. The compiler has already done so.

However, there's a little to what compilers can do. They certainly can't make architectural changes, and there's also many mid-level algorithmic changes that cannot be automated.

Function calls inside expressions are a good example of code changes that might need to be manually optimized. When the compiler sees a function call used in arithmetic, it isn't always able to know what that function is going to do, and has to be conservative by avoiding possibly incorrect optimizations.

Efficient C++ Language Features

Most of the development of C and then C++ has been focused on efficiency, so almost every feature of C++ could be listed here. However, there's some guidance about language features that seem complex but are typically implemented very efficiently by the C++ compiler (with many kudos due to the designers of the C++ language and its compilers!).

Switch beats else-if. The switch statement has been optimized to the nth degree over the years, and almost certainly beats anything you can code. If you're testing a set of constant values, don't use else-if.
Virtual functions. These seem complicated and possibly slow, but they're actually very carefully designed to be very fast to call. There's lots of painful work for compiler designers to get then to compile correctly, but their runtime efficiency is great for programmers. The implementation is effectively a small lookup table with function pointers. It's a couple more assenbler statements before the function call, and the overhead of calling a function will dwarf that cost. So don't use virtual functions if you don't need them, but also don't be afraid of using them.
References and pass-by-reference. These are basically implemented as pointers, so they are very fast. See more details on references versus pointers below.
Compiler intrinsics. There's lots of builtin functions called "intrinsics" which are typically implemented in assembly code by compiler writers. Functions such as memset and memcpy are very fast.

Constant Folding

Constant folding is the name of the optimization where the compiler automatically merges constant expressions. For example, assume your code has:

    x = 3 + 4;

The compiler should "fold" the two constants (3 and 4) by doing the "+" operation at compile-time, and then automatically generate code without the "+" operation. Effectively, it should execute as if you'd written:

    x = 7;

So, how good are the compilers? The answer is that they're getting pretty amazing, but it's not always clear. Here's some cases about constant folding:

sizeof is a constant. Any use of the sizeof operator in expressions should be treated like an integer. If your code says "8*sizeof(int)", then it's hopefully folded to give "32" at compile-time.
Named constants. If you declare "const int n=3", then hopefully all subsequent uses of "n" will be folded as if they said "3".
Named macro constants. This is trivially handled.
Type casts: Hopefully your compiler should propagate types correctly while doing constant folding.

What about some harder cases? Consider this code with an intrinsic math function:

   const float scalefactor = sqrtf(2.0f * 3.14159f);

Can the compiler fold this at compile-time? Surely it should do "2.0f*3.14159f" correctly. But what about the sqrtf calculation? It's theoretically possible, but I'm far from certain. I'd be inclined to declare it as a global constant, or a local "static" constant, so as to ensure it only gets pre-calculated at most once:

   static const float scalefactor = sqrtf(2.0f * 3.14159f);

Now, since it's declared as "static" (and "const"), hopefully the executable code would only compute it once at program startup, even if it wasn't fully folded. Another faster alternative that ensures it's not calcualted even once is to work out the formula yourself, and just put in the numeric result as a hard-coded floating point constant.

   static const float scalefactor = 2.506627216f; // sqrtf(2.0f * 3.14159f);

Research on Constant Folding: Many papers have been published on compiler-automation of constant folding, such as:

Allen, F. E., and Cocke, J. 1972. A catalogue of optimizing transformations. In Design and Optimization of Compilers, Prentice-Hall, Englewood Cliffs, N.J., pp. 1–30. PDF: https://www.clear.rice.edu/comp512/Lectures/Papers/1971-allen-catalog.pdf
D. F. Bacon, S. L. Graham, and O. J. Sharp. 1994. Compiler transformations for high-performance computing. ACM Computing Surveys 26, 4 (1994), 345–420. https://dl.acm.org/doi/10.1145/197405.197406, PDF: https://people.eecs.berkeley.edu/~fateman/264/papers/bacon.pdf (Paper with extensive coverage of numerous compiler auto-optimizations of program code.)
Gansen Hu, Zhaoguo Wang, Jinglin Wei, Wei Huang, Haibo Chen, 17 Jan 2025, Accelerating Large Language Models through Partially Linear Feed-Forward Network, https://arxiv.org/abs/2501.10054 (Inspired by constant folding, the optimization is merging the two MatMuls in an FFN by approximating the itervening non-linear activation function (e.g., RELU or GELU), with linear functions and merging the two matrices using matrix-multiplication associativity.)

Common Subexpression Elimination

Common subexpression elimination (CSE) is avoiding the recomputation of the same expression twice. There are many cases where the same computation appears multiple times in a single expression, or across the control flow of a program. Compiler optimizers attempt to automatically detect such cases and reuse the first computation.

Research on Common Subexpression Eliminiation: Papers on CSE include:

Allen, F. E., and Cocke, J. 1972. A catalogue of optimizing transformations. In Design and Optimization of Compilers, Prentice-Hall, Englewood Cliffs, N.J., pp. 1–30. PDF: https://www.clear.rice.edu/comp512/Lectures/Papers/1971-allen-catalog.pdf
D Monniaux, C Six, 2021, Simple, light, yet formally verified, global common subexpression elimination and loop-invariant code motion, Proceedings of the 22nd ACM SIGPLAN/SIGBED, https://dl.acm.org/doi/abs/10.1145/3461648.3463850, https://arxiv.org/pdf/2105.01344
D. F. Bacon, S. L. Graham, and O. J. Sharp. 1994. Compiler transformations for high-performance computing. ACM Computing Surveys 26, 4 (1994), 345–420. https://dl.acm.org/doi/10.1145/197405.197406, PDF: https://people.eecs.berkeley.edu/~fateman/264/papers/bacon.pdf (Paper with extensive coverage of numerous compiler auto-optimizations of program code.)
David Spuler, March 2024, Common Subexpression Elimination, in Generative AI in C++, https://www.aussieai.com/book/ch10-common-subexpression-elimination

Strength Reduction

Strength reduction is the name for optimizations that reduce the "strength" (i.e. computation complexity) of arithmetic operators. Some of the general approaches include:

Avoid division and modulo/remainder operators (they're the worst!)
Use integer arithmetic rather than floating point (where possible)
Use float arithmetic, not double-precision.

Bitshift for multiplication: The canonical example that everybody knows is that shift operators can replace multiplications by a power of two. Note that there's a whole class of AI engines that rely on this multiply-vs-bitshift optimization. It's called "logarithmic quantization" or "power-of-two quantization". Here's a dummy example of multiplication;

   y = x * 4;

This can be more efficiently done as a left bitshift:

   y = x << 2;

Bug alert! If you're making this code change, you're likely to introduce some bugs. The "<<" and "*" operators have different precedence levels, so make sure you add more parentheses. Also, consider whether you need to use "unsigned" type when switching to a bitwise operator.

Right shift for division: The use of bitshifting works for division, too (but only for unsigned):

   y = x / 4;  
   y = x >> 2;  // faster

Multiply by reciprocal: Division is one of the worst operations. An algebraic optimization is that floating point division is equivalent to multiplication by its reciprocal. This is the slow division by a scale factor:

    v[i] /= sqrtf(3.14159);

Here's the faster way using the reciprocal of the constant:

    v[i] *= 1.0f / sqrtf(3.14159);

And we really should pre-calculate this constant using constant folding and a static variable:

    static const float scalefactor = 1.0f / sqrtf(3.14159);
    v[i] *= scalefactor;

Bitwise remainder calculations: The arithmetic modulus operator (remainder) can also be optimized for power-of-two operands:

   y = x % 512;    // Remainder (mod)
   y = x & 511u;   // Bitwise-AND

And here's another one with integer relative comparisons versus bitwise-and, although this one might not necessarily be faster:

   if (x >= 512)
   if (x & ~511u)  // Bitwise-AND of the complement (unsigned)

Avoiding multiplication: There are some simple cases even with the most basic operators that have multiple options:

    y = x * 2;
    y = x + x;   // Addition
    y = x << 1;  // Shift

Automatic Strength Reduction: In theory, C++ compilers could know what will be faster on its platform, and perform all these optimizations automatically when compiling the program. The optimizers probably do some of them, but they cannot do them all.

Intrinsic Functions: Other more advanced types of strength reduction involve avoiding costly primitives, such as mathematical functions. For example, there are bitwise arithmetic tricks to quickly compute the integer log2 function.

GPU Strength Reduction: One final note is that when doing AI coding work, we aren't as concerned about which C++ operator works the best. The more important concern is which operation is most efficient in the GPU or other non-GPU hardware acceleration (e.g. AVX-512 on CPU).

Finally, note that these optimizations are local optimizations, and the same ideas apply globally to the entire architecture. There's been a lot of research trying to change all of the arithmetic in model inference from multiplication to bitshifting (see logarithmic power-of-two quantization) or addition (see logarithmic number system models, "adder networks" and "zero multiplication models").

Research on Strength Reduction: Research papers on code optimization by reducing strong operators to be weak and helpless include:

Allen, F. E., and Cocke, J. 1972. A catalogue of optimizing transformations. In Design and Optimization of Compilers, Prentice-Hall, Englewood Cliffs, N.J., pp. 1–30. PDF: https://www.clear.rice.edu/comp512/Lectures/Papers/1971-allen-catalog.pdf
D. F. Bacon, S. L. Graham, and O. J. Sharp. 1994. Compiler transformations for high-performance computing. ACM Computing Surveys 26, 4 (1994), 345–420. https://dl.acm.org/doi/10.1145/197405.197406, PDF: https://people.eecs.berkeley.edu/~fateman/264/papers/bacon.pdf (Paper with extensive coverage of numerous compiler auto-optimizations of program code.)
Eric LaForest, March 19, 2010, Survey of Loop Transformation Techniques, ECE 1754, http://fpgacpu.ca/writings/SurveyLoopTransformations.pdf
Zheming Jin, July 2024, Evaluating Operators in Deep Neural Networks for Improving Performance Portability of SYCL, Oak Ridge National Laboratory, ORNL/TM-2024/3463, https://info.ornl.gov/sites/publications/Files/Pub217394.pdf
C. Hakert, K. -H. Chen and J. -J. Chen, 2024, FLInt: Exploiting Floating Point Enabled Integer Arithmetic for Efficient Random Forest Inference, 2024 Design, Automation & Test in Europe Conference & Exhibition (DATE), Valencia, Spain, 2024, pp. 1-2, doi: 10.23919/DATE58400.2024.10546851, https://ieeexplore.ieee.org/abstract/document/10546851
J. Kim, S. Kim, K. Choi and I. -C. Park, 2024, Hardware-Efficient SoftMax Architecture With Bit-Wise Exponentiation and Reciprocal Calculation, IEEE Transactions on Circuits and Systems I: Regular Papers, doi: 10.1109/TCSI.2024.3443270, https://ieeexplore.ieee.org/abstract/document/10640134
David Spuler, March 2024, Operator Strength Reduction, in Generative AI in C++, https://www.aussieai.com/book/ch10-operator-strength-reduction
David Spuler, March 2024, Loop Iterator Strength Reduction, in Generative AI in C++, https://www.aussieai.com/book/ch15-loop-iterator-strength-reduction

Precalculation

Precalculation or precomputation is a code optimization where results are partially or fully calculated ahead of time. This method is similar to caching and computation reuse and lookup tables, but refers to calculations being performed long before they are needed, often at program startup or compile-time. Like caching, this method trades extra space for time.

Typical precalculations are those where the results are computed at program initialization or compile-time. The best methods generate the results at compile-time, and are simply loaded as data, such as numeric constants or pre-initialized data arrays. Examples of this include constant folding and reciprocal multiplication for division.

One method for precomputation of larger amounts of data in an array or lookup table is to perform the initialization dynamically at program startup. An array can be populated with the required results, before the main logic of the program begins. (Or alternatively, the idea of "lazy evaluation" says to run the precomputation only when the program first needs the data.)

A possible alternative is to calculate all this data offline before program startup, and storing the results in a binary data file. This data file can then be loaded into an array at program startup, without needing to perform any of the arithmetic computations. Whether this is beneficial depends on the cost of the computations versus the cost of file loading.

The logical extension of the precomputation method for a large number of numeric results is to write special C++ code that performs these calculations, but then outputs the results into a text file in the exact format of a C++ source code file (rather than a data file), that declares a global array name and the numeric values. Yes, it's a C++ program that creates part of a C++ program, which is almost like your AI has become self-aware, only not quite. This auto-generated C++ code can be compiled and linked into your C++ program, and used like a global array of data in other parts of the program.

The benefit is that this auto-generated code method does not even require the time cost of startup initialization for any computations or file loading. The data is auto-loaded by the linker-loader during executable file instantiation (i.e. when the user starts the app). The only downsides for the user are that the size of the executable program increases, which means more disk space usage, and that application program startup may take longer and it will use more memory (regardless of whether it ever needs this precomputed data). Also, various offline tasks take longer for the software developers, such as compilation and linking for tests.

This self-generating code idea is quite workable for table lookups of approximations for FP16 numbers (16-bit half-precision floating point), because the lookup table needs to contain 2^16=65536 numbers. This is probably about a 200k C++ source file in plain text, and creates linked data of about 65k times 4 bytes equals about 256k space usage (or half that space if you also store the computation as 16-bit numbers rather than 32-bit floats or integers).

And one final point, you could try to precompute an entire AI model into C++ source code, and build it into an EXE file. In theory, you could save the cost of the model loading phase of inference. But it'll be a multi-gigabyte executable file, even if you can build it, and you probably can't anyway, because it'll likely be so large as to crash your C++ compiler. Good luck with that! (A better plan is to look into using an ML compiler.)

Research on Precalculation Optimizations: Papers about precomputation:

Pieter Hijma, Stijn Heldens, Alessio Sclocco, Ben van Werkhoven, Henri E. Bal, 2023, Optimization techniques for GPU programming, ACM Computing Surveys, Volume 55, Issue 11, Article No. 239, pp 1–81, https://dl.acm.org/doi/abs/10.1145/3570638, PDF: https://dl.acm.org/doi/pdf/10.1145/3570638 (Extensive survey of software optimizations to improve GPU latency and throughput.)
Krishna Teja Chitty-Venkata, Sparsh Mittal, Murali Emani, Venkatram Vishwanath, Arun K. Somani, 2023, A Survey of Techniques for Optimizing Transformer Inference, https://arxiv.org/abs/2307.07982 (Precomputing weight distributions.)
Tae Jun Ham; Yejin Lee; Seong Hoon Seo; Soosung Kim; Hyunji Choi; Sung Jun Jung; Jae W. Lee, 2021, ELSA: Hardware-software co-design for efficient, lightweight self-attention mechanism in neural networks, 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), https://ieeexplore.ieee.org/abstract/document/9499860/, https://taejunham.github.io/data/elsa_isca21.pdf (Precomputations involve the key and value matrices.)
Lianming Huang, Shangyu Wu, Yufei Cui, Ying Xiong, Xue Liu, Tei-Wei Kuo, Nan Guan, Chun Jason Xue, 24 May 2024, RAEE: A Training-Free Retrieval-Augmented Early Exiting Framework for Efficient Inference, https://arxiv.org/abs/2405.15198 (Early exit classifiers built with pre-computation using a retrieval database.)
Hong-Yi Wang, and Tian-Sheuan Chang, 2022, Row-wise Accelerator for Vision Transformer, https://arxiv.org/pdf/2205.03998.pdf
David Spuler, March 2024, Chapter 35. Lookup Tables & Precomputation, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Wang, J., Chen, K., Chen, G., Shou, L., McAuley, J.: Skipbert: Efficient inference with shallow layer skipping. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 7287–7301 (2022) https://aclanthology.org/2022.acl-long.503/ (Skips early layers of a model via precomputed lookup tables based on detecting known token n-grams in the prompt.)
Nils Graef, 12 Mar 2024 (v3), Transformer tricks: Precomputing the first layer, https://arxiv.org/abs/2402.13388 Code: https://github.com/OpenMachine-ai/transformer-tricks (Because the first layer only depends on the embeddings, it can be precomputed.)
David Spuler, March 2024, Lookup Table Precomputation, in Generative AI in C++, https://www.aussieai.com/book/ch13-lookup-table-precomputation
SZ Lin, YC Chen, YH Chang, TW Kuo, HP Li, 2024, LUTIN: Efficient Neural Network Inference with Table Lookup, ISLPED ’24, August 5-7, 2024, Newport Beach, CA, USA, https://dl.acm.org/doi/pdf/10.1145/3665314.3670804

Padding

Padding is the placement of extra dummy values into an array or sequence. Typically, padding is formed with zero bytes, but it doesn't always have to be. Although the use of padding seems redundant, there are times when it is faster to process padding bytes than to do something special to avoid it. A particularly good example arises with GPUs. If the size of your vector to process with hardware acceleration is a little smaller than the GPU parallel size capacity, it's faster to send it some extra dummy data and just run the normal GPU instructions, rather than try to process the data in a non-GPU algorithm.

On the other hand, it's still better to avoid padding if you can, and judicious management of vector sizes with GPU capacities can often fix this by avoiding the cases where "odd" vector sizes occur. For more, see zero padding removal.

Padding Research: Research papers on padding as a positive optimization include:

Dennis Sebastian Rieber, 2023, Deployment of Deep Neural Networks on Dedicated Hardware Accelerators, Ph.D. thesis, Doctor of Natural Sciences, Ruprecht–Karls University Heidelberg, PDF: https://archiv.ub.uni-heidelberg.de/volltextserver/32994/1/dissertationPDFA.pdf (Padding as an optimization technique on p.42 and p.73.)
D. F. Bacon, S. L. Graham, and O. J. Sharp. 1994. Compiler transformations for high-performance computing. ACM Computing Surveys 26, 4 (1994), 345–420. https://dl.acm.org/doi/10.1145/197405.197406, PDF: https://people.eecs.berkeley.edu/~fateman/264/papers/bacon.pdf (Covers the use of padding in arrays for tiling optimizations in review of many optimizations.)
Eric LaForest, March 19, 2010, Survey of Loop Transformation Techniques, ECE 1754, http://fpgacpu.ca/writings/SurveyLoopTransformations.pdf (Padding is one method for loop optimizations.)
Nan Yang, Laicheng Zhong, Fan Huang, Dong Yuan, Wei Bao, Feb 2023, Random Padding Data Augmentation, https://arxiv.org/abs/2302.08682
Dung Le, Jul 30, 2020, CUDA Memory Management & Use cases, https://medium.com/distributed-knowledge/cuda-memory-management-use-cases-f9d340f7c704
Jiankun Wei, Abdulrahman Abdulrazzag, Tianchen Zhang, Adel Muursepp, Gururaj Saileshwar, 5 Nov 2024 (v2), Privacy Risks of Speculative Decoding in Large Language Models, https://arxiv.org/abs/2411.01076
Conner Takehana, Aaryan Singhal, Nov 28, 2024, ThunderMittens For Your ThunderKittens, https://hazyresearch.stanford.edu/blog/2024-11-28-tk-mlx (Porting TK to Apple Metal and MLX on the M2 chips.)
NVIDIA, Dec 2024, Multi-Head, Multi-Query, and Group-Query Attention, https://nvidia.github.io/TensorRT-LLM/advanced/gpt-attention.html#kv-cache
Mingcong Song, Xinru Tang, Fengfan Hou, Jing Li, Wei Wei, Yipeng Ma, Runqiu Xiao, Hongjie Si, Dingcheng Jiang, Shouyi Yin, Yang Hu, Guoping Long, 24 Dec 2024, Tackling the Dynamicity in a Production LLM Serving System with SOTA Optimizations via Hybrid Prefill/Decode/Verify Scheduling on Efficient Meta-kernels, https://arxiv.org/abs/2412.18106
Yifu Ding, Wentao Jiang, Shunyu Liu, Yongcheng Jing, Jinyang Guo, Yingjie Wang, Jing Zhang, Zengmao Wang, Ziwei Liu, Bo Du, Xianglong Liu, Dacheng Tao, 27 Feb 2025 (v2), Dynamic Parallel Tree Search for Efficient LLM Reasoning, https://arxiv.org/abs/2502.16235

Keeping It In the Float Family

Hidden unnecessary C++ type conversions are a common source of extra inefficiency. The main type in a Transformer is usually "float" (32-bit), rather than "double" (64-bit). Avoid unnecessary type conversion code in two ways:

Don't mix float and double
Don't mix float and int

The use of float and int tends to be something programmers are aware of, and doesn't occur that often by accident.

However, inadvertantly mixing float and double is difficult to avoid, and sneaks into your code all the time. For example, here's some C++ code that looks perfectly correct:

    float scalefactor = sqrt(2.0) * 3.14159;

But this may be unnecessarily using "double" size arithmetic, although the compiler might fix it with constant folding. Here's the corrected code:

    float scalefactor = sqrtf(2.0f) * 3.14159f;

Note that there are two changes with an "f" suffix to signify float arithmetic is required: on numeric constants (e.g. "2.0f" speficying a 32-bit float, rather than "2.0", which is a 64-bit double constant), and on the intrinsic functions (i.e. "sqrtf" rather than "sqrt"). Without the suffix "f", the default is double type constants and double arithmetic functions.

Efficiency of Pointers vs References

C++ allows two ways to indirectly refer to an object without needing to create a whole new copy: pointers and references. The syntax is either "*" or "&" for their declarations.

    MyVector *myptr = &mv;  // Pointer to mv object
    MyVector &myref = mv;   // Reference to mv object

Pointers and references are more efficient than spinning up a whole new copy of the object, especially when the underlying object is a complicated object. And when you have a function call, you should definitely avoid sending in a whole object.

    void processit(MyVector v)  // Slow
    {
        ....
    }

This is inefficient because the whole MyVector object will get copied, via whatever copy constructor you have defined, which is slow. And if you haven't defined a copy constructor, then the compiler uses default bitwise copy of a structure, which is not only slow, but also rarely what you want, and often a bug.

The faster reference version is to use a "const" reference (or a non-const if you're modifying it inside the function):

    void processit(const MyVector & v)  // Reference argument
    {
        ....
    }

The pointer version is:

    void processit(MyVector * v)  // Pointer argument
    {
        ....
    }

Which is faster in C++ — pointers or references? The short answer is "not any difference", because references are implemented as pointers by the compiler behind the scenes. The two functions above are not going to be significantly different in terms of speed.

The slightly longer answer is that references can be faster because there's no null case. A reference must always be referring to an object for the duration of its scope. The C++ compiler ensures that references cannot occur without an object:

    MyVector &v;          // Cannot do this
    MyVector &v = NULL;   // Nor this
    MyVector &v = 0;      // Nor this

A reference must be initialized from an object, and you cannot set references equal to pointers, because you actually have to de-reference the pointer with the "*" operator, which crashes if it's a null pointer:

    MyVector &v = myptr;  // Disallowed
    MyVector &v = *myptr; // Works if non-null

There's no way in C++ to get a zero value into a reference variable (we hope). Hence, references are always referring to something and they cannot be equivalent to the null pointer.

References are slightly faster: The guarantee of an object for a reference fixes all those null pointer core dumps, and also relieves the programmer of the burden of testing for null references. The compiler does this guarantee at compile-time, so there's no hidden null check being done by the compiler at run-time, making it efficent. So there's a minor speed improvement from using references, by not having to add safety checks for "ptr!=NULL" throughout the function call hierarchy.

Pointers can be better than references if you need a "null" situation to occur. For example, you're processing an object that may or may not exist, and you need the pointer to be allowed to be "NULL" if there's no object. This should occur rarely, and references should be preferred in many cases.

Lazy Evaluation

Research papers on lazy evaluation:

S Leroux, S Bohez, C De Boom, E De Coninck, 2016, Lazy evaluation of convolutional filters, https://arxiv.org/abs/1605.08543
V Vanhoucke, A Senior, MZ Mao, 2011, Improving the speed of neural networks on CPUs, Google Research, https://research.google/pubs/pub37631.pdf
J Park, DY Kim, YH Moon, 2022, Lazy Net: Lazy Entry Neural Networks for Accelerated and Efficient Inference 2022 13th International Conference on Information and Communication Technology Convergence (ICTC), https://ieeexplore.ieee.org/abstract/document/9953031
David Spuler, March 2024, Chapter 13. Algorithm Speedups, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
V. Vanhoucke, A. Senior, and M. Z. Mao, Improving the speed of neural networks on CPUs, In Proc. Deep Learning and Unsupervised Feature Learning NIPS Workshop, volume 1, page 4, 2011, https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.308.2766 (This paper explores some general code optimizations in relation to CPU and GPU execution, including lazy evaluation, loop unrolling, parallel accumulators, and in-memory partitioning of data for hardware acceleration.)
Qichen Fu, Minsik Cho, Thomas Merth, Sachin Mehta, Mohammad Rastegari, Mahyar Najibi, 19 Jul 2024, LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference, https://arxiv.org/abs/2407.14057
David Spuler, March 2024, Lazy Evaluation, in Generative AI in C++, https://www.aussieai.com/book/ch13-lazy-evaluation

General Programming Optimizations for Inference

There aren't a lot of research papers specifically on non-architectural coding algorithms applied to model inference, although perhaps there should be more. Some research on using general coding optimizations for neural network models:

V. Vanhoucke, A. Senior, and M. Z. Mao, Improving the speed of neural networks on CPUs, In Proc. Deep Learning and Unsupervised Feature Learning NIPS Workshop, volume 1, page 4, 2011, https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.308.2766 (This paper explores some general code optimizations in relation to CPU and GPU execution, including lazy evaluation, loop unrolling, parallel accumulators, and in-memory partitioning of data for hardware acceleration.)
Rudy Bunel, Alban Desmaison, M. Pawan Kumar, Philip H.S. Torr, Pushmeet Kohli, Learning to superoptimize programs. In International Conference on Learning Representations (ICLR) (2017). https://arxiv.org/abs/1611.01787
Young Jin Kim, Marcin Junczys-Dowmunt, Hany Hassan, Alham Fikri Aji, Kenneth Heafield, Roman Grundkiewicz, and Nikolay Bogoychev. From research to production and back: Ludicrously fast neural machine translation. In Proc. of WNGT, 2019. https://www.aclweb.org/anthology/D19-5632/, Code: https://github.com/marian-nmt/marian
Z Guo, Z He, Y Zhang, 2023, Mira: A Progam-Behavior-Guided Far Memory System, PDF: https://cseweb.ucsd.edu/~yiying/Mira-SOSP23.pdf (Interesting memory management methods.)

Books and Articles on C++ Coding Efficiency

Fengqi You, Apr 2023, Cornell University Computational Optimization Open Textbook, https://optimization.cbe.cornell.edu/index.php?title=Main_Page
Agner Fog, 2023, Optimizing software in C++: An optimization guide for Windows, Linux, and Mac platforms, PDF: https://www.agner.org/optimize/optimizing_cpp.pdf
Kurt Guntheroth, 2016, Optimized C++: Proven Techniques for Heightened Performance, O'Reilly Media, https://www.amazon.com/dp/1491922060
Dov Bulka and David Mayhew, 1999, Efficient C++: Performance Programming Techniques, https://www.amazon.com//dp/0201379503
Fedor G. Pikus, 2021, The Art of Writing Efficient Programs: An advanced programmer's guide to efficient hardware utilization and compiler optimizations using C++ examples, Packt Publishing, https://www.amazon.com/dp/1800208111
David Spuler, 1992, C++ and C Efficiency: How to Improve Program Speed and Memory Usage, Prentice Hall, https://www.amazon.com/dp/0130965952
ISO/IEC, Feb 15, 2006, Technical Report on C++ Performance, ISO/IEC TR 18015:2006(E), https://www.open-std.org/jtc1/sc22/wg21/docs/TR18015.pdf (Design of the C++ language from an efficiency perspective, including discussion of virtual functions and other language features.)
Nicolai M. Josuttis, 2012, The C++ Standard Library: A Tutorial and Reference, Second Edition, Supplementary Chapter, https://www.amazon.com/Standard-Library-Tutorial-Reference-2nd/dp/0321623215, PDF (extra chapter): http://www.cppstdlib.com/cppstdlib_supplementary.pdf (C++ optimizations such as bit sets and user-defined memory allocators.)
Bjarne Stroustrup, 2013, The Essence of C++ with examples in C++84, C++98, C++11, and C++14, PDF Slides: http://www.staroceans.org/e-book/essenceOfC++.pdf

Research on Code Optimizations

Y Ma, Y Cao, S Vrudhula, J Seo, 2017, Optimising loop operation and dataflow in FPGA acceleration of deep convolutional neural networks. In: Fpga 2017 ‐ Proceedings of the 2017 ACM/SIGDA International Symposium on field‐ programmable gate arrays, Monterey, CA, February 2017 PDF: https://ieeexplore.ieee.org/ielaam/92/8396231/8330049-aam.pdf
Minjia Zhang, Samyam Rajbhandari, Wenhan Wang, Yuxiong He, 2018, DeepCPU: serving RNN-based deep learning models 10x faster, USENIX ATC '18: Proceedings of the 2018 USENIX Conference on Usenix Annual Technical Conference, July 2018, Pages 951–965, https://dl.acm.org/doi/10.5555/3277355.3277446 https://www.microsoft.com/en-us/research/publication/deepcpu-serving-rnn-based-deep-learning-models-10x-faster/ PDF: https://www.usenix.org/system/files/conference/atc18/atc18-zhang-minjia.pdf (Microsoft DeepCPU paper shows details of code optimizations to parallelize matrix multiplications.)
Kaixin Wu, Bojie Hu, and Qi Ju. 2021. TenTrans High-Performance Inference Toolkit for WMT2021 Efficiency Task. In Proceedings of the Sixth Conference on Machine Translation, pages 795–798, Online. Association for Computational Linguistics, https://aclanthology.org/2021.wmt-1.77/ Code: https://github.com/TenTrans/TenTrans
Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks Y Ma, Y Cao, S Vrudhula, J Seo, 2017, Slides PDF: https://www.isfpga.org/past/fpga2017/slides/D1_S1_04.pdf
Meriam Dhouibi, Ahmed Karim Ben Salem, Afef Saidi, Slim Ben Saoud, March 2021, Accelerating deep neural networks implementation: A survey, https://doi.org/10.1049/cdt2.12016 PDF: https://ietresearch.onlinelibrary.wiley.com/doi/pdfdirect/10.1049/cdt2.12016
David Spuler, March 2024, Part II: Basic C++ Optimizations, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Zheming Jin, July 2024, Evaluating Operators in Deep Neural Networks for Improving Performance Portability of SYCL, Oak Ridge National Laboratory, ORNL/TM-2024/3463, https://info.ornl.gov/sites/publications/Files/Pub217394.pdf
Patrik Christen, 13 Oct 2024, Programming of Cellular Automata in C and C++, https://arxiv.org/abs/2410.10022
Inas Bachiri, September 2024, A Literature Review on Combining Neural Architecture Search and Compiler Optimizations for Neural Network Acceleration, DOI:10.13140/RG.2.2.10612.16009, Thesis for: Master in Computer Science, https://www.researchgate.net/publication/384190836_A_Literature_Review_on_Combining_Neural_Architecture_Search_and_Compiler_Optimizations_for_Neural_Network_Acceleration https://www.researchgate.net/profile/Inas-Bachiri/publication/384190836_A_Literature_Review_on_Combining_Neural_Architecture_Search_and_Compiler_Optimizations_for_Neural_Network_Acceleration/links/66ed912c6b101f6fa4f3d6ce/A-Literature-Review-on-Combining-Neural-Architecture-Search-and-Compiler-Optimizations-for-Neural-Network-Acceleration.pdf

Aussie AI

Code Optimizations

General Programming Optimizations for Inference

Compiler-Automated Optimizations

Efficient C++ Language Features

Constant Folding

Common Subexpression Elimination

Strength Reduction

Precalculation

Padding

Keeping It In the Float Family

Efficiency of Pointers vs References

Lazy Evaluation

General Programming Optimizations for Inference

Books and Articles on C++ Coding Efficiency

Research on Code Optimizations

More AI Research

Quick Links

Product

New to Writing?

Writing Styles