Aussie AI

Performance Tuning Practices

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Performance Tuning Practices

How should the huge number of methods of improving program efficiency be applied to a program? The code transformations that improve the program by a significant amount should be tried first, and the smaller optimizations used only when it is important to squeeze out that last bit of extra speed in bottlenecks. Hence, I suggest the following steps for improving the efficiency of a program:

    1. Time your program to get a baseline (i.e. run a full inference query).

    2. Invoke the C++ compiler’s built-in optimizer.

    3. Profile the code and find the “hot spots.”

    4. Consider a better data structure or algorithm.

    5. Use the major code transformations.

    6. Use smaller code transformations, if speed is crucial.

The first step is to measure your code's time cost. Otherwise, how will you know whether anything made it better?

The next step is easy: turn on your optimizer. All modern C++ compilers have an option to invoke an optimizer on the code. The optimizer, although it may not always yield a major increase in speed, has one very important advantage — the programmer need not change the code. Hence, if a small improvement is desired, the optimizer can often provide it without much effort.

Hardware tuning. The optimizer is not the only way to get instant results:

  • Faster GPU
  • FTZ and DAZ CPU modes
  • Overclocking your CPU or GPU (if you must)
  • Linux kernel tweaking

The GPU is a major underpinning factor for high performance. You can upgrade to rent a better one, or try overclocking the one you have. Hardware vendors such as NVIDIA have extensive literature on the performance comparisons of their various chips, along with software tools to test and benchmark the GPUs. Similarly, hardware vendors of CPUs or other specialized AI chips have documentation and toolsets, typically for free (alas, the chips are not!).

Software tuning. Assuming you're done with all the non-code changes, it's time to examine the C++. You can either start high by looking at the data structures, or start low by optimizing the busiest low-level kernels.

The choice of a better algorithm (usually with different data structures) for a program is not an easy method of program improvement. Simply identifying what would be a better algorithm is a difficult problem! And once identified, the new algorithm must be implemented by the programmer, costing precious man hours. However, this is the best method to achieve an order-of-magnitude increase in the program’s performance. For an AI engine, there are many higher-level optimizations covered in this book (e.g. caching or model quantization come to mind). Pick up the book, open to a random page, and there's probably another optimization there.

The next step is to profile in detail the C++ code to determine which functions (or statements) are accounting for most of the program’s time; these are the “hot spots” of the program. This identification of costly statements is best achieved by a profiler, although if I had to take a guess, I'd say look at your vector dot product code. Identifying frequently called functions and deeply nested loops is often adequate. Once the hot spots are identified, all efficiency measures, large and small, should be applied to this code. Any improvement to the efficiency of a statement, no matter how small, will improve the overall efficiency greatly if that statement is executed often.

Once the most costly functions and loops have been optimized, other statements can also be optimized, although the increase in speed will not be as noticeable. Some of the better code transformations to apply are parallelization, loop optimizations (vectorizations), using pass-by-reference for passing structures or objects to functions, and replacing small functions with macros or inline functions.

Make it right first? The speed improvement techniques in C++ can be applied either as the programmer is writing the code, or after the development and debugging of the program. The second approach is often referred to as the “make it right first” rule. However, I believe that the first method is preferable simply because optimizing your program once it is working is a dangerous practice, and often introduces new bugs. Deferring efficiency improvement to the final development stage can also waste programmer time in improving the basic algorithms used in a program. Using efficiency techniques during the development of the program is a much sounder method of improving efficiency. On the other hand, it's really hard to make an AI engine work right, let alone fast and right, so do whatever you want!

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++