Aussie AI

37. Tuning, Profiling & Benchmarking

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

“Premature optimization is the root of all evil.”

— Donald E. Knuth, The Art of Computer Programming,
Volume 1: Fundamental Algorithms, 1968.

Tuning an AI Engine

As with any other C++ application, tuning an AI engine requires timing and profiling of the underlying C++ code. To do so, you'll need a batch interface whereby the prompt query text can be supplied as a command-line argument, or via a text file.

To measure what impact your code optimizations are having on your Transformer engine's performance, you'll need to re-run exactly the same query (or many queries) after each major code change. To isolate the effects of C++ engine code changes, you ideally need everything else to stay exactly the same:

Hardware (same CPU, same GPU, same settings, etc.)
Thread and OS settings
Server load (i.e., avoid other processes running)
Inference query (exactly the same text)
Model file
Configuration settings (e.g., temperature).

To really finesse the engine profiling, you can ensure that it returns exactly the same results, as for regression testing, by managing these code issues:

Random number seed (e.g., impacts the top-k decoding algorithm).
Time-specific tools (e.g., the time function needs an intercept so it doesn't change).

The other part is to test your AI engine separately from other parts of the system. Yes, the overall system performance is important, but that is a separate performance analysis from the pure C++ profiling of the Transformer engine. Some of the issues include:

RAG databases. Test the engine on its query after the retriever has looked up its chunks of text. The full input for a profiling query should be the extra RAG context plus a question.
Inference cache. Ensure the engine is not bypassed by the caching component. If the exact same query runs super-fast the second time you test it, umm, that's not you.

To test the overall response time to the user, system tuning is required. The responsiveness of the RAG retriever component, the cache hit ratio, and other practical deployment issues are all important for real-world performance. See Chapter 7 for more information on efficient architectures for deploying AI engines.

Performance Tuning Practices

How should the huge number of methods of improving program efficiency be applied to a program? The code transformations that improve the program by a significant amount should be tried first, and the smaller optimizations used only when it is important to squeeze out that last bit of extra speed in bottlenecks. Hence, I suggest the following steps for improving the efficiency of a program:

1. Time your program to get a baseline (i.e. run a full inference query).

2. Invoke the C++ compiler’s built-in optimizer.

3. Profile the code and find the “hot spots.”

4. Consider a better data structure or algorithm.

5. Use the major code transformations.

6. Use smaller code transformations, if speed is crucial.

The first step is to measure your code's time cost. Otherwise, how will you know whether anything made it better?

The next step is easy: turn on your optimizer. All modern C++ compilers have an option to invoke an optimizer on the code. The optimizer, although it may not always yield a major increase in speed, has one very important advantage — the programmer need not change the code. Hence, if a small improvement is desired, the optimizer can often provide it without much effort.

Hardware tuning. The optimizer is not the only way to get instant results:

Faster GPU
FTZ and DAZ CPU modes
Overclocking your CPU or GPU (if you must)
Linux kernel tweaking

The GPU is a major underpinning factor for high performance. You can upgrade to rent a better one, or try overclocking the one you have. Hardware vendors such as NVIDIA have extensive literature on the performance comparisons of their various chips, along with software tools to test and benchmark the GPUs. Similarly, hardware vendors of CPUs or other specialized AI chips have documentation and toolsets, typically for free (alas, the chips are not!).

Software tuning. Assuming you're done with all the non-code changes, it's time to examine the C++. You can either start high by looking at the data structures, or start low by optimizing the busiest low-level kernels.

The choice of a better algorithm (usually with different data structures) for a program is not an easy method of program improvement. Simply identifying what would be a better algorithm is a difficult problem! And once identified, the new algorithm must be implemented by the programmer, costing precious man hours. However, this is the best method to achieve an order-of-magnitude increase in the program’s performance. For an AI engine, there are many higher-level optimizations covered in this book (e.g. caching or model quantization come to mind). Pick up the book, open to a random page, and there's probably another optimization there.

The next step is to profile in detail the C++ code to determine which functions (or statements) are accounting for most of the program’s time; these are the “hot spots” of the program. This identification of costly statements is best achieved by a profiler, although if I had to take a guess, I'd say look at your vector dot product code. Identifying frequently called functions and deeply nested loops is often adequate. Once the hot spots are identified, all efficiency measures, large and small, should be applied to this code. Any improvement to the efficiency of a statement, no matter how small, will improve the overall efficiency greatly if that statement is executed often.

Once the most costly functions and loops have been optimized, other statements can also be optimized, although the increase in speed will not be as noticeable. Some of the better code transformations to apply are parallelization, loop optimizations (vectorizations), using pass-by-reference for passing structures or objects to functions, and replacing small functions with macros or inline functions.

Make it right first? The speed improvement techniques in C++ can be applied either as the programmer is writing the code, or after the development and debugging of the program. The second approach is often referred to as the “make it right first” rule. However, I believe that the first method is preferable simply because optimizing your program once it is working is a dangerous practice, and often introduces new bugs. Deferring efficiency improvement to the final development stage can also waste programmer time in improving the basic algorithms used in a program. Using efficiency techniques during the development of the program is a much sounder method of improving efficiency. On the other hand, it's really hard to make an AI engine work right, let alone fast and right, so do whatever you want!

Tuning Trade-offs

Tuning a program is not always a clear-cut gain. There are numerous other quantities that efficiency may affect:

Model accuracy versus speed.
Space versus time-efficiency.
Robustness of a program.
Readability and maintainability of a program.
Portability of a program.

Accuracy of the LLM is the main trade-off in many of the optimizations. This book contains numerous ways to optimize a Transformer engine, some of which 100% retain accuracy (e.g. vectorization), whereas others are effectively approximations of the original model (e.g. quantization).

There is almost always a trade-off between time and space when making programs run faster. Many of the algorithm improvements sacrifice space for extra speed, such as caching and precalculation.

Changing a program for efficiency can introduce extra bugs into a program (although you could argue that it might remove bugs, too). If a piece of code has already been debugged, improving its efficiency may not be worth the risk to the robustness of a program.

Many of the program transformations used for efficiency can reduce the readability of a program. Naturally, this also makes it more difficult for a program to be maintained, and since the major cost in a program’s development cycle is usually maintenance, improving efficiency may not be worth it in the long run.

Perhaps surprisingly, the efficiency of a program can usually be increased significantly without affecting portability. There are some efficiency techniques in this book that use machine-specific methods, such as hardware-acceleration, but there are many generic methods that work across all C++ code.

Almost all of the dangers of improving efficiency are dangers for the programmer. On the other hand, the users of a program will be well pleased by extra responsiveness, and this alone makes efficiency improvement a worthwhile exercise.

Profiling and Benchmarking

Performance profiling is the measurement of time and space efficiency metrics about your C++ program. Profiling is regarded as the right and proper first step when attempting to tune your code. You should run a profiler on a semi-realistic test run of your program under simulated production conditions to generate the performance data. Then you can analyze the data using the profiler's reports to find which functions are chewing up most of the time, or even which specific statements are busiest inside a heavily used function.

Benchmarking is a slightly different concept, and refers to testing the efficiency of certain operations, such as low-level operators, to find a more efficient way to do an operation. For example, if you want to compare multiplication versus addition, you write a program to run these operations a few million times. When changing a program to increase efficiency, you shouldn't assume that a certain operation is clearly faster, but you should benchmark whether the changes have noticeably increased the operation's efficiency (or even decreased it!).

Both profiling and benchmarking require data about CPU and memory usage. Techniques for measuring program efficiency range from the stop-watch method to the use of sophisticated profiler software tools. If no profiler is adequate, the programmer can gain timing information by adding instrumentation statements to the program, although there are many pitfalls in attempting to determine the time taken by a sequence of statements.

The measurement of the memory usage and space-efficiency of a C++ program is a slightly more difficult problem. There are several types of memory: instruction code, static memory, read-only string literals, initialization data, global/static variables, the stack, and the heap. Measuring the memory usage of the stack and heap is somewhat difficult because of their dynamic nature. However, various tools exist to measure the different types of memory, and clever use of C++ programming constructs can also yield reasonable data.

Linux C++ Profilers

When improving a program’s performance, it is useful to know where the speed bottlenecks are. There is a saying that 90% of the time is spent in 10% of the code. Hence, any speed improvement should aim to speed up the functions that are most frequently used. The programmer can often tell where the program is spending most of its time (e.g. where one function is called by all others), but it is useful to have a software tool to analyze exactly where the program is spending its time.

Most implementations of C++ come with a software tool called a “profiler” which is used to examine the performance of the program. There are also a variety of commercial C++ performance profiling tools available to purchase. The most commonly used free profilers on Linux are prof, pixie and gprof.

The prof utility

Under Linux, and other variants of UNIX, the standard C profiling utility is called “prof”. This utility calculates the percentage time taken by each function. This is valuable information when considering which functions to make more efficient.

To use prof, compile the program with the −p option to the compiler (strictly speaking, the -p option is needed only at the link stage of compilation) and then execute the program. Provided the program terminates normally or via exit, a data file called “mon.out” will be generated. This file contains the data to be used by prof in preparing an execution profile for the program. To examine this profile, type the command:

    prof

If your executable is not called a.out, but say, my_prog, the command is:

    prof ./my_prog

This command will generate a profile of your program’s execution from which the functions that use the most time can be identified. A sample of part of the output generated by prof is:

    %time seconds cum % cum sec procedure (file)
    42.1 4.4700 42.1 4.47 strcmp (../strcmp.s)
    40.6 4.3100 82.7 8.78 CheckWord (spell1.c)
    5.9 0.6300 88.6 9.41 fgets (../fgets.c)
    4.3 0.4600 92.9 9.87 initialize (spell1.c)
    3.0 0.3200 96.0 10.19 tolower (../conv.c)
    1.5 0.1600 97.5 10.35 read (../read.s)
    1.0 0.1100 98.5 10.46 malloc (../malloc.c)
    0.8 0.0800 99.2 10.54 strlen (../strlen.c)
    0.5 0.0500 99.7 10.59 morecore (../malloc.c)
    0.1 0.0100 99.8 10.60 open (../open.s)
    0.1 0.0100 99.9 10.61 sbrk (../sbrk.s)
    0.1 0.0100 100.0 10.62 fstat (../fstat.s)

Note that the percentages calculated are only approximate because the profiler uses sampling techniques during interrupts and these samples might not provide a fully accurate picture. For example, if the program has a very small and fast function, this function might be completely missed.

The pixie utility

The pixie utility can be used under Linux or UNIX to get more accurate counts on the number of times each statement in a function is executed. Where the prof utility only produces estimates based on statistical sampling of the program counter at regular intervals throughout the execution of the program, pixie measures the number of times each basic block is executed. A basic block is a sequence of code containing no branches.

The pixie utility is applied to the already generated executable file. There is no need to recompile the executable with the -p option. The command for pixie is simply:

    pixie a.out

This will generate a new executable file, “a.out.pixie”, which when executed will generate a data file called “a.out.Counts”. A data file of function addresses called “a.out.Addrs” is also generated. The next step is to run the new executable:

    a.out.pixie

After execution, the count file can be examined using either prof or pixstats. One possible command is:

    pixstats a.out

The use of the prof command with the -pixie option is:

    prof -pixie a.out

Both of these commands will generate a variety of information. prof with the “-pixie” option will generate an ordering of functions based on instruction cycle counts, another based on invocations, and a list of instruction counts for each basic block. pixstats generates a whole wealth of useful information including summaries of opcode distributions and register usage. For more information refer to the Linux manual entries for pixie, pixstats and prof.

Timing C++ Code

There are a number of reasons why it can be useful to time the execution of a program. Timing C++ code can be useful in determining which statements should be optimized whereas profilers may only indicate which functions are consuming time. Timing code can also determine the relative efficiency of various operations and give you valuable information about writing code for your machine (e.g. is shifting faster than integer multiplication?).

The time Command. If the full execution time for a program is all that is needed, the Linux time command can be used to calculate the time required by a program. There are two versions — a stand-alone utility in /bin and a command built into csh. The command to run is usually:

    time a.out

A different executable name could also be used and command line arguments can also be specified.

Code Instrumentation. If a more detailed speed analysis is needed, it is possible to add C++ self-instrumentation code to your program to monitor its own performance. The basic idea is to use the standard library functions to monitor the time before and after an action.

The most useful function is the “clock” function which counts the number of clock ticks since the program began executing. The “time” function, which keeps track of the real calendar time could also be used, but it is not a true indication of processor time on a large multi-user system. The clock function is correct for both single user and multi-user systems.

The clock function returns a value of type clock_t (typically long or int) that counts the number of clock ticks. This value can be converted to seconds by dividing by the constant CLOCKS_PER_SEC, also declared in <time.h>.

The basic idea of timing C++ code blocks is to call the clock function before and after an operation and examine the difference between the number of clicks. The code below examines the relative speed of shift and multiplication operations on int operands.

    void profile_shifts()
    {
        const int MILLION = 1000000;
        const int ITERATIONS = 100 * MILLION;

        int x = 1, y = 2, z = 3;

        clock_t before = clock();
        for (int i = 0; i < ITERATIONS; i++)
            x = y << z;
        printf("%d Shifts took %f seconds\n", ITERATIONS,
            (double)(clock() - before) / CLOCKS_PER_SEC);

        before = clock();
        for (int i = 0; i < ITERATIONS; i++)
            x = y * z;
        printf("%d Multiplications took %f seconds\n", ITERATIONS,
            (double)(clock() - before) / CLOCKS_PER_SEC);
    }

clock Portability Pitfall. Note that some implementations on older Unix versions don’t conform to the C++ standard and return the number of clock ticks since the first call to the clock function. This means that a single call to clock at the end of the program would always return zero. Hence, it is more portable to measure the number of clock ticks between two calls to clock, one at the start and one at the end. Obviously, you can also put the first call to “clock” at the start of the “main” function to avoid this rare glitch. Note that on implementations that are correct, a call at the start of “main” may be non-zero due to the overhead of global and static C++ object instantiations (i.e. constructors for global objects), which occurs before entering main.

Clock Tick Integer Division Pitfall. Note that the clock_t type and CLOCKS_PER_SEC constant are both integers. Hence, here's a bug:

    clock_t diff = clock() - before;
    double seconds = diff / CLOCKS_PER_SEC; // Bug!

The problem is that it's integer division, so it inaccurately truncates to an integer. You need a typecast to float or double on either side of the division operator.

    clock_t diff = clock() - before;
    double seconds = diff / (double)CLOCKS_PER_SEC; // Correct

Clock Tick Overflow Pitfall. The clock function also has a problem with wraparound on some implementations. Because of its high resolution, the number of clock ticks can quickly overflow the maximum value that can be stored by the type clock_t. On one system the clock function will wrap around after only 36 minutes. If the program being timed runs for longer than this period, the use of clock can be misleading. One solution is to use the “time” function rather than “clock” when executions are longer, but this usually only has resolution to the nearest second.

Benchmarking Methods

Benchmark programs attempt to examine how quickly your machine executes certain instructions, which is more useful for examining a single multiplication operation, rather than an entire AI inference operation. You mainly use benchmarking for code that's running in low-level kernels, such as CPU speedups (e.g. AVX intrinsics) or examining the use of different GPU primitives.

Consider benchmarking for timing of low-level arithmetic operations on your platform. For example, how would you determine whether the integer multiplication operation x*2 could be more efficiently replaced by x<<1?

How can you time these instructions? You obviously cannot just time a single operation of each with the “clock” function, because a single click tick contains many CPU cycles. So, you have to time thousands or even millions of such operations.

    for (int i = 0; i < 100 * MILLION; i++) {
        x << 1;
    }

We've already noted one problem: there's all this extra loop overhead time for the for loop conditional test (the “<” operator) and its incrementer (i++). The loop actually has three operations that are all about the same order-of-magnitude cost (i.e. <, ++, <<). To get at the operator cost, we'd need to subtract out the loop overhead. We could, for example, try to time an empty loop without any loop body, and subtract that from our final cost.

Null effect problems. Another problem is that we cannot easily time the operators with these statements in the loop body:

    x << 1;
    x * 2;

The compiler is clever enough to notice that the x<<1 and x*2 statements have no effect in the program above (and gives “null effect” warnings). The built-in optimizer may even remove them completely. So, they won't get timed properly, or at all, even in a loop.

Add volatility? One possible solution is that maybe the compiler can be forced to avoid this optimization on the original expressions by declaring x as a “volatile” variable.

    volatile int x = 0;

The volatile qualifier tells the compiler that all accesses to x are important, and that it should not remove any. The intended purpose of volatile is to allow the declaration of addresses for memory-mapped I/O, debugger-modified variables, or for variables modified by other programs (e.g. a semaphore modified by another program running concurrently). However, we can use it here to force all accesses to x to occur even if they appear pointless.

On the other hand, by doing this, we've lost the ability to see the “real” time cost of these operations when they're running in normal code. Most variables aren't volatile.

Anyway, it doesn't even work properly. Unfortunately, the computations of the << and * operators in x<<1 and x*2 are not being assigned anywhere, so the computations themselves could be optimized out, even though the actual read operations on x must occur because x is volatile. To force the << and * operations to occur, it is necessary to use their result somehow, such as by assigning it to the (volatile) variable x:

    x = x <<  1;

Although all of the above improvements will enhance the previous version, a far better method of improvement is to time a loop that performs a huge number of the operations,. Hence, we have to use something like these assignment expressions inside a loop:

    x <<= 1;
    x *= 2;

The code given here examines the relative speed of 10,000 shift and multiplication operations on int operands:

   volatile int x = 0; // volatile to prevent optimizations
   clock_t before  = clock();
   for (int i = 0; i < ITERATIONS; i++)
       x = x << 1;
   printf("%d Shifts took %f seconds\n", ITERATIONS,
       (double)(clock() - before) / CLOCKS_PER_SEC);
   before = clock();
   for (int i = 0; i < ITERATIONS; i++)
       x = x * 2;
   printf("%d Multiplications took %f seconds\n", ITERATIONS,
       (double)(clock() - before) / CLOCKS_PER_SEC);

Loop Unrolling. Unfortunately, the above method of measuring the speed of operations is not completely accurate, because it also includes the loop overhead (incrementing i from 1 to 10,000) and the cost of the assignment of the result to x. The loop overhead can be minimized by placing many operations within the loop, as below:

    volatile int x = 0; // volatile to prevent optimizations
    clock_t before = clock();
    for (int i = 0; i < ITERATIONS; i++) {
        x = x << 1; x = x << 1; x = x << 1; x = x << 1;
        x = x << 1; x = x << 1; x = x << 1; x = x << 1;
        x = x << 1; x = x << 1; x = x << 1; x = x << 1;
        x = x << 1; x = x << 1; x = x << 1; x = x << 1;
        x = x << 1; x = x << 1; x = x << 1; x = x << 1;
    }
    printf("%d Shifts took %f seconds\n", ITERATIONS * 20,
        (double)(clock() - before) / CLOCKS_PER_SEC);
    before = clock();
    for (int i = 0; i < ITERATIONS; i++) {
        x = x * 2; x = x * 2; x = x * 2; x = x * 2;
        x = x * 2; x = x * 2; x = x * 2; x = x * 2;
        x = x * 2; x = x * 2; x = x * 2; x = x * 2;
        x = x * 2; x = x * 2; x = x * 2; x = x * 2;
        x = x * 2; x = x * 2; x = x * 2; x = x * 2;
    }
    printf("%d Multiplications took %f seconds\n", ITERATIONS * 20,
        (double)(clock() - before) / CLOCKS_PER_SEC);

Unfortunately, the assignment operations are needed to prevent the optimizer removing the computations, as discussed above. The only truly effective method of removing the cost of the assignment from the measurement is to time another separate loop, and subtract its time from that of the other loops, as below. This method also automatically accounts for the loop overhead cost, so the multiple operations inside each loop are not needed (and in fact would be incorrect). Our final version of the benchmark program is also made more sophisticated to output the relative magnitude of the two operations:

    void profile_shifts4()
    {
        const int MILLION = 1000000;
        const int ITERATIONS = 1000 * MILLION;
        volatile int x = 0; // volatile to prevent optimizations
        double time1, time2;

        // Time the loop overhead
        clock_t before = clock();
        for (int i = 0; i < ITERATIONS; i++)
            x = 1;
        clock_t loop_cost = clock() - before; // overhead
        double ovtime = (double)(loop_cost) / CLOCKS_PER_SEC;
        printf("%d overhead: %f seconds\n", ITERATIONS, ovtime);

        // Shifts
        before = clock();
        for (int i = 0; i < ITERATIONS; i++) {
            x = x << 1;
        }
        time1 = (double)(clock() - before - loop_cost) / CLOCKS_PER_SEC;
        printf("%d Shifts took %f seconds\n", ITERATIONS, time1);

        // Multiplications
        before = clock();
        for (int i = 0; i < ITERATIONS; i++) {
            x = x * 2;
        }
        time2 = (double)(clock() - before - loop_cost) / CLOCKS_PER_SEC;
        printf("%d Multiplications took %f seconds\n", ITERATIONS, time2);

        // Compare both times, and print percentage difference
        const float ACCURACY = 0.00001f; // maximum error
        if (fabs(time1 - time2) < ACCURACY) // (almost) equal?
            printf("Shift and multiplications: same time\n");
        else if (time1 < time2) {
            printf("Shifts faster by %5.2f percent\n",
                    (time2 - time1) / time2 * 100.0);
        } 
        else {
            printf("Multiplications faster by %5.2f percent\n",
                (time1 - time2) / time1 * 100.0);
        }
    }

Limitations of Benchmarking. Benchmarking of C++ using these timing methods is not perfect, but I've always found it useful. There are various reasons why this type of benchmarking timing results may not be fully correct.

Hard to account for parallelism (e.g. GPU throughput)
Single-threaded code is not always a true representation.
Pipelining speedups often differ in production code (even for sequential CPU code, such as AVX intrinsics).
Loop overhead is hard to separate from the raw operations (as seen above!)
Compiler optimizations might modify or even remove the operations being benchmarked.
Memory cache hit rates are too high because you're running tight code accessing only a few addresses.
Optimization levels in test mode might not match your production version.
Debug modes might not match production (e.g. if running in a debugger).
Pipelining by the CPU of many instructions makes it appear better than reality.
Unrealistic non-production conditions are being tested.

Compiler optimizations. In this day and age of amazing optimization algorithms, note that on some platforms the benchmarking code above may indicate that shifts and multiplications cost exactly the same. This is most likely an indication that the compiler automatically optimizes any multiplications by powers of two into left shifts. To get the true cost of a multiplication, the expression should be:

    x = x * x;

But even this might be optimized algebraically by a compiler. The only way to know for sure what's actually being benchmarked is to examine the assembly language.

Examining Assembly Output

Another way of examining the relative costs of particular operations for a particular compiler is to examine the assembly language produced by the compiler. Many compilers have an option to produce assembly language output. For example, under Linux the command may be:

    gcc -S main.cpp

This will produce the assembly language listing for the C++ source file and store it in a new file “main.s” as a human-readable text file. Without the -S option, the assembly output would have been passed to the assembler to create the machine code executable. GCC also has a “-masm” option that controls the different “dialects” of assembly language (e.g. “intel” or “att”). GCC also has a verbosity control on assembly output via “-fverbose-asm” and “-fno-verbose-asm” options.

Another way to generate assembly with GCC is the “-save-temps” option. This option tells GCC to save the temporary assembly language file that it used for the real compilation. Hence, this option can be used with the normal compilation mode to both build the code as normal and also output a “.s” assembly file. The advantage of this GCC “-save-temps” option over “-S' is that you don't need to create a separate build path for generating assembly text files.

Reviewing assembly code. Examining assembly language instructions produced for C++ operations can be very enlightening. For example, you can determine whether the compiler uses a special increment instruction for the ++ operator. Whether or not the compiler is performing various optimizations can also be examined.

Counting the number of assembly instructions is a simple measure and gives a reasonable indication of how efficiently an operation will be performed. A better method is to determine the number of cycles used by each instruction, but this requires a rather more intimate knowledge of the assembly language being used.

Many useful things can be discovered by examining assembly output. For example, does the expression x*2 generate a multiply instruction or a shift instruction (or an addition instruction to do “x+x”)? Does the compiler notice that x=x+1 can be replaced by x++? Is the integer % remainder operator implemented by a sequence of instructions?

Consider the use of the relational operators (e.g. >, <) in expressions such as:

    flag = x > y;

This will often produce a sequence of instructions because of the need to assign flag the value either 0 or 1. The instructions may well look like the following pseudo-assembly language:

    LOAD 10($sp) # Load x (from stack)
    CMP 12($sp) # Compare with y (on stack)
    BGT $1 # Branch if greater than
    LOAD 0 # Result of > operation is 0
    JUMP $2
    $1:
    LOAD 1 # Result of > operation is 1
    $2:
    STORE 14($sp) # Store in flag (on stack)

However, review the assembler for the similar test in if statements, such as:

    if (x > y) ...

For an if statement, the instructions need not be as complex, because there is no need to store the value 0 or 1 anywhere. The assembly language could be similar to branches without computations:

    LOAD 10($sp) # Load x (from stack)
    CMP 12($sp) # Compare with y (on stack)
    BLE $1 # Branch if NOT greater than
    ... # Code for if statement body
    $1:
    ... # Statements after if statement

Examining Object Files

The objdump command is another useful tool on Linux for analyzing binary object files. DUMPBIN is the comparable tool on Windows for MSVS (or you can use the LINK command with the “/DUMP” option). These tools can get to the assembly language text in reverse, by disassembling the binary instructions that are in the object file, in combination with the various symbolic information.

objdump can be used to examine object files in various ways and there are various useful options. The “-d” and “-D” options provide disassembly where you can examine a full dump of the assembly code in printable form (as an alternative path to the “-S” option). The “-h” option shows the headers of the object file and “-g” shows debugging information in the file. There are numerous other options and the “--help” option can be used to list all options. The objdump command is part of Gnu Binutils, which also includes other useful binary file tools such as nm, size, strip, and strings utilities.

DUMPBIN also has various options that can be used on the DOS command-line. The default is “/SUMMARY” for a summary of the information about the object file. The “/DISASM” command shows the disassembly of the object file, which is in assembly language. Also useful is “/SYMBOLS” to show the symbolic names.

Reducing Build Time

The build phase of a large piece of software like an AI engine is a significant time cost, and can become a bottleneck to the development process. If you're using CI/CD then a new build kicks off every time you commit code. If there's a lot of team members, there's regular commits, and many daily builds. So, the build time becomes an important productivity measure.

In fact, the builds can get too long and leave programmers waiting on the automated acceptance testing results after their commits. Builds can also start queueing up if you're not careful. This can happen if builds are too long, or if the team is so large that there's an endless stream of commits. You might want to instigate a process whereby there are small builds for automated approvals and immediate failure feedback on commits, but a much bigger “nightly build” which runs all the biggest test suites, compiles on multiple platforms, gathers compiler warnings and static analysis results, reports on test coverage computations, and all of the other time-intensive automatic testing.

Reducing Compile Time. Reducing compile-time is a small method of improving the programmer’s use of time. It can be more important than reducing overall build time, because coders are usually doing incremental compiles within their area of focus, rather than a full-blown build. Programmers need to re-compile over and over all day long whenever they're debugging.

Modern C++ compilers are incredible and can crank through huge amounts of source code. Although the speed of compilation largely depends on the ingenuity of the implementor of your compiler, there are a few techniques that can make your programs compile more quickly.

Turn down the optimization settings.
Use precompiled headers.
Block re-included header files (i.e. #ifdef macros or “#pragma once”).

Some compilers support an option called “precompiled headers” whereby the compiler stores the state of internal tables, such as the symbol table, in a data file. Instead of then processing the text in the header files the compiler simply loads the data file and ignores the header files. This saves the compile-time used in processing the declarations in header files.

Modularity for Incremental Builds. The best method of reducing compile-time during the testing-debugging phase of program development is to break the program into a large number of small C++ source files, or smaller modularized libraries. In this way, only the files that need to be recompiled into object files are processed in an incremental rebuild, although all object files are still linked in creating the final executable. And the use of multiple files and libraries is also good programming style, which is a bonus.

The method of achieving this automatic incremental rebuilding of object files depends on the environment. Personally, I am addicted to the “make” utility on Linux (e.g. with “makedepend”), whereas MSVS has incremental builds largely automated in the C++ IDE on Windows. You might also prefer a more sophisticated build tool like CMake, Jenkins, or Gradle.

On the other hand, how much time have I wasted debugging a bug fix that didn't work properly, only to find it hadn't been rebuilt properly? Nothing beats a full rebuild.

• Next: Chapter 38. Platform Portability

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs