Aussie AI

Benchmarking Methods

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Benchmarking Methods

Benchmark programs attempt to examine how quickly your machine executes certain instructions, which is more useful for examining a single multiplication operation, rather than an entire AI inference operation. You mainly use benchmarking for code that's running in low-level kernels, such as CPU speedups (e.g. AVX intrinsics) or examining the use of different GPU primitives.

Consider benchmarking for timing of low-level arithmetic operations on your platform. For example, how would you determine whether the integer multiplication operation x*2 could be more efficiently replaced by x<<1?

How can you time these instructions? You obviously cannot just time a single operation of each with the “clock” function, because a single click tick contains many CPU cycles. So, you have to time thousands or even millions of such operations.

    for (int i = 0; i < 100 * MILLION; i++) {
        x << 1;
    }

We've already noted one problem: there's all this extra loop overhead time for the for loop conditional test (the “<” operator) and its incrementer (i++). The loop actually has three operations that are all about the same order-of-magnitude cost (i.e. <, ++, <<). To get at the operator cost, we'd need to subtract out the loop overhead. We could, for example, try to time an empty loop without any loop body, and subtract that from our final cost.

Null effect problems. Another problem is that we cannot easily time the operators with these statements in the loop body:

    x << 1;
    x * 2;

The compiler is clever enough to notice that the x<<1 and x*2 statements have no effect in the program above (and gives “null effect” warnings). The built-in optimizer may even remove them completely. So, they won't get timed properly, or at all, even in a loop.

Add volatility? One possible solution is that maybe the compiler can be forced to avoid this optimization on the original expressions by declaring x as a “volatile” variable.

    volatile int x = 0;

The volatile qualifier tells the compiler that all accesses to x are important, and that it should not remove any. The intended purpose of volatile is to allow the declaration of addresses for memory-mapped I/O, debugger-modified variables, or for variables modified by other programs (e.g. a semaphore modified by another program running concurrently). However, we can use it here to force all accesses to x to occur even if they appear pointless.

On the other hand, by doing this, we've lost the ability to see the “real” time cost of these operations when they're running in normal code. Most variables aren't volatile.

Anyway, it doesn't even work properly. Unfortunately, the computations of the << and * operators in x<<1 and x*2 are not being assigned anywhere, so the computations themselves could be optimized out, even though the actual read operations on x must occur because x is volatile. To force the << and * operations to occur, it is necessary to use their result somehow, such as by assigning it to the (volatile) variable x:

    x = x <<  1;

Although all of the above improvements will enhance the previous version, a far better method of improvement is to time a loop that performs a huge number of the operations,. Hence, we have to use something like these assignment expressions inside a loop:

    x <<= 1;
    x *= 2;

The code given here examines the relative speed of 10,000 shift and multiplication operations on int operands:

   volatile int x = 0; // volatile to prevent optimizations
   clock_t before  = clock();
   for (int i = 0; i < ITERATIONS; i++)
       x = x << 1;
   printf("%d Shifts took %f seconds\n", ITERATIONS,
       (double)(clock() - before) / CLOCKS_PER_SEC);
   before = clock();
   for (int i = 0; i < ITERATIONS; i++)
       x = x * 2;
   printf("%d Multiplications took %f seconds\n", ITERATIONS,
       (double)(clock() - before) / CLOCKS_PER_SEC);

Loop Unrolling. Unfortunately, the above method of measuring the speed of operations is not completely accurate, because it also includes the loop overhead (incrementing i from 1 to 10,000) and the cost of the assignment of the result to x. The loop overhead can be minimized by placing many operations within the loop, as below:

    volatile int x = 0; // volatile to prevent optimizations
    clock_t before = clock();
    for (int i = 0; i < ITERATIONS; i++) {
        x = x << 1; x = x << 1; x = x << 1; x = x << 1;
        x = x << 1; x = x << 1; x = x << 1; x = x << 1;
        x = x << 1; x = x << 1; x = x << 1; x = x << 1;
        x = x << 1; x = x << 1; x = x << 1; x = x << 1;
        x = x << 1; x = x << 1; x = x << 1; x = x << 1;
    }
    printf("%d Shifts took %f seconds\n", ITERATIONS * 20,
        (double)(clock() - before) / CLOCKS_PER_SEC);
    before = clock();
    for (int i = 0; i < ITERATIONS; i++) {
        x = x * 2; x = x * 2; x = x * 2; x = x * 2;
        x = x * 2; x = x * 2; x = x * 2; x = x * 2;
        x = x * 2; x = x * 2; x = x * 2; x = x * 2;
        x = x * 2; x = x * 2; x = x * 2; x = x * 2;
        x = x * 2; x = x * 2; x = x * 2; x = x * 2;
    }
    printf("%d Multiplications took %f seconds\n", ITERATIONS * 20,
        (double)(clock() - before) / CLOCKS_PER_SEC);

Unfortunately, the assignment operations are needed to prevent the optimizer removing the computations, as discussed above. The only truly effective method of removing the cost of the assignment from the measurement is to time another separate loop, and subtract its time from that of the other loops, as below. This method also automatically accounts for the loop overhead cost, so the multiple operations inside each loop are not needed (and in fact would be incorrect). Our final version of the benchmark program is also made more sophisticated to output the relative magnitude of the two operations:

    void profile_shifts4()
    {
        const int MILLION = 1000000;
        const int ITERATIONS = 1000 * MILLION;
        volatile int x = 0; // volatile to prevent optimizations
        double time1, time2;

        // Time the loop overhead
        clock_t before = clock();
        for (int i = 0; i < ITERATIONS; i++)
            x = 1;
        clock_t loop_cost = clock() - before; // overhead
        double ovtime = (double)(loop_cost) / CLOCKS_PER_SEC;
        printf("%d overhead: %f seconds\n", ITERATIONS, ovtime);

        // Shifts
        before = clock();
        for (int i = 0; i < ITERATIONS; i++) {
            x = x << 1;
        }
        time1 = (double)(clock() - before - loop_cost) / CLOCKS_PER_SEC;
        printf("%d Shifts took %f seconds\n", ITERATIONS, time1);

        // Multiplications
        before = clock();
        for (int i = 0; i < ITERATIONS; i++) {
            x = x * 2;
        }
        time2 = (double)(clock() - before - loop_cost) / CLOCKS_PER_SEC;
        printf("%d Multiplications took %f seconds\n", ITERATIONS, time2);

        // Compare both times, and print percentage difference
        const float ACCURACY = 0.00001f; // maximum error
        if (fabs(time1 - time2) < ACCURACY) // (almost) equal?
            printf("Shift and multiplications: same time\n");
        else if (time1 < time2) {
            printf("Shifts faster by %5.2f percent\n",
                    (time2 - time1) / time2 * 100.0);
        } 
        else {
            printf("Multiplications faster by %5.2f percent\n",
                (time1 - time2) / time1 * 100.0);
        }
    }

Limitations of Benchmarking. Benchmarking of C++ using these timing methods is not perfect, but I've always found it useful. There are various reasons why this type of benchmarking timing results may not be fully correct.

  • Hard to account for parallelism (e.g. GPU throughput)
  • Single-threaded code is not always a true representation.
  • Pipelining speedups often differ in production code (even for sequential CPU code, such as AVX intrinsics).
  • Loop overhead is hard to separate from the raw operations (as seen above!)
  • Compiler optimizations might modify or even remove the operations being benchmarked.
  • Memory cache hit rates are too high because you're running tight code accessing only a few addresses.
  • Optimization levels in test mode might not match your production version.
  • Debug modes might not match production (e.g. if running in a debugger).
  • Pipelining by the CPU of many instructions makes it appear better than reality.
  • Unrealistic non-production conditions are being tested.

Compiler optimizations. In this day and age of amazing optimization algorithms, note that on some platforms the benchmarking code above may indicate that shifts and multiplications cost exactly the same. This is most likely an indication that the compiler automatically optimizes any multiplications by powers of two into left shifts. To get the true cost of a multiplication, the expression should be:

    x = x * x;

But even this might be optimized algebraically by a compiler. The only way to know for sure what's actually being benchmarked is to examine the assembly language.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++