Aussie AI

13. Algorithm Speedups

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

“I would like to die on Mars. Just not on impact.”

— Elon Musk.

 

 

Algorithm Optimization Techniques

AI engines feature a very heavy algorithm running against a monolithic data structure. This chapter presents some of the theory of the general techniques for optimizing algorithms, and subsequent chapters show many ways to use them in your engine.

Changing the underlying algorithms used by the program is often the only real way to gain a large speed increase. In particular, the algorithms and data structures used can often be modified to give a significant speed increase. Is there a better way to do what your program does? Is it doing too much unnecessary calculation? Although much depends on the programmer’s ingenuity, there are some common techniques for improving performance of algorithms.

  • Parallelization and vectorization
  • Precomputation (save time by using space)
  • Recomputation (save space by using time)
  • Caching and computation reuse
  • Greedy algorithms (immediate computation)
  • Skipping algorithms
  • Arithmetic strength reduction
  • Integer arithmetic
  • Change recursion to loops
  • Incremental algorithms
  • Choose a better data structure

The idea of “skipping” computations also has various sub-methods:

  • Lazy algorithms (delay computation until needed)
  • Common case first
  • Simple case first
  • Approximate tests first

Lookup Table Precomputation

Lookup tables are so widely used in AI engines that they're usually abbreviated as LUTs. The aim is to precompute results and replace frequently called costly function evaluations with table lookup (i.e. array references). Note that this use of precalculation is only worthwhile if some calculations are repeated and computing the same result.

As an example, we can replace a call to “sqrtf” with a precalculated table of square roots. In the subsequent calculations where square root is needed, a call to the sqrtf function is replaced by a table lookup.

The precalculation uses two separate functions: one to perform the precalculation, and another to access the values by table lookup. The precalculate function must be called once via a global initialization routine for the class. Alternatively, every call to the square_root function could self-check a static Boolean flag indicating whether the values have been precalculated yet, and call the precalculate function if not, but this is needlessly slower for every access.

Even more efficient is to use “offline precomputation” before your program even runs. This is a more efficient method whereby the data is not precalculated during initialization of the program, but is done earlier in an “offline” mode (e.g. as part of your build process). For example, the precomputed results are either stored to a data file, or converted to a C++ source file that is linked.

Another good example of precalculation is the Boolean functions on characters (e.g. isupper). To improve performance, it is possible to implemented these functions as a precomputed array of 256 bool values, or 256 bytes with 0 if isupper is false, and 1 if isupper is true. Then isupper is evaluated by indexing the character into the precomputed table:

    #define isupper(ch) ( precomputed_array[ch] )

In fact, many C++ compilers implement isupper and other functions in <ctype.h> as a table lookup over the 256 characters (plus an extra one for EOF), with a precalculated single bit flag per function — that is, one bit indicating isupper, another bit for islower, etc.

Lazy Evaluation

The idea of lazy evaluation is a slight amendment to precalculation or data structure augmentation. Full precomputation during program startup can be inefficient if only some of the values are needed.

Lazy evaluation works in a “lazy” manner, by only doing work when asked. Instead of precalculating every result, results are calculated only as needed. To use this method, some way is needed of indicating whether a result is already in the table. When seeking a result, it is necessary to check if the required value is already present. If so, table lookup is used to get the result. If not, the value must be calculated, stored in the table and that entry marked as present.

The precomputation of sqrtf can be modified to become lazy evaluation by adding another array of Boolean flags, indicating which of the square roots have been computed. When calculating a square root, the function checks if it has been computed, and calculates it if not.

    float square_root_lazy_eval(int n)
    {
        static float sqrt_table[NUM_PREC + 1]; // values
        static bool precalc[NUM_PREC + 1];     // flags

        if (!precalc[n]) { // precalculated?
            sqrt_table[n] = sqrtf((float)n); // real sqrt
            precalc[n] = true; // Mark as computed
        }
        return sqrt_table[n];
    }

The use of lazy evaluation is slower than complete precalculation if all of the values are eventually calculated, because of the overhead of checking whether calculation is needed. Also, there's only an efficiency gain for values that are calculated twice or more. However, lazy evaluation can make the program faster overall if not all calculations are needed, but some are needed many times. Any unnecessary calculations are avoided. How lazy!

Source Code Precomputation

The examples of the precomputation of square roots in the previous two sections are not particularly efficient because they must still call the sqrtf function a number of times. A far more efficient alternative is to use C++’s compile-time initialization of arrays to set up the precomputed sqrt_table array inside the C++ source code. Hence, the square_root function becomes a simple lookup into an array variable as follows. Note that the array is declared as “static” so that the initialization occurs at compile-time.

    float square_root_precalc(int n)
    {
        const int NUM_PRECALC = 100; // Precalculate to 100
        static float sqrt_table[] = {
          0.000000f, 1.000000f, 1.414214f, 1.732051f,
          2.000000f, 2.236068f, 2.449490f, 2.645751f,
          2.828427f, 3.000000f, 3.162278f, 3.316625f,
          //... etc .....
        };
        if (n >= NUM_PRECALC) return sqrtf((float)n);
        return sqrt_table[n];
    }

The simplest way to produce the values for the precomputed array is to write another program to produce them. Once the values are produced, this program could be discarded, or it could be left in the build process. The following program was used to produce the declaration of sqrt_table used in the square_root function given above. The output from the following program was copy-pasted into the source code for the program above.

    void generate_sqrt_table()
    {
        const int NUM = 100; // Precalculate to 100
        printf("static float sqrt_table[] = {\n");
        for (int i = 0; i < NUM; i++) {
            printf("%ff", sqrtf((float)i));
            if (i + 1 < NUM)
                printf(", "); // comma after all but last
            if (i % 4 == 3 && i + 1 < NUM)
                printf("\n"); // newline every 4 numbers
        }
        printf("\n};\n"); // finish off declaration
    }

Source code precomputation should always be more efficient than lazy evaluation and run-time precomputation. However, source code precomputation is only applicable when the function can be computed at compile-time (e.g., any “constexpr” function). If the computation involves any variables whose values are known only at run-time, either lazy evaluation or run-time precomputation may be needed.

Incremental Algorithms

It is often easier to modify what has already been done than to start from scratch. This idea can be used to write faster algorithms. However, changing an existing algorithm to use incremental calculations will usually require a total redesign of the algorithm.

A simple example of an incremental algorithm is counting the number of symbols in a hash table. The non-incremental way to count them is to traverse the hash table, counting the number of entries along each hashed chain. The incremental method is to keeping a running count — increment it when a symbol is inserted; decrement it when a symbol is deleted. The incremental method is better if the count will be required many times. If the count is not required, there has been a small amount of unnecessary overhead.

Another good example appears in graphics animation when managing the buffers. When displaying a new screen, it is usually more efficient to change the existing screen buffer than to redraw the whole screen. The idea is to set only those pixels that need to be changed.

For another example, a chess-playing program uses a game tree and the minimax algorithm with a static evaluation function. This function usually analyses the material balance (i.e. how many pieces each side has), along with other chess strategy factors. A simple but inefficient method of computing the material value of a position is to add the values of each piece on the 64 squares. The efficient incremental algorithm is to subtract the value of the piece from a running count whenever any piece is captured by the opponent.

Common Case First

When testing for a number of different conditions, it is best to test the most common case first. If it is true, the other tests are not executed. When using multiple if-else-if statements, place the common case first. For example, consider the binary search function:

    if (key > a[i]) {
        // ...
    }
    else if (key < a[i]) {
        // ...
    }
    else { // equality
        // ...
    }

Equality is least likely of all the three conditions, and hence it goes last. Greater-than and less-than are more common, so they go first.

The idea of common case first also appears in Boolean expressions using && or ||. The short-circuiting of these operators makes them very efficient when the common case is first. For ||, the most likely condition should be placed first (i.e. most likely to be true). For &&, the most unlikely condition should be placed first (i.e. most likely to be false).

Simple Case First

This method is similar to common case first — the idea is to test the simplest condition first. More complicated and time-consuming computations can be avoided if the first test succeeds (or fails, depending on the context). This idea appears in two main situations:

  • if-if construct (nested if statements), and
  • logical operators (&& and ||).

The simplest test should be the first of a pair of nested if statements and should also be the first operand of a && or || operator. In the examples below, the sub-expression “x!=0” is evaluated first because it is the simplest and hence the least expensive to evaluate. This is the nested-if example:

    if (x != 0) {
        if (expensive_fn(x) != 0) {
            // ...
        }
    }

This is the && short-circuiting method:

    if (x != 0 && expensive_fn(x) != 0) {
        // ...
    }

Special Solution of Simple cases

In addition to putting a simple case first, it can also be efficient to solve simple cases differently to the general case. When solving a problem, simple cases can often be solved by specially designed fast functions. These “special solutions” can involve table lookup of precalculated values (e.g. storing the first ten factorials in an array) or just a fast algorithm for small cases (e.g. sorting less than five numbers quickly).

In general, the special solution of simple cases will give some speed increase if the simple cases are fairly common. The advantage of simple case precalculation over full precalculation is flexibility — it is not limited to those values that can be stored in a fixed size table.

The use of table lookup for simple cases for the factorial function is shown below. The use of the method here gives speed increase for all cases, not just the simple ones, because the recursive definition of factorial eventually breaks the problem down to a simple case.

    int factorial_precalc(int n)
    {
        const int NUM_PRECALC = 5; // How many
        static int s_precalc[NUM_PRECALC + 1] = 
            { 1, 1, 2, 6, 24, 120 };

        if (n <= NUM_PRECALC)
            return s_precalc[n];
        else
            return n * factorial_precalc(n - 1);
    }

Approximate Tests

Many algorithms can be improved by avoiding complex calculations with a fast preliminary test that is often successful. This is a special type of common and simple case optimization combined. This method is only worthwhile when avoiding the complicated test is highly probable; if avoiding it is unlikely, the extra simple test reduces efficiency because it adds (slightly) to the run-time cost.

Zero skipping. In an AI engine, a common example is “zero skipping.” A low-cost test of a weight against zero can avoid the complexity of computing vector and matrix operations with that weight.

Bounding Sphere Tests in Ray Tracing. As an example in 3D graphics, to implement a ray tracing algorithm for graphical image rendering, it is necessary to determine whether a ray strikes an object. Since the objects are often complex and more often than not the ray will miss an object by a large amount of space, a simple test can be used to quickly identify rays that are close enough to the object to intersect with it. A good simple test is to determine if the ray intersects with the bounding sphere of an object, as it is relatively efficient to determine this. If the ray does intersect the sphere, the more expensive tests are applied to determine if the ray intersects with the object. If the ray does not intersect with the sphere, the cost of the more expensive tests has been avoided. Interestingly, the simplicity of testing the intersection of a ray with a sphere helps explain why there are so many ray-traced images of spherical objects.

Bounding-box 2D collision detection. The similar idea of a bounding rectangle is useful for collision detection in coding 2D arcade games. Collision detection usually involves testing many pairs of objects in a two-dimensional setting, and the tests are complicated because of the different shapes of the objects. The more complicated tests can be avoided by examining whether the bounding rectangles of each object are intersecting. If they do intersect, then a closer examination of whether the objects have pixels that overlap is carried out.

Rectangle Shapes. For yet another example of using a simple test to avoid complicated tests, consider the problem of a GUI-based drawing program. Typically, the user can select a vertex (e.g. the end of a line segment) by clicking “close” to the vertex. In other words, the user must click the mouse within a specified radius of the point. Hence, when the mouse is clicked, the program must compare the mouse location with all the currently active vertices. The obvious method is to use the distance formula for two points and apply the following test on the x and y coordinates of the mouse and all points:

   const float DISTANCE = 2.0f;
   float diffx = xMouse - xPoint;
   float diffy = yMouse - yPoint;
   float distance = sqrtf( diffx * diffx + diffy * diffy);
   if (distance <= DISTANCE) {
        // clicked! ...
   }

Firstly, the efficiency of this test can be improved simply by avoiding the calculation of the square root. Squaring both sides of the equation gives the equivalent test:

   float distance_squared = diffx * diffx + diffy * diffy;
   if (distance_squared <= DISTANCE * DISTANCE) {
        // clicked! ...
   }

However, the multiplications involved in computing the squares of the two sub-expressions on the left are quite expensive, although the square on the right-hand side will be a compile-time constant. A simple test can be used to avoid the expensive multiplications in most cases. If the difference between either the x or the y coordinates is greater than DISTANCE, then the points cannot be close enough. Although the cost of these tests is quite high because the absolute value of the difference must be found, it should still cost less than two multiplications, and will be more efficient if there are many widely spaced points to be tested. The code using this idea is:

    bool check_point_clicked(int xm, int ym, int xp, int yp)
    {
        const float DISTANCE = 2.0f;
        int xd = xp >= xm ? xp - xm : xm - xp;
        if (xd > DISTANCE)
            return false;
        int yd = yp >= ym ? yp - ym : ym - yp;
        if (yd > DISTANCE)
            return false;
        return xd * xd + yd * yd <= DISTANCE * DISTANCE;
    }

Of course, algorithm improvements are even more effective. The best way of improving the efficiency of this program is to avoid the need for multiplications entirely, by changing the program specifications (!) so that the definition of clicking “close enough” to a vertex with a mouse refers to clicking within a square around the point, instead of a circle. Squares don't need multiplication.

Augmenting Data Structures

An interesting type of caching is where the data is stored inside the main data structure, rather than in a separate cache. Instead of recalculating derivative data every time you need it, a faster way is to store the data in the data structure. This is a form of caching that saves the time of recalculation, which need be done only once. If the data ever changes, the calculations must be redone and stored again. Hence, this method works best where data is unchanging, but can also tolerate modifications.

As an example of augmentation, consider a struct defined to represent a line segment (e.g. in a CAD drawing program). The struct contains four fields, for the x and y coordinates of the start and end points:

    struct line_segment {
        int x1, y1; // Start point
        int x2, y2; // End point
    };

Consider the computation of the length of the line segment, using:

    float flen = sqrtf((y2 - y1) * (y2 - y1) 
                     + (x2 - x1) * (x2 - x1));

If the length is a common calculation, it can be beneficial to cache the length of the line segment as an extra field in the struct:

    struct line_segment {
        int x1, y1; // Start point
        int x2, y2; // End point
        float length; // Length of line segment
    };

Whenever this length is needed during calculation it is immediately available as a field member. However, it is important to be careful that there is no consistency problem (where the length field is not the true length of the line segment). The main danger is that the length field won’t be recalculated every time one of the other fields change.

 

Next: Chapter 14. C++ Memory Optimizations

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++