Aussie AI

15. Loop Vectorization

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

“I'll be back.”

— The Terminator, 1984.

Sequential vs Parallel Loop Optimizations

Loops are often sources of inefficiency and can be optimized in numerous ways. And the basic algorithms for neural networks are full of loops, with nesting to multiple levels in tensor operations. Increasing throughput of GPU data is one of the main goals achieved by loop optimizations.

Not all loop transformations are created equal. Some of them are best for sequential code optimizations, whereas other loop transformations are used to parallelize loops for vectorization.

Loop transformations that are good for both sequential and parallel loop optimization include:

Loop unrolling — repeat the loop body to reduce loop test overhead and parallelize the loop body.
Loop peeling — unroll the first few iterations.
Loop coalescing — flatten nested loops.
Loop splitting — split out subportions of the iteration range.
Loop collapsing — another way to flatten nested loops.
Loop interchange — switch the inner and outer loop iterators of nested loops.
Loop reordering — change the ranges and arrangements of inner/outer nested loops.

Some loop transformations are mainly for sequential improvements, and are not parallelization in themselves. However, these techniques can sometimes help with parallelization if they enable another followup loop parallelization optimization. Loop transformation optimizations which tend to be good for sequential code optimizations but not parallelization include:

Loop fusion — combine or “fuse” the bodies of two loops.
Duff's device — amusing but impractical coding trick for loop unrolling.
Loop code motion — move or “hoist” loop-invariant calculations from the loop body to pre-loop initialization.
Loop perforation — randomly skip a subset of loop iterations; it's really a thing.
Loop sentinel — fake it till you make it.
Loop iterator strength reduction — change “*” to “+” if you can.
Loop reversal — going backwards, and yet, still making progress!

Parallelizing loop optimizations with a main goal of vectorization of the loop body include:

Loop fission — opposite of loop fusion; split a single loop body into two loops.
Loop tiling — process sub-parts of contiguous data in separate loops.
Loop distribution — split two sub-parts of a loop body into two simpler separate loops.

Loop Fusion

Loop fusion is a well-known code optimization where two separate loops are merged into a single loop. This does not change the amount of in-loop computation in either loop body, but reduces the loop overhead of the exit test by half. There is also often a benefit from data locality that reduces data movement and temporary data storage, which can also improve overall speed.

Note that loop fusion is not great at vectorization, because complicated loop bodies are actually harder to parallelize. Most of the benefits arise in traditional sequential code execution, which is why its theory dates back many decades. For modern parallel execution on GPUs, loop fusion is often a poor choice, and more benefits may arise from loop fission (the opposite of fusion) and loop vectorization.

Example: Loop Fusion: The general idea is to combine the body of two loops into a single loop. Here is a simplistic example with the (non-fused) loops for initializing two vectors using two sequential loops:

   for (i = 0; i < n; i++) v1[i] = 0;
   for (i = 0; i < n; i++) v2[i] = 0;

And here is the version with loop fusion:

   for (i = 0; i < n; i++) {
       v1[i] = 0;
       v2[i] = 0;
   }

Note that the loop fusion version incurs the same number of assignments for initialization, but only half of the loop overhead cost (i.e. half of the “i < n” and “i++” operators have been optimized away). And for the sake of argument, let's pretend we don't know a better way to initialize a vector in C++ like memset or calloc or load-time static variable initialization.

Loop Perforation

The intentional introduction of randomness to code is known as a “stochastic” algorithm. Personally, I'm more familiar with the unintentional introduction of randomness, otherwise known as a “bug,” but now when it happens you can tell your boss that you were adding “stochastic functionality.”

Various research has shown that AI models have some resilience to the introduction of randomness. Paradoxically, “stochastic” algorithms can even be beneficial to accuracy when used during training. Another intentional use of random numbers occurs in AI inference where the top-k decoding algorithm can randomly pick from several candidate output tokens.

Code perforation is an optimization technique that trades accuracy for speed, by randomly (ahem, I mean, stochastically) skipping some computations. Essentially, using loop perforation is similar to an approximation with a random element, but in a generalized way for any iterative code. It's kind of like how teenage children randomly skip their homework.

Loop perforation skips iterations of a loop in a probabilistic manner. Randomly skipping some percentage of the loop bodies doesn't sound like a good plan, but it has its merits. In an AI inference computation, there's so much going on that no-one's going to notice a few missed beats. Apparently it can even be useful. Well, at least it's faster to do nothing.

Example: Loop Perforation: Here is an example of adding loop perforation to a vector dot product computation. This is an incredibly slow version, and is not recommended, but is just to give the idea of skipping a percentage of the iterations:

    float aussie_vecdot_perf(float v1[], float v2[], int n, int pc)   
    {
        // Loop perforation -- vector dot product
        float sum = 0.0;
        for (int i = 0; i < n; i++) {
            if ( ( rand() % 100 ) + 1 <= pc) {
                // This iteration is perforated...
                continue; // Skip it...
            }
            sum += v1[i] * v2[i];
        }
        return sum;
    }

Loop Unrolling

Loop unrolling is a code optimization where the body of a loop is repeated in sequential code. This speeds up the algorithm because the overhead of both the incrementer and the loop iteration test is avoided. In some cases, the entire loop can be unrolled, usually when the loop iterations are finite and known at compile-time. In other cases of partially unrolling, the loop body can be repeated multiple times, and thereby the loop test only occurs every few iterations.

For an AI engine, loop unrolling is used as an optimization in a few places. It is one of the optimizations used by kernel fusion, along with loop fusion and others. Since many meta-parameters of AI models are finite and fixed numbers (e.g. the “model dimension”), there are many cases where an entire loop can be unrolled and then vectorized into the GPU.

The logical extension of loop rolling is done by machine learning compilers, at least from a conceptual point of view. These ML compilers unroll the inference loop and the lower-level loops in matrix operations, thereby creating a finite graph representation of the entire inference sequence. If all is unrolled, there are no loops in the graph (an “acyclic” graph) and it is of finite size. The process of model inference is propagation of data through the graph. There are many “graph optimizations” that can be made on this graph representation of the AI model.

Example: C++ Loop Unrolling of Vector Dot Product. Here is the basic C++ non-unrolled vector dot product code:

   float aussie_vecdot_basic(float v1[], float v2[], int n)
   {
        // Basic vector dot product
        float sum = 0.0;
        for (int i = 0; i < n; i++) {
            sum += v1[i] * v2[i];
        }
        return sum;
   }

If we know the value of n, e.g. that n=5, then we can completely unroll it:

   return v1[0] * v2[0]
        + v1[1] * v2[1]
        + v1[2] * v2[2]
        + v1[3] * v2[3]
        + v1[4] * v2[4]
        ;

If we don't know the value of n, we can still unroll multiple iterations. Here's an example of 4-level loop unrolling of vector dot product in C++ by assuming that n is a multiple of 4:


   float aussie_vecdot_unroll4(float v1[], float v2[], int n)
   {
        // Loop-unrolled Vector dot product 
        if (n % 4 != 0) {
            yassert(n % 4 == 0);
            return 0.0; // fail
        }
        float sum = 0.0;
        for (int i = 0; i < n; ) {
            sum += v1[i] * v2[i]; i++;
            sum += v1[i] * v2[i]; i++;
            sum += v1[i] * v2[i]; i++;
            sum += v1[i] * v2[i]; i++;
        }
        return sum;
   }

And here's a generalization of that 4-level unrolling with extra code to handle the leftover cases if n is not a multiple of 4. Although the extra cases look messy, they are not actually the main performance bottleneck.

    float aussie_vecdot_unroll4b(float v1[], float v2[], int n)
    {   
        // Better loop-unrolled Vector dot product 
        int i = 0;
        float sum = 0.0;
        if (n % 4 != 0) {
            // Handle the extra cases...
            switch (n % 4) {
            case 1: 
                sum += v1[i] * v2[i]; i++; 
                break;
            case 2: 
                sum += v1[i] * v2[i]; i++;
                sum += v1[i] * v2[i]; i++;
                break;
            case 3:
                sum += v1[i] * v2[i]; i++;
                sum += v1[i] * v2[i]; i++;
                sum += v1[i] * v2[i]; i++;
                break;
            default: yassert_not_reached(); break;
            } // end switch
            // Keep going with rest of the vector
        }
        for (; i < n; ) {  // Unrolled 4 times...
            sum += v1[i] * v2[i]; i++;
            sum += v1[i] * v2[i]; i++;
            sum += v1[i] * v2[i]; i++;
            sum += v1[i] * v2[i]; i++;
        }
        return sum;
    }

This code is just an example for explanation. There are various further code optimizations that can be done for production-level efficiency. For parallelization, the loop body should call an intrinsic function to vectorize the method. For an AI engine, we could choose our model dimension and other meta-parameters as multiples of the loop unrolling factor, and thereby avoid ever having any of the “leftover” cases.

For sequential code, we could change it to use pointer arithmetic rather than array indices, we might try replacing the four i++ operators with i+=4, change the integer modulo operator (%) to a bitwise-and operator test (i.e., use “n&3” not “n%4”, which works since 4 is a power-of-two), and it also might be better to use “+” rather than the “+=” operator. Finally, if we carefully code the leftover cases, the main loop could be unrolled to many more levels than just four.

Duff's Device for Loop Unrolling

There's a neat coding trick called “Duff's Device” for loop unrolling, which uses a switch with case fallthrough to mimic assembler coding style. However, it's not great for vectorization as it's likely to confuse the compiler, so may be mostly of theoretical interest.

    float aussie_unroll4_duff(float v1[], float v2[], int n)  
    {
        // Unrolled dot product with Duff's Device 
        int i = 0;
        float sum = 0.0;
        switch (n % 4) {
            for (; i < n; ) {
                case 0: sum += v1[i] * v2[i]; i++;
                case 3: sum += v1[i] * v2[i]; i++;
                case 2: sum += v1[i] * v2[i]; i++;
                case 1: sum += v1[i] * v2[i]; i++;
                default:;
            } // end for
        } // end switch
        return sum;
    }

What's happening here? My brain hurts looking at this code! The trick is that the outside switch branches into a case that is inside the body of a for loop. This is not normal everyday coding, because there's a loop inside a switch, and the loop body crosses over several case statements. Also, none of the case statements has a “break” statement and they instead rely on fallthrough semantics. Similarly, the “default” clause is mainly just to avoid getting a spurious compilation warning (i.e. “missing default”), and also has no “break” with only a lonely semicolon. Note also that the case labels are written in reverse order from top to bottom (3..2..1), except for 0 at the top.

How does this even work? The first point is that it does. This code performs the exactly correct number of iterations for any value of n (except n==0), and similar versions with an unrolling factor of more than 4 will also work (i.e. if you change “n%4” and add more case constants). The code looks like a hack, but actually uses standardized C++ semantics of case fallthrough and switch multi-way control flow and should work on all platforms. Branching into the middle of a loop with a switch is valid in C++ provided it doesn't bypass any local variable initialization (hence, don't put “sum” into the switch). Also, the case fallthrough semantics (i.e. without a “break” ending each “case”) are standard for C and C++ since inception. Finally, note that this code is buggy for n==0, because it incorrectly does 4 iterations, so it ideally needs a parameter validation assertion at the start.

Bug alert! Note that you cannot tweak the “i++” instruction using the standard idiom:

   sum += v1[i] * v2[i++];  // Bug!

The obscure problem is that the “*” operator doesn't guarantee left-to-right evaluation of its operands. The code assumes evaluation order of: v1[i], v2[i], *, i++, starting from the left. However, the C++ optimizer can legally do this order of operations: v2[i], i++, v1[i], *, which is not what you intended and gets the wrong array element for v1[i]. This code might be unreliable across platforms, or it might work in the debugger mode, but fall over once you turn on high levels of optimization. So, there is an “order of evaluation” pitfall if you put “++” in an operand of the “*” operator or many other binary arithmetic operators.

Is Duff's Device any faster? The short answer is “not really,” although it looks very appealing (or appalling). Firstly, note that this trick is not actually very useful for vectorization, because a switch cannot branch into the middle of a vectorized intrinsic (i.e. if you replace the loop body with a SIMD instruction). Furthermore, although I haven't tested it, I doubt many optimizers will be able to auto-optimize that complex control flow with SIMD instructions. In sequential code, this method also isn't much faster, as it doesn't really have any fewer operations than a basic unrolled loop (i.e. with extra cases handled separately before or after the main loop). The above example of Duff's Device can be further sped up using pointer arithmetic and “looping down to zero” optimizations, but so can the other unrolled versions. However, there is a minor speed advantage in terms of “instruction locality” because the above code is very concise.

The main advantage of Duff's Device is to bamboozle your colleagues. You can use Duff's Device with any unrolling factor, not just 4 as in the example shown above (e.g. change to 8 by using “n%8” and adding cases for 4, 5, 6, and 7, ordered from 7 down to 1, leaving 0 on top). Actually, the unrolling factor needn't be a power-of-two. Make it a prime number for extra bonus points. If you want more of this kind of coding trickery, also search up Jensen's device and Pigeon's device.

Loop Tiling or Blocking

When you hear about a “tiled MatMul” or a “blocked GEMM,” this is the “tiling” or “blocking” optimization method it refers to. MatMul is matrix multiplication and GEMM is General Matrix Multiplication (i.e. the same thing). Tiling is the optimization that most applies to speeding up matrix or tensor multiplication in AI engines.

This optimization is for two-dimensional data (e.g. matrices). When you hear “tiles” or “blocks,” think squares or rectangles of data. For example, if you have a 512x512 matrix, then a tiled algorithm might act on 16x16 sized chunks, one at a time. Loop tiling is an optimization of two-dimensional or three-dimensional data such as matrices or tensors. The one-dimensional equivalent of processing sub-parts of a one-dimensional array is called “strip mining”, “loop sectioning” or often simply “vectorization.”

In other words, tiling means operating on small subsections of a matrix. If you hear “tiled tensor” that could mean two-dimensional data (i.e. just a fancy name for a matrix), or alternatively it might refer to three-dimensional data, in which case, don't think anything or else your head will hurt.

Loop tiling is a method of executing sub-parts of nested loops in a way that maximizes data locality, increases cache utilization, and improves parallel execution. This is also called “loop blocking” because it processes the data a “block” at a time, although the term “tiling” is more widely used in AI research. The two-dimensional sub-partitions of the data that are square or rectangular are called “tiles” or “blocks”.

The same number of arithmetic operations are performed in a tiled versus non-tiled algorithm. However, there should be fewer loads of the data into memory with tiling. The downside is that tiling introduces additional loop overhead. In fact, rather than flattening nested loops over a 2-D array (e.g. 512x512), tiling often introduces additional levels of nesting! The two small loops that spin through the 16x16 square shape of a single “tile” or “block” are often newly added inner loops. So, loop tiling often adds two new layers of nested loops inside your already-nested loops. It makes you wonder how it can even be faster!

Example: Tiled Matrix Clear: For these examples, there is a type “ymatrix” type:

    typedef float ymatrix[ROWS][COLUMNS];

If we forget about memset, here is the simple code to clear a matrix one element at a time in a brute-force nested loop (non-tiled):

    void aussie_clear_matrix(ymatrix m)
    {
        for (int i = 0; i < ROWS; i++) {
            for (int j = 0; j < COLUMNS; j++) {
                m[i][j] = 0.0;
            }
        }
    }

Now we decide to add a 4x4 square tile optimization to this code. The result is an extra two levels of nested loops. Here is the basic code which assumes that the row and column dimensions are exact multiples of the tile size, so there's no extra leftover cases to handle:

    void aussie_clear_matrix_tiled(ymatrix m)
    {
        const int TILEX = 4; // 4x4 tile size
        const int TILEY = 4;
        static_assert(ROWS % TILEX == 0, "Exact X");
        static_assert(COLUMNS % TILEY == 0, "Exact Y");
        for (int i = 0; i < ROWS; i += TILEX) {
          for (int j = 0; j < COLUMNS; j += TILEY) {
              // Do the 4x4 tile...
              for (int tx=i; tx < i+TILEX; tx++) {
                  for (int ty=j; ty < j+TILEY; ty++) {
                      m[tx][tiley] = 0.0f;
                  }
              }
          }
        }
    }

Unrolled Tiles. One followup optimization trick with a tiled loop algorithm is to apply loop unrolling to the two inner loops. This avoids the extra overhead of the two extra inner loops, but retains the data locality benefits of tiling. This optimization results in a fully “unrolled tile” computation without any extra inner loops. In the above example, the two inner loops of a 4x4 tile would be replaced with 16 unrolled computations in sequence. Or for a vectorized version, a fully unrolled tile would be 4 sequential calls to vectorized intrinsics that each do 4 operations in parallel (e.g. AVX intrinsics each do 4 float operations in parallel).

Example: Tiled Matrix Multiplication: Tiling techniques are widely used inside neural network code to improve the efficiency of MatMul's and thereby get better throughput of tensor calculations from a GPU. Matrix multiplication is a good candidate for this optimization because it has O(n^3) arithmetic calculations, but uses only O(n^2) data. Hence, a naive matrix multiplication algorithm that doesn't address locality will re-load the same data into memory many times, whereas a tiled algorithm can reuse the same data more efficiently.

A tiled version of MatMul processes “tiles” or “blocks” of each matrix one at a time (i.e. small square or rectangular sections), with the aim of keeping small parts of the matrix in the memory cache while they are processed. The algorithm progresses across the matrix a tile/block at a time, rather than scanning all the way down one dimension (row or column). The same number of multiplication operations are performed as a non-tiled MatMul, but data locality and cache freshness should improve the overall speed.

Loop Fission

Loop fission is an optimization that is the opposite of loop fusion. Instead of fusing two loops into one, we take one loop and split parts of it into two loops. Loop fission also been called other names such as “loop splitting” or “loop distribution.”

Loop fission can be more efficient for parallel execution (e.g. vectorization for GPUs), but is often slower for sequential execution. Whereas loop fusion aims to remove the overhead of one of the loops, loop fission tolerates an increased loop overhead in return for simpler loop bodies that can be parallelized. The kernel optimization of “kernel fission” is based on loop fission, and loop fission is one technique used to achieve vectorization for GPUs.

The main reason to use loop fission is hardware acceleration via loop parallelization. A complicated single loop can often run faster if split into two simpler loops, if hardware acceleration can be accessed. This is true even if the two resulting loops must run sequentially, because the iterations of each loop are parallelized, but there's a double benefit if the two whole loops can also run in parallel.

Example: Loop Fission in BatchNorm: A good example arises in part of the code for batch normalization. Each element of the vector needs to have two operations performed on it: subtract the mean (re-centering) and multiply by a variance factor (re-scaling). The naive implementation of the second half of BatchNorm looks like this:

    float denom = sqrtf(varc + eps); // Scale factor
    for (int i = 0; i < n; i++) {
        // Normalize: re-center and scale
        v[i] = (v[i] - fmean) / denom; 
    }

This is difficult to hardware accelerate because it's unlikely that there's a combined “subtract-and-then-divide” operation to apply to all elements of a vector in parallel. The first point is that maybe there's an “add-and-then-multiply,” in which case we can use the negative of the additive factor and the reciprocal of the scaling factor. However, assuming there's not, loop fission can be used to split the single complicated loop into two sequential loops.

    float negmean = -fmean;  // Use negative for addition
    float denom = sqrtf(varc + eps); // std. deviation
    float recip = 1.0f / denom;  // reciprocal multiply
    // Loop 1: Re-center using mean
    aussie_vector_add_scalar(v, n, negmean);
    // Loop 2: Re-scale by factor
    aussie_vector_multiply_scalar(v, n, recip);

Each of the two loops is now easy to hardware accelerate, because they are both very simple vector operations: “multiply-by-scalar” and “add-scalar.” Every platform is likely to have hardware acceleration APIs for those simpler operations. So, to summarize, we got an explosive boost to hypersonic rocket speed using atomic operations with loop fission. Isn't that just the bomb?

Loop Reversal

Loop reversal is the optimization of making the loops go backwards. It does the same number of arithmetic operations, but in reverse order, so there is no change in the total arithmetic operations.

This goal is a speedup by “looping down to zero” with a faster loop test, but it is often a de-optimization even for sequential execution. Typical CPU processors rely on ascending order of memory accesses for predictive cache pipelining, and reverse array access is a worst case for that.

Loop reversal is also not a useful parallelization method in itself. Vectorization for GPU computation doesn't really work in reverse. However, reversing a loop can sometimes be useful as an initial transformation on nested loops if reversing the inner loop's direction allows another followup loop vectorization technique.

Example: Reversed Vector Dot Product: Loop reversal can be used on vector dot product, as below, but it probably shouldn't be. Here's the basic idea:

    float aussie_vecdot_rev(float v1[], float v2[], int n)
    {
        float sum = 0.0;
        for (int i = n - 1; i >= 0; i--) {
            sum += v1[i] * v2[i];
        }
        return sum;
    }

Note that there are several coding pitfalls to avoid. The loop variable “i” cannot be “unsigned” or “size_t” type, because the test “i>=0” would never fail, creating an infinite loop. Also, the reversed loop needs to start at “n-1” and must use “i>=0” (not “i>0”) to avoid an off-by-one error. The above code also craters for “n<=0” and needs a safety test.

Loop Code Motion

Loop code motion is moving loop-invariant code from inside the loop body to the pre-initialization code for the loop. Any code that has the same value should not be performed inside the loop body. Instead, it should be pre-calculated before the loop, and stored in a temporary variable. This is sometimes called “hoisting” the code out of the loop.

Example: Loop Code Motion: One common example of unnecessary recalculation of loop-invariant values is in the loop test. The code in the Boolean test for the loop is actually part of the loop body.

An example of code that re-calculates the loop limit:

   for (i = 0; i < vec.num_elements(); i++) {
      // ...
   }

The “num_elements” call is probably loop-invariant, assuming the vector doesn't change size during processing. Maybe the “num_elements” function is declared “inline” and the C++ compiler will fix it anyway. Nevertheless, this is a candidate for loop code motion, using a temporary variable instead:

   int n = vec.num_elements();  // Loop-invariant value
   for (i = 0; i < n; i++) {
      // ...
   }

Loop Distribution

Loop distribution is type of loop code motion that creates two loops from a single loop that contain an “if” statement. The hoisted code is a conditional test. Some early papers in the 1990s called it “loop unswitching.” Some papers use the term “loop distribution” with the different meaning of splitting a loop into two loops, which we call “loop fission.”

The goal of loop distribution is to move an “if” test out of the loop body, by creating two loops, and ends up creating two separate loops on two pathways. This sounds similar to loop fission, but loop distribution is a more general optimization that doesn't require parallelization to get a speed improvement (whereas loop fission does). Instead, loop distribution gets a benefit in ordinary sequential execution because it moves the if-test computation out of the loop body to a once-only pre-initialization test (i.e. “hoisted”). Note that only one of the two loops is executed each time, and these two loops are never executed in parallel, so this technique is not really a type of loop fission.

Example: Loop Distribution: Here's a dummy example of implementing an “add-or-subtract” function using a passed-in Boolean flag.

    void aussie_vector_addition_slow(
        float v[], int n, 
        bool do_add, float scalar)
    {
        for (int i = 0; i < n; i++) {
            if (do_add) 
                v[i] += scalar; // Add
            else
                v[i] -= scalar; // Subtract
        }
    }

The problem is that the test “if(do_add)” is computed for every loop iteration, and yet “do_add” is a loop-invariant flag variable. The faster version is to use loop distribution to move the if-test into the loop initialization, and then split the two pathways inside the loop to instead have two separate loops. Here's the faster version:

    void aussie_vector_addition_loop_distribution(
        float v[], int n, 
        bool do_add, float scalar)
    {
        if (do_add) { // Add scalar
            for (int i = 0; i < n; i++) {
                v[i] += scalar;  // Add
            }
        }
        else {  // Subtract scalar
            for (int i = 0; i < n; i++) {
                v[i] -= scalar; // Subtract
            }
        }
    }

This example is still far from optimal. For starters, it should be using pointer arithmetic rather than array indices.

Loop Reordering

In neural networks, there are many loops, and many ways of nesting them, or running them in sequence. The convolution layers in CNNs can have literally seven layers of nested loops. Hence, there are various research papers exploring different orders to perform the various computations.

Loop reordering is the general class of optimizations that involves reordering loops or their iterations. This can refer to changing the ordering of two sequential loops or two nested loops. The reordering optimization to reverse the inner and outer nested loops is more precisely called “loop interchange.” A single loop can also be reordered with “loop reversal.”

Loop reordering is an optimization that doesn't reduce the total number of computations, because it always executes the same number of iterations as the original version. However, loop reordering may have several benefits:

Vectorization. Putting the loop in a different order may make it more vectorizable, or may allow other loop transformations to be applied before vectorization.
Data locality. Reordering the loops may improve data locality and cache access speed by doing the operations in a different order. This reduces the cost of accessing the data into memory (or low-level caches), rather than the cost of the arithmetic. It is therefore related to memory/dataflow optimizations and pipelining optimizations.
Reduced loop overhead. Both loop interchange and loop reversal can reduce the general overhead of loop testing. Loop interchange allows the shorter loop to be on the outside. Loop reversal allows “looping down to zero” which reduces overhead.

Loop Iterator Strength Reduction

Loop strength reduction is the arithmetic optimization of “strength reduction” applied to loop iteration variables. For example, strength reduction aims to replace multiplication with addition. Consider this loop:

    for (int i = 0; i < n; i++) {
        a[i] = 10 * i;
    }

This can be optimized to change the multiplication into an incremental addition:

    for (int i = 0, x = 0; i < n; i++) {
        a[i] = x;
        x += 10;
    }

Note that the loop strength reduction optimization isn't a good choice for loop parallelization. Although it would be desirable to change a vectorized multiplication to addition, this optimization has changed to an incremental algorithm. This makes each loop iteration dependent on the prior one, with the results dependent on the previous computation, so they cannot be done in parallel.

Loop Coalescing

Loop coalescing is a loop optimization that involves flattening two nested loops into one non-nested loop. Typically, loop coalescing will still operate on a 2-dimensional array, whereas flattening both the nested loops and the array is called “loop collapsing.”

As a dummy example, consider a matrix initialization via nested loops:

    for (int i = 0; i < n; i++) {
        for (int j = 0; j < m; j++) {
            arr[i][j] = 0.0f;
        }
    }

Loop coalescing involves changing to a single loop, but still using two indices i and j, which are calculated from the main linear index.

    int maxx = n * m;
    for (int x = 0; i < maxx; x++) {
        int i = x / n;
        int j = x % m;
        arr[i][j] = 0.0f;
    }

The benefit in speed from loop coalescing can arise by simplifying the loop, which makes it easier to parallelize via hardware acceleration, and also maybe a different data access pattern which might improve data locality and cache freshness.

This optimization is not always possible, as nested loop logic is often quite complicated, and flattening a nested loop may actually worsen data locality in many instances. However, the linear nature of a simple loop can make the code to send off chunks to a GPU much easier.

Loop Collapsing

Loop collapsing is closely related to loop coalescing, since both aim to flatten nested loops, but loop collapsing is a special situation where the array is also flattened to one dimension.

Consider a matrix initialization via nested loops over a 2-dimensional array:

    for (int i = 0; i < n; i++) {
        for (int j = 0; j < m; j++) {
            arr[i][j] = 0.0f;
        }
    }

The loop collapsed version has one big loop over a different one-dimensional array:

    int maxx = n * m;
    for (int x = 0; x < maxx; x++) {
        arr2[x] = 0.0f;
    }

This loop transformation to a single loop is obviously more amenable to vectorization.

Loop Peeling

Loop peeling is a type of loop unrolling that involves unraveling only the first few iterations of a long loop. This is also similar to “loop splitting” with two sections, where the first section is over the early range, and the second range is the main section of all remaining iterations.

Loop peeling is beneficial to the overall loop efficiency if there is code in the loop body that is only required for one or two early iterations, which can then be removed from the main loop body. Similarly, there can be benefit in unraveling the last few iterations of a loop, which is a similar technique.

One common case of loop peeling is when the first iteration is different from the rest, so peeling off a single iteration is valuable.

    for (int i = 0; i < n; i++) {
        arr[i] = (i == 0) ? 0.0f : 1.0f;
    }

In this case, we can peel off the first “i==0” iteration into a single unrolled instruction, and change the main loop to start at 1. This is also a trivial form of “loop distribution,” where we are hoisting an “if” conditional test out of the loop. The new code becomes:

    arr[0] = 0.0f;  // Peeled
    for (int i = 1 /*not 0*/ ; i < n; i++) {
        arr[i] = 1.0f;
    }

This peeled version is faster in terms of both sequential or parallel execution. The loop body has less computation and is also more amenable to vectorization.

Loop Splitting

Loop splitting refers to splitting the sequential iterations of a loop into two loops, which each perform part of the original loop's iterations. Loop splitting is closely related to “loop sectioning” (“strip mining”), but often relates to more complex arithmetic in the loop body. Note that “loop peeling” is a special case of loop splitting where the first section is a small range of a few initial iterations, but these few iterations are unrolled rather than looped.

Loop splitting takes a single loop and transforms it into at least two “split-out” loops, one for the early iterations, and one for the remainder. However, loops can also be split out into more than two loops.

In loop splitting, each split-out loop is shorter than the original loop. Unlike loop fission, the two loops operate over different subportions of the iterator variable range, executing the same number of total iterations, rather than double iterations as in loop fission.

Example: Loop Splitting: Here's some example code to “sqrtize” a vector, using a cached optimization for the numbers up to 100.

    void aussie_vector_do_sqrt(float v[], int n)
    {
        for (int i = 0; i < n; i++) {
            if (i < 100) { // Fast cases
                v[i] = aussie_sqrt_optimized(v[i]);
            }
            else {  // General case
                v[i] = sqrtf(v[i]);
            }
        }
    }

However, we can use loop splitting to split this big loop into two shorter disjoint ranges. Instead of 0..n-1, we do 0..99, and then 100..n-1. Each loop is over part of the range, and has a simpler loop body. Note that this code fails with an array bounds violation for small values of n less than 100.

    void aussie_vector_do_sqrt_loop_splitting(float v[], int n)
    {
        for (int i = 0; i < 100; i++) { // Fast cases                
            v[i] = aussie_sqrt_optimized(v[i]);
        }
        for (int i = 100; i < n; i++) { // General cases
            v[i] = sqrtf(v[i]);
        }
    }

The loop splitting optimization is beneficial if the loop body has different sections of code that only relate to a subset of the iterator range. Hence, the loop bodies of the two loops can be reduced to execute less code. Overall, there is still the same number of iterations performed in the two loops combined, but each loop performs only a proportion of the original iterations on a simpler loop body. This optimizes sequential execution and the simpler code in each loop body may make vectorization of one or both subloops easier. Furthermore, both subloops could run in parallel.

Loop Interchange

Loop interchange is an optimization of nested loops that switches the inner and outer loops. In a typical nested loop, the outer loop body and loop test is executed rarely, almost lazily, whereas the inner loop body is scrambling along in a frantic mess. Loop interchange simply switches them, reversing their roles.

Why is this an optimization? Although the same number of loop iterations still occur in total, and the newly-made inner loop body is also thrashed, various improvements can arise from reversing the iterator variables, usually to make the innermost loop the longest. Possible optimizations result from:

Fewer outside computations. A shorter outside loop reduces the arithmetic operations of the outer loop, whereas the inner loop's number of computations is unchanged in either loop structure.
Data locality. Another possible improvement is in data locality, which can reduce cache misses and speeds up the overall execution. Note that this benefit is not guaranteed just by switching loops, and sometimes loop interchange can worsen data locality; careful analysis is needed.
Inner loop vectorization. Another important possibility is that reversing nested loops can create opportunities to apply other loop optimizations to the new inner loop, notably to vectorize the inner loop.

Shortest loop outside, longest innermost loop: One of the considerations of loop interchange is the optimization of putting the shortest loop on the outside, and making the innermost loop with the longest range of iterations. This is an optimization for both sequential or parallel execution. For sequential execution, there is less overhead from the outer loop, because it is shorter. For parallelization, there is improved vectorization of the inner loop, which now has a longer range.

Consider this example:

    for (int i = 0; i < 1000; i++) {
        for (int j = 0; j < 50; j++) {
            // ...
        }
    }

The current loop nesting has the longest loop (to 1000) on the outside, and the shorter loop (to 50) as the innermost loop. Loop interchange simply makes it the reverse nesting:

    for (int j = 0; j < 50; j++) {
        for (int i = 0; i < 1000; i++) {
            // ...
        }
   }

Considering sequential execution, the inner loop body is executed the same number of times, so there's no difference. This also includes the inner loop's conditional test and incrementer, which are different variables in the two examples, but also execute the same number of times (50,000 times). However, consider the different outer loops. The first example is 1000 iterations, whereas the second example's outer loop is only 50 times. Hence, the loop reordering optimization of “shortest outer loop” and “longest innermost loop” has saved 950 of the outer loop's calculations (i.e. loop test and incrementer). Any extra code that's in the outer loop, either before or after the inner loop, would also be executed fewer times.

There is also an advantage for vectorization. In the first example, we could possibly have 1000 vectorized operations of data size 50. In the interchanged loops, there are 50 operations on vectors size 1000. Hence, there is more opportunity for much larger vectorization gains in the second format with the longest inner loop.

Loop Sentinel

Loop sentinels are an optimization that removes the overhead of checking an array index or pointer scanning an array or pointer chain. The technique does this by adding a pretend extra element onto the end of the array, in a way that we can pretend to succeed. And since we're guaranteed to always succeed, we don't need to check for failure while scanning the loop.

This technique is not particularly useful for vectorization, but is quite powerful for long sequential scanning of arrays. It also has the downside of requiring at least one writeable array element, so it cannot run on read-only arrays.

Example: Check Vector Negatives: Here's the basic loop sentinel version that sets up a dummy success in v[n]:

   bool aussie_vector_has_negative_sentinel(float v[], int n)
   {
        v[n] = -99.0;  // Dummy negative (BUG!)
        int i = 0;
        for ( ; /*GONE!*/; i++) {
            if (v[i] < 0.0) break;  // Found negative
        }
        if (i == n) return false;  // Fake success
        return true;  // Found a negative (for real)
   }

However, this is actually buggy, since “v[n]” is potentially an array overflow. A better version can manipulate the last valid element “v[n-1]” instead of modifying “v[n]”. Then, we have to remember to fix it before we leave town. And we also have to remember to check the last vector element that we temporarily overwrote wasn't also a real success.

    bool aussie_vector_has_negative_sentinel2(float v[], int n)
    {
        float save = v[n - 1];  // Save it!
        v[n - 1] = -99.0;  // Dummy negative at end
        int i = 0;
        for ( ; /*GONE!*/; i++) {
            if (v[i] < 0.0) break;  // Found negative
        }
        v[n - 1] = save;  // Restore it!
        if (i == n - 1) {
            // At the dummy (fake success)
            if (save < 0.0) return true; // Must check
            return false;  
        }
        return true;  // Found a negative (for real)
    }

Loop Strip Mining (Loop Sectioning)

Loop strip mining is a loop optimization that scans or “mines” various “strips” of an array. It is related to “loop tiling” on arrays in two dimensions, but strip mining only applies to processing one-dimensional arrays. Loop strip mining is also called “loop sectioning” because it breaks an array up into sections that are operated on.

For a basic example, consider a simple array initialization:

    for (int i = 0; i < n; i++) {
        arr[i] = 0.0f;
    }

Let's assume we can parallelize this with 16 elements at a time (e.g. 512 bits total parallel processing, which is 16 separate 32-bit float variables). So, we want to process “strips” of length 16. For simplicity, let us assume that n is divisible exactly by 16, so there's no leftover work after the main loop.

    for (int i = 0; i < n; i += 16) {
        // Initialize arr[i]...arr[i+15] in parallel
    }

Obviously, this is a dummy example, where memset would do better for zeroing the array. Also, this really looks exactly like “vectorization” to me, where we are vectorizing 512 bits at a time (16 floats), and indeed the research mentions vectorization as one application. But loop strip mining and vectorization are not exactly the same techniques, because loop strip mining is a more general idea with other applications.

Loop Spreading

Loop spreading is an optimization of two non-nested sequential loops that have different iteration ranges. Typically, this refers to where the end ranges differ significantly. If the loop ranges only differ by an off-by-one issue, then only loop normalization is required.

Loop spreading modifies one of the loops, so that part of this loop fully overlaps with the other loop (i.e. ideally one loop “spreads out” further to match the other loop's end bounds). Hence, after loop spreading has occurred, this subloop can be fused with the other loop, and possibly parallelized. The remaining iterations that are not overlapping then have to be addressed in a followup partial loop (only for one of the loops).

Loop spreading mainly enables loop fusion as a followup optimization. For using loop fission on the two loops, it is not necessary to do loop spreading, since the two loops are already split apart, and each loop could already potentially be vectorized independently.

Loop Normalization

Loop normalization is not directly an optimization, but is a preliminary loop transformation that can make further loop optimizations easier. Followup optimizations might be to fuse the two loops with loop fusion, or to parallelize each loop, such as with loop fission or vectorization.

The goal of loop normalization is to make the loop iteration variables act across the same range. This applies to two sequential loops, rather than nested loops. Hence, loop normalization is needed when two loops in sequence are starting at different offsets (e.g. one is i=1 and another starts at i=0), or are finished at different endpoints (e.g. n versus n-1).

If two loops have the same number of computations, but with different ranges, then one loop can be changed with an offset. For example, these loops differ with ranges 0..n-1 and 1..n:

    for (int i = 0; i < n; i++) a[i] = 0;
    for (int j = 1; j <= n; j++) b[j] = 0;

These can be adjusted to the same ranges with a “j+1” index offset, as follows:

    for (int i = 0; i < n; i++) a[i] = 0;
    for (int j = 0; j < n; j++) b[j+1] = 0;

If the two loops have a different number of iterations, typically off by 1 or 2, then “loop peeling” can be used to unroll and split off one or two iterations and shorten the longer loop, so that both loops have the same number of iterations over the same range. For example, in this example, one loop is 0..n-1 and another is 0..n:

    for (int i = 0; i < n; i++) a[i] = 0;
    for (int j = 0; j <= n; j++) b[j] = 0;

The way to normalize the loop ranges is to “peel” off the last iteration of the “j” loop:

    for (int i = 0; i < n; i++) a[i] = 0;
    for (int j = 0; j < n; j++) b[j] = 0;
    b[n] = 0;  // Peeled

This example has peeled the longer loop to make it shorter. An alternative would be “loop spreading” to lengthen the shorter loop, such as by adding an extra padding element into the array.

Normalizing two loops doesn't change the number of arithmetic computations. However, once two loops have normalized ranges, it becomes easier to see opportunities for further optimizations such as loop fusion or loop fission.

Loop Skewing

Loop skewing is a somewhat mind-bending method to change nested loops to make them more parallelizable. This technique applies when there are two nested loops, but the inner loop is difficult to parallelize because of a dependency on the outer loop variable. The performance advantage from loop skewing is not directly its usage, but because skewing changes then make possible other loop optimizations, especially loop interchange, which reorders the inner and outer loop.

The loop skewing solution is far from obvious. The range bounds of the inner loop are changed by “skewing” them by a factor based on the outer loop variable. And then, by some magical potion, this somehow breaks the dependence on the outer loop, and then the inner loop can run fast on a GPU. Who knew?

As a simplistic example, consider two nested loops:

    for (int i = 0; i < 1000; i++) {
        for (int j = 0; j < 50; j++) {
            arr[i][j] = something;
        }
    }

We can skew the inner loop by adding a skew factor based on the outer loop variable (e.g. “i” or “i+1” or something similar). Add this skew factor to the ranges of j, but then subtract the skew factor (“i”) from any usages of the index “j” inside the inner loop's body.

    for (int i = 0; i < 1000; i++) {
        for (int j = i; j < 50 + i; j++) {
            arr[i][j - i] = something;
        }
    }

Hence, j has changed from the range (0...50) to the skewed range (i...i+50), by adding the skew factor “i” to the start and end. The use of “j” in the inner loop body has changed from “j” to “j-i” (i.e. subtracting the skew factor “i”). The result is a kind of skewed and “triangular” shape of i and j indices, but the actual arithmetic calculations are unchanged.

This newly skewed code isn't any faster, does exactly the same calculations on the 50,000 elements of array arr, and indeed is actually worse because of the extra “50+i” and “j-i” computations. However, in some cases, doing this weird skewing transformation then allows us to follow up with a loop interchange optimization, switching the inner and outer loops. And I'm not even going to pretend to understand this, but there are situations where the non-skewed inner loop cannot be vectorized or interchanged, but after we've skewed the loop, then we can interchange it, and then we get via hocus pocus a different inner loop that can then be vectorized. Hopefully, the GPU knows what's going on.

• Next: Chapter 16. Hardware Acceleration

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs

The new AI programming book by Aussie AI co-founders:

AI coding in C++
Transformer engine speedups
LLM models
Phone and desktop AI
Code examples
Research citations

Get your copy from Amazon: Generative AI in C++