Aussie AI

Loop Unrolling

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Loop Unrolling

Loop unrolling is a code optimization where the body of a loop is repeated in sequential code. This speeds up the algorithm because the overhead of both the incrementer and the loop iteration test is avoided. In some cases, the entire loop can be unrolled, usually when the loop iterations are finite and known at compile-time. In other cases of partially unrolling, the loop body can be repeated multiple times, and thereby the loop test only occurs every few iterations.

For an AI engine, loop unrolling is used as an optimization in a few places. It is one of the optimizations used by kernel fusion, along with loop fusion and others. Since many meta-parameters of AI models are finite and fixed numbers (e.g. the “model dimension”), there are many cases where an entire loop can be unrolled and then vectorized into the GPU.

The logical extension of loop rolling is done by machine learning compilers, at least from a conceptual point of view. These ML compilers unroll the inference loop and the lower-level loops in matrix operations, thereby creating a finite graph representation of the entire inference sequence. If all is unrolled, there are no loops in the graph (an “acyclic” graph) and it is of finite size. The process of model inference is propagation of data through the graph. There are many “graph optimizations” that can be made on this graph representation of the AI model.

Example: C++ Loop Unrolling of Vector Dot Product. Here is the basic C++ non-unrolled vector dot product code:

   float aussie_vecdot_basic(float v1[], float v2[], int n)
   {
        // Basic vector dot product
        float sum = 0.0;
        for (int i = 0; i < n; i++) {
            sum += v1[i] * v2[i];
        }
        return sum;
   }

If we know the value of n, e.g. that n=5, then we can completely unroll it:

   return v1[0] * v2[0]
        + v1[1] * v2[1]
        + v1[2] * v2[2]
        + v1[3] * v2[3]
        + v1[4] * v2[4]
        ;

If we don't know the value of n, we can still unroll multiple iterations. Here's an example of 4-level loop unrolling of vector dot product in C++ by assuming that n is a multiple of 4:


   float aussie_vecdot_unroll4(float v1[], float v2[], int n)
   {
        // Loop-unrolled Vector dot product 
        if (n % 4 != 0) {
            yassert(n % 4 == 0);
            return 0.0; // fail
        }
        float sum = 0.0;
        for (int i = 0; i < n; ) {
            sum += v1[i] * v2[i]; i++;
            sum += v1[i] * v2[i]; i++;
            sum += v1[i] * v2[i]; i++;
            sum += v1[i] * v2[i]; i++;
        }
        return sum;
   }

And here's a generalization of that 4-level unrolling with extra code to handle the leftover cases if n is not a multiple of 4. Although the extra cases look messy, they are not actually the main performance bottleneck.

    float aussie_vecdot_unroll4b(float v1[], float v2[], int n)
    {   
        // Better loop-unrolled Vector dot product 
        int i = 0;
        float sum = 0.0;
        if (n % 4 != 0) {
            // Handle the extra cases...
            switch (n % 4) {
            case 1: 
                sum += v1[i] * v2[i]; i++; 
                break;
            case 2: 
                sum += v1[i] * v2[i]; i++;
                sum += v1[i] * v2[i]; i++;
                break;
            case 3:
                sum += v1[i] * v2[i]; i++;
                sum += v1[i] * v2[i]; i++;
                sum += v1[i] * v2[i]; i++;
                break;
            default: yassert_not_reached(); break;
            } // end switch
            // Keep going with rest of the vector
        }
        for (; i < n; ) {  // Unrolled 4 times...
            sum += v1[i] * v2[i]; i++;
            sum += v1[i] * v2[i]; i++;
            sum += v1[i] * v2[i]; i++;
            sum += v1[i] * v2[i]; i++;
        }
        return sum;
    }

This code is just an example for explanation. There are various further code optimizations that can be done for production-level efficiency. For parallelization, the loop body should call an intrinsic function to vectorize the method. For an AI engine, we could choose our model dimension and other meta-parameters as multiples of the loop unrolling factor, and thereby avoid ever having any of the “leftover” cases.

For sequential code, we could change it to use pointer arithmetic rather than array indices, we might try replacing the four i++ operators with i+=4, change the integer modulo operator (%) to a bitwise-and operator test (i.e., use “n&3” not “n%4”, which works since 4 is a power-of-two), and it also might be better to use “+” rather than the “+=” operator. Finally, if we carefully code the leftover cases, the main loop could be unrolled to many more levels than just four.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++