Aussie AI

Loop Skewing

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Loop Skewing

Loop skewing is a somewhat mind-bending method to change nested loops to make them more parallelizable. This technique applies when there are two nested loops, but the inner loop is difficult to parallelize because of a dependency on the outer loop variable. The performance advantage from loop skewing is not directly its usage, but because skewing changes then make possible other loop optimizations, especially loop interchange, which reorders the inner and outer loop.

The loop skewing solution is far from obvious. The range bounds of the inner loop are changed by “skewing” them by a factor based on the outer loop variable. And then, by some magical potion, this somehow breaks the dependence on the outer loop, and then the inner loop can run fast on a GPU. Who knew?

As a simplistic example, consider two nested loops:

    for (int i = 0; i < 1000; i++) {
        for (int j = 0; j < 50; j++) {
            arr[i][j] = something;
        }
    }

We can skew the inner loop by adding a skew factor based on the outer loop variable (e.g. “i” or “i+1” or something similar). Add this skew factor to the ranges of j, but then subtract the skew factor (“i”) from any usages of the index “j” inside the inner loop's body.

    for (int i = 0; i < 1000; i++) {
        for (int j = i; j < 50 + i; j++) {
            arr[i][j - i] = something;
        }
    }

Hence, j has changed from the range (0...50) to the skewed range (i...i+50), by adding the skew factor “i” to the start and end. The use of “j” in the inner loop body has changed from “j” to “j-i” (i.e. subtracting the skew factor “i”). The result is a kind of skewed and “triangular” shape of i and j indices, but the actual arithmetic calculations are unchanged.

This newly skewed code isn't any faster, does exactly the same calculations on the 50,000 elements of array arr, and indeed is actually worse because of the extra “50+i” and “j-i” computations. However, in some cases, doing this weird skewing transformation then allows us to follow up with a loop interchange optimization, switching the inner and outer loops. And I'm not even going to pretend to understand this, but there are situations where the non-skewed inner loop cannot be vectorized or interchanged, but after we've skewed the loop, then we can interchange it, and then we get via hocus pocus a different inner loop that can then be vectorized. Hopefully, the GPU knows what's going on.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++