Aussie AI

Loop Fission

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

Loop Fission

Loop fission is an optimization that is the opposite of loop fusion. Instead of fusing two loops into one, we take one loop and split parts of it into two loops. Loop fission also been called other names such as “loop splitting” or “loop distribution.”

Loop fission can be more efficient for parallel execution (e.g. vectorization for GPUs), but is often slower for sequential execution. Whereas loop fusion aims to remove the overhead of one of the loops, loop fission tolerates an increased loop overhead in return for simpler loop bodies that can be parallelized. The kernel optimization of “kernel fission” is based on loop fission, and loop fission is one technique used to achieve vectorization for GPUs.

The main reason to use loop fission is hardware acceleration via loop parallelization. A complicated single loop can often run faster if split into two simpler loops, if hardware acceleration can be accessed. This is true even if the two resulting loops must run sequentially, because the iterations of each loop are parallelized, but there's a double benefit if the two whole loops can also run in parallel.

Example: Loop Fission in BatchNorm: A good example arises in part of the code for batch normalization. Each element of the vector needs to have two operations performed on it: subtract the mean (re-centering) and multiply by a variance factor (re-scaling). The naive implementation of the second half of BatchNorm looks like this:

    float denom = sqrtf(varc + eps); // Scale factor
    for (int i = 0; i < n; i++) {
        // Normalize: re-center and scale
        v[i] = (v[i] - fmean) / denom; 
    }

This is difficult to hardware accelerate because it's unlikely that there's a combined “subtract-and-then-divide” operation to apply to all elements of a vector in parallel. The first point is that maybe there's an “add-and-then-multiply,” in which case we can use the negative of the additive factor and the reciprocal of the scaling factor. However, assuming there's not, loop fission can be used to split the single complicated loop into two sequential loops.

    float negmean = -fmean;  // Use negative for addition
    float denom = sqrtf(varc + eps); // std. deviation
    float recip = 1.0f / denom;  // reciprocal multiply
    // Loop 1: Re-center using mean
    aussie_vector_add_scalar(v, n, negmean);
    // Loop 2: Re-scale by factor
    aussie_vector_multiply_scalar(v, n, recip);

Each of the two loops is now easy to hardware accelerate, because they are both very simple vector operations: “multiply-by-scalar” and “add-scalar.” Every platform is likely to have hardware acceleration APIs for those simpler operations. So, to summarize, we got an explosive boost to hypersonic rocket speed using atomic operations with loop fission. Isn't that just the bomb?

• Next:

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs

The new AI programming book by Aussie AI co-founders:

AI coding in C++
Transformer engine speedups
LLM models
Phone and desktop AI
Code examples
Research citations

Get your copy from Amazon: Generative AI in C++

Aussie AI

Loop Fission

Loop Fission

Quick Links

Product

New to Writing?

Writing Styles