Aussie AI

Duffs Device for Loop Unrolling

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Duff's Device for Loop Unrolling

There's a neat coding trick called “Duff's Device” for loop unrolling, which uses a switch with case fallthrough to mimic assembler coding style. However, it's not great for vectorization as it's likely to confuse the compiler, so may be mostly of theoretical interest.

    float aussie_unroll4_duff(float v1[], float v2[], int n)  
    {
        // Unrolled dot product with Duff's Device 
        int i = 0;
        float sum = 0.0;
        switch (n % 4) {
            for (; i < n; ) {
                case 0: sum += v1[i] * v2[i]; i++;
                case 3: sum += v1[i] * v2[i]; i++;
                case 2: sum += v1[i] * v2[i]; i++;
                case 1: sum += v1[i] * v2[i]; i++;
                default:;
            } // end for
        } // end switch
        return sum;
    }

What's happening here? My brain hurts looking at this code! The trick is that the outside switch branches into a case that is inside the body of a for loop. This is not normal everyday coding, because there's a loop inside a switch, and the loop body crosses over several case statements. Also, none of the case statements has a “break” statement and they instead rely on fallthrough semantics. Similarly, the “default” clause is mainly just to avoid getting a spurious compilation warning (i.e. “missing default”), and also has no “break” with only a lonely semicolon. Note also that the case labels are written in reverse order from top to bottom (3..2..1), except for 0 at the top.

How does this even work? The first point is that it does. This code performs the exactly correct number of iterations for any value of n (except n==0), and similar versions with an unrolling factor of more than 4 will also work (i.e. if you change “n%4” and add more case constants). The code looks like a hack, but actually uses standardized C++ semantics of case fallthrough and switch multi-way control flow and should work on all platforms. Branching into the middle of a loop with a switch is valid in C++ provided it doesn't bypass any local variable initialization (hence, don't put “sum” into the switch). Also, the case fallthrough semantics (i.e. without a “break” ending each “case”) are standard for C and C++ since inception. Finally, note that this code is buggy for n==0, because it incorrectly does 4 iterations, so it ideally needs a parameter validation assertion at the start.

Bug alert! Note that you cannot tweak the “i++” instruction using the standard idiom:

   sum += v1[i] * v2[i++];  // Bug!

The obscure problem is that the “*” operator doesn't guarantee left-to-right evaluation of its operands. The code assumes evaluation order of: v1[i], v2[i], *, i++, starting from the left. However, the C++ optimizer can legally do this order of operations: v2[i], i++, v1[i], *, which is not what you intended and gets the wrong array element for v1[i]. This code might be unreliable across platforms, or it might work in the debugger mode, but fall over once you turn on high levels of optimization. So, there is an “order of evaluation” pitfall if you put “++” in an operand of the “*” operator or many other binary arithmetic operators.

Is Duff's Device any faster? The short answer is “not really,” although it looks very appealing (or appalling). Firstly, note that this trick is not actually very useful for vectorization, because a switch cannot branch into the middle of a vectorized intrinsic (i.e. if you replace the loop body with a SIMD instruction). Furthermore, although I haven't tested it, I doubt many optimizers will be able to auto-optimize that complex control flow with SIMD instructions. In sequential code, this method also isn't much faster, as it doesn't really have any fewer operations than a basic unrolled loop (i.e. with extra cases handled separately before or after the main loop). The above example of Duff's Device can be further sped up using pointer arithmetic and “looping down to zero” optimizations, but so can the other unrolled versions. However, there is a minor speed advantage in terms of “instruction locality” because the above code is very concise.

The main advantage of Duff's Device is to bamboozle your colleagues. You can use Duff's Device with any unrolling factor, not just 4 as in the example shown above (e.g. change to 8 by using “n%8” and adding cases for 4, 5, 6, and 7, ordered from 7 down to 1, leaving 0 on top). Actually, the unrolling factor needn't be a power-of-two. Make it a prime number for extra bonus points. If you want more of this kind of coding trickery, also search up Jensen's device and Pigeon's device.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++