Aussie AI Blog

CUDA Thread Divergence

  • September 26, 2024
  • by David Spuler, Ph.D.

What is CUDA Thread Divergence?

Thread divergence, also called "warp divergence" or "branch divergence," is a computation bottleneck that occurs when some subset of threads in a warp take a different path at a control flow branch, such as an if statement or loop conditional test. The reason it slows the GPU down is rather strange: CUDA runs both branches sequentially, rather than in parallel. Apparently, the underlying algorithm on the GPU runs like this at a conditional branch:

  • Evaluate the condition on all 32 threads (in parallel)
  • For all threads where the condition was true, run the first "if" path, with the other threads disabled.
  • Reverse the logic, and run the code in the "else" branch, with the first subset of threads disabled.

The reasons why a GPU would work this way are somewhat involved, but basically it's a huge vector processor that's really more like SIMD than it appears. So it basically runs the "if" and "else" sections as a non-stop sequence of SIMD operations, but uses the trick to suppress it on a subset of the 32 threads.

The good news is that this is all totally hidden from the programmer by the CUDA stack. Programmers can just write if statements, loop conditions, or even switch statements, without worrying about enabling or disabling threads in a warp. Kudos to the compiler design engineers at NVIDIA for making this work so well.

The bad news is that branch divergence causes a significant slow-down in the execution of a kernel, because it forces serial execution. Ugh! Hence, one kernel optimization is to try to avoid divergent branches as much as possible.

Optimizations to Reduce Thread Divergence

Methods to reduce thread divergence (warp divergence) include:

  • Reduce control flow statements in kernels (e.g., if statements, loops, switch statements).
  • Reduce any complex side-effects in short-circuiting control flow of C++ operators (i.e., &&, || and ?:)
  • Use CUDA math functions or other CUDA SIMD intrinsic function calls.
  • Hoist if statements out of kernels to higher-level logic.
  • Use predication for if statements in kernels.
  • Use compile-time preprocessor directives to avoid live if statements.
  • Use conditions at the warp level (in each warp, all 32 threads follow either the if or the else pathway).
  • Empty else code pathways (the extra serialized part of the code does nothing).

Remember the basics. General advice on trying to minimize branch prediction by reducing if statements can be taken too far. The general code optimization advice follows:

  • Help the compiler auto-optimize the branches (it uses "predication").
  • Examine the PTX assembler to see whether there's any branches happening.
  • Don't micro-optimize branches in non-critical kernels that don't matter.
  • A lot of other optimizations matter more than branch divergence.

Device code only. One point to make, which is probably stating the obvious, but I'm going to say it anyway: host code is fine! Branch divergence is only a problem for kernel code running on the GPU device. The basic C++ code on the host is serialized on the CPU anyway, so use as many if statements and loops as you like in the host code.

The compiler fixes many cases. The CUDA C++ compiler will optimize a lot of cases to avoid branch divergence. And many micro-optimizations either won't help, or will result in the same code. You can even make things worse for the compiler if you use complex semantics, because it might stop the auto-optimizer from fixing things for you. You need to examine the assembler code to determine whether you're super-cool method to avoid an if statement is actually changing anything at all.

Another point about the compiler handling things for you: not all loops are bad! Simple loops do not cause thread divergence if every thread executes all iterations. A simple example is a constant loop:

    for (int i = 0; i < 8; i++) {  // Harmless
        // ...
    }

Control Flow

Branch divergence can occur with any of the control flow statements where there are two or more pathways:

  • if and if-else statements
  • for loops
  • while loops
  • do-while loops
  • switch statements

Note that some other C++ control flow statements that alter the flow path, but do not really introduce divergence, because they are non-conditional branches that don't offer two pathways at that point:

  • return statement
  • continue statement
  • break statement

Also, as dicussed further below, there are several CUDA C++ operators that have control-flow characteristics, such as the ternary operator.

Predication

The method of predication is to make the if and else clauses more explicit. The compiler will often do this optimization for you via auto-predication, because it mirrors how the GPU actually executes branches. However, you can do it manually.

The normal non-predication way of writing a branch is like this code:

    if (condition) {
        // Code
    }
    else {
        // Other code path
    }

Manual predication means writing this:

    if (condition) {
        // Code
    }
    if (!condition) {
        // Other code path
    }

There's no else statements in predication, but don't go crazy, because the compiler usually handles it anyway. You would also need to manually convert the ternary operator (?:) into if statements to do this manually. And operator short-circuiting is also an occasional concern.

CUDA C++ operators

Some C++ operators have a hidden control flow aspect where two blocks of code may differ. The main ones are:

  • && logical and operator — short-circuited if first operand is false.
  • || logical or operator — short-circuited if first operand is true.
  • ?: ternary operator — ternary operator has an if-else meaning.

The Boolean logical operators do short-circuiting in C++ and also in CUDA C++. The && has an "and then" flow, and the || operator has an "or else" flow. But this won't really matter in regard to branch divergence in simple expressions involving basic variables (eg., i<10&&i!=0), because the compiler won't really evaluate conditions using branching. The only issues arise if you're doing tricky side effects in the second operand expression of these operators, such as assignments or increments/decrements, or function calls.

On the other hand, a ternary operator (?:) is definitely affecting control flow, and could lead to branch divergence. The ternary operator is not really an optimization compared to using if-else statements. It's the same!

Use CUDA Math Functions

A lot of the traditional C++ tricks aren't that great for CUDA because of the branch divergence problem. Here's some examples of things not to do in kernel code:

    #define MAX(x,y) ( ((x) > (y) ) ? (x) : (y) )  // BAD
    #define MIN(x,y) ( ((x) < (y) ) ? (x) : (y) )  // BAD

Also, don't do the equivalent in small inline functions, because the problem for branch divergence is the presence of an if statement, not the function call overhead:

    inline float abs(float x) 
    { 
        if (x < 0.0)      // BAD
            return -x; 
        else 
            return x; 
    }

Instead, you should use the many builtin CUDA functions inside kernels, such as these:

  • max
  • min
  • abs

It seems reasonable to assume that they've been implemented in a way that doesn't suffer from branch divergence, and indeed, many of them map to single PTX instructions, with no branches or function call overhead. There are many more such functions in the CUDA runtime libraries.

Example: RELU activation function. Implementing the RELU activation function on CUDA is a good example. Don't use a manual method:

    inline float RELU(float x)
    {
        if (x <= 0.0) return 0.0;
        return x;
    }

Instead, you can just use the builtin CUDA "max" function, which has various overloaded declarations for different types. You can even use a preprocessor macro for this simple change:

    #define RELU(x)  (max((x),0.0))

Research on Branch Divergence Optimizations

There is a research area that focuses on techniques to reduce branch divergence, and how to automate them in the compiler, including papers over a decade old. Some of the basic methods are:

  • Branch fusion
  • Tail merging
  • Control-flow melding
  • Thread data remapping

I'm not going to go into details of these methods, but they sure are fun!

CUDA C++ Optimization Book



CUDA C++ Optimization The new CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization

More CUDA Research Topics

More AI Research Topics

Read more about: