Aussie AI Blog

Basic CUDA C++ Programming Mistakes

September 23, 2024

by David Spuler, Ph.D.

Basic CUDA C++ Programming Mistakes

There are many common CUDA idioms and they exist for a reason. Break the pattern, and you might trigger a bug or a slug. There are problems including.

Not enough threads
Not enough blocks
Not enough threads per block
Too many blocks (each with threads)
Vector underflows (silently)
Vector buffer overflows (gasp!)

All these combinations are enough to make your head spin! Let's examine some of these problems.

This is the most beginner issue when you're creating your first CUDA program. Using this style is a slug:

    mykernel<<<1,1>>>(v,n);  // SEQUENTIAL!

It's not parallel programming if you're only running one thread!

Block Size Not Multiple of 32

The number of threads per block (aka the "block size") should be a multiple of the warp size, which is 32 threads. This is not good:

    float v[54];
    ...
    int blocks = 2;
    mykernel<<<blocks, 27>>>(v,n);  // BAD

This might actually work, but it's very inefficient, and offends the sensibilities of any experienced CUDA programmer, for reasons discussed below.

But first, note that this is worse:

    float v[54];
    ...
    int blocks = 54;
    mykernel<<<blocks, 1>>>(v,n);  // BAD

If the threads per block is not 32, or a multiple of 32, there will be odd threads in a warp that aren't properly utilized (or might be doing the wrong thing). The reason is that CUDA allows threads in "warps" that contain exactly 32 threads. With the threads per block of 27, there were 5 extra threads, and for 1 thread per block, there were 31 wasted threads. So, instead, you want something more like this:

    float v[64];
    int n = 64;
    ...
    int blocks = 2;
    mykernel<<<blocks,32>>>(v,n);  // BETTER

Or you can do this:

    float v[64];
    int n = 64;
    ...
    int blocks = 1;
    mykernel<<<blocks,64>>>(v,n);  // BETTER

CUDA can only schedule 32 threads (a warp) at a time, and if you try to schedule less than that, the rest of the threads in that warp still run, which is a bug or a slug, or are unavailable to run, which is a slug.

Wrongly Generalizing Block and Thread Computations

You don't always know the incoming size N of your vector data structure (well, actually, you often do in AI engines, because they have static dimensions, but anyway). Let's try to generalize our computation of how many blocks with each having a fixed number of threads. There's a few basic points:

Each block has the same number of threads (i.e., the threads-per-block)
You can't run half a block (all its threads will run, even if you don't need that many).

Too Few CUDA Blocks

So let's try to generalize our computations:

    int n = 33;
    float *v = (float*)malloc(n * sizeof(float));
    ...
    int threads_per_block = 32;   // or 64 or 256...
    int blocks = n / threads_per_block;  // BUGGY!
    mykernel<<<blocks,threads_per_block>>>(v,n);

Does this work? No, because the "/" operator is integer division of "33/32", which truncates to 1, and we can't get fractions of a block. The result is:

    threads_per_block == 32
    blocks == 1

We're only running 32 threads, but our vector has 33 elements. So if each kernel is processing one vector element, we've missed one of the extra elements. This is a code logic bug. It won't crash CUDA, and nobody's going to whisper in your ear that the end of your vector didn't get processed.

Now, maybe the kernel code has a loop to handle the extra vector elements, which would fix the code logic bug, but it's still a slug. But I'm jumping ahead by mentioning a loop, when your kernel probably doesn't even have an if statement yet. It would be better just to use better block size computations.

Too Many CUDA Blocks

Let's try to fix it. Here's a computation that adds an extra block:

    int threads_per_block = 32;  
    int blocks = (n + threads_per_block) / threads_per_block;  // BUGGY!

This is actually the same as:

    int threads_per_block = 32;  
    int blocks = (n / threads_per_block) + 1;  // BUGGY!

This seems to work better because "(33+32)/32" or "(33/32)+1" both evaluate to 2 blocks, which is what we want. However, we want the computation to work for all values of n, and it's actually wrong if n==32 because it needlessly uses 2 blocks when 1 block of 32 threads is enough. We have a whole extra block doing nothing, or maybe crashing stuff. You'd think the GPU would be thrilled if you threw it an extra block, but no, not so much.

The trick to get it working for all values is to use a common CUDA idiom:

    int blocks = (n + threads_per_block - 1) / threads_per_block;  // CORRECT

The subtraction of 1 in the numerator fixes everything using the power of mathematics. For n==32, we get 1 and for n==33 there are correctly 2 blocks. It also works correctly for n==63 or n==64, although I feel like I should add a unit test.

Odd Vector Sizes

All of this fuss about the number 32 being special because it's the warp size might make you realize something: all of your data should be in size 32 arrays (or matrices, or tensors). It's really hard to fix everything if N is a prime number or other oddity.

That's why a lot of the AI engines have parameters and "hidden dimensions" and "context window sizes" that are a multiple of 32 (e.g., 4096 or 32k or whatever). You won't see this in ChatGPT's source code:

    float mytensor[27][1531][17];   // Weird

Better would be:

    float mytensor[4096][256][32768];  // The power of 32!

So now you know: it's all CUDA's fault. Which is more true than we like to admit, since all of this AI frenzy might not have happened without CUDA's blazing speed.

An Underwhelming Kernel (Array Underflow)

Okay, so now that our computations of blocks and threads are sorted, let's now actually show a very simple kernel. All good kernels start with the special "__global__" keyword, which means that it runs on the GPU. Here's a vector clearing to zero kernel, which is a great piece of work if we pretend we don't know about functions like memset or cudaMemset.

Anyway, here's the kernel code for the GPU to run lots of times in parallel:

__global__ void aussie_clear_vector_kernel_buggy1(float* v, int n)
{
	int id = threadIdx.x;  // BUG!
	v[id] = 0.0;  // Clear one vector element..
}

And the host code that runs on the CPU, only once, and launches the kernel off into the GPU, looks like this:

    int n = 33;
    float *v = (float*)malloc(n * sizeof(float));
    ...
    int threads_per_block = 32;
    int blocks = ( n + threads_per_block - 1) / threads_per_block;
    aussie_clear_vector_kernel_buggy1<<< blocks, threads_per_block >>> (v, n);

I've left out a lot of the details on the CPU side before and after the fancy <<<...>>> syntax (it's called "triple chevron" syntax if you really want to know). The host code also has to allocate memory (twice) and set up the vectors and copy them from the CPU to the GPU, and back again. But for now, let's just look at the blocks and threads.

Is there a bug? Umm, well, there is the word "BUG" in the comments, and also in the function name, so maybe I gave it away a little bit. But why is it a bug? First, let's look at how many blocks and threads-per-block we've got:

    threads_per_block == 32
    blocks == 2

So we've now got 32*2=64 total threads running, each setting one element in a vector of size 33. Aha! Obviously, it's a buffer overflow inside the kernel when we access v[id]!

Nice try, but no cigar. Actually, the problem is this statement:

    int id = threadIdx.x;  // BUG!

What is threadIdx.x when it's at home (on a GPU)? Well, the variable "threadIdx" stands for "thread index" and the ".x" part means the offset in the x-dimension (i.e., 1D vectors). There's also a ".y" for 2D (matrices) and ".z" for 3D (tensors), but we're just using a vector, so the ".x" part is correct.

The problem is the "thread index" means the offset of a thread within the current block of threads, not across all the total threads in multiple blocks. We have set blocks-per-thread as 32, so each block launches 32 threads. Hence, the offset of any thread in its current block has the range 0..31 only.

By launching 2 blocks of size 32 threads each, we get this sequence. The first block launches 32 threads, each having a "threadIdx.x" value of 0..31, so they correctly set all the first 32 vector elements to zero (in parallel). Our second block launches another 32 threads at the same time, also in parallel to our first block, but even though they're in the second block, their "threadIdx.x" value only relates to their own block, so they also have offset values of 0..31. Hence, we get another 32 threads running the kernel function, setting exactly the same values in the vector.

Overall, the outcome is a logic bug in the vector clearing function. The vector elements 0..31 all get cleared correctly (twice!), but the 33rd element v[32] is not touched. This CUDA kernel will only work correctly if n<=threads_per_block.

An Overwhelming Kernel (Array Overflow)

Like the six-million dollar man, we can rebuild the kernel. We need our first block to have values 0..31 and our second block to have values 32..63. But how do we know which block we're in?

The answer is another CUDA builtin variable called "blockIdx" which stands for "block index" and has value 0 for the first block, 1 for the next, and so on. It also has fields x, y, and z, but we only care about the ".x" values for a vector. By the way, the CUDA type is named "dim3" for these three-dimensional builtin object variables, but you don't need to declare them yourself.

Using the block index, allows us to compute index offsets higher than the threads-per-block, such as:

    int id = blockIdx.x * 32 + threadIdx.x;  // BETTER

This is workable, but inelegant, and we could define a global constant for the threads-per-block.

    const int threads_per_block = 32;
    int id = blockIdx.x * threads_per_block + threadIdx.x;  // PRETTIER

This looks like good C++ code, but the CUDA idiom is actually to use another builtin variable called "blockDim" which stands for "block dimension" and in this case that means threads-per-block (in the x direction). Hence, the fully correct CUDA code, which is somewhat uglier, looks like:

    int id = blockIdx.x * blockDim.x + threadIdx.x;  // CORRECT

The full GPU kernel function looks like this:

    __global__ void aussie_clear_vector_kernel_buggy1(float* v, int n)
    {
        int id = blockIdx.x * blockDim.x + threadIdx.x;  // CORRECT
        v[id] = 0.0;  // Clear one vector element.. OKAY?
    }

Now you can run your corrected kernel, and it will clear all the elements of your vector, rather than only the first 32. But if you're running it in production, might want to get your mop and bucket ready to clean up the mess, because this kernel is going to segfault all over the place.

Remember that array bounds violation you were worried about before? Now it's happening for real.

The blockIdx.x value is 0..1, blockDim.x is 32 here (because that was our threads per block), and threadIdx.x has range 0..31 here. If you crank through all the options (sequentially since you don't have a GPU in your head), you'll find out the id array index variable has range 0..63, which all run in parallel.

Which isn't great for computing v[id] on a vector of size 33. You get 31 array bounds violations, in 31 different threads, all at the same time.

The correction is another common CUDA kernel idiom: the safety if statement. Yes, you can use control flow statements like if statements and loops inside the kernel. The non-crashing version of the kernel looks like this:

    __global__ void aussie_clear_vector_kernel_buggy1(float* v, int n)
    {
        int id = blockIdx.x * blockDim.x + threadIdx.x;  // CORRECT
        if (id < n)
            v[id] = 0.0;  // Clear one vector element.. Safely!
    }

For some reason, most of the CUDA examples write it as above, without curly braces, because adding braces would slow it down or something. I'd personally prefer this minor tweak to the C++ styling:

        if (id < n) {
            v[id] = 0.0;
        }

Finally, it works! This safety test stops any of the 31 threads that would have overflowed the vector from writing to bad memory, and they simply do nothing. The other 33 threads correctly clear all the elements of the vector, 32 of them in the first block, and 1 in the second block.

This code won't crash, but its somewhat sluggy for two reasons:

Redundant threads — the extra 31 threads run but do nothing.
Warp divergence — some threads take different paths at the if statement.

The redundant threads don't actually slow anything down, since they run harmlessly at the same time in parallel with the useful threads. However, they waste electricity, and prevent the GPU from having better utilization of its silicon wafers to do other computations.

Warp divergence or "branch divergence" refers to the fact that GPUs like all their threads to follow exactly the same instruction path in parallel. Adding an if statement or a loop is problematic for performance, because some threads go left or right, and this slows down the whole process.

Too Many CUDA Blocks for GPU

Just when you thought it was safe to go back into the water, here's the news: that safe kernel is actually broken. The kernel fails if N is too large, because it tries to always create exactly one thread per item. Hence, a very large value of N means a large number of blocks.

This can blow the little GPU's mind. If you try to work on a vector with a billion elements, that's a billion divided by 32 warps. Even the amazing NVIDIA GPUs have their limits. For example, a V100 tops out at 16,384 threads, which is a lot less than a billion.

The solution is obviously to modify the kernels so that they process more than one vector element. If you're only a little too high, you can maybe manually modify the kernel so that each thread sets 2 vector elements by doing two assignments in sequence.

For a more general case, you actually need to add a loop into the kernel function, and do the calculations to work out how many elements each kernel needs to process, which is basically N divided by the maximum number of allowed threads.

There's nothing wrong with a kernel like this, and it just looks like an ordinary C++ loop, but we're getting a bit too advanced now. So we'll defer the discussion of how to do loops in kernels, because there are scary things like "grid-stride loops" and "coalesced data accesses" to think about.

CUDA C++ Debugging Book

The new CUDA C++ Debugging book:

Debugging CUDA C++ kernels
Tools & techniques
Self-testing & reliability
Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging

Aussie AI Blog

Basic CUDA C++ Programming Mistakes