Aussie AI Blog

CUDA C++ Memory Coalescing Optimizations

  • September 26, 2024
  • by David Spuler, Ph.D.

What is Coalesced Data Access in CUDA C++?

An important way to optimize CUDA kernel data access costs is to use a "coalesced" pattern of accessing chunks of data in global memory. Coalesced data accesses are parallel accesses to contiguous memory addresses. In other words, this means to access contiguous memory with a group of threads, all at the same time. The basic idea is that:

  • Each thread accesses an adjacent memory address
  • None of the threads access the same address (no contention)

For a warp of 32 threads, it would access 32 adjacent memory addresses in a contiguous block, with each thread accessing a separate address. These coalesced accesses to global memory are faster in a GPU than random accesses to global memory. Kernels run faster when the memory being accessed is together.

Note that is similar to the CPU speedup of accessing continuous memory blocks in sequence, because of memory cache prefetch optimizations. However, that refers to sequential CPU logic, which we have totally forgotten now that we've got a better parallel processor (called a GPU).

What does coalesced memory accesses look like a kernel? Nothing exciting, let me assure you. In fact, they're very basic and unexciting:

__global__ void aussie_clear_vector_coalesced(float* v, int n)
{
	int id = blockIdx.x * blockDim.x + threadIdx.x;
	if (id < n) {
		v[id] = 0.0;  // Clear one vector element..
	}
}

Why is this coalesced? Because every thread accesses a different element element of the vector v, in global memory. For simplicity, let's assume that n==128 and we're running 2 blocks of 64 threads each (i.e. 2 waraps of 32 threads in each block). When it runs, all 128 threads run in parallel, all in lock-step, and they all process v[id] at exactly the same time. The first warp has 32 threads that access indices 0..31, the second warp is also in the first block, and it accesses 32..63, and the next two warps in the second block access array elements 64..95 and 96..127.

These accesses are all coalesced. Each warp of 32 threads accesses all 32 of the contiguous memory elements, all in parallel.

But that's just an ordinary CUDA kernel, I hear you say. That's correct, and the basic thread model of kernels tends to use coalesced memory accesses, unless we try hard not to.

Non-Coalesced Memory Accesses

Let's look at a different kernel, where each thread is clearing 4 elements of the vector, rather than just one. This is not usually a good plan, unless our vectors are so large that we can't run enough threads to do each element in a separate thread. Hence, here's the code:

__global__ void aussie_clear_vector_non_coalesced(float* v, int n)
{
	int id = blockIdx.x * blockDim.x + threadIdx.x;
	id *= 4;
	if (id < n) {
		v[id] = 0.0;  // Clear 4 vector elements..
		v[id + 1] = 0.0;
		v[id + 2] = 0.0;
		v[id + 3] = 0.0;
	}
}

Now, if you've got your old CPU sequential C++ programming hat on, this code looks better. It looks like a loop unrolling optimization that does 4 assignments in a row, and should run a lot faster.

Not at all! Everything is upside-down in parallel world (which is why Australians like it), because this kernel is actually:

  • Slower, in parallel, and
  • Non-coalesced memory accesses

Because each kernel thread is clearing 4 elements, we actually need to run 4 times fewer threads so it's much less parallel. And each kernel is doing 4 accesses to global memory, one after another, in an apparently sequential algorithm. This is much less parallelization, except to the extent that the NVCC compiler auto-parallelizes anything. We really should expect this kernel to run 4 times slower.

Furthermore, it's not a coalesced memory access pattern. Let's just examine the first warp of 32 threads. These will set id variable to 0...31, and then it's quadrupled to have values of 0, 4, ..., 124 (31*4). This seems superficially like it might be a coalesced memory pattern, since the 4 assignments in each thread will iterate over a contiguous memory block of 128 vector elements, and ensure that elements v[0]..v[127] are all cleared properly.

But it's not a coalesced access order. The first assignment v[id] in these 32 threads will set v[0], v[4], v[8], etc. This is not a coalesced pattern, since none of these addresses are adjacent.

The second assignment is also non-coalesced, which also reduces performance. Each of the threads will run it in lock-step, but v[id+1] will set v[1], v[5], v[9], etc. Similarly, the third and fourth assignments are not coalesced.

Stride Memory Accesses

The way to get coalesced memory accesses, if you can't have each thread accessing only one address, is to have each thread "stride" through the array. In this way, each thread can do multiple memory access operations, but each step of the sequence has all the 32 threads in a warp accessing adjacent memory addresses, so as to achieve coalescing.

How does a stride work? We want to have 32 threads, each doing 4 assignments, but in such a way that each of the 4 assignments in spreading a contiguous range of 32 vector elements.

Obviously, we want the first assignment v[id] to set v[0]...v[31] for the 32 threads in the warp. The second assignment should set v[32]...v[63]. The third should cover 64..95 and the fourth has 96..127. So if we look at just the first thread in the warp, this should set:

  • First assignment — v[0]
  • Second assignment — v[32]
  • Third assignment — v[64]
  • Fourth assignment — v[96]

You can see the pattern: we want to do so every 32 vector elements. This value of 32 is called the stride of the array. Or, rather, the number of bytes is more properly called the stride on an array, but you get the idea.

So here's our first attempt at a strided kernel with coalesced memory access patterns:

__global__ void aussie_clear_vector_stride(float* v, int n)
{
	int id = blockIdx.x * blockDim.x + threadIdx.x;
	if (id < n) {
		int stride = 32;
		v[id] = 0.0;  // Clear 4 vector elements.. STRIDED!
		v[id + stride] = 0.0;  // BUG!!
		v[id + 2*stride] = 0.0;
		v[id + 3*stride] = 0.0;
	}
}

Well, this is indeed coalescing its memory accesses, but not always in the right place — it's just a little buggy. The test "id<n" does not prevent array bounds overflow for the subsequent assignments such as "v[id+stride]". So this code craters for any values of n with extra threads. It really needs an "if" safety test before all the four assignments, rather than just the first.

The way to fix it is actually to use a "stride loop" idiom, where array bounds are checked each iteration. Note that this code isn't yet one of the fabled "grid-stride loops" that lives in the GPU unicorn forest. This is just a basic stride loop, but not over the grid size.

__global__ void aussie_clear_vector_stride_loop(float* v, int n)
{
	int id = blockIdx.x * blockDim.x + threadIdx.x;
	int stride = 32;
	for (int i = 0; i < 4; i++, id += stride) {
		if (id < n) {
			v[id] = 0.0;  // Clear 4 vector elements.. STRIDED!
		}
	}
}

The loop version is basically like the non-loop stride version, but re-rolled into a short loop. It does the same set of array accesses in the same order, but the safety occurs because "id<n" is performed each iteration. This is a coalesced access order, as discussed above, because each assignment runs in a warp in lock-step. The contiguous memory addresses of v are processed like this by the 32 threads in the first warp: 0..31, 32..63, 64..95, and 96..127.

CUDA C++ Optimization Book



CUDA C++ Optimization The new CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization

More AI Research Topics

Read more about: