Aussie AI

CUDA C++ Optimization Research

  • Last Updated 10 December, 2024
  • by David Spuler, Ph.D.

CUDA C++ Blog Articles

See also these Aussie AI blog articles:

CUDA Introductory Articles

Articles and tutorials for CUDA include:

CUDA Programming Books

CUDA Optimization Techniques

Articles and research on CUDA performance optimization techniques:

CUDA Profiling and Timing

Articles on using profiler tools for timing the efficiency of CUDA kernels:

CUDA Memory Optimization

Articles and papers on CUDA memory optimization techniques:

CUDA C++ Matrix Multiplication (MatMul/GEMM)

Articles on coding MatmUl/GEMM in CUDA C++:

CUDA Debugging Techniques

Articles on CUDA debugging tools and techniques:

CUDA Portability

Articles and papers on the portability of CUDA code:

CUDA Compatibility and Versions

Articles about version compatibility and CUDA:

CUDA Emulator

Can you run CUDA without actually having a real GPU? On just a CPU? This would be useful for educational purposes and also for programmers working at home on their laptop.

What about a fully functional CUDA emulation? There are rumors that such a beast used to exist in 2010 (almost 15 years ago), and even be supported by NVIDIA, but it doesn't seem to exist any more.

Here are some articles on CUDA emulation:

CUDA Static Analysis

Research on static analysis (code checking or auto-tuning) of CUDA programs:

CUDA Programming Bugs

Papers on the types of programming errors that can occur in CUDA kernels:

CUDA Floating-Point Errors

CUDA Floating-Point Runtime Error Checkers

Research papers on tools that detect floating-point errors and exceptions at runtime:

CUDA Auto-Tuning Research

Research on CUDA "auto-tuning" is about automatically analyzing how to run a CUDA kernel faster:

CUDA Tools

Research papers on the theory of some of the CUDA C++ tools and extensions:

Warp Divergence in CUDA Kernels

Warp divergence, or "thread divergence" or "branch divergence," is where some threads in a warp of 32 threads go on different control flow paths (i.e., at if statements or loop conditions). This is a speed impediment to fast GPU execution of code, and avoiding divergence around control flow statements in CUDA kernels is an important optimization technique.

Articles and research on thread divergence:

Warp Shuffle

Warp shuffle is a set of CUDA GPU primitives that allow sharing of memory between the 32 threads in a warp. The shuffle APIs are faster than shared memory ("__shared__"), but are limited to 32 threads (i.e., one warp only), whereas shared memory works across threads in all the warps in a block.

Memory Address Alignment

Aligned memory accesses are faster in CUDA than those with non-aligned addresses.

Articles and papers on memory alignment in CUDA:

Shared Memory Optimizations in CUDA

Articles on CUDA optimizations using shared memory:

Memory Address Coalescing in CUDA

Coalesced memory accesses refer to each thread in a warp accessing adjacent memory locations (in global memory). This is much faster in CUDA than having different threads accessing non-adjacent locations.

Articles on memory coalescing in CUDA:

Array Stride Optimizations in CUDA

Using the "stride" of an array is an important access pattern for kernels in CUDA. One of the advantages is achieving coalesced memory accesses.

Articles and papers on array stride optimizations:

CUDA Kernel Fusion

Kernel fusion is the merging of two kernels into one. Typically, if two kernels are working sequentially on the same set of data, with the output of one kernel going into the second kernel, then it can be efficient to combine the two kernels.

Papers on kernel fusion in CUDA include:

CUDA Security Issues

Research on CUDA security issues, such as buffer overflow exploits:

General Research on CUDA

Articles and papers on CUDA programming:

CUDA C++ Optimization Book



CUDA C++ Optimization The new CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization

Memory Safety C++ Blog Articles

CUDA C++ Debugging Book



CUDA C++ Optimization The new CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging

More AI Research

Read more about: