Aussie AI

CUDA C++ Optimization Research

  • Last Updated 30 August, 2025
  • by David Spuler, Ph.D.

CUDA C++ Blog Articles

See also these Aussie AI blog articles:

CUDA Introductory Articles

Articles and tutorials for CUDA include:

CUDA Programming Books

CUDA Optimization Techniques

Articles and research on CUDA performance optimization techniques:

CUDA Profiling and Timing

Articles on using profiler tools for timing the efficiency of CUDA kernels:

Performance Profiling

Research papers on performance profiling:

CUDA Memory Optimization

Articles and papers on CUDA memory optimization techniques:

CUDA C++ Matrix Multiplication (MatMul/GEMM)

Articles on coding MatmUl/GEMM in CUDA C++:

CUDA Megakernels

CUDA megakernels are a kernel optimization method that extends the idea of a thread pool with producer-consumer idiom. A "megakernel" has workers that can "consume" multiple different types of work, rather than only one.

CUDA Debugging Techniques

Articles on CUDA debugging tools and techniques:

CUDA Debugging Tools

Papers on CUDA debug tools:

Sanitizers

Research on sanitizer tools:

CUDA Portability

Articles and papers on the portability of CUDA code:

CUDA Compatibility and Versions

Articles about version compatibility and CUDA:

CUDA Emulator

Can you run CUDA without actually having a real GPU? On just a CPU? This would be useful for educational purposes and also for programmers working at home on their laptop.

What about a fully functional CUDA emulation? There are rumors that such a beast used to exist in 2010 (almost 15 years ago), and even be supported by NVIDIA, but it doesn't seem to exist any more.

Here are some articles on CUDA emulation:

CUDA Static Analysis

Research on static analysis (code checking or auto-tuning) of CUDA programs:

CUDA Programming Bugs

Papers on the types of programming errors that can occur in CUDA kernels:

CUDA Floating-Point Errors

CUDA Floating-Point Runtime Error Checkers

Research papers on tools that detect floating-point errors and exceptions at runtime:

CUDA Auto-Tuning Research

Research on CUDA "auto-tuning" is about automatically analyzing how to run a CUDA kernel faster:

CUDA Tools

Research papers on the theory of some of the CUDA C++ tools and extensions:

Warp Divergence in CUDA Kernels

Warp divergence, or "thread divergence" or "branch divergence," is where some threads in a warp of 32 threads go on different control flow paths (i.e., at if statements or loop conditions). This is a speed impediment to fast GPU execution of code, and avoiding divergence around control flow statements in CUDA kernels is an important optimization technique.

Articles and research on thread divergence:

Warp Shuffle

Warp shuffle is a set of CUDA GPU primitives that allow sharing of memory between the 32 threads in a warp. The shuffle APIs are faster than shared memory ("__shared__"), but are limited to 32 threads (i.e., one warp only), whereas shared memory works across threads in all the warps in a block.

Memory Address Alignment

Aligned memory accesses are faster in CUDA than those with non-aligned addresses.

Articles and papers on memory alignment in CUDA:

Shared Memory Optimizations in CUDA

Articles on CUDA optimizations using shared memory:

Memory Address Coalescing in CUDA

Coalesced memory accesses refer to each thread in a warp accessing adjacent memory locations (in global memory). This is much faster in CUDA than having different threads accessing non-adjacent locations.

Articles on memory coalescing in CUDA:

Array Stride Optimizations in CUDA

Using the "stride" of an array is an important access pattern for kernels in CUDA. One of the advantages is achieving coalesced memory accesses.

Articles and papers on array stride optimizations:

CUDA Kernel Fusion

Kernel fusion is the merging of two kernels into one. Typically, if two kernels are working sequentially on the same set of data, with the output of one kernel going into the second kernel, then it can be efficient to combine the two kernels.

Papers on kernel fusion in CUDA include:

CUDA Security Issues

Research on CUDA security issues, such as buffer overflow exploits:

CUDA Examples

More CUDA examples:

General Research on CUDA

Articles and papers on CUDA programming:

CUDA C++ Optimization Book



CUDA C++ Optimization The new CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization

Memory Safety C++ Blog Articles

CUDA C++ Debugging Book



CUDA C++ Optimization The new CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging

CUDA C++ Programming Books



CUDA C++ Optimization CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization



CUDA C++ Optimization CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging

AI Books from Aussie AI



The Sweetest Lesson: Your Brain Versus AI The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
  • Your brain is 50 times bigger than the best AI engines.
  • Truly intelligent AI will require more compute!
  • Another case of the bitter lesson?
  • Maybe it's the opposite of that: the sweetest lesson.

Get your copy from Amazon: The Sweetest Lesson



RAG Optimization RAG Optimization: Accurate and Efficient LLM Applications: new book on RAG architectures:
  • Smarter RAG
  • Faster RAG
  • Cheaper RAG
  • Agentic RAG
  • RAG reasoning

Get your copy from Amazon: RAG Optimization



Generative AI in C++ Generative AI Applications book:
  • Deciding on your AI project
  • Planning for success and safety
  • Designs and LLM architectures
  • Expediting development
  • Implementation and deployment

Get your copy from Amazon: Generative AI Applications



Generative AI in C++ Generative AI programming book:
  • Generative AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++



CUDA C++ Optimization CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization



CUDA C++ Optimization CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging

More AI Research

Read more about: