Aussie AI

CUDA C++ Optimization Research

Last Updated 24 June, 2025

by David Spuler, Ph.D.

CUDA C++ Blog Articles

CUDA Introductory Articles

Articles and tutorials for CUDA include:

NVIDIA, 2024, CUDA C++ Programming Guide, https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
Mark Harris, Jan 25, 2017, An Even Easier Introduction to CUDA, https://developer.nvidia.com/blog/even-easier-introduction-cuda/
Mark Harris, Oct 31, 2012, An Easy Introduction to CUDA C and C++, https://developer.nvidia.com/blog/easy-introduction-cuda-c-and-c/
Umangshrestha, Jul 2, 2021, Introduction to CUDA, https://medium.com/geekculture/introduction-to-cuda-7bf6909ea57c
Read The Docs, Sep 2024, Tutorial 01: Say Hello to CUDA, https://cuda-tutorial.readthedocs.io/en/latest/tutorials/tutorial01/
Geeks for Geeks, 14 Mar, 2023, Introduction to CUDA Programming, https://www.geeksforgeeks.org/introduction-to-cuda-programming/
Saurabh Kumar Singh, April 25, 2020, Introduction to CUDA : The Beginner's Guide. What is CUDA and CUDA architecture? https://imsrbh.github.io/IntroToCUDA
NVIDIA, Sep 2024, GPU Accelerated Computing with C and C++, https://developer.nvidia.com/how-to-cuda-c-cpp
Martin Heller, Sep 16, 2022, What is CUDA? Parallel programming for GPUs, InfoWorld, https://www.infoworld.com/article/2256401/what-is-cuda-parallel-programming-for-gpus.html
Stephen Jones, 2024, How To Write A CUDA Program: The Ninja Edition, https://www.nvidia.com/en-us/on-demand/session/gtc24-s62401/ https://developer-blogs.nvidia.com/wp-content/uploads/2024/08/How-To-Write-A-CUDA-Program.pdf
NVIDIA, 2024, CUDA Code Samples, https://developer.nvidia.com/cuda-code-samples
Prof. Mike Giles, 2024, Course on CUDA Programming on NVIDIA GPUs, Oxford University Mathematical Institute, UK, https://people.maths.ox.ac.uk/gilesm/cuda/
GPU Programming with CUDA C++ (2024), Short guidew to GPU Programming with CUDA C++ (2024): coding, performance optimization, latest features. Paul Norvig, January 12, 2024, https://www.paulnorvig.com/guides/gpu-programming-with-cuda-c.html
Cyril Zeller, 2011, CUDA C/C++ Basics : Supercomputing 2011 Tutorial, https://www.nvidia.com/docs/IO/116711/sc11-cuda-c-basics.pdf
Daniel Warfield, Jun 15, 2024, CUDA for AI — Intuitively and Exhaustively Explained: Parallelized AI from scratch, https://towardsdatascience.com/cuda-for-ai-intuitively-and-exhaustively-explained-6ba6cb4406c5

CUDA Programming Books

Brett Neutreon, 2024, Mastering CUDA C++ Programming: A Comprehensive Guidebook, https://www.overdrive.com/media/10749608/mastering-cuda-c-programming https://books.apple.com/us/book/mastering-cuda-c-programming-a-comprehensive-guidebook/id6502452953
William Smith, 4 August 2024, CUDA Programming with C++: From Basics to Expert Proficiency, https://www.amazon.com/dp/1964899451
Jason Sanders, Edward Kandrot, 19 July 2010, CUDA by Example: An Introduction to General-Purpose GPU Programming, Addison-Wesley Professional, 1st edition, https://www.amazon.com/CUDA-Example-Introduction-General-Purpose-Programming/dp/0131387685/
Jordan P. Syntax, 30 May 2024, The CUDA C++ Programming Beginner's Guide: Unlock the Potential of GPU Computing with a Step-by-Step Explanation and Real-World Applications, https://www.amazon.com/Programming-Beginners-Guide-Step-Step/dp/B0D5MKQ3Q3/
John Cheng, Max Grossman, Ty McKercher, 29 August 2014, Professional CUDA C Programming, Wrox, https://www.amazon.com/Professional-Cuda-Programming-John-Cheng/dp/1118739329/
Greg Ruetsch, Massimiliano Fatica, 26 July 2024, CUDA Fortran for Scientists and Engineers: Best Practices for Efficient CUDA Fortran Programming, Morgan Kaufmann, 2nd edition, https://www.amazon.com/CUDA-Fortran-Scientists-Engineers-Programming/dp/044321977X/
David Spuler, Oct 2024, CUDA C++ Optimization: Coding Faster GPU Kernels, https://www.amazon.com/dp/B0DK21QQYD/
David Spuler, Oct 2024, CUDA C++ Debugging: Safer GPU Kernel Programming, https://www.amazon.com/dp/B0DJJVDJBW/

CUDA Optimization Techniques

Articles and research on CUDA performance optimization techniques:

Zheming Jin, July 2024, Evaluating Operators in Deep Neural Networks for Improving Performance Portability of SYCL, Oak Ridge National Laboratory, ORNL/TM-2024/3463, https://info.ornl.gov/sites/publications/Files/Pub217394.pdf
Meng Wu, Jingkai Qiu, Mingyu Yan, Wenming Li, Yang Zhang, Zhimin Zhang, Xiaochun Ye, Dongrui Fan, 16 Aug 2024, Accelerating Mini-batch HGNN Training by Reducing CUDA Kernels, https://arxiv.org/abs/2408.08490 (Improving CUDA kernel performance by reducing small memory-bound kernels.)
Ganesh Bikshandi, Jay Shah, 19 Dec 2023, A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library, https://arxiv.org/abs/2312.11918 https://research.colfax-intl.com/nvidia-hopper-flashattention-2/
Stijn Heldens, Ben van Werkhoven, 22 Mar 2023, Kernel Launcher: C++ Library for Optimal-Performance Portable CUDA Applications, https://arxiv.org/abs/2303.12374 http://kerneltuner.github.io/
Alexander Brandt, Davood Mohajerani, Marc Moreno Maza, Jeeva Paudel, Linxiao Wang, 5 Nov 2019, KLARAPTOR: A Tool for Dynamically Finding Optimal Kernel Launch Parameters Targeting CUDA Programs, https://arxiv.org/abs/1911.02373
Zijing Gu, 26 Jul 2020, Optimizing Block-Sparse Matrix Multiplications on CUDA with TVM, https://arxiv.org/abs/2007.13055
Stefan K. Muller and Jan Hoffmann. 2024. Modeling and Analyzing Evaluation Cost of CUDA Kernels. ACM Trans. Parallel Comput. 11, 1, Article 5 (March 2024), 53 pages. https://doi.org/10.1145/3639403 https://dl.acm.org/doi/full/10.1145/3639403 PDF: https://dl.acm.org/doi/pdf/10.1145/3639403
Mark Lou and Stefan K. Muller. 2024. Automatic Static Analysis-Guided Optimization of CUDA Kernels. In Proceedings of the 15th International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM '24). Association for Computing Machinery, New York, NY, USA, 11–21. https://doi.org/10.1145/3649169.3649249 https://dl.acm.org/doi/abs/10.1145/3649169.3649249 PDF: https://dl.acm.org/doi/pdf/10.1145/3649169.3649249
Li, Y., Dongarra, J., Tomov, S. (2009). A Note on Auto-tuning GEMM for GPUs. In: Allen, G., Nabrzyski, J., Seidel, E., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds) Computational Science – ICCS 2009. Lecture Notes in Computer Science, vol 5544. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01970-8_89 https://link.springer.com/chapter/10.1007/978-3-642-01970-8_89 PDF: https://link.springer.com/content/pdf/10.1007/978-3-642-01970-8_89.pdf
DominikGreweandAntonLokhmotov.2011. Automatically Generating and Tuning GPU Code for Sparse Matrix-Vector Multiplication from a High-Level Representation. In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, (Newport Beach, California,U SA)(GPGPU-4).Association for Computing Machinery, NewYork, NY, USA, https://doi.org/10.1145/1964179.1964196
Tianyi David Han and Tarek S. Abdelrahman. 2011. Reducing branch divergence in GPU programs. In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units (GPGPU-4). Association for Computing Machinery, New York, NY, USA, Article 3, 1–8. https://doi.org/10.1145/1964179.1964184 https://dl.acm.org/doi/10.1145/1964179.1964184
Neda Seifi, Abdullah Al-Mamun, 2014, Optimizing Memory Access Efficiency in CUDA Kernel via Data Layout Technique, Journal of Computer and Communications, 2024, 12, 124-139, DOI: 10.4236/jcc.2024.125009, https://www.scirp.org/journal/paperinformation?paperid=133500 PDF: https://www.scirp.org/pdf/jcc2024125_91732699.pdf (Fast CUDA matrix multiplication using data locality of memory accesses, by using diagonal data access patterns for coalesced access.)
Dung Le, Jul 30, 2020, CUDA Memory Management & Use cases, https://medium.com/distributed-knowledge/cuda-memory-management-use-cases-f9d340f7c704
Mark Harris, Jan 07, 2013, How to Access Global Memory Efficiently in CUDA C/C++ Kernels, https://developer.nvidia.com/blog/how-access-global-memory-efficiently-cuda-c-kernels/
Justin Luitjens, Dec 04, 2013, CUDA Pro Tip: Increase Performance with Vectorized Memory Access, https://developer.nvidia.com/blog/cuda-pro-tip-increase-performance-with-vectorized-memory-access/
Peng Wang, 2010, Fundamental Optimizations in CUDA, https://developer.download.nvidia.com/GTC/PDF/1083_Wang.pdf
Nikolay Sakharnykh, Nov 19, 2017, Maximizing Unified Memory Performance in CUDA, https://developer.nvidia.com/blog/maximizing-unified-memory-performance-cuda/
Johannes Pekkilä, Oskar Lappi, Fredrik Robertsén, Maarit J. Korpi-Lagg, 13 Jun 2024, Stencil Computations on AMD and Nvidia Graphics Processors: Performance and Tuning Strategies, https://arxiv.org/abs/2406.08923
A. Jangda, S. Maleki, M. M. Dehnavi, M. Musuvathi and O. Saarikivi, "A Framework for Fine-Grained Synchronization of Dependent GPU Kernels," 2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), Edinburgh, United Kingdom, 2024, pp. 93-105, doi: 10.1109/CGO57630.2024.10444873. https://ieeexplore.ieee.org/abstract/document/10444873
V. Geraeinejad, Q. Qian and M. Ebrahimi, "Investigating Register Cache Behavior: Implications for CUDA and Tensor Core Workloads on GPUs," in IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 14, no. 3, pp. 469-482, Sept. 2024, doi: 10.1109/JETCAS.2024.3439193. https://ieeexplore.ieee.org/abstract/document/10623472
Mark Harris, Apr 22, 2013, CUDA Pro Tip: Write Flexible Kernels with Grid-Stride Loops, https://developer.nvidia.com/blog/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/
NVIDIA, Module 14 – Efficient Host-Device Data Transfer: Lecture 14.1 - Pinned Host Memory, GPU Teaching Kit, https://people.csail.mit.edu/xchen/gpu-programming/Lecture14-stream.pdf
Mark Harris, Feb 18, 2013, An Efficient Matrix Transpose in CUDA C/C++, https://developer.nvidia.com/blog/efficient-matrix-transpose-cuda-cc/
Athena Elafrou, March 2024, Introduction to CUDA Programming and Performance Optimization, https://www.nvidia.com/en-us/on-demand/session/gtc24-s62191/
NVIDIA, 2024, Performance Optimization Conference Sessions, https://www.nvidia.com/gtc/sessions/performance-optimization/
Michelle Horton, Sep 11, 2024, Advanced Strategies for High-Performance GPU Programming with NVIDIA CUDA, https://developer.nvidia.com/blog/advanced-strategies-for-high-performance-gpu-programming-with-nvidia-cuda/
Michelle Horton, Aug 29, 2024, Boosting CUDA Efficiency with Essential Techniques for New Developers, https://developer.nvidia.com/blog/boosting-cuda-efficiency-with-essential-techniques-for-new-developers/
Thejaswi Rao and Mark Harris, Aug 20, 2019, CUDA Pro Tip: The Fast Way to Query Device Properties, https://developer.nvidia.com/blog/cuda-pro-tip-the-fast-way-to-query-device-properties/
Julien Demouth, Jun 04, 2014, CUDA Pro Tip: Minimize the Tail Effect, https://developer.nvidia.com/blog/cuda-pro-tip-minimize-the-tail-effect/
Andy Adinets, Oct 01, 2014, CUDA Pro Tip: Optimized Filtering with Warp-Aggregated Atomics, https://developer.nvidia.com/blog/cuda-pro-tip-optimized-filtering-warp-aggregated-atomics/
Cris Cecka, Feb 27, 2017, Pro Tip: cuBLAS Strided Batched Matrix Multiply, https://developer.nvidia.com/blog/cublas-strided-batched-matrix-multiply/
Mark Harris, Jun 29, 2015, GPU Pro Tip: Fast Great-Circle Distance Calculation in CUDA C++, https://developer.nvidia.com/blog/fast-great-circle-distance-calculation-cuda-c/
Elmar Westphal, Aug 06, 2015, Voting and Shuffling to Optimize Atomic Operations, https://developer.nvidia.com/blog/voting-and-shuffling-optimize-atomic-operations/
Mark Harris, Jun 10, 2015, GPU Pro Tip: Lerp Faster in C++, https://developer.nvidia.com/blog/lerp-faster-cuda/
Maxim Milakov, Feb 11, 2015, GPU Pro Tip: Fast Dynamic Indexing of Private Arrays in CUDA, https://developer.nvidia.com/blog/fast-dynamic-indexing-private-arrays-cuda/
Jeremy Appleyard, Aug 07, 2014, CUDA Pro Tip: Optimize for Pointer Aliasing, https://developer.nvidia.com/blog/cuda-pro-tip-optimize-pointer-aliasing/ (Use of const and restrict pointers in CUDA kernels can improve speed.)
Mark Harris, Jul 17, 2014, CUDA Pro Tip: Occupancy API Simplifies Launch Configuration, https://developer.nvidia.com/blog/cuda-pro-tip-occupancy-api-simplifies-launch-configuration/
Mark Harris, Feb 03, 2014, CUDA Pro Tip: Do The Kepler Shuffle, https://developer.nvidia.com/blog/cuda-pro-tip-kepler-shuffle/
Yuan Lin and Vinod Grover, Jan 15, 2018, Using CUDA Warp-Level Primitives, https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/
Prof. Mike Giles, 2024, Lecture 4: warp shuffles, and reduction / scan operations, Lecture 4– p. 1/38, Oxford University Mathematical Institute, https://people.maths.ox.ac.uk/gilesm/cuda/lecs/lec4.pdf
Dhanush, Aug 23, 2024, Optimizing Vector Dot Product in CUDA: Exploring Shared Memory and Reduction Techniques, https://github.com/Dhanush295/Cuda_program/blob/main/Vector_dot_product.cu
https://emre-avci.medium.com/dot-product-in-cuda-c-which-might-outperform-cublas-t-dot-732047aa5ec5
Bialas, P., Strzelecki, A. (2016). Benchmarking the Cost of Thread Divergence in CUDA. In: Wyrzykowski, R., Deelman, E., Dongarra, J., Karczewski, K., Kitowski, J., Wiatr, K. (eds) Parallel Processing and Applied Mathematics. PPAM 2015. Lecture Notes in Computer Science(), vol 9573. Springer, Cham. https://doi.org/10.1007/978-3-319-32149-3_53 https://link.springer.com/chapter/10.1007/978-3-319-32149-3_53 https://arxiv.org/abs/1504.01650 PDF: https://arxiv.org/pdf/1504.01650
NVIDIA, Sep 2024, SIMD Intrinsics, https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH__INTRINSIC__SIMD.html
NVIDIA, Sep 2024, Integer Mathematical Functions, https://docs.nvidia.com/cuda/cuda-math-api/cuda_math_api/group__CUDA__MATH__INT.html https://docs.nvidia.com/cuda/pdf/CUDA_Math_API.pdf
NVIDIA, Sep 2024, CUDA Math API Reference Manual, https://docs.nvidia.com/cuda/cuda-math-api/index.html
GPU Programming with CUDA C++ (2024), Short guidew to GPU Programming with CUDA C++ (2024): coding, performance optimization, latest features. Paul Norvig, January 12, 2024, https://www.paulnorvig.com/guides/gpu-programming-with-cuda-c.html
Christopher Cooper, August, 2011, GPU Computing with CUDA: Lecture 4 - Optimizations, Boston University https://www.bu.edu/pasi/files/2011/07/Lecture4.pdf
Mark Harris, Dec 04, 2012, How to Optimize Data Transfers in CUDA C/C++, https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/
Mark Harris, Dec 13, 2012, How to Overlap Data Transfers in CUDA C/C++, https://developer.nvidia.com/blog/how-overlap-data-transfers-cuda-cc/
Six Ways to SAXPY, Jul 02, 2012, Mark Harris, https://developer.nvidia.com/blog/six-ways-saxpy/
Mark Harris, Optimizing Parallel Reduction in CUDA, https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf
NVIDIA, Aug 2024, CUDAC++ProgrammingGuide, Release 12.6, https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf
Daniel Warfield, Jun 15, 2024, CUDA for AI — Intuitively and Exhaustively Explained: Parallelized AI from scratch, https://towardsdatascience.com/cuda-for-ai-intuitively-and-exhaustively-explained-6ba6cb4406c5
By Arthy Sundaram, Jaydeep Marathe, Mike Murphy, Nikhil Gupta, Xiaohua Zhang and Thibaut Lutz, Apr 15, 2021, Programming Efficiently with the NVIDIA CUDA 11.3 Compiler Toolchain, https://developer.nvidia.com/blog/programming-efficiently-with-the-cuda-11-3-compiler-toolchain/ (Includes focus on using alloca for stack memory allocation versus malloc on the heap.)
K. Cooper, 2020, Timing and Tricks, Department of Mathematics, Washington State University, https://www.math.wsu.edu/math/kcooper/CUDA/c07Timing.pdf
Athena Elafrou, Guillaume Thomas Collignon, March 18th 2024, Introduction to CUDA Performance Optimization, NVIDIA DevTech Compute GPU Technology Conference, https://developer-blogs.nvidia.com/wp-content/uploads/2024/08/CUDA-Programming-and-Optimization.pdf
Rob Van der Wijngaart and Fred Oh, Aug 08, 2024, Improving GPU Performance by Reducing Instruction Cache Misses, https://developer.nvidia.com/blog/improving-gpu-performance-by-reducing-instruction-cache-misses-2/ (Loop unrolling may be detrimental to instruction locality, and the compiler's auto-unrolling is not necessarily the best.)
Alan Gray, Aug 07, 2024, Optimizing llama.cpp AI Inference with CUDA Graphs, https://developer.nvidia.com/blog/optimizing-llama-cpp-ai-inference-with-cuda-graphs/
Christian Mills, September 11, 2024, GPU MODE Lecture 8: CUDA Performance Checklist, https://christianjmills.com/posts/cuda-mode-notes/lecture-008/
Maxim Milakov, Natalia Gimelshein, 28 Jul 2018 (v2), Online normalizer calculation for softmax, https://arxiv.org/abs/1805.02867
Thomas Mejstrik, 9 Aug 2023, __host__ __device__ -- Generic programming in Cuda, Association for Computing Machinery, https://arxiv.org/abs/2309.03912
Gianluca Brilli, Paolo Burgio, 7 Oct 2023, Interference analysis of shared last-level cache on embedded GP-GPUs with multiple CUDA streams, https://arxiv.org/abs/2310.04848
Tony Scudiero and Mike Murphy, Apr 22, 2014, Separate Compilation and Linking of CUDA C++ Device Code, https://developer.nvidia.com/blog/separate-compilation-linking-cuda-device-code/
Samuel S. Cho, 2011, CUDA Thread Basics, https://users.wfu.edu/choss/CUDA/docs/Lecture%205.pdf
Andy Adinets, May 20, 2014, CUDA Dynamic Parallelism API and Principles, https://developer.nvidia.com/blog/cuda-dynamic-parallelism-api-principles/
Ellery Russell, Jiqun Tu, March 2024, Accelerating Drug Discovery: Optimizing Dynamic GPU Workflows with CUDA Graphs, Mapped Memory, C++ Coroutines, and More, GTC 2024, https://www.nvidia.com/en-us/on-demand/session/gtc24-s61156/
Mark Harris, March 2024, More Data, Faster: GPU Memory Management Best Practices in Python and C++, GTC 2024, https://www.nvidia.com/en-us/on-demand/session/gtc24-s62550/
Jack Kosaian, Vijay Thakkar, March 2024, CUTLASS: A Performant, Flexible, and Portable Way to Target Hopper Tensor Cores, GTC 2024, https://www.nvidia.com/en-us/on-demand/session/gtc24-s61198/
Anastasia Stulova, Jeff Larkin, March 2024, No More Porting: Accelerated Computing With Standard C++, Fortran, and Python, GTC 2024, https://www.nvidia.com/en-us/on-demand/session/gtc24-s61204/
Igor Terentyev, March 2024, Advanced Performance Optimization in CUDA, GTC 2024, https://www.nvidia.com/en-us/on-demand/session/gtc24-s62192/
Jiri Kraus, March 2024, Multi GPU Programming Models for HPC and AI, GTC 2024, https://www.nvidia.com/en-us/on-demand/session/gtc24-s61339/
David Olsen, Graham Lopez and Bryce Adelstein Lelbach, Aug 04, 2020, Accelerating Standard C++ with GPUs Using stdpar, https://developer.nvidia.com/blog/accelerating-standard-c-with-gpus-using-stdpar/
Mark Harris, Jan 25, 2017, An Even Easier Introduction to CUDA, https://developer.nvidia.com/blog/even-easier-introduction-cuda/
Deepak Unnikrishnan and Fred Oh, CUDA 12.1 Supports Large Kernel Parameters, Jun 05, 2023, https://developer.nvidia.com/blog/cuda-12-1-supports-large-kernel-parameters/
Paul Delestrac. 2024, Advanced Profiling Techniques For Evaluating GPU Computing Efficiency Executing ML Applications. Ph.D. Thesis, Micro and nanotechnologies/Microelectronics. Université de Montpellier, 2024. English. NNT: 2024UMONS014 https://theses.hal.science/tel-04742193/file/DELESTRAC_2024_archivage.pdf
David Spuler, Oct 2024, CUDA C++ Optimization: Coding Faster GPU Kernels, https://www.amazon.com/dp/B0DK21QQYD/
Ming Li, Ziqian Bi, Tianyang Wang, Yizhu Wen, Qian Niu, Junyu Liu, Benji Peng, Sen Zhang, Xuanhe Pan, Jiawei Xu, Jinlang Wang, Keyu Chen, Caitlyn Heqi Yin, Pohsun Feng, Ming Liu, 8 Oct 2024, Deep Learning and Machine Learning with GPGPU and CUDA: Unlocking the Power of Parallel Computing, https://arxiv.org/abs/2410.05686
Vasily Volkov, August 12, 2016, Understanding Latency Hiding on GPUs, Ph.D. Thesis, Electrical Engineering and Computer Sciences, University of California at Berkeley, Technical Report No. UCB/EECS-2016-143, http://www.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-143.html https://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-143.pdf
Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, Evan Shelhamer, 18 Dec 2014 (v3), cuDNN: Efficient Primitives for Deep Learning, https://arxiv.org/abs/1410.0759
Wayne Gaudin, Scott McMillan, Pat Brooks and Akhil Docca, Nov 16, 2020, Building and Deploying HPC Applications using NVIDIA HPC SDK from the NVIDIA NGC Catalog, NVIDIA Technical Blog, https://developer.nvidia.com/blog/building-and-deploying-hpc-applications-using-hpc-sdk-from-ngc-catalog/
Character.AI, Nov 21, 2024, Optimizing AI Inference at Character.AI (Part Deux), https://research.character.ai/optimizing-ai-inference-at-character-ai-part-deux/ (Optimization techniques discussed include INT8, Flash attention 3, kernel fusion of KV dequantization and attention, MQA parallelization, producer-consumer CUDA warp specialization, fused matrix transpose, and more.)
Zhonggen Li, Xiangyu Ke, Yifan Zhu, Yunjun Gao, Yaofeng Tu, 12 Dec 2024, HC-SpMM: Accelerating Sparse Matrix-Matrix Multiplication for Graphs with Hybrid GPU Cores, https://arxiv.org/abs/2412.08902
Francisco Durán, Matias Martinez, Patricia Lago, Silverio Martínez-Fernández, 19 Dec 2024, Energy consumption of code small language models serving with runtime engines and execution providers, https://arxiv.org/abs/2412.15441
Andrew Chan, Dec 12, 2024, Fast LLM Inference From Scratch: Pushing single-GPU inference throughput to the edge without libraries, https://andrewkchan.dev/posts/yalm.html
Yunjae Lee, Juntaek Lim, Jehyeon Bang, Eunyeong Cho, Huijong Jeong, Taesu Kim, Hyungjun Kim, Joonhyung Lee, Jinseop Im, Ranggi Hwang, Se Jung Kwon, Dongsoo Lee, Minsoo Rhu, 31 Dec 2024, Debunking the CUDA Myth Towards GPU-based AI Systems, https://arxiv.org/abs/2501.00210 (Comparing Intel Gaudi-2 with TPC-C programming versus NVIDIA A100 GPUs with CUDA software.)
Ryan Lucchese, Niki Birkner, Yaron Hagai, Virginia Adams, August 13, 2024, Together AI, A practitioner's guide to testing and running large GPU clusters for training generative AI models, https://www.together.ai/blog/a-practitioners-guide-to-testing-and-running-large-gpu-clusters-for-training-generative-ai-models
Guoliang He, Eiko Yoneki, 14 Jan 2025, CuAsmRL: Optimizing GPU SASS Schedules via Deep Reinforcement Learning, https://arxiv.org/abs/2501.08071
Less Wright, Adnan Hoque, November 01, 2024, Deep Dive on CUTLASS Ping-Pong GEMM Kernel, https://pytorch.org/blog/cutlass-ping-pong-gemm-kernel/ https://github.com/pytorch-labs/applied-ai/tree/main/kernels/cuda/cutlass_gemm https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp (CUTLASS optimized FP8 tiled GEMM kernel using warp specialization into data producers and consumers.)
Y Wang, B Li, MTI Ziad, L Eeckhout, J Yang, A Jaleel, Jan 2025, OASIS: Object-Aware Page Management for Multi-GPU Systems https://users.elis.ugent.be/~leeckhou/papers/HPCA2025-OASIS.pdf
Zachary Bourque, Maja Piechotka, Shelton Dsouza and Fred Oh, Jan 31, 2025, Dynamic Loading in the CUDA Runtime, https://developer.nvidia.com/blog/dynamic-loading-in-the-cuda-runtime/
Sylvain Jeaugey, Giuseppe Congiu, Thomas Gillis, Ben Williams and Fred Oh, Jan 31, 2025, New Scaling Algorithm and Initialization with NVIDIA Collective Communications Library 2.23, https://developer.nvidia.com/blog/new-scaling-algorithm-and-initialization-with-nvidia-collective-communications-library-2-23/
Radostin Stoyanov, Viktória Spišaková, Jesus Ramos, Steven Gurfinkel, Andrei Vagin, Adrian Reber, Wesley Armour, Rodrigo Bruno, 23 Feb 2025, CRIUgpu: Transparent Checkpointing of GPU-Accelerated Workloads, https://arxiv.org/abs/2502.16631
Benjamin Spector, Aaryan Singhal, Dan Fu, Chris Ré, March 4, 2025, ThunderMLA: FlashMLA, Faster and Fused-er! https://hazyresearch.stanford.edu/blog/2025-03-04-thundermla https://github.com/HazyResearch/ThunderKittens/blob/mla/kernels/attn/demo/mla_decode/template_mla_decode.cu (Using a single CUDA "megakernel" to perform all jobs and passing it meta-instructions, thereby avoiding launching and shutting down kernels.)
Benjamin Spector, Aaryan Singhal, Dan Fu, Chris Ré, Mar 15, 2025, ThunderKittens Now on Blackwells! https://hazyresearch.stanford.edu/blog/2025-03-15-tk-blackwell https://github.com/HazyResearch/ThunderKittens/blob/blackwell/kernels/attn/b200/b200.cu https://github.com/HazyResearch/ThunderKittens/blob/blackwell/kernels/matmul/B200/matmul.cu https://github.com/HazyResearch/ThunderKittens/blob/blackwell/kernels/matmul/FP8_B200/matmul.cu
Russell K. Standish, 25 Apr 2025, Dynamic Memory Management on GPUs with SYCL, https://arxiv.org/abs/2504.18211
M. Suvarna, O. Tehrani, 2 Apr 2025, GigaAPI for GPU Parallelization, https://arxiv.org/abs/2504.01266
TP Coyne, 2025, Performance Portability of CUDA Across NVIDIA GPU Architectures, Masters Thesis, Virginia Polytechnic Institute and State University, https://vtechworks.lib.vt.edu/server/api/core/bitstreams/a73dba4f-1fce-48d9-b144-ae173b980f66/content
Anne Ouyang and Azalia Mirhoseini and Percy Liang, June 2025, Surprisingly Fast AI-Generated Kernels We Didn’t Mean to Publish (Yet), https://crfm.stanford.edu/2025/05/28/fast-kernels.html
A Kavalans, 2025, Exploring Modern Shader Technologies and Their Capabilities with WebGPU, https://www.theseus.fi/bitstream/handle/10024/893260/Kavalans_Andrejs.pdf?sequence=2
Shunji Funasaka, Koji Nakano, and Yasuaki It, 2016, A Parallel Algorithm for LZW decompression, with GPU implementation, https://www.cs.hiroshima-u.ac.jp/cs/_media/cp14.pdf https://www.jstage.jst.go.jp/article/transinf/E99.D/12/E99.D_2016PAP0011/_pdf/-char/en (There are two versions of the paper that are similar.)

CUDA Profiling and Timing

Articles on using profiler tools for timing the efficiency of CUDA kernels:

DominikGreweandAntonLokhmotov.2011. Automatically Generating and Tuning GPU Code for Sparse Matrix-Vector Multiplication from a High-Level Representation. In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, (Newport Beach, California,U SA)(GPGPU-4).Association for Computing Machinery, NewYork, NY, USA, https://doi.org/10.1145/1964179.1964196
R. Lim, B. Norris and A. Malony, "Autotuning GPU Kernels via Static and Predictive Analysis," 2017 46th International Conference on Parallel Processing (ICPP), Bristol, UK, 2017, pp. 523-532, doi: 10.1109/ICPP.2017.61. https://ieeexplore.ieee.org/abstract/document/8025326 https://arxiv.org/abs/1701.08547
Sébastien Darche and Michel R. Dagenais. 2024. Low-Overhead Trace Collection and Profiling on GPU Compute Kernels. ACM Trans. Parallel Comput. 11, 2, Article 9 (June 2024), 24 pages. https://doi.org/10.1145/3649510 https://dl.acm.org/doi/abs/10.1145/3649510
Nikolay Sakharnykh, Nov 19, 2017, Maximizing Unified Memory Performance in CUDA, https://developer.nvidia.com/blog/maximizing-unified-memory-performance-cuda/
Jackson Marusarz, 2023, CUDA Tutorials I. Profiling and Debugging Applications, https://www.youtube.com/watch?v=dB5Jxwj0PDw
NVIDIA, 2024, Profiler User’s Guide, https://docs.nvidia.com/cuda/profiler-users-guide/index.html
Mark Harris, May 28, 2013, CUDA Pro Tip: Clean Up After Yourself to Ensure Correct Profiling, https://developer.nvidia.com/blog/pro-tip-clean-up-after-yourself-ensure-correct-profiling/
Mark Harris, Oct 23, 2013, CUDA Pro Tip: nvprof is Your Handy Universal GPU Profiler, https://developer.nvidia.com/blog/cuda-pro-tip-nvprof-your-handy-universal-gpu-profiler/
Mark Harris, Nov 07, 2012, How to Implement Performance Metrics in CUDA C/C++, https://developer.nvidia.com/blog/how-implement-performance-metrics-cuda-cc/
NVIDIA, Sep 2024 (accessed), nvbench: CUDA Kernel Benchmarking Library, https://github.com/NVIDIA/nvbench
K. Cooper, 2020, Timing and Tricks, Department of Mathematics, Washington State University, https://www.math.wsu.edu/math/kcooper/CUDA/c07Timing.pdf
NVIDIA, Sep 2024 (accessed), Nsight Compute CLI, https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html
Paul Delestrac. 2024, Advanced Profiling Techniques For Evaluating GPU Computing Efficiency Executing ML Applications. Ph.D. Thesis, Micro and nanotechnologies/Microelectronics. Université de Montpellier, 2024. English. NNT: 2024UMONS014 https://theses.hal.science/tel-04742193/file/DELESTRAC_2024_archivage.pdf
Greg Gutmann, Sep 2020, Running CUDA Profilers on Linux: GPU Optimizations, https://codingbyexample.com/2020/09/25/cuda-profiling-linux/
Wang Peng, Qi Kaiyuan, Yu Zhibin, Su Guangfeng, Liu Peng, Dec 2024, MICPAT: Micro-architecture Independent Characteristics Profiling Analysis Tool for GPU Programs, https://doi.org/10.21203/rs.3.rs-5430086/v1 https://www.researchsquare.com/article/rs-5430086/v1 https://zenodo.org/records/13623324

Performance Profiling

Research papers on performance profiling:

Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
NM Ho, DT Nguyen, JL Gustafson, WF Wong, 2023, Bedot: Bit Efficient Dot Product for Deep Generative Models, CoNGA 2023: Next Generation Arithmetic, pp. 19–37, https://link.springer.com/chapter/10.1007/978-3-031-32180-1_2 https://www.comp.nus.edu.sg/~wongwf/papers/CONGA23-Bedot.pdf
David Spuler, March 2024, Chapter 37. Tuning, Profiling & Benchmarking, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Xinyi Hou, Yanjie Zhao, Haoyu Wang, 3 Aug 2024, Voices from the Frontier: A Comprehensive Analysis of the OpenAI Developer Forum, https://arxiv.org/abs/2408.01687
Yvan Fafchamps, Apr 5, 2024, How to benchmark and optimize LLM inference performance (for data scientists), https://medium.com/@yvan.fafchamps/how-to-benchmark-and-optimize-llm-inference-performance-for-data-scientists-1dbacdc7412a
Wang Peng, Qi Kaiyuan, Yu Zhibin, Su Guangfeng, Liu Peng, Dec 2024, MICPAT: Micro-architecture Independent Characteristics Profiling Analysis Tool for GPU Programs, https://doi.org/10.21203/rs.3.rs-5430086/v1 https://www.researchsquare.com/article/rs-5430086/v1 https://zenodo.org/records/13623324
Shaibal Saha, Lanyu Xu, 26 Feb 2025, Vision Transformers on the Edge: A Comprehensive Survey of Model Compression and Acceleration Strategies, https://arxiv.org/abs/2503.02891

CUDA Memory Optimization

Articles and papers on CUDA memory optimization techniques:

Dung Le, Jul 30, 2020, CUDA Memory Management & Use cases, https://medium.com/distributed-knowledge/cuda-memory-management-use-cases-f9d340f7c704
Neda Seifi, Abdullah Al-Mamun, 2014, Optimizing Memory Access Efficiency in CUDA Kernel via Data Layout Technique, Journal of Computer and Communications, 2024, 12, 124-139, DOI: 10.4236/jcc.2024.125009, https://www.scirp.org/journal/paperinformation?paperid=133500 PDF: https://www.scirp.org/pdf/jcc2024125_91732699.pdf
Mark Harris, Jan 07, 2013, How to Access Global Memory Efficiently in CUDA C/C++ Kernels, https://developer.nvidia.com/blog/how-access-global-memory-efficiently-cuda-c-kernels/
Justin Luitjens, Dec 04, 2013, CUDA Pro Tip: Increase Performance with Vectorized Memory Access, https://developer.nvidia.com/blog/cuda-pro-tip-increase-performance-with-vectorized-memory-access/
Nikolay Sakharnykh, Nov 19, 2017, Maximizing Unified Memory Performance in CUDA, https://developer.nvidia.com/blog/maximizing-unified-memory-performance-cuda/
NVIDIA, Module 14 – Efficient Host-Device Data Transfer: Lecture 14.1 - Pinned Host Memory, GPU Teaching Kit, https://people.csail.mit.edu/xchen/gpu-programming/Lecture14-stream.pdf
Dhanush, Aug 23, 2024, Optimizing Vector Dot Product in CUDA: Exploring Shared Memory and Reduction Techniques, https://github.com/Dhanush295/Cuda_program/blob/main/Vector_dot_product.cu
Lei Mao, June 22, 2022, CUDA Shared Memory Bank, https://leimao.github.io/blog/CUDA-Shared-Memory-Bank/
Mark Harris, Dec 04, 2012, How to Optimize Data Transfers in CUDA C/C++, https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/
Mark Harris, Jun 19, 2017, Unified Memory for CUDA Beginners, https://developer.nvidia.com/blog/unified-memory-cuda-beginners/
Tutorials Point, Sep 2024, CUDA - Reducing Global Memory Traffic, https://www.tutorialspoint.com/cuda/cuda_reducing_global_memory_traffic.htm
By Arthy Sundaram, Jaydeep Marathe, Mike Murphy, Nikhil Gupta, Xiaohua Zhang and Thibaut Lutz, Apr 15, 2021, Programming Efficiently with the NVIDIA CUDA 11.3 Compiler Toolchain, https://developer.nvidia.com/blog/programming-efficiently-with-the-cuda-11-3-compiler-toolchain/ (Includes focus on using alloca for stack memory allocation versus malloc on the heap.)
Athena Elafrou, Guillaume Thomas Collignon, March 18th 2024, Introduction to CUDA Performance Optimization, NVIDIA DevTech Compute GPU Technology Conference, https://developer-blogs.nvidia.com/wp-content/uploads/2024/08/CUDA-Programming-and-Optimization.pdf
Mark Harris, March 2024, More Data, Faster: GPU Memory Management Best Practices in Python and C++, GTC 2024, https://www.nvidia.com/en-us/on-demand/session/gtc24-s62550/
Deepak Unnikrishnan and Fred Oh, CUDA 12.1 Supports Large Kernel Parameters, Jun 05, 2023, https://developer.nvidia.com/blog/cuda-12-1-supports-large-kernel-parameters/
David Spuler, Oct 2024, CUDA C++ Optimization: Coding Faster GPU Kernels, https://www.amazon.com/dp/B0DK21QQYD/
Ming Li, Ziqian Bi, Tianyang Wang, Yizhu Wen, Qian Niu, Junyu Liu, Benji Peng, Sen Zhang, Xuanhe Pan, Jiawei Xu, Jinlang Wang, Keyu Chen, Caitlyn Heqi Yin, Pohsun Feng, Ming Liu, 8 Oct 2024, Deep Learning and Machine Learning with GPGPU and CUDA: Unlocking the Power of Parallel Computing, https://arxiv.org/abs/2410.05686
Prem Sagar Gali and Ben Zaitlen, Dec 05, 2024, Unified Virtual Memory Supercharges pandas with RAPIDS cuDF, https://developer.nvidia.com/blog/unified-virtual-memory-supercharges-pandas-with-rapids-cudf/
Edoardo Maria Ponti, Adrian Łańcucki, Johannes Rausch and David Tarjan, Jan 24, 2025, Dynamic Memory Compression, https://developer.nvidia.com/blog/dynamic-memory-compression/ (Using addition for KV cache compression.)
Russell K. Standish, 25 Apr 2025, Dynamic Memory Management on GPUs with SYCL, https://arxiv.org/abs/2504.18211

CUDA C++ Matrix Multiplication (MatMul/GEMM)

Articles on coding MatmUl/GEMM in CUDA C++:

Samuel S. Cho, 2011, CUDA Thread Basics, https://users.wfu.edu/choss/CUDA/docs/Lecture%205.pdf
Mark Harris, Feb 18, 2013, An Efficient Matrix Transpose in CUDA C/C++, https://developer.nvidia.com/blog/efficient-matrix-transpose-cuda-cc/
Cris Cecka, Feb 27, 2017, Pro Tip: cuBLAS Strided Batched Matrix Multiply, https://developer.nvidia.com/blog/cublas-strided-batched-matrix-multiply/
Dominik Grewe and Anton Lokhmotov.2011. Automatically Generating and Tuning GPU Code for Sparse Matrix-Vector Multiplication from a High-Level Representation. In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, (Newport Beach, California,U SA)(GPGPU-4).Association for Computing Machinery, NewYork, NY, USA, https://doi.org/10.1145/1964179.1964196
Zijing Gu, 26 Jul 2020, Optimizing Block-Sparse Matrix Multiplications on CUDA with TVM, https://arxiv.org/abs/2007.13055
Neda Seifi, Abdullah Al-Mamun, 2014, Optimizing Memory Access Efficiency in CUDA Kernel via Data Layout Technique, Journal of Computer and Communications, 2024, 12, 124-139, DOI: 10.4236/jcc.2024.125009, https://www.scirp.org/journal/paperinformation?paperid=133500 PDF: https://www.scirp.org/pdf/jcc2024125_91732699.pdf (Fast CUDA matrix multiplication using data locality of memory accesses, by using diagonal data access patterns for coalesced access.)
Li, Y., Dongarra, J., Tomov, S. (2009). A Note on Auto-tuning GEMM for GPUs. In: Allen, G., Nabrzyski, J., Seidel, E., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds) Computational Science – ICCS 2009. Lecture Notes in Computer Science, vol 5544. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01970-8_89 https://link.springer.com/chapter/10.1007/978-3-642-01970-8_89 PDF: https://link.springer.com/content/pdf/10.1007/978-3-642-01970-8_89.pdf
Ganesh Bikshandi, Jay Shah, 19 Dec 2023, A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library, https://arxiv.org/abs/2312.11918 https://research.colfax-intl.com/nvidia-hopper-flashattention-2/
Dung Le, Jul 30, 2020, CUDA Memory Management & Use cases, https://medium.com/distributed-knowledge/cuda-memory-management-use-cases-f9d340f7c704
Lei Mao, March 19, 2023, CUDA Coalesced Memory Access, https://leimao.github.io/blog/CUDA-Coalesced-Memory-Access/
Abhinav Agarwalla, Abhay Gupta, Alexandre Marques, Shubhra Pandit, Michael Goin, Eldar Kurtic, Kevin Leong, Tuan Nguyen, Mahmoud Salem, Dan Alistarh, Sean Lie, Mark Kurtz, 6 May 2024, Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment, https://arxiv.org/abs/2405.03594 NVIDIA, 2024, cuSparse, https://docs.nvidia.com/cuda/cusparse/index.html
Stephen Jones, March 2024, CUDA: New Features and Beyond, GTC 2024, https://www.nvidia.com/en-us/on-demand/session/gtc24-s62400/
Siboehm, December 2022, How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog, https://siboehm.com/articles/22/CUDA-MMM
Gray, S. 2014. A full walk through of the SGEMM implementation, https://github.com/NervanaSystems/maxas/wiki/SGEMM
Lai, J., and Seznec, A. 2013. Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs. International Symposium on Code Generation and Optimization (CGO '13), 1–10. https://inria.hal.science/file/index/docid/789958/filename/112_Lai.pdf
Andrew Kerr, Duane Merrill, Julien Demouth and John Tran, Dec 05, 2017, CUTLASS: Fast Linear Algebra in CUDA C++, https://developer.nvidia.com/blog/cutlass-linear-algebra-cuda/
Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, Evan Shelhamer, 18 Dec 2014 (v3), cuDNN: Efficient Primitives for Deep Learning, https://arxiv.org/abs/1410.0759
Rajib Nath, Stanimire Tomov, and Jack Dongarra, November 18, 2010, An Improved Magma Gemm For Fermi Graphics Processing Units, The International Journal of High Performance Computing Applications, Volume 24, Issue 4, https://doi.org/10.1177/1094342010385729 https://journals.sagepub.com/doi/10.1177/1094342010385729
Fengguang Song, Stanimire Tomov, Jack Dongarra, 2012, Enabling and Scaling Matrix Computations on Heterogeneous Multi-Core and Multi-GPU Systems, ICS’12, June 25–29, 2012, San Servolo Island, Venice, Italy. https://icl.utk.edu/files/publications/2012/icl-utk-495-2012.pdf
Wang Zhiyong, 2022, NVIDIA SGEMM Practice: Step-by-step optimization of CUDA SGEMM, https://github.com/wangzyon/NVIDIA_SGEMM_PRACTICE
Siboehm, 2023, Fast CUDA SGEMM from Scratch, https://github.com/siboehm/SGEMM_CUDA
Edward Kandrot, 2023, cuda_matmul: Optimized CUDA matmul with benchmarks, https://github.com/ekandrot/cuda_matmul
Harshit Kumar, June 7, 2024, Matrix Multiplication in CUDA, https://kharshit.github.io/blog/2024/06/07/matrix-multiplication-cuda
NVIDIA, 2024, Matrix Multiplication Background User's Guide, https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html
Less Wright, Adnan Hoque, November 01, 2024, Deep Dive on CUTLASS Ping-Pong GEMM Kernel, https://pytorch.org/blog/cutlass-ping-pong-gemm-kernel/ https://github.com/pytorch-labs/applied-ai/tree/main/kernels/cuda/cutlass_gemm https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp (CUTLASS optimized FP8 tiled GEMM kernel using warp specialization into data producers and consumers.)

CUDA Debugging Techniques

Articles on CUDA debugging tools and techniques:

Mingyuan Wu, Husheng Zhou, Lingming Zhang, Cong Liu, Yuqun Zhang, 29 May 2019 (v3), Characterizing and Detecting CUDA Program Bugs, https://arxiv.org/abs/1905.01833 (Study of CUDA bugs in several production-level CUDA projects, including memory resource issues and synchronization errors.)
M. Wu, Y. Ouyang, H. Zhou, L. Zhang, C. Liu and Y. Zhang, "Simulee: Detecting CUDA Synchronization Bugs via Memory-Access Modeling," 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE), Seoul, Korea (South), 2020, pp. 937-948, doi: 10.1145/3377811.3380358. https://ieeexplore.ieee.org/document/9284094 (Simulation tool to detect CUDA bugs by interpreting the LLVM byte code.)
Pengcheng Li, Chen Ding, Xiaoyu Hu, Tolga Soyata, 2014, LDetector: A Low Overhead Race Detector For GPU Programs, https://wodet.cs.washington.edu/wp-content/uploads/2014/02/wodet2014-final14.pdf
Mohamed Tarek Ibn Ziad, Sana Damani, Aamer Jaleel, Stephen W. Keckler, and Mark Stephenson. 2023. CuCatch: A Debugging Tool for Efficiently Catching Memory Safety Violations in CUDA Applications. Proc. ACM Program. Lang. 7, PLDI, Article 111 (June 2023), 24 pages. https://doi.org/10.1145/3591225 https://dl.acm.org/doi/abs/10.1145/3591225 PDF: https://dl.acm.org/doi/pdf/10.1145/3591225
Helder J. F. Luz, Paulo S. L. Souza, Simone R. S. Souza, 9 April 2024, Structural testing for CUDA programming model, https://doi.org/10.1002/cpe.8105 https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.8105
S. Hong, H. Sun, X. Gao and S. H. Tan, "Investigating and Detecting Silent Bugs in PyTorch Programs," 2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), Rovaniemi, Finland, 2024, pp. 272-283, doi: 10.1109/SANER60148.2024.00035. https://ieeexplore.ieee.org/abstract/document/10589839 PDF: https://gaoxiang9430.github.io/papers/saner24a.pdf
Florian Tambon, Amin Nikanjam, Le An, Foutse Khomh, Giuliano Antoniol, 1 Sep 2023 (v2), Silent Bugs in Deep Learning Frameworks: An Empirical Study of Keras and TensorFlow, https://arxiv.org/abs/2112.13314
M Boyer, K Skadron, W Weimer, 2008, Automated dynamic analysis of CUDA programs, Third Workshop on Software Tools, https://www.nvidia.com/docs/io/67190/stmcs08.pdf
T Lloyd, K Ali, JN Amaral, 2019, Gpucheck: Detecting cuda thread divergence with static analysis, https://era.library.ualberta.ca/items/7ab2b28d-b111-448f-8273-2ff219132908 PDF: https://era.library.ualberta.ca/items/7ab2b28d-b111-448f-8273-2ff219132908/view/88ccced9-537e-47de-b62f-7eef434e73b6/TR19-01
Yanan Guo, Zhenkai Zhang, Jun Yang, August 2024, GPU Memory Exploitation for Fun and Profit, Proceedings of the 33rd USENIX Security Symposium, August 14–16, 2024 • Philadelphia, PA, USA, 978-1-939133-44-1, https://www.usenix.org/conference/usenixsecurity24/presentation/guo-yanan https://www.usenix.org/system/files/usenixsecurity24-guo-yanan.pdf
Christopher Erb, Mike Collins, and Joseph L. Greathouse. Dynamic buffer overflow detection for GPGPUs. In IEEE/ACM International Symposium on Code Generation and Optimization (CGO), 2017. https://dl.acm.org/doi/10.5555/3049832.3049840 https://computermachines.org/joe/publications/pdfs/cgo2017_clarmor.pdf
Bob Crovella, Sep 14, 2021, CUDA Debugging, https://leimao.github.io/downloads/blog/2022-05-25-Proper-CUDA-Error-Checking/cuda_training_series_cuda_debugging.pdf
Jackson Marusarz, 2023, CUDA Tutorials I. Profiling and Debugging Applications, https://www.youtube.com/watch?v=dB5Jxwj0PDw
NVIDIA, Sep 2024, Compute Sanitizer, https://docs.nvidia.com/compute-sanitizer/ComputeSanitizer/index.html
Aurelien Chartier, Steve Ulrich, 2023, Debugging CUDA: An Overview of CUDA Correctness Tools, https://www.nvidia.com/en-us/on-demand/session/gtcspring23-s51772/
NVIDIA, 2024, Debugger API, https://docs.nvidia.com/cuda/debugger-api/index.html
NVIDIA, 2024, CUDA-GDB, https://docs.nvidia.com/cuda/cuda-gdb/index.html
Greg Ruetsch, Nov 16, 2017, Pro Tip: Pinpointing Runtime Errors in CUDA Fortran, https://developer.nvidia.com/blog/pinpointing-runtime-errors-cuda-fortran/
Paul Graham and Mozhgan Kabiri Chimeh, Jun 29, 2023, Efficient CUDA Debugging: How to Hunt Bugs with NVIDIA Compute Sanitizer, https://developer.nvidia.com/blog/debugging-cuda-more-efficiently-with-nvidia-compute-sanitizer/
Lei Mao, Dec 15, 2023, Proper CUDA Error Checking, https://leimao.github.io/blog/Proper-CUDA-Error-Checking/
Thomas M. Baumann, Jose Gracia, 3 Oct 2013, Cudagrind: A Valgrind Extension for CUDA, https://arxiv.org/abs/1310.0901 https://github.com/dpc-grindland/Cudagrind (Valgrind Memcheck for CUDA C++, but over 10 years old)
Andy Adinets, May 20, 2014, CUDA Dynamic Parallelism API and Principles, https://developer.nvidia.com/blog/cuda-dynamic-parallelism-api-principles/
CudaText add-ons, 2021 (approx), CudaText plugin: CudaLint, https://github.com/CudaText-addons/cuda_lint
GPU-NBDetect Oct 2024 (accessed), Comprehensive-Study-on-GPU-Program-Numerical-Issues, https://github.com/GPU-Program-Bug-Study/Comprehensive-Study-on-GPU-Program-Numerical-Issues.github.io/tree/main/GPU-NBDetect
FP Checker, Jul 19, 2021, Floating-point Exceptions and GPU Applications, https://fpchecker.org/2021-07-12-exceptions.html
Lawrence Livermore National Laboratory, Oct 024, FP Checker: dynamic analysis tool to detect floating-point errors in HPC applications, https://fpchecker.org/index.html https://github.com/LLNL/FPChecker
Laguna, Ignacio. "FPChecker: Detecting Floating-point Exceptions in GPU Applications." In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 1126-1129. IEEE, 2019. https://ieeexplore.ieee.org/abstract/document/8952258 https://www.osti.gov/servlets/purl/1574625
Michael Bentley, Ian Briggs, Ganesh Gopalakrishnan, Ignacio Laguna, Harshitha Menon, Tristan Vanderbruggen, Cindy Rubio González, 2020, FPChecker Detecting Floating-Point Exceptions in GPUs, https://fpanalysistools.org/pearc19/slides/Module-FPChecker.pdf
Ignacio Laguna Feb 4, 2020, Improving Reliability Through Analyzing and Debugging Floating-Point Software, 2020 ECP Annual Meeting, https://fpanalysistools.org/slides/ignacio_laguna_ECP_2020.pdf
Ignacio Laguna, Xinyi Li, and Ganesh Gopalakrishnan. 2022. BinFPE: accurate floating-point exception detection for GPU applications. In Proceedings of the 11th ACM SIGPLAN International Workshop on the State Of the Art in Program Analysis (SOAP 2022). Association for Computing Machinery, New York, NY, USA, 1–8. https://doi.org/10.1145/3520313.3534655 https://dl.acm.org/doi/10.1145/3520313.3534655 https://dl.acm.org/doi/pdf/10.1145/3520313.3534655 https://github.com/LLNL/BinFPE
Floris Gorter, Enrico Barberis, Raphael Isemann, Erik van der Kouwe, Cristiano Giuffrida, Herbert Bos, November 1, 2023, FloatZone: How Floating Point Additions can Detect Memory Errors, https://download.vusec.net/papers/floatzone_sec23.pdf https://github.com/vusec/floatzone
David Spuler, Oct 2024, CUDA C++ Debugging: Safer GPU Kernel Programming, https://www.amazon.com/dp/B0DJJVDJBW/
Felipe R. Monteiro, Erickson H. da S. Alves, Isabela S. Silva, Hussama I. Ismail, Lucas C. Cordeiro, and Eddie B. de Lima Filho. 2018. ESBMC-GPU A context-bounded model checking tool to verify CUDA programs. Sci. Comput. Program. 152, C (January 2018), 63–69. https://doi.org/10.1016/j.scico.2017.09.005 https://dl.acm.org/doi/10.1016/j.scico.2017.09.005 https://pure.manchester.ac.uk/ws/files/77048933/jscp2017.pdf
Chao Peng, 2021, Automated testing for GPU kernels, Doctor of Philosophy, Laboratory for Foundations of Computer Science, School of Informatics, University of Edinburgh, https://era.ed.ac.uk/handle/1842/38563 https://era.ed.ac.uk/bitstream/handle/1842/38563/Peng2021.pdf?sequence=1&isAllowed=y
Bo Jiang, Xiaoyan Wang, W.K. Chan, T.H. Tse, Na Li and Yongfeng Yin, 2020, "CUDAsmith: A Fuzzer for CUDA Compilers," 2020 IEEE 44th Annual Computers, Software, and Applications Conference (COMPSAC), Madrid, Spain, 2020, pp. 861-871, doi: 10.1109/COMPSAC48688.2020.0-156. https://ieeexplore.ieee.org/abstract/document/9202798 https://github.com/gongbell/CUDAsmith https://www.cs.hku.hk/data/techreps/document/TR-2020-05.pdf
Geoff Gerfin Vyas Venkataraman, Oct 2024, Debugging Experience with CUDA-GDB and CUDA-MEMCHECK, GPU Conference, https://picture.iczhiku.com/resource/paper/WhiSzdKjTtFhhNnm.pdf
A Hück, T Ziegler, S Schwitanski, J Jenke, C Bischof, Nov 2024, Compiler-Aided Correctness Checking of CUDA-Aware MPI Applications, https://conferences.computer.org/sc-wpub/pdfs/SC-W2024-6oZmigAQfgJ1GhPL0yE3pS/555400a204/555400a204.pdf
Taylor Allred, Xinyi Li, Ashton Wiersdorf, Ben Greenman, Ganesh Gopalakrishnan, 22 Mar 2024, FlowFPX: Nimble Tools for Debugging Floating-Point Exceptions, https://arxiv.org/abs/2403.15632 https://juliahub.com/ui/Packages/FloatTracker/dBXig
Zhihan Jiang, Junjie Huang, Zhuangbin Chen, Yichen Li, Guangba Yu, Cong Feng, Yongqiang Yang, Zengyin Yang, Michael R. Lyu, 26 Mar 2025, L4: Diagnosing Large-scale LLM Training Failures via Automated Log Analysis, https://arxiv.org/abs/2503.20263

CUDA Debugging Tools

Papers on CUDA debug tools:

Paul Graham and Mozhgan Kabiri Chimeh, Jun 29, 2023, Efficient CUDA Debugging: How to Hunt Bugs with NVIDIA Compute Sanitizer, https://developer.nvidia.com/blog/debugging-cuda-more-efficiently-with-nvidia-compute-sanitizer/
Felipe R. Monteiro, Erickson H. da S. Alves, Isabela S. Silva, Hussama I. Ismail, Lucas C. Cordeiro, and Eddie B. de Lima Filho. 2018. ESBMC-GPU A context-bounded model checking tool to verify CUDA programs. Sci. Comput. Program. 152, C (January 2018), 63–69. https://doi.org/10.1016/j.scico.2017.09.005 https://dl.acm.org/doi/10.1016/j.scico.2017.09.005 https://pure.manchester.ac.uk/ws/files/77048933/jscp2017.pdf
Joshi, S., Muduganti, G. (2021). GPURepair: Automated Repair of GPU Kernels. In: Henglein, F., Shoham, S., Vizel, Y. (eds) Verification, Model Checking, and Abstract Interpretation. VMCAI 2021. Lecture Notes in Computer Science(), vol 12597. Springer, Cham. https://doi.org/10.1007/978-3-030-67067-2_18 https://link.springer.com/chapter/10.1007/978-3-030-67067-2_18 https://arxiv.org/pdf/2011.08373
Chao Peng, 2021, Automated testing for GPU kernels, Doctor of Philosophy, Laboratory for Foundations of Computer Science, School of Informatics, University of Edinburgh, https://era.ed.ac.uk/handle/1842/38563 https://era.ed.ac.uk/bitstream/handle/1842/38563/Peng2021.pdf?sequence=1&isAllowed=y
Taylor Allred, Xinyi Li, Ashton Wiersdorf, Ben Greenman, Ganesh Gopalakrishnan, 22 Mar 2024, FlowFPX: Nimble Tools for Debugging Floating-Point Exceptions, https://arxiv.org/abs/2403.15632 https://juliahub.com/ui/Packages/FloatTracker/dBXig

Sanitizers

Research on sanitizer tools:

Zeyu Chen, Daiping Liu, Jidong Xiao, and Haining Wang. 2023. All Use-After-Free Vulnerabilities Are Not Created Equal: An Empirical Study on Their Characteristics and Detectability. In Proceedings of the 26th International Symposium on Research in Attacks, Intrusions and Defenses (RAID '23). Association for Computing Machinery, New York, NY, USA, 623–638. https://doi.org/10.1145/3607199.3607229 https://dl.acm.org/doi/10.1145/3607199.3607229 https://vtechworks.lib.vt.edu/bitstream/handle/10919/116595/3607199.3607229.pdf
M Marini, DC D'Elia, M Payer, L Querzoni, Dec 2024, QMSan: Efficiently Detecting Uninitialized Memory Errors During Fuzzing, https://nebelwelt.net/files/25NDSS3.pdf (Binary instrumentation to detect UUM errors.)
K. Gondow and Y. Arahori, "EarlyDLDetect: an Early Root-Cause Locator of Dangling Pointers and Memory Leaks," in IEEE Access, doi: 10.1109/ACCESS.2024.3515133. https://ieeexplore.ieee.org/document/10792904 https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10792904 https://github.com/gondow/EarlyDLDetect
Mohamed Tarek Ibn Ziad, Sana Damani, Aamer Jaleel, Stephen W. Keckler, and Mark Stephenson. 2023. CuCatch: A Debugging Tool for Efficiently Catching Memory Safety Violations in CUDA Applications. Proc. ACM Program. Lang. 7, PLDI, Article 111 (June 2023), 24 pages. https://doi.org/10.1145/3591225 https://dl.acm.org/doi/abs/10.1145/3591225 PDF: https://dl.acm.org/doi/pdf/10.1145/3591225
Floris Gorter, Enrico Barberis, Raphael Isemann, Erik van der Kouwe, Cristiano Giuffrida, Herbert Bos, November 1, 2023, FloatZone: How Floating Point Additions can Detect Memory Errors, https://download.vusec.net/papers/floatzone_sec23.pdf https://github.com/vusec/floatzone
Felipe R. Monteiro, Erickson H. da S. Alves, Isabela S. Silva, Hussama I. Ismail, Lucas C. Cordeiro, and Eddie B. de Lima Filho. 2018. ESBMC-GPU A context-bounded model checking tool to verify CUDA programs. Sci. Comput. Program. 152, C (January 2018), 63–69. https://doi.org/10.1016/j.scico.2017.09.005 https://dl.acm.org/doi/10.1016/j.scico.2017.09.005 https://pure.manchester.ac.uk/ws/files/77048933/jscp2017.pdf
Andreas Hager-Clukas and Konrad Hohentanner. 2024. DMTI: Accelerating Memory Error Detection in Precompiled C/C++ Binaries with ARM Memory Tagging Extension. In Proceedings of the 19th ACM Asia Conference on Computer and Communications Security (ASIA CCS '24). Association for Computing Machinery, New York, NY, USA, 1173–1185. https://doi.org/10.1145/3634737.3637655 https://dl.acm.org/doi/abs/10.1145/3634737.3637655 https://dl.acm.org/doi/pdf/10.1145/3634737.3637655
Seo, J., Bang, I., Cho, Y. et al. Exploring effective uses of the tagged memory for reducing bounds checking overheads. J Supercomput 79, 1032–1064 (2023). https://doi.org/10.1007/s11227-022-04694-y https://link.springer.com/article/10.1007/s11227-022-04694-y
Michael B. Sullivan, Mohamed Tarek Ibn Ziad, Aamer Jaleel, and Stephen W. Keckler. 2023. Implicit Memory Tagging: No-Overhead Memory Safety Using Alias-Free Tagged ECC. In Proceedings of the 50th Annual International Symposium on Computer Architecture (ISCA '23). Association for Computing Machinery, New York, NY, USA, Article 67, 1–13. https://doi.org/10.1145/3579371.3589102 https://dl.acm.org/doi/pdf/10.1145/3579371.3589102
Myoung Jin Nam, Periklis Akritidis, and David J Greaves. 2019. FRAMER: a tagged-pointer capability system with memory safety applications. In Proceedings of the 35th Annual Computer Security Applications Conference (ACSAC '19). Association for Computing Machinery, New York, NY, USA, 612–626. https://doi.org/10.1145/3359789.3359799 https://dl.acm.org/doi/abs/10.1145/3359789.3359799
Emanuel Q. Vintila, Philipp Zieris, and Julian Horsch. 2021. MESH: A Memory-Efficient Safe Heap for C/C++. In Proceedings of the 16th International Conference on Availability, Reliability and Security (ARES '21). Association for Computing Machinery, New York, NY, USA, Article 16, 1–10. https://doi.org/10.1145/3465481.3465760 https://dl.acm.org/doi/abs/10.1145/3465481.3465760
Konrad Hohentanner, Philipp Zieris, and Julian Horsch. 2023. CryptSan: Leveraging ARM Pointer Authentication for Memory Safety in C/C++. In Proceedings of the 38th ACM/SIGAPP Symposium on Applied Computing (SAC '23). Association for Computing Machinery, New York, NY, USA, 1530–1539. https://doi.org/10.1145/3555776.3577635 https://dl.acm.org/doi/abs/10.1145/3555776.3577635
Kroes, T., 2020, How to Keep Your Memory Safe and Your Software Fast, https://research.vu.nl/ws/portalfiles/portal/107720519/941816.pdf (Uses pointer tagging.)
X. Wang, B. Zhang, C. Tang and L. Zhang, "Highly Comprehensive and Efficient Memory Safety Enforcement with Pointer Tagging," 2024 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W), Brisbane, Australia, 2024, pp. 74-81, doi: 10.1109/DSN-W60302.2024.00026. https://ieeexplore.ieee.org/abstract/document/10646907
Lukas Bernhard, Michael Rodler, Thorsten Holz, Lucas Davi, 8 Mar 2022, xTag: Mitigating Use-After-Free Vulnerabilities via Software-Based Pointer Tagging on Intel x86-64, https://arxiv.org/abs/2203.04117
L. Bernhard, M. Rodler, T. Holz and L. Davit, "xTag: Mitigating Use-After-Free Vulnerabilities via Software-Based Pointer Tagging on Intel x86-64," 2022 IEEE 7th European Symposium on Security and Privacy (EuroS&P), Genoa, Italy, 2022, pp. 502-519, doi: 10.1109/EuroSP53844.2022.00038. https://ieeexplore.ieee.org/abstract/document/9797369
Kaiming Huang, Jack Sampson, Mathias Payer, Gang Tan, Zhiyun Qian, Trent Jaeger, Oct 2024, Top of the Heap: Efficient Memory Error Protection of Safe Heap Objects, CCS ’24, October 14–18, 2024, Salt Lake City, UT, USA, https://doi.org/10.1145/3658644.3690310 https://nebelwelt.net/publications/files/24CCS.pdf
Ruizhe Wang, Meng Xu, N. Asokan, 29 May 2024 (v2), S2malloc: Statistically Secure Allocator for Use-After-Free Protection And More, https://arxiv.org/abs/2402.01894
Wikipedia, Oct 2024 (accessed), Buffer Overflow Protection, https://en.wikipedia.org/wiki/Buffer_overflow_protection
Zhenpeng Lin, Zheng Yu, Ziyi Guo, Simone Campanoni, Peter Dinda, and Xinyu Xing, 2023, CAMP:Compiler and Allocator-based Heap Memory Protection, https://www.usenix.org/system/files/sec23winter-prepub-402-lin.pdf
Adriaan Jacobs and Stijn Volckaert, DistriNet, KU Leuven Aug 2024, Not Quite Write: On the Effectiveness of Store-Only Bounds Checking, 18th USENIX WOOT Conference on Offensive Technologies. August 12–13, 2024 • Philadelphia, PA, USA, https://www.usenix.org/conference/woot24/presentation/jacobs https://www.usenix.org/system/files/woot24-jacobs.pdf
Geoff Gerfin Vyas Venkataraman, Oct 2024, Debugging Experience with CUDA-GDB and CUDA-MEMCHECK, GPU Conference, https://picture.iczhiku.com/resource/paper/WhiSzdKjTtFhhNnm.pdf
E. Vintila, P. Zieris and J. Horsch, "Evaluating the Effectiveness of Memory Safety Sanitizers," in 2025 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 2025, pp. 88-88, doi: 10.1109/SP61157.2025.00088. https://www.computer.org/csdl/proceedings-article/sp/2025/223600a088/21TfesaEHTy https://www.computer.org/csdl/pds/api/csdl/proceedings/download-article/21TfesaEHTy/pdf
onathan Woodruff, Robert N.M. Watson, David Chisnall, Simon W. Moore, Jonathan Anderson, Brooks Davis, Ben Laurie, Peter G. Neumann, Robert Norton, and Michael Roe. 2014. The CHERI capability model: revisiting RISC in an age of risk. SIGARCH Comput. Archit. News 42, 3 (June 2014), 457–468. https://doi.org/10.1145/2678373.2665740 https://dl.acm.org/doi/10.1145/2678373.2665740 https://www.cl.cam.ac.uk/research/security/ctsrd/pdfs/201406-isca2014-cheri.pdf
RangeSanitizer: Detecting Memory Errors with Efficient Range Checks Floris Gorter, Cristiano Giuffrida, Feb 2025, https://download.vusec.net/papers/rsan_sec25.pdf

CUDA Portability

Articles and papers on the portability of CUDA code:

Stijn Heldens, Ben van Werkhoven, 22 Mar 2023, Kernel Launcher: C++ Library for Optimal-Performance Portable CUDA Applications, https://arxiv.org/abs/2303.12374 http://kerneltuner.github.io/
Manuel Costanzo, Enzo Rucci, Carlos García Sánchez, Marcelo Naiouf, Manuel Prieto-Matías, 10 Nov 2023 (v2), Comparing Performance and Portability between CUDA and SYCL for Protein Database Search on NVIDIA, AMD, and Intel GPUs, https://arxiv.org/abs/2309.09609
Ruobing Han, Jun Chen, Bhanu Garg, Xule Zhou, John Lu, Jeffrey Young, Jaewoong Sim, and Hyesoon Kim. 2024. CuPBoP: Making CUDA a Portable Language. ACM Trans. Des. Autom. Electron. Syst. 29, 4, Article 60 (July 2024), 25 pages. https://doi.org/10.1145/3659949 https://dl.acm.org/doi/full/10.1145/3659949
Victor Anderssen, April 23, 2024 Converting CUDA programs to run on AMD GPUs, Master’s Thesis, Department of Information Technologies, Faculty of Science and Engineering, Abo Akademi University, Vasa, Finland, https://www.doria.fi/bitstream/handle/10024/188907/anderss%C3%A9n_victor.pdf?sequence=2
Ziran Zhang, Zhiming Wang, Chenwei Sun, and Andy Huang. 2024. Smoothing the Migration from CUDA to SYCL: SYCLomatic Utility Features. In Proceedings of the 12th International Workshop on OpenCL and SYCL (IWOCL '24). Association for Computing Machinery, New York, NY, USA, Article 14, 1–2. https://doi.org/10.1145/3648115.3648132 https://dl.acm.org/doi/abs/10.1145/3648115.3648132
Mark Harris, Sep 21, 2015, Simple, Portable Parallel C++ with Hemi 2 and CUDA 7.5, https://developer.nvidia.com/blog/simple-portable-parallel-c-hemi-2/
Ben Funk, February 27, 2021, Yes, You Can Run NVIDIA CUDA On Intel GPUs And Libraries For It Have Hit Github, https://hothardware.com/news/cuda-on-intel-gpus-zluda (Article about ZLUDA, an open source project, but the github page says it's been discontinued due to legal issues, although some more recent articles now say it's back, so it's a bit unclear.)
Thejaswi Rao and Mark Harris, Aug 20, 2019, CUDA Pro Tip: The Fast Way to Query Device Properties, https://developer.nvidia.com/blog/cuda-pro-tip-the-fast-way-to-query-device-properties/
Mark Harris, Jan 27, 2014, CUDA Pro Tip: Control GPU Visibility with CUDA_VISIBLE_DEVICES, https://developer.nvidia.com/blog/cuda-pro-tip-control-gpu-visibility-cuda_visible_devices/
Akhil Langer, Seth Howell, Arnav Goel, Pak Markthub, Harry Petty and Fred Oh, Sep 06, 2024, Enhancing Application Portability and Compatibility across New Platforms Using NVIDIA Magnum IO NVSHMEM 3.0, https://developer.nvidia.com/blog/enhancing-application-portability-and-compatibility-across-new-platforms-using-nvidia-magnum-io-nvshmem-3-0/
Ruobing Han, Jaewon Lee, Jaewoong Sim, Hyesoon Kim, 19 Dec 2021, COX: CUDA on X86 by Exposing Warp-Level Functions to CPUs, https://arxiv.org/abs/2112.10034
Manuel Costanzo, Enzo Rucci, Carlos Garcia Sanchez, Marcelo Naiouf, Manuel Prieto-Matias, 20 Jun 2022 (v2), Migrating CUDA to oneAPI: A Smith-Waterman Case Study, https://arxiv.org/abs/2203.11100
Thomas Mejstrik, 9 Aug 2023, __host__ __device__ -- Generic programming in Cuda, Association for Computing Machinery, https://arxiv.org/abs/2309.03912
Anton Shilov, March 5, 2024, Nvidia bans using translation layers for CUDA software — previously the prohibition was only listed in the online EULA, now included in installed files [Updated], https://www.tomshardware.com/pc-components/gpus/nvidia-bans-using-translation-layers-for-cuda-software-to-run-on-other-chips-new-restriction-apparently-targets-zluda-and-some-chinese-gpu-makers
NVIDIA, Sep 2024 (accessed), deviceQuery.cpp, https://github.com/NVIDIA/cuda-samples/blob/master/Samples/1_Utilities/deviceQuery/deviceQuery.cpp (CUDA sample showing how to get device versions and properties in CUDA C++ using APIs such as cuDeviceGetAttribute, cudaGetDeviceCount, cudaGetDeviceProperties, cudaDriverGetVersion, cudaRuntimeGetVersion.)
Jack Kosaian, Vijay Thakkar, March 2024, CUTLASS: A Performant, Flexible, and Portable Way to Target Hopper Tensor Cores, GTC 2024, https://www.nvidia.com/en-us/on-demand/session/gtc24-s61198/
Anastasia Stulova, Jeff Larkin, March 2024, No More Porting: Accelerated Computing With Standard C++, Fortran, and Python, GTC 2024, https://www.nvidia.com/en-us/on-demand/session/gtc24-s61204/
David Spuler, Oct 2024, CUDA C++ Debugging: Safer GPU Kernel Programming, https://www.amazon.com/dp/B0DJJVDJBW/
ENCCS, Dec 2024 (accessed), Preparing code for GPU porting, https://enccs.github.io/gpu-programming/11-gpu-porting/
Prerit Kapadia, eInfoChips, December 26, 2023, Porting Algorithms on GPU, https://www.einfochips.com/blog/porting-algorithms-on-gpu/

CUDA Compatibility and Versions

Articles about version compatibility and CUDA:

NVIDIA, 2024, CUDA Compatibility, https://docs.nvidia.com/deploy/cuda-compatibility/index.html https://docs.nvidia.com/deploy/pdf/CUDA_Compatibility.pdf
Mark Harris, Jan 27, 2014, CUDA Pro Tip: Control GPU Visibility with CUDA_VISIBLE_DEVICES, https://developer.nvidia.com/blog/cuda-pro-tip-control-gpu-visibility-cuda_visible_devices/
Kohei Yoshida, Shinobu Miwa, Hayato Yamaki, Hiroki Honda, 2024, Analyzing the impact of CUDA versions on GPU applications, Parallel Computing, Volume 120, 103081, ISSN 0167-8191, https://doi.org/10.1016/j.parco.2024.103081 https://www.sciencedirect.com/science/article/pii/S016781912400019X
Akhil Langer, Seth Howell, Arnav Goel, Pak Markthub, Harry Petty and Fred Oh, Sep 06, 2024, Enhancing Application Portability and Compatibility across New Platforms Using NVIDIA Magnum IO NVSHMEM 3.0, https://developer.nvidia.com/blog/enhancing-application-portability-and-compatibility-across-new-platforms-using-nvidia-magnum-io-nvshmem-3-0/
NVIDIA, Sep 2024 (accessed), deviceQuery.cpp, https://github.com/NVIDIA/cuda-samples/blob/master/Samples/1_Utilities/deviceQuery/deviceQuery.cpp (CUDA sample showing how to get device versions and properties in CUDA C++ using APIs such as cuDeviceGetAttribute, cudaGetDeviceCount, cudaGetDeviceProperties, cudaDriverGetVersion, cudaRuntimeGetVersion.)
David Spuler, Oct 2024, CUDA C++ Debugging: Safer GPU Kernel Programming, https://www.amazon.com/dp/B0DJJVDJBW/

CUDA Emulator

Can you run CUDA without actually having a real GPU? On just a CPU? This would be useful for educational purposes and also for programmers working at home on their laptop.

What about a fully functional CUDA emulation? There are rumors that such a beast used to exist in 2010 (almost 15 years ago), and even be supported by NVIDIA, but it doesn't seem to exist any more.

Here are some articles on CUDA emulation:

NVIDIA, libcu++: The C++ Standard Library for your entire system, https://github.com/NVIDIA/libcudacxx
jrhemstad, Oct 5, 2023, Unifying the CUDA C++ Core Libraries: Towards a More Delightful CUDA C++ #520, NVIDIA, https://github.com/NVIDIA/cccl/discussions/520
btarunr, Apr 2nd, 2010, NVIDIA CUDA Emulator for every PC, https://www.techpowerup.com/119073/nvidia-cuda-emulator-for-every-pc (Old 2010 article.)
CASL, 2010, GPUOCelot: A dynamic compilation framework for PTX, https://github.com/gtcasl/gpuocelot (Last commit was 9 years ago, about 2010.)
NVIDIA Forums, 2010, CUDA Emulator, https://forums.developer.nvidia.com/t/cuda-emulator/15480 (This is from 2010.)
David Spuler, Oct 2024, CUDA C++ Debugging: Safer GPU Kernel Programming, https://www.amazon.com/dp/B0DJJVDJBW/

CUDA Static Analysis

Research on static analysis (code checking or auto-tuning) of CUDA programs:

S. Lagouvardos, J. Dolby, N. Grech, A. Antoniadis, and Y. Smaragdakis, 2020, “Static analysis of shape in Tensorflow programs,” in 34th European Conference on Object-Oriented Programming (ECOOP 2020). Schloss Dagstuhl-Leibniz-Zentrum fur Informatik, 2020. https://drops.dagstuhl.de/entities/document/10.4230/DARTS.6.2.6 PDF: https://drops.dagstuhl.de/storage/05darts/darts-vol006/darts-vol006-issue002_ecoop2020/DARTS.6.2.6/DARTS.6.2.6.pdf
H. Y. Jhoo, S. Kim, W. Song, K. Park, D. Lee, and K. Yi, “A static analyzer for detecting tensor shape errors in deep neural network training code,” in Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings, 2022, pp. 337 338. https://arxiv.org/abs/2112.09037
Yiran Wang, José Antonio Hernández López, Ulf Nilsson, and Dániel Varró. 2024. Using Run-Time Information to Enhance Static Analysis of Machine Learning Code in Notebooks. In Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering (FSE 2024). Association for Computing Machinery, New York, NY, USA, 497–501. https://doi.org/10.1145/3663529.3663785 https://dl.acm.org/doi/abs/10.1145/3663529.3663785 PDF: https://dl.acm.org/doi/pdf/10.1145/3663529.3663785
T Lloyd, K Ali, JN Amaral, 2019, Gpucheck: Detecting cuda thread divergence with static analysis, https://era.library.ualberta.ca/items/7ab2b28d-b111-448f-8273-2ff219132908 PDF: https://era.library.ualberta.ca/items/7ab2b28d-b111-448f-8273-2ff219132908/view/88ccced9-537e-47de-b62f-7eef434e73b6/TR19-01
Mark Lou and Stefan K. Muller. 2024. Automatic Static Analysis-Guided Optimization of CUDA Kernels. In Proceedings of the 15th International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM '24). Association for Computing Machinery, New York, NY, USA, 11–21. https://doi.org/10.1145/3649169.3649249 https://dl.acm.org/doi/abs/10.1145/3649169.3649249 PDF: https://dl.acm.org/doi/pdf/10.1145/3649169.3649249
R. Lim, B. Norris and A. Malony, "Autotuning GPU Kernels via Static and Predictive Analysis," 2017 46th International Conference on Parallel Processing (ICPP), Bristol, UK, 2017, pp. 523-532, doi: 10.1109/ICPP.2017.61. https://ieeexplore.ieee.org/abstract/document/8025326 https://arxiv.org/abs/1701.08547
Nimit Singhania 2018, Static Analysis for GPU Program Performance, Ph.D. Thesis, Computer and Information Science, University of Pennsylvania, https://repository.upenn.edu/server/api/core/bitstreams/9d32ece9-5321-4d37-ab72-251e0ecc197e/content (Optimization of uncoalesced memory accesses, block sizes, and cache reuse.)
CudaText add-ons, 2021 (approx), CudaText plugin: CudaLint, https://github.com/CudaText-addons/cuda_lint
A Hück, T Ziegler, S Schwitanski, J Jenke, C Bischof, Nov 2024, Compiler-Aided Correctness Checking of CUDA-Aware MPI Applications, https://conferences.computer.org/sc-wpub/pdfs/SC-W2024-6oZmigAQfgJ1GhPL0yE3pS/555400a204/555400a204.pdf

CUDA Programming Bugs

Papers on the types of programming errors that can occur in CUDA kernels:

Yanan Guo, Zhenkai Zhang, Jun Yang, August 2024, GPU Memory Exploitation for Fun and Profit, Proceedings of the 33rd USENIX Security Symposium, August 14–16, 2024 • Philadelphia, PA, USA, 978-1-939133-44-1, https://www.usenix.org/conference/usenixsecurity24/presentation/guo-yanan https://www.usenix.org/system/files/usenixsecurity24-guo-yanan.pdf
Christopher Erb, Mike Collins, and Joseph L. Greathouse. Dynamic buffer overflow detection for GPGPUs. In IEEE/ACM International Symposium on Code Generation and Optimization (CGO), 2017. https://dl.acm.org/doi/10.5555/3049832.3049840 https://computermachines.org/joe/publications/pdfs/cgo2017_clarmor.pdf
Bang Di, Jianhua Sun, and Hao Chen. 2016, A study of overflow vulnerabilities on GPUs. In Network and Parallel Computing: 13th IFIP WG 10.3 International Conference. https://link.springer.com/chapter/10.1007/978-3-319-47099-3_9 https://www.aimlab.org/haochen/papers/npc16-overflow.pdf
Mingyuan Wu, Husheng Zhou, Lingming Zhang, Cong Liu, Yuqun Zhang, 29 May 2019 (v3), Characterizing and Detecting CUDA Program Bugs, https://arxiv.org/abs/1905.01833 (Study of CUDA bugs in several production-level CUDA projects, including memory resource issues and synchronization errors.)
Justin Luitjens, Sep 04, 2014, CUDA Pro Tip: Always Set the Current Device to Avoid Multithreading Bugs, https://developer.nvidia.com/blog/cuda-pro-tip-always-set-current-device-avoid-multithreading-bugs/
Mark Harris, Apr 22, 2013, CUDA Pro Tip: Write Flexible Kernels with Grid-Stride Loops, https://developer.nvidia.com/blog/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/
Greg Ruetsch, Nov 16, 2017, Pro Tip: Pinpointing Runtime Errors in CUDA Fortran, https://developer.nvidia.com/blog/pinpointing-runtime-errors-cuda-fortran/
Paul Graham and Mozhgan Kabiri Chimeh, Jun 29, 2023, Efficient CUDA Debugging: How to Hunt Bugs with NVIDIA Compute Sanitizer, https://developer.nvidia.com/blog/debugging-cuda-more-efficiently-with-nvidia-compute-sanitizer/
David Spuler, September 23, 2024, Basic CUDA C++ Programming Mistakes, Aussie AI Blog, https://www.aussieai.com/blog/cuda-basic-mistakes
Thomas Mejstrik, 9 Aug 2023, __host__ __device__ -- Generic programming in Cuda, Association for Computing Machinery, https://arxiv.org/abs/2309.03912
GPU-NBDetect Oct 2024 (accessed), Comprehensive-Study-on-GPU-Program-Numerical-Issues, https://github.com/GPU-Program-Bug-Study/Comprehensive-Study-on-GPU-Program-Numerical-Issues.github.io/tree/main/GPU-NBDetect
FP Checker, Jul 19, 2021, Floating-point Exceptions and GPU Applications, https://fpchecker.org/2021-07-12-exceptions.html
Laguna, Ignacio. "FPChecker: Detecting Floating-point Exceptions in GPU Applications." In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 1126-1129. IEEE, 2019. https://ieeexplore.ieee.org/abstract/document/8952258 https://www.osti.gov/servlets/purl/1574625
Michael Bentley, Ian Briggs, Ganesh Gopalakrishnan, Ignacio Laguna, Harshitha Menon, Tristan Vanderbruggen, Cindy Rubio González, 2020, FPChecker Detecting Floating-Point Exceptions in GPUs, https://fpanalysistools.org/pearc19/slides/Module-FPChecker.pdf
Ignacio Laguna Feb 4, 2020, Improving Reliability Through Analyzing and Debugging Floating-Point Software, 2020 ECP Annual Meeting, https://fpanalysistools.org/slides/ignacio_laguna_ECP_2020.pdf
Ignacio Laguna, Xinyi Li, and Ganesh Gopalakrishnan. 2022. BinFPE: accurate floating-point exception detection for GPU applications. In Proceedings of the 11th ACM SIGPLAN International Workshop on the State Of the Art in Program Analysis (SOAP 2022). Association for Computing Machinery, New York, NY, USA, 1–8. https://doi.org/10.1145/3520313.3534655 https://dl.acm.org/doi/10.1145/3520313.3534655 https://dl.acm.org/doi/pdf/10.1145/3520313.3534655 https://github.com/LLNL/BinFPE
David Spuler, Oct 2024, CUDA C++ Debugging: Safer GPU Kernel Programming, https://www.amazon.com/dp/B0DJJVDJBW/
Chao Peng, 2021, Automated testing for GPU kernels, Doctor of Philosophy, Laboratory for Foundations of Computer Science, School of Informatics, University of Edinburgh, https://era.ed.ac.uk/handle/1842/38563 https://era.ed.ac.uk/bitstream/handle/1842/38563/Peng2021.pdf?sequence=1&isAllowed=y

CUDA Floating-Point Errors

GPU-NBDetect Oct 2024 (accessed), Comprehensive-Study-on-GPU-Program-Numerical-Issues, https://github.com/GPU-Program-Bug-Study/Comprehensive-Study-on-GPU-Program-Numerical-Issues.github.io/tree/main/GPU-NBDetect
FP Checker, Jul 19, 2021, Floating-point Exceptions and GPU Applications, https://fpchecker.org/2021-07-12-exceptions.html
Lawrence Livermore National Laboratory, Oct 024, FP Checker: dynamic analysis tool to detect floating-point errors in HPC applications, https://fpchecker.org/index.html https://github.com/LLNL/FPChecker
Laguna, Ignacio. "FPChecker: Detecting Floating-point Exceptions in GPU Applications." In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 1126-1129. IEEE, 2019. https://ieeexplore.ieee.org/abstract/document/8952258 https://www.osti.gov/servlets/purl/1574625
Michael Bentley, Ian Briggs, Ganesh Gopalakrishnan, Ignacio Laguna, Harshitha Menon, Tristan Vanderbruggen, Cindy Rubio González, 2020, FPChecker Detecting Floating-Point Exceptions in GPUs, https://fpanalysistools.org/pearc19/slides/Module-FPChecker.pdf
Ignacio Laguna Feb 4, 2020, Improving Reliability Through Analyzing and Debugging Floating-Point Software, 2020 ECP Annual Meeting, https://fpanalysistools.org/slides/ignacio_laguna_ECP_2020.pdf
Ignacio Laguna, Xinyi Li, and Ganesh Gopalakrishnan. 2022. BinFPE: accurate floating-point exception detection for GPU applications. In Proceedings of the 11th ACM SIGPLAN International Workshop on the State Of the Art in Program Analysis (SOAP 2022). Association for Computing Machinery, New York, NY, USA, 1–8. https://doi.org/10.1145/3520313.3534655 https://dl.acm.org/doi/10.1145/3520313.3534655 https://dl.acm.org/doi/pdf/10.1145/3520313.3534655 https://github.com/LLNL/BinFPE
Floris Gorter, Enrico Barberis, Raphael Isemann, Erik van der Kouwe, Cristiano Giuffrida, Herbert Bos, November 1, 2023, FloatZone: How Floating Point Additions can Detect Memory Errors, https://download.vusec.net/papers/floatzone_sec23.pdf https://github.com/vusec/floatzone
Taylor Allred, Xinyi Li, Ashton Wiersdorf, Ben Greenman, Ganesh Gopalakrishnan, 22 Mar 2024, FlowFPX: Nimble Tools for Debugging Floating-Point Exceptions, https://arxiv.org/abs/2403.15632 https://juliahub.com/ui/Packages/FloatTracker/dBXig
Xinyi Li, Ignacio Laguna, Katarzyna Swirydowicz, Bo Fang, Ang Li, and Ganesh Gopalakrishnan. Design and evaluation of GPU-FPX: A low-overhead tool for floating-point excep tion detection in NVIDIA GPUs. In ACM HPDC 2023, 2023. doi:10.11578/dc.20230713.4. https://dl.acm.org/doi/pdf/10.1145/3588195.3592991
James Demmel, Jack J. Dongarra, Mark Gates, Greg Henry, Julien Langou, Xiaoye S. Li, Piotr Luszczek, Weslley S. Pereira, E. Jason Riedy, and Cindy Rubio-González. Pro posed consistent exception handling for the BLAS and LAPACK. In Correctness@SC, pages 1–9. IEEE, 2022. doi:10.1109/Correctness56720.2022.00006. https://netlib.org/utk/people/Jahttps://arxiv.org/abs/2207.09281ckDongarra/PAPERS/Proposed_Consistent_Exception_Handling_for_the_BLAS_and_LAPACK.pdf https://arxiv.org/abs/2207.09281
N. Toronto and J. McCarthy, "Practically Accurate Floating-Point Math," in Computing in Science & Engineering, vol. 16, no. 4, pp. 80-95, July-Aug. 2014, doi: 10.1109/MCSE.2014.90. https://ieeexplore.ieee.org/document/6879754 https://www.cs.umd.edu/~ntoronto/papers/toronto-2014cise-floating-point.pdf
Peter Dinda, Alex Bernat, and Conor Hetland. Spying on the Floating Point Behavior of Existing, Unmodified Sci entific Applications. In HPDC, pages 5–16. ACM, 2020. doi:10.1145/3369583.3392673. http://pdinda.org/Papers/hpdc20.pdf

CUDA Floating-Point Runtime Error Checkers

Research papers on tools that detect floating-point errors and exceptions at runtime:

Lawrence Livermore National Laboratory, Oct 024, FP Checker: dynamic analysis tool to detect floating-point errors in HPC applications, https://fpchecker.org/index.html https://github.com/LLNL/FPChecker
Laguna, Ignacio. "FPChecker: Detecting Floating-point Exceptions in GPU Applications." In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 1126-1129. IEEE, 2019. https://ieeexplore.ieee.org/abstract/document/8952258 https://www.osti.gov/servlets/purl/1574625
Michael Bentley, Ian Briggs, Ganesh Gopalakrishnan, Ignacio Laguna, Harshitha Menon, Tristan Vanderbruggen, Cindy Rubio González, 2020, FPChecker Detecting Floating-Point Exceptions in GPUs, https://fpanalysistools.org/pearc19/slides/Module-FPChecker.pdf
Ignacio Laguna, Xinyi Li, and Ganesh Gopalakrishnan. 2022. BinFPE: accurate floating-point exception detection for GPU applications. In Proceedings of the 11th ACM SIGPLAN International Workshop on the State Of the Art in Program Analysis (SOAP 2022). Association for Computing Machinery, New York, NY, USA, 1–8. https://doi.org/10.1145/3520313.3534655 https://dl.acm.org/doi/10.1145/3520313.3534655 https://dl.acm.org/doi/pdf/10.1145/3520313.3534655 https://github.com/LLNL/BinFPE
Xinyi Li, Ignacio Laguna, Katarzyna Swirydowicz, Bo Fang, Ang Li, and Ganesh Gopalakrishnan. Design and evaluation of GPU-FPX: A low-overhead tool for floating-point excep tion detection in NVIDIA GPUs. In ACM HPDC 2023, 2023. doi:10.11578/dc.20230713.4. https://dl.acm.org/doi/pdf/10.1145/3588195.3592991
Peter Dinda, Alex Bernat, and Conor Hetland. Spying on the Floating Point Behavior of Existing, Unmodified Sci entific Applications. In HPDC, pages 5–16. ACM, 2020. doi:10.1145/3369583.3392673. http://pdinda.org/Papers/hpdc20.pdf

CUDA Auto-Tuning Research

Research on CUDA "auto-tuning" is about automatically analyzing how to run a CUDA kernel faster:

Stefan K. Muller and Jan Hoffmann. 2024. Modeling and Analyzing Evaluation Cost of CUDA Kernels. ACM Trans. Parallel Comput. 11, 1, Article 5 (March 2024), 53 pages. https://doi.org/10.1145/3639403 https://dl.acm.org/doi/full/10.1145/3639403 PDF: https://dl.acm.org/doi/pdf/10.1145/3639403
Mark Lou and Stefan K. Muller. 2024. Automatic Static Analysis-Guided Optimization of CUDA Kernels. In Proceedings of the 15th International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM '24). Association for Computing Machinery, New York, NY, USA, 11–21. https://doi.org/10.1145/3649169.3649249 https://dl.acm.org/doi/abs/10.1145/3649169.3649249 PDF: https://dl.acm.org/doi/pdf/10.1145/3649169.3649249
Li, Y., Dongarra, J., Tomov, S. (2009). A Note on Auto-tuning GEMM for GPUs. In: Allen, G., Nabrzyski, J., Seidel, E., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds) Computational Science – ICCS 2009. Lecture Notes in Computer Science, vol 5544. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01970-8_89 https://link.springer.com/chapter/10.1007/978-3-642-01970-8_89 PDF: https://link.springer.com/content/pdf/10.1007/978-3-642-01970-8_89.pdf
DominikGreweandAntonLokhmotov.2011. Automatically Generating and Tuning GPU Code for Sparse Matrix-Vector Multiplication from a High-Level Representation. In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, (Newport Beach, California,U SA)(GPGPU-4).Association for Computing Machinery, NewYork, NY, USA, https://doi.org/10.1145/1964179.1964196
Tianyi David Han and Tarek S. Abdelrahman. 2011. Reducing branch divergence in GPU programs. In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units (GPGPU-4). Association for Computing Machinery, New York, NY, USA, Article 3, 1–8. https://doi.org/10.1145/1964179.1964184 https://dl.acm.org/doi/10.1145/1964179.1964184
R. Lim, B. Norris and A. Malony, "Autotuning GPU Kernels via Static and Predictive Analysis," 2017 46th International Conference on Parallel Processing (ICPP), Bristol, UK, 2017, pp. 523-532, doi: 10.1109/ICPP.2017.61. https://ieeexplore.ieee.org/abstract/document/8025326 https://arxiv.org/abs/1701.08547
Nimit Singhania 2018, Static Analysis for GPU Program Performance, Ph.D. Thesis, Computer and Information Science, University of Pennsylvania, https://repository.upenn.edu/server/api/core/bitstreams/9d32ece9-5321-4d37-ab72-251e0ecc197e/content (Optimization of uncoalesced memory accesses, block sizes, and cache reuse.)
R. L. Castro, D. Andrade and B. B. Fraguela, "STuning-DL: Model-Driven Autotuning of Sparse GPU Kernels for Deep Learning," in IEEE Access, vol. 12, pp. 70581-70599, 2024, doi: 10.1109/ACCESS.2024.3402326. https://ieeexplore.ieee.org/abstract/document/10534045 PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10534045
Chendi Li, Yufan Xu, Sina Mahdipour Saravani, and Ponnuswamy Sadayappan. 2024. Accelerated Auto-Tuning of GPU Kernels for Tensor Computations. In Proceedings of the 38th ACM International Conference on Supercomputing (ICS '24). Association for Computing Machinery, New York, NY, USA, 549–561. https://doi.org/10.1145/3650200.3656626 https://dl.acm.org/doi/abs/10.1145/3650200.3656626 PDF: https://dl.acm.org/doi/pdf/10.1145/3650200.3656626
Ben Spencer, 2011, A General Auto-Tuning Framework for Software Performance Optimisation, Third Year Project Report, University of Oxford http://mistymountain.co.uk/flamingo/report/autotuning-2011-05-30.pdf http://mistymountain.co.uk/flamingo/

CUDA Tools

Research papers on the theory of some of the CUDA C++ tools and extensions:

Lawrence Livermore National Laboratory, Oct 024, FP Checker: dynamic analysis tool to detect floating-point errors in HPC applications, https://fpchecker.org/index.html https://github.com/LLNL/FPChecker
Laguna, Ignacio. "FPChecker: Detecting Floating-point Exceptions in GPU Applications." In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 1126-1129. IEEE, 2019. https://ieeexplore.ieee.org/abstract/document/8952258 https://www.osti.gov/servlets/purl/1574625
Michael Bentley, Ian Briggs, Ganesh Gopalakrishnan, Ignacio Laguna, Harshitha Menon, Tristan Vanderbruggen, Cindy Rubio González, 2020, FPChecker Detecting Floating-Point Exceptions in GPUs, https://fpanalysistools.org/pearc19/slides/Module-FPChecker.pdf
Ignacio Laguna Feb 4, 2020, Improving Reliability Through Analyzing and Debugging Floating-Point Software, 2020 ECP Annual Meeting, https://fpanalysistools.org/slides/ignacio_laguna_ECP_2020.pdf
Ignacio Laguna, Xinyi Li, and Ganesh Gopalakrishnan. 2022. BinFPE: accurate floating-point exception detection for GPU applications. In Proceedings of the 11th ACM SIGPLAN International Workshop on the State Of the Art in Program Analysis (SOAP 2022). Association for Computing Machinery, New York, NY, USA, 1–8. https://doi.org/10.1145/3520313.3534655 https://dl.acm.org/doi/10.1145/3520313.3534655 https://dl.acm.org/doi/pdf/10.1145/3520313.3534655 https://github.com/LLNL/BinFPE
Mohamed Tarek Ibn Ziad, Sana Damani, Aamer Jaleel, Stephen W. Keckler, Mark Stephenson, June 19, 2023, cuCatch: A Debugging Tool for Efficiently Catching Memory Safety Violations in CUDA Applications, ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), Proceedings of the ACM on Programming Languages, Volume 7, Issue PLDI, Article No.: 111, Pages 124 - 147, https://doi.org/10.1145/359122 https://dl.acm.org/doi/10.1145/3591225 https://research.nvidia.com/publication/2023-06_cucatch-debugging-tool-efficiently-catching-memory-safety-violations-cuda https://dl.acm.org/doi/pdf/10.1145/3591225
Floris Gorter, Enrico Barberis, Raphael Isemann, Erik van der Kouwe, Cristiano Giuffrida, Herbert Bos, November 1, 2023, FloatZone: How Floating Point Additions can Detect Memory Errors, https://download.vusec.net/papers/floatzone_sec23.pdf https://github.com/vusec/floatzone
David Spuler, Oct 2024, CUDA C++ Optimization: Coding Faster GPU Kernels, https://www.amazon.com/dp/B0DK21QQYD/
David Spuler, Oct 2024, CUDA C++ Debugging: Safer GPU Kernel Programming, https://www.amazon.com/dp/B0DJJVDJBW/
Felipe R. Monteiro, Erickson H. da S. Alves, Isabela S. Silva, Hussama I. Ismail, Lucas C. Cordeiro, and Eddie B. de Lima Filho. 2018. ESBMC-GPU A context-bounded model checking tool to verify CUDA programs. Sci. Comput. Program. 152, C (January 2018), 63–69. https://doi.org/10.1016/j.scico.2017.09.005 https://dl.acm.org/doi/10.1016/j.scico.2017.09.005 https://pure.manchester.ac.uk/ws/files/77048933/jscp2017.pdf
Joshi, S., Muduganti, G. (2021). GPURepair: Automated Repair of GPU Kernels. In: Henglein, F., Shoham, S., Vizel, Y. (eds) Verification, Model Checking, and Abstract Interpretation. VMCAI 2021. Lecture Notes in Computer Science(), vol 12597. Springer, Cham. https://doi.org/10.1007/978-3-030-67067-2_18 https://link.springer.com/chapter/10.1007/978-3-030-67067-2_18 https://arxiv.org/pdf/2011.08373
Bo Jiang, Xiaoyan Wang, W.K. Chan, T.H. Tse, Na Li and Yongfeng Yin, 2020, "CUDAsmith: A Fuzzer for CUDA Compilers," 2020 IEEE 44th Annual Computers, Software, and Applications Conference (COMPSAC), Madrid, Spain, 2020, pp. 861-871, doi: 10.1109/COMPSAC48688.2020.0-156. https://ieeexplore.ieee.org/abstract/document/9202798 https://github.com/gongbell/CUDAsmith https://www.cs.hku.hk/data/techreps/document/TR-2020-05.pdf
Taylor Allred, Xinyi Li, Ashton Wiersdorf, Ben Greenman, Ganesh Gopalakrishnan, 22 Mar 2024, FlowFPX: Nimble Tools for Debugging Floating-Point Exceptions, https://arxiv.org/abs/2403.15632 https://juliahub.com/ui/Packages/FloatTracker/dBXig

Warp Divergence in CUDA Kernels

Warp divergence, or "thread divergence" or "branch divergence," is where some threads in a warp of 32 threads go on different control flow paths (i.e., at if statements or loop conditions). This is a speed impediment to fast GPU execution of code, and avoiding divergence around control flow statements in CUDA kernels is an important optimization technique.

Articles and research on thread divergence:

Elmar Westphal, Aug 06, 2015, Voting and Shuffling to Optimize Atomic Operations, https://developer.nvidia.com/blog/voting-and-shuffling-optimize-atomic-operations/
HDL Wizard, January 17, 2024, An In-Depth Look at Thread Divergence in GPU Architecture, https://hdlwizard.com/an-in-depth-look-at-thread-divergence-in-gpu-architecture/
Imen Chakroun, Mohand Mezmaz, Nouredine Melab, Ahcène Bendjoudi. 2013, Reducing thread divergence in a GPU-accelerated branch-and-bound algorithm. Concurrency and Computation: Practice and Experience, 2013, 25 (8), pp.1121-1136. 10.1002/cpe.2931 . hal-00731859 PDF: https://inria.hal.science/hal-00731859/document
Charitha Saumya, Kirshanthan Sundararajah, Milind Kulkarni, 14 Jan 2022 (v3), DARM: Control-Flow Melding for SIMT Thread Divergence Reduction -- Extended Version, https://arxiv.org/abs/2107.05681
Chakroun, I., Bendjoudi, A., Melab, N. (2012). Reducing Thread Divergence in GPU-Based B&B Applied to the Flow-Shop Problem. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Waśniewski, J. (eds) Parallel Processing and Applied Mathematics. PPAM 2011. Lecture Notes in Computer Science, vol 7203. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31464-3_57 https://link.springer.com/chapter/10.1007/978-3-642-31464-3_57
Huanxin Lin, Cho-Li Wang, and Hongyuan Liu. 2018. On-GPU Thread-Data Remapping for Branch Divergence Reduction. ACM Trans. Archit. Code Optim. 15, 3, Article 39 (September 2018), 24 pages. https://doi.org/10.1145/3242089 https://dl.acm.org/doi/10.1145/3242089 PDF: https://dl.acm.org/doi/pdf/10.1145/3242089
Vespa, L., Peters, G. (2021). Contrived and Remediated GPU Thread Divergence Using a Flattening Technique. In: Arabnia, H.R., et al. Advances in Parallel & Distributed Processing, and Applications. Transactions on Computational Science and Computational Intelligence. Springer, Cham. https://doi.org/10.1007/978-3-030-69984-0_46 https://link.springer.com/chapter/10.1007/978-3-030-69984-0_46
Bialas, P., Strzelecki, A. (2016). Benchmarking the Cost of Thread Divergence in CUDA. In: Wyrzykowski, R., Deelman, E., Dongarra, J., Karczewski, K., Kitowski, J., Wiatr, K. (eds) Parallel Processing and Applied Mathematics. PPAM 2015. Lecture Notes in Computer Science(), vol 9573. Springer, Cham. https://doi.org/10.1007/978-3-319-32149-3_53 https://link.springer.com/chapter/10.1007/978-3-319-32149-3_53 https://arxiv.org/abs/1504.01650 PDF: https://arxiv.org/pdf/1504.01650
NVIDIA, Sep 2024, SIMD Intrinsics, https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH__INTRINSIC__SIMD.html
Christopher Cooper, August, 2011, GPU Computing with CUDA: Lecture 4 - Optimizations, Boston University https://www.bu.edu/pasi/files/2011/07/Lecture4.pdf
K. Cooper, 2020, Timing and Tricks, Department of Mathematics, Washington State University, https://www.math.wsu.edu/math/kcooper/CUDA/c07Timing.pdf
Christian Mills, September 11, 2024, GPU MODE Lecture 8: CUDA Performance Checklist, https://christianjmills.com/posts/cuda-mode-notes/lecture-008/
David Spuler, Oct 2024, CUDA C++ Optimization: Coding Faster GPU Kernels, https://www.amazon.com/dp/B0DK21QQYD/

Warp Shuffle

Warp shuffle is a set of CUDA GPU primitives that allow sharing of memory between the 32 threads in a warp. The shuffle APIs are faster than shared memory ("__shared__"), but are limited to 32 threads (i.e., one warp only), whereas shared memory works across threads in all the warps in a block.

Elmar Westphal, Aug 06, 2015, Voting and Shuffling to Optimize Atomic Operations, https://developer.nvidia.com/blog/voting-and-shuffling-optimize-atomic-operations/
Mark Harris, Feb 03, 2014, CUDA Pro Tip: Do The Kepler Shuffle, https://developer.nvidia.com/blog/cuda-pro-tip-kepler-shuffle/
Yuan Lin and Vinod Grover, Jan 15, 2018, Using CUDA Warp-Level Primitives, https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/
Prof. Mike Giles, 2024, Lecture 4: warp shuffles, and reduction / scan operations, Lecture 4– p. 1/38, Oxford University Mathematical Institute, https://people.maths.ox.ac.uk/gilesm/cuda/lecs/lec4.pdf
David Spuler, Oct 2024, CUDA C++ Optimization: Coding Faster GPU Kernels, https://www.amazon.com/dp/B0DK21QQYD/
Gray, S. 2014. A full walk through of the SGEMM implementation, https://github.com/NervanaSystems/maxas/wiki/SGEMM

Memory Address Alignment

Aligned memory accesses are faster in CUDA than those with non-aligned addresses.

Articles and papers on memory alignment in CUDA:

Mark Harris, Jan 07, 2013, How to Access Global Memory Efficiently in CUDA C/C++ Kernels, https://developer.nvidia.com/blog/how-access-global-memory-efficiently-cuda-c-kernels/
Lei Mao, Oct 18, 2022, CUDA Data Alignment, https://leimao.github.io/blog/CUDA-Data-Alignment/

Shared Memory Optimizations in CUDA

Articles on CUDA optimizations using shared memory:

Mark Harris, Jan 07, 2013, How to Access Global Memory Efficiently in CUDA C/C++ Kernels, https://developer.nvidia.com/blog/how-access-global-memory-efficiently-cuda-c-kernels/
Dung Le, Jul 30, 2020, CUDA Memory Management & Use cases, https://medium.com/distributed-knowledge/cuda-memory-management-use-cases-f9d340f7c704
Lei Mao, June 22, 2022, CUDA Shared Memory Bank, https://leimao.github.io/blog/CUDA-Shared-Memory-Bank/
Dhanush, Aug 23, 2024, Optimizing Vector Dot Product in CUDA: Exploring Shared Memory and Reduction Techniques, https://github.com/Dhanush295/Cuda_program/blob/main/Vector_dot_product.cu
Mark Harris, Jan 28, 2013, Using Shared Memory in CUDA C/C++, https://developer.nvidia.com/blog/using-shared-memory-cuda-cc/ https://github.com/NVIDIA-developer-blog/code-samples/blob/master/series/cuda-cpp/shared-memory/shared-memory.cu
Rustam, Apr 5, 2024, CUDA: Shared memory, https://medium.com/@fatlip/cuda-shared-memory-23cd1a0d4e39
K. Cooper, 2020, Timing and Tricks, Department of Mathematics, Washington State University, https://www.math.wsu.edu/math/kcooper/CUDA/c07Timing.pdf
Athena Elafrou, Guillaume Thomas Collignon, March 18th 2024, Introduction to CUDA Performance Optimization, NVIDIA DevTech Compute GPU Technology Conference, https://developer-blogs.nvidia.com/wp-content/uploads/2024/08/CUDA-Programming-and-Optimization.pdf
Ellery Russell, Jiqun Tu, March 2024, Accelerating Drug Discovery: Optimizing Dynamic GPU Workflows with CUDA Graphs, Mapped Memory, C++ Coroutines, and More, GTC 2024, https://www.nvidia.com/en-us/on-demand/session/gtc24-s61156/
Ming Li, Ziqian Bi, Tianyang Wang, Yizhu Wen, Qian Niu, Junyu Liu, Benji Peng, Sen Zhang, Xuanhe Pan, Jiawei Xu, Jinlang Wang, Keyu Chen, Caitlyn Heqi Yin, Pohsun Feng, Ming Liu, 8 Oct 2024, Deep Learning and Machine Learning with GPGPU and CUDA: Unlocking the Power of Parallel Computing, https://arxiv.org/abs/2410.05686

Memory Address Coalescing in CUDA

Coalesced memory accesses refer to each thread in a warp accessing adjacent memory locations (in global memory). This is much faster in CUDA than having different threads accessing non-adjacent locations.

Articles on memory coalescing in CUDA:

Christopher Cooper, August, 2011, GPU Computing with CUDA: Lecture 4 - Optimizations, Boston University https://www.bu.edu/pasi/files/2011/07/Lecture4.pdf
Mark Harris, Jan 07, 2013, How to Access Global Memory Efficiently in CUDA C/C++ Kernels, https://developer.nvidia.com/blog/how-access-global-memory-efficiently-cuda-c-kernels/
Samuel Midkiff, 2019, Lecture 6.2 – Performance Considerations: Memory Coalescing in CUDA, Illinois University, NVIDIA GPU Teaching Kit, https://engineering.purdue.edu/~smidkiff/ece563/NVidiaGPUTeachingToolkit/Mod6/Lecture-6-2-memory-coalescing.pdf
Chunwei Yan, Feb 25, 2024, Memory coalescing in CUDA (1) – VecAdd, https://superjomn.github.io/posts/cuda-memory-coalescing-access/
Dung Le, Jul 30, 2020, CUDA Memory Management & Use cases, https://medium.com/distributed-knowledge/cuda-memory-management-use-cases-f9d340f7c704
Lei Mao, March 19, 2023, CUDA Coalesced Memory Access, https://leimao.github.io/blog/CUDA-Coalesced-Memory-Access/
Lei Mao, Oct 18, 2022, CUDA Data Alignment, https://leimao.github.io/blog/CUDA-Data-Alignment/
Nimit Singhania 2018, Static Analysis for GPU Program Performance, Ph.D. Thesis, Computer and Information Science, University of Pennsylvania, https://repository.upenn.edu/server/api/core/bitstreams/9d32ece9-5321-4d37-ab72-251e0ecc197e/content (Optimization of uncoalesced memory accesses, block sizes, and cache reuse.)
Athena Elafrou, Guillaume Thomas Collignon, March 18th 2024, Introduction to CUDA Performance Optimization, NVIDIA DevTech Compute GPU Technology Conference, https://developer-blogs.nvidia.com/wp-content/uploads/2024/08/CUDA-Programming-and-Optimization.pdf
Christian Mills, September 11, 2024, GPU MODE Lecture 8: CUDA Performance Checklist, https://christianjmills.com/posts/cuda-mode-notes/lecture-008/
Ming Li, Ziqian Bi, Tianyang Wang, Yizhu Wen, Qian Niu, Junyu Liu, Benji Peng, Sen Zhang, Xuanhe Pan, Jiawei Xu, Jinlang Wang, Keyu Chen, Caitlyn Heqi Yin, Pohsun Feng, Ming Liu, 8 Oct 2024, Deep Learning and Machine Learning with GPGPU and CUDA: Unlocking the Power of Parallel Computing, https://arxiv.org/abs/2410.05686

Array Stride Optimizations in CUDA

Using the "stride" of an array is an important access pattern for kernels in CUDA. One of the advantages is achieving coalesced memory accesses.

Articles and papers on array stride optimizations:

Mark Harris, Jan 07, 2013, How to Access Global Memory Efficiently in CUDA C/C++ Kernels, https://developer.nvidia.com/blog/how-access-global-memory-efficiently-cuda-c-kernels/
Chunwei Yan, Feb 25, 2024, Memory coalescing in CUDA (1) – VecAdd, https://superjomn.github.io/posts/cuda-memory-coalescing-access/
Mark Harris, Apr 22, 2013, CUDA Pro Tip: Write Flexible Kernels with Grid-Stride Loops, https://developer.nvidia.com/blog/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/
Mark Harris, Optimizing Parallel Reduction in CUDA, https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf
Athena Elafrou, Guillaume Thomas Collignon, March 18th 2024, Introduction to CUDA Performance Optimization, NVIDIA DevTech Compute GPU Technology Conference, https://developer-blogs.nvidia.com/wp-content/uploads/2024/08/CUDA-Programming-and-Optimization.pdf
David Spuler, Oct 2024, CUDA C++ Optimization: Coding Faster GPU Kernels, https://www.amazon.com/dp/B0DK21QQYD/

CUDA Kernel Fusion

Kernel fusion is the merging of two kernels into one. Typically, if two kernels are working sequentially on the same set of data, with the output of one kernel going into the second kernel, then it can be efficient to combine the two kernels.

Papers on kernel fusion in CUDA include:

Ganesh Bikshandi, Jay Shah, 19 Dec 2023, A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library, https://arxiv.org/abs/2312.11918 https://research.colfax-intl.com/nvidia-hopper-flashattention-2/
J. Filipovič, M. Madzin, J. Fousek, L. Matyska, 16 Jul 2013 (v2), Optimizing CUDA Code By Kernel Fusion---Application on BLAS, https://arxiv.org/abs/1305.1183 (An early paper from 2013 on kernel fusion in CUDA.)
DMLC, 2016, NNVM-Fusion: Implement GPU Kernel Fusion and Runtime Compilation Based on NNVM, https://github.com/dmlc/nnvm-fusion
Ao Li, Bojian Zheng, Gennady Pekhimenko, Fan Long, 2 Jul 2020, Automatic Horizontal Fusion for GPU Kernels, https://arxiv.org/abs/2007.01277
Nourazar, M., Booth, B.G. & Goossens, B. A GPU optimization workflow for real-time execution of ultra-high frame rate computer vision applications. J Real-Time Image Proc 21, 5 (2024). https://doi.org/10.1007/s11554-023-01384-7 https://link.springer.com/article/10.1007/s11554-023-01384-7 PDF: https://backoffice.biblio.ugent.be/download/01HEJC9WDA5DGR7X1X1XR6XV48/01HFV8WMEJQFKT7F347ADBCEGB
W. Sun, A. Li, S. Stuijk and H. Corporaal, "How Much Can We Gain From Tensor Kernel Fusion on GPUs?," in IEEE Access, vol. 12, pp. 126135-126144, 2024, doi: 10.1109/ACCESS.2024.3411473. https://ieeexplore.ieee.org/abstract/document/10551817 PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10551817
Guoliang He, Eiko Yoneki, 14 Jan 2025, CuAsmRL: Optimizing GPU SASS Schedules via Deep Reinforcement Learning, https://arxiv.org/abs/2501.08071

CUDA Security Issues

Research on CUDA security issues, such as buffer overflow exploits:

Yanan Guo, Zhenkai Zhang, Jun Yang, August 2024, GPU Memory Exploitation for Fun and Profit, Proceedings of the 33rd USENIX Security Symposium, August 14–16, 2024 • Philadelphia, PA, USA, 978-1-939133-44-1, https://www.usenix.org/conference/usenixsecurity24/presentation/guo-yanan https://www.usenix.org/system/files/usenixsecurity24-guo-yanan.pdf
Christopher Erb, Mike Collins, and Joseph L. Greathouse. Dynamic buffer overflow detection for GPGPUs. In IEEE/ACM International Symposium on Code Generation and Optimization (CGO), 2017. https://dl.acm.org/doi/10.5555/3049832.3049840 https://computermachines.org/joe/publications/pdfs/cgo2017_clarmor.pdf
Bang Di, Jianhua Sun, and Hao Chen. 2016, A study of overflow vulnerabilities on GPUs. In Network and Parallel Computing: 13th IFIP WG 10.3 International Conference. https://link.springer.com/chapter/10.1007/978-3-319-47099-3_9 https://www.aimlab.org/haochen/papers/npc16-overflow.pdf
Andrea Miele. Buffer overflow vulnerabilities in CUDA: a preliminary analysis. Journal of Computer Virology and Hacking Techniques, 12:113–120, 2016. https://link.springer.com/article/10.1007/s11416-015-0251-1 https://arxiv.org/pdf/1506.08546
Sang-Ok Park, Ohmin Kwon, Yonggon Kim, Sang Kil Cha, and Hyunsoo Yoon. Mind control attack: Under mining deep learning with GPU memory exploitation. Computers & Security, 102:102115, 2021. https://dl.acm.org/doi/10.1016/j.cose.2020.102115
Phoebe Lee and Kristina Joos, Jan 25, 2024, Advancing Production AI with NVIDIA AI Enterprise, https://developer.nvidia.com/blog/advancing-production-ai-with-nvidia-ai-enterprise/ ("... advances in NVIDIA AI software deliver up to 54% performance gains without a hardware upgrade...")

CUDA Examples

More CUDA examples:

NVIDIA, 2024, cuda-samples: Samples for CUDA Developers which demonstrates features in CUDA Toolkit, https://github.com/NVIDIA/cuda-samples/
NVIDIA, 2024, CUDA Code Samples, https://developer.nvidia.com/cuda-code-samples
NVIDIA, 2024, Training Material and Code Samples, https://developer.nvidia.com/cuda-education
NVIDIA, 2024, 7.22.3. Examples, CUDA C++ Programming Guide, https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#examples

General Research on CUDA

Articles and papers on CUDA programming:

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, June 2022. https://arxiv.org/abs/2205.14135 Code: https://github.com/HazyResearch/flash-attention (The original FlashAttention version 1, now superceded by FlashAttention 2, which uses tiling and memory-aware kernels to optimize attention.)
Benjamin Charlier, Jean Feydy, Joan Alexis Glaunès, François-David Collin, Ghislain Durif, 8 Apr 2021 (v2), Kernel Operations on the GPU, with Autodiff, without Memory Overflows, https://arxiv.org/abs/2004.11127 Code: https://www.kernel-operations.io/keops/index.html
Mingyuan Wu, Husheng Zhou, Lingming Zhang, Cong Liu, Yuqun Zhang, 29 May 2019 (v3), Characterizing and Detecting CUDA Program Bugs, https://arxiv.org/abs/1905.01833 (Study of CUDA bugs in several production-level CUDA projects, including memory resource issues and synchronization errors.)
M. Wu, Y. Ouyang, H. Zhou, L. Zhang, C. Liu and Y. Zhang, "Simulee: Detecting CUDA Synchronization Bugs via Memory-Access Modeling," 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE), Seoul, Korea (South), 2020, pp. 937-948, doi: 10.1145/3377811.3380358. https://ieeexplore.ieee.org/document/9284094 (Simulation tool to detect CUDA bugs by interpreting the LLVM byte code.)
Pengcheng Li, Chen Ding, Xiaoyu Hu, Tolga Soyata, 2014, LDetector: A Low Overhead Race Detector For GPU Programs, https://wodet.cs.washington.edu/wp-content/uploads/2014/02/wodet2014-final14.pdf
Meng Wu, Jingkai Qiu, Mingyu Yan, Wenming Li, Yang Zhang, Zhimin Zhang, Xiaochun Ye, Dongrui Fan, 16 Aug 2024, Accelerating Mini-batch HGNN Training by Reducing CUDA Kernels, https://arxiv.org/abs/2408.08490 (Improving CUDA kernel performance by reducing small memory-bound kernels.)
Ganesh Bikshandi, Jay Shah, 19 Dec 2023, A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library, https://arxiv.org/abs/2312.11918 https://research.colfax-intl.com/nvidia-hopper-flashattention-2/
Stijn Heldens, Ben van Werkhoven, 22 Mar 2023, Kernel Launcher: C++ Library for Optimal-Performance Portable CUDA Applications, https://arxiv.org/abs/2303.12374 http://kerneltuner.github.io/
NVIDIA, Sep 2024, NVIDIA CUDA Compiler Driver NVCC, https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/ https://docs.nvidia.com/cuda/pdf/CUDA_Compiler_Driver_NVCC.pdf
Jonathan Goldberg, March 17, 2024, Not just the hardware: How deep is Nvidia's software moat? The inherent inertia of software ecosystems, https://www.techspot.com/news/102294-beyond-gpu-how-deep-nvidia-software-moat.html
Zen Analyst, May 31, 2023, Nvidia's CUDA: Unleashing The Power Of Parallel Computing For AI Dominance, https://seekingalpha.com/article/4608475-nvidias-cuda-unleashing-the-power-of-parallel-computing-for-ai-dominance (CUDA has 4 million developers,3,000+ applications, and 40 million CUDA downloads.)
Will Ramey, August 19, 2020, 2 Million Registered Developers, Countless Breakthroughs. NVIDIA developer program doubles number of members in less than two years, https://blogs.nvidia.com/blog/2-million-registered-developers-breakthroughs/
Stephen Jones, March 2024, CUDA: New Features and Beyond, GTC 2024, https://www.nvidia.com/en-us/on-demand/session/gtc24-s62400/

CUDA C++ Optimization Book

The new CUDA C++ Optimization book:

Faster CUDA C++ kernels
Optimization tools & techniques
Compute optimization
Memory optimization

Get your copy from Amazon: CUDA C++ Optimization

Memory Safety C++ Blog Articles

CUDA C++ Debugging Book

The new CUDA C++ Debugging book:

Debugging CUDA C++ kernels
Tools & techniques
Self-testing & reliability
Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging