Aussie AI
CUDA C++ Optimization Research
-
Last Updated 10 December, 2024
-
by David Spuler, Ph.D.
CUDA C++ Blog Articles
See also these Aussie AI blog articles:
- CUDA C++ Floating Point Exceptions
- CUDA C++ Optimization (New Book)
- Debugging CUDA C++ Kernels (New Book)
- CUDA Memory Coalescing Optimizations
- CUDA GPU Thread Divergence
- CUDA Basic C++ Programming Mistakes
CUDA Introductory Articles
Articles and tutorials for CUDA include:
- NVIDIA, 2024, CUDA C++ Programming Guide, https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
- Mark Harris, Jan 25, 2017, An Even Easier Introduction to CUDA, https://developer.nvidia.com/blog/even-easier-introduction-cuda/
- Mark Harris, Oct 31, 2012, An Easy Introduction to CUDA C and C++, https://developer.nvidia.com/blog/easy-introduction-cuda-c-and-c/
- Umangshrestha, Jul 2, 2021, Introduction to CUDA, https://medium.com/geekculture/introduction-to-cuda-7bf6909ea57c
- Read The Docs, Sep 2024, Tutorial 01: Say Hello to CUDA, https://cuda-tutorial.readthedocs.io/en/latest/tutorials/tutorial01/
- Geeks for Geeks, 14 Mar, 2023, Introduction to CUDA Programming, https://www.geeksforgeeks.org/introduction-to-cuda-programming/
- Saurabh Kumar Singh, April 25, 2020, Introduction to CUDA : The Beginner's Guide. What is CUDA and CUDA architecture? https://imsrbh.github.io/IntroToCUDA
- NVIDIA, Sep 2024, GPU Accelerated Computing with C and C++, https://developer.nvidia.com/how-to-cuda-c-cpp
- Martin Heller, Sep 16, 2022, What is CUDA? Parallel programming for GPUs, InfoWorld, https://www.infoworld.com/article/2256401/what-is-cuda-parallel-programming-for-gpus.html
- Stephen Jones, 2024, How To Write A CUDA Program: The Ninja Edition, https://www.nvidia.com/en-us/on-demand/session/gtc24-s62401/ https://developer-blogs.nvidia.com/wp-content/uploads/2024/08/How-To-Write-A-CUDA-Program.pdf
- NVIDIA, 2024, CUDA Code Samples, https://developer.nvidia.com/cuda-code-samples
- Prof. Mike Giles, 2024, Course on CUDA Programming on NVIDIA GPUs, Oxford University Mathematical Institute, UK, https://people.maths.ox.ac.uk/gilesm/cuda/
- GPU Programming with CUDA C++ (2024), Short guidew to GPU Programming with CUDA C++ (2024): coding, performance optimization, latest features. Paul Norvig, January 12, 2024, https://www.paulnorvig.com/guides/gpu-programming-with-cuda-c.html
- Cyril Zeller, 2011, CUDA C/C++ Basics : Supercomputing 2011 Tutorial, https://www.nvidia.com/docs/IO/116711/sc11-cuda-c-basics.pdf
- Daniel Warfield, Jun 15, 2024, CUDA for AI — Intuitively and Exhaustively Explained: Parallelized AI from scratch, https://towardsdatascience.com/cuda-for-ai-intuitively-and-exhaustively-explained-6ba6cb4406c5
CUDA Programming Books
- Brett Neutreon, 2024, Mastering CUDA C++ Programming: A Comprehensive Guidebook, https://www.overdrive.com/media/10749608/mastering-cuda-c-programming https://books.apple.com/us/book/mastering-cuda-c-programming-a-comprehensive-guidebook/id6502452953
- William Smith, 4 August 2024, CUDA Programming with C++: From Basics to Expert Proficiency, https://www.amazon.com/dp/1964899451
- Jason Sanders, Edward Kandrot, 19 July 2010, CUDA by Example: An Introduction to General-Purpose GPU Programming, Addison-Wesley Professional, 1st edition, https://www.amazon.com/CUDA-Example-Introduction-General-Purpose-Programming/dp/0131387685/
- Jordan P. Syntax, 30 May 2024, The CUDA C++ Programming Beginner's Guide: Unlock the Potential of GPU Computing with a Step-by-Step Explanation and Real-World Applications, https://www.amazon.com/Programming-Beginners-Guide-Step-Step/dp/B0D5MKQ3Q3/
- John Cheng, Max Grossman, Ty McKercher, 29 August 2014, Professional CUDA C Programming, Wrox, https://www.amazon.com/Professional-Cuda-Programming-John-Cheng/dp/1118739329/
- Greg Ruetsch, Massimiliano Fatica, 26 July 2024, CUDA Fortran for Scientists and Engineers: Best Practices for Efficient CUDA Fortran Programming, Morgan Kaufmann, 2nd edition, https://www.amazon.com/CUDA-Fortran-Scientists-Engineers-Programming/dp/044321977X/
- David Spuler, Oct 2024, CUDA C++ Optimization: Coding Faster GPU Kernels, https://www.amazon.com/dp/B0DK21QQYD/
- David Spuler, Oct 2024, CUDA C++ Debugging: Safer GPU Kernel Programming, https://www.amazon.com/dp/B0DJJVDJBW/
CUDA Optimization Techniques
Articles and research on CUDA performance optimization techniques:
- Zheming Jin, July 2024, Evaluating Operators in Deep Neural Networks for Improving Performance Portability of SYCL, Oak Ridge National Laboratory, ORNL/TM-2024/3463, https://info.ornl.gov/sites/publications/Files/Pub217394.pdf
- Meng Wu, Jingkai Qiu, Mingyu Yan, Wenming Li, Yang Zhang, Zhimin Zhang, Xiaochun Ye, Dongrui Fan, 16 Aug 2024, Accelerating Mini-batch HGNN Training by Reducing CUDA Kernels, https://arxiv.org/abs/2408.08490 (Improving CUDA kernel performance by reducing small memory-bound kernels.)
- Ganesh Bikshandi, Jay Shah, 19 Dec 2023, A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library, https://arxiv.org/abs/2312.11918 https://research.colfax-intl.com/nvidia-hopper-flashattention-2/
- Stijn Heldens, Ben van Werkhoven, 22 Mar 2023, Kernel Launcher: C++ Library for Optimal-Performance Portable CUDA Applications, https://arxiv.org/abs/2303.12374 http://kerneltuner.github.io/
- Alexander Brandt, Davood Mohajerani, Marc Moreno Maza, Jeeva Paudel, Linxiao Wang, 5 Nov 2019, KLARAPTOR: A Tool for Dynamically Finding Optimal Kernel Launch Parameters Targeting CUDA Programs, https://arxiv.org/abs/1911.02373
- Zijing Gu, 26 Jul 2020, Optimizing Block-Sparse Matrix Multiplications on CUDA with TVM, https://arxiv.org/abs/2007.13055
- Stefan K. Muller and Jan Hoffmann. 2024. Modeling and Analyzing Evaluation Cost of CUDA Kernels. ACM Trans. Parallel Comput. 11, 1, Article 5 (March 2024), 53 pages. https://doi.org/10.1145/3639403 https://dl.acm.org/doi/full/10.1145/3639403 PDF: https://dl.acm.org/doi/pdf/10.1145/3639403
- Mark Lou and Stefan K. Muller. 2024. Automatic Static Analysis-Guided Optimization of CUDA Kernels. In Proceedings of the 15th International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM '24). Association for Computing Machinery, New York, NY, USA, 11–21. https://doi.org/10.1145/3649169.3649249 https://dl.acm.org/doi/abs/10.1145/3649169.3649249 PDF: https://dl.acm.org/doi/pdf/10.1145/3649169.3649249
- Li, Y., Dongarra, J., Tomov, S. (2009). A Note on Auto-tuning GEMM for GPUs. In: Allen, G., Nabrzyski, J., Seidel, E., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds) Computational Science – ICCS 2009. Lecture Notes in Computer Science, vol 5544. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01970-8_89 https://link.springer.com/chapter/10.1007/978-3-642-01970-8_89 PDF: https://link.springer.com/content/pdf/10.1007/978-3-642-01970-8_89.pdf
- DominikGreweandAntonLokhmotov.2011. Automatically Generating and Tuning GPU Code for Sparse Matrix-Vector Multiplication from a High-Level Representation. In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, (Newport Beach, California,U SA)(GPGPU-4).Association for Computing Machinery, NewYork, NY, USA, https://doi.org/10.1145/1964179.1964196
- Tianyi David Han and Tarek S. Abdelrahman. 2011. Reducing branch divergence in GPU programs. In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units (GPGPU-4). Association for Computing Machinery, New York, NY, USA, Article 3, 1–8. https://doi.org/10.1145/1964179.1964184 https://dl.acm.org/doi/10.1145/1964179.1964184
- Neda Seifi, Abdullah Al-Mamun, 2014, Optimizing Memory Access Efficiency in CUDA Kernel via Data Layout Technique, Journal of Computer and Communications, 2024, 12, 124-139, DOI: 10.4236/jcc.2024.125009, https://www.scirp.org/journal/paperinformation?paperid=133500 PDF: https://www.scirp.org/pdf/jcc2024125_91732699.pdf (Fast CUDA matrix multiplication using data locality of memory accesses, by using diagonal data access patterns for coalesced access.)
- Dung Le, Jul 30, 2020, CUDA Memory Management & Use cases, https://medium.com/distributed-knowledge/cuda-memory-management-use-cases-f9d340f7c704
- Mark Harris, Jan 07, 2013, How to Access Global Memory Efficiently in CUDA C/C++ Kernels, https://developer.nvidia.com/blog/how-access-global-memory-efficiently-cuda-c-kernels/
- Justin Luitjens, Dec 04, 2013, CUDA Pro Tip: Increase Performance with Vectorized Memory Access, https://developer.nvidia.com/blog/cuda-pro-tip-increase-performance-with-vectorized-memory-access/
- Peng Wang, 2010, Fundamental Optimizations in CUDA, https://developer.download.nvidia.com/GTC/PDF/1083_Wang.pdf
- Nikolay Sakharnykh, Nov 19, 2017, Maximizing Unified Memory Performance in CUDA, https://developer.nvidia.com/blog/maximizing-unified-memory-performance-cuda/
- Johannes Pekkilä, Oskar Lappi, Fredrik Robertsén, Maarit J. Korpi-Lagg, 13 Jun 2024, Stencil Computations on AMD and Nvidia Graphics Processors: Performance and Tuning Strategies, https://arxiv.org/abs/2406.08923
- A. Jangda, S. Maleki, M. M. Dehnavi, M. Musuvathi and O. Saarikivi, "A Framework for Fine-Grained Synchronization of Dependent GPU Kernels," 2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), Edinburgh, United Kingdom, 2024, pp. 93-105, doi: 10.1109/CGO57630.2024.10444873. https://ieeexplore.ieee.org/abstract/document/10444873
- V. Geraeinejad, Q. Qian and M. Ebrahimi, "Investigating Register Cache Behavior: Implications for CUDA and Tensor Core Workloads on GPUs," in IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 14, no. 3, pp. 469-482, Sept. 2024, doi: 10.1109/JETCAS.2024.3439193. https://ieeexplore.ieee.org/abstract/document/10623472
- Mark Harris, Apr 22, 2013, CUDA Pro Tip: Write Flexible Kernels with Grid-Stride Loops, https://developer.nvidia.com/blog/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/
- NVIDIA, Module 14 – Efficient Host-Device Data Transfer: Lecture 14.1 - Pinned Host Memory, GPU Teaching Kit, https://people.csail.mit.edu/xchen/gpu-programming/Lecture14-stream.pdf
- Mark Harris, Feb 18, 2013, An Efficient Matrix Transpose in CUDA C/C++, https://developer.nvidia.com/blog/efficient-matrix-transpose-cuda-cc/
- Athena Elafrou, March 2024, Introduction to CUDA Programming and Performance Optimization, https://www.nvidia.com/en-us/on-demand/session/gtc24-s62191/
- NVIDIA, 2024, Performance Optimization Conference Sessions, https://www.nvidia.com/gtc/sessions/performance-optimization/
- Michelle Horton, Sep 11, 2024, Advanced Strategies for High-Performance GPU Programming with NVIDIA CUDA, https://developer.nvidia.com/blog/advanced-strategies-for-high-performance-gpu-programming-with-nvidia-cuda/
- Michelle Horton, Aug 29, 2024, Boosting CUDA Efficiency with Essential Techniques for New Developers, https://developer.nvidia.com/blog/boosting-cuda-efficiency-with-essential-techniques-for-new-developers/
- Thejaswi Rao and Mark Harris, Aug 20, 2019, CUDA Pro Tip: The Fast Way to Query Device Properties, https://developer.nvidia.com/blog/cuda-pro-tip-the-fast-way-to-query-device-properties/
- Julien Demouth, Jun 04, 2014, CUDA Pro Tip: Minimize the Tail Effect, https://developer.nvidia.com/blog/cuda-pro-tip-minimize-the-tail-effect/
- Andy Adinets, Oct 01, 2014, CUDA Pro Tip: Optimized Filtering with Warp-Aggregated Atomics, https://developer.nvidia.com/blog/cuda-pro-tip-optimized-filtering-warp-aggregated-atomics/
- Cris Cecka, Feb 27, 2017, Pro Tip: cuBLAS Strided Batched Matrix Multiply, https://developer.nvidia.com/blog/cublas-strided-batched-matrix-multiply/
- Mark Harris, Jun 29, 2015, GPU Pro Tip: Fast Great-Circle Distance Calculation in CUDA C++, https://developer.nvidia.com/blog/fast-great-circle-distance-calculation-cuda-c/
- Elmar Westphal, Aug 06, 2015, Voting and Shuffling to Optimize Atomic Operations, https://developer.nvidia.com/blog/voting-and-shuffling-optimize-atomic-operations/
- Mark Harris, Jun 10, 2015, GPU Pro Tip: Lerp Faster in C++, https://developer.nvidia.com/blog/lerp-faster-cuda/
- Maxim Milakov, Feb 11, 2015, GPU Pro Tip: Fast Dynamic Indexing of Private Arrays in CUDA, https://developer.nvidia.com/blog/fast-dynamic-indexing-private-arrays-cuda/
- Jeremy Appleyard, Aug 07, 2014, CUDA Pro Tip: Optimize for Pointer Aliasing, https://developer.nvidia.com/blog/cuda-pro-tip-optimize-pointer-aliasing/ (Use of const and restrict pointers in CUDA kernels can improve speed.)
- Mark Harris, Jul 17, 2014, CUDA Pro Tip: Occupancy API Simplifies Launch Configuration, https://developer.nvidia.com/blog/cuda-pro-tip-occupancy-api-simplifies-launch-configuration/
- Mark Harris, Feb 03, 2014, CUDA Pro Tip: Do The Kepler Shuffle, https://developer.nvidia.com/blog/cuda-pro-tip-kepler-shuffle/
- Yuan Lin and Vinod Grover, Jan 15, 2018, Using CUDA Warp-Level Primitives, https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/
- Prof. Mike Giles, 2024, Lecture 4: warp shuffles, and reduction / scan operations, Lecture 4– p. 1/38, Oxford University Mathematical Institute, https://people.maths.ox.ac.uk/gilesm/cuda/lecs/lec4.pdf
- Dhanush, Aug 23, 2024, Optimizing Vector Dot Product in CUDA: Exploring Shared Memory and Reduction Techniques, https://github.com/Dhanush295/Cuda_program/blob/main/Vector_dot_product.cu
- https://emre-avci.medium.com/dot-product-in-cuda-c-which-might-outperform-cublas-t-dot-732047aa5ec5
- Bialas, P., Strzelecki, A. (2016). Benchmarking the Cost of Thread Divergence in CUDA. In: Wyrzykowski, R., Deelman, E., Dongarra, J., Karczewski, K., Kitowski, J., Wiatr, K. (eds) Parallel Processing and Applied Mathematics. PPAM 2015. Lecture Notes in Computer Science(), vol 9573. Springer, Cham. https://doi.org/10.1007/978-3-319-32149-3_53 https://link.springer.com/chapter/10.1007/978-3-319-32149-3_53 https://arxiv.org/abs/1504.01650 PDF: https://arxiv.org/pdf/1504.01650
- NVIDIA, Sep 2024, SIMD Intrinsics, https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH__INTRINSIC__SIMD.html
- NVIDIA, Sep 2024, Integer Mathematical Functions, https://docs.nvidia.com/cuda/cuda-math-api/cuda_math_api/group__CUDA__MATH__INT.html https://docs.nvidia.com/cuda/pdf/CUDA_Math_API.pdf
- NVIDIA, Sep 2024, CUDA Math API Reference Manual, https://docs.nvidia.com/cuda/cuda-math-api/index.html
- GPU Programming with CUDA C++ (2024), Short guidew to GPU Programming with CUDA C++ (2024): coding, performance optimization, latest features. Paul Norvig, January 12, 2024, https://www.paulnorvig.com/guides/gpu-programming-with-cuda-c.html
- Christopher Cooper, August, 2011, GPU Computing with CUDA: Lecture 4 - Optimizations, Boston University https://www.bu.edu/pasi/files/2011/07/Lecture4.pdf
- Mark Harris, Dec 04, 2012, How to Optimize Data Transfers in CUDA C/C++, https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/
- Mark Harris, Dec 13, 2012, How to Overlap Data Transfers in CUDA C/C++, https://developer.nvidia.com/blog/how-overlap-data-transfers-cuda-cc/
- Six Ways to SAXPY, Jul 02, 2012, Mark Harris, https://developer.nvidia.com/blog/six-ways-saxpy/
- Mark Harris, Optimizing Parallel Reduction in CUDA, https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf
- NVIDIA, Aug 2024, CUDAC++ProgrammingGuide, Release 12.6, https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf
- Daniel Warfield, Jun 15, 2024, CUDA for AI — Intuitively and Exhaustively Explained: Parallelized AI from scratch, https://towardsdatascience.com/cuda-for-ai-intuitively-and-exhaustively-explained-6ba6cb4406c5
- By Arthy Sundaram, Jaydeep Marathe, Mike Murphy, Nikhil Gupta, Xiaohua Zhang and Thibaut Lutz, Apr 15, 2021, Programming Efficiently with the NVIDIA CUDA 11.3 Compiler Toolchain, https://developer.nvidia.com/blog/programming-efficiently-with-the-cuda-11-3-compiler-toolchain/ (Includes focus on using alloca for stack memory allocation versus malloc on the heap.)
- K. Cooper, 2020, Timing and Tricks, Department of Mathematics, Washington State University, https://www.math.wsu.edu/math/kcooper/CUDA/c07Timing.pdf
- Athena Elafrou, Guillaume Thomas Collignon, March 18th 2024, Introduction to CUDA Performance Optimization, NVIDIA DevTech Compute GPU Technology Conference, https://developer-blogs.nvidia.com/wp-content/uploads/2024/08/CUDA-Programming-and-Optimization.pdf
- Rob Van der Wijngaart and Fred Oh, Aug 08, 2024, Improving GPU Performance by Reducing Instruction Cache Misses, https://developer.nvidia.com/blog/improving-gpu-performance-by-reducing-instruction-cache-misses-2/ (Loop unrolling may be detrimental to instruction locality, and the compiler's auto-unrolling is not necessarily the best.)
- Alan Gray, Aug 07, 2024, Optimizing llama.cpp AI Inference with CUDA Graphs, https://developer.nvidia.com/blog/optimizing-llama-cpp-ai-inference-with-cuda-graphs/
- Christian Mills, September 11, 2024, GPU MODE Lecture 8: CUDA Performance Checklist, https://christianjmills.com/posts/cuda-mode-notes/lecture-008/
- Maxim Milakov, Natalia Gimelshein, 28 Jul 2018 (v2), Online normalizer calculation for softmax, https://arxiv.org/abs/1805.02867
- Thomas Mejstrik, 9 Aug 2023, __host__ __device__ -- Generic programming in Cuda, Association for Computing Machinery, https://arxiv.org/abs/2309.03912
- Gianluca Brilli, Paolo Burgio, 7 Oct 2023, Interference analysis of shared last-level cache on embedded GP-GPUs with multiple CUDA streams, https://arxiv.org/abs/2310.04848
- Tony Scudiero and Mike Murphy, Apr 22, 2014, Separate Compilation and Linking of CUDA C++ Device Code, https://developer.nvidia.com/blog/separate-compilation-linking-cuda-device-code/
- Samuel S. Cho, 2011, CUDA Thread Basics, https://users.wfu.edu/choss/CUDA/docs/Lecture%205.pdf
- Andy Adinets, May 20, 2014, CUDA Dynamic Parallelism API and Principles, https://developer.nvidia.com/blog/cuda-dynamic-parallelism-api-principles/
- Ellery Russell, Jiqun Tu, March 2024, Accelerating Drug Discovery: Optimizing Dynamic GPU Workflows with CUDA Graphs, Mapped Memory, C++ Coroutines, and More, GTC 2024, https://www.nvidia.com/en-us/on-demand/session/gtc24-s61156/
- Mark Harris, March 2024, More Data, Faster: GPU Memory Management Best Practices in Python and C++, GTC 2024, https://www.nvidia.com/en-us/on-demand/session/gtc24-s62550/
- Jack Kosaian, Vijay Thakkar, March 2024, CUTLASS: A Performant, Flexible, and Portable Way to Target Hopper Tensor Cores, GTC 2024, https://www.nvidia.com/en-us/on-demand/session/gtc24-s61198/
- Anastasia Stulova, Jeff Larkin, March 2024, No More Porting: Accelerated Computing With Standard C++, Fortran, and Python, GTC 2024, https://www.nvidia.com/en-us/on-demand/session/gtc24-s61204/
- Igor Terentyev, March 2024, Advanced Performance Optimization in CUDA, GTC 2024, https://www.nvidia.com/en-us/on-demand/session/gtc24-s62192/
- Jiri Kraus, March 2024, Multi GPU Programming Models for HPC and AI, GTC 2024, https://www.nvidia.com/en-us/on-demand/session/gtc24-s61339/
- David Olsen, Graham Lopez and Bryce Adelstein Lelbach, Aug 04, 2020, Accelerating Standard C++ with GPUs Using stdpar, https://developer.nvidia.com/blog/accelerating-standard-c-with-gpus-using-stdpar/
- Mark Harris, Jan 25, 2017, An Even Easier Introduction to CUDA, https://developer.nvidia.com/blog/even-easier-introduction-cuda/
- Deepak Unnikrishnan and Fred Oh, CUDA 12.1 Supports Large Kernel Parameters, Jun 05, 2023, https://developer.nvidia.com/blog/cuda-12-1-supports-large-kernel-parameters/
- Paul Delestrac. 2024, Advanced Profiling Techniques For Evaluating GPU Computing Efficiency Executing ML Applications. Ph.D. Thesis, Micro and nanotechnologies/Microelectronics. Université de Montpellier, 2024. English. NNT: 2024UMONS014 https://theses.hal.science/tel-04742193/file/DELESTRAC_2024_archivage.pdf
- David Spuler, Oct 2024, CUDA C++ Optimization: Coding Faster GPU Kernels, https://www.amazon.com/dp/B0DK21QQYD/
- Ming Li, Ziqian Bi, Tianyang Wang, Yizhu Wen, Qian Niu, Junyu Liu, Benji Peng, Sen Zhang, Xuanhe Pan, Jiawei Xu, Jinlang Wang, Keyu Chen, Caitlyn Heqi Yin, Pohsun Feng, Ming Liu, 8 Oct 2024, Deep Learning and Machine Learning with GPGPU and CUDA: Unlocking the Power of Parallel Computing, https://arxiv.org/abs/2410.05686
- Vasily Volkov, August 12, 2016, Understanding Latency Hiding on GPUs, Ph.D. Thesis, Electrical Engineering and Computer Sciences, University of California at Berkeley, Technical Report No. UCB/EECS-2016-143, http://www.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-143.html https://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-143.pdf
- Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, Evan Shelhamer, 18 Dec 2014 (v3), cuDNN: Efficient Primitives for Deep Learning, https://arxiv.org/abs/1410.0759
- Wayne Gaudin, Scott McMillan, Pat Brooks and Akhil Docca, Nov 16, 2020, Building and Deploying HPC Applications using NVIDIA HPC SDK from the NVIDIA NGC Catalog, NVIDIA Technical Blog, https://developer.nvidia.com/blog/building-and-deploying-hpc-applications-using-hpc-sdk-from-ngc-catalog/
- Character.AI, Nov 21, 2024, Optimizing AI Inference at Character.AI (Part Deux), https://research.character.ai/optimizing-ai-inference-at-character-ai-part-deux/ (Optimization techniques discussed include INT8, Flash attention 3, kernel fusion of KV dequantization and attention, MQA parallelization, producer-consumer CUDA warp specialization, fused matrix transpose, and more.)
CUDA Profiling and Timing
Articles on using profiler tools for timing the efficiency of CUDA kernels:
- DominikGreweandAntonLokhmotov.2011. Automatically Generating and Tuning GPU Code for Sparse Matrix-Vector Multiplication from a High-Level Representation. In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, (Newport Beach, California,U SA)(GPGPU-4).Association for Computing Machinery, NewYork, NY, USA, https://doi.org/10.1145/1964179.1964196
- R. Lim, B. Norris and A. Malony, "Autotuning GPU Kernels via Static and Predictive Analysis," 2017 46th International Conference on Parallel Processing (ICPP), Bristol, UK, 2017, pp. 523-532, doi: 10.1109/ICPP.2017.61. https://ieeexplore.ieee.org/abstract/document/8025326 https://arxiv.org/abs/1701.08547
- Sébastien Darche and Michel R. Dagenais. 2024. Low-Overhead Trace Collection and Profiling on GPU Compute Kernels. ACM Trans. Parallel Comput. 11, 2, Article 9 (June 2024), 24 pages. https://doi.org/10.1145/3649510 https://dl.acm.org/doi/abs/10.1145/3649510
- Nikolay Sakharnykh, Nov 19, 2017, Maximizing Unified Memory Performance in CUDA, https://developer.nvidia.com/blog/maximizing-unified-memory-performance-cuda/
- Jackson Marusarz, 2023, CUDA Tutorials I. Profiling and Debugging Applications, https://www.youtube.com/watch?v=dB5Jxwj0PDw
- NVIDIA, 2024, Profiler User’s Guide, https://docs.nvidia.com/cuda/profiler-users-guide/index.html
- Mark Harris, May 28, 2013, CUDA Pro Tip: Clean Up After Yourself to Ensure Correct Profiling, https://developer.nvidia.com/blog/pro-tip-clean-up-after-yourself-ensure-correct-profiling/
- Mark Harris, Oct 23, 2013, CUDA Pro Tip: nvprof is Your Handy Universal GPU Profiler, https://developer.nvidia.com/blog/cuda-pro-tip-nvprof-your-handy-universal-gpu-profiler/
- Mark Harris, Nov 07, 2012, How to Implement Performance Metrics in CUDA C/C++, https://developer.nvidia.com/blog/how-implement-performance-metrics-cuda-cc/
- NVIDIA, Sep 2024 (accessed), nvbench: CUDA Kernel Benchmarking Library, https://github.com/NVIDIA/nvbench
- K. Cooper, 2020, Timing and Tricks, Department of Mathematics, Washington State University, https://www.math.wsu.edu/math/kcooper/CUDA/c07Timing.pdf
- NVIDIA, Sep 2024 (accessed), Nsight Compute CLI, https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html
- Paul Delestrac. 2024, Advanced Profiling Techniques For Evaluating GPU Computing Efficiency Executing ML Applications. Ph.D. Thesis, Micro and nanotechnologies/Microelectronics. Université de Montpellier, 2024. English. NNT: 2024UMONS014 https://theses.hal.science/tel-04742193/file/DELESTRAC_2024_archivage.pdf
- Greg Gutmann, Sep 2020, Running CUDA Profilers on Linux: GPU Optimizations, https://codingbyexample.com/2020/09/25/cuda-profiling-linux/
- Wang Peng, Qi Kaiyuan, Yu Zhibin, Su Guangfeng, Liu Peng, Dec 2024, MICPAT: Micro-architecture Independent Characteristics Profiling Analysis Tool for GPU Programs, https://doi.org/10.21203/rs.3.rs-5430086/v1 https://www.researchsquare.com/article/rs-5430086/v1 https://zenodo.org/records/13623324
CUDA Memory Optimization
Articles and papers on CUDA memory optimization techniques:
- Dung Le, Jul 30, 2020, CUDA Memory Management & Use cases, https://medium.com/distributed-knowledge/cuda-memory-management-use-cases-f9d340f7c704
- Neda Seifi, Abdullah Al-Mamun, 2014, Optimizing Memory Access Efficiency in CUDA Kernel via Data Layout Technique, Journal of Computer and Communications, 2024, 12, 124-139, DOI: 10.4236/jcc.2024.125009, https://www.scirp.org/journal/paperinformation?paperid=133500 PDF: https://www.scirp.org/pdf/jcc2024125_91732699.pdf
- Mark Harris, Jan 07, 2013, How to Access Global Memory Efficiently in CUDA C/C++ Kernels, https://developer.nvidia.com/blog/how-access-global-memory-efficiently-cuda-c-kernels/
- Justin Luitjens, Dec 04, 2013, CUDA Pro Tip: Increase Performance with Vectorized Memory Access, https://developer.nvidia.com/blog/cuda-pro-tip-increase-performance-with-vectorized-memory-access/
- Nikolay Sakharnykh, Nov 19, 2017, Maximizing Unified Memory Performance in CUDA, https://developer.nvidia.com/blog/maximizing-unified-memory-performance-cuda/
- NVIDIA, Module 14 – Efficient Host-Device Data Transfer: Lecture 14.1 - Pinned Host Memory, GPU Teaching Kit, https://people.csail.mit.edu/xchen/gpu-programming/Lecture14-stream.pdf
- Dhanush, Aug 23, 2024, Optimizing Vector Dot Product in CUDA: Exploring Shared Memory and Reduction Techniques, https://github.com/Dhanush295/Cuda_program/blob/main/Vector_dot_product.cu
- Lei Mao, June 22, 2022, CUDA Shared Memory Bank, https://leimao.github.io/blog/CUDA-Shared-Memory-Bank/
- Mark Harris, Dec 04, 2012, How to Optimize Data Transfers in CUDA C/C++, https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/
- Mark Harris, Jun 19, 2017, Unified Memory for CUDA Beginners, https://developer.nvidia.com/blog/unified-memory-cuda-beginners/
- Tutorials Point, Sep 2024, CUDA - Reducing Global Memory Traffic, https://www.tutorialspoint.com/cuda/cuda_reducing_global_memory_traffic.htm
- By Arthy Sundaram, Jaydeep Marathe, Mike Murphy, Nikhil Gupta, Xiaohua Zhang and Thibaut Lutz, Apr 15, 2021, Programming Efficiently with the NVIDIA CUDA 11.3 Compiler Toolchain, https://developer.nvidia.com/blog/programming-efficiently-with-the-cuda-11-3-compiler-toolchain/ (Includes focus on using alloca for stack memory allocation versus malloc on the heap.)
- Athena Elafrou, Guillaume Thomas Collignon, March 18th 2024, Introduction to CUDA Performance Optimization, NVIDIA DevTech Compute GPU Technology Conference, https://developer-blogs.nvidia.com/wp-content/uploads/2024/08/CUDA-Programming-and-Optimization.pdf
- Mark Harris, March 2024, More Data, Faster: GPU Memory Management Best Practices in Python and C++, GTC 2024, https://www.nvidia.com/en-us/on-demand/session/gtc24-s62550/
- Deepak Unnikrishnan and Fred Oh, CUDA 12.1 Supports Large Kernel Parameters, Jun 05, 2023, https://developer.nvidia.com/blog/cuda-12-1-supports-large-kernel-parameters/
- David Spuler, Oct 2024, CUDA C++ Optimization: Coding Faster GPU Kernels, https://www.amazon.com/dp/B0DK21QQYD/
- Ming Li, Ziqian Bi, Tianyang Wang, Yizhu Wen, Qian Niu, Junyu Liu, Benji Peng, Sen Zhang, Xuanhe Pan, Jiawei Xu, Jinlang Wang, Keyu Chen, Caitlyn Heqi Yin, Pohsun Feng, Ming Liu, 8 Oct 2024, Deep Learning and Machine Learning with GPGPU and CUDA: Unlocking the Power of Parallel Computing, https://arxiv.org/abs/2410.05686
CUDA C++ Matrix Multiplication (MatMul/GEMM)
Articles on coding MatmUl/GEMM in CUDA C++:
- Samuel S. Cho, 2011, CUDA Thread Basics, https://users.wfu.edu/choss/CUDA/docs/Lecture%205.pdf
- Mark Harris, Feb 18, 2013, An Efficient Matrix Transpose in CUDA C/C++, https://developer.nvidia.com/blog/efficient-matrix-transpose-cuda-cc/
- Cris Cecka, Feb 27, 2017, Pro Tip: cuBLAS Strided Batched Matrix Multiply, https://developer.nvidia.com/blog/cublas-strided-batched-matrix-multiply/
- Dominik Grewe and Anton Lokhmotov.2011. Automatically Generating and Tuning GPU Code for Sparse Matrix-Vector Multiplication from a High-Level Representation. In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, (Newport Beach, California,U SA)(GPGPU-4).Association for Computing Machinery, NewYork, NY, USA, https://doi.org/10.1145/1964179.1964196
- Zijing Gu, 26 Jul 2020, Optimizing Block-Sparse Matrix Multiplications on CUDA with TVM, https://arxiv.org/abs/2007.13055
- Neda Seifi, Abdullah Al-Mamun, 2014, Optimizing Memory Access Efficiency in CUDA Kernel via Data Layout Technique, Journal of Computer and Communications, 2024, 12, 124-139, DOI: 10.4236/jcc.2024.125009, https://www.scirp.org/journal/paperinformation?paperid=133500 PDF: https://www.scirp.org/pdf/jcc2024125_91732699.pdf (Fast CUDA matrix multiplication using data locality of memory accesses, by using diagonal data access patterns for coalesced access.)
- Li, Y., Dongarra, J., Tomov, S. (2009). A Note on Auto-tuning GEMM for GPUs. In: Allen, G., Nabrzyski, J., Seidel, E., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds) Computational Science – ICCS 2009. Lecture Notes in Computer Science, vol 5544. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01970-8_89 https://link.springer.com/chapter/10.1007/978-3-642-01970-8_89 PDF: https://link.springer.com/content/pdf/10.1007/978-3-642-01970-8_89.pdf
- Ganesh Bikshandi, Jay Shah, 19 Dec 2023, A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library, https://arxiv.org/abs/2312.11918 https://research.colfax-intl.com/nvidia-hopper-flashattention-2/
- Dung Le, Jul 30, 2020, CUDA Memory Management & Use cases, https://medium.com/distributed-knowledge/cuda-memory-management-use-cases-f9d340f7c704
- Lei Mao, March 19, 2023, CUDA Coalesced Memory Access, https://leimao.github.io/blog/CUDA-Coalesced-Memory-Access/
- Abhinav Agarwalla, Abhay Gupta, Alexandre Marques, Shubhra Pandit, Michael Goin, Eldar Kurtic, Kevin Leong, Tuan Nguyen, Mahmoud Salem, Dan Alistarh, Sean Lie, Mark Kurtz, 6 May 2024, Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment, https://arxiv.org/abs/2405.03594 NVIDIA, 2024, cuSparse, https://docs.nvidia.com/cuda/cusparse/index.html
- Stephen Jones, March 2024, CUDA: New Features and Beyond, GTC 2024, https://www.nvidia.com/en-us/on-demand/session/gtc24-s62400/
- Siboehm, December 2022, How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog, https://siboehm.com/articles/22/CUDA-MMM
- Gray, S. 2014. A full walk through of the SGEMM implementation, https://github.com/NervanaSystems/maxas/wiki/SGEMM
- Lai, J., and Seznec, A. 2013. Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs. International Symposium on Code Generation and Optimization (CGO '13), 1–10. https://inria.hal.science/file/index/docid/789958/filename/112_Lai.pdf
- Andrew Kerr, Duane Merrill, Julien Demouth and John Tran, Dec 05, 2017, CUTLASS: Fast Linear Algebra in CUDA C++, https://developer.nvidia.com/blog/cutlass-linear-algebra-cuda/
- Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, Evan Shelhamer, 18 Dec 2014 (v3), cuDNN: Efficient Primitives for Deep Learning, https://arxiv.org/abs/1410.0759
- Rajib Nath, Stanimire Tomov, and Jack Dongarra, November 18, 2010, An Improved Magma Gemm For Fermi Graphics Processing Units, The International Journal of High Performance Computing Applications, Volume 24, Issue 4, https://doi.org/10.1177/1094342010385729 https://journals.sagepub.com/doi/10.1177/1094342010385729
- Fengguang Song, Stanimire Tomov, Jack Dongarra, 2012, Enabling and Scaling Matrix Computations on Heterogeneous Multi-Core and Multi-GPU Systems, ICS’12, June 25–29, 2012, San Servolo Island, Venice, Italy. https://icl.utk.edu/files/publications/2012/icl-utk-495-2012.pdf
- Wang Zhiyong, 2022, NVIDIA SGEMM Practice: Step-by-step optimization of CUDA SGEMM, https://github.com/wangzyon/NVIDIA_SGEMM_PRACTICE
- Siboehm, 2023, Fast CUDA SGEMM from Scratch, https://github.com/siboehm/SGEMM_CUDA
- Edward Kandrot, 2023, cuda_matmul: Optimized CUDA matmul with benchmarks, https://github.com/ekandrot/cuda_matmul
- Harshit Kumar, June 7, 2024, Matrix Multiplication in CUDA, https://kharshit.github.io/blog/2024/06/07/matrix-multiplication-cuda
- NVIDIA, 2024, Matrix Multiplication Background User's Guide, https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html
CUDA Debugging Techniques
Articles on CUDA debugging tools and techniques:
- Mingyuan Wu, Husheng Zhou, Lingming Zhang, Cong Liu, Yuqun Zhang, 29 May 2019 (v3), Characterizing and Detecting CUDA Program Bugs, https://arxiv.org/abs/1905.01833 (Study of CUDA bugs in several production-level CUDA projects, including memory resource issues and synchronization errors.)
- M. Wu, Y. Ouyang, H. Zhou, L. Zhang, C. Liu and Y. Zhang, "Simulee: Detecting CUDA Synchronization Bugs via Memory-Access Modeling," 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE), Seoul, Korea (South), 2020, pp. 937-948, doi: 10.1145/3377811.3380358. https://ieeexplore.ieee.org/document/9284094 (Simulation tool to detect CUDA bugs by interpreting the LLVM byte code.)
- Pengcheng Li, Chen Ding, Xiaoyu Hu, Tolga Soyata, 2014, LDetector: A Low Overhead Race Detector For GPU Programs, https://wodet.cs.washington.edu/wp-content/uploads/2014/02/wodet2014-final14.pdf
- Mohamed Tarek Ibn Ziad, Sana Damani, Aamer Jaleel, Stephen W. Keckler, and Mark Stephenson. 2023. CuCatch: A Debugging Tool for Efficiently Catching Memory Safety Violations in CUDA Applications. Proc. ACM Program. Lang. 7, PLDI, Article 111 (June 2023), 24 pages. https://doi.org/10.1145/3591225 https://dl.acm.org/doi/abs/10.1145/3591225 PDF: https://dl.acm.org/doi/pdf/10.1145/3591225
- Helder J. F. Luz, Paulo S. L. Souza, Simone R. S. Souza, 9 April 2024, Structural testing for CUDA programming model, https://doi.org/10.1002/cpe.8105 https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.8105
- S. Hong, H. Sun, X. Gao and S. H. Tan, "Investigating and Detecting Silent Bugs in PyTorch Programs," 2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), Rovaniemi, Finland, 2024, pp. 272-283, doi: 10.1109/SANER60148.2024.00035. https://ieeexplore.ieee.org/abstract/document/10589839 PDF: https://gaoxiang9430.github.io/papers/saner24a.pdf
- Florian Tambon, Amin Nikanjam, Le An, Foutse Khomh, Giuliano Antoniol, 1 Sep 2023 (v2), Silent Bugs in Deep Learning Frameworks: An Empirical Study of Keras and TensorFlow, https://arxiv.org/abs/2112.13314
- M Boyer, K Skadron, W Weimer, 2008, Automated dynamic analysis of CUDA programs, Third Workshop on Software Tools, https://www.nvidia.com/docs/io/67190/stmcs08.pdf
- T Lloyd, K Ali, JN Amaral, 2019, Gpucheck: Detecting cuda thread divergence with static analysis, https://era.library.ualberta.ca/items/7ab2b28d-b111-448f-8273-2ff219132908 PDF: https://era.library.ualberta.ca/items/7ab2b28d-b111-448f-8273-2ff219132908/view/88ccced9-537e-47de-b62f-7eef434e73b6/TR19-01
- Yanan Guo, Zhenkai Zhang, Jun Yang, August 2024, GPU Memory Exploitation for Fun and Profit, Proceedings of the 33rd USENIX Security Symposium, August 14–16, 2024 • Philadelphia, PA, USA, 978-1-939133-44-1, https://www.usenix.org/conference/usenixsecurity24/presentation/guo-yanan https://www.usenix.org/system/files/usenixsecurity24-guo-yanan.pdf
- Christopher Erb, Mike Collins, and Joseph L. Greathouse. Dynamic buffer overflow detection for GPGPUs. In IEEE/ACM International Symposium on Code Generation and Optimization (CGO), 2017. https://dl.acm.org/doi/10.5555/3049832.3049840 https://computermachines.org/joe/publications/pdfs/cgo2017_clarmor.pdf
- Bob Crovella, Sep 14, 2021, CUDA Debugging, https://leimao.github.io/downloads/blog/2022-05-25-Proper-CUDA-Error-Checking/cuda_training_series_cuda_debugging.pdf
- Jackson Marusarz, 2023, CUDA Tutorials I. Profiling and Debugging Applications, https://www.youtube.com/watch?v=dB5Jxwj0PDw
- NVIDIA, Sep 2024, Compute Sanitizer, https://docs.nvidia.com/compute-sanitizer/ComputeSanitizer/index.html
- Aurelien Chartier, Steve Ulrich, 2023, Debugging CUDA: An Overview of CUDA Correctness Tools, https://www.nvidia.com/en-us/on-demand/session/gtcspring23-s51772/
- NVIDIA, 2024, Debugger API, https://docs.nvidia.com/cuda/debugger-api/index.html
- NVIDIA, 2024, CUDA-GDB, https://docs.nvidia.com/cuda/cuda-gdb/index.html
- Greg Ruetsch, Nov 16, 2017, Pro Tip: Pinpointing Runtime Errors in CUDA Fortran, https://developer.nvidia.com/blog/pinpointing-runtime-errors-cuda-fortran/
- Paul Graham and Mozhgan Kabiri Chimeh, Jun 29, 2023, Efficient CUDA Debugging: How to Hunt Bugs with NVIDIA Compute Sanitizer, https://developer.nvidia.com/blog/debugging-cuda-more-efficiently-with-nvidia-compute-sanitizer/
- Lei Mao, Dec 15, 2023, Proper CUDA Error Checking, https://leimao.github.io/blog/Proper-CUDA-Error-Checking/
- Thomas M. Baumann, Jose Gracia, 3 Oct 2013, Cudagrind: A Valgrind Extension for CUDA, https://arxiv.org/abs/1310.0901 https://github.com/dpc-grindland/Cudagrind (Valgrind Memcheck for CUDA C++, but over 10 years old)
- Andy Adinets, May 20, 2014, CUDA Dynamic Parallelism API and Principles, https://developer.nvidia.com/blog/cuda-dynamic-parallelism-api-principles/
- CudaText add-ons, 2021 (approx), CudaText plugin: CudaLint, https://github.com/CudaText-addons/cuda_lint
- GPU-NBDetect Oct 2024 (accessed), Comprehensive-Study-on-GPU-Program-Numerical-Issues, https://github.com/GPU-Program-Bug-Study/Comprehensive-Study-on-GPU-Program-Numerical-Issues.github.io/tree/main/GPU-NBDetect
- FP Checker, Jul 19, 2021, Floating-point Exceptions and GPU Applications, https://fpchecker.org/2021-07-12-exceptions.html
- Lawrence Livermore National Laboratory, Oct 024, FP Checker: dynamic analysis tool to detect floating-point errors in HPC applications, https://fpchecker.org/index.html https://github.com/LLNL/FPChecker
- Laguna, Ignacio. "FPChecker: Detecting Floating-point Exceptions in GPU Applications." In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 1126-1129. IEEE, 2019. https://ieeexplore.ieee.org/abstract/document/8952258 https://www.osti.gov/servlets/purl/1574625
- Michael Bentley, Ian Briggs, Ganesh Gopalakrishnan, Ignacio Laguna, Harshitha Menon, Tristan Vanderbruggen, Cindy Rubio González, 2020, FPChecker Detecting Floating-Point Exceptions in GPUs, https://fpanalysistools.org/pearc19/slides/Module-FPChecker.pdf
- Ignacio Laguna Feb 4, 2020, Improving Reliability Through Analyzing and Debugging Floating-Point Software, 2020 ECP Annual Meeting, https://fpanalysistools.org/slides/ignacio_laguna_ECP_2020.pdf
- Ignacio Laguna, Xinyi Li, and Ganesh Gopalakrishnan. 2022. BinFPE: accurate floating-point exception detection for GPU applications. In Proceedings of the 11th ACM SIGPLAN International Workshop on the State Of the Art in Program Analysis (SOAP 2022). Association for Computing Machinery, New York, NY, USA, 1–8. https://doi.org/10.1145/3520313.3534655 https://dl.acm.org/doi/10.1145/3520313.3534655 https://dl.acm.org/doi/pdf/10.1145/3520313.3534655 https://github.com/LLNL/BinFPE
- Floris Gorter, Enrico Barberis, Raphael Isemann, Erik van der Kouwe, Cristiano Giuffrida, Herbert Bos, November 1, 2023, FloatZone: How Floating Point Additions can Detect Memory Errors, https://download.vusec.net/papers/floatzone_sec23.pdf https://github.com/vusec/floatzone
- David Spuler, Oct 2024, CUDA C++ Debugging: Safer GPU Kernel Programming, https://www.amazon.com/dp/B0DJJVDJBW/
- Felipe R. Monteiro, Erickson H. da S. Alves, Isabela S. Silva, Hussama I. Ismail, Lucas C. Cordeiro, and Eddie B. de Lima Filho. 2018. ESBMC-GPU A context-bounded model checking tool to verify CUDA programs. Sci. Comput. Program. 152, C (January 2018), 63–69. https://doi.org/10.1016/j.scico.2017.09.005 https://dl.acm.org/doi/10.1016/j.scico.2017.09.005 https://pure.manchester.ac.uk/ws/files/77048933/jscp2017.pdf
- Chao Peng, 2021, Automated testing for GPU kernels, Doctor of Philosophy, Laboratory for Foundations of Computer Science, School of Informatics, University of Edinburgh, https://era.ed.ac.uk/handle/1842/38563 https://era.ed.ac.uk/bitstream/handle/1842/38563/Peng2021.pdf?sequence=1&isAllowed=y
- Bo Jiang, Xiaoyan Wang, W.K. Chan, T.H. Tse, Na Li and Yongfeng Yin, 2020, "CUDAsmith: A Fuzzer for CUDA Compilers," 2020 IEEE 44th Annual Computers, Software, and Applications Conference (COMPSAC), Madrid, Spain, 2020, pp. 861-871, doi: 10.1109/COMPSAC48688.2020.0-156. https://ieeexplore.ieee.org/abstract/document/9202798 https://github.com/gongbell/CUDAsmith https://www.cs.hku.hk/data/techreps/document/TR-2020-05.pdf
- Geoff Gerfin Vyas Venkataraman, Oct 2024, Debugging Experience with CUDA-GDB and CUDA-MEMCHECK, GPU Conference, https://picture.iczhiku.com/resource/paper/WhiSzdKjTtFhhNnm.pdf
- A Hück, T Ziegler, S Schwitanski, J Jenke, C Bischof, Nov 2024, Compiler-Aided Correctness Checking of CUDA-Aware MPI Applications, https://conferences.computer.org/sc-wpub/pdfs/SC-W2024-6oZmigAQfgJ1GhPL0yE3pS/555400a204/555400a204.pdf
- Taylor Allred, Xinyi Li, Ashton Wiersdorf, Ben Greenman, Ganesh Gopalakrishnan, 22 Mar 2024, FlowFPX: Nimble Tools for Debugging Floating-Point Exceptions, https://arxiv.org/abs/2403.15632 https://juliahub.com/ui/Packages/FloatTracker/dBXig
CUDA Portability
Articles and papers on the portability of CUDA code:
- Stijn Heldens, Ben van Werkhoven, 22 Mar 2023, Kernel Launcher: C++ Library for Optimal-Performance Portable CUDA Applications, https://arxiv.org/abs/2303.12374 http://kerneltuner.github.io/
- Manuel Costanzo, Enzo Rucci, Carlos García Sánchez, Marcelo Naiouf, Manuel Prieto-Matías, 10 Nov 2023 (v2), Comparing Performance and Portability between CUDA and SYCL for Protein Database Search on NVIDIA, AMD, and Intel GPUs, https://arxiv.org/abs/2309.09609
- Ruobing Han, Jun Chen, Bhanu Garg, Xule Zhou, John Lu, Jeffrey Young, Jaewoong Sim, and Hyesoon Kim. 2024. CuPBoP: Making CUDA a Portable Language. ACM Trans. Des. Autom. Electron. Syst. 29, 4, Article 60 (July 2024), 25 pages. https://doi.org/10.1145/3659949 https://dl.acm.org/doi/full/10.1145/3659949
- Victor Anderssen, April 23, 2024 Converting CUDA programs to run on AMD GPUs, Master’s Thesis, Department of Information Technologies, Faculty of Science and Engineering, Abo Akademi University, Vasa, Finland, https://www.doria.fi/bitstream/handle/10024/188907/anderss%C3%A9n_victor.pdf?sequence=2
- Ziran Zhang, Zhiming Wang, Chenwei Sun, and Andy Huang. 2024. Smoothing the Migration from CUDA to SYCL: SYCLomatic Utility Features. In Proceedings of the 12th International Workshop on OpenCL and SYCL (IWOCL '24). Association for Computing Machinery, New York, NY, USA, Article 14, 1–2. https://doi.org/10.1145/3648115.3648132 https://dl.acm.org/doi/abs/10.1145/3648115.3648132
- Mark Harris, Sep 21, 2015, Simple, Portable Parallel C++ with Hemi 2 and CUDA 7.5, https://developer.nvidia.com/blog/simple-portable-parallel-c-hemi-2/
- Ben Funk, February 27, 2021, Yes, You Can Run NVIDIA CUDA On Intel GPUs And Libraries For It Have Hit Github, https://hothardware.com/news/cuda-on-intel-gpus-zluda (Article about ZLUDA, an open source project, but the github page says it's been discontinued due to legal issues, although some more recent articles now say it's back, so it's a bit unclear.)
- Thejaswi Rao and Mark Harris, Aug 20, 2019, CUDA Pro Tip: The Fast Way to Query Device Properties, https://developer.nvidia.com/blog/cuda-pro-tip-the-fast-way-to-query-device-properties/
- Mark Harris, Jan 27, 2014, CUDA Pro Tip: Control GPU Visibility with CUDA_VISIBLE_DEVICES, https://developer.nvidia.com/blog/cuda-pro-tip-control-gpu-visibility-cuda_visible_devices/
- Akhil Langer, Seth Howell, Arnav Goel, Pak Markthub, Harry Petty and Fred Oh, Sep 06, 2024, Enhancing Application Portability and Compatibility across New Platforms Using NVIDIA Magnum IO NVSHMEM 3.0, https://developer.nvidia.com/blog/enhancing-application-portability-and-compatibility-across-new-platforms-using-nvidia-magnum-io-nvshmem-3-0/
- Ruobing Han, Jaewon Lee, Jaewoong Sim, Hyesoon Kim, 19 Dec 2021, COX: CUDA on X86 by Exposing Warp-Level Functions to CPUs, https://arxiv.org/abs/2112.10034
- Manuel Costanzo, Enzo Rucci, Carlos Garcia Sanchez, Marcelo Naiouf, Manuel Prieto-Matias, 20 Jun 2022 (v2), Migrating CUDA to oneAPI: A Smith-Waterman Case Study, https://arxiv.org/abs/2203.11100
- Thomas Mejstrik, 9 Aug 2023, __host__ __device__ -- Generic programming in Cuda, Association for Computing Machinery, https://arxiv.org/abs/2309.03912
- Anton Shilov, March 5, 2024, Nvidia bans using translation layers for CUDA software — previously the prohibition was only listed in the online EULA, now included in installed files [Updated], https://www.tomshardware.com/pc-components/gpus/nvidia-bans-using-translation-layers-for-cuda-software-to-run-on-other-chips-new-restriction-apparently-targets-zluda-and-some-chinese-gpu-makers
- NVIDIA, Sep 2024 (accessed), deviceQuery.cpp, https://github.com/NVIDIA/cuda-samples/blob/master/Samples/1_Utilities/deviceQuery/deviceQuery.cpp (CUDA sample showing how to get device versions and properties in CUDA C++ using APIs such as cuDeviceGetAttribute, cudaGetDeviceCount, cudaGetDeviceProperties, cudaDriverGetVersion, cudaRuntimeGetVersion.)
- Jack Kosaian, Vijay Thakkar, March 2024, CUTLASS: A Performant, Flexible, and Portable Way to Target Hopper Tensor Cores, GTC 2024, https://www.nvidia.com/en-us/on-demand/session/gtc24-s61198/
- Anastasia Stulova, Jeff Larkin, March 2024, No More Porting: Accelerated Computing With Standard C++, Fortran, and Python, GTC 2024, https://www.nvidia.com/en-us/on-demand/session/gtc24-s61204/
- David Spuler, Oct 2024, CUDA C++ Debugging: Safer GPU Kernel Programming, https://www.amazon.com/dp/B0DJJVDJBW/
- ENCCS, Dec 2024 (accessed), Preparing code for GPU porting, https://enccs.github.io/gpu-programming/11-gpu-porting/
- Prerit Kapadia, eInfoChips, December 26, 2023, Porting Algorithms on GPU, https://www.einfochips.com/blog/porting-algorithms-on-gpu/
CUDA Compatibility and Versions
Articles about version compatibility and CUDA:
- NVIDIA, 2024, CUDA Compatibility, https://docs.nvidia.com/deploy/cuda-compatibility/index.html https://docs.nvidia.com/deploy/pdf/CUDA_Compatibility.pdf
- Mark Harris, Jan 27, 2014, CUDA Pro Tip: Control GPU Visibility with CUDA_VISIBLE_DEVICES, https://developer.nvidia.com/blog/cuda-pro-tip-control-gpu-visibility-cuda_visible_devices/
- Kohei Yoshida, Shinobu Miwa, Hayato Yamaki, Hiroki Honda, 2024, Analyzing the impact of CUDA versions on GPU applications, Parallel Computing, Volume 120, 103081, ISSN 0167-8191, https://doi.org/10.1016/j.parco.2024.103081 https://www.sciencedirect.com/science/article/pii/S016781912400019X
- Akhil Langer, Seth Howell, Arnav Goel, Pak Markthub, Harry Petty and Fred Oh, Sep 06, 2024, Enhancing Application Portability and Compatibility across New Platforms Using NVIDIA Magnum IO NVSHMEM 3.0, https://developer.nvidia.com/blog/enhancing-application-portability-and-compatibility-across-new-platforms-using-nvidia-magnum-io-nvshmem-3-0/
- NVIDIA, Sep 2024 (accessed), deviceQuery.cpp, https://github.com/NVIDIA/cuda-samples/blob/master/Samples/1_Utilities/deviceQuery/deviceQuery.cpp (CUDA sample showing how to get device versions and properties in CUDA C++ using APIs such as cuDeviceGetAttribute, cudaGetDeviceCount, cudaGetDeviceProperties, cudaDriverGetVersion, cudaRuntimeGetVersion.)
- David Spuler, Oct 2024, CUDA C++ Debugging: Safer GPU Kernel Programming, https://www.amazon.com/dp/B0DJJVDJBW/
CUDA Emulator
Can you run CUDA without actually having a real GPU? On just a CPU? This would be useful for educational purposes and also for programmers working at home on their laptop.
What about a fully functional CUDA emulation? There are rumors that such a beast used to exist in 2010 (almost 15 years ago), and even be supported by NVIDIA, but it doesn't seem to exist any more.
Here are some articles on CUDA emulation:
- NVIDIA, libcu++: The C++ Standard Library for your entire system, https://github.com/NVIDIA/libcudacxx
- jrhemstad, Oct 5, 2023, Unifying the CUDA C++ Core Libraries: Towards a More Delightful CUDA C++ #520, NVIDIA, https://github.com/NVIDIA/cccl/discussions/520
- btarunr, Apr 2nd, 2010, NVIDIA CUDA Emulator for every PC, https://www.techpowerup.com/119073/nvidia-cuda-emulator-for-every-pc (Old 2010 article.)
- CASL, 2010, GPUOCelot: A dynamic compilation framework for PTX, https://github.com/gtcasl/gpuocelot (Last commit was 9 years ago, about 2010.)
- NVIDIA Forums, 2010, CUDA Emulator, https://forums.developer.nvidia.com/t/cuda-emulator/15480 (This is from 2010.)
- David Spuler, Oct 2024, CUDA C++ Debugging: Safer GPU Kernel Programming, https://www.amazon.com/dp/B0DJJVDJBW/
CUDA Static Analysis
Research on static analysis (code checking or auto-tuning) of CUDA programs:
- S. Lagouvardos, J. Dolby, N. Grech, A. Antoniadis, and Y. Smaragdakis, 2020, “Static analysis of shape in Tensorflow programs,” in 34th European Conference on Object-Oriented Programming (ECOOP 2020). Schloss Dagstuhl-Leibniz-Zentrum fur Informatik, 2020. https://drops.dagstuhl.de/entities/document/10.4230/DARTS.6.2.6 PDF: https://drops.dagstuhl.de/storage/05darts/darts-vol006/darts-vol006-issue002_ecoop2020/DARTS.6.2.6/DARTS.6.2.6.pdf
- H. Y. Jhoo, S. Kim, W. Song, K. Park, D. Lee, and K. Yi, “A static analyzer for detecting tensor shape errors in deep neural network training code,” in Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings, 2022, pp. 337 338. https://arxiv.org/abs/2112.09037
- Yiran Wang, José Antonio Hernández López, Ulf Nilsson, and Dániel Varró. 2024. Using Run-Time Information to Enhance Static Analysis of Machine Learning Code in Notebooks. In Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering (FSE 2024). Association for Computing Machinery, New York, NY, USA, 497–501. https://doi.org/10.1145/3663529.3663785 https://dl.acm.org/doi/abs/10.1145/3663529.3663785 PDF: https://dl.acm.org/doi/pdf/10.1145/3663529.3663785
- T Lloyd, K Ali, JN Amaral, 2019, Gpucheck: Detecting cuda thread divergence with static analysis, https://era.library.ualberta.ca/items/7ab2b28d-b111-448f-8273-2ff219132908 PDF: https://era.library.ualberta.ca/items/7ab2b28d-b111-448f-8273-2ff219132908/view/88ccced9-537e-47de-b62f-7eef434e73b6/TR19-01
- Mark Lou and Stefan K. Muller. 2024. Automatic Static Analysis-Guided Optimization of CUDA Kernels. In Proceedings of the 15th International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM '24). Association for Computing Machinery, New York, NY, USA, 11–21. https://doi.org/10.1145/3649169.3649249 https://dl.acm.org/doi/abs/10.1145/3649169.3649249 PDF: https://dl.acm.org/doi/pdf/10.1145/3649169.3649249
- R. Lim, B. Norris and A. Malony, "Autotuning GPU Kernels via Static and Predictive Analysis," 2017 46th International Conference on Parallel Processing (ICPP), Bristol, UK, 2017, pp. 523-532, doi: 10.1109/ICPP.2017.61. https://ieeexplore.ieee.org/abstract/document/8025326 https://arxiv.org/abs/1701.08547
- Nimit Singhania 2018, Static Analysis for GPU Program Performance, Ph.D. Thesis, Computer and Information Science, University of Pennsylvania, https://repository.upenn.edu/server/api/core/bitstreams/9d32ece9-5321-4d37-ab72-251e0ecc197e/content (Optimization of uncoalesced memory accesses, block sizes, and cache reuse.)
- CudaText add-ons, 2021 (approx), CudaText plugin: CudaLint, https://github.com/CudaText-addons/cuda_lint
- A Hück, T Ziegler, S Schwitanski, J Jenke, C Bischof, Nov 2024, Compiler-Aided Correctness Checking of CUDA-Aware MPI Applications, https://conferences.computer.org/sc-wpub/pdfs/SC-W2024-6oZmigAQfgJ1GhPL0yE3pS/555400a204/555400a204.pdf
CUDA Programming Bugs
Papers on the types of programming errors that can occur in CUDA kernels:
- Yanan Guo, Zhenkai Zhang, Jun Yang, August 2024, GPU Memory Exploitation for Fun and Profit, Proceedings of the 33rd USENIX Security Symposium, August 14–16, 2024 • Philadelphia, PA, USA, 978-1-939133-44-1, https://www.usenix.org/conference/usenixsecurity24/presentation/guo-yanan https://www.usenix.org/system/files/usenixsecurity24-guo-yanan.pdf
- Christopher Erb, Mike Collins, and Joseph L. Greathouse. Dynamic buffer overflow detection for GPGPUs. In IEEE/ACM International Symposium on Code Generation and Optimization (CGO), 2017. https://dl.acm.org/doi/10.5555/3049832.3049840 https://computermachines.org/joe/publications/pdfs/cgo2017_clarmor.pdf
- Bang Di, Jianhua Sun, and Hao Chen. 2016, A study of overflow vulnerabilities on GPUs. In Network and Parallel Computing: 13th IFIP WG 10.3 International Conference. https://link.springer.com/chapter/10.1007/978-3-319-47099-3_9 https://www.aimlab.org/haochen/papers/npc16-overflow.pdf
- Mingyuan Wu, Husheng Zhou, Lingming Zhang, Cong Liu, Yuqun Zhang, 29 May 2019 (v3), Characterizing and Detecting CUDA Program Bugs, https://arxiv.org/abs/1905.01833 (Study of CUDA bugs in several production-level CUDA projects, including memory resource issues and synchronization errors.)
- Justin Luitjens, Sep 04, 2014, CUDA Pro Tip: Always Set the Current Device to Avoid Multithreading Bugs, https://developer.nvidia.com/blog/cuda-pro-tip-always-set-current-device-avoid-multithreading-bugs/
- Mark Harris, Apr 22, 2013, CUDA Pro Tip: Write Flexible Kernels with Grid-Stride Loops, https://developer.nvidia.com/blog/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/
- Greg Ruetsch, Nov 16, 2017, Pro Tip: Pinpointing Runtime Errors in CUDA Fortran, https://developer.nvidia.com/blog/pinpointing-runtime-errors-cuda-fortran/
- Paul Graham and Mozhgan Kabiri Chimeh, Jun 29, 2023, Efficient CUDA Debugging: How to Hunt Bugs with NVIDIA Compute Sanitizer, https://developer.nvidia.com/blog/debugging-cuda-more-efficiently-with-nvidia-compute-sanitizer/
- David Spuler, September 23, 2024, Basic CUDA C++ Programming Mistakes, Aussie AI Blog, https://www.aussieai.com/blog/cuda-basic-mistakes
- Thomas Mejstrik, 9 Aug 2023, __host__ __device__ -- Generic programming in Cuda, Association for Computing Machinery, https://arxiv.org/abs/2309.03912
- GPU-NBDetect Oct 2024 (accessed), Comprehensive-Study-on-GPU-Program-Numerical-Issues, https://github.com/GPU-Program-Bug-Study/Comprehensive-Study-on-GPU-Program-Numerical-Issues.github.io/tree/main/GPU-NBDetect
- FP Checker, Jul 19, 2021, Floating-point Exceptions and GPU Applications, https://fpchecker.org/2021-07-12-exceptions.html
- Laguna, Ignacio. "FPChecker: Detecting Floating-point Exceptions in GPU Applications." In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 1126-1129. IEEE, 2019. https://ieeexplore.ieee.org/abstract/document/8952258 https://www.osti.gov/servlets/purl/1574625
- Michael Bentley, Ian Briggs, Ganesh Gopalakrishnan, Ignacio Laguna, Harshitha Menon, Tristan Vanderbruggen, Cindy Rubio González, 2020, FPChecker Detecting Floating-Point Exceptions in GPUs, https://fpanalysistools.org/pearc19/slides/Module-FPChecker.pdf
- Ignacio Laguna Feb 4, 2020, Improving Reliability Through Analyzing and Debugging Floating-Point Software, 2020 ECP Annual Meeting, https://fpanalysistools.org/slides/ignacio_laguna_ECP_2020.pdf
- Ignacio Laguna, Xinyi Li, and Ganesh Gopalakrishnan. 2022. BinFPE: accurate floating-point exception detection for GPU applications. In Proceedings of the 11th ACM SIGPLAN International Workshop on the State Of the Art in Program Analysis (SOAP 2022). Association for Computing Machinery, New York, NY, USA, 1–8. https://doi.org/10.1145/3520313.3534655 https://dl.acm.org/doi/10.1145/3520313.3534655 https://dl.acm.org/doi/pdf/10.1145/3520313.3534655 https://github.com/LLNL/BinFPE
- David Spuler, Oct 2024, CUDA C++ Debugging: Safer GPU Kernel Programming, https://www.amazon.com/dp/B0DJJVDJBW/
- Chao Peng, 2021, Automated testing for GPU kernels, Doctor of Philosophy, Laboratory for Foundations of Computer Science, School of Informatics, University of Edinburgh, https://era.ed.ac.uk/handle/1842/38563 https://era.ed.ac.uk/bitstream/handle/1842/38563/Peng2021.pdf?sequence=1&isAllowed=y
CUDA Floating-Point Errors
- GPU-NBDetect Oct 2024 (accessed), Comprehensive-Study-on-GPU-Program-Numerical-Issues, https://github.com/GPU-Program-Bug-Study/Comprehensive-Study-on-GPU-Program-Numerical-Issues.github.io/tree/main/GPU-NBDetect
- FP Checker, Jul 19, 2021, Floating-point Exceptions and GPU Applications, https://fpchecker.org/2021-07-12-exceptions.html
- Lawrence Livermore National Laboratory, Oct 024, FP Checker: dynamic analysis tool to detect floating-point errors in HPC applications, https://fpchecker.org/index.html https://github.com/LLNL/FPChecker
- Laguna, Ignacio. "FPChecker: Detecting Floating-point Exceptions in GPU Applications." In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 1126-1129. IEEE, 2019. https://ieeexplore.ieee.org/abstract/document/8952258 https://www.osti.gov/servlets/purl/1574625
- Michael Bentley, Ian Briggs, Ganesh Gopalakrishnan, Ignacio Laguna, Harshitha Menon, Tristan Vanderbruggen, Cindy Rubio González, 2020, FPChecker Detecting Floating-Point Exceptions in GPUs, https://fpanalysistools.org/pearc19/slides/Module-FPChecker.pdf
- Ignacio Laguna Feb 4, 2020, Improving Reliability Through Analyzing and Debugging Floating-Point Software, 2020 ECP Annual Meeting, https://fpanalysistools.org/slides/ignacio_laguna_ECP_2020.pdf
- Ignacio Laguna, Xinyi Li, and Ganesh Gopalakrishnan. 2022. BinFPE: accurate floating-point exception detection for GPU applications. In Proceedings of the 11th ACM SIGPLAN International Workshop on the State Of the Art in Program Analysis (SOAP 2022). Association for Computing Machinery, New York, NY, USA, 1–8. https://doi.org/10.1145/3520313.3534655 https://dl.acm.org/doi/10.1145/3520313.3534655 https://dl.acm.org/doi/pdf/10.1145/3520313.3534655 https://github.com/LLNL/BinFPE
- Floris Gorter, Enrico Barberis, Raphael Isemann, Erik van der Kouwe, Cristiano Giuffrida, Herbert Bos, November 1, 2023, FloatZone: How Floating Point Additions can Detect Memory Errors, https://download.vusec.net/papers/floatzone_sec23.pdf https://github.com/vusec/floatzone
- Taylor Allred, Xinyi Li, Ashton Wiersdorf, Ben Greenman, Ganesh Gopalakrishnan, 22 Mar 2024, FlowFPX: Nimble Tools for Debugging Floating-Point Exceptions, https://arxiv.org/abs/2403.15632 https://juliahub.com/ui/Packages/FloatTracker/dBXig
- Xinyi Li, Ignacio Laguna, Katarzyna Swirydowicz, Bo Fang, Ang Li, and Ganesh Gopalakrishnan. Design and evaluation of GPU-FPX: A low-overhead tool for floating-point excep tion detection in NVIDIA GPUs. In ACM HPDC 2023, 2023. doi:10.11578/dc.20230713.4. https://dl.acm.org/doi/pdf/10.1145/3588195.3592991
- James Demmel, Jack J. Dongarra, Mark Gates, Greg Henry, Julien Langou, Xiaoye S. Li, Piotr Luszczek, Weslley S. Pereira, E. Jason Riedy, and Cindy Rubio-González. Pro posed consistent exception handling for the BLAS and LAPACK. In Correctness@SC, pages 1–9. IEEE, 2022. doi:10.1109/Correctness56720.2022.00006. https://netlib.org/utk/people/Jahttps://arxiv.org/abs/2207.09281ckDongarra/PAPERS/Proposed_Consistent_Exception_Handling_for_the_BLAS_and_LAPACK.pdf https://arxiv.org/abs/2207.09281
- N. Toronto and J. McCarthy, "Practically Accurate Floating-Point Math," in Computing in Science & Engineering, vol. 16, no. 4, pp. 80-95, July-Aug. 2014, doi: 10.1109/MCSE.2014.90. https://ieeexplore.ieee.org/document/6879754 https://www.cs.umd.edu/~ntoronto/papers/toronto-2014cise-floating-point.pdf
- Peter Dinda, Alex Bernat, and Conor Hetland. Spying on the Floating Point Behavior of Existing, Unmodified Sci entific Applications. In HPDC, pages 5–16. ACM, 2020. doi:10.1145/3369583.3392673. http://pdinda.org/Papers/hpdc20.pdf
CUDA Floating-Point Runtime Error Checkers
Research papers on tools that detect floating-point errors and exceptions at runtime:
- Lawrence Livermore National Laboratory, Oct 024, FP Checker: dynamic analysis tool to detect floating-point errors in HPC applications, https://fpchecker.org/index.html https://github.com/LLNL/FPChecker
- Laguna, Ignacio. "FPChecker: Detecting Floating-point Exceptions in GPU Applications." In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 1126-1129. IEEE, 2019. https://ieeexplore.ieee.org/abstract/document/8952258 https://www.osti.gov/servlets/purl/1574625
- Michael Bentley, Ian Briggs, Ganesh Gopalakrishnan, Ignacio Laguna, Harshitha Menon, Tristan Vanderbruggen, Cindy Rubio González, 2020, FPChecker Detecting Floating-Point Exceptions in GPUs, https://fpanalysistools.org/pearc19/slides/Module-FPChecker.pdf
- Ignacio Laguna, Xinyi Li, and Ganesh Gopalakrishnan. 2022. BinFPE: accurate floating-point exception detection for GPU applications. In Proceedings of the 11th ACM SIGPLAN International Workshop on the State Of the Art in Program Analysis (SOAP 2022). Association for Computing Machinery, New York, NY, USA, 1–8. https://doi.org/10.1145/3520313.3534655 https://dl.acm.org/doi/10.1145/3520313.3534655 https://dl.acm.org/doi/pdf/10.1145/3520313.3534655 https://github.com/LLNL/BinFPE
- Xinyi Li, Ignacio Laguna, Katarzyna Swirydowicz, Bo Fang, Ang Li, and Ganesh Gopalakrishnan. Design and evaluation of GPU-FPX: A low-overhead tool for floating-point excep tion detection in NVIDIA GPUs. In ACM HPDC 2023, 2023. doi:10.11578/dc.20230713.4. https://dl.acm.org/doi/pdf/10.1145/3588195.3592991
- Peter Dinda, Alex Bernat, and Conor Hetland. Spying on the Floating Point Behavior of Existing, Unmodified Sci entific Applications. In HPDC, pages 5–16. ACM, 2020. doi:10.1145/3369583.3392673. http://pdinda.org/Papers/hpdc20.pdf
CUDA Auto-Tuning Research
Research on CUDA "auto-tuning" is about automatically analyzing how to run a CUDA kernel faster:
- Stefan K. Muller and Jan Hoffmann. 2024. Modeling and Analyzing Evaluation Cost of CUDA Kernels. ACM Trans. Parallel Comput. 11, 1, Article 5 (March 2024), 53 pages. https://doi.org/10.1145/3639403 https://dl.acm.org/doi/full/10.1145/3639403 PDF: https://dl.acm.org/doi/pdf/10.1145/3639403
- Mark Lou and Stefan K. Muller. 2024. Automatic Static Analysis-Guided Optimization of CUDA Kernels. In Proceedings of the 15th International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM '24). Association for Computing Machinery, New York, NY, USA, 11–21. https://doi.org/10.1145/3649169.3649249 https://dl.acm.org/doi/abs/10.1145/3649169.3649249 PDF: https://dl.acm.org/doi/pdf/10.1145/3649169.3649249
- Li, Y., Dongarra, J., Tomov, S. (2009). A Note on Auto-tuning GEMM for GPUs. In: Allen, G., Nabrzyski, J., Seidel, E., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds) Computational Science – ICCS 2009. Lecture Notes in Computer Science, vol 5544. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01970-8_89 https://link.springer.com/chapter/10.1007/978-3-642-01970-8_89 PDF: https://link.springer.com/content/pdf/10.1007/978-3-642-01970-8_89.pdf
- DominikGreweandAntonLokhmotov.2011. Automatically Generating and Tuning GPU Code for Sparse Matrix-Vector Multiplication from a High-Level Representation. In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, (Newport Beach, California,U SA)(GPGPU-4).Association for Computing Machinery, NewYork, NY, USA, https://doi.org/10.1145/1964179.1964196
- Tianyi David Han and Tarek S. Abdelrahman. 2011. Reducing branch divergence in GPU programs. In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units (GPGPU-4). Association for Computing Machinery, New York, NY, USA, Article 3, 1–8. https://doi.org/10.1145/1964179.1964184 https://dl.acm.org/doi/10.1145/1964179.1964184
- R. Lim, B. Norris and A. Malony, "Autotuning GPU Kernels via Static and Predictive Analysis," 2017 46th International Conference on Parallel Processing (ICPP), Bristol, UK, 2017, pp. 523-532, doi: 10.1109/ICPP.2017.61. https://ieeexplore.ieee.org/abstract/document/8025326 https://arxiv.org/abs/1701.08547
- Nimit Singhania 2018, Static Analysis for GPU Program Performance, Ph.D. Thesis, Computer and Information Science, University of Pennsylvania, https://repository.upenn.edu/server/api/core/bitstreams/9d32ece9-5321-4d37-ab72-251e0ecc197e/content (Optimization of uncoalesced memory accesses, block sizes, and cache reuse.)
- R. L. Castro, D. Andrade and B. B. Fraguela, "STuning-DL: Model-Driven Autotuning of Sparse GPU Kernels for Deep Learning," in IEEE Access, vol. 12, pp. 70581-70599, 2024, doi: 10.1109/ACCESS.2024.3402326. https://ieeexplore.ieee.org/abstract/document/10534045 PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10534045
- Chendi Li, Yufan Xu, Sina Mahdipour Saravani, and Ponnuswamy Sadayappan. 2024. Accelerated Auto-Tuning of GPU Kernels for Tensor Computations. In Proceedings of the 38th ACM International Conference on Supercomputing (ICS '24). Association for Computing Machinery, New York, NY, USA, 549–561. https://doi.org/10.1145/3650200.3656626 https://dl.acm.org/doi/abs/10.1145/3650200.3656626 PDF: https://dl.acm.org/doi/pdf/10.1145/3650200.3656626
- Ben Spencer, 2011, A General Auto-Tuning Framework for Software Performance Optimisation, Third Year Project Report, University of Oxford http://mistymountain.co.uk/flamingo/report/autotuning-2011-05-30.pdf http://mistymountain.co.uk/flamingo/
CUDA Tools
Research papers on the theory of some of the CUDA C++ tools and extensions:
- Lawrence Livermore National Laboratory, Oct 024, FP Checker: dynamic analysis tool to detect floating-point errors in HPC applications, https://fpchecker.org/index.html https://github.com/LLNL/FPChecker
- Laguna, Ignacio. "FPChecker: Detecting Floating-point Exceptions in GPU Applications." In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 1126-1129. IEEE, 2019. https://ieeexplore.ieee.org/abstract/document/8952258 https://www.osti.gov/servlets/purl/1574625
- Michael Bentley, Ian Briggs, Ganesh Gopalakrishnan, Ignacio Laguna, Harshitha Menon, Tristan Vanderbruggen, Cindy Rubio González, 2020, FPChecker Detecting Floating-Point Exceptions in GPUs, https://fpanalysistools.org/pearc19/slides/Module-FPChecker.pdf
- Ignacio Laguna Feb 4, 2020, Improving Reliability Through Analyzing and Debugging Floating-Point Software, 2020 ECP Annual Meeting, https://fpanalysistools.org/slides/ignacio_laguna_ECP_2020.pdf
- Ignacio Laguna, Xinyi Li, and Ganesh Gopalakrishnan. 2022. BinFPE: accurate floating-point exception detection for GPU applications. In Proceedings of the 11th ACM SIGPLAN International Workshop on the State Of the Art in Program Analysis (SOAP 2022). Association for Computing Machinery, New York, NY, USA, 1–8. https://doi.org/10.1145/3520313.3534655 https://dl.acm.org/doi/10.1145/3520313.3534655 https://dl.acm.org/doi/pdf/10.1145/3520313.3534655 https://github.com/LLNL/BinFPE
- Mohamed Tarek Ibn Ziad, Sana Damani, Aamer Jaleel, Stephen W. Keckler, Mark Stephenson, June 19, 2023, cuCatch: A Debugging Tool for Efficiently Catching Memory Safety Violations in CUDA Applications, ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), Proceedings of the ACM on Programming Languages, Volume 7, Issue PLDI, Article No.: 111, Pages 124 - 147, https://doi.org/10.1145/359122 https://dl.acm.org/doi/10.1145/3591225 https://research.nvidia.com/publication/2023-06_cucatch-debugging-tool-efficiently-catching-memory-safety-violations-cuda https://dl.acm.org/doi/pdf/10.1145/3591225
- Floris Gorter, Enrico Barberis, Raphael Isemann, Erik van der Kouwe, Cristiano Giuffrida, Herbert Bos, November 1, 2023, FloatZone: How Floating Point Additions can Detect Memory Errors, https://download.vusec.net/papers/floatzone_sec23.pdf https://github.com/vusec/floatzone
- David Spuler, Oct 2024, CUDA C++ Optimization: Coding Faster GPU Kernels, https://www.amazon.com/dp/B0DK21QQYD/
- David Spuler, Oct 2024, CUDA C++ Debugging: Safer GPU Kernel Programming, https://www.amazon.com/dp/B0DJJVDJBW/
- Felipe R. Monteiro, Erickson H. da S. Alves, Isabela S. Silva, Hussama I. Ismail, Lucas C. Cordeiro, and Eddie B. de Lima Filho. 2018. ESBMC-GPU A context-bounded model checking tool to verify CUDA programs. Sci. Comput. Program. 152, C (January 2018), 63–69. https://doi.org/10.1016/j.scico.2017.09.005 https://dl.acm.org/doi/10.1016/j.scico.2017.09.005 https://pure.manchester.ac.uk/ws/files/77048933/jscp2017.pdf
- Joshi, S., Muduganti, G. (2021). GPURepair: Automated Repair of GPU Kernels. In: Henglein, F., Shoham, S., Vizel, Y. (eds) Verification, Model Checking, and Abstract Interpretation. VMCAI 2021. Lecture Notes in Computer Science(), vol 12597. Springer, Cham. https://doi.org/10.1007/978-3-030-67067-2_18 https://link.springer.com/chapter/10.1007/978-3-030-67067-2_18 https://arxiv.org/pdf/2011.08373
- Bo Jiang, Xiaoyan Wang, W.K. Chan, T.H. Tse, Na Li and Yongfeng Yin, 2020, "CUDAsmith: A Fuzzer for CUDA Compilers," 2020 IEEE 44th Annual Computers, Software, and Applications Conference (COMPSAC), Madrid, Spain, 2020, pp. 861-871, doi: 10.1109/COMPSAC48688.2020.0-156. https://ieeexplore.ieee.org/abstract/document/9202798 https://github.com/gongbell/CUDAsmith https://www.cs.hku.hk/data/techreps/document/TR-2020-05.pdf
- Taylor Allred, Xinyi Li, Ashton Wiersdorf, Ben Greenman, Ganesh Gopalakrishnan, 22 Mar 2024, FlowFPX: Nimble Tools for Debugging Floating-Point Exceptions, https://arxiv.org/abs/2403.15632 https://juliahub.com/ui/Packages/FloatTracker/dBXig
Warp Divergence in CUDA Kernels
Warp divergence, or "thread divergence" or "branch divergence," is where some threads in a warp of 32 threads go on different control flow paths (i.e., at if statements or loop conditions). This is a speed impediment to fast GPU execution of code, and avoiding divergence around control flow statements in CUDA kernels is an important optimization technique.
Articles and research on thread divergence:
- Elmar Westphal, Aug 06, 2015, Voting and Shuffling to Optimize Atomic Operations, https://developer.nvidia.com/blog/voting-and-shuffling-optimize-atomic-operations/
- HDL Wizard, January 17, 2024, An In-Depth Look at Thread Divergence in GPU Architecture, https://hdlwizard.com/an-in-depth-look-at-thread-divergence-in-gpu-architecture/
- Imen Chakroun, Mohand Mezmaz, Nouredine Melab, Ahcène Bendjoudi. 2013, Reducing thread divergence in a GPU-accelerated branch-and-bound algorithm. Concurrency and Computation: Practice and Experience, 2013, 25 (8), pp.1121-1136. 10.1002/cpe.2931 . hal-00731859 PDF: https://inria.hal.science/hal-00731859/document
- Charitha Saumya, Kirshanthan Sundararajah, Milind Kulkarni, 14 Jan 2022 (v3), DARM: Control-Flow Melding for SIMT Thread Divergence Reduction -- Extended Version, https://arxiv.org/abs/2107.05681
- Chakroun, I., Bendjoudi, A., Melab, N. (2012). Reducing Thread Divergence in GPU-Based B&B Applied to the Flow-Shop Problem. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Waśniewski, J. (eds) Parallel Processing and Applied Mathematics. PPAM 2011. Lecture Notes in Computer Science, vol 7203. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31464-3_57 https://link.springer.com/chapter/10.1007/978-3-642-31464-3_57
- Huanxin Lin, Cho-Li Wang, and Hongyuan Liu. 2018. On-GPU Thread-Data Remapping for Branch Divergence Reduction. ACM Trans. Archit. Code Optim. 15, 3, Article 39 (September 2018), 24 pages. https://doi.org/10.1145/3242089 https://dl.acm.org/doi/10.1145/3242089 PDF: https://dl.acm.org/doi/pdf/10.1145/3242089
- Vespa, L., Peters, G. (2021). Contrived and Remediated GPU Thread Divergence Using a Flattening Technique. In: Arabnia, H.R., et al. Advances in Parallel & Distributed Processing, and Applications. Transactions on Computational Science and Computational Intelligence. Springer, Cham. https://doi.org/10.1007/978-3-030-69984-0_46 https://link.springer.com/chapter/10.1007/978-3-030-69984-0_46
- Bialas, P., Strzelecki, A. (2016). Benchmarking the Cost of Thread Divergence in CUDA. In: Wyrzykowski, R., Deelman, E., Dongarra, J., Karczewski, K., Kitowski, J., Wiatr, K. (eds) Parallel Processing and Applied Mathematics. PPAM 2015. Lecture Notes in Computer Science(), vol 9573. Springer, Cham. https://doi.org/10.1007/978-3-319-32149-3_53 https://link.springer.com/chapter/10.1007/978-3-319-32149-3_53 https://arxiv.org/abs/1504.01650 PDF: https://arxiv.org/pdf/1504.01650
- NVIDIA, Sep 2024, SIMD Intrinsics, https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH__INTRINSIC__SIMD.html
- Christopher Cooper, August, 2011, GPU Computing with CUDA: Lecture 4 - Optimizations, Boston University https://www.bu.edu/pasi/files/2011/07/Lecture4.pdf
- K. Cooper, 2020, Timing and Tricks, Department of Mathematics, Washington State University, https://www.math.wsu.edu/math/kcooper/CUDA/c07Timing.pdf
- Christian Mills, September 11, 2024, GPU MODE Lecture 8: CUDA Performance Checklist, https://christianjmills.com/posts/cuda-mode-notes/lecture-008/
- David Spuler, Oct 2024, CUDA C++ Optimization: Coding Faster GPU Kernels, https://www.amazon.com/dp/B0DK21QQYD/
Warp Shuffle
Warp shuffle is a set of CUDA GPU primitives
that allow sharing of memory between the 32 threads in a warp.
The shuffle APIs are faster than shared memory ("__shared__
"),
but are limited to 32 threads (i.e., one warp only),
whereas shared memory works across threads in all the warps in a block.
- Elmar Westphal, Aug 06, 2015, Voting and Shuffling to Optimize Atomic Operations, https://developer.nvidia.com/blog/voting-and-shuffling-optimize-atomic-operations/
- Mark Harris, Feb 03, 2014, CUDA Pro Tip: Do The Kepler Shuffle, https://developer.nvidia.com/blog/cuda-pro-tip-kepler-shuffle/
- Yuan Lin and Vinod Grover, Jan 15, 2018, Using CUDA Warp-Level Primitives, https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/
- Prof. Mike Giles, 2024, Lecture 4: warp shuffles, and reduction / scan operations, Lecture 4– p. 1/38, Oxford University Mathematical Institute, https://people.maths.ox.ac.uk/gilesm/cuda/lecs/lec4.pdf
- David Spuler, Oct 2024, CUDA C++ Optimization: Coding Faster GPU Kernels, https://www.amazon.com/dp/B0DK21QQYD/
- Gray, S. 2014. A full walk through of the SGEMM implementation, https://github.com/NervanaSystems/maxas/wiki/SGEMM
Memory Address Alignment
Aligned memory accesses are faster in CUDA than those with non-aligned addresses.
Articles and papers on memory alignment in CUDA:
- Mark Harris, Jan 07, 2013, How to Access Global Memory Efficiently in CUDA C/C++ Kernels, https://developer.nvidia.com/blog/how-access-global-memory-efficiently-cuda-c-kernels/
- Lei Mao, Oct 18, 2022, CUDA Data Alignment, https://leimao.github.io/blog/CUDA-Data-Alignment/
Shared Memory Optimizations in CUDA
Articles on CUDA optimizations using shared memory:
- Mark Harris, Jan 07, 2013, How to Access Global Memory Efficiently in CUDA C/C++ Kernels, https://developer.nvidia.com/blog/how-access-global-memory-efficiently-cuda-c-kernels/
- Dung Le, Jul 30, 2020, CUDA Memory Management & Use cases, https://medium.com/distributed-knowledge/cuda-memory-management-use-cases-f9d340f7c704
- Lei Mao, June 22, 2022, CUDA Shared Memory Bank, https://leimao.github.io/blog/CUDA-Shared-Memory-Bank/
- Dhanush, Aug 23, 2024, Optimizing Vector Dot Product in CUDA: Exploring Shared Memory and Reduction Techniques, https://github.com/Dhanush295/Cuda_program/blob/main/Vector_dot_product.cu
- Mark Harris, Jan 28, 2013, Using Shared Memory in CUDA C/C++, https://developer.nvidia.com/blog/using-shared-memory-cuda-cc/ https://github.com/NVIDIA-developer-blog/code-samples/blob/master/series/cuda-cpp/shared-memory/shared-memory.cu
- Rustam, Apr 5, 2024, CUDA: Shared memory, https://medium.com/@fatlip/cuda-shared-memory-23cd1a0d4e39
- K. Cooper, 2020, Timing and Tricks, Department of Mathematics, Washington State University, https://www.math.wsu.edu/math/kcooper/CUDA/c07Timing.pdf
- Athena Elafrou, Guillaume Thomas Collignon, March 18th 2024, Introduction to CUDA Performance Optimization, NVIDIA DevTech Compute GPU Technology Conference, https://developer-blogs.nvidia.com/wp-content/uploads/2024/08/CUDA-Programming-and-Optimization.pdf
- Ellery Russell, Jiqun Tu, March 2024, Accelerating Drug Discovery: Optimizing Dynamic GPU Workflows with CUDA Graphs, Mapped Memory, C++ Coroutines, and More, GTC 2024, https://www.nvidia.com/en-us/on-demand/session/gtc24-s61156/
- Ming Li, Ziqian Bi, Tianyang Wang, Yizhu Wen, Qian Niu, Junyu Liu, Benji Peng, Sen Zhang, Xuanhe Pan, Jiawei Xu, Jinlang Wang, Keyu Chen, Caitlyn Heqi Yin, Pohsun Feng, Ming Liu, 8 Oct 2024, Deep Learning and Machine Learning with GPGPU and CUDA: Unlocking the Power of Parallel Computing, https://arxiv.org/abs/2410.05686
Memory Address Coalescing in CUDA
Coalesced memory accesses refer to each thread in a warp accessing adjacent memory locations (in global memory). This is much faster in CUDA than having different threads accessing non-adjacent locations.
Articles on memory coalescing in CUDA:
- Christopher Cooper, August, 2011, GPU Computing with CUDA: Lecture 4 - Optimizations, Boston University https://www.bu.edu/pasi/files/2011/07/Lecture4.pdf
- Mark Harris, Jan 07, 2013, How to Access Global Memory Efficiently in CUDA C/C++ Kernels, https://developer.nvidia.com/blog/how-access-global-memory-efficiently-cuda-c-kernels/
- Samuel Midkiff, 2019, Lecture 6.2 – Performance Considerations: Memory Coalescing in CUDA, Illinois University, NVIDIA GPU Teaching Kit, https://engineering.purdue.edu/~smidkiff/ece563/NVidiaGPUTeachingToolkit/Mod6/Lecture-6-2-memory-coalescing.pdf
- Chunwei Yan, Feb 25, 2024, Memory coalescing in CUDA (1) – VecAdd, https://superjomn.github.io/posts/cuda-memory-coalescing-access/
- Dung Le, Jul 30, 2020, CUDA Memory Management & Use cases, https://medium.com/distributed-knowledge/cuda-memory-management-use-cases-f9d340f7c704
- Lei Mao, March 19, 2023, CUDA Coalesced Memory Access, https://leimao.github.io/blog/CUDA-Coalesced-Memory-Access/
- Lei Mao, Oct 18, 2022, CUDA Data Alignment, https://leimao.github.io/blog/CUDA-Data-Alignment/
- Nimit Singhania 2018, Static Analysis for GPU Program Performance, Ph.D. Thesis, Computer and Information Science, University of Pennsylvania, https://repository.upenn.edu/server/api/core/bitstreams/9d32ece9-5321-4d37-ab72-251e0ecc197e/content (Optimization of uncoalesced memory accesses, block sizes, and cache reuse.)
- Athena Elafrou, Guillaume Thomas Collignon, March 18th 2024, Introduction to CUDA Performance Optimization, NVIDIA DevTech Compute GPU Technology Conference, https://developer-blogs.nvidia.com/wp-content/uploads/2024/08/CUDA-Programming-and-Optimization.pdf
- Christian Mills, September 11, 2024, GPU MODE Lecture 8: CUDA Performance Checklist, https://christianjmills.com/posts/cuda-mode-notes/lecture-008/
- Ming Li, Ziqian Bi, Tianyang Wang, Yizhu Wen, Qian Niu, Junyu Liu, Benji Peng, Sen Zhang, Xuanhe Pan, Jiawei Xu, Jinlang Wang, Keyu Chen, Caitlyn Heqi Yin, Pohsun Feng, Ming Liu, 8 Oct 2024, Deep Learning and Machine Learning with GPGPU and CUDA: Unlocking the Power of Parallel Computing, https://arxiv.org/abs/2410.05686
Array Stride Optimizations in CUDA
Using the "stride" of an array is an important access pattern for kernels in CUDA. One of the advantages is achieving coalesced memory accesses.
Articles and papers on array stride optimizations:
- Mark Harris, Jan 07, 2013, How to Access Global Memory Efficiently in CUDA C/C++ Kernels, https://developer.nvidia.com/blog/how-access-global-memory-efficiently-cuda-c-kernels/
- Chunwei Yan, Feb 25, 2024, Memory coalescing in CUDA (1) – VecAdd, https://superjomn.github.io/posts/cuda-memory-coalescing-access/
- Mark Harris, Apr 22, 2013, CUDA Pro Tip: Write Flexible Kernels with Grid-Stride Loops, https://developer.nvidia.com/blog/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/
- Mark Harris, Optimizing Parallel Reduction in CUDA, https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf
- Athena Elafrou, Guillaume Thomas Collignon, March 18th 2024, Introduction to CUDA Performance Optimization, NVIDIA DevTech Compute GPU Technology Conference, https://developer-blogs.nvidia.com/wp-content/uploads/2024/08/CUDA-Programming-and-Optimization.pdf
- David Spuler, Oct 2024, CUDA C++ Optimization: Coding Faster GPU Kernels, https://www.amazon.com/dp/B0DK21QQYD/
CUDA Kernel Fusion
Kernel fusion is the merging of two kernels into one. Typically, if two kernels are working sequentially on the same set of data, with the output of one kernel going into the second kernel, then it can be efficient to combine the two kernels.
Papers on kernel fusion in CUDA include:
- Ganesh Bikshandi, Jay Shah, 19 Dec 2023, A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library, https://arxiv.org/abs/2312.11918 https://research.colfax-intl.com/nvidia-hopper-flashattention-2/
- J. Filipovič, M. Madzin, J. Fousek, L. Matyska, 16 Jul 2013 (v2), Optimizing CUDA Code By Kernel Fusion---Application on BLAS, https://arxiv.org/abs/1305.1183 (An early paper from 2013 on kernel fusion in CUDA.)
- DMLC, 2016, NNVM-Fusion: Implement GPU Kernel Fusion and Runtime Compilation Based on NNVM, https://github.com/dmlc/nnvm-fusion
- Ao Li, Bojian Zheng, Gennady Pekhimenko, Fan Long, 2 Jul 2020, Automatic Horizontal Fusion for GPU Kernels, https://arxiv.org/abs/2007.01277
- Nourazar, M., Booth, B.G. & Goossens, B. A GPU optimization workflow for real-time execution of ultra-high frame rate computer vision applications. J Real-Time Image Proc 21, 5 (2024). https://doi.org/10.1007/s11554-023-01384-7 https://link.springer.com/article/10.1007/s11554-023-01384-7 PDF: https://backoffice.biblio.ugent.be/download/01HEJC9WDA5DGR7X1X1XR6XV48/01HFV8WMEJQFKT7F347ADBCEGB
- W. Sun, A. Li, S. Stuijk and H. Corporaal, "How Much Can We Gain From Tensor Kernel Fusion on GPUs?," in IEEE Access, vol. 12, pp. 126135-126144, 2024, doi: 10.1109/ACCESS.2024.3411473. https://ieeexplore.ieee.org/abstract/document/10551817 PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10551817
CUDA Security Issues
Research on CUDA security issues, such as buffer overflow exploits:
- Yanan Guo, Zhenkai Zhang, Jun Yang, August 2024, GPU Memory Exploitation for Fun and Profit, Proceedings of the 33rd USENIX Security Symposium, August 14–16, 2024 • Philadelphia, PA, USA, 978-1-939133-44-1, https://www.usenix.org/conference/usenixsecurity24/presentation/guo-yanan https://www.usenix.org/system/files/usenixsecurity24-guo-yanan.pdf
- Christopher Erb, Mike Collins, and Joseph L. Greathouse. Dynamic buffer overflow detection for GPGPUs. In IEEE/ACM International Symposium on Code Generation and Optimization (CGO), 2017. https://dl.acm.org/doi/10.5555/3049832.3049840 https://computermachines.org/joe/publications/pdfs/cgo2017_clarmor.pdf
- Bang Di, Jianhua Sun, and Hao Chen. 2016, A study of overflow vulnerabilities on GPUs. In Network and Parallel Computing: 13th IFIP WG 10.3 International Conference. https://link.springer.com/chapter/10.1007/978-3-319-47099-3_9 https://www.aimlab.org/haochen/papers/npc16-overflow.pdf
- Andrea Miele. Buffer overflow vulnerabilities in CUDA: a preliminary analysis. Journal of Computer Virology and Hacking Techniques, 12:113–120, 2016. https://link.springer.com/article/10.1007/s11416-015-0251-1 https://arxiv.org/pdf/1506.08546
- Sang-Ok Park, Ohmin Kwon, Yonggon Kim, Sang Kil Cha, and Hyunsoo Yoon. Mind control attack: Under mining deep learning with GPU memory exploitation. Computers & Security, 102:102115, 2021. https://dl.acm.org/doi/10.1016/j.cose.2020.102115
- Phoebe Lee and Kristina Joos, Jan 25, 2024, Advancing Production AI with NVIDIA AI Enterprise, https://developer.nvidia.com/blog/advancing-production-ai-with-nvidia-ai-enterprise/ ("... advances in NVIDIA AI software deliver up to 54% performance gains without a hardware upgrade...")
General Research on CUDA
Articles and papers on CUDA programming:
- Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, June 2022. https://arxiv.org/abs/2205.14135 Code: https://github.com/HazyResearch/flash-attention (The original FlashAttention version 1, now superceded by FlashAttention 2, which uses tiling and memory-aware kernels to optimize attention.)
- Benjamin Charlier, Jean Feydy, Joan Alexis Glaunès, François-David Collin, Ghislain Durif, 8 Apr 2021 (v2), Kernel Operations on the GPU, with Autodiff, without Memory Overflows, https://arxiv.org/abs/2004.11127 Code: https://www.kernel-operations.io/keops/index.html
- Mingyuan Wu, Husheng Zhou, Lingming Zhang, Cong Liu, Yuqun Zhang, 29 May 2019 (v3), Characterizing and Detecting CUDA Program Bugs, https://arxiv.org/abs/1905.01833 (Study of CUDA bugs in several production-level CUDA projects, including memory resource issues and synchronization errors.)
- M. Wu, Y. Ouyang, H. Zhou, L. Zhang, C. Liu and Y. Zhang, "Simulee: Detecting CUDA Synchronization Bugs via Memory-Access Modeling," 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE), Seoul, Korea (South), 2020, pp. 937-948, doi: 10.1145/3377811.3380358. https://ieeexplore.ieee.org/document/9284094 (Simulation tool to detect CUDA bugs by interpreting the LLVM byte code.)
- Pengcheng Li, Chen Ding, Xiaoyu Hu, Tolga Soyata, 2014, LDetector: A Low Overhead Race Detector For GPU Programs, https://wodet.cs.washington.edu/wp-content/uploads/2014/02/wodet2014-final14.pdf
- Meng Wu, Jingkai Qiu, Mingyu Yan, Wenming Li, Yang Zhang, Zhimin Zhang, Xiaochun Ye, Dongrui Fan, 16 Aug 2024, Accelerating Mini-batch HGNN Training by Reducing CUDA Kernels, https://arxiv.org/abs/2408.08490 (Improving CUDA kernel performance by reducing small memory-bound kernels.)
- Ganesh Bikshandi, Jay Shah, 19 Dec 2023, A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library, https://arxiv.org/abs/2312.11918 https://research.colfax-intl.com/nvidia-hopper-flashattention-2/
- Stijn Heldens, Ben van Werkhoven, 22 Mar 2023, Kernel Launcher: C++ Library for Optimal-Performance Portable CUDA Applications, https://arxiv.org/abs/2303.12374 http://kerneltuner.github.io/
- NVIDIA, Sep 2024, NVIDIA CUDA Compiler Driver NVCC, https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/ https://docs.nvidia.com/cuda/pdf/CUDA_Compiler_Driver_NVCC.pdf
- Jonathan Goldberg, March 17, 2024, Not just the hardware: How deep is Nvidia's software moat? The inherent inertia of software ecosystems, https://www.techspot.com/news/102294-beyond-gpu-how-deep-nvidia-software-moat.html
- Zen Analyst, May 31, 2023, Nvidia's CUDA: Unleashing The Power Of Parallel Computing For AI Dominance, https://seekingalpha.com/article/4608475-nvidias-cuda-unleashing-the-power-of-parallel-computing-for-ai-dominance (CUDA has 4 million developers,3,000+ applications, and 40 million CUDA downloads.)
- Will Ramey, August 19, 2020, 2 Million Registered Developers, Countless Breakthroughs. NVIDIA developer program doubles number of members in less than two years, https://blogs.nvidia.com/blog/2-million-registered-developers-breakthroughs/
- Stephen Jones, March 2024, CUDA: New Features and Beyond, GTC 2024, https://www.nvidia.com/en-us/on-demand/session/gtc24-s62400/
CUDA C++ Optimization Book
The new CUDA C++ Optimization book:
Get your copy from Amazon: CUDA C++ Optimization |
Memory Safety C++ Blog Articles
- DIY Preventive C++ Memory Safety
- Canary Values & Redzones for Memory-Safe C++
- Use-After-Free Memory Errors in C++
- Array Bounds Violations and Memory Safe C++
- Poisoning Memory Blocks for Safer C++
- Uninitialized Memory Safety in C++
- DIY Memory Safety in C++
- CUDA C++ Floating Point Exceptions
- Memory Safe C++ Library Functions
- Smart Stack Buffers for Memory Safe C++
- Safe C++ Text Buffers with snprintf
CUDA C++ Debugging Book
The new CUDA C++ Debugging book:
Get your copy from Amazon: CUDA C++ Debugging |
More AI Research
Read more about: