Aussie AI Blog
CUDA C++ Floating-Point Exceptions
-
October 30th, 2024
-
by David Spuler, Ph.D.
CUDA C++ Floating-Point Exceptions
One of the problems with CUDA C++ kernels is that abnormalities in floating-point computations do not raise an exception. On the host CPU, floating-point errors will often raise the SIGFPE signal, which may be fatal, or sometimes can be trapped and handled. However, the GPU's silence about similar problems is both a blessing and a curse. It's good that there's no failure, because the GPU kernel can continue, but it's also a problem because the programmer is never alerted to there being a failure.
Types of Floating-Point Exceptions
Some of the types of floating-point arithmetic exceptions include:
- Division-by-zero
- Floating-point overflow
- Floating-point underflow
- Conversion overflow (large positive floating-point to integer)
- Conversion underflow (large negative floating-point to integer)
Not all of these operations will produce an exceptions. Some of the other areas that are not necessarily a failure, but may be something the application programmer needs to know about include:
- Not-a-Number (NaN) floating-point values
- Negative zero value
- Denormalized numbers (very small near-zero values)
Arithmetic exceptions are not limited to floating-point types, of course.
However, integer overflow and underflow tend not to trigger exceptions
in C++ on either CPU or GPU code.
However, a common example is integer division by zero, or equivalently
modulo-by-zero with the %
integer remainder operator.
Detecting Kernel Floating-Point Exceptions
Generally, floating-point exceptions cannot be detected, trapped or handle in GPU device code. A different approach is need to detecting GPU arithmetic errors. There are two main approaches:
(a) programmatically scanning your data structures for values indicating a failure (e.g., NaN), or
(b) using a runtime instrumentation floating-point error checker tool.
There are several tools available in the literature to detect floating-point errors
(see references list below),
but there does not appear to be an officially supported commercial tool
from NVIDIA.
Note that compute-sanitizer
does not usually catch these types of floating-point errors.
Trapping Floating-Point Exceptions in Host Code
On host code, some floating-point
exceptions may trigger the SIGFPE
signal, which can be intercepted
and handled to a limited extent.
Certain types of floating-point library functions may set the errno
variable
with an error code, which is returned to the caller.
Recovery from these floating-point errors is not always possible, but at least you are alerted to the bug. For example, you might attempt to return from a signal handler. Here's an example:
#include <signal.h> void sigfpe_handler(int sig) { printf("Trapped SIGFPE\n"); return; // Try to continue } void init_signal() { // Called earlier signal(SIGFPE, sigfpe_handler); // Register SIGFPE handler }
Note that in theory you're not supposed to do output operations
such as a printf
call in a signal handler.
It's probably undefined behavior in the official standards.
Nevertheless, I've done it successfully many times on various UNIX platforms,
but maybe I'm just lucky.
However, many causes of SIGFPE
cannot be easily recovered via the signal handlers,
and will re-raise the same signal.
The above code might work on some platforms,
but on others it may cause an infinite spin as it re-raises the signal and handles it again.
Another oddity on some platforms is that calling a signal handler de-registers the handler. Hence, you might need to re-register the handler itself inside the handler:
void sigfpe_handler(int sig) { printf("Trapped SIGFPE\n"); signal(SIGFPE, sigfpe_handler); // Re-Register SIGFPE handler // Try to continue }
This signal also cannot necessarily be suppressed with SIG_IGN
.
You can try this early in your application code:
#include <signal.h> signal(SIGFPE, SIG_IGN); // Try to suppress it
Another attempt at recovery inside a signal handler would be to throw a C++ exception,
which is certainly undefined behavior.
It might actually be more
likely to work using
the older C-style setjmp
/longjmp
functions,
although this is also not guaranteed to work.
Note that in CPU implementations, the SIGFPE
signal can also be triggered by some
integer arithmetic errors,
even though it has "floating-point" in the name of the signal.
Examples include division-by-zero or remainder-by-zero,
although it is platform-dependent whether or not these will raise the SIGFPE
signal.
CUDA C++ Debugging Book
The new CUDA C++ Debugging book:
Get your copy from Amazon: CUDA C++ Debugging |
References
- GPU-NBDetect Oct 2024 (accessed), Comprehensive-Study-on-GPU-Program-Numerical-Issues, https://github.com/GPU-Program-Bug-Study/Comprehensive-Study-on-GPU-Program-Numerical-Issues.github.io/tree/main/GPU-NBDetect
- FP Checker, Jul 19, 2021, Floating-point Exceptions and GPU Applications, https://fpchecker.org/2021-07-12-exceptions.html
- Lawrence Livermore National Laboratory, Oct 024, FP Checker: dynamic analysis tool to detect floating-point errors in HPC applications, https://fpchecker.org/index.html https://github.com/LLNL/FPChecker
- Laguna, Ignacio. "FPChecker: Detecting Floating-point Exceptions in GPU Applications." In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 1126-1129. IEEE, 2019. https://ieeexplore.ieee.org/abstract/document/8952258 https://www.osti.gov/servlets/purl/1574625
- Michael Bentley, Ian Briggs, Ganesh Gopalakrishnan, Ignacio Laguna, Harshitha Menon, Tristan Vanderbruggen, Cindy Rubio González, 2020, FPChecker Detecting Floating-Point Exceptions in GPUs, https://fpanalysistools.org/pearc19/slides/Module-FPChecker.pdf
- Ignacio Laguna Feb 4, 2020, Improving Reliability Through Analyzing and Debugging Floating-Point Software, 2020 ECP Annual Meeting, https://fpanalysistools.org/slides/ignacio_laguna_ECP_2020.pdf
- Ignacio Laguna, Xinyi Li, and Ganesh Gopalakrishnan. 2022. BinFPE: accurate floating-point exception detection for GPU applications. In Proceedings of the 11th ACM SIGPLAN International Workshop on the State Of the Art in Program Analysis (SOAP 2022). Association for Computing Machinery, New York, NY, USA, 1–8. https://doi.org/10.1145/3520313.3534655 https://dl.acm.org/doi/10.1145/3520313.3534655 https://dl.acm.org/doi/pdf/10.1145/3520313.3534655 https://github.com/LLNL/BinFPE
Memory-Safety Blog Articles
See also these articles:
- Canary Values & Redzones for Memory-Safe C++
- Use-After-Free Memory Errors in C++
- Array Bounds Violations and Memory Safe C++
- Poisoning Memory Blocks for Safer C++
- Uninitialized Memory Safety in C++
- DIY Memory Safety in C++
- Memory Safe C++ Library Functions
- Smart Stack Buffers for Memory Safe C++
- Safe C++ Text Buffers with snprintf
More AI Research Topics
Read more about: