Aussie AI Blog

CUDA C++ Floating-Point Exceptions

October 30th, 2024

by David Spuler, Ph.D.

CUDA C++ Floating-Point Exceptions

One of the problems with CUDA C++ kernels is that abnormalities in floating-point computations do not raise an exception. On the host CPU, floating-point errors will often raise the SIGFPE signal, which may be fatal, or sometimes can be trapped and handled. However, the GPU's silence about similar problems is both a blessing and a curse. It's good that there's no failure, because the GPU kernel can continue, but it's also a problem because the programmer is never alerted to there being a failure.

Types of Floating-Point Exceptions

Some of the types of floating-point arithmetic exceptions include:

Division-by-zero
Floating-point overflow
Floating-point underflow
Conversion overflow (large positive floating-point to integer)
Conversion underflow (large negative floating-point to integer)

Not all of these operations will produce an exceptions. Some of the other areas that are not necessarily a failure, but may be something the application programmer needs to know about include:

Not-a-Number (NaN) floating-point values
Negative zero value
Denormalized numbers (very small near-zero values)

Arithmetic exceptions are not limited to floating-point types, of course. However, integer overflow and underflow tend not to trigger exceptions in C++ on either CPU or GPU code. However, a common example is integer division by zero, or equivalently modulo-by-zero with the % integer remainder operator.

Detecting Kernel Floating-Point Exceptions

Generally, floating-point exceptions cannot be detected, trapped or handle in GPU device code. A different approach is need to detecting GPU arithmetic errors. There are two main approaches:

(a) programmatically scanning your data structures for values indicating a failure (e.g., NaN), or

(b) using a runtime instrumentation floating-point error checker tool.

There are several tools available in the literature to detect floating-point errors (see references list below), but there does not appear to be an officially supported commercial tool from NVIDIA. Note that compute-sanitizer does not usually catch these types of floating-point errors.

Trapping Floating-Point Exceptions in Host Code

On host code, some floating-point exceptions may trigger the SIGFPE signal, which can be intercepted and handled to a limited extent. Certain types of floating-point library functions may set the errno variable with an error code, which is returned to the caller.

Recovery from these floating-point errors is not always possible, but at least you are alerted to the bug. For example, you might attempt to return from a signal handler. Here's an example:

    #include <signal.h>

    void sigfpe_handler(int sig)
    {
        printf("Trapped SIGFPE\n");
        return;   // Try to continue
    }

    void init_signal()
    {
        // Called earlier
        signal(SIGFPE, sigfpe_handler);  // Register SIGFPE handler
    }

Note that in theory you're not supposed to do output operations such as a printf call in a signal handler. It's probably undefined behavior in the official standards. Nevertheless, I've done it successfully many times on various UNIX platforms, but maybe I'm just lucky.

However, many causes of SIGFPE cannot be easily recovered via the signal handlers, and will re-raise the same signal. The above code might work on some platforms, but on others it may cause an infinite spin as it re-raises the signal and handles it again.

Another oddity on some platforms is that calling a signal handler de-registers the handler. Hence, you might need to re-register the handler itself inside the handler:

    void sigfpe_handler(int sig)
    {
        printf("Trapped SIGFPE\n");
        signal(SIGFPE, sigfpe_handler); // Re-Register SIGFPE handler
        // Try to continue
    }

This signal also cannot necessarily be suppressed with SIG_IGN. You can try this early in your application code:

    #include <signal.h>

    signal(SIGFPE, SIG_IGN);  // Try to suppress it

Another attempt at recovery inside a signal handler would be to throw a C++ exception, which is certainly undefined behavior. It might actually be more likely to work using the older C-style setjmp/longjmp functions, although this is also not guaranteed to work.

Note that in CPU implementations, the SIGFPE signal can also be triggered by some integer arithmetic errors, even though it has "floating-point" in the name of the signal. Examples include division-by-zero or remainder-by-zero, although it is platform-dependent whether or not these will raise the SIGFPE signal.

CUDA C++ Debugging Book

The new CUDA C++ Debugging book:

Debugging CUDA C++ kernels
Tools & techniques
Self-testing & reliability
Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging

References

GPU-NBDetect Oct 2024 (accessed), Comprehensive-Study-on-GPU-Program-Numerical-Issues, https://github.com/GPU-Program-Bug-Study/Comprehensive-Study-on-GPU-Program-Numerical-Issues.github.io/tree/main/GPU-NBDetect
FP Checker, Jul 19, 2021, Floating-point Exceptions and GPU Applications, https://fpchecker.org/2021-07-12-exceptions.html
Lawrence Livermore National Laboratory, Oct 024, FP Checker: dynamic analysis tool to detect floating-point errors in HPC applications, https://fpchecker.org/index.html https://github.com/LLNL/FPChecker
Laguna, Ignacio. "FPChecker: Detecting Floating-point Exceptions in GPU Applications." In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 1126-1129. IEEE, 2019. https://ieeexplore.ieee.org/abstract/document/8952258 https://www.osti.gov/servlets/purl/1574625
Michael Bentley, Ian Briggs, Ganesh Gopalakrishnan, Ignacio Laguna, Harshitha Menon, Tristan Vanderbruggen, Cindy Rubio González, 2020, FPChecker Detecting Floating-Point Exceptions in GPUs, https://fpanalysistools.org/pearc19/slides/Module-FPChecker.pdf
Ignacio Laguna Feb 4, 2020, Improving Reliability Through Analyzing and Debugging Floating-Point Software, 2020 ECP Annual Meeting, https://fpanalysistools.org/slides/ignacio_laguna_ECP_2020.pdf
Ignacio Laguna, Xinyi Li, and Ganesh Gopalakrishnan. 2022. BinFPE: accurate floating-point exception detection for GPU applications. In Proceedings of the 11th ACM SIGPLAN International Workshop on the State Of the Art in Program Analysis (SOAP 2022). Association for Computing Machinery, New York, NY, USA, 1–8. https://doi.org/10.1145/3520313.3534655 https://dl.acm.org/doi/10.1145/3520313.3534655 https://dl.acm.org/doi/pdf/10.1145/3520313.3534655 https://github.com/LLNL/BinFPE

Aussie AI Blog

CUDA C++ Floating-Point Exceptions