Aussie AI

42. Debugging

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

“If debugging is the process of removing software bugs,
then programming must be the process of putting them in.”

— Edsger Dijkstra.

AI Engine Debugging

I heard a rumor that AI frameworks are just code, and AI models are just data. So, this means that there must be bugs! And this chapter is about real, hard-core coding bugs, the nasty kind that sneak in with all of this performance tuning that's going around.

The detection and correction of errors in programs is usually called “debugging.” For the most part there is no standard method of debugging programs and much responsibility rests on the programmer’s creativity and ingenuity in finding the cause of a program’s malfunction. However, there are a number of useful techniques that programmers can use to aid in the debugging of programs.

But before you blame the software, always remember where you are: AI thrashes everything. It's not uncommon in production to have memory failures, disk failures and even CPU failure. These glitches are often not a binary works or does-not-work error. Reliability issues in hardware do not even necessarily crash the app, but can result in crazy computations. Sure, random crashes occur, too, and they might be your code, so it's a judgement call whether to blame carbon or silicon.

GPUs are the most complex hardware and are prone to overheating. Even without failing, overheating can lead to the output of dubious results. If you're using GPU hosters or buying used chips yourself, the origin of the GPUs is a concern. Many GPU chips have been Bitcoin miners in a previous life, hammering away on crypto rocks, rather than enjoying days of leisure doing video editing. An AI application will also drive a GPU hard, more so than just playing Call of Duty a few times a week, so the GPUs can gradually degrade. Hence, it's worth pondering hardware reliability, particularly with GPUs, when troubleshooting bizarre gibberish coming from an AI app.

Different types of bugs arise from the top of the AI stack. This chapter is not about “model evaluation” and the higher-level AI problems with safety and accuracy issues. Model evaluation is really a type of testing, which is finding the bugs. There's another whole debugging-like area of expertise in trying to figure out why an LLM gave a wrong answer, or a biased result, or whatever other high-level semantic failure. Such cases are often a non-algorithmic error, such as incorrect or omitted training data, so you might need a Data Scientist rather than an ML Engineer. Fortunately, you can both blame the writers.

The remainder of this chapter is only talking about making sure all our fancy C++ kernel algorithms are running correctly. Hence, much of the material in this chapter is generic to any large C++ application, and also generalizes to debugging massive Transformer engines.

Debugging Techniques

The term “debugging” mainly refers to the process whereby a programmer tries to find the cause of a bug. Debugging is only part of the code reliability puzzle. A lot of other important techniques are involved in improving overall code reliability for the long-term, rather than how to quickly find and fix a known bug, which this chapter discusses.

The best debugging technique really depends on the symptoms of the bug, and on your experience and intuition telling you what type of coding error is likely. Some of the practical techniques to find a bug include:

Interactive debugger tools (e.g. stepping through code in the Windows IDE or Gnu gdb on Linux).
Postmortem debugging (e.g. using gdb if you have a Linux core dump file).
Review compiler warnings for bugs hiding in plain sight.
Enable more debug trace statements.
Add some more assertions, unit tests, or self-testing code.
Memory debugging tools (e.g. run the code in Valgrind on Linux).

Very Difficult Bugs. Some bugs are like roaches and keep coming out of the woodwork. General strategies for solving a tricky bug include:

Can you reproduce it? That's the key.
Write a unit test that triggers it (if you can).
Gather as much information about the context as possible (e.g. if it's a user-reported error).
Think about what code you just changed recently (or was just committed by someone else).
Try to cut down the input to the smallest case that triggers the fault.
Memory-related failures often cause weird errors nowhere near the cause.
Review the debug trace output carefully (i.e. maybe something failed earlier).
Step through the code in the debugger about ten more times.
Run a static analysis (“linter”) tool on the code.
Run an AI copilot debugger tool. I hear they're terrific.
Refactor a large module into smaller functions that are more easily unit-tested (often you accidentally fix the bug!).

If you really get stuck, you could try talking to another human (gasp!). Show your code to someone else and they'll find the bug in three seconds.

Level Up Your Post-Debugging Routine. Assuming you can fix it, think about the next level of professionalism to avoid having a repetition of similar problems. Consider doing followups such as:

Add a unit test or regression test to re-check that problematic input every build.
Write it up and close the incident in the bug tracking database like a Goody Two-Shoes.
Add safety input validation tests so that a similar failure is tolerated (and logged).
Add a self-check in a C++ debug wrapper function to check for it next time at runtime.
Is there a tool that would have found it? Can you run it automatically? Every build?

Interactive Debuggers

I used to be a big fan of gdb and dbx on various Unix platforms, but lately I'm addicted to Microsoft Visual Studio's interactive debugging tools in the Windows GUI. Everyone uses debuggers differently, and some programmers even hate using IDE debuggers to step through code, but I find it invaluable. Here are some of my own thoughts on how to use interactive debugging tools to find bugs:

Breakpoints. I use breakpoints a lot. I have standard breakpoints set inside my assertion and self-testing failure code, so that my code automatically stops when that triggers. Similarly, if I have an unexpected problem, I set a breakpoint right before the offending error message. Another trick with breakpoints is to define a C++ function called “breakpoint” and call it from various other places.

Stepping Through Code. I find stepping through the C++ code useful for both debugging and new code development. You need to get proficient with the different stepwise actions, such as Step, Step Over, Step Into, and Step Out (go to the end of this function). Sometimes stepping gets annoying, in which case I also find that I'm often having to set breakpoints a few lines of code down from where I'm currently at, and hitting “Continue”.

Restart: I find that I'm always doing “Restart” when debugging. It's very helpful to restart stepping, and also if you add a minor code change, you can then Restart to rebuild and rerun from the beginning. This is most useful if you have a unit test that triggers the error.

Watches: You can watch the value of a variable, or an expression, as it changes throughout execution. This is an extremely useful feature to have. Set some watches as you step through the code.

Edit and Continue. This is an IDE feature where you can edit the value of a variable, or even edit the C++ code in the middle of a run, and the compiler will incrementally re-compile your small changes into the executable, and keep going from where it is (i.e. rather than re-starting from the beginning). Lots of programmers like this feature, and as a former compiler engineer, I offer kudos to the engineers who have made it actually work(!), but sorry, I cannot stand this feature as a user of the debugger. I wish I could turn it off and never be prompted about it again.

Postmortem debugging. On Linux or other Unix platforms, if you have a “core” file as part of your user's error report, it's helpful to run gdb to get a stack trace using the “where” command. And you can also sometimes get other useful context about the values of variables. One important point about this is that you need a copy of the actual version of the executable that shipped to the user. Trying to postmortem debug a core dump file using your developer's version of the executable doesn't work too well. Also, you ideally need a version of that shipped Unix executable that still has debugging symbols, and hasn't had its debug info removed with the “strip” tool, so the software release process needs to save copies of both of those.

Command-Line Debuggers: A symbolic debugger such as gdb on Linux can also be used to debug the program interactively. The programmer runs the program from within the debugger and sets breakpoints to control execution. The “where” command is useful to show the function call stack. The values of variables can be examined at run-time, and function calls can be monitored carefully. When used properly these tools are a highly effective method of finding an error. However, a programmer should not fall into the trap of using a symbolic debugger to test a program because this form of testing is not easily reproducible. Instead, symbolic debuggers should be used mainly when a particular failure has been identified.

Debugging with Keyboard Signals: An interesting but rarely used debugging procedure is to trap keyboard interrupts such as those from the <ctrl-c> keyboard shortcut, which causes a SIGINT signal. These interrupts can be caught in C++ so as to cause the execution of a “signal handler” function via <signal.h> and this code can then output helpful debugging information (or do whatever you need), such as a report on the current heap state or the function call stack.

Trapping Fatal Signals. Another tip is that on Linux or Unix you can trap fatal signals such as SIGSEGV, SIGILL, and SIGFPE. You can set a breakpoint in there, and there's also a supportability benefit. The signal handler can't ignore or correct a fatal signal, but it can print a nice message for the user before dumping core.

Don’t Blame the Compiler

The biggest temptation when faced with an error you do not understand is to blame the compiler. However, although compiler bugs are not as uncommon as one would hope, there is probably a 99% or more chance that it is your mistake, especially if you are just learning the C++ language.

There is an even larger temptation to place the blame on the compiler if the code suddenly fails when optimization is invoked. Because of the complexity of optimization technology, there have been many well-known errors in optimizers over the years. However, there are a large number of coding errors that may not cause failures when compiled in one way, but may cause failures with other compiler or optimizer settings. A compiler or optimizer bug is a very convenient excuse, but also an unlikely excuse. More likely is that you have a memory error or some other undefined behavior that was only working by fluke in the non-optimized version.

I remember well the first time that I demonstrated this form of human frailty while learning the C language. A program wasn’t working, and after some debugging effort, the problem was traced to a for loop that was executing only once instead of many times. An experienced programmer can probably diagnose the error from the single statement in the previous sentence, but the program’s behavior seemed very strange to me. Although I don’t remember the exact code, the loop was similar to:

    for (i = 0; i < n; i++);
    {
        /* do something */
    }

Instead of repeating n times, the loop body was executed only once. Can you spot the bug?

The error is, of course, the semicolon immediately after the for loop header, which does not cause a syntax error, but makes the for loop body an empty statement (i.e. a single semicolon) and this mistake causes the intended loop body to be executed only once, after i has been incremented in a do-nothing loop from 0 to n.

Finding this error took me a great deal of time and effort. I spent a lot of time trying to determine the problem with debugging output statements, but with no success. Then, beginning to suspect a compiler bug, I created assembly output using the “cc -S” command. Sure enough, the assembly code showed the compiler generating instructions where control prematurely returned to the top of the loop. The compiler was “erroneously” placing the branch instruction before the first statement of the loop body, which I considered to be “proof” that there was a bug in the compiler. Finally, I demonstrated my “compiler bug” to a friend, who immediately pointed out the extra semicolon. A little knowledge is a dangerous thing.

Random Number Seeds

Neural network code often uses random numbers to improve accuracy via a stochastic algorithm. For example, the top-k decoding uses randomness for creativity and to prevent the repetitive looping that can occur with greedy decoding. And you might use randomness to generate input tests when you're trying to thrash the model with random prompt strings.

But that's not good for debugging! We don't want randomness when we're trying to reproduce a bug!

Hence, we want it to be random for users, but not when we're debugging. Random numbers need a “seed” to get started, so we can just save and re-use the seed for a debugging session. This idea can be applied to old-style rand/srand functions or to the newer <random> libraries.

Seeding the random number generator in old-style C++ is done via the “srand” function. The longstanding way to initialize the random number generator, so it's truly random, is to use the current time:

    srand(time(NULL));

Note that seeding with a guessable value is a security risk. Hence, it's safer to use some additional arithmetic on the time return value.

After seeding, the “rand” function can be used to get a truly unpredictable set of random numbers. The random number generator works well and is efficient. A generalized plan is to have a debugging or regression testing mode where the seed is fixed.

    if (g_aussie_debug_srand_seed != 0) {
        // Debugging mode
        srand(g_aussie_debug_srand_seed);   // Non-random randomness!
    }
    else {  // Normal run
        srand(time(NULL));
    }

The test harness has to set the global debug variable “g_aussie_debug_srand_seed” whenever it's needed for a regression test. For example, either it's manually hard-coded into a testing function, or it could be set via a command-line argument to your test harness executable, so the program can be scripted to run with a known seed.

This is better, but if we have a bug in production, we won't know the seed number. So, the better code also prints out the seed number (or logs it) in case you need to use it later to reproduce a bug that occurred live.

    if (g_aussie_debug_srand_seed != 0) {
        srand(g_aussie_debug_srand_seed);   // Debug mode
    }
    else {  // Normal run
        long int iseed = (long)time(NULL);
        fprintf(stderr, "INFO: Random number seed: %ld 0x%lx\n", iseed, iseed);
        srand(iseed);
    }

An extension would be to also print out the seed in error context information on assertion failures or other internal errors.

Debug Stacktrace

There are various situations where it can be useful to have a programmatic method for reporting the “stack trace” or “backtrace” of the function call stack in C++. Some examples include:

Your assertion macro can report the full stack trace on failure.
Self-testing code similarly can report the location.
Debug wrapper functions too.
Writing your own memory allocation tracker library.

C++ is about to have standard stacktrace capabilities with its standardization in C++23. This is available via the “std::stacktrace” facility, such as printing the current stack via:

    std::cout << "Stacktrace: " << std::stacktrace::current() << std::endl;

The C++23 stacktrace library is already supported by GCC and early support in MSVS is available via a compiler flag “/std:c++latest”. There are also two different longstanding implementations of stacktrace capabilities: glibc backtrace and Boost StackTrace. The C++23 standardized version is based on Boost's version.

Error Logging

Error logging is the small matter of what your code should do when it detects a problem. This is not the question of whether to manually check for error return codes versus a full stack of try-catch exception handling. Rather, this is the question as to what either of those technically should actually do when they get triggered. This depends on whether your code is running on an iPhone app versus a website backend, and includes options such as:

Pop up a user error message on their phone.
Pop up a GUI dialog on Windows desktop.
Log an error message to the Apache error logs.
Print it to stderr and hope someone's listening.

Note that there are several different types of “errors” that you need to think about:

User input errors
Configuration errors
Assertion failures (internal errors)
Self-testing failures (e.g. debug wrappers)
External resource failures (e.g. file not found)
Internal resource failures (e.g. memory allocation failed)
Debug tracing messages

Some of these need to go to the user, whereas some of those you would prefer to only be seen by the software development team. On the other hand, if some of the internal errors occur, then you want a way for users to submit them to the support team, so deciding what to disclose publicly is a judgement call.

And there are also non-error messages that can often be handled by the same subsystem, such as:

Informational messages
Configuration reports
Statistics tracking
Time measurements tracking
Supportability messages

There are a few standardized error logging classes available:

std::clog (global iostream object)
Boost.Log
Log4cxx (Apache)

BYO Error Logging: It's common for professional C++ programmers to skip the standard error logging classes, and BYO. That is, Build Your Own. Typically, each project defines a simple API to log an error, and the error logging subsystem has to decide what to do.

Here's a simple version using C++ varargs preprocessor macros:

#define errprintf(fmt,...)  fprintf(stderr, (fmt), __VA_ARGS__ )

This is using the special tokens “...” in the macro parameter list and “__VA_ARGS__” in the macro body, which are standard C/C++ features since C99.

And since we're using this for error logging, we might sometimes want to also emit some extra context information:

#define errprintf2(fmt,...) \
        fprintf(stderr, "ERROR [%s:%d:%s]: ", __FILE__, __LINE__, __func__), \
        fprintf(stderr, (fmt), __VA_ARGS__ )

Note that this macro uses the comma operator between the two fprintf calls.

Debug Tracing Messages

A common debugging method is adding debug trace output statements to a program to print out important information at various points in the program. Judicious use of these statements can be highly effective in localizing the cause of an error, but this method can also lead to huge volumes of not particularly useful information. One desirable feature of this method is that the output statements can be selectively enabled at either compile-time or run-time.

Debug tracing messages are informational messages that you only enable during debugging. These are useful to software developers to track where the program is executing, and what data it is processing. The simplest version of this idea looks like:

    #if DEBUG
    std::cerr << "DEBUG: I am here!" << std::endl;
    #endif

A better solution is to code some BYO debug tracing macros. Here's a C-like version:

   #define ydebug(str)  ( fprintf(stderr, "%s\n", (str)) )
   ... 
   ydebug("DEBUG: I am here!");

And here's the C++ style version:

   #define ydebug(str)  ( std::cerr << str << std::endl )
   ... 
   ydebug("DEBUG: I am here!");

In order to only show these when debug mode is enabled in the code, our header file looks like this:

   #if DEBUG
       #define ydebug(str)  ( std::cerr << str << std::endl )
   #else
       #define ydebug(str) // nothing
  #endif

Missing Semicolon Bug: Professional programmers prefer to use “0” rather than emptiness to remove the debug code when removing it from the production version. It is also good to typecast it to “void” type so it cannot accidentally be used as the number “0” in expressions. Hence, we get this improved version:

    #define ydebug(str) ((void)0)  // better!

It's not just a stylistic preference. The reason is that the “nothing” version can introduce an insidious bug if you forget a semicolon after the debug trace call in an if statement:

   if (something) ydebug("Hello world") // Missing semicolon
   x++;

If the “nothing” macro expansion is used, then the missing semicolon leads to this code:

   if (something) // nothing
   x++;

Can you see why it's a bug? Instead, if the expansion is “((void)0)” then this missing semicolon typo will get a compilation error.

Variable-Argument Debug Macros

A neater solution is to use varargs preprocessor macros with the special tokens “...” and “__VA_ARGS__”, which are standard in C and C++ (since 1999):

   #define ydebug(fmt,...)  fprintf(stderr, (fmt), __VA_ARGS__ )
   ...
   ydebug("DEBUG: I am here!\n");

That's not especially helpful, so we can add more context:

    // Version with file/line/function context
    #define ydebug(fmt,...)  \
        ( fprintf(stderr, "DEBUG [%s:%d:%s]: ", \
                __FILE__, __LINE__, __func__ ), \
        fprintf(stderr, (fmt), __VA_ARGS__ ))
   ...
   ydebug("I am here!\n");

This will report the source code filename, line number, and function name. Note the use of the comma operator between the two fprintf statements (whereas a semicolon would be a macro bug). Also required are parentheses around the whole thing, and around each use of the “fmt” parameter.

Here's a final example that also detects if you forgot a newline in your format string (how kind!):

    // Version that makes the newline optional
    #define ydebug(fmt,...)  \
        (fprintf(stderr, "DEBUG [%s:%d:%s]: ", \
                __FILE__, __LINE__, __func__ ), \
        fprintf(stderr, (fmt), __VA_ARGS__ ), \
        (strchr((fmt), '\n') != NULL \
                || fprintf(stderr, "\n")))
   ...
   ydebug("I am here!");  // Newline optional

Dynamic Debug Tracing Flag

Instead of using “#if DEBUG”, it can be desirable to have the debug tracing dynamically controlled at runtime. This allows you to turn it on and off without a rebuild, such as via a command-line argument. And you can decide whether or not you want to ship it to production with the tracing available to be used.

This idea can use a single Boolean flag:

   extern bool g_aussie_debug_enabled;

We can add some macros to control it:

    #define aussie_debug_off()  ( g_aussie_debug_enabled = false )
    #define aussie_debug_on()  ( g_aussie_debug_enabled = true )

And then the basic debug tracing macros simply need to check it:

    #define ydbg(fmt,...)  ( g_aussie_debug_enabled && \
        fprintf(stderr, (fmt), __VA_ARGS__ ))

So, this adds some runtime cost of testing a global flag every time this line of code is executed.

Here's the version with file, line, and function context:

    #define ydbg(fmt,...)  \
        ( g_aussie_debug_enabled && \
        ( fprintf(stderr, "DEBUG [%s:%d:%s]: ", \
                __FILE__, __LINE__, __func__ ), \
        fprintf(stderr, (fmt), __VA_ARGS__ )))

And here's the courtesy newline-optional version:

    #define ydbg(fmt,...)  \
        ( g_aussie_debug_enabled && \
        (fprintf(stderr, "DEBUG [%s:%d:%s]: ", \
                __FILE__, __LINE__, __func__ ), \
        fprintf(stderr, (fmt), __VA_ARGS__ ), \
        (strchr((fmt), '\n') != NULL \
                || fprintf(stderr, "\n"))))

Multi-Statement Debug Trace Macro

An alternative method of using debugging statements is to use a special macro that allows any arbitrary statements. For example, debugging output statements can be written as:

    YDBG( printf("DEBUG: Entered function print_list\n"); )

Or using C++ iostream output style:

    YDBG( std::cerr << "DEBUG: Entered function print_list\n"; )

This allows use of multiple statements of debugging, with self-testing code coded as:

    YDBG( count++; )
    YDBG( if (count != count_elements(table)) { )
    YDBG(     aussie_internal_error("ERROR: Element count wrong"); )
    YDBG( } )

But it's actually easier to add multiple lines of code or a whole block in many cases. An alternative use of YDBG with multiple statements is valid, provided that the enclosed statements do not include any comma tokens (unless they are nested inside matching brackets). The presence of a comma would separate the tokens into two or more macro arguments for the preprocessor, and the YDBG macro above requires only one parameter:

    YDBG(
        count++; 
        if (count != count_elements(table)) { // self-test
            aussie_internal_error("ERROR: Element count wrong"); // error
        }
    )

The multi-statement YDBG macro is declared in a header file as:

    #if YDEBUG
    #define YDBG(token_list) token_list  // Risky
    #else
    #define YDBG(token_list) // nothing
    #endif

The above version of YDBG is actually non-optimal for the macro error reasons already examined. A safer idea is to add surrounding braces and the “do-while(0)” trick to the YDBG macro:

    #if YDEBUG
    #define YDBG(token_list) do { token_list } while(0)   // Safer
    #else
    #define YDBG(token_list)   ((void)0)
    #endif

Note that this now requires a semicolon after every expansion of the YDBG macro, whereas the earlier definition did not:

   YDBG( std::cerr << "Value of i is " << i << "\n"; );

Whenever debugging is enabled, the statements inside the YDBG argument are activated, but when debugging is disabled they disappear completely. Thus, this method offers a very simple method of removing debugging code from the production version of a program, if you like that kind of thing.

This YDBG macro may be considered poor style since it does not mimic any usual syntax. However, it is a neat and general method of introducing debugging statements, and is not limited to output statements.

Multiple Levels of Debug Tracing

Once you've used these debug methods for a while, you start to see that you get too much output. For a while, you're just commenting and uncommenting calls to the debug routines. A more sustainable solution is to add numeric levels of tracing, where a higher number gets more verbose.

To make this work well, we declare both a Boolean overall flag and a numeric level:

    extern bool g_aussie_debug_enabled;
    extern int g_aussie_debug_level;

Here's the macros to enable and disable the basic level:

    #define aussie_debug_off()  ( \
        g_aussie_debug_enabled = false, \
        g_aussie_debug_level = 0)

    #define aussie_debug_on()  ( \
        g_aussie_debug_enabled = true, \
        g_aussie_debug_level = 1 )

And here's the new macro that sets a numeric level of debug tracing (higher number means more verbose):

    #define aussie_debug_set_level(lvl)  ( \
        g_aussie_debug_enabled = (((lvl) != 0)), \
        g_aussie_debug_level = (lvl) )

Here's what a basic debug macro looks like:

    #define ydbglevel(lvl,fmt,...)  ( \
        g_aussie_debug_enabled && \
        (lvl) <= g_aussie_debug_level && \
        fprintf(stderr, (fmt), __VA_ARGS__ ))
    ...
    ydbglevel(1, "Hello world");
    ydbglevel(2, "More details");

Now we see the reason for having two global variables. In non-debug mode, the only cost is a single Boolean flag test, rather than a more costly integer “<” operation.

And for convenience we might add multiple macro name versions for different levels:

    #define ydbglevel1(fmt) (ydebuglevel(1, (fmt)))
    #define ydbglevel2(fmt) (ydebuglevel(2, (fmt)))
    ...
    ydbglevel1("Hello world");
    ydbglevel2("More details");

Very volatile. Note that if you are altering debug tracing levels inside a symbolic debugger (e.g. gdb) or IDE debugger, you might want to consider declaring the global level variables with the “volatile” qualifier. This applies in this situation because their values can be changed (by you!) in a dynamic way that the optimizer cannot predict. On the other hand, you can skip this, as this issue won't affect production usage, and only rarely impacts your own interactive debugging usage.

BYO debug printf: All of the above examples are quite fast in execution, but heavy in space usage. They will be adding a fair amount of executable code for each “ydebug” statement. I'm not sure that I really should care that much about the code size, but anyway, we could fix it easily by declaring our own variable-argument debug printf-like function.

Advanced Debug Tracing

The above ideas are far from being the end of the options for debug tracing. The finesses to using debug tracing messages include:

Environment variable to enable debug messages.
Command-line argument to enable them (and set the level).
Configuration settings (eg. changeable inside the GUI, or in a config file).
Add unit tests running in trace mode (because sometimes debug tracing crashes!).
Extend to multiple sets or named classes of debug messages, not just numeric levels, so you can trace different aspects of execution dynamically.

Supportability Tip: Think about customers and debug tracing messages: are there times when you want users to enable them? Usually, the answer is yes. Whenever a user has submitted an error report, you'd like the user to submit a run of the program with tracing enabled to help with reproducibility. Hence, consider what you want to tell customers about enabling tracing (if anything). Similarly, debug tracing messages could be useful to phone support staff in various ways to diagnose or resolve customer problems. Consider how a phone support person might help a customer to enable these messages.

Valgrind Limitation Workarounds

If you're a fan like me of Valgrind on Linux, especially the “Memcheck” tool, then you've probably noticed it has some major limitations. For starters, you might struggle to get Valgrind to cope with a huge engine and a full model, but even if that fails, it's still useful for finding memory bugs from running unit tests. Obviously, not running on Windows is also a biggie. Finally, there's also the problem that it cannot detect memory overruns on:

Global arrays or buffers
Static local arrays or buffers
Stack arrays or buffers

This is quite a major limitation! I'm not aware of any easy way to make Valgrind detect problems on these variables, but you can add code workarounds to increase the level of error detections. The trick is simply to re-compile your code to use allocated memory instead of global or stack arrays. Here's a simplistic idea:

    #if AUSSIE_COMPILE_FOR_VALGRIND
        char* buf = ::new char[BUFSIZE];
        buf[0] = 0;
        float *farr = ::new float[BUFSIZE];
        farr[0] = 0.0f;
    #else
        char buf[BUFSIZE] = "";
        float farr[BUFSIZE] = { 0 };
    #endif

There are several practical problems with this workaround method including:

1. It requires a re-compile to switch between Valgrind and non-Valgrind modes.

2. “sizeof buf” will be wrong if you're using it (e.g. for memset).

3. Matching “delete” statements are needed, otherwise Valgrind finds leaked memory.

4. Tons of boilerplate code that is prone to copy-paste bugs.

There is a method to fix the problem that this a full re-compile, not a runtime test. If you want to control this dynamically at runtime, you can do so at the cost of doubling your memory usage. Valgrind has a special builtin variable called “RUNNING_ON_VALGRIND” declared in “valgrind.h” that can be used.

    char hiddenbuf[BUFSIZE] = "";
    char* buf2 = hiddenbuf;
    if (RUNNING_ON_VALGRIND) {
        buf2 = ::new char[BUFSIZE];
        buf2[0] = 0; // Init!
    }

Note that on non-Linux platforms or production builds where you are not including “valgrind.h” then you'll get a compile error about RUNNING_ON_VALGRIND being an “undefined identifier”, and need to do something like this:

    #if !LINUX
    #define RUNNING_ON_VALGRIND 0
    #endif

Extension: Valgrind Smart Buffer Class: If you are feeling like a challenge, you can also define your own special “smart buffer” class, which hides these details behind constructor code that tests RUNNING_ON_VALGRIND. This class can also fix the problem that the allocated memory version doesn't free the memory properly by calling “delete” in the destructor. Since the destructor is automatically called whenever the smart buffer variable goes out of scope, you don't need to manually add any code at the end of a function using a smart buffer. This idea can work at run-time, fix the memory leaks, and avoid boilerplate, but still has the sizeof problem.

Making the Correction

An important part of the debugging phase that is often neglected is actually making the correction. You’ve found the cause of the failure, but how do you fix it? It is imperative that you actually understand what caused the error before fixing it; don’t be satisfied when a correction works and you don’t know why.

Here are some thoughts on the best practices for the “fixing” part of debugging:

Test it one last time.
Add a unit test or regression test.
Re-run the entire unit test or regression test suite.
Update status logs, bug databases, change logs, etc.
Update documentation (if applicable)

Another common pitfall is to make the correction and then not test whether it actually fixed the problem. Furthermore, making a correction will often uncover (or introduce!) another new bug. Hence, not only should you test for this bug, but it’s a very good idea to use extensive regression tests after making an apparently successful correction.

• Next: Chapter 43. Overview of AI Research

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs

The new AI programming book by Aussie AI co-founders:

AI coding in C++
Transformer engine speedups
LLM models
Phone and desktop AI
Code examples
Research citations

Get your copy from Amazon: Generative AI in C++