Aussie AI

Debugging AI Models and Frameworks

  • Last Updated 7 December, 2024
  • by David Spuler, Ph.D.

I heard a rumor that AI frameworks are just code, and AI models are just data. So this means that there must be bugs! And this article is about real, hard-core coding bugs, the nasty kind that sneak in with all of this performance tuning that's going around, not the higher level AI problems with safety and accuracy issues.

The reality is that an AI engine is some of the most difficult code you'll ever see. Parallelized code of any kind (e.g. low-level hardware acceleration, multi-threaded, multi-GPU, etc.) multiplies this complexity by another order of magnitude. Hence, starting with the basics of high quality coding practices are ever more important, such as:

  • Unit tests
  • Assertions and self-testing code
  • Debug tracing code
  • Automated system tests (regression testing)
  • Error handling (e.g. starting with checking error return codes)
  • Exception handling (wrapping code in a full exception handling stack)

All of these techniques involve a significant chunk of extra coding work. Theory says that full exception handling can be 80% of a finalized software product, so it's a four-fold amount of extra work! Maybe that estimate is a little outdated, given improvements in modern tech stacks, but it still contains many grains of truth.

There are many programming tools to help the debugging cycle:

  • C++ memory debugging tools (e.g. Valgrind on Linux)
  • Performance profiling tools (for "de-slugging")
  • Memory usage tracking (ie. allocated memory measurement)
  • Interactive debugging tools (eg. in the IDE, Gnu gdb, etc.)

Random Number Seeds

Neural network code often uses random numbers to improve accuracy, for a stochastic algorithm, or even just for random testing. Random numbers need a "seed" to get started, which is done via the "srand" function in C++. The typical way to initialize the random number generator, so it's truly random, is to use the current time:

    srand(time(NULL));

But that's not good for debugging! We don't want randomness when we're trying to reproduce a bug!

A generalized plan is to have a debugging or regression testing mode where the seed is fixed.

    if (g_yapi_debug_srand_seed != 0) {
	srand(g_yapi_debug_srand_seed);   // Non-random randomness!
    }
    else {  // Normal run
	srand(time(NULL));
    }

The test harness has to set the global debug variable when it's doing a regression test. For example, either it's manually hard-coded into a testing function, or it could be set via a command-line argument to your test harness executable.

This is better, but if we have a bug in production, we won't know the seed number. So the better code also prints out the seed number in case you need to use it later to reproduce a bug that occurred live.

    if (g_yapi_debug_srand_seed != 0) {
	srand(g_yapi_debug_srand_seed);   // Non-random randomness!
    }
    else {  // Normal run
	long int iseed = (long)time(NULL);
	fprintf(stderr, "INFO: Random number seed: %ld 0x%lx\n", iseed, iseed);
	srand(iseed);
    }

Research on Debugging AI Framework Code

Papers on the issues of debugging the actual code that runs AI models, including the code inside the frameworks and ML compilers, includes:

General Debugging Techniques Research

Research on general program debugging methods:

More AI Research

Read more about: