Aussie AI
AI Engine Debugging
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
AI Engine Debugging
I heard a rumor that AI frameworks are just code, and AI models are just data. So, this means that there must be bugs! And this chapter is about real, hard-core coding bugs, the nasty kind that sneak in with all of this performance tuning that's going around.
The detection and correction of errors in programs is usually called “debugging.” For the most part there is no standard method of debugging programs and much responsibility rests on the programmer’s creativity and ingenuity in finding the cause of a program’s malfunction. However, there are a number of useful techniques that programmers can use to aid in the debugging of programs.
But before you blame the software, always remember where you are: AI thrashes everything. It's not uncommon in production to have memory failures, disk failures and even CPU failure. These glitches are often not a binary works or does-not-work error. Reliability issues in hardware do not even necessarily crash the app, but can result in crazy computations. Sure, random crashes occur, too, and they might be your code, so it's a judgement call whether to blame carbon or silicon.
GPUs are the most complex hardware and are prone to overheating. Even without failing, overheating can lead to the output of dubious results. If you're using GPU hosters or buying used chips yourself, the origin of the GPUs is a concern. Many GPU chips have been Bitcoin miners in a previous life, hammering away on crypto rocks, rather than enjoying days of leisure doing video editing. An AI application will also drive a GPU hard, more so than just playing Call of Duty a few times a week, so the GPUs can gradually degrade. Hence, it's worth pondering hardware reliability, particularly with GPUs, when troubleshooting bizarre gibberish coming from an AI app.
Different types of bugs arise from the top of the AI stack. This chapter is not about “model evaluation” and the higher-level AI problems with safety and accuracy issues. Model evaluation is really a type of testing, which is finding the bugs. There's another whole debugging-like area of expertise in trying to figure out why an LLM gave a wrong answer, or a biased result, or whatever other high-level semantic failure. Such cases are often a non-algorithmic error, such as incorrect or omitted training data, so you might need a Data Scientist rather than an ML Engineer. Fortunately, you can both blame the writers.
The remainder of this chapter is only talking about making sure all our fancy C++ kernel algorithms are running correctly. Hence, much of the material in this chapter is generic to any large C++ application, and also generalizes to debugging massive Transformer engines.
• Next: • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |