Aussie AI

40. Reliability

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

“Simplicity is prerequisite for reliability.”

— Edsger Dijkstra.

AI Engine Reliability

We want our AI model to be predictable, not irrational. And it should show bravery in the face of adversity, rather than crumble into instability at the first sign of prompt confusion. At a high-level, there are various facets to AI engine reliability:

Accuracy of model responses
Safety issues (e.g., bias, toxicity)
Engine basic quality (e.g., not crashing or spinning)
Resilience to dubious inputs
Scalability to many users

How to make a foundation model that's smart and accurate is a whole discipline in itself. The issues include the various training and other algorithms in the Transformer architecture, along with the general quality of the training dataset. Similarly, safety issues such as bias or toxic responses are an ongoing area of research, and aren't covered in this chapter.

Aspects of the C++ code inside your Transformer engine are important for its basic quality. Writing C++ that doesn't crash or spin is a code quality issue with many techniques. This involves coding methods such as assertions and self-testing, along with external quality assurance techniques that examine the product from the outside.

Resilience is tolerance of situations that were largely unexpected by programmers. Appropriate handling of questionable inputs is a cross between a coding issue and a model accuracy issue, depending on what type of inputs are causing the problem. Similarly, the engine should be able to cope with resource failures, or at least to gracefully fail with a meaningful response to users in such cases. Checking return statuses and exception handling is a well-known issue here.

A system is only as reliable as its worst component. Hence, it's not just the Transformer and LLM to consider, but also the quality of the other components, such as:

Backend server software (e.g. web server, request scheduler)
RAG components (e.g., retriever and document database)
Vector database
Application-specific logic (i.e., whatever your “AI thingy” does)
Output formatting component
User interface

The rest of this chapter is about how to make your C++ code reliable, whether it's in an AI engine or other components. This includes various aspects of “code quality” and also ways to tolerate problems such as exception handing and defensive programming.

Code Reliability

Code reliability means that the execution is predictable and produces the desired results. The reality is that an AI engine is some of the most difficult code you'll ever see. Parallelized code of any kind (e.g. low-level hardware acceleration, multi-threaded, multi-GPU, etc.) multiplies this complexity by another order of magnitude. Hence, starting with the basics of high quality coding practices are ever more important for code reliability, such as:

Unit tests
Assertions
Self-testing code
Debug tracing methods
Automated system tests
Function argument validation
Error detection (e.g. starting with checking error return codes)
Exception handling (wrapping code in a full exception handling stack)
Resilience and failure tolerance
Regression testing
Test automation
Test coverage measurement

One useful method of catching program failures is making the program apply checks to itself. Assertions and other self-testing code have the major advantage that they will catch such errors early, rather than letting the program continue, and cause a failure much later.

All of these techniques involve a significant chunk of extra coding work. Theory says that full exception handling can be 80% of a finalized software product, so it's a four-fold amount of extra work! Maybe that estimate is a little outdated, given improvements in modern tech stacks, but it still contains many grains of truth.

There are many programming tools to help improve code reliability throughout the development, testing and debugging cycle:

C++ memory debugging tools (e.g. Valgrind on Linux)
Performance profiling tools (for “de-slugging”)
Memory usage tracking (ie. memory leaks and allocated memory measurement)
Interactive debugging tools (eg. debuggers in the IDE, Gnu gdb, etc.)
Static analysis tools (“linters”)
Bug tracking databases (for cussing at each other)

Building More Bugs

Advanced build engineering is a non-obvious way to improve code quality. The basic idea is to build lots of test versions of your program to shake out more of the insidious bugs. The strategy is:

Build multiple executable versions.
Run them against the full test suite (e.g. unit testing, regression testing, mutation testing, etc.).
Make sure someone's watching to see if anything fails.

The vast majority of your C++ code should be standardized and platform-independent, so a simple way to test it fully is to thrash it across multiple platforms and compilers. You can watch for both compile-time warnings and runtime failures. Here are some suggestions:

Multiple OS platforms (Linux, Windows, Mac)
Multiple C++ compilers
Multiple optimization levels
Multiple CPU architectures
32-bit and 64-bit OS versions
With self-testing or debug code enabled and disabled.

This method will give a huge cascade if you're got a simple bug. But the more important idea of this whole strategy is to watch for the singular failures, where one platform has tickled an obscure code weakness, such as race conditions in multi-threading code.

Warning-Free Compilation

Don't ignore compiler warnings! I'm not a fervent advocate of having a set of C++ coding style guidelines for a project, but I do feel strongly about this one. Modern compiler warnings are so good that it's like running a “linter” for free.

A very good goal for C++ software quality is to get to a warning-free compile. You should think of compiler warnings as doing “static analysis” of your code, which was an idea that started back on Unix with the C “lint” tool, back in the days before we had electricity. To maximize this idea, turn on more warning options, such as -Wall for gcc. The warnings are rarely wrong in modern compilers, although some are about harmless things.

Harmless doesn't mean unimportant. And anyway, the so-called “harmless” warnings aren't actually harmless, because if there's too many of them in the compilation output, then the bad bugs won't get seen. Hence, make the effort to fix the minor issues in C++ code that's causing warnings. For example, fix the “unused variable” warnings or “mixing float and double” type warnings, even though they're rarely a real bug. And yet, sometimes they are! This is why it's powerful to have a warning-free compile.

If you are a true believer in warning-free compilation, you can elevate all GCC warnings to compilation errors using the “-Werror” command-line option. Personally, I don't recommend going this far, as some warnings are really tricky to get rid of.

Two compilers are better than one! Another powerful quality trick is to compile your code on multiple compiler platforms. For example, I try to write code that is portable enough that it compiles on both the Microsoft Visual Studio C++ IDE for Windows and command-line GCC on Linux. This is usually a matter of adding #ifdef statements for Windows versus Linux. By checking your code twice, with a small overhead of a Makefile or Project file, you get warnings from both of these amazing C++ compilers. It's like getting double-linted for free!

Tracking compilation warnings. One way to take warning-free compilation to the next level is to actually store and analyze the compiler output. It's like log file analysis in DevOps, only it's not for systems management. On Linux, I typically use this idea:

    make build |& tee makebuild.txt

Here's an actual example from a Linux Makefile in an Aussie AI project:

    build:
        -@make build2 |& tee makebuild.txt
        -@echo 'See output in makebuild.txt'

The Makefile uses prefix “-” and “@” flags, which means that it doesn't echo the command to output, and doesn't stop if one of the steps triggers an error.

When the build has finished, then we have a text file “makebuild.txt” which can be viewed for warning messages. To go further, I usually use grep to remove some of the common informational messages, to leave only warning messages. Typically, my Linux command looks like:

    make warnings

Here's an example of the “warnings” target in a Linux Makefile for one of my Aussie AI projects:

    warnings:
        -@cat makebuild.txt | grep -v '^r -' \
        | grep -v '^g++ ' | grep -v '^Compiling' \
        | grep -v '^Making' | grep -v '^ar ' \ 
        | grep -v '^make\[' | grep -v '^ranlib' \
        | grep -v '^INFO:' | grep -v 'Regressions failed: 0' \ 
        | grep -v 'Assertions failed: 0' | grep -v SUCCESS \
        |more

Note that this uses grep to remove the informational messages from g++, ar, ranlib, and make. And it also removes the unit testing success messages if all tests pass (but not if they fail!). The idea is to show only the bad stuff because log outputs with too many lines get boring far too quickly and then nobody's watching.

Finally, your warning-free tracking method should ideally be part of your “nightly builds” that do more extensive analysis than the basic CI/CD acceptance testing. You should email those warnings to the whole team, at about 2am ideally, because C++ programmers don't deserve any sleep.

Static Analysis Tools (Linters)

Static analysis tools are those that automatically check over source code to find problems. These are called “Linters” in honor of the old “lint” tool for C on Unix since the 1980s. Getting to a warning-free compilation with your C++ compilers is like running a Linter for free, but there are some much more advanced static analysis tools, both free and commercial.

The idea of “warning-free compilation” can and should be extended to using Linters in your workflow. However, since these tools are more “picky” at finding stuff, they tend to emit too many warnings. Personally, I take a pragmatic non-purist view that you should focus on the warnings that are most likely to indicate a bug, rather than purely coding style issues (e.g. “function defined without a prior prototype declaration in a header file”).

You don't want programmers doing too much “busy work” fixing minor coding style warnings with little practical impact on code reliability. Hence, you might find that your policy needs to suppress some of the pickier warnings. And that'll be a fun meeting to have.

Refactoring versus Rewriting

Refactoring was something I was doing for years, but I called it “code cleanup.” The seminal work on refactoring is Martin Fowler's book “Refactoring” from 1999. This was the first work to gain traction in popularizing and formalizing the ideas of cleaning up code into a disciplined approach.

Refactoring is a code maintenance task that you do mainly for code quality reasons, and it needs to be considered an overhead cost. True refactoring does not add any new functionality for customers, and marketing won't be happy if you do refactoring all day long. But refactoring is a powerful way to achieve consistency in code quality and adhere to principles such as DRY. In highly technical special cases such as writing an API, you'll need to refactor multiple times until the API is “good.”

Rewriting is where you pick up the dusty server containing the old source code repo, walk over to the office window and toss it out. You watch it smash ten floors below, drive over to CompUSA to buy a new server, and then start tapping away with a big smile on your face.

The goals of refactoring and rewriting are quite different. Refactoring aims to:

Make the existing code “better” (e.g. modularized, layered).
Add unit testing and other formality.
Retain all the old features and functionality.
Not add any new functionality.

Rewriting projects tend to:

Throw away all the existing code.
Choose a new tech stack, new UI, new tools, etc.
Not support backward compatibility.
Add some new functionality.

Refactoring and rewriting are very close together, and there's a lot of middle ground between them. If you're fixing some old code by rewriting one of the main modules, is it refactoring or rewriting?

The reality is that rewriting versus refactoring is always an engineering choice, and it's a difficult one without a clear right or wrong answer. You can't try both to see which one works better, so there's never any proof either way.

Defensive Programming

Defensive programming is a mindset where you assume that everything will go wrong. The user input will be garbage. Anyone else's code will be broken. The operating system intrinsics will fail. And your poor helpless AI needs to keep chugging along.

Many of the high-level types of defensive coding are discussed elsewhere in this book. Good practices that attempt to prevent bugs include: assertions, self-testing code, unit tests, regression tests, check return codes, validate incoming parameter values, exception handling, error logging, debug tracing code, warning-free compilation, memory debugging tools, static analysis tools, document your code, and call your mother on Sunday.

Using Compiler Errors for Good, not Evil: One of the advanced types of defensive programming is to intentionally trigger compiler errors that prevent compilation. For example, you can enforce security coding policies:

    #define tmpnam dont_use_tmpnam_please

Or if you are using debug wrappers for some low-level system functions, you can enforce that:

    #define memset please_use_memset_wrapper

Politeness is always required. You don't want your colleagues going home crying.

Defensive Coding Style Policies: You might want to consider some specific bug-prevention coding styles, for defensive programming, maintainability, and general reliability. Some examples might be:

All variables must be initialized when declared. Don't want to see this anymore: int x;
All switch statements need a default clause.
Null the pointer after every delete. You can define a macro to help.
Null the pointer after every free. If you use a debug wrapper for free, make it pass-by-reference and NULL the pointer's value insider the wrapper function.
Null the file pointer after fclose. Also can be nulled by a wrapper function.
Unreachable code should be marked as such with an assertion (a special type).
Prefer inline functions to preprocessor macros.
Define numeric constants using const rather than #define.
Validate enum variables are in range. Add a dummy EOL item at the end of an enum list, which can be used as an upper-bound to range-check any enum has a valid value. Define a self-test macro to range-check the value.
Use [[nodiscard]] attributes for functions. All of them.
Start different enums at different numbers (e.g. token numbers start at 10,000 and some other IDs start at 200,000), so that they can't get mixed up, even if they end up in int variables. And you'll need a bottom and top value to range-check their validity. You have to remove the commas from these numbers, though!
All allocated memory must be zeroed. This might be a policy for each coder, or it could be auto-handled by intercepting the new operator and malloc/calloc into debug wrappers, and only returning cleared memory.
Constructors should use memset to zero their own memory. This seems like bad coding style in a way, but how many times have you forgotten to initialize a data member in a constructor?
Zero means “not set” for every flag, enum, status code, etc. This is a policy supporting the “zero all memory” defensive idea.

Assume failures will happen: Plan ahead to make failures easier to detect and debug (supportability!), even when they happen in production code:

Use extra messages in assertions, and make them brief but useful.
If an assertion keeps failing in testing, or fails in production for users, change it to more detailed self-checking code that emits a more detailed error.
Add unique code numbers to error messages to make identifying causes easier (supportability).
Separately check different error occurrences. Don't use only one combined assertion: assert(s && *s);
Review assertions for cases where lazy code jockeys have used them to check return codes (e.g. file not found).

Maintainability

My first Software Engineer job was maintenance of low-level systems management on a lumbering Ultrix box in C code, with hardly any comments. You'd think I hate code maintenance, right? No, I had the opposite reaction: it was the best job ever!

If you think you don't like code maintenance, consider this: Code maintenance is what you do every day. I mean, except for those rare days where you're starting a new project from scratch, you're either maintaining your own code or someone else's, or both. There are two main modes: you're either debugging issues or extending the product with new features, but in both cases it is at some level a maintenance activity.

So, how do you improve future maintainability of code? And how do you fix up old code that's landed on your desk, flapping around like a seagull, because your company acquired a small startup.

Let's consider your existing code. How would you make your code better so that a future new hire can be quickly productive? The answer is probably not that different to the general approach to improving reliability of your code. Things like unit tests, regression testing, exception handling, and so on will make it easier for a new hire. You can't stop that college intern from re-naming all the source code files or re-indenting the whole codebase, but at least you can help them to not break stuff.

One way to think about future maintainability is to take a step back and think of it as a “new hire induction” problem. After you've shown your new colleague the ping pong table in the lunch room and the restrooms, they need to know:

Where is the code, and how do I check it out of the repo?
How do I build it? Run it? Test it?
Where's the bug database, requirements documents, or enhancements list?
What are the big code libraries? Which directories?

After that, then you can get into the nitty-gritty of how the C++ is laid out. Where are the utility libraries that handle low-level things like files, memory allocation, strings, hash tables, and whatnot? Which code modules do the higher-level AI engine features like activation functions, MatMul, tokenization, and so on? Where do I add a new unit test? A new command-line argument or configuration property?

Maintenance safety nets: How do you make your actual C++ code resilient to the onslaught of a new hire programmer? Assume that future changes to the code will often introduce bugs, and try to plan ahead to catch them using various coding tricks. Actually, the big things in preventing future bugs are the large code reliability techniques (e.g. unit tests, assertions, comment your code, blah blah blah). There are a lot of little things you can do, which are really quite marginal compared to the big things, but are much more fun, so here's my list:

All variables should be initialized, even if it'll be immediately overwritten (i.e. “int x=3;” never just “int x;”). The temptation to not initialize is mainly from variables that are only declared so as to be passed into some other function to be set as a reference parameter. And yes, in this case, it's an intentional micro-inefficiency to protect against a future macro-crashability.
Unreachable code should be marked with at least a comment or preferably an attribute or assertion (e.g. use the “yassert_not_reached” assertion idea).
Prefer statement blocks with curly braces to single-statements in any if, else, or loop body. Also for case and default. Use braces even if all fits on one line. Otherwise, some newbie will add a second statement, guaranteed.
Once-only initialization code that isn't in a constructor should also be protected (e.g. the “yassert_once” idea).
All switch statements need a default (even if it just triggers an assertion).
Don't use case fallthrough, except it's allowed for Duff's Device and any other really cool code abuses. Tag it with [[fallthrough]] if you must use it.
Avoid preprocessor macros. Prefer inline functions rather than function-like macro tricks, and do named constants using const or enum names rather than #define. I've only used macros in this book for educational purposes, and you shouldn't even be looking at my dubious coding style.
Declare a dummy enum at the end of an enum list (e.g. “MyEnum_EOL_Dummy”), and use this EOL name in any range-checking of values of enum variables. Otherwise, it breaks when someone adds a new enum at the end. EOL means “end-of-list” if you were wondering.
Add some range-checking of your enum variables, because you forgot about that. Otherwise array indices and enum variables tend to get mixed up when you have a lot of int variables.
Assert the exact numeric values of a few random enum symbols, and put cuss words in the optional message, telling newbie programmers that they shouldn't add a new enum at the top of the list.
sizeof(varname) is better than sizeof(int) when someone changes it to long type. Similarly, use sizeof(arr[0]) and sizeof(*ptr). No, the * operator isn't live in sizeof.
All classes should have the “big four” (constructor, destructor, copy constructor, and assignment operator), even if they're silly, like when the destructor is just {}.
If your class should not ever be bitwise-copied, then declare a dummy copy constructor and assignment operator (i.e. as “private” and without a function body), so the compiler prevents a newbie from accidentally doing something that would be an object bitwise copy.
If your AI code needs a mathematical constant, like the reciprocal of the square root of pi, just work it out on your calculator and type the number in directly. Job security.
A switch over an enum should usually have the default clause as an error or assertion. This detects the code maintenance situation where a newly added enum code isn't being handled.
Avoid long if-else-if sequences. They get confusing. They also break completely if someone adds a new “if” section in the middle, but forgets it should be “else if” instead.
Instigate a rule that whoever breaks the build has to bring kolaches tomorrow.

But don't sweat it. New hires will break your code, and then just comment out the unit test that fails.

Maintaining OPC. What about brand-new code? It's from that startup that got acquired, and it's really just a hacked-up prototype that should never have shipped. Now it's landed on your desk with a big red bow wrapped around it and a nice note from your boss telling you how much it'll be appreciated if you could have a little look at this. At least it's a challenge, and maybe you could even learn a little Italian, because that's the language the comments are written in.

So, refactoring has to be top of the list. You need to move code around so that it is modular, easier to unit test, and so on. Split out smaller functions and group all the low-level factory type routines. Writing some internal documentation about new code doesn't hurt either! And “canale” means “channel” in Italian so now you're bilingual.

Technical Debt

When programmers talk in disparaging tones about “technical debt” in code, what they often mean is that the code wasn't written “properly.” A prototype got shipped long ago, and was never designed well, or in fact, was never designed at all. Some other giveaways of high technical debt are basically:

Unit tests? That's someone else's job.
Documentation? Never heard of it. Oh, you meant code comments? We don't use those.
File Explorer is a source code control system.
And a backup tool.
Bug tracking tool? Do you mean the whiteboard?
Requirements documentation. Also the whiteboard.
Test plan? Eating free bananas while I test it.

Or to summarize all these points into one:

You work at an AI startup.

Debt-Free Code: The good news is that there is a popular software development paradigm that has zero technical debt. It's called Properly-Written Code (PWC) and programmers are always talking about it in hushed or strident tones. Personally, I've been watching for years, but haven't yet been fortunate enough to actually see any, but apparently it exists somewhere out in the wild, kind of like the Loch Ness Monster, but with semicolons.

Exactly what properly-written code means is rather vague, but the suggested solution is usually a refactor or a rewrite. Personally, I favor refactoring, because I think that technical debt gets increased by rewrites, because the brand-new code:

a) Lacks unit tests.

b) Lacks internal documentation.

c) Hasn't been “tested by fire” in real customer usage.

d) Hasn't been tested by anyone, for that matter.

e) Is a “version 1.0” no matter how you try to spin it.

So, here's my probably-unpopular list of suggestions for reducing technical debt without rewriting anything:

Comment your code!
Fix compiler warnings to get warning-free compilation.
Add more assertions and self-checking code.
Check return codes from system functions (e.g. file operations).
Add parameter validation checks to your functions.
Add debug wrapper functions for selected system calls.
Add automated tests (unit tests or regression tests).
Port the platform-independent code modules to another platform. Even if only to get compiler warnings and run tests.
Add performance instrumentation (i.e. time).
Add memory usage instrumentation (i.e. space).
Add file usage instrumentation.
Document the architecture, APIs, classes, data formats, or interfaces. With words.
Add unique codes to error messages (for supportability).
Document your DevOps procedures.
Run nightly builds, and with tests running, too.
Do a backup once in a while.

And if you're at a startup or a new project, get your tools sorted out for professional software development workflows:

Compilers and IDEs. Two is better than one.
Memory error detection (e.g. Valgrind on Linux is my favorite)
Source code control (e.g. SVN or git or CVS)
CI/CD/CT build system
Bug tracking system
Internal documentation tools
User support database

What really makes better code? Well, that's a rather big question about the entirety of software development practices, so I'll offer only one final suggestion: humans. My overarching view is that the quality of code is most impacted by the ability and motivation of the programmers, rather than by new tools or a trendy programming language (or even an AI coding copilot). A small team that is “on fire” can outpace a hundred coders sitting in meetings talking about the right way to do agile development processes. Hence, morale of the team is important, too.

• Next: Chapter 41. Self-Testing Code

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs