Aussie AI

AI Engine Automated Testing

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

AI Engine Automated Testing

Automated testing can be applied to AI engines and incorporated into every build. There are two main types of testing, unit tests and regression tests, although there's some overlap between them.

Unit Testing. Unit tests have saved my bacon so many times. And yet, I have also often neglected to unit test a function, only to regret it later. The more unit tests you add, the better. When I express the unpopular view that rewriting code adds technical debt, rather than reducing it, the absence of unit tests is one of the big issues.

The benefit from unit testing and regression testing is just as much for an AI Engine as any other large project. An AI engine's unit tests would be things like testing a vector dot product kernel, whereas a regression test would be running an entire query through all the layers of the model to get an output response.

Unit Testing Harness. The unit testing harness is the code that runs multiple unit tests. The JUnit tool for Java was a major advance at the time, leapfrogging past C++, but there are now a variety of free C++ test harness libraries including:

  • Google Test (BSD-3-clause license)
  • Boost.Test (Boost Software License)
  • CppUnit (a port of Java's JUnit; LGPLv2 license)

Alternatively, you can create one yourself. I often build my own simple unit test API with basic features including:

  • Test API versions for Boolean, integer, and float types.
  • Emits error context information on failures (e.g. filename, line).
  • Tracks a count of failures.
  • Tracks the total number of tests done.
  • Reports success or failure at the end.

Regression Testing. Unit tests are distinct from “regression tests.” The idea with unit tests is to call very low-level C++ functions to ensure they work in a basic way. Regression tests are much higher-level and test whether the whole component is running as desired.

A typical regression test for an AI engine is to pass it a query and check that its answer is the same as it was last Tuesday. To do so, you need a batch method to run a prompt through the model, such as a command-line interface.

In order to successfully run a full regression test through an AI engine, you need to control everything. This involves managing various aspects of the code to ensure that the regression test case will still produce identical output, including:

  • Random number seed
  • Time-specific code
  • RAG component (“retriever”)
  • Inference cache component
  • Data source integrations
  • Tool interfaces
  • Model file

Randomness in the top-k decoding algorithm is the enemy of regression tests. But you can simply curtail the model's creativity by using a fixed random number seed. This can be done either by allowing the regression test to specify a seed, or having a special “testing” mode where a hard-coded fixed seed is used.

Another area that can change is the current time. The AI engine probably interfaces with a “tool” that helps it answer time-specific questions. To make this consistent in regression tests, you also need to intercept these API calls in a way that the test harness can specify what time to use.

To the extent that the AI engine relies on other major components, you need to control their interactions for regression testing. For example, a regression test of the engine itself needs to provide a fixed input from the RAG retrieval component. You may need to alter the RAG component so that it logs its answers to a file, which can then be re-used in regression tests.

Similarly, anything else that alters the conditions of a test needs to be made fixed. We probably want to disable any caching components, except when we're trying to test the cache module itself. Integrations with data sources and tools are also areas where the AI engine sends requests and receives data, so this needs to be predictable in order for regression tests to work correctly.

The simplest idea is that the AI engine itself could simply log its own input prompts and output results for later usage in testing. It would also need to log its configuration settings (e.g. temperature), random number seed, and time-related data. In this way, a huge test suite of prompt-response pairs with fixed configuration settings can be generated for later re-testing.

Finally, note that we don't change the model file for regression testing. The goal of regression testing is not to test the model; that's called “model evaluation” and is a separate discipline. Instead, regression testing aims to re-check the C++ in the Transformer engine's kernels, and actually works best if we keep the old model files around and re-use them. An updated model is unlikely to output exactly the same output on a large range of inputs (nor would we want it to!), so you need to build a new regression test suite for a new model.

Advanced regression testing. The above section assumes you are sending the same inputs and testing for the exact output string. Another more flexible way is to test the results approximately using techniques such as:

  • Substring match (e.g. factual questions get an answer containing the right words).
  • Vector match (i.e., the answer's vector embeddings are “close enough” semantically to the expected result).
  • Disallowed words (e.g., substring match to ensure “non-safe” words are absent).

Many of these methods are going beyond basic regression testing and into the area of “model evaluation.” Running a suite of test prompts through a new model is one part of evaluating its smartness and safety.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++