Aussie AI
Inference Cache
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
Inference Cache
A full inference cache is where the entire results of a model inference are stored, and re-used for a later identical query. For example, such an approach would recognize that 100 users are all submitting “This is a test,” whether concurrently or over time, and would do the inference computation only the first time, and retrieve it from the cache for the other 99 users.
Inference caching could involve storing the actual identical results in text form, in which case all users would get exactly the same response. Alternatively, a more flexible approach that still avoids most computations is storing the near-final results, in some intermediate form with logits (probabilities), and a final brief computation can still emit different results to different users. In this way, most of the computation is avoided, and some variability is added to the final output. Another simpler way to add variability to responses would be to cache more than one possible answer for a given input.
Caching logit arrays fails. Here's an idea. For every token, cache the array of logits (or their probabilities after exponentiation via Softmax), or rather, just cache the top-k logits for each output token. This is 50 times more space usage than just the token list, but also has more useful information. Then, rather than emit the exact same token list for a cached user query, we can re-run the top-k decoding algorithm to get a different random choice from the top-50 tokens.
Unfortunately, it doesn't work very well, because if you change one of the words early in a sequence (i.e. choose a different token with a random top-k), then the whole sentence should change. Changing one choice of token should alter the words that would be in the top-k for all subsequent tokens, but they probably won't be the ones we've cached, and the probabilities would be wrong even if we did happen to cache them.
Another use case for a full inference cache is where the input is similar or continuous. This is typically the case in image processing for machine vision (e.g. self-driving cars) or video analysis (e.g. security camera monitoring). There are many frames per second and they are often not very different. In such cases, the cached inference results from the previous image or frame can often be re-used with modifications for a faster incremental algorithm, or alternatively, the entire results from the previous frame can simply replace the current inference computation (i.e. “frame skipping”).
• Next: • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |