Aussie AI

What is Cachingand?

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

What is Caching?

Caching is the general optimization method where computed results are stored and re-used instead of repeating a later computation. Generally, the idea is to trade off use of extra memory in order to save on execution time. This mainly works if the same exact computations are being repeated, but can also work for repetitions of similar near-identical computations. In the research literature, caching algorithms for neural networks are also called “memoization,” “data re-use” or “computation re-use” algorithms.

There are at least seven caching optimizations known for Transformers:

  • KV caching
  • Encoder/prefill KV caching
  • Inference cache
  • Semantic cache
  • Vector dot product caching
  • Input similarity caching
  • Cached matrix transpose

KV caching is the best known of these optimizations, and relates to the K and V tensors used in the QKV attention mechanism. It was discovered quickly that some of the K and V tensor calculations could be cached between tokens, thereby avoiding repeated matrix computations in the usual autoregressive model. This is only temporary caching used while processing a single query, rather than across multiple user queries.

Caching can also be done at the highest level with model inference answers stored in a global cache. Inference results can also be cached across multiple queries from multiple users, so that repeated identical queries need not be re-computed. When the entire results of an inference calculation are saved and re-used, the optimization is called an “Inference Cache.”

Vectorized caching is also possible for non-exact cache matches in at least two ways. Semantic caching with vector hashing or vector databases can help identify user queries that are non-identical, but have the same meaning, and need not be fully computed. Incremental caching of full inference results can be used with “input similarity” algorithms, such as when analyzing the individual frames of a video in an AI engine. Optimizations such as frame skipping or partial image caching are possible.

Some research papers have attempted to use caching deep inside the engine to reduce the sheer number of weight multiplications in matrix computations. Low-level caching and computation reuse can be done even at the vector dot product level. By detecting when similar vectors have been calculated before, such as using Locality-Sensitive Hashing (LSH), the cache results of the dot product calculation can be accessed and re-used instead. This approach does not seem to have reached production usage yet.

Before introducing the various caching methods, one proviso: not all queries can be cached. Any time-dependent queries cannot be cached (over a long duration), either in terms of the text outputs or the KV calculations, because they differ over time. Consider the response to this user query: “What day is today?”

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++