Aussie AI

Global KV PrefillEncoder Caching

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Global KV Prefill/Encoder Caching

Basic KV caching stores the values of K and V across multiple token processing phases, but only within the one query. It is a form of temporary local cache. This helps with autoregression complexity in long sequences, but won't be stored between queries, and there is no global cache used across multiple queries. At the other extreme is a full “inference cache” that stores the results for identical queries in a global cache of all prior answers, and may completely avoid the inference expense for any cached answers that are used.

In between these two approaches is the encoder/prefill KV caching method. This is on-disk caching of the prefill/encoder KV results across multiple user queries. For an encoder-decoder architecture, this stores the K and V results after the encoder has finished. For a decoder-only architecture, the KV results after the encoder-like “prefill” phase are stored.

This idea avoids the expense of running the encoder or the prefill phase, but the full decoder stack is still executed. Hence, it is a partial caching of the inference algorithms, and significantly different answers can result from the randomness inherent to the various decoding algorithms (e.g. top-k decoding).

The KV operations can be cached for identical queries, across many users, so that when a user inputs the same text, the KV operations do not have to be re-done, but can be loaded from a disk cache. If there are no cached KV results detected, the full encoder/prefill is performed without caching, and its results can be added to the cache.

The simplest approach is to cache KV results for exactly identical queries. Also possible is to extend this idea to a “semantic cache” with vector caching, which caches the encoder/prefill KV results for any “close enough” queries. This differs from the full semantic cache, because only the encoder/prefill data is cached, rather than the resulting logits or the full answer text.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++