Aussie AI
KV Caching
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
KV Caching
KV caching is storing the results of the K and V vector operations that are performed in Transformer attention heads. Analysis of the vanilla Transformer by researchers has discovered at least two distinct ways to cache these results.
- Autoregressive KV caching
- Global encoder/prefill KV caching
Autoregressive decoder KV caching: This is in-memory caching during one query as the decoder processes multiple tokens. Partial KV tensor operations can be cached in memory during decoding, across the processing of multiple tokens, avoiding re-computations in the next cycle of decoder stacks. In autoregressive decoder mode, the extra KV computations related to the new token are not cached, but all prior KV-related calculations can be cached. This is a subtype of autoregression optimization.
Uncaching KV: Care must be taken in special cases with KV caching to keep the cache accurate and updated. This is particularly true in algorithms that “look ahead” but must sometimes “backtrack” to a prior token. Caching is efficient when moving forwards, but some of the cached items must be flushed and the cache recalculated whenever there is a token rejected. For example, this occurs in speculative decoding, parallel decoding, beam search decoding, and various other non-autoregressive decoding algorithms. It may also occur in various research algorithms such as token pruning, token merging, and prompt compression.
• Next: • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |