Aussie AI

KV Caching

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

KV Caching

KV caching is storing the results of the K and V vector operations that are performed in Transformer attention heads. Analysis of the vanilla Transformer by researchers has discovered at least two distinct ways to cache these results.

Autoregressive KV caching
Global encoder/prefill KV caching

Autoregressive decoder KV caching: This is in-memory caching during one query as the decoder processes multiple tokens. Partial KV tensor operations can be cached in memory during decoding, across the processing of multiple tokens, avoiding re-computations in the next cycle of decoder stacks. In autoregressive decoder mode, the extra KV computations related to the new token are not cached, but all prior KV-related calculations can be cached. This is a subtype of autoregression optimization.

Uncaching KV: Care must be taken in special cases with KV caching to keep the cache accurate and updated. This is particularly true in algorithms that “look ahead” but must sometimes “backtrack” to a prior token. Caching is efficient when moving forwards, but some of the cached items must be flushed and the cache recalculated whenever there is a token rejected. For example, this occurs in speculative decoding, parallel decoding, beam search decoding, and various other non-autoregressive decoding algorithms. It may also occur in various research algorithms such as token pruning, token merging, and prompt compression.

• Next:

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs

The new AI programming book by Aussie AI co-founders:

AI coding in C++
Transformer engine speedups
LLM models
Phone and desktop AI
Code examples
Research citations

Get your copy from Amazon: Generative AI in C++

Aussie AI

KV Caching

KV Caching

Quick Links

Product

New to Writing?

Writing Styles