Aussie AI Blog

Generalizing Prefix KV Caching to RAG Chunks

  • October 24, 2024
  • by David Spuler, Ph.D.

Prefix KV Caching Takes Off

There have been two notable trends in relation to "prefix KV caching" in the last couple of months:

    1. Several industry implementations of prefix KV caching,

    2. Multiple research papers on the generalization to "non-prefix KV caching."

There's been quite a few industry announcements of "caching" functionality, such as "prompt caching" or "context caching" and similar names, from platforms such as:

  • vLLM
  • Character.AI
  • DeepSeek
  • Google Gemini
  • Anthropic
  • OpenAI
  • OpenVINO

Such is the efficiency benefit of prefix KV caching that the pricing for "cached tokens" is much lower than normal inference pricing. It's faster and cheaper!

RAG is Problematic for Prefix KV Caching

The problem with RAG chunks is that they're not always a prefix. The whole idea of RAG is that the retriever returns multiple useful chunks of text, then orders them via the "reranker" component, and sends them to the LLM. But only the first chunk is a "prefix" in this layout.

There have been several attempts to backfit prefix KV caching onto RAG, such as:

  • Only returning one chunk to the LLM (i.e., the reranker chooses the single best one).
  • Caching only the first chunk, then processing the rest.
  • Cache-aware reranking (i.e., the reranker tries to return a chunk that has a cache, but this undermmines the main goal of the reranker to assess the quality of the chunks, not their speed.)
  • Tracking multiple caches for each ordering of RAG chunks (but this is unrealistic for any significant number of chunks, just think about the combinatorics).

Fortunately, there's been some further research on RAG chunks and prefix KV caching, with an unexepcted discovery.

Research on Non-Prefix KV Caching

The generalization of prefix KV caching is obvious: non-prefix KV caching. This means using a KV cache from a general substring of tokens in the middle of a token stream, rather than only at the beginning (prefix tokens).

I've seen two new papers on this research area in just the last week. There are various names being used:

  • Substring KV cache
  • Fused KV cache
  • Position-independent KV cache
  • KV cache concatenation

These all amount to the same thing: just merge all the KV caches together! It seems like it shouldn't be that simple, and yet there are now at least four research papers that suggest that it is. See my list of research papers on fused KV caching.

Open Questions for Non-Prefix KV Caching

This area still needs more research! Basic confirmation of the technique will still require more confirmation. Some of the other areas needing examination include:

  • Why does this even work? Does the attention computation during the decoding phase quickly make up for the total lack of inter-chunk and instructions-chunk attention in the KV cache?
  • To what extent is this a lossy approximation of the KV cache, and how does it depend on the position of the RAG chunk in the ordering.
  • Does the lossiness of the approximation depend on the length of the chunk? On the number of chunks? On the distance between the chunk and the user query tokens? On the length of any prepended instructions? On the length of the user's query afterwards?
  • Whether the KV cache for a RAG chunk should be computed with or without any prepended global instructions or meta-instructions (i.e., it's faster if not, and using a prepended instruction sequence is like doing prefix KV caching if the chunk is in first position, so it may depend on its ranking position.)
  • To what extent does this fused KV caching method require that global attention is used in the decoding phase? Using local attention (e.g., sliding window) or hybrid local-global attention is a common industry backend optimization. Does this method of RAG chunk KV caching undermine this other optimization method?

Nevertheless, I am optimistic that we'll see an industry implementation of fused KV caching for RAG chunks in the near future!

Further Reading on RAG Caching

More AI Research Topics

Read more about: