Aussie AI

Caching and Reuse Optimizations

Last Updated 24 June, 2025

by David Spuler, Ph.D.

Transformer Caching Optimizations

Transformer caching optimizations are the use of datastores of cached information for LLM inference optimization. There are several ways that Transformer architectures can use caching to speed up LLM inference. There are two main ways:

Outside the Transformer — Inference cache, Semantic cache, RAG caching, etc.
Inside the Transformer — KV caching (basic autoregressive decoding), global KV caching, etc..

When working outside the Transformer, the goal of the cache is to avoid doing any LLM work. Instead, hit the cache, get the answers, and send the answers back to the user. No LLM weights are harmed in this process!

The types of "outside" caching include:

Inference cache — text to text, for a given input prompt, find the results text in the cache and output.
Inference cache of logits.
Semantic cache — given a query, find a close-enough query in the cache, output the results text.

Single-query KV caching: When using caching internal to the Transformer inference phase, the idea is to speed up the QKV calculations in the attention heads. And in a decoder-only architecture (e.g. GPT style), one of the other reasons is to speed up the "prefill" phase, which is the initial delay before the first token is output.

The main type of "internal" KV caching is those that work only within the current query (and are then discarded):

KV caching — basic internal KV vector caching during the auto-regressive decoding phase (one token at a time).

KV Cache Reuse (Multi-query KV caching): There are also several type of KV cache called "KV cache reuse" or "global KV caching" whereby they work across multiple queries. The idea is similar to an inference cache, which caches the text results for a single input query, but instead we cache just the KV cache data. The idea of KV cache reuse means that when we hit a query we've seen before, we don't have the output, but we have the preliminary KV data, and we can just skip past the "prefill" phase and immediately start outputing the first word of our response. Types of these "global KV cache" methods include:

Global KV caching (basic text version)
Global KV caching (token sequence version) (also called "context caching")
Global KV caching (semantic embeddings vector version)
Prefix global KV caching
Fused global KV caching

Note that global KV caching involves lookup of a query to find its corresponding KV cache data, which means that this idea can be used with: (a) inference caching text-based lookup, (b) token-based lookup in the inference cache, or (c) semantic caching and vector-based lookup. However, the prefix and fused global KV caching variants rely on token/text matches, so cannot be used with semantic caching (or can they?).

Low-memory KV caching: For both the local and global KV caches, one of the major problems with KV caches is that they grow too large. There are several techniques to cut down the size of the KV cache in memory (or on disk):

Memory-efficient QKV attention algorithms — Flash attention, Paged attention, Local attention, Linear attention, etc.
KV cache quantization
KV cache compression — e.g. sparsity/pruning of the KV cache, KV cache layer fusion, and other variants.
KV cache eviction
KV data recomputation — don't cache!

RAG architecture caching: The special features of RAG architectures can use caching in various ways:

Datastore retrieval caching — the usual database type caching.
Chunk-specific global KV caching — prefix global KV caching or fused global KV caching.

Chatbot architecture caching: Chatbot conversations have an interesting property: the current query and response become the input context for the next query, via the "conversation history" used as prompt context. This means that the KV cache data at the end of one answer is effectively the global KV caching data for the next cycle. This idea is similar to "prefix global KV caching" but is a chatbot-specific property.

For on-device chatbots in particular, this means that the KV cache data is immediately available, and prefill latency can be avoided. For data center chatbot architectures, there is a difficulty in mapping the KV cache data across multi-user sessions, although it can be worth doing, since the latency of prefill is also avoided in this manner for cloud architectures.

General Caching Theory

Caching is the general optimization method where computed results are stored and re-used instead of repeating a later computation. Generally, the idea is to trade off use of extra memory in order to save on execution time. This mainly works if the same exact computations are being repeated, but can also work for repetitions of similar near-identical computations. In the literature, caching algorithms for neural networks are also called "memoization", "data re-use" or "computation re-use" algorithms.

One low-level version of caching is called "common subexpression elimination" which uses a temporary variable to eliminate any instances where the same calculation is done twice. Such optimizations are usually automated by compilers in modern programming. Another type of caching is where a loop with a repeatedly calculated value is modified to bring the calculation out in front of the loop, thereby calculating it only once.

Various optimizations to Transformers have involved caching. For example, it was discovered that some of the K and V tensor calculations could be cached between tokens, thereby avoiding repeated matrix computations in the usual autoregressive model. See Transformer optimizations.

Intermediate-level caching and computation reuse can be done at the vector dot product level. By detecting when similar vectors have been calculated before, such as using Locality-Sensitive Hashing (LSH), the cache results can be accessed and re-used instead.

Caching can also be done at the highest level with model inference. Incremental caching of full inference results can be used with "input similarity" such as analyzing the frames of a video. Inference results can also be cached across multiple queries from multiple users. When the entire results of an inference calculation is saved and re-used, the optimization is called an "Inference Cache".

Caching and computation reuse are a type of dynamic inference optimization. Comparable dynamic data and computation efficiency strategies include zero skipping, weight sharing, and layer fusion.

Transformer Calculation Caching

Transformer calculation caching is the storage of computations for later reuse in memory or on disk, as a way to optimize LLM inference. Experience has shown that some computations performed in a vanilla Transformer can be cached for faster inference. This is often called "computation reuse" in the literature; see also Transformer architectures and Transformer optimizations. Papers that discuss these Transformer caching methods are below:

Hugging Face, How we sped up transformer inference 100x for HF API customers, January 18, 2021, https://huggingface.co/blog/accelerated-inference (Various Transformer optimization techniques are suggested, including caching of attention head matrix computations from already-processed tokens to reduce auto-regression costs, i.e. auto-regressive KV caching.)
Intel, Optimizing Transformer Model Inference on Intel Processors, April 2023, https://www.intel.com/content/www/us/en/developer/articles/technical/optimize-transformer-model-inference-processors.html (One of the suggested optimizations is caching of computations of K and V tensors in the attention head logic.)
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. fairseq: A fast, extensible toolkit for sequence modeling. arXiv preprint arXiv:1904.01038, 2019, https://arxiv.org/abs/1904.01038, Code: https://github.com/pytorch/fairseq (Includes caching of model states from previously generated tokens.)
Lucas D. Lingle, Sep 2023, Transformer-VQ: Linear-Time Transformers via Vector Quantization https://arxiv.org/abs/2309.16354, Code: https://github.com/transformer-vq/transformer_vq (Uses a "long range cache" in attention optimization.)
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed, Oct 2023, Mistral 7B, https://arxiv.org/abs/2310.06825, Code: https://mistral.ai/news/announcing-mistral-7b/ (Uses a "rolling buffer cache" method.) https://www.sciencedirect.com/science/article/pii/S0893608023005555 PDF: https://papers.nips.cc/paper/2021/file/149ef6419512be56a93169cd5e6fa8fd-Paper.pdf (Attempts to quickly handle easy inputs by caching classifications from prior layers for early exit decisions.)

KV Caching

KV caching is storing the results of the K and V vector operations that are performed in Transformer attention heads for LLM inference optimization. The data from the KV cache can be used to optimize the attention of the current query or multiple future queries. Analysis of the vanilla Transformer by researchers has discovered at least two distinct ways to cache these results.

Autoregressive decoder KV caching: This is in-memory caching during one query as the decoder processes multiple tokens. Partial KV tensor operations can be cached in memory during decoding, across the processing of multiple tokens, avoiding re-computations in autoregressive mode. In autoregressive decoder mode, the extra KV computations related to the new token are not cached, but all prior KV-related calculations can be cached. This is a subtype of autoregression optimization.

Research papers on autoregressive decoder KV caching:

Hugging Face, How we sped up transformer inference 100x for HF API customers, January 18, 2021, https://huggingface.co/blog/accelerated-inference (Various Transformer optimization techniques are suggested, including caching of attention head matrix computations from already-processed tokens to reduce auto-regression costs, i.e. auto-regressive KV caching.)
Intel, Optimizing Transformer Model Inference on Intel Processors, April 2023, https://www.intel.com/content/www/us/en/developer/articles/technical/optimize-transformer-model-inference-processors.html (One of the suggested optimizations is caching of computations of K and V tensors in the attention head logic, in autoregressive mode.)
Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Anselm Levskaya, Jonathan Heek, Kefan Xiao, Shivani Agrawal, Jeff Dean, Nov 2022, Efficiently Scaling Transformer Inference, https://arxiv.org/abs/2211.05102 (Includes some discussion relevant to KV caching, but it is mostly in relation to operator fusion.)
Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Shigang Li, and Torsten Hoefler. Nov 2021, Data movement is all you need: A case study on optimizing transformers, Proceedings of Machine Learning and Systems, 3, 2021. https://arxiv.org/abs/2007.00072 Code: https://github.com/spcl/substation (Includes a detailed analysis of the QKV tensors in relation to optimizations, with relevance to KV caching, kernel operator fusion, and matrix algebra optimizations.)
Chen, Carol, Transformer Inference Arithmetic 2022-03-30, kipply's blog, https://kipp.ly/transformer-inference-arithmetic/ (Covers multiple optimization methods including KV caching and sparse QKV attention layers.)
Lilian Weng, January 10, 2023, Large Transformer Model Inference Optimization, https://lilianweng.github.io/posts/2023-01-10-inference-optimization/ (Multiple Transformer optimization methods, with some relevance to in-memory KV caching.)
Noam Shazeer, Nov 2019, Fast Transformer Decoding: One Write-Head is All You Need, https://arxiv.org/abs/1911.02150 (Multi-query attention shares KV tensors across multiple attention heads, which isn't exactly KV caching, but is in the same ballpark.)
Dipkumar Patel, February 12, 2023, Speeding up the GPT - KV cache, https://www.dipkumar.dev/becoming-the-unbeatable/posts/gpt-kvcache/ (Useful article specifically on KV caching.)
A Ouyang, June 2023, Understanding the Performance of Transformer Inference, Masters Thesis, Electrical Engineering and Computer Science, MIT, https://dspace.mit.edu/handle/1721.1/151543, https://dspace.mit.edu/bitstream/handle/1721.1/151543/ouyang-aouyang-meng-eecs-2023-thesis.pdf?sequence=1&isAllowed=y (Detailed analysis of Transformer performance optimizations, including the technique of autoregressive KV caching during decoding.)
Gongzheng Li, Yadong Xi, Jingzhen Ding, Duan Wang, Bai Liu, Changjie Fan, Xiaoxi Mao, Zeng Zhao, Easy and Efficient Transformer : Scalable Inference Solution For large NLP model, May 2022, https://arxiv.org/abs/2104.12470
Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of importance hypothesis for LLM KV cache compression at test time, 2023. http://arxiv.org/abs/2305.17118 (Reduces the size of the KV cache by limiting storage to only pivotal tokens.)
Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re, Beidi Chen, 2023, Deja vu: Contextual sparsity for efficient LLMs at inference time, Proceedings of the 40th International Conference on Machine Learning, PMLR 202:22137-22176, 2023. https://proceedings.mlr.press/v202/liu23am.html, PDF: https://proceedings.mlr.press/v202/liu23am/liu23am.pdf
Y Jin, CF Wu, D Brooks, GY Wei, 2023, S3: Increasing GPU Utilization during Generative Inference for Higher Throughput, arXiv preprint arXiv:2306.06000, https://arxiv.org/abs/2306.06000
Luciano Del Corro, Allie Del Giorno, Sahaj Agarwal, Bin Yu, Ahmed Awadallah, Subhabrata Mukherjee, July 2023, SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference, https://arxiv.org/abs/2307.02628 (Early exit for non-autoregression, with consideration of the KV cache.)
Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, Sumit Sanghai, May 2023, GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints, https://arxiv.org/abs/2305.13245 (Some discussion of KV caching in the context of multi-query attention.)
Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, Furu Wei, July 2023, Retentive Network: A Successor to Transformer for Large Language Models https://arxiv.org/abs/2307.08621, Code: https://aka.ms/retnet (Some analysis of KV cache memory usage, but not a primary focus of the paper.)
H Jiang, Q Wu, CY Lin, Y Yang, L Qiu, Oct 2023, LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models, arXiv preprint arXiv:2310.05736, https://arxiv.org/pdf/2310.05736.pdf, Code: https://aka.ms/LLMLingua (Discusses token pruning reducing size of KV cache.)
S Ren, Q Jia, KQ Zhu, arXiv preprint arXiv:2310.08152, Context Compression for Auto-regressive Transformers with Sentinel Tokens, Oct 2023, https://arxiv.org/pdf/2310.08152.pdf, Code: https://github.com/DRSY/KV_Compression
G Xiao, Y Tian, B Chen, S Han, M Lewis, Sep 2023, Efficient Streaming Language Models with Attention Sinks, arXiv preprint arXiv:2309.17453, https://arxiv.org/abs/2309.17453 (Sliding window KV caching.)
Hyungjun Oh, Kihong Kim, Jaemin Kim, Sungkyun Kim, Junyeol Lee, Du-seong Chang, Jiwon Seo, 15 Mar 2024, ExeGPT: Constraint-Aware Resource Scheduling for LLM Inference, https://arxiv.org/abs/2404.07947 (Scheduling and pipelining of inference calculations.)
Yechen Xu, Xinhao Kong, Tingjun Chen, Danyang Zhuo, 4 Jun 2024 (v2), Conveyor: Efficient Tool-aware LLM Serving with Tool Partial Execution, https://arxiv.org/abs/2406.00059 Code: https://github.com/conveyor-sys/conveyor (Speeding up inference by partially running tools in parallel to the LLM query procesisng, rather than sequentially after the LLM request, by detecting tool requests deep inside the decoding algorithm and starting them off immediately, before the LLM has finished generating the fully decoed output.)
Liu, Y., Li, H., Du, K., Yao, J., Cheng, Y., Huang, Y., Lu, S., Maire, M., Hoffmann, H., Holtzman, A., Ananthanarayanan, G., and Jiang, J., August 2024, Cachegen: Fast context loading for language model applications, Microsoft Research, https://arxiv.org/abs/2310.07240 https://www.microsoft.com/en-us/research/publication/cachegen-fast-context-loading-for-language-model-applications-via-kv-cache-streaming/
X Wu, L Zhang, Y Wang, Y Ren, M Hack, 2016, zExpander: a Key-Value Cache with both High Performance and Fewer Misses, EuroSys ’16 April 18–21, 2016, London, United Kingdom, https://ranger.uta.edu/~sjiang/pubs/papers/wu16_zExpander.pdf (General theory paper about prefix key-value caching in a trie or binary tree, not specific to neural networks.)
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica, Oct 2023, Efficient Memory Management for Large Language Model Serving with PagedAttention, SOSP ’23, October 23–26, 2023, Koblenz, Germany, https://dl.acm.org/doi/pdf/10.1145/3600006.3613165
Lu Ye, Ze Tao, Yong Huang, Yang Li, 22 Mar 2024 (v2), ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition, https://arxiv.org/abs/2402.15220 (Identify prefixes of prompts and caching the KV values of these portions of the prompt.)
Yunsheng Ni, Chuanjian Liu, Yehui Tang, Kai Han, Yunhe Wang, 13 May 2024, EMS-SD: Efficient Multi-sample Speculative Decoding for Accelerating Large Language Models, https://arxiv.org/abs/2405.07542 Code: https://github.com/niyunsheng/EMS-SD (Speculative decoding across multiple queries by avoiding padding tokens and optimizing the KV cache.)
J. Lin, S. Q. Zhang and A. Leon-Garcia, 2024, sLLM: Accelerating LLM Inference using Semantic Load Balancing with Shared Memory Data Structures, 2024 25th International Symposium on Quality Electronic Design (ISQED), San Francisco, CA, USA, 2024, pp. 1-6, doi: 10.1109/ISQED60706.2024.10528703. https://ieeexplore.ieee.org/abstract/document/10528703 (Optimize the global KV cache by sharing it across multiple queries.)
Vikranth Srivatsa∗, Zijian He∗, Reyna Abhyankar, Dongming Li, Yiying Zhang, 2024, Preble: Efficient Distributed Prompt Scheduling for LLM Serving, University of California, San Diego, https://escholarship.org/content/qt1bm0k1w0/qt1bm0k1w0.pdf (Evalulates prompt sharing including full inference cache or a partial prefix-based computation of a global KV cache for the prefill phase. Also schedules GPUs based on prefill versus decoding phase requirements.)
Yao Fu, 14 May 2024, Challenges in Deploying Long-Context Transformers: A Theoretical Peak Performance Analysis, https://arxiv.org/abs/2405.08944 (The KV cache size is the main bottleneck for long context processing, in both prefill and decoding phases, and includes analysis of different optimizations to address this.)
Minsik Cho, Mohammad Rastegari, Devang Naik, 8 May 2024, KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache Generation, https://arxiv.org/abs/2405.05329 (Parallelization of KV cache generation in prefill phase.)
Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, Furu Wei, 9 May 2024 (v2), You Only Cache Once: Decoder-Decoder Architectures for Language Models, https://arxiv.org/abs/2405.05254 Code: https://aka.ms/YOCO (A novel decoder-decoder architecture with fast KV caching and cross-attention.)
Shaoyuan Chen, Yutong Lin, Mingxing Zhang, Yongwei Wu, 3 May 2024, Efficient and Economic Large Language Model Inference with Attention Offloading, https://arxiv.org/abs/2405.01814 (Separates the process-bound and memory-bound parts of inference for speedup, with focus on prefill, decoding, and the sub-tasks such as QKV and FFN use of GEMM kernels, versus the different pattern of attention computations and the KV cache.)
Qiaozhi He, Zhihua Wu, 28 Apr 2024, Efficient LLM Inference with Kcache, https://arxiv.org/abs/2404.18057 (Splits the KV cache into a KCache stored in HBM and a Vcache stored in CPU memory. The requests for the V cache are limited by filtering after attention based on the Softmax scaled top-N results of the QK matrix multiplication, so thereby pruning a lot of the V cache memory loads and corresponding calculations.)
Jiamin Li, Le Xu, Hong Xu, Aditya Akella, 28 Apr 2024, BlockLLM: Multi-tenant Finer-grained Serving for Large Language Models, https://arxiv.org/abs/2404.18322 (Partitioning inference over blocks for GPU.)
Jesus Rodriguez, Apr 22, 2024, Some Technical Notes About Llama 3: New tokenizer, optimized pretraining and some other details about Meta AI’s new model, Towards AI, https://pub.towardsai.net/some-technical-notes-about-llama-3-042c0b19db14
João Monteiro, Étienne Marcotte, Pierre-André Noël, Valentina Zantedeschi, David Vázquez, Nicolas Chapados, Christopher Pal, Perouz Taslakian, 23 Apr 2024, XC-Cache: Cross-Attending to Cached Context for Efficient LLM Inference, https://arxiv.org/abs/2404.15420
Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, Deming Chen, 22 Apr 2024, SnapKV: LLM Knows What You are Looking for Before Generation, https://arxiv.org/abs/2404.14469
Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Xin Liu, Xuanzhe Liu, Xin Jin, 18 Apr 2024, RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation, https://arxiv.org/abs/2404.12457
Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, 22 Apr 2024, A Survey on Efficient Inference for Large Language Models, https://arxiv.org/abs/2404.14294
L. Zheng, L. Yin, Z. Xie, J. Huang, C. Sun, C. H. Yu, S. Cao, C. Kozyrakis, I. Stoica, J. E. Gonzalez et al., Dec 2023, Efficiently programming large language models using SGLang, arXiv preprint arXiv:2312.07104, 2023, https://arxiv.org/abs/2312.07104 (Uses a radix attention method, a trie or prefix tree, for KV caching.)
Shashank Verma and Neal Vaidya, Nov 17, 2023 Mastering LLM Techniques: Inference Optimization, NVIDIA Technical Blog, https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/ (An overview that covers a lot of inference optimization techniques.)
Theia Vogel, December 18, 2023, How to make LLMs go fast, https://vgel.me/posts/faster-inference/
Foteini Strati, Sara Mcallister, Amar Phanishayee, Jakub Tarnawski, Ana Klimovic, 4 Mar 2024, DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving, https://arxiv.org/abs/2403.01876
3 Jan 2024 (v2), SPEED: Speculative Pipelined Execution for Efficient Decoding, Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Hasan Genc, Kurt Keutzer, Amir Gholami, Sophia Shao, https://arxiv.org/abs/2310.12072 (Speculatively executing multiple future tokens in parallel to the current token, by using multiple tokens with high probability from the early layers of inference of the current token in the model. This allows multiple speculations of the autoregressive inference of the next token to start before the current token is finished.)
Aonan Zhang, Chong Wang, Yi Wang, Xuanyu Zhang, Yunfei Cheng, 22 Mar 2024 (v2), Recurrent Drafter for Fast Speculative Decoding in Large Language Models, https://arxiv.org/abs/2403.09919 (Use of small RNN as the drafting model for speculative decoding.)
Zihao Wang, Shaoduo Gan, 7 Apr 2024] SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget, https://arxiv.org/abs/2404.04793 Code: https://github.com/hetailang/squeezeattention (Optimization of the KV cache along the two dimensions of layers and input sequence.)
Yijin Liu, Fandong Meng, Jie Zhou, 10 Apr 2024, Accelerating Inference in Large Language Models with a Unified Layer Skipping Strategy, https://arxiv.org/abs/2404.06954 Code: https://github.com/Adaxry/Unified_Layer_Skipping (Layer skipping with choosing globally which layers to skip in an orderly way for all tokens based on speedup required. All tokens skip the exact same layers, which avoids the problem with out-of-date KV caches.)
Ajay Jaiswal, Bodun Hu, Lu Yin, Yeonju Ro, Shiwei Liu, Tianlong Chen, Aditya Akella, 5 Apr 2024, FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping, https://arxiv.org/abs/2404.03865 (Presents an input-adaptive layer skipping scheme for drops about 30% of FFN calculations. Avoids the KV caching problems by only skipping FFN computations in layers.)
LMDeploy Contributors, 2023, LMDeploy: A Toolkit for Compressing, Deploying, and Serving LLM, Apache 2.0 License, Code: https://github.com/InternLM/lmdeploy
Wangsong Yin, Mengwei Xu, Yuanchun Li, Xuanzhe Liu, 18 Mar 2024, LLM as a System Service on Mobile Devices, https://arxiv.org/abs/2403.11805 (On-device inference for LLMs, including a stateful on-device AI service LLMaaS, including Llama2 7B and OPT-7B with INT8 quantization, based on improved KV caching on mobile, with pipelining, recomputation and chunk-level KV cache memory management for running on phones.)
Youpeng Zhao, Di Wu, Jun Wang, 26 Mar 2024, ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching, https://arxiv.org/abs/2403.17312 (Improved memory management of the cache for KV caching during autoregressive inference with prioritization of tokens based on sparse window attention, and managing caching versus recomputation.)
Shuming Shi, Enbo Zhao, Deng Cai, Leyang Cui, Xinting Huang, Huayang Li, 16 Jan 2024, Inferflow: an Efficient and Highly Configurable Inference Engine for Large Language Models, https://arxiv.org/abs/2401.08294 Source: https://github.com/inferflow/inferflow
Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, Lin Zhong, Nov 2023, Prompt Cache: Modular Attention Reuse for Low-Latency Inference, https://arxiv.org/abs/2311.04934 (Unique and insightful advance of generalizing KV caching to multiple prompts by computing a cache for short "segments" of prompts, including methods to adjust the different KV cache values for text segments that appear in different positions of the overall prompt.)
Hui Wu, Yi Gan, Feng Yuan, Jing Ma, Wei Zhu, Yutao Xu, Hong Zhu, Yuhua Zhu, Xiaoli Liu, Jinghui Gu, Dec 2023, Efficient LLM inference solution on Intel GPU, https://arxiv.org/abs/2401.05391 (Optimized LLM inference using kernel fusion of GEMM with element-wise operations for better data movement, and also advanced management of the KV cache.)
Hongxuan Zhang, Zhining Liu, Jiaqi Zheng, Chenyi Zhuang, Jinjie Gu, Guihai Chen, Nov 2023, Fast Chain-of-Thought: A Glance of Future from Parallel Decoding Leads to Answers Faster, https://arxiv.org/abs/2311.08263
Pratyush Patel, Esha Choukse, Chaojie Zhang, Íñigo Goiri, Aashaka Shah, Saeed Maleki, Ricardo Bianchini, 30 Nov 2023, Splitwise: Efficient generative LLM inference using phase splitting, https://arxiv.org/abs/2311.18677 (Separates the two Transformer phases of initial prompt computation or prefill to generate the KV cache, and the token generation phase or decoding algorithm onto two machines.)
Yanxi Chen, Xuchen Pan, Yaliang Li, Bolin Ding, Jingren Zhou, Dec 2023, EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism, https://arxiv.org/abs/2312.04916 Code: https://github.com/pan-x-c/EE-LLM
Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
Sangmin Bae, Jongwoo Ko, Hwanjun Song, and Se-Young Yun. 2023. Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding. arXiv preprint arXiv:2310.05424. https://arxiv.org/abs/2310.05424 (Using early exits as the draft model in generalized speculative decoding.)
Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu, 2023, KIVI : Plug-and-play 2bit KV Cache Quantization with Streaming Asymmetric Quantization, https://www.researchgate.net/profile/Zirui-Liu-29/publication/376831635_KIVI_Plug-and-play_2bit_KV_Cache_Quantization_with_Streaming_Asymmetric_Quantization/links/658b5d282468df72d3db3280/KIVI-Plug-and-play-2bit-KV-Cache-Quantization-with-Streaming-Asymmetric-Quantization.pdf (Explores quantization of values stored in the KV cache as a way to maintain a smaller KV caching and reduce memory storage requirements.)
H Shen, H Chang, B Dong, Y Luo, H Meng, Nov 2023, Efficient LLM Inference on CPUs, arXiv preprint arXiv:2311.00502, https://arxiv.org/pdf/2311.00502.pdf Code: https://github.com/intel/intel-extension-for-transformers (INT4 weight quantization with 16-bit activations, and highly optimized kernel with support for AVX2, AVX512, AVX512_VNNI and Advanced Matrix Extensions (AMX), and KV caching, tested on LLamam2 3B to 20B with 20-80ms latency per token.)
Luchang Li, Sheng Qian, Jie Lu, Lunxi Yuan, Rui Wang, Qin Xie, 29 Mar 2024, Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs, https://arxiv.org/abs/2403.20041 (On-device LLMs via four optimizations: dynamic-tensor-shape inference, FP4 quantization, operator optimizations, and KV cache improvements.)
Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 15 Mar 2024 (v5), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer (A large survey of a variety of LLM optimizations.)
Ruikang Liu, Haoli Bai, Haokun Lin, Yuening Li, Han Gao, Zhengzhuo Xu, Lu Hou, Jun Yao, Chun Yuan, 2 Mar 2024, IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact, https://arxiv.org/abs/2403.01241
Piotr Nawrot, Adrian Łańcucki, Marcin Chochowski, David Tarjan, Edoardo M. Ponti, 14 Mar 2024, Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference, https://arxiv.org/abs/2403.09636 (Reducing the memory size of the KV cache.)
Muhammad Adnan, Akhil Arunkumar, Gaurav Jain, Prashant J. Nair, Ilya Soloveychik, Purushotham Kamath, 14 Mar 2024, Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference, https://arxiv.org/abs/2403.09054 (Reducing KV cache in-memory size and related computations by focusing on a subset of tokens.)
Haim Barad, Ekaterina Aidova, Yury Gorbachev, Nov 2023, Leveraging Speculative Sampling and KV-Cache Optimizations Together for Generative AI using OpenVINO, https://arxiv.org/abs/2311.04951 Code: https://github.com/openvinotoolkit/openvino_notebooks/tree/main/notebooks/266-speculative-sampling
Bin Lin, Tao Peng, Chen Zhang, Minmin Sun, Lanbo Li, Hanyu Zhao, Wencong Xiao, Qi Xu, Xiafei Qiu, Shen Li, Zhigang Ji, Yong Li, Wei Lin, 5 Jan 2024, Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache, https://arxiv.org/abs/2401.02669 (Long context processing by a modification to the QKV caching mechanisms.)
Pierre Lienhart, Jan 16, 2024, LLM Inference Series: 4. KV caching, a deeper look, https://medium.com/@plienhar/llm-inference-series-4-kv-caching-a-deeper-look-4ba9a77746c8
MartinLwx, Oct 2023 LLM inference optimization - KV Cache, https://martinlwx.github.io/en/llm-inference-optimization-kv-cache/
Omri Mallis, February 25, 2024 , Techniques for KV Cache Optimization in Large Language Models, https://www.omrimallis.com/posts/techniques-for-kv-cache-optimization/
Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, Junchen Jiang, 3 Jun 2024 (v2), CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion, https://arxiv.org/abs/2405.16444 Code: https://github.com/YaoJiayi/CacheBlend.git (Generalizes prefix KV caching to KV cache fusion with selective recomputation of some KV cache data.)
Lei Zhu, Xinjiang Wang, Wayne Zhang, Rynson W.H. Lau, 30 May 2024 (v3), RelayAttention for Efficient Large Language Model Serving with Long System Prompts, https://arxiv.org/abs/2402.14808 (Reduces the number of memory accesses for attention computations and the KV cache.)
Yuan Feng, Hyeran Jeon, Filip Blagojevic, Cyril Guyot, Qing Li, Dong Li, 17 Apr 2023 (v2), AttMEMO : Accelerating Transformers with Memoization on Big Memory Systems, https://arxiv.org/abs/2301.09262
C Hooper, S Kim, H Mohammadzadeh, H Genc, Oct 2023, SPEED: Speculative Pipelined Execution for Efficient Decoding https://arxiv.org/pdf/2310.12072.pdf
H. Face, “Transformers,” https://github.com/huggingface/transformers.
NVIDIA, “FasterTransformer,” https://github.com/NVIDIA/FasterTransformer.
G.-I. Yu, J. S. Jeong, G.-W. Kim, S. Kim, and B.-G. Chun, “Orca: A distributed serving system for Transformer-Based generative models,” in 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). Carlsbad, CA: USENIX Association, Jul. 2022, pp. 521–538. [Online]. Available: https://www.usenix.org/conference/osdi22/presentation/yu
Mistral AI, https://github.com/mistralai/mistral-src
Reyna Abhyankar, Zijian He, Vikranth Srivatsa, Hao Zhang, Yiying Zhang, 2024, INFERCEPT: Efficient Intercept Support for Augmented Large Language Model Inference, https://openreview.net/pdf?id=wDDGQabYPQ
The White Box, April 7, 2024, KV Cache, ChatGPT’s Memory, https://thewhitebox.ai/kv-cache-chatgpts-memory/
Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ramjee, Ashish Panwar, 7 May 2024, vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention, https://arxiv.org/abs/2405.04437
William Brandon, Aniruddha Nrusimha, Kevin Qian, Zachary Ankner, Tian Jin, Zhiye Song, Jonathan Ragan-Kelley, 15 Nov 2023, Striped Attention: Faster Ring Attention for Causal Transformers, https://arxiv.org/abs/2311.09431
David Spuler, March 2024, Chapter 29. Caching Optimizations, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Natarajan Vaidhyanathan Mar 7, 2024, How to Quadruple LLM Decoding Performance with Speculative Decoding (SpD) and Microscaling (MX) Formats on Qualcomm® Cloud AI 100, https://www.qualcomm.com/developer/blog/2024/03/how-quadruple-llm-decoding-performance-speculative-decoding-spd-and-microscaling-mx-formats
Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng ZHANG, Dahua Lin, Ion Stoica, Hao Zhang, 02 May 2024, MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving, ICML 2024, https://openreview.net/forum?id=R0SoZvqXyQ PDF: https://openreview.net/pdf?id=R0SoZvqXyQ Code: https://github.com/hao-ai-lab/MuxServe (Separates the prefill and decoding phases when serving, and also manages the LLM weights and KV cache data in blocks for memory efficiency.)
Cunchen Hu, Heyang Huang, Junhao Hu, Jiang Xu, Xusheng Chen, Tao Xie, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan, 26 Jun 2024 (v2), MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool, https://arxiv.org/abs/2406.17565 (Combined session-based prefix KV caching with disaggregation of prefill and decoding phases.)
Sean Welleck, Amanda Bertsch, Matthew Finlayson, Hailey Schoelkopf, Alex Xie, Graham Neubig, Ilia Kulikov, Zaid Harchaoui, 24 Jun 2024, From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models, https://arxiv.org/abs/2406.16838 (Survey and theoretical analysis of many different decoding algorithms, along with various ways to speed them up such as speculative decoding and KV caches.)
Zhongwei Wan, Ziang Wu, Che Liu, Jinfa Huang, Zhihong Zhu, Peng Jin, Longyue Wang, Li Yuan, 26 Jun 2024, LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference, https://arxiv.org/abs/2406.18139 (KV cache compression in text and multimodal inference, prioritizing eviction of text over image tokens, and using new ways to merge evicted KV cache data into the retained KV cache, including averaging, pivotal tokens, and weighted averages, which is relevant to token merging and KV cache fusion
Wonbeom Lee, Jungi Lee, Junghwan Seo, Jaewoong Sim, 28 Jun 2024, InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management, https://arxiv.org/abs/2406.19707 (KV caching optimization using salient token pruning for the attention layer.)
Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, Xinran Xu, 2 Jul 2024 (v2), Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving, https://arxiv.org/abs/2407.00079 Code: https://github.com/kvcache-ai/Mooncake (Disaggregates prefill and decoding phases for scheduling, with chunked prefill, while managing the KV cache.)
Google, 2024, Context caching, https://ai.google.dev/gemini-api/docs/caching?lang=python (Pass in context tokens and reuse them without re-uploading, might be doing something like prefix KV caching underneath.)
Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas Kübler, Jiaji Huang, Matthäus Kleindessner, Jun Huan, Volkan Cevher, Yida Wang, George Karypis, 12 Jul 2024, Inference Optimization of Foundation Models on AI Accelerators, KDD’24, August 25–29, 2024, Barcelona, Spain, https://arxiv.org/abs/2407.09111
Hailin Zhang, Xupeng Miao, Xiaodong Ji, Xiaonan Nie Yilin Chen, Weipeng Chen, Fangcheng Fu, Bin Cui, 2024, PQCache: Product Quantization-based KVCache for Long Context LLM Inference, https://hugozhl.github.io/files/PQCache.pdf
Minsik Cho, Mohammad Rastegari, Devang Naik, 25 Jul 2024, KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache Generation, https://proceedings.mlr.press/v235/cho24e.html PDF: https://raw.githubusercontent.com/mlresearch/v235/main/assets/cho24e/cho24e.pdf
Luohe Shi, Hongyi Zhang, Yao Yao, Zuchao Li, Hai Zhao, 13 Aug 2024 (v3), Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption, https://arxiv.org/abs/2407.18003 https://github.com/zcli-charlie/Awesome-KV%20Cache
Bingli Liao, Danilo Vasconcellos Vargas, 13 Jul 2024, Beyond KV Caching: Shared Attention for Efficient LLMs, https://arxiv.org/abs/2407.12866 (Layerwise weight sharing in attention.)
Huan Yang, Deyu Zhang, Yudong Zhao, Yuanchun Li, Yunxin Liu, 6 Sep 2024, A First Look At Efficient And Secure On-Device LLM Inference Against KV Leakage, https://arxiv.org/abs/2409.04040 (Security issues where KV caches can be data leaks as they may contain encodings of private information.)
R Parthasarathy, R Shuttleworth, Sep 2024, Analyzing Inference Optimizations for Transformers, 6.5930 Final Project Report, https://reeceshuttle.me/assets/6.5930_Project.pdf
David Spuler, 26th August, 2024, State-of-the-Art LLM Backends, Aussie AI Blog, https://www.aussieai.com/blog/state-of-the-art-llm-backends
Yushuo Chen, Tianyi Tang, Erge Xiang, Linjiang Li, Wayne Xin Zhao, Jing Wang, Yunpeng Chai, Ji-Rong Wen, 17 Apr 2024, Towards Coarse-to-Fine Evaluation of Inference Efficiency for Large Language Models, https://arxiv.org/abs/2404.11502 (Benchmarks the performance of various Transformer inference frameworks: Transformers, vLLM, DeepSpeed-MII, TGI, TenserRT-LLM, llama.cpp, LightLLM, LMDeploy, StreamingLLM.)
Nikhil Jha, Kevin Wang, 2023, https://people.eecs.berkeley.edu/~kubitron/courses/cs262a-F23/projects/reports/project1010_paper_64287652274076362722.pdf (Extends Paged Attention to a global multi-query KV cache and also implements prefix KV caching.)
Zayd Muhammad Kawakibi Zuhri, Muhammad Farid Adilazuarda, Ayu Purwarianti, Alham Fikri Aji, 13 Jun 2024, MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding, https://arxiv.org/abs/2406.09297 Code: https://github.com/zaydzuhri/pythia-mlkv (Extends cross-head KV sharing in MQA to also share KV data between layers, analogous to layer fusion of weights.)
Apple, June 2024, Introducing Apple’s On-Device and Server Foundation Models, https://machinelearning.apple.com/research/introducing-apple-foundation-models (Apple's on-device models feature optimizations including small models, grouped query attention, 2-bit/4-bit quantization including activation quantization, shared embedding/unembedding tensors, small-ish vocabulary size of 49k, an undisclosed efficient KV cache optimization for neural engines, and layer-specific 16-bit LoRA/QLoRA adapters of size "10s of megabytes" for fine-tuned specialized model versions, also sometimes in 2-bit/4-bit, claiming speed rates of 0.6ms/token in prefill, and 30 tokens per second in decoding.)
Siyan Zhao, Daniel Israel, Guy Van den Broeck, Aditya Grover, 15 Apr 2024, Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models, https://arxiv.org/abs/2404.09529 Code: https://github.com/siyan-zhao/prepacking (Optimizes prefill KV cache computations by batching multiple query prefill phases together via packing, since prefill token sequence lengths are fully known, and further combined with simple modifications to positional encoding and masking to avoid cross-query attention.)
Zhiyu Guo, Hidetaka Kamigaito, Taro Watanabe, 18 Jun 2024, Attention Score is not All You Need for Token Importance Indicator in KV Cache Reduction: Value Also Matters, https://arxiv.org/abs/2406.12335 (Extensions of KV cache token pruning methods tjat ise attention scores to find pivotal tokens, generalized to also consider L1 vector norms of value vectors.)
Character.AI, June 20, 2024, Optimizing AI Inference at Character.AI, https://research.character.ai/optimizing-inference/
Schwinn Saereesitthipitak, Ashish Rao, Cathy Zhou, William Li, 2024, Prophet: An LLM Inference Engine Optimized For Head-of-Line Blocking, https://www.scs.stanford.edu/24sp-cs244b/projects/Prophet_An_LLM_Inference_Engine_Optimized_For_Head_of_Line_Blocking.pdf (Faster inference serving via iterative scheduling, separating prefill and decoding phase computations for batching, using priority-based schedulers with preemption, and controling transfer of KV caches from prefill to decoders.)
Baolin Li, Yankai Jiang, Vijay Gadepally, Devesh Tiwari, 17 Jul 2024, LLM Inference Serving: Survey of Recent Advances and Opportunities, https://arxiv.org/abs/2407.12391
Junyoung Park, Myeonggu Kang, Yunki Han, Yanggon Kim, Jaekang Shin, Lee-Sup Kim, 21 Jul 2024, Token-Picker: Accelerating Attention in Text Generation with Minimized Memory Transfer via Probability Estimation, https://arxiv.org/abs/2407.15131
Yingbing Huang, Lily Jiaxin Wan, Hanchen Ye, Manvi Jha, Jinghua Wang, Yuhong Li, Xiaofan Zhang, Deming Chen, 16 Jun 2024, New Solutions on LLM Acceleration, Optimization, and Application, https://arxiv.org/abs/2406.10903
Kan Zhu, Yilong Zhao, Liangyu Zhao, Gefei Zuo, Yile Gu, Dedong Xie, Yufei Gao, Qinyu Xu, Tian Tang, Zihao Ye, Keisuke Kamahori, Chien-Yu Lin, Stephanie Wang, Arvind Krishnamurthy, Baris Kasikci, 22 Aug 2024, NanoFlow: Towards Optimal Large Language Model Serving Throughput, https://arxiv.org/abs/2408.12757
Sowoong Kim, Eunyeong Sim, Youngsam Shin, YeonGon Cho, and Woongki Baek. 2024. Activation Sequence Caching: High-Throughput and Memory-Efficient Generative Inference with a Single GPU. In Proceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques (PACT '24). Association for Computing Machinery, New York, NY, USA, 78–90. https://doi.org/10.1145/3656019.3676945 https://dl.acm.org/doi/abs/10.1145/3656019.3676945
Haihao Shen, Hanwen Chang, Bo Dong, Yu Luo, Hengyu Meng, 7 Dec 2023 (v2), Efficient LLM Inference on CPUs, https://arxiv.org/abs/2311.00502 https://github.com/intel/intel-extension-for-transformers
Keihan Haqiq, Majid Vafaei Jahan, Saeede Anbaee Farimani, Seyed Mahmood Fattahi Masoom, MinCache: A hybrid cache system for efficient chatbots with hierarchical embedding matching and LLM, Future Generation Computer Systems, Volume 170, 2025, 107822, ISSN 0167-739X, https://doi.org/10.1016/j.future.2025.107822. https://www.sciencedirect.com/science/article/abs/pii/S0167739X25001177

KV Caching in Early Exit

KV caching in early exit is the need to recompute the KV data when layers have been exited early or skipped in LLM inference. The use of "early exit" (dynamic layer pruning) or layer skipping causes a problem with the KV cache. Unexecuted layers then have the wrong KV cache data, because the computation has not proceeded through all the layers. A future step in the inference (i.e., the next token in autoregressive decoding) will have the wrong KV cache, and this needs to be corrected for early exit mechanisms.

There is also the converse inefficiency of unnecessary prior KV caching computations whenever early exit is successful. If the current token exits at an earlier layer than the prior token, we have needlessly stored the KV cache for the later layers of the prior token, since those are the layers that were currently skipped by the early exit.

Several alternative ways to correct the KV cache have been considered in the research:

Recomputation of the KV cache: note that the cache is wrong and recompute the KV cache data in a layer when needed.
Propagation of the KV cache: This is reusing the KV cache of the last executed layer as the stored KV cache for any layers that get skipped, which becomes a kind of approximation of the KV cache.
Exiting/caching pattern changes. Doing the layer skipping in a way that the KV cache issue is avoided, of which the simplest is a fixed number of layers, or more complex arrangements such as using monotonically decreasing exit layers.
FFN-only partial layer early exiting: Avoiding the KV cache issue by only doing early exit on the FFN weight computations (i.e. not skipping the QKV computations in early exit when skipping a layer).

KV Recomputation. The simplest idea is to mark the cache as out-of-date so that it will be recomputed at the next execution of that layer, but the downside of this approach is that it thereby loses some of the speed advantages of early exiting. Potentially, every saving of a skipped layer in an early-exit is then undermined in the next inference iteration, if all those layers need their cache re-computed, although this won't happen every time so there is still benefit to early exit.

Fixed early exit layer count. Exiting early after a fixed number of layers avoids the KV cache out-of-date problem, becaues none of the tokens ever do more (or less) layers than the fixed count, and the KV cache is only needed for this same number of layers. A little more complex is a simple version of "deep encoder, shallow decoder" architecture, whereby there are two fixed counts: one for the encoder (or prefill in decoder-only architectures), and another level for the decoding phase. However, this simplistic exit criteria is non-adaptive to the input's complexity, has poor accuracy characteristics because it doesn't check for any decision criteria at all before exiting, and is effectively the same as permanently removing layers of the model with static layer pruning.

Layer exiting orders. There are several ways to avoid a mismatch between the layers and the KV cache simply by controlling how layers are skipped. The simplest is a fixed global number of layers always early exited for all queries as above. Another way is to skip a fixed subset of the layers, not necessarily in order (i.e., layer skipping, not early exit), thereby ensuring that all tokens in an output sequence run the same layers and have the same KV cache needs. The downside is that every token gets the same computation, regardless of whether it is easy or difficult. Yet another way is to monotonically sequence the exit points, so that although they change, a later token cannot go to more layers than an earlier token, which avoids KV caching problems. This approach means that the end of an output always gets less computation than the early tokens, which may not match the real computation needs.

Research on KV cache correction in early exiting. Research papers on correcting the KV cache in early exit methods include:

Luciano Del Corro, Allie Del Giorno, Sahaj Agarwal, Bin Yu, Ahmed Awadallah, Subhabrata Mukherjee, July 2023, SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference, https://arxiv.org/abs/2307.02628 (Early exit for non-autoregression, with consideration of the KV cache. By using a monotically decreasing exit point, this avoids the possibility of a later token's inference requiring an out-of-date KV cache.)
Maha Elbayad, Jiatao Gu, Edouard Grave, and Michael Auli. 2020. Depth-adaptive transformer. ArXiv, abs/1910.10073, https://arxiv.org/abs/1910.10073 (Copies the internal states of the exited layer to the later laters.)
Ajay Jaiswal, Bodun Hu, Lu Yin, Yeonju Ro, Shiwei Liu, Tianlong Chen, Aditya Akella, 5 Apr 2024, FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping, https://arxiv.org/abs/2404.03865 (Presents an input-adaptive layer skipping scheme for drops about 30% of FFN calculations. Avoids the KV caching problems by only skipping FFN computations in layers.)
Yanxi Chen, Xuchen Pan, Yaliang Li, Bolin Ding, Jingren Zhou, Dec 2023, EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism, https://arxiv.org/abs/2312.04916 Code: https://github.com/pan-x-c/EE-LLM (Examines two methods of KV cache handling with early exit, and implements KV recomputation.)
Li, L., Wang, C., Qiu, M. et al., 2024, Accelerating BERT inference with GPU-efficient exit prediction. Front. Comput. Sci. 18, 183308 (2024). https://doi.org/10.1007/s11704-022-2341-9, https://link.springer.com/article/10.1007/s11704-022-2341-9 (Propagates hidden states to the exited layers.)
Tal Schuster, Adam Fisch, Jai Gupta, Mostafa Dehghani, Dara Bahri, Vinh Q. Tran, Yi Tay, Donald Metzler, 25 Oct 2022 (v2), Confident Adaptive Language Modeling, https://arxiv.org/abs/2207.07061 (KV propagation copies computed KV states to the exited layers.)
Sangmin Bae, Jongwoo Ko, Hwanjun Song, and Se-Young Yun. 2023. Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding. arXiv preprint arXiv:2310.05424. https://arxiv.org/abs/2310.05424 (Uses KV recomputation.)
Peng Tang, Pengkai Zhu, Tian Li, Srikar Appalaraju, Vijay Mahadevan, R. Manmatha, Nov 2023, DEED: Dynamic Early Exit on Decoder for Accelerating Encoder-Decoder Transformer Models, https://arxiv.org/abs/2311.08623 (Uses KV recomputation.)
Yijin Liu, Fandong Meng, Jie Zhou, 10 Apr 2024, Accelerating Inference in Large Language Models with a Unified Layer Skipping Strategy, https://arxiv.org/abs/2404.06954 Code: https://github.com/Adaxry/Unified_Layer_Skipping (Layer skipping with choosing globally which layers to skip in an orderly way for all tokens, based on speedup required. All tokens skip the exact same layers, which avoids the problem with out-of-date KV caches.)
Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 15 Mar 2024 (v5), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer (A large survey of a variety of LLM optimizations that discusses the KV caching problems in the section on early exit; see "Hidden States Propagation" paragraph.)

KV Cache Compression (Memory Reduction)

KV cache compression is an LLM inference optimization via shrinking the size of the KV cache data so that it is smaller and faster to compute. Compression methods such as quantization and pruning can be applied to the KV data, resulting in less data to be stored, and fewer computations required.

This is "optimizing the optimization!" The idea of KV caching is to trade extra memory (the KV cache) for faster speed. But it's been too successful, and often requires too much memory, so there are now research papers on optimization of the memory size of the KV cache, including KV cache compression, and its subtype KV cache quantization.

Papers on KV cache compression and/or KV cache quantization:

Harry Dong, Xinyu Yang, Zhenyu Zhang, Zhangyang Wang, Yuejie Chi, Beidi Che, 14 Feb 2024, Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference, https://arxiv.org/abs/2402.09398
Luchang Li, Sheng Qian, Jie Lu, Lunxi Yuan, Rui Wang, Qin Xie, 29 Mar 2024, Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs, https://arxiv.org/abs/2403.20041 (On-device LLMs via four optimizations: dynamic-tensor-shape inference, FP4 quantization, operator optimizations, and KV cache improvements.)
Shichen Dong, Wen Cheng, Jiayu Qin, Wei Wang, 7 Mar 2024, QAQ: Quality Adaptive Quantization for LLM KV Cache, https://arxiv.org/abs/2403.04643 Code: http://github.com/ClubieDong/KVCacheQuantization (Reducing the size of the KV cache using quantization.)
Hao Kang, Qingru Zhang, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu, Tushar Krishna, Tuo Zhao, 11 Mar 2024 (v2), GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM, https://arxiv.org/abs/2403.05527 Code: https://github.com/HaoKang-Timmy/GEAR (Compressing the size of the KV cache using quantization, low-rank matrices, and sparse matrix.)
14 Mar 2024, Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference, Piotr Nawrot, Adrian Łańcucki, Marcin Chochowski, David Tarjan, Edoardo M. Ponti, https://arxiv.org/abs/2403.09636 (Reducing the memory size of the KV cache.)
Muhammad Adnan, Akhil Arunkumar, Gaurav Jain, Prashant J. Nair, Ilya Soloveychik, Purushotham Kamath, 14 Mar 2024, Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference, https://arxiv.org/abs/2403.09054 (Reducing KV cache in-memory size and related computations by focusing on a subset of tokens.)
Wangsong Yin, Mengwei Xu, Yuanchun Li, Xuanzhe Liu, 18 Mar 2024, LLM as a System Service on Mobile Devices, https://arxiv.org/abs/2403.11805 (On-device inference for LLMs, including a stateful on-device AI service LLMaaS, including Llama2 7B and OPT-7B with INT8 quantization, based on improved KV caching on mobile, with pipelining, recomputation and chunk-level KV cache memory management for running on phones.)
Youpeng Zhao, Di Wu, Jun Wang, 26 Mar 2024, ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching, https://arxiv.org/abs/2403.17312 (Improved memory management of the cache for KV caching during autoregressive inference with prioritization of tokens based on sparse window attention, and managing caching versus recomputation.)
Hui Wu, Yi Gan, Feng Yuan, Jing Ma, Wei Zhu, Yutao Xu, Hong Zhu, Yuhua Zhu, Xiaoli Liu, Jinghui Gu, Dec 2023, Efficient LLM inference solution on Intel GPU, https://arxiv.org/abs/2401.05391 (Optimized LLM inference using kernel fusion of GEMM with element-wise operations for better data movement, and also advanced management of the KV cache.)
Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, Pengfei Zuo, 23 Mar 2024, AttentionStore: Cost-effective Attention Reuse across Multi-turn Conversations in Large Language Model Serving, https://arxiv.org/abs/2403.19708 (Memory management of KV caches using hierarchical cache layers.)
Zihao Wang, Shaoduo Gan, 7 Apr 2024] SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget, https://arxiv.org/abs/2404.04793 Code: https://github.com/hetailang/squeezeattention (Optimization of the KV cache along the two dimensions of layers and input sequence.)
Aleksandar Botev, Soham De, Samuel L Smith, Anushan Fernando, George-Cristian Muraru, Ruba Haroun, Leonard Berrada, Razvan Pascanu, Pier Giuseppe Sessa, Robert Dadashi, Léonard Hussenot, Johan Ferret, Sertan Girgin, Olivier Bachem, Alek Andreev, Kathleen Kenealy, Thomas Mesnard, Cassidy Hardin, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, Armand Joulin, Noah Fiedel, Evan Senter, Yutian Chen, Srivatsan Srinivasan, Guillaume Desjardins, David Budden, Arnaud Doucet, Sharad Vikram, Adam Paszke, Trevor Gale, Sebastian Borgeaud, Charlie Chen, Andy Brock, Antonia Paterson, Jenny Brennan, Meg Risdal, Raj Gundluru, Nesh Devanathan, Paul Mooney, Nilay Chauhan, Phil Culliton, Luiz GUStavo Martins, Elisa Bandy, David Huntsperger, Glenn Cameron, Arthur Zucker, Tris Warkentin, Ludovic Peran, Minh Giang, Zoubin Ghahramani, Clément Farabet, Koray Kavukcuoglu, Demis Hassabis, Raia Hadsell, Yee Whye Teh, Nando de Frietas, 11 Apr 2024, RecurrentGemma: Moving Past Transformers for Efficient Open Language Models, Google Research, https://arxiv.org/abs/2404.07839 (KV cache is bounded, rather than growing with sequence length.)
Yushuo Chen, Tianyi Tang, Erge Xiang, Linjiang Li, Wayne Xin Zhao, Jing Wang, Yunpeng Chai, Ji-Rong Wen, 17 Apr 2024, Towards Coarse-to-Fine Evaluation of Inference Efficiency for Large Language Models, https://arxiv.org/abs/2404.11502 (Examines the KV cache optimization methods used by multiple frameworks. Benchmarks the performance of various Transformer inference frameworks: Transformers, vLLM, DeepSpeed-MII, TGI, TenserRT-LLM, llama.cpp, LightLLM, LMDeploy, StreamingLLM.)
Amir Zandieh, Majid Daliri, Insu Han, 5 Jun 2024, QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead, https://arxiv.org/abs/2406.03482 Code: https://github.com/amirzandieh/QJL (Using 1-bit or 2-bit KV cache quantization approach based on sign bits as estimates.)
Prajwal Singhania, Siddharth Singh, Shwai He, Soheil Feizi, Abhinav Bhatele, 4 Jun 2024, Loki: Low-Rank Keys for Efficient Sparse Attention, https://arxiv.org/abs/2406.02542 (Sparsification of the KV cache values with a focus on the key vectors.)
Akide Liu, Jing Liu, Zizheng Pan, Yefei He, Gholamreza Haffari, Bohan Zhuang, 23 May 2024, MiniCache: KV Cache Compression in Depth Dimension for Large Language Models, arXiv preprint arXiv:2405.14366, 2024, https://arxiv.org/abs/2405.14366 (Merging KV caches with similar values across nearby layers to effectively share parts of the cache across layers for a 41% reduction in size.)
Yefei He, Luoming Zhang, Weijia Wu, Jing Liu, Hong Zhou, Bohan Zhuang, 23 May 2024, ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification, https://arxiv.org/abs/2405.14256 (Quantizing the KV cache with combination of other KV cache compression methods based on salient tokens.)
Zichang Liu, April 2024, Ph.D. Thesis, Rice University, Houston, Texas, https://repository.rice.edu/server/api/core/bitstreams/a089344e-6f6b-44d2-a1c3-6cef2c303e86/content (Using sparsity to compress the KV cache for long context windows.)
Liu, Y., Li, H., Du, K., Yao, J., Cheng, Y., Huang, Y., Lu, S., Maire, M., Hoffmann, H., Holtzman, A., Ananthanarayanan, G., and Jiang, J., August 2024, Cachegen: Fast context loading for language model applications, Microsoft Research, https://arxiv.org/abs/2310.07240 https://www.microsoft.com/en-us/research/publication/cachegen-fast-context-loading-for-language-model-applications-via-kv-cache-streaming/
Lu Ye, Ze Tao, Yong Huang, Yang Li, 22 Mar 2024 (v2), ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition, https://arxiv.org/abs/2402.15220 (Identify prefixes of prompts and caching the KV values of these portions of the prompt.)
Jeon, Byungsoo, May 2024, Automated and Portable Machine Learning Systems, Ph.D. Thesis, Carnegie Mellon University, https://doi.org/10.1184/R1/25746708.v1 https://kilthub.cmu.edu/articles/thesis/Automated_and_Portable_Machine_Learning_Systems/25746708/1 PDF: https://kilthub.cmu.edu/ndownloader/files/46074087 Code: https://github.com/cmu-catalyst/collage (Portability layer to integrate the various kernels and low-level backends more easily. Also covers pipeline parallelism in graph models, and KV cache parallelism similar to FlashDecode.)
Peiyu Liu, Ze-Feng Gao, Wayne Xin Zhao, Yipeng Ma, Tao Wang, Ji-Rong Wen, 21 May 2024, Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression, https://arxiv.org/abs/2405.12591
William Brandon, Mayank Mishra, Aniruddha Nrusimha, Rameswar Panda, Jonathan Ragan Kelly, 21 May 2024, Reducing Transformer Key-Value Cache Size with Cross-Layer Attention, https://arxiv.org/abs/2405.12981 (Sharing KV cache values across layers in MQA, every 2nd or 3rd layer, to reduce overall KV cache size by 2 or 3 times.)
Haoyi Wu, Kewei Tu, 17 May 2024, Layer-Condensed KV Cache for Efficient Inference of Large Language Models, https://arxiv.org/abs/2405.10637 Code: https://github.com/whyNLP/LCKV (Use the KV cache for only the final layer as the KV cache for all other layers, or alternatively, use only the cache from a few layers, also possibly using a few standard layers as "warmup layers". This idea is conceptuatlly similar to "propagation" of the KV cache in early exit methods or to layer fusion of weights.)
Yao Fu, 14 May 2024, Challenges in Deploying Long-Context Transformers: A Theoretical Peak Performance Analysis, https://arxiv.org/abs/2405.08944 (The KV cache size is the main bottleneck for long context processing, in both prefill and decoding phases, and includes analysis of different optimizations to address this.)
Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, Jianfeng Gao, 16 Jan 2024, Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs. https://openreview.net/forum?id=uNrFpDPMyo
Runheng Liu, Xingchen Xiao, Heyan Huang, Zewen Chi, Zhijing Wu, 7 May 2024, FlashBack:Efficient Retrieval-Augmented Language Modeling for Long Context Inference, https://arxiv.org/abs/2405.04065 (Optimize RAG by appending rather than prepending documents, and modifying the attention for improvements in KV caching, by shimming or replacing some of the CUDA GPU low-level memory management APIs to avoid the need to rewrite kernels with extra higher-level memory management code.)
Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, Furu Wei, 9 May 2024 (v2), You Only Cache Once: Decoder-Decoder Architectures for Language Models, https://arxiv.org/abs/2405.05254 Code: https://aka.ms/YOCO (A novel decoder-decoder architecture with fast KV caching and cross-attention.)
Qiaozhi He, Zhihua Wu, 28 Apr 2024, Efficient LLM Inference with Kcache, https://arxiv.org/abs/2404.18057 (Splits the KV cache into a KCache stored in HBM and a Vcache stored in CPU memory. The requests for the V cache are limited by filtering after attention based on the Softmax scaled top-N results of the QK matrix multiplication, so thereby pruning a lot of the V cache memory loads and corresponding calculations.)
João Monteiro, Étienne Marcotte, Pierre-André Noël, Valentina Zantedeschi, David Vázquez, Nicolas Chapados, Christopher Pal, Perouz Taslakian, 23 Apr 2024, XC-Cache: Cross-Attending to Cached Context for Efficient LLM Inference, https://arxiv.org/abs/2404.15420
Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, Deming Chen, 22 Apr 2024, SnapKV: LLM Knows What You are Looking for Before Generation, https://arxiv.org/abs/2404.14469
Hanshi Sun, Zhuoming Chen, Xinyu Yang, Yuandong Tian, Beidi Chen, 18 Apr 2024, TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding, https://arxiv.org/abs/2404.11912 Code: https://github.com/Infini-AI-Lab/TriForce (Improves issues with long context in the use of speculative decoding, such as small model context sizes and KV cache memory usage bottlenecks for Llama2-7B-128K model on an A100 and RTX 4090 GPU.)
Reyna Abhyankar, Zijian He, Vikranth Srivatsa, Hao Zhang, Yiying Zhang, 2 Feb 2024, APIServe: Efficient API Support for Large-Language Model Inferencing, https://arxiv.org/abs/2402.01869
Foteini Strati, Sara Mcallister, Amar Phanishayee, Jakub Tarnawski, Ana Klimovic, 4 Mar 2024, DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving, https://arxiv.org/abs/2403.01876
Yuxin Zhang, Yuxuan Du, Gen Luo, Yunshan Zhong, Zhenyu Zhang, Shiwei Liu, Rongrong Ji, 2024, CaM: Cache Merging for Memory-efficient LLMs Inference, https://openreview.net/pdf?id=LCTmppB165 Code: https://github.com/zyxxmu/cam (Compressing the KV cache by merging KV data that is about to be evicted into other parts of the KV cache.)
Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivas tava. 2023, Scissorhands: Exploiting the Persistence of Importance Hypoth esis for LLM KV Cache Compression at Test Time. arXiv preprint arXiv:2305.17118, https://arxiv.org/abs/2305.17118
Nikita Trukhanov, Ilya Soloveychik, 29 Mar 2024, Accurate Block Quantization in LLMs with Outliers, https://arxiv.org/abs/2403.20137 (Analyzes block floating point number formats in block quantization with a focus on the KV cache memory reduction, including the use of permutations to reorder tensor weight rows.)
Pierre Lienhart, Jan 16, 2024, LLM Inference Series: 4. KV caching, a deeper look, https://medium.com/@plienhar/llm-inference-series-4-kv-caching-a-deeper-look-4ba9a77746c8
Zefan Cai., Yichi Zhang, Bofei Gao, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, Wen Xiao, 4 Jun 2024, PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling, https://arxiv.org/abs/2406.02069
Liyuan Liu , Jianfeng Gao , May 8, 2024 LLM profiling guides KV cache optimization, Microsoft Research Blog, 12th International Conference on Learning Representations(opens in new tab) (ICLR 2024), https://www.microsoft.com/en-us/research/blog/llm-profiling-guides-kv-cache-optimization/
Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, Jianfeng Gao, May 2024, Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs, ICLR 2024, https://www.microsoft.com/en-us/research/publication/model-tells-you-what-to-discard-adaptive-kv-cache-compression-for-llms/ https://arxiv.org/pdf/2310.01801
Noam Shazeer, 6 Nov 2019, Fast Transformer Decoding: One Write-Head is All You Need, https://arxiv.org/abs/1911.02150
Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, Sumit Sanghai, 23 Dec 2023 (v3), GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints, https://arxiv.org/abs/2305.13245
Amir Zandieh, Insu Han, Vahab Mirrokni, Amin Karbasi, 8 Feb 2024, SubGen: Token Generation in Sublinear Time and Memory, https://arxiv.org/abs/2402.06082
June Yong Yang, Byeongwook Kim, Jeongin Bae, Beomseok Kwon, Gunho Park, Eunho Yang, Se Jung Kwon, Dongsoo Lee, 28 Feb 2024, No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization, https://arxiv.org/abs/2402.18096
Xingbo Wu, Li Zhang, Yandong Wang, Yufei Ren, Michel Hack, Song Jiang, 18 April 2016, zExpander: a key-value cache with both high performance and fewer misses, EuroSys '16: Proceedings of the Eleventh European Conference on Computer SystemsApril 2016Article No.: 14Pages 1–15, https://dl.acm.org/doi/abs/10.1145/2901318.2901332 https://doi.org/10.1145/2901318.2901332
Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. Model tells you what to discard: Adaptive kv cache compression for llms. arXiv preprint arXiv:2310.01801, 2023. https://arxiv.org/abs/2310.01801
Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. https://arxiv.org/abs/2310.06825 arXiv preprint arXiv:2310.06825, 2023
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. In The Twelfth International Conference on Learning Representations, 2023. [51] Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems, 36, 2024 https://arxiv.org/abs/2306.14048
Zhiyu Guo, Hidetaka Kamigaito, Taro Watanabe, 18 Jun 2024, Attention Score is not All You Need for Token Importance Indicator in KV Cache Reduction: Value Also Matters, https://arxiv.org/abs/2406.12335 (Extensions of KV cache token pruning methods tjat ise attention scores to find pivotal tokens, generalized to also consider L1 vector norms of value vectors.)
Hao Yu, Zelan Yang, Shen Li, Yong Li, Jianxin Wu, 11 Jun 2024, Effectively Compress KV Heads for LLM, https://arxiv.org/abs/2406.07056 (Examines KV cache head merging approaches for KV cache size reduction, and also examines RoPE encoding issues with relevance to fusing KV caches.)
DeepSeek-AI. DeepSeek-V2: A strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434, 2024. https://arxiv.org/abs/2405.04434 Code: https://github.com/deepseek-ai/DeepSeek-V2 (Introduces various architectural optimizations, notably RoPE handling and KV cache compression via low-rank matrices.)
Character.AI, June 20, 2024, Optimizing AI Inference at Character.AI, https://research.character.ai/optimizing-inference/
Zhongwei Wan, Xinjian Wu, Yu Zhang, Yi Xin, Chaofan Tao, Zhihong Zhu, Xin Wang, Siqi Luo, Jing Xiong, Mi Zhang, 18 Jun 2024, D2O:Dynamic Discriminative Operations for Efficient Generative Inference of Large Language Models, https://arxiv.org/abs/2406.13035 (Per-layer KV cache eviction strategies with token merging applied to the KV cache.)
Alessio Devoto, Yu Zhao, Simone Scardapane, Pasquale Minervini, 17 Jun 2024, A Simple and Effective L2 Norm-Based Strategy for KV Cache Compression, https://arxiv.org/abs/2406.11430
Yingbing Huang, Lily Jiaxin Wan, Hanchen Ye, Manvi Jha, Jinghua Wang, Yuhong Li, Xiaofan Zhang, Deming Chen, 16 Jun 2024, New Solutions on LLM Acceleration, Optimization, and Application, https://arxiv.org/abs/2406.10903 (A survey of inference optimization methods and further analysis of Medusa-type speculative decoding and KV cache compression. Also explores hardware co-design, ML compilers and LLM-assisted code debugging.)
Chao Lou, Zixia Jia, Zilong Zheng, Kewei Tu, 24 Jun 2024, Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers, https://arxiv.org/abs/2406.16747 (Sparse KV cache for memory-efficient decoding on long contexts by selecting KV pairs of salient tokens.)
Jeffrey Willette, Heejun Lee, Youngwan Lee, Myeongjae Jeon, Sung Ju Hwang, 24 Jun 2024, Training-Free Exponential Extension of Sliding Window Context with Cascading KV Cache, https://arxiv.org/abs/2406.17808 (Extends the KV cache eviction policy in sliding window attention so that the KV partially looks back further than the window.)
Zhongwei Wan, Ziang Wu, Che Liu, Jinfa Huang, Zhihong Zhu, Peng Jin, Longyue Wang, Li Yuan, 26 Jun 2024, LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference, https://arxiv.org/abs/2406.18139 (KV cache compression in text and multimodal inference, prioritizing eviction of text over image tokens, and using new ways to merge evicted KV cache data into the retained KV cache, including averaging, pivotal tokens, and weighted averages, which is relevant to token merging and KV cache fusion
Franklin Huang, May 17, 2024, Machine Learning Systems with Reduced Memory Requirements, Masters of Science, Electrical Engineering and Computer Sciences, University of California, Berkeley, Technical Report No. UCB/EECS-2024-120 http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.html https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.pdf Code: https://github.com/hongyihuang/spec-mcts/blob/main/triton (Broad paper that examines a lot of different optimizations that reduce memory costs, including quantization, kernel fusion, sparsity, MatMul optimizations, KV cache compression, and various other methods.)
Namgyu Ho, Sangmin Bae, Taehyeon Kim, Hyunjik Jo, Yireun Kim, Tal Schuster, Adam Fisch, James Thorne, Se-Young Yun, 4 Jun 2024, Block Transformer: Global-to-Local Language Modeling for Fast Inference, https://arxiv.org/abs//2406.02657 Code: https://github.com/itsnamgyu/block-transformer (Impressive technique of combining tokens into blocks, then doing inference on the blocks, then unblocking to get tokens.)
Wonbeom Lee, Jungi Lee, Junghwan Seo, Jaewoong Sim, 28 Jun 2024, InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management, https://arxiv.org/abs/2406.19707 (KV caching optimization using salient token pruning for the attention layer.)
Yao Yao, Zuchao Li, Hai Zhao, 21 May 2024, SirLLM: Streaming Infinite Retentive LLM, https://arxiv.org/abs/2405.12528 (Low-rank decomposition to compress KV cache heads.)
Jiayi Yuan, Hongyi Liu, Shaochen (Henry)Zhong, Yu-Neng Chuang, Songchen Li, Guanchu Wang, Duy Le, Hongye Jin, Vipin Chaudhary, Zhaozhuo Xu, Zirui Liu, Xia Hu, 1 Jul 2024, KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches, https://arxiv.org/abs/2407.01527 Code: https://github.com/henryzhongsc/longctx_bench (Survey and benchmarking of several KV cache compression and long context handling techniques.)
Ruiqing Yan, Xingbo Du, Haoyu Deng, Linghan Zheng, Qiuzhuang Sun, Jifang Hu, Yuhang Shao, Penghao Jiang, Jinrong Jiang, Lian Zhao, 3 Jul 2024 (v2), Unveiling and Controlling Anomalous Attention Distribution in Transformers, https://arxiv.org/abs/2407.01601 (Examination of why the very first token in a sequence always gets more attention than others, including the effect of positional encoding, and its impact on KV cache compression.)
Baolin Li, Yankai Jiang, Vijay Gadepally, Devesh Tiwari, 17 Jul 2024, LLM Inference Serving: Survey of Recent Advances and Opportunities, https://arxiv.org/abs/2407.12391
Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
Junyoung Park, Myeonggu Kang, Yunki Han, Yanggon Kim, Jaekang Shin, Lee-Sup Kim, 21 Jul 2024, Token-Picker: Accelerating Attention in Text Generation with Minimized Memory Transfer via Probability Estimation, https://arxiv.org/abs/2407.15131
Shi Luohe, Zhang Hongyi, Yao Yao, Li Zuchao, Zhao Hai, 25 Jul 2024, Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption, https://arxiv.org/abs/2407.18003
Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference, Harry Dong, Xinyu Yang, Zhenyu Zhang, Zhangyang Wang, Yuejie Chi, Beidi Chen, 2024, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:11437-11452, 2024, https://proceedings.mlr.press/v235/dong24f.html https://raw.githubusercontent.com/mlresearch/v235/main/assets/dong24f/dong24f.pdf https://openreview.net/forum?id=uhHDhVKFMW Code: https://github.com/hdong920/LESS
Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, 04 August 2024, CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving, ACM SIGCOMM '24: Proceedings of the ACM SIGCOMM 2024 Conference, Pages 38 - 56, https://doi.org/10.1145/3651890.3672274 https://dl.acm.org/doi/abs/10.1145/3651890.3672274
Giulio Corallo, Paolo Papotti, 31 Jul 2024, Finch: Prompt-guided Key-Value Cache Compression, https://arxiv.org/abs/2408.00167 (KV cache compression along the lengthwise input dimension in a blockwise KV data pruning method.)
Wang, Bin Guo, Chen Meng, Sheng Gui, Weifei Yu, Yi Xie, 10 Jul 2024, Inference Performance Optimization for Large Language Models on CPUs,Pujiang He, Shan Zhou, Wenhuan Huang, Changqing Li, Duyi https://arxiv.org/abs/2407.07304 Code: https://github.com/intel/xFasterTransformer
Zeyu Zhang, Haiying Shen, 7 Aug 2024, Zero-Delay QKV Compression for Mitigating KV Cache and Network Bottlenecks in LLM Inference, https://arxiv.org/abs/2408.04107
Utkarsh Saxena, Gobinda Saha, Sakshi Choudhary, Kaushik Roy, 10 Aug 2024, Eigen Attention: Attention in Low-Rank Space for KV Cache Compression, https://arxiv.org/abs/2408.05646
Shuo Yang, Ying Sheng, Joseph E. Gonzalez, Ion Stoica, Lianmin Zheng, 11 Aug 2024, Post-Training Sparse Attention with Double Sparsity, https://arxiv.org/abs/2408.07092 Code: https://github.com/andy-yang-1/DoubleSparse (Combined token-level sparse attention with reduced KV data accesses.)
Piotr Nawrot, Adrian Łańcucki, Marcin Chochowski, David Tarjan, Edoardo Ponti, July 2024, Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:37396-37412, 2024, https://proceedings.mlr.press/v235/nawrot24a.html PDF: https://raw.githubusercontent.com/mlresearch/v235/main/assets/nawrot24a/nawrot24a.pdf Code: https://github.com/NVIDIA/Megatron-LM/tree/DMC
Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, Song Han, July 2024, QUEST: Query-Aware Sparsity for Efficient Long-Context LLM Inference, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:47901-47911, 2024, https://proceedings.mlr.press/v235/tang24l.html PDF: https://raw.githubusercontent.com/mlresearch/v235/main/assets/tang24l/tang24l.pdf Code: https://github.com/mit-han-lab/quest
Yuhui Xu, Zhanming Jie, Hanze Dong, Lei Wang, Xudong Lu, Aojun Zhou, Amrita Saha, Caiming Xiong, Doyen Sahoo, 30 Jul 2024, ThinK: Thinner Key Cache by Query-Driven Pruning, https://arxiv.org/abs/2407.21018
David Spuler, 25th August, 2024, Hot Inference Optimization Techniques, https://www.aussieai.com/blog/hot-inference-research
Ahmed Tremo, Aug 6, 2024, How to Efficiently Serve an LLM? https://ahmedtremo.com/posts/How-to-Efficiently-serve-an-llm/
Zihao Ye,, Lequn Chen, Ruihang Lai, Yilong Zhao, Size Zheng, Junru Shao, Bohan Hou, Hongyi Jin, Yifei Zuo, Liangsheng Yin, Tianqi Chen, Luis Ceze, Feb 2, 2024, Accelerating Self-Attentions for LLM Serving with FlashInfer, https://flashinfer.ai/2024/02/02/introduce-flashinfer.html
Lilly Kumari, Anthony Rowe, Shengjie Wang, Jeff Bilmes, 2024, BumbleBee: Dynamic KV-Cache Streaming Submodular Summarization for Infinite-Context Transformers, COLM 2024, https://openreview.net/pdf?id=8w0RApM5yG (KV cache compression via "summaries" of the KV cache data.)
Luka Ribar, Ivan Chelombiev, Luke Hudlass-Galley, Charlie Blake, Carlo Luschi, Douglas Orr, July 2024, SparQ Attention: Bandwidth-Efficient LLM Inference, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:42558-42583, 2024, https://proceedings.mlr.press/v235/ribar24a.html PDF: https://raw.githubusercontent.com/mlresearch/v235/main/assets/ribar24a/ribar24a.pdf
Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guohong Liu, Jiacheng Liu, Wenxing Xu, Xiang Wang, Yi Sun, Rui Kong, Yile Wang, Hanfei Geng, Jian Luan, Xuefeng Jin, Zilong Ye, Guanjing Xiong, Fan Zhang, Xiang Li, Mengwei Xu, Zhijun Li, Peng Li, Yang Liu, Ya-Qin Zhang, Yunxin Liu, 8 May 2024 (v2), Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security, https://arxiv.org/abs/2401.05459 https://github.com/MobileLLM/Personal_LLM_Agents_Survey
Zhenyu Zhang, Shiwei Liu, Runjin Chen, Bhavya Kailkhura, Beidi Chen, Atlas Wang, 2024, Q-Hitter: A Better Token Oracle for Efficient LLM Inference via Sparse-Quantized KV Cache, Part of Proceedings of Machine Learning and Systems 6 (MLSys 2024) Conference, https://proceedings.mlsys.org/paper_files/paper/2024/hash/bbb7506579431a85861a05fff048d3e1-Abstract-Conference.html https://proceedings.mlsys.org/paper_files/paper/2024/file/bbb7506579431a85861a05fff048d3e1-Paper-Conference.pdf https://github.com/VITA-Group/Q-Hitter
Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong-Yan Chen, Yu-Fang Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, Kai-Chiang Wu, 30 Jul 2024, Palu: Compressing KV-Cache with Low-Rank Projection, https://arxiv.org/abs/2407.21118 https://github.com/shadowpa0327/Palu
The SGLang Team, Sep 04, 2024 SGLang v0.3 Release: 7x Faster DeepSeek MLA, 1.5x Faster torch.compile, Multi-Image/Video LLaVA-OneVision, https://lmsys.org/blog/2024-09-04-sglang-v0-3/
Xiurui Pan, Endian Li, Qiao Li, Shengwen Liang, Yizhou Shan, Ke Zhou, Yingwei Luo, Xiaolin Wang, Jie Zhang, 8 Sep 2024, InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference, https://arxiv.org/abs/2409.04992
Luning Wang, Shiyao Li, Xuefei Ning, Zhihang Yuan, Shengen Yan, Guohao Dai, Yu Wang, 16 Sep 2024, CSKV: Training-Efficient Channel Shrinking for KV Cache in Long-Context Scenarios, https://arxiv.org/abs/2409.10593 (KV cache compression on the "channel" or "width" dimension.)
Zhenyu Ning, Jieru Zhao, Qihao Jin, Wenchao Ding, Minyi Guo, 11 Sep 2024, Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU, https://arxiv.org/abs/2409.09086
Bo Lv, Quan Zhou, Xuanang Ding, Yan Wang, Zeming Ma, 17 Sep 2024, KVPruner: Structural Pruning for Faster and Memory-Efficient Large Language Models, https://arxiv.org/abs/2409.11057
Yanshu Wang, Tong Yang, Xiyan Liang, Guoan Wang, Hanning Lu, Xu Zhe, Yaoming Li, Li Weitao, 18 Sep 2024, Art and Science of Quantizing Large-Scale Models: A Comprehensive Overview, https://arxiv.org/abs/2409.11650 (Extensive survey of quantization from the basics to SOTA approaches, with also some coverage of knowledge distillation and KV cache compression.)
Junlin Lv, Yuan Feng, Xike Xie, Xin Jia, Qirong Peng, Guiming Xie, 19 Sep 2024, CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling Acceleration in LLMs, https://arxiv.org/abs/2409.12490
David Spuler, 26th August, 2024, Inference Optimization Research Ideas, https://www.aussieai.com/blog/inference-optimization-ideas
Minsoo Kim, Kyuhong Shim, Jungwook Choi, Simyung Chang, 2 Oct 2024, InfiniPot: Infinite Context Processing on Memory-Constrained LLMs, https://arxiv.org/abs/2410.01518 (Length-wise KV cache pruning by analyzing token importance.)
Isaac Rehg, 7 Oct 2024 (v2), KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head, https://arxiv.org/abs/2410.00161
Shiwei Gao, Youmin Chen, Jiwu Shu, Oct 2024, Fast State Restoration in LLM Serving with HCache EuroSys ’25, March 30–April 3, 2025, Rotterdam, Netherlands, https://chenyoumin1993.github.io/papers/eurosys25-hcache.pdf
Suyu Ge, Xihui Lin, Yunan Zhang, Jiawei Han, Hao Peng, 2 Oct 2024, A Little Goes a Long Way: Efficient Long Context Training and Inference with Partial Contexts, https://arxiv.org/abs/2410.01485
Jing Xiong, Jianghan Shen, Fanghua Ye, Chaofan Tao, Zhongwei Wan, Jianqiao Lu, Xun Wu, Chuanyang Zheng, Zhijiang Guo, Lingpeng Kong, Ngai Wong, 4 Oct 2024, UNComp: Uncertainty-Aware Long-Context Compressor for Efficient Large Language Model Inference, https://arxiv.org/abs/2410.03090
Rongzhi Zhang, Kuang Wang, Liyuan Liu, Shuohang Wang, Hao Cheng, Chao Zhang, Yelong Shen, 4 Oct 2024, LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy, https://arxiv.org/abs/2410.03111
Yefei He, Feng Chen, Jing Liu, Wenqi Shao, Hong Zhou, Kaipeng Zhang, Bohan Zhua, 11 Oct 2024, ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification and KV Cache Compression, https://arxiv.org/abs/2410.08584
Xuan Zhang, Cunxiao Du, Chao Du, Tianyu Pang, Wei Gao, Min Lin, 17 Oct 2024, SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction, https://arxiv.org/abs/2410.13846 https://github.com/sail-sg/SimLayerKV
Anonymous authors, Oct 2024, LSH Tells You What To Discard: An Adaptive Locality-Sensitive Strategy for KV Cache Compression, ICLR 2025, https://openreview.net/pdf?id=0ZcQhdyI3n
OpenVINO-toolkit, Oct 1, 2024, Introducing OpenVINO™ 2024.4, https://medium.com/openvino-toolkit/introducing-openvino-2024-4-28578870b264
Yu Fu, Zefan Cai, Abedelkadir Asi, Wayne Xiong, Yue Dong, Wen Xiao, 28 Oct 2024 (v2), Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning, https://arxiv.org/abs/2410.19258
Junqi Zhao, Zhijin Fang, Shu Li, Shaohui Yang, Shichao He, 30 Oct 2024, BUZZ: Beehive-structured Sparse KV Cache with Segmented Heavy Hitters for Efficient LLM Inference, https://arxiv.org/abs/2410.23079 https://github.com/JunqiZhao888/buzz-llm
Xingwu Sun, Yanfeng Chen, Yiqing Huang, Ruobing Xie, Jiaqi Zhu, Kai Zhang, Shuaipeng Li, Zhen Yang, Jonny Han, Xiaobo Shu, Jiahao Bu, (and many more authors), 4 Nov 2024, Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent, https://arxiv.org/abs/2411.02265 https://github.com/Tencent/Hunyuan-Large https://huggingface.co/tencent/Tencent-Hunyuan-Large
Dezhan Tu, Danylo Vashchilenko, Yuzhe Lu, Panpan Xu, 29 Oct 2024, VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration, https://arxiv.org/abs/2410.23317
B Sun, X Yu, D Tao, Nov 2024, KVSort: Drastically Improving LLM Inference Performance via KV Cache Compression, https://sc24.supercomputing.org/proceedings/poster/poster_files/post189s2-file3.pdf
Chaoyi Jiang, Lei Gao, Hossein Entezari Zarch, Murali Annavaram, 26 Nov 2024, Efficient LLM Inference with I/O-Aware Partial KV Cache Recomputation, https://arxiv.org/abs/2411.17089 (Overlapping/optimizing CPU-GPU network bandwidth for KV cache with some recomputation.)
Yash Akhauri, Safeen Huda, Mohamed S. Abdelfattah, 26 Nov 2024, Attamba: Attending To Multi-Token States, https://arxiv.org/abs/2411.17685
Da Ma, Lu Chen, Situo Zhang, Yuxun Miao, Su Zhu, Zhi Chen, Hongshen Xu, Hanqi Li, Shuai Fan, Lei Pan, Kai Yu, 3 Dec 2024, Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity, https://arxiv.org/abs/2412.02252
Yanqi Zhang, Yuwei Hu, Runyuan Zhao, John C.S. Lui, Haibo Chen, 4 Dec 2024, Unifying KV Cache Compression for Large Language Models with LeanKV, https://arxiv.org/abs/2412.03131 (KV cache compression via mixed-precision KV quantization, token-specific KV pruning, and KV sparsity. Also uses a KV paging method similar to paged attention.)
Guangda Liu, Chengwei Li, Jieru Zhao, Chenqi Zhang, Minyi Guo, 4 Dec 2024, ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression, https://arxiv.org/abs/2412.03213
Weizhuo Li, Zhigang Wang, Yu Gu, Ge Yu, 8 Dec 2024, XKV: Personalized KV Cache Memory Reduction for Long-Context LLM Inference, https://arxiv.org/abs/2412.05896
Xiaohuan Pei, Tao Huang, Chang Xu, 5 Dec 2024, Cross-Self KV Cache Pruning for Efficient Vision-Language Inference, https://arxiv.org/abs/2412.04652 https://github.com/TerryPei/CSP (KV cache pruning in multimodal LLMs.)
Michael R. Metel, Boxing Chen, Mehdi Rezagholizadeh, 7 Dec 2024, Batch-Max: Higher LLM Throughput using Larger Batch Sizes and KV Cache Compression, https://arxiv.org/abs/2412.05693 (KV cache compression in prefill or prompt processing phase.)
Junhyuck Kim, Jongho Park, Jaewoong Cho, Dimitris Papailiopoulos, 12 Dec 2024, Lexico: Extreme KV Cache Compression via Sparse Coding over Universal Dictionaries, https://arxiv.org/abs/2412.08890 https://github.com/krafton-ai/lexico (Sparsification of KV cache in prefill, using INT8 and vector lookup in a dictionary of predefined vectors.)
Meizhi Zhong, Xikai Liu, Chen Zhang, Yikun Lei, Yan Gao, Yao Hu, Kehai Chen, Min Zhang, 12 Dec 2024, ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based on Layer Uncertainty, https://arxiv.org/abs/2412.09036 (KV cache compression on a layerwise budget, similar to token-based eviction, giving a kind of "dual-dimension" KV cache compression on depth and length.)
Yucheng Li, Huiqiang Jiang, Qianhui Wu, Xufang Luo, Surin Ahn, Chengruidong Zhang, Amir H. Abdi, Dongsheng Li, Jianfeng Gao, Yuqing Yang, Lili Qiu, 13 Dec 2024, SCBench: A KV Cache-Centric Analysis of Long-Context Methods, https://arxiv.org/abs/2412.10319 https://aka.ms/SCBench
Hongjin Qian, Zheng Liu, Peitian Zhang, Zhicheng Dou, Defu Lian, 18 Dec 2024 (v2), Boosting Long-Context Management via Query-Guided Activation Refilling, https://arxiv.org/abs/2412.12486 (Maintaining two KV caches, one global, one local.)
Jiebin Zhang, Dawei Zhu, Yifan Song, Wenhao Wu, Chuqiao Kuang, Xiaoguang Li, Lifeng Shang, Qun Liu, Sujian Li, 17 Dec 2024, More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression, https://arxiv.org/abs/2412.12706 (Combining KV token pruning and KV quantization.)
Jialong Wu, Zhenglin Wang, Linhai Zhang, Yilong Lai, Yulan He, Deyu Zhou, 18 Dec 2024, SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation, https://arxiv.org/abs/2412.13649 (Different KV cache optimizations for prefill and decoding phases.)
Xiabin Zhou, Wenbin Wang, Minyan Zeng, Jiaxian Guo, Xuebo Liu, Li Shen, Min Zhang, Liang Ding, 19 Dec 2024, DynamicKV: Task-Aware Adaptive KV Cache Compression for Long Context LLMs, https://arxiv.org/abs/2412.14838
Minghui Liu, Tahseen Rabbani, Tony O'Halloran, Ananth Sankaralingam, Mary-Anne Hartley, Brian Gravelle, Furong Huang, Cornelia Fermüller, Yiannis Aloimonos, 24 Dec 2024 (v2), HashEvict: A Pre-Attention KV Cache Eviction Strategy using Locality-Sensitive Hashing, https://arxiv.org/abs/2412.16187 (LSH to compress the KV cache.)
Y Li, H Jiang, Q Wu, X Luo, S Ahn, C Zhang, AH Abdi, Dec 2024, SharedContextBench: Evaluating Long-Context Methods in KV Cache Reuse, 4th NeurIPS Efficient Natural Language and Speech Processing Workshop (ENLSP-IV 2024), https://neurips2024-enlsp.github.io/papers/paper_93.pdf (Evaluating model performance with KV cache compression.)
H Kang, Q Zhang, S Kundu, G Jeong, Z Liu, T Krishna, Dec 2024, GEAR: An Efficient Error Reduction Framework for KV Cache Compression in LLM Inference, https://neurips2024-enlsp.github.io/papers/paper_3.pdf (Use extra information in low-rank and sparse matrices to efficiently alleviate lossy KV cache quantization issues such as outliers.)
Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, Lei Chen, 27 Dec 2024, A Survey on Large Language Model Acceleration based on KV Cache Management, https://arxiv.org/abs/2412.19442 (Huge survey of all KV cache optimization methods.)
Qingyun Jin, Xiaohui Song, Feng Zhou, Zengchang Qin, 30 Dec 2024, Align Attention Heads Before Merging Them: An Effective Way for Converting MHA to GQA, https://arxiv.org/abs/2412.20677
Ben Dickson, December 13, 2024, New LLM optimization technique slashes memory costs up to 75%, https://venturebeat.com/ai/new-llm-optimization-technique-slashes-memory-costs-up-to-75/
Edoardo Cetin, Qi Sun, Tianyu Zhao, Yujin Tang, 6 Dec 2024 (v3), An Evolved Universal Transformer Memory, https://arxiv.org/abs/2410.13166
Luyang Liu, Jonas Pfeiffer, Jiaxing Wu, Jun Xie, Arthur Szlam, 23 Dec 2024, Deliberation in Latent Space via Differentiable Cache Augmentation, https://arxiv.org/abs/2412.17747 (Augmenting the KV cache with reasoning information so that decoding will mimic multi-step reasoning with fewer tokens required for intermediate steps.)
Sanghyeon Lee, Hongbeen Kim, Soojin Hwang, Guseul Heo, Minwoo Noh, Jaehyuk Huh. 3 Jan 2025, Efficient LLM Inference with Activation Checkpointing and Hybrid Caching, https://arxiv.org/abs/2501.01792 (Recomputation of the KV cache from stored activations.)
Ziwei He, Jian Yuan, Haoli Bai, Jingwen Leng, Bo Jiang, 9 Jan 2025, TreeKV: Smooth Key-Value Cache Compression with Tree Structures, https://arxiv.org/abs/2501.04987 (Using a tree data structure to compress the KV cache.)
Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, Luis Ceze, 2 Jan 2025, FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving, https://arxiv.org/abs/2501.01005 (Using block-sparse KV caching for faster inference.)
Snowflake, Dec 05, 2024, SwiftKV: Accelerating Enterprise LLM Workloads with Knowledge Preserving Compute Reduction, https://www.snowflake.com/en/engineering-blog/swiftkv-llm-compute-reduction/
Longze Chen, Jan 2025 (accessed), Awesome-KV-Cache-Compression: Must-read papers on KV Cache Compression (constantly updating), https://github.com/October2001/Awesome-KV-Cache-Compression (KV cache reuse across multiple prompts via SwiftKV, sounds similar to prefix KV caching or fused KV caching, and also SingleInputKV does KV cache layer fusion in a single prompt.)
Xingyang He, Jie Liu, Shaowei Chen, 25 Jan 2025, Task-KV: Task-aware KV Cache Optimization via Semantic Differentiation of Attention Heads, https://arxiv.org/abs/2501.15113 (Dynamic KV cache compression based on budgets.)
Edoardo Maria Ponti, Adrian Łańcucki, Johannes Rausch and David Tarjan, Jan 24, 2025, Dynamic Memory Compression, https://developer.nvidia.com/blog/dynamic-memory-compression/ (Using addition for KV cache compression.)
Zeyu Zhang, Haiying Shen, Shay Vargaftik, Ran Ben Basat, Michael Mitzenmacher, Minlan Yu, 5 Feb 2025, HACK: Homomorphic Acceleration via Compression of the Key-Value Cache for Disaggregated LLM Inference, https://arxiv.org/abs/2502.03589
Dr. Ashish Bamania, Feb 2025, Multi-Head Latent Attention Is The Powerful Engine Behind DeepSeek: A deep dive Into DeepSeek’s innovative Attention mechanism that makes its LLMs so good https://levelup.gitconnected.com/multi-head-latent-attention-is-the-powerful-engine-behind-deepseek-0ecfd29e0b04 (MLA versus GQA/MQA attention and how MLA achieves KV cache compression.)
Insu Han, Michael Kapralov, Ekaterina Kochetkova, Kshiteej Sheth, Amir Zandieh, 11 Feb 2025, BalanceKV: KV Cache Compression through Discrepancy Theory, https://arxiv.org/abs/2502.07861
Payman Behnam, Yaosheng Fu, Ritchie Zhao, Po-An Tsai, Zhiding Yu, Alexey Tumanov, 19 Feb 2025, RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression, https://arxiv.org/abs/2502.14051 (Combines KV token pruning with sparse attention algorithms.)
Hong Yankun, Li Xing, Zhen Hui-Ling, Yu Xianzhi, Liu Wulong, Yuan Mingxuan, 21 Feb 2025, SVDq: 1.25-bit and 410x Key Cache Compression for LLM Attention, https://arxiv.org/abs/2502.15304
Qiheng Sun, Hongwei Zhang, Haocheng Xia, Jiayao Zhang, Jinfei Liu, Kui Ren, 21 Feb 2025, CoKV: Optimizing KV Cache Allocation via Cooperative Game, https://arxiv.org/abs/2502.17501
Zhongwei Wan, Hui Shen, Xin Wang, Che Liu, Zheda Mai, Mi Zhang, 24 Feb 2025, MEDA: Dynamic KV Cache Allocation for Efficient Multimodal Long-Context Inference, https://arxiv.org/abs/2502.17599 https://github.com/AIoT-MLSys-Lab/MEDA
Xiaoran Liu, Ruixiao Li, Mianqiu Huang, Zhigeng Liu, Yuerong Song, Qipeng Guo, Siyang He, Qiqi Wang, Linlin Li, Qun Liu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu, 24 Feb 2025, Thus Spake Long-Context Large Language Model, https://arxiv.org/abs/2502.17129 (Impressive survey of many techniques to improve efficiency and accuracy of long context processing in both inference and training, covering text, video and multimodal models.)
Themistoklis Haris, Krzysztof Onak, 21 Feb 2025, Compression Barriers for Autoregressive Transformers, https://arxiv.org/abs/2502.15955
Xuanfan Ni, Liyan Xu, Chenyang Lyu, Longyue Wang, Mo Yu, Lemao Liu, Fandong Meng, Jie Zhou, Piji Li, 24 Feb 2025, DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance, https://arxiv.org/abs/2502.16886
Nathan Godey, Alessio Devoto, Yu Zhao, Simone Scardapane, Pasquale Minervini, Éric de la Clergerie, Benoît Sagot, 4 Mar 2025, Q-Filters: Leveraging QK Geometry for Efficient KV Cache Compression, https://arxiv.org/abs/2503.02812
N Godey, A Devoto, Y Zhao, S Scardapane, P Minervini, 2025, Q-Filters: Leveraging Query-Key Geometry for Efficient Key-Value Cache Compression, https://openreview.net/pdf?id=0vzhbmwZel
Giulio Corallo, Orion Weller, Fabio Petroni, Paolo Papotti, 6 Mar 2025, Beyond RAG: Task-Aware KV Cache Compression for Comprehensive Knowledge Reasoning, https://arxiv.org/abs/2503.04973
Jianian Zhu, Hang Wu, Haojie Wang, Yinghui Li, Biao Hou, Ruixuan Li, Jidong Zhai, 11 Mar 2025, FastCache: Optimizing Multimodal LLM Serving through Lightweight KV-Cache Compression Framework, https://arxiv.org/abs/2503.08461
Yifang Chen, Xiaoyu Li, Yingyu Liang, Zhenmei Shi, Zhao Song, Yu Tian, 14 Mar 2025, Limits of KV Cache Compression for Tensor Attention based Autoregressive Transformers, https://arxiv.org/abs/2503.11108
Neusha Javidnia, Bita Darvish Rouhani, Farinaz Koushanfar, 14 Mar 2025, Key, Value, Compress: A Systematic Exploration of KV Cache Compression Techniques, https://arxiv.org/abs/2503.11816
Bo Chen, Xiaoyu Li, Yekun Ke, Yingyu Liang, Zhenmei Shi, Zhao Song, 19 Mar 2025, Exploring the Limits of KV Cache Compression in Visual Autoregressive Transformers, https://arxiv.org/abs/2503.14881
Feng Cheng, Cong Guo, Chiyue Wei, Junyao Zhang, Changchun Zhou, Edward Hanson, Jiaqi Zhang, Xiaoxiao Liu, Hai "Helen" Li, Yiran Chen, 11 May 2025, Ecco: Improving Memory Bandwidth and Capacity for LLMs via Entropy-aware Cache Compression, https://arxiv.org/abs/2505.06901
A. Moradifirouzabadi and M. Kang, "End-to-end Acceleration of Generative Models with Runtime Regularized KV Cache Management," in IEEE Journal on Emerging and Selected Topics in Circuits and Systems, doi: 10.1109/JETCAS.2025.3568716, https://ieeexplore.ieee.org/abstract/document/10994487
M. Seo, J. Hyun, S. Jeong, X. T. Nguyen, H. -J. Lee and H. Lee, "OASIS: Outlier-Aware KV Cache Clustering for Scaling LLM Inference in CXL Memory Systems," in IEEE Computer Architecture Letters, doi: 10.1109/LCA.2025.3567844, https://ieeexplore.ieee.org/abstract/document/10990150
Hang Zhang, Jiuchen Shi, Yixiao Wang, Quan Chen, Yizhou Shan, Minyi Guo, 19 Apr 2025, Improving the Serving Performance of Multi-LoRA Large Language Models via Efficient LoRA and KV Cache Management, https://arxiv.org/abs/2505.03756
Jiahao Wang, Jinbo Han, Xingda Wei, Sijie Shen, Dingyan Zhang, Chenguang Fang, Rong Chen, Wenyuan Yu, Haibo Chen, 3 Jun 2025, KVCache Cache in the Wild: Characterizing and Optimizing KVCache Cache at a Large Cloud Provider, https://arxiv.org/abs/2506.02634
Xianglong Yan, Zhiteng Li, Tianao Zhang, Linghe Kong, Yulun Zhang, Xiaokang Yang, 5 Jun 2025 (v2), ReCalKV: Low-Rank KV Cache Compression via Head Reordering and Offline Calibration, https://arxiv.org/abs/2505.24357 https://github.com/XIANGLONGYAN/ReCalKV
Jang-Hyun Kim, Jinuk Kim, Sangwoo Kwon, Jae W. Lee, Sangdoo Yun, Hyun Oh Song, 29 May 2025, KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction, https://arxiv.org/abs/2505.23416
Apple, June 2025, Updates to Apple's On-Device and Server Foundation Language Models, https://machinelearning.apple.com/research/apple-foundation-models-2025-updates (Apple's 3B on-device model with cloud server alternative. The on-device architecture includes 2-bit quantization, 4-bit embeddings quantization, 8-bit KV quantization, a unique KV cache compression, interleaved local-global attention and multi-LoRA.)
Zhanqiu Hu, Jian Meng, Yash Akhauri, Mohamed S. Abdelfattah, Jae-sun Seo, Zhiru Zhang, Udit Gupta, 27 May 2025, Accelerating Diffusion Language Model Inference via Efficient KV Caching and Guided Diffusion, https://arxiv.org/abs/2505.21467
Peilin Chen, Xiaoxuan Yang, 23 May 2025, Titanus: Enabling KV Cache Pruning and Quantization On-the-Fly for LLM Acceleration, https://arxiv.org/abs/2505.17787 https://github.com/peilin-chen/Titanus-for-LLM-acceleration

KV Cache Low-Rank Matrix Factorization

KV low-rank matrix factorization, or KV cache decomposition, is the use of low-rank matrices for faster LLM inference. This is another type of KV cache compression uses low-rank matrix factorization, which is a well-known model compression method. A large matrix is approximated as the product of two much smaller matrices, which reduces memory size and computation requirements.

Research papers on low-rank matrix factorization applied to KV cache compression:

Hao Kang, Qingru Zhang, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu, Tushar Krishna, Tuo Zhao, 11 Mar 2024 (v2), GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM, https://arxiv.org/abs/2403.05527 https://github.com/HaoKang-Timmy/GEAR
Prajwal Singhania, Siddharth Singh, Shwai He, Soheil Feizi, Abhinav Bhatele, 4 Jun 2024, Loki: Low-Rank Keys for Efficient Sparse Attention, https://arxiv.org/abs/2406.02542 (Sparsification of the KV cache values with a focus on the key vectors.)
DeepSeek-AI. DeepSeek-V2: A strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434, 2024. https://arxiv.org/abs/2405.04434 Code: https://github.com/deepseek-ai/DeepSeek-V2 (Introduces various architectural optimizations, notably RoPE handling and KV cache compression via low-rank matrices.)
Yao Yao, Zuchao Li, Hai Zhao, 21 May 2024, SirLLM: Streaming Infinite Retentive LLM, https://arxiv.org/abs/2405.12528 (Low-rank decomposition to compress KV cache heads.)
Utkarsh Saxena, Gobinda Saha, Sakshi Choudhary, Kaushik Roy, 10 Aug 2024, Eigen Attention: Attention in Low-Rank Space for KV Cache Compression, https://arxiv.org/abs/2408.05646
Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong-Yan Chen, Yu-Fang Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, Kai-Chiang Wu, 30 Jul 2024, Palu: Compressing KV-Cache with Low-Rank Projection, https://arxiv.org/abs/2407.21118 https://github.com/shadowpa0327/Palu
Rongzhi Zhang, Kuang Wang, Liyuan Liu, Shuohang Wang, Hao Cheng, Chao Zhang, Yelong Shen, 4 Oct 2024, LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy, https://arxiv.org/abs/2410.03111
H Kang, Q Zhang, S Kundu, G Jeong, Z Liu, T Krishna, Dec 2024, GEAR: An Efficient Error Reduction Framework for KV Cache Compression in LLM Inference, https://neurips2024-enlsp.github.io/papers/paper_3.pdf (Use extra information in low-rank and sparse matrices to efficiently alleviate lossy KV cache quantization issues such as outliers.)
Yuhao Zhou, Sirui Song, Boyang Liu, Zhiheng Xi, Senjie Jin, Xiaoran Fan, Zhihao Zhang, Wei Li, Xuanjing Huang, 3 Mar 2025, EliteKV: Scalable KV Cache Compression via RoPE Frequency Selection and Joint Low-Rank Projection, https://arxiv.org/abs/2503.01586
Chi-Chih Chang, Chien-Yu Lin, Yash Akhauri, Wei-Cheng Lin, Kai-Chiang Wu, Luis Ceze, Mohamed S. Abdelfattah, 24 Mar 2025, xKV: Cross-Layer SVD for KV-Cache Compression, https://arxiv.org/abs/2503.18893

KV Cache Sparsity

KV cache sparsity is the use of zero values in the KV data to reduce the computations in LLM inference. Data stored in the KV cache is often sparse, with many zero or near-zero values. This data can be pruned in a structured or unstructured way, similar to the use of sparsification in model weights.

Research has shown that attention tends to be quite sparse, with some tokens having a much greater impact on attention than others. The ideas of pruning and sparsity can therefore be used on the KV cache data.

Possible types of KV cache pruning and sparsification include:

Magnitude pruning
Token-specific KV sparsity (lengthwise pruning of "pivotal tokens")
Layer-wise pruning (or layer-wise fusion of KV layer-specific data, akin to "layer fusion" of weights)

Papers on the specific types of KV cache compression via pruning or sparsity, which effectively means structured or unstructured pruning applied to the KV cache data, include:

Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of importance hypothesis for LLM KV cache compression at test time, 2023. http://arxiv.org/abs/2305.17118 (Reduces the size of the KV cache by limiting storage to only pivotal tokens.)
H Jiang, Q Wu, CY Lin, Y Yang, L Qiu, Oct 2023, LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models, arXiv preprint arXiv:2310.05736, https://arxiv.org/pdf/2310.05736.pdf, Code: https://aka.ms/LLMLingua (Discusses token pruning reducing size of KV cache.)
S Ren, Q Jia, KQ Zhu, arXiv preprint arXiv:2310.08152, Context Compression for Auto-regressive Transformers with Sentinel Tokens, Oct 2023, https://arxiv.org/pdf/2310.08152.pdf, Code: https://github.com/DRSY/KV_Compression
G Xiao, Y Tian, B Chen, S Han, M Lewis, Sep 2023, Efficient Streaming Language Models with Attention Sinks, arXiv preprint arXiv:2309.17453, https://arxiv.org/abs/2309.17453 (Sliding window KV caching.)
Qiaozhi He, Zhihua Wu, 28 Apr 2024, Efficient LLM Inference with Kcache, https://arxiv.org/abs/2404.18057 (Splits the KV cache into a KCache stored in HBM and a Vcache stored in CPU memory. The requests for the V cache are limited by filtering after attention based on the Softmax scaled top-N results of the QK matrix multiplication, so thereby pruning a lot of the V cache memory loads and corresponding calculations.)
Zihao Wang, Shaoduo Gan, 7 Apr 2024] SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget, https://arxiv.org/abs/2404.04793 Code: https://github.com/hetailang/squeezeattention (Optimization of the KV cache along the two dimensions of layers and input sequence.)
Youpeng Zhao, Di Wu, Jun Wang, 26 Mar 2024, ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching, https://arxiv.org/abs/2403.17312 (Improved memory management of the cache for KV caching during autoregressive inference with prioritization of tokens based on sparse window attention, and managing caching versus recomputation.)
Chao Lou, Zixia Jia, Zilong Zheng, Kewei Tu, 24 Jun 2024, Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers, https://arxiv.org/abs/2406.16747 (Sparse KV cache for memory-efficient decoding on long contexts by selecting KV pairs of salient tokens.)
Prajwal Singhania, Siddharth Singh, Shwai He, Soheil Feizi, Abhinav Bhatele, 4 Jun 2024, Loki: Low-Rank Keys for Efficient Sparse Attention, https://arxiv.org/abs/2406.02542 (Sparsification of the KV cache values with a focus on the key vectors.)
Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey
Zhenyu Zhang, Shiwei Liu, Runjin Chen, Bhavya Kailkhura, Beidi Chen, Atlas Wang, 2024, Q-Hitter: A Better Token Oracle for Efficient LLM Inference via Sparse-Quantized KV Cache, Part of Proceedings of Machine Learning and Systems 6 (MLSys 2024) Conference, https://proceedings.mlsys.org/paper_files/paper/2024/hash/bbb7506579431a85861a05fff048d3e1-Abstract-Conference.html https://proceedings.mlsys.org/paper_files/paper/2024/file/bbb7506579431a85861a05fff048d3e1-Paper-Conference.pdf https://github.com/VITA-Group/Q-Hitter
Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, Beidi Chen, 28 Oct 2024, ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference, https://arxiv.org/abs/2410.21465 https://github.com/bytedance/ShadowKV
Junqi Zhao, Zhijin Fang, Shu Li, Shaohui Yang, Shichao He, 30 Oct 2024, BUZZ: Beehive-structured Sparse KV Cache with Segmented Heavy Hitters for Efficient LLM Inference, https://arxiv.org/abs/2410.23079 https://github.com/JunqiZhao888/buzz-llm
M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024, Resource-efficient Algorithms and Systems of Foundation Models: A Survey, https://dl.acm.org/doi/pdf/10.1145/3706418
Yanqi Zhang, Yuwei Hu, Runyuan Zhao, John C.S. Lui, Haibo Chen, 4 Dec 2024, Unifying KV Cache Compression for Large Language Models with LeanKV, https://arxiv.org/abs/2412.03131 (KV cache compression via mixed-precision KV quantization, token-specific KV pruning, and KV sparsity. Also uses a KV paging method similar to paged attention.)
Junhyuck Kim, Jongho Park, Jaewoong Cho, Dimitris Papailiopoulos, 12 Dec 2024, Lexico: Extreme KV Cache Compression via Sparse Coding over Universal Dictionaries, https://arxiv.org/abs/2412.08890 https://github.com/krafton-ai/lexico (Sparsification of KV cache in prefill, using INT8 and vector lookup in a dictionary of predefined vectors.)
H Kang, Q Zhang, S Kundu, G Jeong, Z Liu, T Krishna, Dec 2024, GEAR: An Efficient Error Reduction Framework for KV Cache Compression in LLM Inference, https://neurips2024-enlsp.github.io/papers/paper_3.pdf (Use extra information in low-rank and sparse matrices to efficiently alleviate lossy KV cache quantization issues such as outliers.)
Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, Luis Ceze, 2 Jan 2025, FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving, https://arxiv.org/abs/2501.01005
Qihui Zhou, Peiqi Yin, Pengfei Zuo, James Cheng, 1 Mar 2025, Progressive Sparse Attention: Algorithm and System Co-design for Efficient Attention in LLM Serving, https://arxiv.org/abs/2503.00392
AnonymousACLsubmission, 2025, TokenSelect: Efficient Long-Context Inferenceand Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection, https://openreview.net/pdf?id=l7i2gtDKdU
Zehao Fan, Garrett Gagnon, Zhenyu Liu, Liu Liu, 9 May 2025, Sparse Attention Remapping with Clustering for Efficient LLM Decoding on PIM, https://arxiv.org/abs/2505.05772

KV Cache Token Pruning

KV cache token pruning is the idea of token pruning (lengthwise pruning) applied to the KV cache data for faster LLM inference. Unimportant tokens can have their KV cache data removed (pruned) or fused with another token's KV data. This is based on the idea of "pivotal tokens" or "salient tokens" and that not all tokens are important to the output.

Research papers on KV cache token pruning:

Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of importance hypothesis for LLM KV cache compression at test time, 2023. http://arxiv.org/abs/2305.17118 (Reduces the size of the KV cache by limiting storage to only pivotal tokens.)
H Jiang, Q Wu, CY Lin, Y Yang, L Qiu, Oct 2023, LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models, arXiv preprint arXiv:2310.05736, https://arxiv.org/pdf/2310.05736.pdf, Code: https://aka.ms/LLMLingua (Discusses token pruning reducing size of KV cache.)
S Ren, Q Jia, KQ Zhu, arXiv preprint arXiv:2310.08152, Context Compression for Auto-regressive Transformers with Sentinel Tokens, Oct 2023, https://arxiv.org/pdf/2310.08152.pdf, Code: https://github.com/DRSY/KV_Compression
Zihao Wang, Shaoduo Gan, 7 Apr 2024] SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget, https://arxiv.org/abs/2404.04793 Code: https://github.com/hetailang/squeezeattention (Optimization of the KV cache along the two dimensions of layers and input sequence.)
Youpeng Zhao, Di Wu, Jun Wang, 26 Mar 2024, ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching, https://arxiv.org/abs/2403.17312 (Improved memory management of the cache for KV caching during autoregressive inference with prioritization of tokens based on sparse window attention, and managing caching versus recomputation.)
Zhiyu Guo, Hidetaka Kamigaito, Taro Watanabe, 18 Jun 2024, Attention Score is not All You Need for Token Importance Indicator in KV Cache Reduction: Value Also Matters, https://arxiv.org/abs/2406.12335 (Extensions of KV cache token pruning methods tjat ise attention scores to find pivotal tokens, generalized to also consider L1 vector norms of value vectors.)
Wonbeom Lee, Jungi Lee, Junghwan Seo, Jaewoong Sim, 28 Jun 2024, InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management, https://arxiv.org/abs/2406.19707 (KV caching optimization using salient token pruning for the attention layer.)
Jiayi Yuan, Hongyi Liu, Shaochen (Henry)Zhong, Yu-Neng Chuang, Songchen Li, Guanchu Wang, Duy Le, Hongye Jin, Vipin Chaudhary, Zhaozhuo Xu, Zirui Liu, Xia Hu, 1 Jul 2024, KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches, https://arxiv.org/abs/2407.01527 Code: https://github.com/henryzhongsc/longctx_bench (Survey and benchmarking of several KV cache compression and long context handling techniques.)
Hailin Zhang, Xupeng Miao, Xiaodong Ji, Xiaonan Nie Yilin Chen, Weipeng Chen, Fangcheng Fu, Bin Cui, 2024, PQCache: Product Quantization-based KVCache for Long Context LLM Inference, https://hugozhl.github.io/files/PQCache.pdf
Qichen Fu, Minsik Cho, Thomas Merth, Sachin Mehta, Mohammad Rastegari, Mahyar Najibi, 19 Jul 2024, LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference, https://arxiv.org/abs/2407.14057
Junyoung Park, Myeonggu Kang, Yunki Han, Yanggon Kim, Jaekang Shin, Lee-Sup Kim, 21 Jul 2024, Token-Picker: Accelerating Attention in Text Generation with Minimized Memory Transfer via Probability Estimation, https://arxiv.org/abs/2407.15131
Shi Luohe, Zhang Hongyi, Yao Yao, Li Zuchao, Zhao Hai, 25 Jul 2024, Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption, https://arxiv.org/abs/2407.18003
Giulio Corallo, Paolo Papotti, 31 Jul 2024, Finch: Prompt-guided Key-Value Cache Compression, https://arxiv.org/abs/2408.00167 (KV cache compression along the lengthwise input dimension in a blockwise KV data pruning method.)
Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, Song Han, July 2024, QUEST: Query-Aware Sparsity for Efficient Long-Context LLM Inference, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:47901-47911, 2024, https://proceedings.mlr.press/v235/tang24l.html PDF: https://raw.githubusercontent.com/mlresearch/v235/main/assets/tang24l/tang24l.pdf Code: https://github.com/mit-han-lab/quest
Yuhui Xu, Zhanming Jie, Hanze Dong, Lei Wang, Xudong Lu, Aojun Zhou, Amrita Saha, Caiming Xiong, Doyen Sahoo, 30 Jul 2024, ThinK: Thinner Key Cache by Query-Driven Pruning, https://arxiv.org/abs/2407.21018
Minsoo Kim, Kyuhong Shim, Jungwook Choi, Simyung Chang, 2 Oct 2024, InfiniPot: Infinite Context Processing on Memory-Constrained LLMs, https://arxiv.org/abs/2410.01518 (Length-wise KV cache pruning by analyzing token importance.)
Zihao Zeng, Bokai Lin, Tianqi Hou, Hao Zhang, Zhijie Deng, 15 Oct 2024, In-context KV-Cache Eviction for LLMs via Attention-Gate, https://arxiv.org/abs/2410.12876
Junqi Zhao, Zhijin Fang, Shu Li, Shaohui Yang, Shichao He, 30 Oct 2024, BUZZ: Beehive-structured Sparse KV Cache with Segmented Heavy Hitters for Efficient LLM Inference, https://arxiv.org/abs/2410.23079 https://github.com/JunqiZhao888/buzz-llm
Wei Wu, Zhuoshi Pan, Chao Wang, Liyi Chen, Yunchu Bai, Kun Fu, Zheng Wang, Hui Xiong, 5 Nov 2024, TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection, https://arxiv.org/abs/2411.02886
Xiangyu Zhang, Yu Zhou, Guang Yang, Harald C. Gall, Taolue Chen, 11 Nov 2024, Anchor Attention, Small Cache: Code Generation with Large Language Models, https://arxiv.org/abs/2411.06680
Da Ma, Lu Chen, Situo Zhang, Yuxun Miao, Su Zhu, Zhi Chen, Hongshen Xu, Hanqi Li, Shuai Fan, Lei Pan, Kai Yu, 3 Dec 2024, Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity, https://arxiv.org/abs/2412.02252
Yanqi Zhang, Yuwei Hu, Runyuan Zhao, John C.S. Lui, Haibo Chen, 4 Dec 2024, Unifying KV Cache Compression for Large Language Models with LeanKV, https://arxiv.org/abs/2412.03131 (KV cache compression via mixed-precision KV quantization, token-specific KV pruning, and KV sparsity. Also uses a KV paging method similar to paged attention.)
Meizhi Zhong, Xikai Liu, Chen Zhang, Yikun Lei, Yan Gao, Yao Hu, Kehai Chen, Min Zhang, 12 Dec 2024, ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based on Layer Uncertainty, https://arxiv.org/abs/2412.09036 (KV cache compression on a layerwise budget, similar to token-based eviction, giving a kind of "dual-dimension" KV cache compression on depth and length.)
Jiebin Zhang, Dawei Zhu, Yifan Song, Wenhao Wu, Chuqiao Kuang, Xiaoguang Li, Lifeng Shang, Qun Liu, Sujian Li, 17 Dec 2024, More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression, https://arxiv.org/abs/2412.12706 (Combining KV token pruning and KV quantization.)
Jialong Wu, Zhenglin Wang, Linhai Zhang, Yilong Lai, Yulan He, Deyu Zhou, 18 Dec 2024, SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation, https://arxiv.org/abs/2412.13649 (Different KV cache optimizations for prefill and decoding phases.)
Xiabin Zhou, Wenbin Wang, Minyan Zeng, Jiaxian Guo, Xuebo Liu, Li Shen, Min Zhang, Liang Ding, 19 Dec 2024, DynamicKV: Task-Aware Adaptive KV Cache Compression for Long Context LLMs, https://arxiv.org/abs/2412.14838
Chao Lou, Zixia Jia, Zilong Zheng, Kewei Tu, 24 Jun 2024, Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers, https://arxiv.org/abs/2406.16747 (Sparse KV cache for memory-efficient decoding on long contexts by selecting KV pairs of salient tokens.)
Yefei He, Luoming Zhang, Weijia Wu, Jing Liu, Hong Zhou, Bohan Zhuang, 23 May 2024, ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification, https://arxiv.org/abs/2405.14256
Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, Lei Chen, 27 Dec 2024, A Survey on Large Language Model Acceleration based on KV Cache Management, https://arxiv.org/abs/2412.19442 (Huge survey of all KV cache optimization methods.)
Ben Dickson, December 13, 2024, New LLM optimization technique slashes memory costs up to 75%, https://venturebeat.com/ai/new-llm-optimization-technique-slashes-memory-costs-up-to-75/
Edoardo Cetin, Qi Sun, Tianyu Zhao, Yujin Tang, 6 Dec 2024 (v3), An Evolved Universal Transformer Memory, https://arxiv.org/abs/2410.13166
Zhang, Z., Liu, S., Chen, R., Kailkhura, B., Chen, B., & Wang, Z. (2024). Q-Hitter: A Better Token Oracle for Efficient LLM Inference via Sparse-Quantized KV Cache. In P. Gibbons, G. Pekhimenko, & C. De Sa (Eds.), Proceedings of Machine Learning and Systems 6 (MLSys 2024) (pp. 381-394). MLSys. https://proceedings.mlsys.org/paper_files/paper/2024/hash/bbb7506579431a85861a05fff048d3e1-Abstract Conference.htm https://github.com/VITA-Group/Q-Hitter
Dongwon Jo, Jiwon Song, Yulhwa Kim, Jae-Joon Kim, 3 Feb 2025, FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation, https://arxiv.org/abs/2502.01068 https://github.com/dongwonjo/FastKV
Siyu Xu, Yunke Wang, Chenghao Xia, Dihao Zhu, Tao Huang, Chang Xu, 4 Feb 2025, VLA-Cache: Towards Efficient Vision-Language-Action Model via Adaptive Token Caching in Robotic Manipulation, https://arxiv.org/abs/2502.02175
Qingyue Yang, Jie Wang, Xing Li, Zhihai Wang, Chen Chen, Lei Chen, Xianzhi Yu, Wulong Liu, Jianye Hao, Mingxuan Yuan, Bin Li, 6 Feb 2025. AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference, https://arxiv.org/abs/2502.04077
Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, S Kevin Zhou, 6 Feb 2025, Identify Critical KV Cache in LLM Inference from an Output Perturbation Perspective, https://arxiv.org/abs/2502.03805
Difan Deng, Marius Lindauer, 20 Feb 2025 (v2), Neural Attention Search, https://arxiv.org/abs/2502.13251 (Deciding whether a token deserves global attention, local attention, or sliding window attention, reducing KV caches.)
Haoyu Wang, Tong Teng, Tianyu Guo, An Xiao, Duyu Tang, Hanting Chen, Yunhe Wang, 20 Feb 2025, Unshackling Context Length: An Efficient Selective Attention Approach through Query-Key Compression, https://arxiv.org/abs/2502.14477
Payman Behnam, Yaosheng Fu, Ritchie Zhao, Po-An Tsai, Zhiding Yu, Alexey Tumanov, 19 Feb 2025, RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression, https://arxiv.org/abs/2502.14051 (Combines KV token pruning with sparse attention algorithms.)
Xiaoran Liu, Ruixiao Li, Mianqiu Huang, Zhigeng Liu, Yuerong Song, Qipeng Guo, Siyang He, Qiqi Wang, Linlin Li, Qun Liu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu, 24 Feb 2025, Thus Spake Long-Context Large Language Model, https://arxiv.org/abs/2502.17129 (Impressive survey of many techniques to improve efficiency and accuracy of long context processing in both inference and training, covering text, video and multimodal models.)
AnonymousACLsubmission, 2025, TokenSelect: Efficient Long-Context Inferenceand Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection, https://openreview.net/pdf?id=l7i2gtDKdU

KV Cache Eviction

KV cache eviction is a method of LLM inference optimization that compresses the KV cache size by evicting tokens. This eviction policy ensures that the KV cache does not get too large, which is a bottleneck for processing long token sequences.

One way to compress your KV cache is to just stop it getting too big. Denied! The way to do this is called an "eviction policy" whereby some of the cached KV data gets evicted out of the cache on some basis.

Research papers on KV cache eviction policies:

Yushuo Chen, Tianyi Tang, Erge Xiang, Linjiang Li, Wayne Xin Zhao, Jing Wang, Yunpeng Chai, Ji-Rong Wen, 17 Apr 2024, Towards Coarse-to-Fine Evaluation of Inference Efficiency for Large Language Models, https://arxiv.org/abs/2404.11502 (Benchmarks the performance of various Transformer inference frameworks: Transformers, vLLM, DeepSpeed-MII, TGI, TenserRT-LLM, llama.cpp, LightLLM, LMDeploy, StreamingLLM.)
Hanshi Sun, Zhuoming Chen, Xinyu Yang, Yuandong Tian, Beidi Chen, 18 Apr 2024, TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding, https://arxiv.org/abs/2404.11912 Code: https://github.com/Infini-AI-Lab/TriForce (Improves issues with long context in the use of speculative decoding, such as small model context sizes and KV cache memory usage bottlenecks for Llama2-7B-128K model on an A100 and RTX 4090 GPU.)
Yuxin Zhang, Yuxuan Du, Gen Luo, Yunshan Zhong, Zhenyu Zhang, Shiwei Liu, Rongrong Ji, 2024, CaM: Cache Merging for Memory-efficient LLMs Inference, https://openreview.net/pdf?id=LCTmppB165 Code: https://github.com/zyxxmu/cam (Compressing the KV cache by merging KV data that is about to be evicted into other parts of the KV cache.)
Liyuan Liu , Jianfeng Gao , May 8, 2024 LLM profiling guides KV cache optimization, Microsoft Research Blog, 12th International Conference on Learning Representations(opens in new tab) (ICLR 2024), https://www.microsoft.com/en-us/research/blog/llm-profiling-guides-kv-cache-optimization/
Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, Jianfeng Gao, May 2024, Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs, ICLR 2024, https://www.microsoft.com/en-us/research/publication/model-tells-you-what-to-discard-adaptive-kv-cache-compression-for-llms/ https://arxiv.org/pdf/2310.01801
June Yong Yang, Byeongwook Kim, Jeongin Bae, Beomseok Kwon, Gunho Park, Eunho Yang, Se Jung Kwon, Dongsoo Lee, 28 Feb 2024, No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization, https://arxiv.org/abs/2402.18096
Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. Model tells you what to discard: Adaptive kv cache compression for llms. arXiv preprint arXiv:2310.01801, 2023. https://arxiv.org/abs/2310.01801
Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. https://arxiv.org/abs/2310.06825 arXiv preprint arXiv:2310.06825, 2023
Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time. Advances in Neural Information Processing Systems, 36, 2024. https://arxiv.org/abs/2305.17118
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. In The Twelfth International Conference on Learning Representations, 2023. [51] Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems, 36, 2024 https://arxiv.org/abs/2306.14048
Zhongwei Wan, Xinjian Wu, Yu Zhang, Yi Xin, Chaofan Tao, Zhihong Zhu, Xin Wang, Siqi Luo, Jing Xiong, Mi Zhang, 18 Jun 2024, D2O:Dynamic Discriminative Operations for Efficient Generative Inference of Large Language Models, https://arxiv.org/abs/2406.13035 (Per-layer KV cache eviction strategies with token merging applied to the KV cache.)
Jeffrey Willette, Heejun Lee, Youngwan Lee, Myeongjae Jeon, Sung Ju Hwang, 24 Jun 2024, Training-Free Exponential Extension of Sliding Window Context with Cascading KV Cache, https://arxiv.org/abs/2406.17808 (Extends the KV cache eviction policy in sliding window attention so that the KV partially looks back further than the window.)
Zhongwei Wan, Ziang Wu, Che Liu, Jinfa Huang, Zhihong Zhu, Peng Jin, Longyue Wang, Li Yuan, 26 Jun 2024, LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference, https://arxiv.org/abs/2406.18139 (KV cache compression in text and multimodal inference, prioritizing eviction of text over image tokens, and using new ways to merge evicted KV cache data into the retained KV cache, including averaging, pivotal tokens, and weighted averages, which is relevant to token merging and KV cache fusion
Wonbeom Lee, Jungi Lee, Junghwan Seo, Jaewoong Sim, 28 Jun 2024, InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management, https://arxiv.org/abs/2406.19707 (KV caching optimization using salient token pruning for the attention layer.)
Shi Luohe, Zhang Hongyi, Yao Yao, Li Zuchao, Zhao Hai, 25 Jul 2024, Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption, https://arxiv.org/abs/2407.18003
Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference, Harry Dong, Xinyu Yang, Zhenyu Zhang, Zhangyang Wang, Yuejie Chi, Beidi Chen, 2024, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:11437-11452, 2024, https://proceedings.mlr.press/v235/dong24f.html https://raw.githubusercontent.com/mlresearch/v235/main/assets/dong24f/dong24f.pdf https://openreview.net/forum?id=uhHDhVKFMW Code: https://github.com/hdong920/LESS
Yilong Chen, Guoxia Wang, Junyuan Shang, Shiyao Cui, Zhenyu Zhang, Tingwen Liu, Shuohuan Wang, Yu Sun, Dianhai Yu, Hua Wu, 8 Aug 2024 (v2), NACL: A General and Effective KV Cache Eviction Framework for LLMs at Inference Time, https://arxiv.org/abs/2408.03675 Code: https://github.com/PaddlePaddle/Research/tree/master/NLP/ACL2024-NACL
Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, S. Kevin Zhou, 16 Aug 2024 (v3), Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference, https://arxiv.org/abs/2407.11550
vLLM, 2024, Performance and Tuning, https://docs.vllm.ai/en/latest/models/performance.html
Zuyan Liu, Benlin Liu, Jiahui Wang, Yuhao Dong, Guangyi Chen, Yongming Rao, Ranjay Krishna, Jiwen Lu, 25 Jul 2024, Efficient Inference of Vision Instruction-Following Models with Elastic Cache, https://arxiv.org/abs/2407.18121 https://github.com/liuzuyan/ElasticCache
Zhenyu Ning, Jieru Zhao, Qihao Jin, Wenchao Ding, Minyi Guo, 11 Sep 2024, Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU, https://arxiv.org/abs/2409.09086
Isaac Rehg, 7 Oct 2024 (v2), KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head, https://arxiv.org/abs/2410.00161
Yuxiang Huang, Binhang Yuan, Xu Han, Chaojun Xiao, Zhiyuan Liu, 2 Oct 2024, Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads, https://arxiv.org/abs/2410.01805
Zihao Zeng, Bokai Lin, Tianqi Hou, Hao Zhang, Zhijie Deng, 15 Oct 2024, In-context KV-Cache Eviction for LLMs via Attention-Gate, https://arxiv.org/abs/2410.12876
Anonymous authors, Oct 2024, LSH Tells You What To Discard: An Adaptive Locality-Sensitive Strategy for KV Cache Compression, ICLR 2025, https://openreview.net/pdf?id=0ZcQhdyI3n
Z. Xu and J. Wu, "Crowd: An KV Cache Eviction Policy Which Uses Crowd Information to Select Evicted Key-Value Pairs," 2024 4th International Conference on Computer Science and Blockchain (CCSB), Shenzhen, China, 2024, pp. 608-612, doi: 10.1109/CCSB63463.2024.10735473. https://ieeexplore.ieee.org/abstract/document/10735473
Zekun Hao, David W. Romero, Tsung-Yi Lin, Ming-Yu Liu, 12 Dec 2024, Meshtron: High-Fidelity, Artist-Like 3D Mesh Generation at Scale, https://arxiv.org/abs/2412.09548 https://research.nvidia.com/labs/dir/meshtron/ (Optimizations to avoid the quadratic Transformer cost, in both training and inference, include "hourglass neural architecture" analogous to widthwise pruning or slimming, sliding window attention, rolling KV cache, truncated sequence training, and a "robust sampling strategy" that is effectively a type of constrained decoding based on mesh layouts.)
Minghui Liu, Tahseen Rabbani, Tony O'Halloran, Ananth Sankaralingam, Mary-Anne Hartley, Brian Gravelle, Furong Huang, Cornelia Fermüller, Yiannis Aloimonos, 24 Dec 2024 (v2), HashEvict: A Pre-Attention KV Cache Eviction Strategy using Locality-Sensitive Hashing, https://arxiv.org/abs/2412.16187 (LSH to compress the KV cache.)
Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, Lei Chen, 27 Dec 2024, A Survey on Large Language Model Acceleration based on KV Cache Management, https://arxiv.org/abs/2412.19442 (Huge survey of all KV cache optimization methods.)
Tong Xiao, Jingbo Zhu, 16 Jan 2025, Foundations of Large Language Models, https://arxiv.org/abs/2501.09223 (Huge 230 page paper on many topics such as training, prompting, alignment, and long context.)
John Thomson, Anjali Shah and Laikh Tewari, Jan 16, 2025, Introducing New KV Cache Reuse Optimizations in NVIDIA TensorRT-LLM, https://developer.nvidia.com/blog/introducing-new-kv-cache-reuse-optimizations-in-nvidia-tensorrt-llm/ (An API with callbacks to manage the KV cache.)
Yuxin Zhang, Yuxuan Du, Gen Luo, Yunshan Zhong, Zhenyu Zhang, Shiwei Liu, Rongrong Ji, July 2024, CaM: Cache Merging for Memory-efficient LLMs Inference, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:58840-58850, 2024, https://proceedings.mlr.press/v235/zhang24n.html PDF: https://raw.githubusercontent.com/mlresearch/v235/main/assets/zhang24n/zhang24n.pdf Code: https://github.com/zyxxmu/cam
Shang Yang, Junxian Guo, Haotian Tang, Qinghao Hu, Guangxuan Xiao, Jiaming Tang, Yujun Lin, Zhijian Liu, Yao Lu, Song Han, 20 Feb 2025, LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention, https://arxiv.org/abs/2502.14866
Ahmed Burak Gulhan, Krishna Teja Chitty-Venkata, Murali Emani, Mahmut Kandemir, Venkatram Vishwanath, 18 Feb 2025, BaKlaVa -- Budgeted Allocation of KV cache for Long-context Inference, https://arxiv.org/abs/2502.13176
Jing Xiong, Jianghan Shen, Chuanyang Zheng, Zhongwei Wan, Chenyang Zhao, Chiwun Yang, Fanghua Ye, Hongxia Yang, Lingpeng Kong, Ngai Wong, 20 Feb 2025, ParallelComp: Parallel Long-Context Compressor for Length Extrapolation, https://arxiv.org/abs/2502.14317
Tong Wu, Junzhe Shen, Zixia Jia, Yuxuan Wang, Zilong Zheng, 26 Feb 2025, From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation up to 100K Tokens, https://arxiv.org/abs/2502.18890 (Extending speculative decoding to address three bottlenecks in ultra-long context: frequent model reloading, KV cache size, and repetitive output content generation. Uses techniques such as KV cache eviction and decoding penalties to avoid repetition.) https://github.com/bigai-nlco/TokenSwift
G Wang, S Upasani, C Wu, D Gandhi, JL Li, C Hu, B Li, Mar 2025, LLMs Know What to Drop: Self-Attention Guided KV Cache Eviction for Efficient Long-Context Inference, ICLR 2025 review, https://openreview.net/pdf?id=qg9dlCcNzr
A. Moradifirouzabadi and M. Kang, "End-to-end Acceleration of Generative Models with Runtime Regularized KV Cache Management," in IEEE Journal on Emerging and Selected Topics in Circuits and Systems, doi: 10.1109/JETCAS.2025.3568716, https://ieeexplore.ieee.org/abstract/document/10994487
Yifeng Gu, Zicong Jiang, Jianxiu Jin, Kailing Guo, Ziyang Zhang, Xiangmin Xu, 4 Jun 2025, AhaKV: Adaptive Holistic Attention-Driven KV Cache Eviction for Efficient Inference of Large Language Models, https://arxiv.org/abs/2506.03762
Jang-Hyun Kim, Jinuk Kim, Sangwoo Kwon, Jae W. Lee, Sangdoo Yun, Hyun Oh Song, 29 May 2025, KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction, https://arxiv.org/abs/2505.23416

KV Cache Quantization

KV cache quantization is an LLM inference optimization that applies quantization to the KV cache data as a type of KV cache compression. As with model weight or activation quantization, this makes the KV cache both smaller in memory and faster to process, leading to faster LLM inference, and allowing the efficient processing of longer queries.

Research on the quantization of the KV cache to reduce memory, as a subtype of KV cache compression:

Yuxuan Yue, Zhihang Yuan, Haojie Duanmu, Sifan Zhou, Jianlong Wu, Liqiang Nie, 20 Feb 2024 (v2), WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More, https://arxiv.org/abs/2402.12065
Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, Xia Hu, 5 Feb 2024, KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache, https://arxiv.org/abs/2402.02750 Code: https://github.com/jy-yuan/KIVI (KV cache 2-bit quantization on Llama-2, Falcon and Mistral models.)
Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 15 Mar 2024 (v5), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer (A large survey of a variety of LLM optimizations.)
Hao Kang, Qingru Zhang, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu, Tushar Krishna, Tuo Zhao, 11 Mar 2024 (v2), GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM, https://arxiv.org/abs/2403.05527 Code: https://github.com/HaoKang-Timmy/GEAR (Compressing the size of the KV cache using quantization, low-rank matrices, and sparse matrix.)
Shichen Dong, Wen Cheng, Jiayu Qin, Wei Wang, 7 Mar 2024, QAQ: Quality Adaptive Quantization for LLM KV Cache, https://arxiv.org/abs/2403.04643 Code: http://github.com/ClubieDong/KVCacheQuantization (Reducing the size of the KV cache using quantization.)
Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu, 2023, KIVI : Plug-and-play 2bit KV Cache Quantization with Streaming Asymmetric Quantization, https://www.researchgate.net/profile/Zirui-Liu-29/publication/376831635_KIVI_Plug-and-play_2bit_KV_Cache_Quantization_with_Streaming_Asymmetric_Quantization/links/658b5d282468df72d3db3280/KIVI-Plug-and-play-2bit-KV-Cache-Quantization-with-Streaming-Asymmetric-Quantization.pdf -> KV Caching (Explores quantization of values stored in the KV cache as a way to maintain a smaller KV caching and reduce memory storage requirements.)
Amir Zandieh, Majid Daliri, Insu Han, 5 Jun 2024, QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead, https://arxiv.org/abs/2406.03482 Code: https://github.com/amirzandieh/QJL (Using 1-bit or 2-bit KV cache quantization approach based on sign bits as estimates.)
Akide Liu, Jing Liu, Zizheng Pan, Yefei He, Gholamreza Haffari, Bohan Zhuang, 23 May 2024, MiniCache: KV Cache Compression in Depth Dimension for Large Language Models, arXiv preprint arXiv:2405.14366, 2024, https://arxiv.org/abs/2405.14366 (Merging KV caches with similar values across nearby layers to effectively share parts of the cache across layers for a 41% reduction in size.)
Yefei He, Luoming Zhang, Weijia Wu, Jing Liu, Hong Zhou, Bohan Zhuang, 23 May 2024, ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification, https://arxiv.org/abs/2405.14256 (Quantizing the KV cache with combination of other KV cache compression methods based on salient tokens.)
Yao Fu, 14 May 2024, Challenges in Deploying Long-Context Transformers: A Theoretical Peak Performance Analysis, https://arxiv.org/abs/2405.08944 (The KV cache size is the main bottleneck for long context processing, in both prefill and decoding phases, and includes analysis of different optimizations to address this.)
Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, Song Han, 7 May 2024, QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving, arXiv preprint arXiv:2405.04532, https://arxiv.org/abs/2405.04532 Project: https://hanlab.mit.edu/projects/qserve Code: https://github.com/mit-han-lab/qserve (Efficient quantized inference on GPUs using 4-bit weights, 8-bit activations, and 4-bit KV cache, mostly via a GEMM speedup.)
Zhenyu Zhang, Shiwei Liu, Runjin Chen, Bhavya Kailkhura, Beidi Chen, Zhangyang Wang, 2024, Q-HITTER: A BETTER TOKEN ORACLE FOR EFFICIENT LLM INFERENCE VIA SPARSE-QUANTIZED KV CACHE, Proceedings of the 5th MLSys Conference, Santa Clara, CA, USA, https://proceedings.mlsys.org/paper_files/paper/2024/file/bbb7506579431a85861a05fff048d3e1-Paper-Conference.pdf
Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. Kivi : Plug-and-play 2bit kv cache quantization with streaming asymmetric quantization. 2023. doi: 10.13140/RG.2.2.28167.37282. https://rgdoi.net/10.13140/RG.2.2.28167.37282.
Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L. Croci, Bo Li, Martin Jaggi, Dan Alistarh, Torsten Hoefler, James Hensman, 30 Mar 2024, QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs, https://arxiv.org/abs/2404.00456 Code: https://github.com/spcl/QuaRot
June Yong Yang, Byeongwook Kim, Jeongin Bae, Beomseok Kwon, Gunho Park, Eunho Yang, Se Jung Kwon, Dongsoo Lee, 28 Feb 2024, No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization, https://arxiv.org/abs/2402.18096
Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, Amir Gholami, 4 Apr 2024 (v3), KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization, https://arxiv.org/abs/2401.18079
Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang Shi, Raghuraman Krishnamoorthi, and Vikas Chandra. Llm-qat: Data-free quantization aware training for large language models. arXiv preprint arXiv:2305.17888, 2023, https://arxiv.org/abs/2305.17888
Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. Flexgen: High-throughput generative inference of large language models with a single gpu. In International Conference on Machine Learning, pages 31094–31116. PMLR, 2023. https://arxiv.org/abs/2303.06865
GuangxuanXiao, JiLin, Mickael Seznec, HaoWu, JulienDemouth, andSongHan. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pages 38087–38099. PMLR, 2023. https://arxiv.org/abs/2211.10438
Apple, June 2024, Introducing Apple’s On-Device and Server Foundation Models, https://machinelearning.apple.com/research/introducing-apple-foundation-models (Apple's on-device models feature optimizations including small models, grouped query attention, 2-bit/4-bit quantization including activation quantization, shared embedding/unembedding tensors, small-ish vocabulary size of 49k, an undisclosed efficient KV cache optimization for neural engines, and layer-specific 16-bit LoRA/QLoRA adapters of size "10s of megabytes" for fine-tuned specialized model versions, also sometimes in 2-bit/4-bit, claiming speed rates of 0.6ms/token in prefill, and 30 tokens per second in decoding.)
Haojie Duanmu, Zhihang Yuan, Xiuhong Li, Jiangfei Duan, Xingcheng Zhang, Dahua Lin, 13 May 2024 (v2), SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models, https://arxiv.org/abs/2405.06219
Character.AI, June 20, 2024, Optimizing AI Inference at Character.AI, https://research.character.ai/optimizing-inference/
Jiayi Yuan, Hongyi Liu, Shaochen (Henry)Zhong, Yu-Neng Chuang, Songchen Li, Guanchu Wang, Duy Le, Hongye Jin, Vipin Chaudhary, Zhaozhuo Xu, Zirui Liu, Xia Hu, 1 Jul 2024, KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches, https://arxiv.org/abs/2407.01527 Code: https://github.com/henryzhongsc/longctx_bench (Survey and benchmarking of several KV cache compression and long context handling techniques.)
Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas Kübler, Jiaji Huang, Matthäus Kleindessner, Jun Huan, Volkan Cevher, Yida Wang, George Karypis, 12 Jul 2024, Inference Optimization of Foundation Models on AI Accelerators, KDD’24, August 25–29, 2024, Barcelona, Spain, https://arxiv.org/abs/2407.09111
Hailin Zhang, Xupeng Miao, Xiaodong Ji, Xiaonan Nie Yilin Chen, Weipeng Chen, Fangcheng Fu, Bin Cui, 2024, PQCache: Product Quantization-based KVCache for Long Context LLM Inference, https://hugozhl.github.io/files/PQCache.pdf
Shi Luohe, Zhang Hongyi, Yao Yao, Li Zuchao, Zhao Hai, 25 Jul 2024, Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption, https://arxiv.org/abs/2407.18003
Wang, Bin Guo, Chen Meng, Sheng Gui, Weifei Yu, Yi Xie, 10 Jul 2024, Inference Performance Optimization for Large Language Models on CPUs,Pujiang He, Shan Zhou, Wenhuan Huang, Changqing Li, Duyi https://arxiv.org/abs/2407.07304 Code: https://github.com/intel/xFasterTransformer
Hugging Face, Aug 2024 (accessed), Best Practices for Generation with Cache, https://huggingface.co/docs/transformers/kv_cache
Zihao Ye,, Lequn Chen, Ruihang Lai, Yilong Zhao, Size Zheng, Junru Shao, Bohan Hou, Hongyi Jin, Yifei Zuo, Liangsheng Yin, Tianqi Chen, Luis Ceze, Feb 2, 2024, Accelerating Self-Attentions for LLM Serving with FlashInfer, https://flashinfer.ai/2024/02/02/introduce-flashinfer.html
The SGLang Team, Sep 04, 2024 SGLang v0.3 Release: 7x Faster DeepSeek MLA, 1.5x Faster torch.compile, Multi-Image/Video LLaVA-OneVision, https://lmsys.org/blog/2024-09-04-sglang-v0-3/
W. Byun, J. Woo and S. Mukhopadhyay, "Hessian-Aware KV Cache Quantization for LLMs," 2024 IEEE 67th International Midwest Symposium on Circuits and Systems (MWSCAS), Springfield, MA, USA, 2024, pp. 243-247, doi: 10.1109/MWSCAS60917.2024.10658840. https://ieeexplore.ieee.org/abstract/document/10658840
Yanshu Wang, Tong Yang, Xiyan Liang, Guoan Wang, Hanning Lu, Xu Zhe, Yaoming Li, Li Weitao, 18 Sep 2024, Art and Science of Quantizing Large-Scale Models: A Comprehensive Overview, https://arxiv.org/abs/2409.11650 (Extensive survey of quantization from the basics to SOTA approaches, with also some coverage of knowledge distillation and KV cache compression.)
Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu, 25 Sep 2024, A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms, https://arxiv.org/abs/2409.16694
Yifan Tan, Haoze Wang, Chao Yan, Yangdong Deng, 25 Sep 2024, AlignedKV: Reducing Memory Access of KV-Cache with Precision-Aligned Quantization, https://arxiv.org/abs/2409.16546 (Focuses on access latency of KV cache in floating point, rather than size reduction.)
Yefei He, Feng Chen, Jing Liu, Wenqi Shao, Hong Zhou, Kaipeng Zhang, Bohan Zhua, 11 Oct 2024, ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification and KV Cache Compression, https://arxiv.org/abs/2410.08584
Juntao Zhao, Wenhao Lu, Sheng Wang, Lingpeng Kong, Chuan Wu, 15 Oct 2024, QSpec: Speculative Decoding with Complementary Quantization Schemes, https://arxiv.org/abs/2410.11305 (Enhance speculative decoding using quantization to reuse KV cache values and weights.)
Lian Liu, Haimeng Ren, Long Cheng, Zhaohui Xu, Yudong Pan, Mengdi Wang, Xiaowei Li, Yinhe Han, Ying Wang, 16 Oct 2024, COMET: Towards Partical W4A4KV4 LLMs Serving, https://arxiv.org/abs/2410.12168
Qian Tao, Wenyuan Yu, Jingren Zhou, 17 Oct 2024, AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asymmetric Quantization Configurations, https://arxiv.org/abs/2410.13212
Zhen Yang, J.N.Han, Kan Wu, Ruobing Xie, An Wang, Xingwu Sun, Zhanhui Kang, 20 Oct 2024, Lossless KV Cache Compression to 2%, https://arxiv.org/abs/2410.15252
Akshat Sharma, Hangliang Ding, Jianping Li, Neel Dani, Minjia Zhang, 27 Nov 2024, Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache, https://arxiv.org/abs/2411.18077
Yanqi Zhang, Yuwei Hu, Runyuan Zhao, John C.S. Lui, Haibo Chen, 4 Dec 2024, Unifying KV Cache Compression for Large Language Models with LeanKV, https://arxiv.org/abs/2412.03131 (KV cache compression via mixed-precision KV quantization, token-specific KV pruning, and KV sparsity. Also uses a KV paging method similar to paged attention.)
Junhyuck Kim, Jongho Park, Jaewoong Cho, Dimitris Papailiopoulos, 12 Dec 2024, Lexico: Extreme KV Cache Compression via Sparse Coding over Universal Dictionaries, https://arxiv.org/abs/2412.08890 https://github.com/krafton-ai/lexico (Sparsification of KV cache in prefill, using INT8 and vector lookup in a dictionary of predefined vectors.)
Jiebin Zhang, Dawei Zhu, Yifan Song, Wenhao Wu, Chuqiao Kuang, Xiaoguang Li, Lifeng Shang, Qun Liu, Sujian Li, 17 Dec 2024, More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression, https://arxiv.org/abs/2412.12706 (Combining KV token pruning and KV quantization.)
H Kang, Q Zhang, S Kundu, G Jeong, Z Liu, T Krishna, Dec 2024, GEAR: An Efficient Error Reduction Framework for KV Cache Compression in LLM Inference, https://neurips2024-enlsp.github.io/papers/paper_3.pdf (Use extra information in low-rank and sparse matrices to efficiently alleviate lossy KV cache quantization issues such as outliers.)
Andrew Chan, Dec 12, 2024, Fast LLM Inference From Scratch: Pushing single-GPU inference throughput to the edge without libraries, https://andrewkchan.dev/posts/yalm.html
Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, Lei Chen, 27 Dec 2024, A Survey on Large Language Model Acceleration based on KV Cache Management, https://arxiv.org/abs/2412.19442 (Huge survey of all KV cache optimization methods.)
Zhang, Z., Liu, S., Chen, R., Kailkhura, B., Chen, B., & Wang, Z. (2024). Q-Hitter: A Better Token Oracle for Efficient LLM Inference via Sparse-Quantized KV Cache. In P. Gibbons, G. Pekhimenko, & C. De Sa (Eds.), Proceedings of Machine Learning and Systems 6 (MLSys 2024) (pp. 381-394). MLSys. https://proceedings.mlsys.org/paper_files/paper/2024/hash/bbb7506579431a85861a05fff048d3e1-Abstract Conference.htm https://github.com/VITA-Group/Q-Hitter
Dongyoung Lee, Seungkyu Choi, Ik Joon Chang, 23 Jan 2025, Qrazor: Reliable and effortless 4-bit llm quantization by significant data razoring, https://arxiv.org/abs/2501.13331
Zunhai Su, Zhe Chen, Wang Shen, Hanyu Wei, Linge Li, Huangqi Yu, Kehong Yuan, 25 Jan 2025, RotateKV: Accurate and Robust 2-Bit KV Cache Quantization for LLMs via Outlier-Aware Adaptive Rotations, https://arxiv.org/abs/2501.16383 (INT2 KV caching with special handling of outliers, RoPE, and attention sinks, and the resulting architecture works in Chain-of-Thought.)
Alina Shutova, Vladimir Malinovskii, Vage Egiazarian, Denis Kuznedelev, Denis Mazur, Nikita Surkov, Ivan Ermakov, Dan Alistarh, 31 Jan 2025, Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models, https://arxiv.org/abs/2501.19392 (Enhances KV cache quantization by exploiting cross-layer similarities.)
Songhao Wu, Ang Lv, Xiao Feng, Yufei Zhang, Xun Zhang, Guojun Yin, Wei Lin, Rui Yan, 1 Feb 2025, PolarQuant: Leveraging Polar Transformation for Efficient Key Cache Quantization and Decoding Acceleration, https://arxiv.org/abs/2502.00527
Zeyu Zhang, Haiying Shen, Shay Vargaftik, Ran Ben Basat, Michael Mitzenmacher, Minlan Yu, 5 Feb 2025, HACK: Homomorphic Acceleration via Compression of the Key-Value Cache for Disaggregated LLM Inference, https://arxiv.org/abs/2502.03589
Mohsen Hariri, Lam Nguyen, Sixu Chen, Shaochen Zhong, Qifan Wang, Xia Hu, Xiaotian Han, Vipin Chaudhary, 20 Feb 2025, More for Keys, Less for Values: Adaptive KV Cache Quantization, https://arxiv.org/abs/2502.15075 https://tinyurl.com/kv-adaquant (KV cache quantization with INT4 for K and INT2 for V.)
Xiaoran Liu, Ruixiao Li, Mianqiu Huang, Zhigeng Liu, Yuerong Song, Qipeng Guo, Siyang He, Qiqi Wang, Linlin Li, Qun Liu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu, 24 Feb 2025, Thus Spake Long-Context Large Language Model, https://arxiv.org/abs/2502.17129 (Impressive survey of many techniques to improve efficiency and accuracy of long context processing in both inference and training, covering text, video and multimodal models.)
Weiming Hu, Haoyan Zhang, Cong Guo, Yu Feng, Renyang Guan, Zhendong Hua, Zihan Liu, Yue Guan, Minyi Guo, Jingwen Leng, 26 Feb 2025, M-ANT: Efficient Low-bit Group Quantization for LLMs via Mathematically Adaptive Numerical Type, https://arxiv.org/abs/2502.18755
Minsu Kim, Seongmin Hong, RyeoWook Ko, Soongyu Choi, Hunjong Lee, Junsoo Kim, Joo-Young Kim, Jongse Park, 24 Mar 2025, Oaken: Fast and Efficient LLM Serving with Online-Offline Hybrid KV Cache Quantization, https://arxiv.org/abs/2503.18599
Yi Su, Yuechi Zhou, Quantong Qiu, Juntao Li, Qingrong Xia, Ping Li, Xinyu Duan, Zhefeng Wang, Min Zhang, 16 May 2025, Accurate KV Cache Quantization with Outlier Tokens Tracing, https://arxiv.org/abs/2505.10938
Peilin Chen, Xiaoxuan Yang, 23 May 2025, Titanus: Enabling KV Cache Pruning and Quantization On-the-Fly for LLM Acceleration, https://arxiv.org/abs/2505.17787 https://github.com/peilin-chen/Titanus-for-LLM-acceleration

KV Cache Layer Fusion

KV cache layer fusion is a type of KV cache compression, analogous to layer fusion as a type of LLM model compression. The idea is that some layers of the KV cache are similar enough that they can be combined (fused), alleviating the need to store the KV data for any layer that is fused with another KV data layer. This is similar to model parameter sharing and layer fusion in LLM model weights.

Research papers on KV cache layer fusion:

Akide Liu, Jing Liu, Zizheng Pan, Yefei He, Gholamreza Haffari, Bohan Zhuang, 23 May 2024, MiniCache: KV Cache Compression in Depth Dimension for Large Language Models, https://arxiv.org/abs/2405.14366 (Compresses the KV cache on the depth dimension of layers, analogous to layer fusion.)
Zayd Muhammad Kawakibi Zuhri, Muhammad Farid Adilazuarda, Ayu Purwarianti, Alham Fikri Aji, 13 Jun 2024, MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding, https://arxiv.org/abs/2406.09297 Code: https://github.com/zaydzuhri/pythia-mlkv (Extends cross-head KV sharing in MQA to also share KV data between layers, analogous to layer fusion of weights.)
Character.AI, June 20, 2024, Optimizing AI Inference at Character.AI, https://research.character.ai/optimizing-inference/
Haoyi Wu, Kewei Tu, 4 Jun 2024 (v2), Layer-Condensed KV Cache for Efficient Inference of Large Language Models, https://arxiv.org/abs/2405.10637 Code: https://github.com/whyNLP/LCKV (Only computes the KV cache of some layers.)
Qichen Fu, Minsik Cho, Thomas Merth, Sachin Mehta, Mohammad Rastegari, Mahyar Najibi, 19 Jul 2024, LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference, https://arxiv.org/abs/2407.14057
William Brandon, Mayank Mishra, Aniruddha Nrusimha, Rameswar Panda, Jonathan Ragan Kelly, 21 May 2024, Reducing Transformer Key-Value Cache Size with Cross-Layer Attention, https://arxiv.org/abs/2405.12981 (Sharing KV cache values across layers in MQA, every 2nd or 3rd layer, to reduce overall KV cache size by 2 or 3 times.)
AIModels.FYI, 2024, Layer-Condensed KV Cache for Efficient Inference of Large Language Models, https://www.aimodels.fyi/papers/arxiv/layer-condensed-kv-cache-efficient-inference-large
Luohe Shi, Hongyi Zhang, Yao Yao, Zuchao Li, Hai Zhao, 13 Aug 2024 (v3), Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption, https://arxiv.org/abs/2407.18003 https://github.com/zcli-charlie/Awesome-KV%20Cache
Bingli Liao, Danilo Vasconcellos Vargas, 13 Jul 2024, Beyond KV Caching: Shared Attention for Efficient LLMs, https://arxiv.org/abs/2407.12866 (Layerwise weight sharing in attention.)
Yanshu Wang, Tong Yang, Xiyan Liang, Guoan Wang, Hanning Lu, Xu Zhe, Yaoming Li, Li Weitao, 18 Sep 2024, Art and Science of Quantizing Large-Scale Models: A Comprehensive Overview, https://arxiv.org/abs/2409.11650 (Extensive survey of quantization from the basics to SOTA approaches, with also some coverage of knowledge distillation and KV cache compression.)
Shashank Rajput, Ying Sheng, Sean Owen, Vitaliy Chiley, 23 Sep 2024, Inference-Friendly Models With MixAttention, https://arxiv.org/abs/2409.15012 (Attention optimization with sliding window attention and KV cache layer fusion, inspired by the approach of Character AI.)
Aurick Qiao, Zhewei Yao, Samyam Rajbhandari, Yuxiong He, 4 Oct 2024, SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation, https://arxiv.org/abs/2410.03960
You Wu, Haoyi Wu, Kewei Tu, 18 Oct 2024, A Systematic Study of Cross-Layer KV Sharing for Efficient LLM Inference, https://arxiv.org/abs/2410.14442
Zhen Yang, J.N.Han, Kan Wu, Ruobing Xie, An Wang, Xingwu Sun, Zhanhui Kang, 20 Oct 2024, Lossless KV Cache Compression to 2%, https://arxiv.org/abs/2410.15252
Zhanchao Zhou, Tianyi Wu, Zhiyun Jiang, Zhenzhong Lan, 23 Oct 2024, Value Residual Learning For Alleviating Attention Concentration In Transformers, https://arxiv.org/abs/2410.17897
Yifei Yang, Zouying Cao, Qiguang Chen, Libo Qin, Dongjie Yang, Hai Zhao, Zhi Chen, 24 Oct 2024, KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing, https://arxiv.org/abs/2410.18517
Xingwu Sun, Yanfeng Chen, Yiqing Huang, Ruobing Xie, Jiaqi Zhu, Kai Zhang, Shuaipeng Li, Zhen Yang, Jonny Han, Xiaobo Shu, Jiahao Bu, (and many more authors), 4 Nov 2024, Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent, https://arxiv.org/abs/2411.02265 https://github.com/Tencent/Hunyuan-Large https://huggingface.co/tencent/Tencent-Hunyuan-Large
Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya Sunil Mahabaleshwarkar, Shih-Yang Liu, Matthijs Van Keirsbilck, Min-Hung Chen, Yoshi Suhara, Yingyan Lin, Jan Kautz, Pavlo Molchanov, 20 Nov 2024, Hymba: A Hybrid-head Architecture for Small Language Models, https://arxiv.org/abs/2411.13676
AI: Alan Wake, Albert Wang, Bei Chen, C.X. Lv, Chao Li, Chengen Huang, Chenglin Cai, Chujie Zheng, Daniel Cooper, Ethan Dai, Fan Zhou, Feng Hu, Heng Ji, Howard Qiu, Jiangcheng Zhu, Jun Tian, Katherine Su, Lihuan Zhang, Liying Li, Ming Song, Mou Li, Peng Liu, Qichen Hu, Shawn Wang, Shijun Zhou, Shiyong Li, Tianhang Zhu, Wen Xie, Xiang He, Xiaobo Chen, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Yanpeng Li, Yongke Zhao, Yongzhen Luo, Yuchi Xu, Yuxuan Sha, Zhaodong Yan, Zhiyuan Liu, Zirui Zhang, 3 Dec 2024 (v2), Yi-Lightning Technical Report, https://arxiv.org/abs/2412.01253 https://platform.lingyiwanwu.com/ (MoE architecture with model expert routing optimizations, also with hybrid global-local attention and fused layers in the KV caching.)
Da Ma, Lu Chen, Situo Zhang, Yuxun Miao, Su Zhu, Zhi Chen, Hongshen Xu, Hanqi Li, Shuai Fan, Lei Pan, Kai Yu, 3 Dec 2024, Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity, https://arxiv.org/abs/2412.02252
Meizhi Zhong, Xikai Liu, Chen Zhang, Yikun Lei, Yan Gao, Yao Hu, Kehai Chen, Min Zhang, 12 Dec 2024, ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based on Layer Uncertainty, https://arxiv.org/abs/2412.09036 (KV cache compression on a layerwise budget, similar to token-based eviction, giving a kind of "dual-dimension" KV cache compression on depth and length.)
Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, Lei Chen, 27 Dec 2024, A Survey on Large Language Model Acceleration based on KV Cache Management, https://arxiv.org/abs/2412.19442 (Huge survey of all KV cache optimization methods.)
Longze Chen, Jan 2025 (accessed), Awesome-KV-Cache-Compression: Must-read papers on KV Cache Compression (constantly updating), https://github.com/October2001/Awesome-KV-Cache-Compression (KV cache reuse across multiple prompts via SwiftKV, sounds similar to prefix KV caching or fused KV caching, and also SingleInputKV does KV cache layer fusion in a single prompt.)
Alina Shutova, Vladimir Malinovskii, Vage Egiazarian, Denis Kuznedelev, Denis Mazur, Nikita Surkov, Ivan Ermakov, Dan Alistarh, 31 Jan 2025, Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models, https://arxiv.org/abs/2501.19392 (Enhances KV cache quantization by exploiting cross-layer similarities.)
Xiaoran Liu, Ruixiao Li, Mianqiu Huang, Zhigeng Liu, Yuerong Song, Qipeng Guo, Siyang He, Qiqi Wang, Linlin Li, Qun Liu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu, 24 Feb 2025, Thus Spake Long-Context Large Language Model, https://arxiv.org/abs/2502.17129 (Impressive survey of many techniques to improve efficiency and accuracy of long context processing in both inference and training, covering text, video and multimodal models.)

KV Cache Layer Pruning

KV cache layer pruning is a type of KV cache compression that removes layers of KV data for faster LLM inference processing. This makes the KV cache data smaller in memory and faster to process, allowing more efficient processing of longer token sequences during LLM inference. This method of KV cache layer pruning is analogous to layer pruning as a type of model compression.

The idea is that some layers of the KV cache are unimportant, and can be removed. This is similar to structured pruning (depth pruning), and reduces the total number of layers that need to be stored. This reduces the memory size of the overall KV cache.

Zihao Wang, Shaoduo Gan, 7 Apr 2024] SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget, https://arxiv.org/abs/2404.04793 Code: https://github.com/hetailang/squeezeattention (Optimization of the KV cache along the two dimensions of layers and input sequence.)
Haoyi Wu, Kewei Tu, 4 Jun 2024 (v2), Layer-Condensed KV Cache for Efficient Inference of Large Language Models, https://arxiv.org/abs/2405.10637 Code: https://github.com/whyNLP/LCKV (Only computes the KV cache of some layers.)
William Brandon, Mayank Mishra, Aniruddha Nrusimha, Rameswar Panda, Jonathan Ragan Kelly, 21 May 2024, Reducing Transformer Key-Value Cache Size with Cross-Layer Attention, https://arxiv.org/abs/2405.12981 (Sharing KV cache values across layers in MQA, every 2nd or 3rd layer, to reduce overall KV cache size by 2 or 3 times.)
Akide Liu, Jing Liu, Zizheng Pan, Yefei He, Gholamreza Haffari, Bohan Zhuang, 23 May 2024, MiniCache: KV Cache Compression in Depth Dimension for Large Language Models, arXiv preprint arXiv:2405.14366, 2024, https://arxiv.org/abs/2405.14366 (Merging KV caches with similar values across nearby layers to effectively share parts of the cache across layers for a 41% reduction in size.)
Yanshu Wang, Tong Yang, Xiyan Liang, Guoan Wang, Hanning Lu, Xu Zhe, Yaoming Li, Li Weitao, 18 Sep 2024, Art and Science of Quantizing Large-Scale Models: A Comprehensive Overview, https://arxiv.org/abs/2409.11650 (Extensive survey of quantization from the basics to SOTA approaches, with also some coverage of knowledge distillation and KV cache compression.)
Dezhan Tu, Danylo Vashchilenko, Yuzhe Lu, Panpan Xu, 29 Oct 2024, VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration, https://arxiv.org/abs/2410.23317
Meizhi Zhong, Xikai Liu, Chen Zhang, Yikun Lei, Yan Gao, Yao Hu, Kehai Chen, Min Zhang, 12 Dec 2024, ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based on Layer Uncertainty, https://arxiv.org/abs/2412.09036 (KV cache compression on a layerwise budget, similar to token-based eviction, giving a kind of "dual-dimension" KV cache compression on depth and length.)
Kuntai Du, Bowen Wang, Chen Zhang, Yiming Cheng, Qing Lan, Hejian Sang, Yihua Cheng, Jiayi Yao, Xiaoxuan Liu, Yifan Qiao, Ion Stoica, Junchen Jiang, 12 May 2025, PrefillOnly: An Inference Engine for Prefill-only Workloads in Large Language Model Applications, https://arxiv.org/abs/2505.07203

KV Fused Head Research

KV fused heads are an optimization to LLM inference that makes the KV cache data smaller and faster to process. This method in KV data is analogous to fused head optimizations in the main attention algorithms of LLM inference. The goal is to make the KV cache data smaller in memory and more efficient to compute, allowing processing of longer token sequences.

Research papers on KV head fusion or merging:

Hao Yu, Zelan Yang, Shen Li, Yong Li, Jianxin Wu, 11 Jun 2024, Effectively Compress KV Heads for LLM, https://arxiv.org/abs/2406.07056
Zhen Yang, J.N.Han, Kan Wu, Ruobing Xie, An Wang, Xingwu Sun, Zhanhui Kang, 20 Oct 2024, Lossless KV Cache Compression to 2%, https://arxiv.org/abs/2410.15252
Yao Yao, Zuchao Li, Hai Zhao, 21 May 2024, SirLLM: Streaming Infinite Retentive LLM, https://arxiv.org/abs/2405.12528 (Low-rank decomposition to compress KV cache heads.)
Isaac Rehg, 7 Oct 2024 (v2), KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head, https://arxiv.org/abs/2410.00161
Yu Fu, Zefan Cai, Abedelkadir Asi, Wayne Xiong, Yue Dong, Wen Xiao, 28 Oct 2024 (v2), Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning, https://arxiv.org/abs/2410.19258
Yuxiang Huang, Binhang Yuan, Xu Han, Chaojun Xiao, Zhiyuan Liu, 2 Oct 2024, Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads, https://arxiv.org/abs/2410.01805
Zayd Muhammad Kawakibi Zuhri, Muhammad Farid Adilazuarda, Ayu Purwarianti, Alham Fikri Aji, 13 Jun 2024, MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding, https://arxiv.org/abs/2406.09297 Code: https://github.com/zaydzuhri/pythia-mlkv (Extends cross-head KV sharing in MQA to also share KV data between layers, analogous to layer fusion of weights.)
Xiaoran Liu, Ruixiao Li, Mianqiu Huang, Zhigeng Liu, Yuerong Song, Qipeng Guo, Siyang He, Qiqi Wang, Linlin Li, Qun Liu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu, 24 Feb 2025, Thus Spake Long-Context Large Language Model, https://arxiv.org/abs/2502.17129 (Impressive survey of many techniques to improve efficiency and accuracy of long context processing in both inference and training, covering text, video and multimodal models.)

KV Cache Reuse (Multi-Query KV Caching)

KV cache reuse is the caching of the entire KV data from one query, to reuse that data in future queries, as a type of LLM inference optimization. The basic KV cache methods only store KV data within the current query, while autoregressive decoding is happening, and then the cache data is discarded. A more efficient idea is to generalize this idea to store KV data across multiple queries.

There are several types of multi-query KV cache reuse:

Basic KV cache reuse (global KV caching)
Prefix KV caching
Session-based multi-turn KV caching (e.g., in chatbot conversations)
Fused substring KV caching

Research papers generally on KV cache reuse:

NVIDIA, July 2024 (accessed), KV cache reuse, https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/kv_cache_reuse.md (KV cache reuse in TensorRT is an implementation of prefix-based KV caching.)
Kexin Chu, Tzechinh Liu, Yunding Li, Pengchao Yuan, Wei Zhang, 2024, CaR: An Efficient KV Cache Reuse System for Large Language Model Inference, University of Connecticut, https://llm-gnn.org/slides/CaR-Chu.pdf
Shixiaowei02, Oct 2024, TensorRT-LLM 0.13.0 Release Latest, https://github.com/NVIDIA/TensorRT-LLM/releases/tag/v0.13.0
NVIDIA, Dec 2024, KV cache reuse, https://nvidia.github.io/TensorRT-LLM/advanced/kv-cache-reuse.html
Yucheng Li, Huiqiang Jiang, Qianhui Wu, Xufang Luo, Surin Ahn, Chengruidong Zhang, Amir H. Abdi, Dongsheng Li, Jianfeng Gao, Yuqing Yang, Lili Qiu, 13 Dec 2024, SCBench: A KV Cache-Centric Analysis of Long-Context Methods, https://arxiv.org/abs/2412.10319 https://aka.ms/SCBench
Y Li, H Jiang, Q Wu, X Luo, S Ahn, C Zhang, AH Abdi, Dec 2024, SharedContextBench: Evaluating Long-Context Methods in KV Cache Reuse, 4th NeurIPS Efficient Natural Language and Speech Processing Workshop (ENLSP-IV 2024), https://neurips2024-enlsp.github.io/papers/paper_93.pdf (Evaluating model performance with KV cache compression.)
HF, 2024, TGI v3 overview, https://huggingface.co/docs/text-generation-inference/conceptual/chunking
John Thomson, Anjali Shah and Laikh Tewari, Jan 16, 2025, Introducing New KV Cache Reuse Optimizations in NVIDIA TensorRT-LLM, https://developer.nvidia.com/blog/introducing-new-kv-cache-reuse-optimizations-in-nvidia-tensorrt-llm/ (An API with callbacks to manage the KV cache.)
Paul Gillin, Jan 16, 2025, Snowflake claims breakthrough can cut AI inferencing times by more than 50%, https://siliconangle.com/2025/01/16/snowflake-claims-breakthrough-can-cut-ai-inferencing-times-50/ (Inference optimization by KV cache management during prefill phase.)
Longze Chen, Jan 2025 (accessed), Awesome-KV-Cache-Compression: Must-read papers on KV Cache Compression (constantly updating), https://github.com/October2001/Awesome-KV-Cache-Compression (KV cache reuse across multiple prompts via SwiftKV, sounds similar to prefix KV caching or fused KV caching, and also SingleInputKV does KV cache layer fusion in a single prompt.)
Amr Elmeleegy, Nick Comly and Thor Johnsen, Nov 08, 2024, 5x Faster Time to First Token with NVIDIA TensorRT-LLM KV Cache Early Reuse, https://developer.nvidia.com/blog/5x-faster-time-to-first-token-with-nvidia-tensorrt-llm-kv-cache-early-reuse/
Shiju Zhao, Junhao Hu, Rongxiao Huang, Jiaqi Zheng, Guihai Chen, 4 Feb 2025, MPIC: Position-Independent Multimodal Context Caching System for Efficient MLLM Serving, https://arxiv.org/abs/2502.01960
Kexin Chu, Tzechinh Liu, Yunding Li, Pengchao Yuan, and Wei Zhang, Feb 2025, CaR: An Efficient KV Cache Reuse System for Large Language Model Inference, https://llm-gnn.org/papers/CaR.pdf
Xiaoran Liu, Ruixiao Li, Mianqiu Huang, Zhigeng Liu, Yuerong Song, Qipeng Guo, Siyang He, Qiqi Wang, Linlin Li, Qun Liu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu, 24 Feb 2025, Thus Spake Long-Context Large Language Model, https://arxiv.org/abs/2502.17129 (Impressive survey of many techniques to improve efficiency and accuracy of long context processing in both inference and training, covering text, video and multimodal models.)
Jingbo Yang, Bairu Hou, Wei Wei, Yujia Bao, Shiyu Chang, 21 Feb 2025, KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse, https://arxiv.org/abs/2502.16002 https://github.com/UCSB-NLP-Chang/KVLink (Computing a KV cache for each RAG chunk, and using techniques to fuse/merge/concatenate these KV caches, i.e., fused KV caching as a generalization of prefix KV caching, while restoring cross-chunk attention accuracy via 3 techniques: positional re-encoding, "link tokens" between chunks processed during inference, and fine-tuning).

Context Caching (Global KV Prefill/Encoder Caching)

Context caching is the storage and reuse of KV cache data from the "context" of an LLM query. This allows faster LLM inference on any context tokens that the LLM has seen before. Context caching is often implemented as a type of prefix KV caching or "prefix sharing" optimization.

Cross-query multi-user KV caching: The idea of "context caching" or "global KV caching" is to store the inference context across multiple queries. This is on-disk caching of the prefill/encoder KV data results across multiple user queries, and re-using them on subsequent identical or similar queries. The KV operations can be cached for identical queries, across many users, so that when a user inputs the same text, the KV operations do not have to be re-done, but can be loaded from a disk cache. This type of KV caching method is a subtype of an "Inference Cache" architecture.

Framework Context Caching Support. This idea of a "context cache" or "multi-query KV cache" is appearing in some of the frameworks and commercial model frameworks. Some of the platforms with "context caching" (including "prefix KV caching") include:

vLLM (open source)
Google (commercial)
DeepSeek V2 (commercial)

This type of caching can greatly reduce inference costs, so I expect to see it in a lot more platforms, along with pricing reductions for such tokens. For example, DeepSeek has two distinct levels of pricing for "cached" and "non-cached" tokens.

Global semantic KV caching? Can we generalize the idea of a "semantic cache" to store the KV cache across semantically-similar queries, rather than just storing the text output? Maybe not, and I've not seen a paper on this. The main problem is that KV caching is very specific to the exact token sequence, whereas semantic caching aims to cache results for very different textual queries. It's not just different lengths of token sequences, but also having different tokens. Maybe there's some way to do it?

Research papers on multi-query prefill/encoder KV caching:

G Gernanov, March 13, 2023, Store KV cache of computed prompts to disk to avoid re-compute in follow-up runs #64, Llama.cpp project, https://github.com/ggerganov/llama.cpp/issues/64 (This is multi-user KV caching; uses a hash of the input query to store the KV computation to a disk cache for re-use.)
Yechen Xu, Xinhao Kong, Tingjun Chen, Danyang Zhuo, 4 Jun 2024 (v2), Conveyor: Efficient Tool-aware LLM Serving with Tool Partial Execution, https://arxiv.org/abs/2406.00059 Code: https://github.com/conveyor-sys/conveyor (Speeding up inference by partially running tools in parallel to the LLM query procesisng, rather than sequentially after the LLM request, by detecting tool requests deep inside the decoding algorithm and starting them off immediately, before the LLM has finished generating the fully decoed output.)
Liu, Y., Li, H., Du, K., Yao, J., Cheng, Y., Huang, Y., Lu, S., Maire, M., Hoffmann, H., Holtzman, A., Ananthanarayanan, G., and Jiang, J., August 2024, Cachegen: Fast context loading for language model applications, Microsoft Research, https://arxiv.org/abs/2310.07240 https://www.microsoft.com/en-us/research/publication/cachegen-fast-context-loading-for-language-model-applications-via-kv-cache-streaming/
Jordan Juravsky, Bradley Brown, Ryan Ehrlich, Daniel Y. Fu, Christopher Ré, Azalia Mirhoseini, 13 May 2024 (v2), Hydragen: High-Throughput LLM Inference with Shared Prefixes, https://arxiv.org/abs/2402.05099 Code: https://github.com/jordan-benjamin/hydragen
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica, Oct 2023, Efficient Memory Management for Large Language Model Serving with PagedAttention, SOSP ’23, October 23–26, 2023, Koblenz, Germany, https://dl.acm.org/doi/pdf/10.1145/3600006.3613165
Lu Ye, Ze Tao, Yong Huang, Yang Li, 22 Mar 2024 (v2), ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition, https://arxiv.org/abs/2402.15220 (Identify prefixes of prompts and caching the KV values of these portions of the prompt.)
J. Lin, S. Q. Zhang and A. Leon-Garcia, 2024, sLLM: Accelerating LLM Inference using Semantic Load Balancing with Shared Memory Data Structures, 2024 25th International Symposium on Quality Electronic Design (ISQED), San Francisco, CA, USA, 2024, pp. 1-6, doi: 10.1109/ISQED60706.2024.10528703. https://ieeexplore.ieee.org/abstract/document/10528703 (Optimize the global KV cache by sharing it across multiple queries.)
Vikranth Srivatsa∗, Zijian He∗, Reyna Abhyankar, Dongming Li, Yiying Zhang, 2024, Preble: Efficient Distributed Prompt Scheduling for LLM Serving, University of California, San Diego, https://escholarship.org/content/qt1bm0k1w0/qt1bm0k1w0.pdf (Evalulates prompt sharing including full inference cache or a partial prefix-based computation of a global KV cache for the prefill phase. Also schedules GPUs based on prefill versus decoding phase requirements.)
João Monteiro, Étienne Marcotte, Pierre-André Noël, Valentina Zantedeschi, David Vázquez, Nicolas Chapados, Christopher Pal, Perouz Taslakian, 23 Apr 2024, XC-Cache: Cross-Attending to Cached Context for Efficient LLM Inference, https://arxiv.org/abs/2404.15420
Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Xin Liu, Xuanzhe Liu, Xin Jin, 18 Apr 2024, RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation, https://arxiv.org/abs/2404.12457
L. Zheng, L. Yin, Z. Xie, J. Huang, C. Sun, C. H. Yu, S. Cao, C. Kozyrakis, I. Stoica, J. E. Gonzalez et al., Dec 2023, Efficiently programming large language models using SGLang, arXiv preprint arXiv:2312.07104, 2023, https://arxiv.org/abs/2312.07104 (Uses a radix attention method, a trie or prefix tree, for KV caching.)
Bo-Ru Lu, Nikita Haduong, Chien-Yu Lin, Hao Cheng, Noah A. Smith, Mari Ostendorf, 19 Mar 2024, Encode Once and Decode in Parallel: Efficient Transformer Decoding, https://arxiv.org/abs/2403.13112
Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, Pengfei Zuo, 23 Mar 2024, AttentionStore: Cost-effective Attention Reuse across Multi-turn Conversations in Large Language Model Serving, https://arxiv.org/abs/2403.19708 PDF: https://prongs1996.github.io/assets/pdf/CachedAttention.pdf (Memory management of KV caches using hierarchical cache layers.)
Nikhil Jha, Kevin Wang, 2023, https://people.eecs.berkeley.edu/~kubitron/courses/cs262a-F23/projects/reports/project1010_paper_64287652274076362722.pdf (Extends Paged Attention to a global multi-query KV cache and also implements prefix KV caching.)
David Spuler, March 2024, Chapter 29. Caching Optimizations, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Cunchen Hu, Heyang Huang, Junhao Hu, Jiang Xu, Xusheng Chen, Tao Xie, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan, 26 Jun 2024 (v2), MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool, https://arxiv.org/abs/2406.17565 (Combined session-based prefix KV caching with disaggregation of prefill and decoding phases.)
Sean Welleck, Amanda Bertsch, Matthew Finlayson, Hailey Schoelkopf, Alex Xie, Graham Neubig, Ilia Kulikov, Zaid Harchaoui, 24 Jun 2024, From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models, https://arxiv.org/abs/2406.16838 (Survey and theoretical analysis of many different decoding algorithms, along with various ways to speed them up such as speculative decoding and KV caches.)
Google, 2024, Context caching, https://ai.google.dev/gemini-api/docs/caching?lang=python (Pass in context tokens and reuse them without re-uploading, might be doing something like prefix KV caching underneath.)
NVIDIA, July 2024 (accessed), KV cache reuse, https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/kv_cache_reuse.md (KV cache reuse in TensorRT is an implementation of prefix-based KV caching.)
Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
Chip Huyen, Jul 25, 2024, Building A Generative AI Platform, https://huyenchip.com/2024/07/25/genai-platform.html
Llama.cpp, July 2024 (accessed), Prompt Caching, https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md#prompt-caching
DeepSeek, 02 August, 2024, DeepSeek API introduces Context Caching on Disk, cutting prices by an order of magnitude, https://platform.deepseek.com/api-docs/news/news0802/ (Announcement of commercial support for global KV caching with session-based and prefix KV caches.)
Anthropic, 15 Aug 2024, Prompt caching with Claude, https://www.anthropic.com/news/prompt-caching (Anthropic is now supporting prompt caching with approximately tenfold reduction in token pricing.)
Yibo Jin, Tao Wang, Huimin Lin, Mingyang Song, Peiyang Li, Yipeng Ma, Yicheng Shan, Zhengfan Yuan, Cailong Li, Yajing Sun, Tiandeng Wu, Xing Chu, Ruizhi Huan, Li Ma, Xiao You, Wenting Zhou, Yunpeng Ye, Wen Liu, Xiangkun Xu, Yongsheng Zhang, Tiantian Dong, Jiawei Zhu, Zhe Wang, Xijian Ju, Jianxun Song, Haoliang Cheng, Xiaojing Li, Jiandong Ding, Hefei Guo, Zhengyong Zhang, 15 Aug 2024, P/D-Serve: Serving Disaggregated Large Language Model at Scale, https://arxiv.org/abs/2408.08147 (Comprehensive serving system addressing disaggregated prefill and KV cache transfer with RDMA.)
Hugging Face, Aug 2024 (accessed), Best Practices for Generation with Cache, https://huggingface.co/docs/transformers/kv_cache
David Spuler, March 2024, Global KV Prefill/Encoder Caching, in Generative AI in C++, https://www.aussieai.com/book/ch29-global-prefill-kv-caching
Yihua Cheng, Kuntai Du, Jiayi Yao, Junchen Jiang, 16 Sep 2024, Do Large Language Models Need a Content Delivery Network? https://arxiv.org/abs/2409.13761 https://github.com/LMCache/LMCache (Managing the process of sharing KV cache data over a network.)
Linke Song, Zixuan Pang, Wenhao Wang, Zihao Wang, XiaoFeng Wang, Hongbo Chen, Wei Song, Yier Jin, Dan Meng, Rui Hou, 30 Sep 2024, The Early Bird Catches the Leak: Unveiling Timing Side Channels in LLM Serving Systems, https://arxiv.org/abs/2409.20002
Amr Elmeleegy, Nick Comly and Thor Johnsen, Nov 08, 2024, 5x Faster Time to First Token with NVIDIA TensorRT-LLM KV Cache Early Reuse, https://developer.nvidia.com/blog/5x-faster-time-to-first-token-with-nvidia-tensorrt-llm-kv-cache-early-reuse/
Brian J Chan, Chao-Ting Chen, Jui-Hung Cheng, Hen-Hsen Huang, 20 Dec 2024, Don't Do RAG: When Cache-Augmented Generation is All You Need for Knowledge Tasks, https://arxiv.org/abs/2412.15605 (Mini-RAG architecture preloading the entire knowledge into the LLM context and then using KV caching.)

Prefix KV Caching

Prefix KV caching, or prefix caching, is an LLM optimization that involves reusing a cache of the KV data for a previously seen token prefix. There are several common use cases where an LLM will reprocess the same token prefix, such as in session-based conversational history or prepended global instructions for all queries.

An interesting generalization of the global KV cache is that it needn't only apply to the whole query — it can also apply to any prefix. By caching the portion of the KV cache up to a prefix of the query, a significant amount of computation is avoided. And there are several major cases where a prefix re-occurs:

RAG document prepended to a query
Conversational history of a chatbot or other session-based AI.
Global instructions (prepended to all queries)
Document re-use (multiple queries on the same context)

Each time a RAG document is prepended as data to a user's query, the query has the same prefix. The KV cache can be stored for this prefix, and, even further, it could be precomputed offline for each RAG document chunk.

Similarly, one notes that each turn of conversation between a user and a chatbot prepends the prior conversation history. Hence, the KV data for the current query is a prefix KV cache for the next query from the same user. This imposes some practical issues in terms of synching the user's session with the cache on large data center backends (e.g. use of "stick sessions" in the network load balancing hardware, or a "network-shared disk" accessible from all servers, where the KV caches are stored). There's also a very clear win on single-user systems such as on-device chatbots on AI Phones or AI PCs, where there's only really a single session and the KV cache is obviously always on the same device.

Research on prefix KV caching: There are various research papers:

Liu, Y., Li, H., Du, K., Yao, J., Cheng, Y., Huang, Y., Lu, S., Maire, M., Hoffmann, H., Holtzman, A., Ananthanarayanan, G., and Jiang, J., August 2024, Cachegen: Fast context loading for language model applications, Microsoft Research, https://arxiv.org/abs/2310.07240 https://www.microsoft.com/en-us/research/publication/cachegen-fast-context-loading-for-language-model-applications-via-kv-cache-streaming/
X Wu, L Zhang, Y Wang, Y Ren, M Hack, 2016, zExpander: a Key-Value Cache with both High Performance and Fewer Misses, EuroSys ’16 April 18–21, 2016, London, United Kingdom, https://ranger.uta.edu/~sjiang/pubs/papers/wu16_zExpander.pdf (General theory paper about prefix key-value caching in a trie or binary tree, not specific to neural networks.)
Jordan Juravsky, Bradley Brown, Ryan Ehrlich, Daniel Y. Fu, Christopher Ré, Azalia Mirhoseini, 13 May 2024 (v2), Hydragen: High-Throughput LLM Inference with Shared Prefixes, https://arxiv.org/abs/2402.05099 Code: https://github.com/jordan-benjamin/hydragen
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica, Oct 2023, Efficient Memory Management for Large Language Model Serving with PagedAttention, SOSP ’23, October 23–26, 2023, Koblenz, Germany, https://dl.acm.org/doi/pdf/10.1145/3600006.3613165
Lu Ye, Ze Tao, Yong Huang, Yang Li, 22 Mar 2024 (v2), ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition, https://arxiv.org/abs/2402.15220 (Identify prefixes of prompts and caching the KV values of these portions of the prompt.)
Vikranth Srivatsa∗, Zijian He∗, Reyna Abhyankar, Dongming Li, Yiying Zhang, 2024, Preble: Efficient Distributed Prompt Scheduling for LLM Serving, University of California, San Diego, https://escholarship.org/content/qt1bm0k1w0/qt1bm0k1w0.pdf (Evalulates prompt sharing including full inference cache or a partial prefix-based computation of a global KV cache for the prefill phase. Also schedules GPUs based on prefill versus decoding phase requirements.)
Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Xin Liu, Xuanzhe Liu, Xin Jin, 18 Apr 2024, RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation, https://arxiv.org/abs/2404.12457
L. Zheng, L. Yin, Z. Xie, J. Huang, C. Sun, C. H. Yu, S. Cao, C. Kozyrakis, I. Stoica, J. E. Gonzalez et al., Dec 2023, Efficiently programming large language models using SGLang, arXiv preprint arXiv:2312.07104, 2023, https://arxiv.org/abs/2312.07104 (Uses a radix attention method, a trie or prefix tree, for KV caching.)
Bo-Ru Lu, Nikita Haduong, Chien-Yu Lin, Hao Cheng, Noah A. Smith, Mari Ostendorf, 19 Mar 2024, Encode Once and Decode in Parallel: Efficient Transformer Decoding, https://arxiv.org/abs/2403.13112
Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, Pengfei Zuo, 23 Mar 2024, AttentionStore: Cost-effective Attention Reuse across Multi-turn Conversations in Large Language Model Serving, https://arxiv.org/abs/2403.19708 PDF: https://prongs1996.github.io/assets/pdf/CachedAttention.pdf (Memory management of KV caches using hierarchical cache layers.)
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Jeff Huang, Chuyue Sun, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, Ying Sheng, Jan 17, 2024, Fast and Expressive LLM Inference with RadixAttention and SGLang, https://lmsys.org/blog/2024-01-17-sglang/ https://arxiv.org/abs/2312.07104 Code: https://github.com/sgl-project/sglang/
Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, Junchen Jiang, 3 Jun 2024 (v2), CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion, https://arxiv.org/abs/2405.16444 Code: https://github.com/YaoJiayi/CacheBlend.git (Generalizes prefix KV caching to KV cache fusion with selective recomputation of some KV cache data.)
Shu Liu, Asim Biswal, Audrey Cheng, Xiangxi Mo, Shiyi Cao, Joseph E. Gonzalez, Ion Stoica, and Matei Zaharia. March 2024, Optimizing llm queries in relational workloads, https://arxiv.org/abs/2403.05821
Yuan Feng, Hyeran Jeon, Filip Blagojevic, Cyril Guyot, Qing Li, Dong Li, 17 Apr 2023 (v2), AttMEMO : Accelerating Transformers with Memoization on Big Memory Systems, https://arxiv.org/abs/2301.09262
Zihao Ye, Ruihang Lai, Bo-Ru Lu, Chien-Yu Lin, Size Zheng, Lequn Chen, Tianqi Chen, Luis Ceze, Feb 2, 2024, Cascade Inference: Memory Bandwidth Efficient Shared Prefix Batch Decoding, https://flashinfer.ai/2024/02/02/cascade-inference.html
Reyna Abhyankar, Zijian He, Vikranth Srivatsa, Hao Zhang, Yiying Zhang, 2024, INFERCEPT: Efficient Intercept Support for Augmented Large Language Model Inference, https://openreview.net/pdf?id=wDDGQabYPQ
Nikhil Jha, Kevin Wang, 2023, https://people.eecs.berkeley.edu/~kubitron/courses/cs262a-F23/projects/reports/project1010_paper_64287652274076362722.pdf (Extends Paged Attention to a global multi-query KV cache and also implements prefix KV caching.)
Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, Lili Qiu, 30 May 2024, Parrot: Efficient Serving of LLM-based Applications with Semantic Variable, https://arxiv.org/abs/2405.19888 (Uses prefix KV caching and a combined flash attention and paged attention module.)
8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
VLLM, 2024, What is Automatic Prefix Caching, https://docs.vllm.ai/en/stable/automatic_prefix_caching/apc.html
Character.AI, June 20, 2024, Optimizing AI Inference at Character.AI, https://research.character.ai/optimizing-inference/
Cunchen Hu, Heyang Huang, Junhao Hu, Jiang Xu, Xusheng Chen, Tao Xie, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan, 26 Jun 2024 (v2), MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool, https://arxiv.org/abs/2406.17565 (Combined session-based prefix KV caching with disaggregation of prefill and decoding phases.)
Sean Welleck, Amanda Bertsch, Matthew Finlayson, Hailey Schoelkopf, Alex Xie, Graham Neubig, Ilia Kulikov, Zaid Harchaoui, 24 Jun 2024, From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models, https://arxiv.org/abs/2406.16838 (Survey and theoretical analysis of many different decoding algorithms, along with various ways to speed them up such as speculative decoding and KV caches.)
Franklin Huang, May 17, 2024, Machine Learning Systems with Reduced Memory Requirements, Masters of Science, Electrical Engineering and Computer Sciences, University of California, Berkeley, Technical Report No. UCB/EECS-2024-120 http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.html https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.pdf Code: https://github.com/hongyihuang/spec-mcts/blob/main/triton (Broad paper that examines a lot of different optimizations that reduce memory costs, including quantization, kernel fusion, sparsity, MatMul optimizations, KV cache compression, and various other methods.)
Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, Xinran Xu, 2 Jul 2024 (v2), Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving, https://arxiv.org/abs/2407.00079 Code: https://github.com/kvcache-ai/Mooncake (Disaggregates prefill and decoding phases for scheduling, with chunked prefill, while managing the KV cache.)
Google, 2024, Context caching, https://ai.google.dev/gemini-api/docs/caching?lang=python (Pass in context tokens and reuse them without re-uploading, might be doing something like prefix KV caching underneath.)
Tianyi Tang, Yiwen Hu, Bingqian Li, Wenyang Luo, Zijing Qin, Haoxiang Sun, Jiapeng Wang, Shiyi Xu, Xiaoxue Cheng, Geyang Guo, Han Peng, Bowen Zheng, Yiru Tang, Yingqian Min, Yushuo Chen, Jie Chen, Yuanqian Zhao, Luran Ding, Yuhao Wang, Zican Dong, Chunxuan Xia, Junyi Li, Kun Zhou, Wayne Xin Zhao, Ji-Rong Wen, 8 Jul 2024, LLMBox: A Comprehensive Library for Large Language Models, https://arxiv.org/abs/2407.05563 Code: https://github.com/RUCAIBox/LLMBox
Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas Kübler, Jiaji Huang, Matthäus Kleindessner, Jun Huan, Volkan Cevher, Yida Wang, George Karypis, 12 Jul 2024, Inference Optimization of Foundation Models on AI Accelerators, KDD’24, August 25–29, 2024, Barcelona, Spain, https://arxiv.org/abs/2407.09111
NVIDIA, July 2024 (accessed), KV cache reuse, https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/kv_cache_reuse.md (KV cache reuse in TensorRT is an implementation of prefix-based KV caching.)
Kexin Chu, Tzechinh Liu, Yunding Li, Pengchao Yuan, Wei Zhang, 2024, CaR: An Efficient KV Cache Reuse System for Large Language Model Inference, University of Connecticut, https://llm-gnn.org/slides/CaR-Chu.pdf
James Groeneveld, Aug 1, 2024, Prompt Design at Character.AI, Character.AI blog, https://research.character.ai/prompt-design-at-character-ai/
DeepSeek, 02 August, 2024, DeepSeek API introduces Context Caching on Disk, cutting prices by an order of magnitude, https://platform.deepseek.com/api-docs/news/news0802/ (Announcement of commercial support for global KV caching with session-based and prefix KV caches.)
Yibo Jin, Tao Wang, Huimin Lin, Mingyang Song, Peiyang Li, Yipeng Ma, Yicheng Shan, Zhengfan Yuan, Cailong Li, Yajing Sun, Tiandeng Wu, Xing Chu, Ruizhi Huan, Li Ma, Xiao You, Wenting Zhou, Yunpeng Ye, Wen Liu, Xiangkun Xu, Yongsheng Zhang, Tiantian Dong, Jiawei Zhu, Zhe Wang, Xijian Ju, Jianxun Song, Haoliang Cheng, Xiaojing Li, Jiandong Ding, Hefei Guo, Zhengyong Zhang, 15 Aug 2024, P/D-Serve: Serving Disaggregated Large Language Model at Scale, https://arxiv.org/abs/2408.08147 (Comprehensive serving system addressing disaggregated prefill and KV cache transfer with RDMA.)
David Spuler, 25th August, 2024, Hot Inference Optimization Techniques, https://www.aussieai.com/blog/hot-inference-research
Ahmed Tremo, Aug 6, 2024, How to Efficiently Serve an LLM? https://ahmedtremo.com/posts/How-to-Efficiently-serve-an-llm/
Zihao Ye,, Lequn Chen, Ruihang Lai, Yilong Zhao, Size Zheng, Junru Shao, Bohan Hou, Hongyi Jin, Yifei Zuo, Liangsheng Yin, Tianqi Chen, Luis Ceze, Feb 2, 2024, Accelerating Self-Attentions for LLM Serving with FlashInfer, https://flashinfer.ai/2024/02/02/introduce-flashinfer.html
David Spuler, , September 26, 2024, RAG Optimization via Caching, https://www.aussieai.com/blog/rag-optimization-caching
Open AI, Oct 2024 (accessed), Prompt Caching, https://platform.openai.com/docs/guides/prompt-caching
Shuowei Jin, Xueshen Liu, Qingzhao Zhang, Z. Morley Mao, 4 Oct 2024, Compute Or Load KV Cache? Why Not Both? https://arxiv.org/abs/2410.03065
OpenVINO-toolkit, Oct 1, 2024, Introducing OpenVINO™ 2024.4, https://medium.com/openvino-toolkit/introducing-openvino-2024-4-28578870b264
Junhao Hu, Wenrui Huang, Haoyi Wang, Weidong Wang, Tiancheng Hu, Qin Zhang, Hao Feng, Xusheng Chen, Yizhou Shan, Tao Xie, 20 Oct 2024, EPIC: Efficient Position-Independent Context Caching for Serving Large Language Models, https://arxiv.org/abs/2410.15332
David Spuler, October 24, 2024, Generalizing Prefix KV Caching to RAG Chunks, Aussie AI Blog, https://www.aussieai.com/blog/prefix-kv-rag
Yilong Zhao, Shuo Yang, Kan Zhu, Lianmin Zheng, Baris Kasikci, Yang Zhou, Jiarong Xing, Ion Stoica, 25 Nov 2024, BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching, https://arxiv.org/abs/2411.16102
Rui Pan, Zhuang Wang, Zhen Jia, Can Karakus, Luca Zancato, Tri Dao, Ravi Netravali, Yida Wang, 28 Nov 2024, Marconi: Prefix Caching for the Era of Hybrid LLMs, https://arxiv.org/abs/2411.19379 (Prefix caching applied to hybrid SSM-Transformer LLMs.)
Zhen Zheng, Xin Ji, Taosong Fang, Fanghao Zhou, Chuanjie Liu, Gang Peng, 29 Nov 2024, BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching, https://arxiv.org/abs/2412.03594
Ao Wang, Hui Chen, Jianchao Tan, Kefeng Zhang, Xunliang Cai, Zijia Lin, Jungong Han, Guiguang Ding, 4 Dec 2024, PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation, https://arxiv.org/abs/2412.03409 https://github.com/THU-MIG/PrefixKV
PromptHub, December 6, 2024, Prompt Caching with OpenAI, Anthropic, and Google Models, https://www.prompthub.us/blog/prompt-caching-with-openai-anthropic-and-google-models
Yucheng Li, Huiqiang Jiang, Qianhui Wu, Xufang Luo, Surin Ahn, Chengruidong Zhang, Amir H. Abdi, Dongsheng Li, Jianfeng Gao, Yuqing Yang, Lili Qiu, 13 Dec 2024, SCBench: A KV Cache-Centric Analysis of Long-Context Methods, https://arxiv.org/abs/2412.10319 https://aka.ms/SCBench
Saurabh Agarwal, Anyong Mao, Aditya Akella, Shivaram Venkataraman, 21 Dec 2024, SYMPHONY: Improving Memory Management for LLM Inference Workloads, https://arxiv.org/abs/2412.16434 (Use "additional hints" to optimize KV cache processing in chatbot sessions, such as when the user starts typing.)
Chenchen Gu, Xiang Lisa Li, Rohith Kuditipudi, Percy Liang, Tatsunori Hashimoto, Dec 2024, Timing Attacks on Prompt Caching in Language Model APIs, Stanford CS 191W Senior Project, https://cs191w.stanford.edu/projects/Gu,%20Chenchen_CS191W.pdf (Using timing attacks to detect prefix KV caching, thereby gaining information about other users' prompts.)
Brian J Chan, Chao-Ting Chen, Jui-Hung Cheng, Hen-Hsen Huang, 20 Dec 2024, Don't Do RAG: When Cache-Augmented Generation is All You Need for Knowledge Tasks, https://arxiv.org/abs/2412.15605 (Mini-RAG architecture preloading the entire knowledge into the LLM context and then using KV caching.)
HF, 2024, TGI v3 overview, https://huggingface.co/docs/text-generation-inference/conceptual/chunking
Longze Chen, Jan 2025 (accessed), Awesome-KV-Cache-Compression: Must-read papers on KV Cache Compression (constantly updating), https://github.com/October2001/Awesome-KV-Cache-Compression (KV cache reuse across multiple prompts via SwiftKV, sounds similar to prefix KV caching or fused KV caching, and also SingleInputKV does KV cache layer fusion in a single prompt.)
Shiyi Cao, Yichuan Wang, Ziming Mao, Pin-Lun Hsu, Liangsheng Yin, Tian Xia, Dacheng Li, Shu Liu, Yineng Zhang, Yang Zhou, Ying Sheng, Joseph Gonzalez, Ion Stoica, 24 Jan 2025, Locality-aware Fair Scheduling in LLM Serving, https://arxiv.org/abs/2501.14312 (Scheduling taking into account prefix cache availability based on locality.)
Amr Elmeleegy, Ashraf Eassa, Mark Taylor, Nick Comly, Vijay Singh, Hao Wang and Pallab Bhattacharya, Oct 28, 2024, NVIDIA GH200 Superchip Accelerates Inference by 2x in Multiturn Interactions with Llama Models, https://developer.nvidia.com/blog/nvidia-gh200-superchip-accelerates-inference-by-2x-in-multiturn-interactions-with-llama-models/
Shiju Zhao, Junhao Hu, Rongxiao Huang, Jiaqi Zheng, Guihai Chen, 4 Feb 2025, MPIC: Position-Independent Multimodal Context Caching System for Efficient MLLM Serving, https://arxiv.org/abs/2502.01960
Gregory Dexter, Shao Tang, Ata Fatahi Baarzi, Qingquan Song, Tejas Dharamsi, Aman Gupta, 7 Feb 2025. LLM Query Scheduling with Prefix Reuse and Latency Constraints, https://arxiv.org/abs/2502.04677
S Agarwal, S Sundaresan, S Mitra, D Mahapatra, Feb 2025, Cache-Craft: Managing Chunk-Caches for Efficient Retrieval-Augmented Generation, https://skejriwal44.github.io/docs/CacheCraft_SIGMOD_2025.pdf (Managing pre-computed KV caches for RAG chunks as a generalization of prefix KV caching, addressing limitations in their position and ordering.)
Jingbo Yang, Bairu Hou, Wei Wei, Yujia Bao, Shiyu Chang, 21 Feb 2025, KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse, https://arxiv.org/abs/2502.16002 https://github.com/UCSB-NLP-Chang/KVLink (Computing a KV cache for each RAG chunk, and using techniques to fuse/merge/concatenate these KV caches, i.e., fused KV caching as a generalization of prefix KV caching, while restoring cross-chunk attention accuracy via 3 techniques: positional re-encoding, "link tokens" between chunks processed during inference, and fine-tuning).
Weijian Chen, Shuibing He, Haoyang Qu, Ruidong Zhang, Siling Yang, and Ping Chen, Yi Zheng and Baoxing Huai, Gang Chen, 2025, IMPRESS: An Importance-Informed Multi-Tier Prefix KV Storage System for Large Language Model Inference, https://www.usenix.org/system/files/fast25-chen-weijian-impress.pdf (Estimating the cost of loading prefix KV caches from disk, to decide whether to use them or not.)

Prompt Caching

Prompt caching is an LLM inference optimization that uses cached data for faster query responses. This is a more general optimization than context caching, as prompt caching may refer to caching of any elements of the prompt, including both user query and context tokens.

The term "prompt caching" is appearing in industry and can mean various things. Typically, it refers to multi-query caching of the KV data for input prompt tokens. However, prompt cache may also refer to prefix KV caching, or even to a non-KV inference cache.

Research papers on "prompt cache" include:

Chip Huyen, Jul 25, 2024, Building A Generative AI Platform, https://huyenchip.com/2024/07/25/genai-platform.html
Google, July 2024 (accessed), Context caching, https://ai.google.dev/gemini-api/docs/caching?lang=python
Llama.cpp, July 2024 (accessed), Prompt Caching, https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md#prompt-caching
Anthropic, 15 Aug 2024, Prompt caching with Claude, https://www.anthropic.com/news/prompt-caching (Anthropic is now supporting prompt caching with approximately tenfold reduction in token pricing.)
DeepSeek, 02 August, 2024, DeepSeek API introduces Context Caching on Disk, cutting prices by an order of magnitude, https://platform.deepseek.com/api-docs/news/news0802/ (Announcement of commercial support for global KV caching with session-based and prefix KV caches.)
Hanlin Zhu, Banghua Zhu, Jiantao Jiao, 2 Feb 2024, Efficient Prompt Caching via Embedding Similarity, https://arxiv.org/abs/2402.01173
In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, Lin Zhong, 25 Apr 2024 (v2), Prompt Cache: Modular Attention Reuse for Low-Latency Inference, https://arxiv.org/abs/2311.04934
Anthropic, 20 Sept 2024, Introducing Contextual Retrieval, https://www.anthropic.com/news/contextual-retrieval
Open AI, Oct 2024 (accessed), Prompt Caching, https://platform.openai.com/docs/guides/prompt-caching
Michael Nuñez, October 1, 2024, OpenAI’s DevDay 2024: 4 major updates that will make AI more accessible and affordable, https://venturebeat.com/ai/openai-devday-2024-4-major-updates-that-will-make-ai-more-accessible-and-affordable/
Lingjiao Chen, Matei Zaharia, James Zou, 9 May 2023, FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance, https://arxiv.org/abs/2305.05176
Yuzhe Yang, Yipeng Du, Ahmad Farhan, Claudio Angione, Yue Zhao, Harry Yang, Fielding Johnston, James Buban, Patrick Colangelo, 28 Oct 2024, Meta-Learning for Speeding Up Large Model Inference in Decentralized Environments, https://arxiv.org/abs/2410.21340 (Choosing between multiple acceleration techniques).
PromptHub, December 6, 2024, Prompt Caching with OpenAI, Anthropic, and Google Models, https://www.prompthub.us/blog/prompt-caching-with-openai-anthropic-and-google-models
Chenchen Gu, Xiang Lisa Li, Rohith Kuditipudi, Percy Liang, Tatsunori Hashimoto, Dec 2024, Timing Attacks on Prompt Caching in Language Model APIs, Stanford CS 191W Senior Project, https://cs191w.stanford.edu/projects/Gu,%20Chenchen_CS191W.pdf (Using timing attacks to detect prefix KV caching, thereby gaining information about other users' prompts.)
LG Schroeder, Nov 2024, VectorQ: Advanced Semantic Prompt Caching With Dynamic Thresholds and Performance-Based Clustering, Master’s Thesis in Informatics, Technical University of Munich, https://talks.db.in.tum.de/uploads/fe/d9d7195b2c48e8adab7a46f69b1e80/thesis.pdf
Chenchen Gu, Xiang Lisa Li, Rohith Kuditipudi, Percy Liang, Tatsunori Hashimoto, 11 Feb 2025, Auditing Prompt Caching in Language Model APIs, https://arxiv.org/abs/2502.07776
Anthropic, 14 Mar 2025, Token-saving updates on the Anthropic API, https://www.anthropic.com/news/token-saving-updates (Prompt caching, excluding cached responses from rate limits, and token-efficient tool calling.)
Arun Iyengar, Ashish Kundu, Ramana Kompella, Sai Nandan Mamidi, 22 Mar 2025, A Generative Caching System for Large Language Models, https://arxiv.org/abs/2503.17603
E. Baccour, A. Erbad, A. Mohamed, M. Hamdi and M. Guizani, "Active Prompt Caching in Edge Networks for Generative AI and LLMs: An RL-Based Approach," 2025 IEEE Wireless Communications and Networking Conference (WCNC), Milan, Italy, 2025, pp. 01-07, doi: 10.1109/WCNC61545.2025.10978306, https://ieeexplore.ieee.org/abstract/document/10978306

Session KV Caching

Session KV caching is the use of KV caching to speed up the inference in an LLM session. This typically refers to a chatbot or Q&A session, where computations from earlier in a session can be cached and used to speed up the inference of the next query.

Chatbots and other session-based Q&A interfaces have an interesting property, where the current output is used as part of the input for the next query. This allows session-based optimization for conversational history or other multi-turn interfaces.

If you consider how the conversation history is handled, the output of the current query becomes the prepended text onto the next query. Although you cannot speed up the current query this way, you can avoid a lot of processing of the conversation on the next query by retaining the KV cache. This is a special case of "prefix KV caching" in multi-turn conversational sessions. Effectively, the prefill phase for processing any conversation history text is avoided by maintaining a KV cache of the prior answer.

There are some practical problems with this, especially related to the size of the KV cache. A long discussion of a chatbot, where its answers may be lengthy, can become a large sequence of tokens. This is even more the case in a Q&A session with a RAG architecture, where the responses will be lengthy from summarization of the RAG chunks. The length of the session's prepended conversation history becomes long in either of these cases, in which case the amount of memory used by the KV cache is excessive. Various approaches exist to address this, such as context summarization, KV cache truncation, or KV cache compression. However, it is not entirely clear which approaches for KV cache size reduction can be used in a way that the prior conversational KV cache can still be used to replace prefill of the next query.

Research papers on session-based prefix caching:

Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, Pengfei Zuo, 23 Mar 2024, AttentionStore: Cost-effective Attention Reuse across Multi-turn Conversations in Large Language Model Serving, https://arxiv.org/abs/2403.19708 (Memory management of KV caches using hierarchical cache layers.)
VLLM, 2024, What is Automatic Prefix Caching, https://docs.vllm.ai/en/stable/automatic_prefix_caching/apc.html
Character.AI, June 20, 2024, Optimizing AI Inference at Character.AI, https://research.character.ai/optimizing-inference/
Cunchen Hu, Heyang Huang, Junhao Hu, Jiang Xu, Xusheng Chen, Tao Xie, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan, 26 Jun 2024 (v2), MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool, https://arxiv.org/abs/2406.17565 (Combined session-based prefix KV caching with disaggregation of prefill and decoding phases.)
Yu, Lingfan, 2024, Improve Language Model Serving Efficiency With Fine-Grained and Stateful Scheduling, Ph.D. Thesis, Department of Computer Science, New York University, ProQuest Dissertations & Theses, 31139782, https://www.proquest.com/openview/7200cdfc0906f1d4edb8008b4368bcf9 PDF: https://cs.nyu.edu/media/publications/lingfan_yu_phd_thesis.pdf (Examines efficiency of batching methods and how to create a "stateful" version with cached multi-turn conversation history using session-based KV caching.)
Kexin Chu, Tzechinh Liu, Yunding Li, Pengchao Yuan, Wei Zhang, 2024, CaR: An Efficient KV Cache Reuse System for Large Language Model Inference, University of Connecticut, https://llm-gnn.org/slides/CaR-Chu.pdf
James Groeneveld, Aug 1, 2024, Prompt Design at Character.AI, Character.AI blog, https://research.character.ai/prompt-design-at-character-ai/
DeepSeek, 02 August, 2024, DeepSeek API introduces Context Caching on Disk, cutting prices by an order of magnitude, https://platform.deepseek.com/api-docs/news/news0802/ (Announcement of commercial support for global KV caching with session-based and prefix KV caches.)
Anthropic, 15 Aug 2024, Prompt caching with Claude, https://www.anthropic.com/news/prompt-caching (Anthropic is now supporting prompt caching with approximately tenfold reduction in token pricing.)
Hugging Face, Aug 2024 (accessed), Best Practices for Generation with Cache, https://huggingface.co/docs/transformers/kv_cache
Amy Yang, Jingyi Yang, Aya Ibrahim, Xinfeng Xie, Bangsheng Tang, Grigory Sizov, Jongsoo Park, Jianyu Huang, 4 Nov 2024, Context Parallelism for Scalable Million-Token Inference, https://arxiv.org/abs/2411.01783
Yucheng Li, Huiqiang Jiang, Qianhui Wu, Xufang Luo, Surin Ahn, Chengruidong Zhang, Amir H. Abdi, Dongsheng Li, Jianfeng Gao, Yuqing Yang, Lili Qiu, 13 Dec 2024, SCBench: A KV Cache-Centric Analysis of Long-Context Methods, https://arxiv.org/abs/2412.10319 https://aka.ms/SCBench
Saurabh Agarwal, Anyong Mao, Aditya Akella, Shivaram Venkataraman, 21 Dec 2024, SYMPHONY: Improving Memory Management for LLM Inference Workloads, https://arxiv.org/abs/2412.16434 (Use "additional hints" to optimize KV cache processing in chatbot sessions, such as when the user starts typing.)
Amr Elmeleegy, Ashraf Eassa, Mark Taylor, Nick Comly, Vijay Singh, Hao Wang and Pallab Bhattacharya, Oct 28, 2024, NVIDIA GH200 Superchip Accelerates Inference by 2x in Multiturn Interactions with Llama Models, https://developer.nvidia.com/blog/nvidia-gh200-superchip-accelerates-inference-by-2x-in-multiturn-interactions-with-llama-models/

Fused KV Caching

Fused KV caching is an attempt to generalize prefix KV caching to handle multiple text sequences anywhere in the input prompt string. This allows LLM inference to be optimized in any situations where a portion of the tokens have previously been processed by the LLM.

Fuse KV caching is an improvement over prefix KV caching. For example, when RAG returns two or more chunks, only the first chunk is cached in prefix KV caching. More specifically, the second chunk is not properly cached in the sense that the second chunk won't be cached if it appears later, except if it happens to be after the same first chunk again, if the two chunks are prefix-cached. This problem undermines the idea of having a cache of any RAG chunks that are often used in answers. This problem can be alleviated somewhat by using "caching-aware reranking", to reorder the chunks with a cached chunk placed first, so that the first chunk is almost always cached, but the second chunk is never cached and its KV cache must be recomputed.

There are some major problems that block this idea of caching sequences for two or more chunks anywhere in the prompt text:

Positional encoding
Inter-chunk attention
Non-linearity

Positional encoding: One of the difficulties is that the KV cache of a chunk appearing second in the sequence actually depends on the length of the first sequence, because of positional encoding. Hence, merging of the KV cache for the second or later chunks is difficult, and would really need to be per-position, which is an unrealistic memory requirement.

Inter-chunk attention: If there are two RAG chunks, and we have a cached KV data set for both, neither of the cached values are doing attention on the other. If there is a user query after the second chunk, then the pre-computed KV data after the second chunk doesn't have any contribution from the first chunk. Obviously, the attention computation for the query tokens will attend to both of the chunks again, as part of prefill, but is that enough attention? It seems likely for RAG that the cross-chunk attention is far less important than the query-to-chunk attention, but this hasn't really been confirmed empirically.

Non-linearity of KV data: If it were linear, we could just add the two KV caches together for the two RAG chunks. Unfortunately, as we well know, LLMs don't work linearly. More complicated approximations are required.

Research on Fused KV Caching: This is a new area of research with not many papers yet. One approach in [Gim et al 2023] is to adjust the KV cache for the positional encoding problems, and then merge them together using simple concatenation, which is an approximation of the KV cache that would be computed. The positional encoding problem is avoided by using a structured prompt layout that ensures the text always occurs at particular positions (i.e., similar to the idea of having fixed lengths for each RAG chunk). Hence, there are limited positions for each chunk, and caching per-chunk for a limited number of positions is realistic. No attempt is made to add any inter-chunk attention, with each chunk's KV cache simply used without changes, and they are all concatenated together to create the full token sequence's KV cache data. Surprisingly, this seems to all work well, as shown in the paper. So much for those concerns about non-linearity!

Another approach to fusing KV caches in [Yao et al, 2024] is selective re-computation of KV caches combined with fusing the KV cache for multiple chunks, which again is an approximation. This merges two or more KV caches together in sequence, but then does a selective re-computation of the KV caches for some tokens. Around 10-20% of the tokens have their KV cache re-computed in each layer. As the paper shows, this is much faster in terms of Time-to-first-token, but has a negligible loss in accuracy.

Research papers on fused KV caching:

Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, Junchen Jiang, 3 Jun 2024 (v2), CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion, https://arxiv.org/abs/2405.16444 Code: https://github.com/YaoJiayi/CacheBlend.git (Generalizes prefix KV caching to KV cache fusion with selective recomputation of some KV cache data.)
In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, Lin Zhong, Nov 2023, Prompt Cache: Modular Attention Reuse for Low-Latency Inference, https://arxiv.org/abs/2311.04934 (Unique and insightful advance of generalizing KV caching to multiple prompts by computing a cache for short "segments" of prompts, including methods to adjust the different KV cache values for text segments that appear in different positions of the overall prompt.)
Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Xin Liu, Xuanzhe Liu, Xin Jin, 18 Apr 2024, RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation, https://arxiv.org/abs/2404.12457 (This paper briefly considers merging KV caches of multiple RAG chunks, but instead focuses on (a) caching of two or more chunks in one KV cache record, and (b) reordering the chunks in a cache-aware manner.)
Hao Yu, Zelan Yang, Shen Li, Yong Li, Jianxin Wu, 11 Jun 2024, Effectively Compress KV Heads for LLM, https://arxiv.org/abs/2406.07056 (Examines KV cache head merging approaches for KV cache size reduction, and also examines RoPE encoding issues with relevance to fusing KV caches.)
Zhongwei Wan, Ziang Wu, Che Liu, Jinfa Huang, Zhihong Zhu, Peng Jin, Longyue Wang, Li Yuan, 26 Jun 2024, LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference, https://arxiv.org/abs/2406.18139 (KV cache compression in text and multimodal inference, prioritizing eviction of text over image tokens, and using new ways to merge evicted KV cache data into the retained KV cache, including averaging, pivotal tokens, and weighted averages, which is relevant to token merging and KV cache fusion
Yao Yao, Zuchao Li, Hai Zhao, 21 May 2024, SirLLM: Streaming Infinite Retentive LLM, https://arxiv.org/abs/2405.12528 (Low-rank decomposition to compress KV cache heads.)
David Spuler, , September 26, 2024, RAG Optimization via Caching, https://www.aussieai.com/blog/rag-optimization-caching
Songshuo Lu, Hua Wang, Yutian Rong, Zhi Chen, Yaohua Tang, 10 Oct 2024, TurboRAG: Accelerating Retrieval-Augmented Generation with Precomputed KV Caches for Chunked Text, https://arxiv.org/abs/2410.07590 (Fusing precomputed KV caches for each RAG chunk.)
Junhao Hu, Wenrui Huang, Haoyi Wang, Weidong Wang, Tiancheng Hu, Qin Zhang, Hao Feng, Xusheng Chen, Yizhou Shan, Tao Xie, 20 Oct 2024, EPIC: Efficient Position-Independent Context Caching for Serving Large Language Models, https://arxiv.org/abs/2410.15332
David Spuler, October 24, 2024, Generalizing Prefix KV Caching to RAG Chunks, Aussie AI Blog, https://www.aussieai.com/blog/prefix-kv-rag
Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Monishwaran Maheswaran, June Paik, Michael W. Mahoney, Kurt Keutzer, Amir Gholami, 14 Nov 2024, Squeezed Attention: Accelerating Long Context Length LLM Inference, https://arxiv.org/abs/2411.09688 https://github.com/SqueezeAILab/SqueezedAttention (This is like a combination of semantic caching and prefix KV caching, and close to fused KV caching.)
East Sun, Yan Wang, Lan Tian, 17 Oct 2024 (v4), Block-Attention for Efficient RAG, https://arxiv.org/abs/2409.15355
Zhisong Zhang, Yan Wang, Xinting Huang, Tianqing Fang, Hongming Zhang, Chenlong Deng, Shuaiyi Li, Dong Yu, 21 Dec 2024, Attention Entropy is a Key Factor: An Analysis of Parallel Context Encoding with Full-attention-based Pre-trained Language Models, https://arxiv.org/abs/2412.16545 (Parallel encoding of chunks of context is similar to fused KV caching.)
Philhoon Oh, Jinwoo Shin, James Thorne, 13 Jan 2025, Parallel Key-Value Cache Fusion for Position Invariant RAG, https://arxiv.org/abs/2501.07523 (Generating the KV cache for each RAG chunk.)
Longze Chen, Jan 2025 (accessed), Awesome-KV-Cache-Compression: Must-read papers on KV Cache Compression (constantly updating), https://github.com/October2001/Awesome-KV-Cache-Compression (KV cache reuse across multiple prompts via SwiftKV, sounds similar to prefix KV caching or fused KV caching, and also SingleInputKV does KV cache layer fusion in a single prompt.)
Shiju Zhao, Junhao Hu, Rongxiao Huang, Jiaqi Zheng, Guihai Chen, 4 Feb 2025, MPIC: Position-Independent Multimodal Context Caching System for Efficient MLLM Serving, https://arxiv.org/abs/2502.01960
Kun Luo, Zheng Liu, Peitian Zhang, Hongjin Qian, Jun Zhao, Kang Liu, 17 Feb 2025, Does RAG Really Perform Bad For Long-Context Processing? https://arxiv.org/abs/2502.11444 (Long context RAG processing based on the KV cache data is similar to fused/substring KV caching methods.)
S Agarwal, S Sundaresan, S Mitra, D Mahapatra, Feb 2025, Cache-Craft: Managing Chunk-Caches for Efficient Retrieval-Augmented Generation, https://skejriwal44.github.io/docs/CacheCraft_SIGMOD_2025.pdf (Managing pre-computed KV caches for RAG chunks as a generalization of prefix KV caching, addressing limitations in their position and ordering.)
Jingbo Yang, Bairu Hou, Wei Wei, Yujia Bao, Shiyu Chang, 21 Feb 2025, KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse, https://arxiv.org/abs/2502.16002 https://github.com/UCSB-NLP-Chang/KVLink (Computing a KV cache for each RAG chunk, and using techniques to fuse/merge/concatenate these KV caches, i.e., fused KV caching as a generalization of prefix KV caching, while restoring cross-chunk attention accuracy via 3 techniques: positional re-encoding, "link tokens" between chunks processed during inference, and fine-tuning).
Shai Bergman, Zhang Ji, Anne-Marie Kermarrec, Diana Petrescu, Rafael Pires, Mathis Randl, Martijn de Vos, 7 Mar 2025, Leveraging Approximate Caching for Faster Retrieval-Augmented Generation, https://arxiv.org/abs/2503.05530

General Research on KV Cache Optimization

Other papers on optimizing the KV cache include:

Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ramjee, Ashish Panwar, 7 May 2024, vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention, https://arxiv.org/abs/2405.04437 (Further optimizes paged attention algorithm for KV caching in attention, by storing the KV cache in contiguous memory and using underlying system paging.)
Yushuo Chen, Tianyi Tang, Erge Xiang, Linjiang Li, Wayne Xin Zhao, Jing Wang, Yunpeng Chai, Ji-Rong Wen, 17 Apr 2024, Towards Coarse-to-Fine Evaluation of Inference Efficiency for Large Language Models, https://arxiv.org/abs/2404.11502 (Benchmarks the performance of various Transformer inference frameworks: Transformers, vLLM, DeepSpeed-MII, TGI, TenserRT-LLM, llama.cpp, LightLLM, LMDeploy, StreamingLLM.)
Jinwei Yao, Kaiqi Chen, Kexun Zhang, Jiaxuan You, Binhang Yuan, Zeke Wang, Tao Lin, 30 Mar 2024, DeFT: Flash Tree-attention with IO-Awareness for Efficient Tree-search-based LLM Inference, https://arxiv.org/abs/2404.00242
Omri Mallis, February 25, 2024 , Techniques for KV Cache Optimization in Large Language Models, https://www.omrimallis.com/posts/techniques-for-kv-cache-optimization/
Nikhil Jha, Kevin Wang, 2023, https://people.eecs.berkeley.edu/~kubitron/courses/cs262a-F23/projects/reports/project1010_paper_64287652274076362722.pdf (Extends Paged Attention to a global multi-query KV cache and also implements prefix KV caching.)
Zayd Muhammad Kawakibi Zuhri, Muhammad Farid Adilazuarda, Ayu Purwarianti, Alham Fikri Aji, 13 Jun 2024, MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding, https://arxiv.org/abs/2406.09297 Code: https://github.com/zaydzuhri/pythia-mlkv (Extends cross-head KV sharing in MQA to also share KV data between layers, analogous to layer fusion of weights.)
Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey
Apple, June 2024, Introducing Apple’s On-Device and Server Foundation Models, https://machinelearning.apple.com/research/introducing-apple-foundation-models (Apple's on-device models feature optimizations including small models, grouped query attention, 2-bit/4-bit quantization including activation quantization, shared embedding/unembedding tensors, small-ish vocabulary size of 49k, an undisclosed efficient KV cache optimization for neural engines, and layer-specific 16-bit LoRA/QLoRA adapters of size "10s of megabytes" for fine-tuned specialized model versions, also sometimes in 2-bit/4-bit, claiming speed rates of 0.6ms/token in prefill, and 30 tokens per second in decoding.)
Siyan Zhao, Daniel Israel, Guy Van den Broeck, Aditya Grover, 15 Apr 2024, Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models, https://arxiv.org/abs/2404.09529 Code: https://github.com/siyan-zhao/prepacking (Optimizes prefill KV cache computations by batching multiple query prefill phases together via packing, since prefill token sequence lengths are fully known, and further combined with simple modifications to positional encoding and masking to avoid cross-query attention.)
Zhiyu Guo, Hidetaka Kamigaito, Taro Watanabe, 18 Jun 2024, Attention Score is not All You Need for Token Importance Indicator in KV Cache Reduction: Value Also Matters, https://arxiv.org/abs/2406.12335 (Extensions of KV cache token pruning methods tjat ise attention scores to find pivotal tokens, generalized to also consider L1 vector norms of value vectors.)
Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng ZHANG, Dahua Lin, Ion Stoica, Hao Zhang, 02 May 2024, MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving, ICML 2024, https://openreview.net/forum?id=R0SoZvqXyQ PDF: https://openreview.net/pdf?id=R0SoZvqXyQ Code: https://github.com/hao-ai-lab/MuxServe (Separates the prefill and decoding phases when serving, and also manages the LLM weights and KV cache data in blocks for memory efficiency.)
Character.AI, June 20, 2024, Optimizing AI Inference at Character.AI, https://research.character.ai/optimizing-inference/
Schwinn Saereesitthipitak, Ashish Rao, Cathy Zhou, William Li, 2024, Prophet: An LLM Inference Engine Optimized For Head-of-Line Blocking, https://www.scs.stanford.edu/24sp-cs244b/projects/Prophet_An_LLM_Inference_Engine_Optimized_For_Head_of_Line_Blocking.pdf (Faster inference serving via iterative scheduling, separating prefill and decoding phase computations for batching, using priority-based schedulers with preemption, and controling transfer of KV caches from prefill to decoders.)
Hui Wu, Yi Gan, Feng Yuan, Jing Ma, Wei Zhu, Yutao Xu, Hong Zhu, Yuhua Zhu, Xiaoli Liu, Jinghui Gu, Peng Zhao, 23 Jun 2024 (v2), Efficient LLM inference solution on Intel GPU, https://arxiv.org/abs/2401.05391 (Disaggregated the KV cache between prefill and decoding tokens, since theh KV cache size is known for prefill, thereby reducing memory fragmentation, and also applying kernel fusion to several modules include the scaled dot product attention.)
Baolin Li, Yankai Jiang, Vijay Gadepally, Devesh Tiwari, 17 Jul 2024, LLM Inference Serving: Survey of Recent Advances and Opportunities, https://arxiv.org/abs/2407.12391
Luchang Li, Sheng Qian, Jie Lu, Lunxi Yuan, Rui Wang, Qin Xie, 5 Jul 2024 (v3), Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs, https://arxiv.org/abs/2403.20041
Junyoung Park, Myeonggu Kang, Yunki Han, Yanggon Kim, Jaekang Shin, Lee-Sup Kim, 21 Jul 2024, Token-Picker: Accelerating Attention in Text Generation with Minimized Memory Transfer via Probability Estimation, https://arxiv.org/abs/2407.15131
Shi Luohe, Zhang Hongyi, Yao Yao, Li Zuchao, Zhao Hai, 25 Jul 2024, Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption, https://arxiv.org/abs/2407.18003
Yingbing Huang, Lily Jiaxin Wan, Hanchen Ye, Manvi Jha, Jinghua Wang, Yuhong Li, Xiaofan Zhang, Deming Chen, 16 Jun 2024, New Solutions on LLM Acceleration, Optimization, and Application, https://arxiv.org/abs/2406.10903
Xingbo Wu, Li Zhang, Yandong Wang, Yufei Ren, Michel Hack, Song Jiang, 18 April 2016, zExpander: a key-value cache with both high performance and fewer misses, EuroSys '16: Proceedings of the Eleventh European Conference on Computer SystemsApril 2016Article No.: 14Pages 1–15, https://dl.acm.org/doi/abs/10.1145/2901318.2901332 https://doi.org/10.1145/2901318.2901332
Haiying Shen, Tanmoy Sen, 17 Mar 2025, Mitigating KV Cache Competition to Enhance User Experience in LLM Inference, https://arxiv.org/abs/2503.13773

Computation Reuse Optimizations

Computation reuse optimizations are LLM inference improvements where computed data is reused. There are several areas where LLM computations can be reused, including as part of kernel optimizations, or the use of KV caching. Computations in neural network inference can be re-used by storing them in a cache (also called "memoization").

Research papers on data re-use or computation re-use include:

Giorgos Armeniakos, Georgios Zervakis, Dimitrios Soudris, Jörg Henkel, 2022, Hardware Approximate Techniques for Deep Neural Network Accelerators: A Survey, ACM Computing Surveys, Volume 55, Issue 4, No. 83, pp 1–36 https://doi.org/10.1145/3527156, https://dl.acm.org/doi/10.1145/3527156, https://arxiv.org/abs/2203.08737 (Extensive survey that contains a section on "Memoization" which is caching computed values for later reuse.)
X. Jiao, V. Akhlaghi, Yu Jiang, and R. K. Gupta. 2018. Energy-efficient neural networks using approximate computation reuse. Proc. of the 2018 Design, Automation and Test in Europe Conference and Exhibition, (DATE) (2018), 1223–1228. https://ieeexplore.ieee.org/document/8342202 https://www.date-conference.com/proceedings-archive/2018/pdf/0571.pdf (Uses Bloom filters and caching of results.)
JA Chen, W Niu, B Ren, Y Wang, X Shen, 2023, Survey: Exploiting data redundancy for optimization of deep learning, ACM Computing Surveys, https://dl.acm.org/doi/abs/10.1145/3564663, https://arxiv.org/pdf/2208.13363 (Survey paper covering various data redundancy optimizations such as skipping or reusing computations for similar data.)
Luca Mocerino, Valerio Tenace, and Andrea Calimera. 2019. Energy-Efficient Convolutional Neural Networks via Recurrent Data Reuse. In Design, Automation Test in Europe Conference Exhibition (DATE). 848–853. https://ieeexplore.ieee.org/document/8714880 (Caching using an associative memory unit.)
M Riera, JM Arnau, A González, 2022, CREW: Computation reuse and efficient weight storage for hardware-accelerated MLPs and RNNs, Journal of Systems Architecture, Elsevier, https://www.sciencedirect.com/science/article/pii/S1383762122001394, https://arxiv.org/abs/2107.09408 (Hardware-assisted caching of weight computations.)
V Janfaza, K Weston, M Razavi, 2023, MERCURY: Accelerating DNN Training By Exploiting Input Similarity, 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA), https://ieeexplore.ieee.org/abstract/document/10071051/, https://arxiv.org/abs/2110.14904 (Uses a bit-signature to find similar vectors for reusing dot product calculations.)
X Li, B Ren, X Shen, Y Wang, 2022, CoCoPIE XGen: A Full-Stack AI-Oriented Optimizing Framework, arXiv preprint arXiv:2206.10620, https://arxiv.org/abs/2206.10620 (Various optimizations including block pruning and deep reuse.)
Yuan Feng, Hyeran Jeon, Filip Blagojevic, Cyril Guyot, Qing Li, Dong Li, 2023, AttMEMO : Accelerating Transformers with Memoization on Big Memory Systems, https://arxiv.org/abs/2301.09262 (Exploits input text similarities in terms of tokens and embeddings so as to cache and reuse tensor attention computations.)
M Capra, B Bussolino, A Marchisio, M Shafique, 2020, An updated survey of efficient hardware architectures for accelerating deep convolutional neural networks, Future Internet, PDF: https://www.mdpi.com/1999-5903/12/7/113/pdf (Includes a section on data reuse.)
Hegde, K., Yu, J., Agrawal, R., Yan, M., Pellauer, M., Fletcher, C.W.: UCNN: Exploiting Computational Reuse in Deep Neural Networks via Weight Repetition. In: Proceedings of the 45th Annual International Symposium on Computer Architecture. pp. 674–687. IEEE Press (2018) https://ieeexplore.ieee.org/document/8416864, https://arxiv.org/abs/1804.06508 (Combines analysis of "weight repetition" to reuse partial dot product results, further improved with sparsity.)
E. Guo, P. Li, K. Wang, H. Feng, J. Lu, and S. Guo, Exploiting computation reuse in cloud-based deep learning via input reordering, in Proc. ICC-IEEE Int. Conf. Commun. (ICC), Jun. 2020, pp. 1–6. https://ieeexplore.ieee.org/document/9148746
Franyell Silfa, Jose-Maria Arnau, Antonio González, Feb 2022, Saving RNN Computations with a Neuron-Level Fuzzy Memoization Scheme https://arxiv.org/abs/2202.06563 (Uses "fuzzy memoization" in hardware for computation reuse.)
Franyell Silfa, Gem Dot, Jose-Maria Arnau, and Antonio Gonzàlez. Neuron-Level Fuzzy Memoization in RNNs. In Annual IEEE/ACM International Symposium on Microarchitecture, pages 782–793, 2019. https://dl.acm.org/doi/10.1145/3352460.3358309 (Uses "fuzzy memoization" for caching neuron outputs for computation reuse.)
Ruofan Wu, Feng Zhang, Jiawei Guan, Zhen Zheng, Xiaoyong Du, and Xipeng Shen. DREW: Efficient Winograd CNN Inference with Deep Reuse. In Proceedings of the ACM Web Conference 2022, pages 1807–1816, 2022. https://dl.acm.org/doi/10.1145/3485447.3511985 PDF: https://research.csc.ncsu.edu/picture/publications/papers/www2022.pdf (Combines the Winograd matrix multiplication algorithm with computation reuse.)
NM Cicek, L Ning, O Ozturk, 2022, General reuse-centric CNN accelerator, IEEE Transactions on Computers, Volume 71, Issue 4, 01 April 2022, https://ieeexplore.ieee.org/abstract/document/9373953/ (Detects similarities in patches of images.)
F Zhang, R Wu, J Guan, Z Zheng, X Guo, 2023, Expanding the Edge: Enabling Efficient Winograd CNN Inference With Deep Reuse on Edge Device, IEEE Transactions on Knowledge and Data Engineering, Volume 35, Issue 10, 01 October 2023, https://ieeexplore.ieee.org/abstract/document/10106424/ (Further analysis of DREW, which combines Winograd optimizations with data reuse.)
Okan Köpüklü, Maryam Babaee, Stefan Hörmann, Gerhard Rigoll, Feb 2019, Convolutional neural networks with layer reuse, 2019 IEEE International Conference on Image Processing (ICIP), https://ieeexplore.ieee.org/abstract/document/8802998/, https://arxiv.org/pdf/1901.09615 (The method of repeatedly re-using the same entire layers.)
J Servais, E Atoofian, 2021, Adaptive computation reuse for energy-efficient training of deep neural networks, ACM Transactions on Embedded Computing Systems, Volume 20, Issue 6, Article No. 114, pp. 1–24, https://doi.org/10.1145/3487025, https://dl.acm.org/doi/abs/10.1145/3487025 (Optimizing training rather than inference, by using computation reuse.)
Alireza Khadem, Haojie Ye, Trevor Mudge, Apr 2021, CoDR: Computation and Data Reuse Aware CNN Accelerator, https://arxiv.org/abs/2104.09798 (Hardware-accelerated optimization based on sparsity, weight repetition, and similarity for computation reuse.)
Hoda Mahdiani; Alireza Khadem; Ali Yasoubi; Azam Ghanbari; Mehdi Modarressi; Masoud Daneshtalab, 2020, Computation reuse-aware accelerator for neural networks, In: Hardware Architectures for Deep Learning, Institution of Engineering and Technology , 2020, p. 147-158, https://digital-library.theiet.org/content/books/10.1049/pbcs055e_ch7, https://www.diva-portal.org/smash/record.jsf?pid=diva2:1761051
Mohammed F. Tolba, Hani Saleh, Baker Mohammad, Mahmoud Al-Qutayri, Thanos Stouraitis, 2023, EACNN: Efficient CNN Accelerator Utilizing Linear Approximation and Computation Reuse, 2023 IEEE International Symposium on Circuits and Systems (ISCAS), https://ieeexplore.ieee.org/document/10181343 (Reducing multiplications by detecting weight similarity.)
C Barrios, M Kumar, 2023, Service Caching and Computation Reuse Strategies at the Edge: A Survey, ACM Computing Surveys, Volume 56, Issue 2, Article No. 43, pp 1–38, https://dl.acm.org/doi/abs/10.1145/3609504
B Zerom, M Tolba, H Tesfai, H Saleh, 2022, Approximate Logarithmic Multiplier For Convolutional Neural Network Inference With Computational Reuse, 2022 29th IEEE International Conference on Electronics, Circuits and Systems (ICECS), https://ieeexplore.ieee.org/document/9970861 (Combines the Logarithmic Number System, Mitchell's approximate multiplication algorithm, and data reuse strategies to speed up MAC operations.)
Energy-efficient acceleration of convolutional neural networks using computation reuse, A Ghanbari, M Modarressi, Journal of Systems Architecture, 2022, https://www.sciencedirect.com/science/article/pii/S1383762122000674, https://dl.acm.org/doi/10.1016/j.sysarc.2022.102490
MF Tolba, HT Tesfai, H Saleh, B Mohammad, 2022, Deep Neural Networks-Based Weight Approximation and Computation Reuse for 2-D Image Classification, IEEE Access (Volume 10), https://ieeexplore.ieee.org/abstract/document/9740128/, PDF: https://ieeexplore.ieee.org/iel7/6287639/6514899/09740128.pdf, https://arxiv.org/abs/2105.02954
D Ma, X Yin, M Niemier, XS Hu, X Jiao, 2020, Axr-nn: Approximate computation reuse for energy-efficient convolutional neural networks, GLSVLSI '20: Proceedings of the 2020 on Great Lakes Symposium on VLSI, September 2020, Pages 363–368, https://dl.acm.org/doi/abs/10.1145/3386263.3407595
Ali Yasoubi; Reza Hojabr; Mehdi Modarressi, 2016, Power-efficient accelerator design for neural networks using computation reuse, IEEE Computer Architecture Letters, Volume 16, Issue 1, Jan-June 2017, https://ieeexplore.ieee.org/abstract/document/7393481/ PDF: https://www.researchgate.net/profile/Ali-Yasoubi/publication/292077006_Power-Efficient_Accelerator_Design_for_Neural_Networks_Using_Computation_Reuse/links/5b98c095299bf14ad4d0b04d/Power-Efficient-Accelerator-Design-for-Neural-Networks-Using-Computation-Reuse.pdf
Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks,” ISCA, 2016. https://ieeexplore.ieee.org/document/7551407, PDF: http://www.rle.mit.edu/eems/wp-content/uploads/2016/04/eyeriss_isca_2016.pdf, PDF Slides: https://eems.mit.edu/wp-content/uploads/2016/06/eyeriss_isca_2016_slides.pdf, Project: http://eyeriss.mit.edu/ (Examines computation re-use as a memory-efficient dataflow, including in kernel operator fusion, which is called "folding.")
N. Vasilache, O. Zinenko, T. Theodoridis, P. Goyal, Z. DeVito, W. S. Moses, S. Verdoolaege, A. Adams, and A. Cohen, 2018, Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions, CoRR, vol. abs/1802.04730, http://arxiv.org/abs/1802.04730 (Caching and data reuse in low-level kernel fusion optimizations.)
Xia, C., Zhao, J., Sun, Q., Wang, Z., Wen, Y., Feng, X., Cui, H., 2023, Optimizing Deep Learning Inference via Global Analysis and Tensor Expressions, The ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 27 Apr-01 May 2023, San Diego, USA. https://eprints.whiterose.ac.uk/203681/, PDF: https://eprints.whiterose.ac.uk/203681/1/asplos24.pdf (Identifying data reuse opportunities at the ML compiler level.)
C Fu, 2023, Machine Learning Algorithm and System Co-design for Hardware Efficiency, Ph.D. thesis, Computer Science, University of California San Diego, https://escholarship.org/content/qt52q368p3/qt52q368p3.pdf (Explores computation reuse of partial dot product sums.)
Rohan Baskar Prabhakar, Sachit Kuhar, Rohit Agrawal, Christopher J. Hughes, and Christopher W. Fletcher. Summerge: An efficient algorithm and implementation for weight repetition-aware dnn inference. In Proceedings of the ACM International Conference on Supercomputing, ICS ’21, page 279–290, New York, NY, USA, 2021. Association for Computing Machinery, PDF: https://dl.acm.org/doi/pdf/10.1145/3447818.3460375 (Efficient dot product computation via computation reuse and weight repetition.)
Michael E. Wolf and Monica S. Lam, 1991, A Data Locality Optimizing Algorithm, Proceedings of the ACM SIGPLAN ’91 Conference on Programming Language Design and Implementation. Toronto, Ontario, Canada, June 26-28, 1991. https://www.cs.cmu.edu/afs/cs.cmu.edu/academic/class/15745-s05/www/lectures/wolf-lam-pldi91.pdf (Early 1991 paper that includes optimizations to loops and matrix multiplication algorithms with a cache locality focus.)

Cached or Precomputed Transpose

Cached transpose, or cached matrix transpose, is a kernel optimization in LLM inference where the transpose of a matrix is precomputed or cached. There are several ways that MatMul or GEMM kernels can be optimized using the tranpose of a matrix, because it may have better storage of contiguous data, leading to faster memory access patterns.

One minor way to optimize matrix multiplications that involve the transpose of a matrix, is to store both versions of the matrix in memory: the original matrix and its transpose. This can help to speed up inference by (a) avoiding the need to compute the transpose on-the-fly, and (b) by having the transpose already laid out properly in contiguous memory for pipelining and dataflow efficiency.

Some papers mention this optimization technique:

Optimizing Inference Performance of Transformers on CPUs Dave Dice, Alex Kogan, Feb 2021, https://arxiv.org/pdf/2102.06621.pdf (Stores weight matrix and its transpose in memory, precomputed from the start.)
Olivier Beaumont, Lionel Eyraud-Dubois, Mathieu Vérité, Julien Langou, 21 Feb 2022, I/O-Optimal Algorithms for Symmetric Linear Algebra Kernels https://arxiv.org/abs/2202.10217
Paul Springer, Paolo Bientinesi, 7 Nov 2017 (v3), Design of a high-performance GEMM-like Tensor-Tensor Multiplication, https://arxiv.org/abs/1607.00145
David Spuler, March 2024, Chapter 29. Caching Optimizations, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
David Spuler, March 2024, Cached or Precomputed Transpose, in Generative AI in C++, https://www.aussieai.com/book/ch29-cached-precomputed-transpose

Vector Computation Reuse with Hashing (Vector Caching)

Vector caching is the use of precomputed vector dot products that are stored in a cache datastore. A lot of the low-level computation of tensor multiplications or "convolutions" in AI inference breaks down into a vector dot product computation (also called a "scalar product"). This is a multiplication-addition (multiply-accumulate or MAC) of all of the elements of two vectors to create a single number, usually a 32-bit floating point result. Hence, it's a good candidate for caching with computation reuse since it's a large amount of arithmetic, and the result is only a single floating-point number, which won't need much extra space to store. The trade-off of a small amount of extra storage to avoid significant computations is attractive. Also, one of those vectors is static during inference (e.g. the weights vector), whereas the other vector operand is dynamic. Hence, the idea with vector computation reuse is to cache the computed dot product results and then detect when the second, incoming (dynamic) vector is the same as, or similar enough to, a previous incoming vector.

Various researchers have looked into this type of vector caching. The main methods to detect similar vectors are:

Locality-Sensitive Hashing (LSH): the most popular method, which uses cosine similarity to find reasonably "close" vectors in n-dimensional space. See also more research on hashing.
Bit signatures
K-means clustering
Hyper-Cube

If the vectors are similar, the cached result is a reasonable approximation of the dot product computation, which is thereby avoided.

Assuming similar vectors can be identified efficiently, the question is: how often does AI model inference perform vector computations on similar vectors? What is the cache hit rate? Research papers seem to indicate that it's rather a lot, with some reporting 50% speedup of inference over time.

Research papers on LSH-based vector dot product caching and computation reuse:

L. Ning and X. Shen, Deep reuse: Streamline CNN inference on the fly via coarse-grained computation reuse, in Proc. ACM Int. Conf. Supercomputing, 2019, pp. 438–448. [25] L. Ning, H. Guan, and X. Shen, Adaptive deep reuse: Accelerating CNN training on the fly, in Proc. IEEE 35th Int. Conf. Data Eng. (ICDE), Apr. 2019, pp. 1538–1549. https://dl.acm.org/doi/10.1145/3330345.3330384 (A dynamic method to reuse computations based on vector and sub-vector similarities detected via locality-sensitive hashing during inference. Explores LSH, K-means clustering and Hyper-Cube for vector similarity detection.)
R. Moura, P. Santos, J. Lima, M. Alves, A. Beck, and L. Carro, Skipping CNN convolutions through efficient memoization, in Proc. Int. Conf. Embedded Comput. Syst. Springer, 2019, pp. 65–76. https://link.springer.com/chapter/10.1007/978-3-030-27562-4_5, PDF: https://web.inf.ufpr.br/mazalves/wp-content/uploads/sites/13/2020/03/samos2019.pdf (Caching of dot product multiplications using hashing and proximity-based clustering.)
Vahid Janfaza, Kevin Weston, Moein Razavi, Shantanu Mandal, Farabi Mahmud, Alex Hilty, Abdullah Muzahid, 2021 (revised Nov 2022), SIMCNN: Exploiting Computational Similarity to Accelerate CNN Training in Hardware, https://www.researchgate.net/profile/Moein-Razavi-Ghods/publication/355730736_SIMCNN_--_Exploiting_Computational_Similarity_to_Accelerate_CNN_Training_in_Hardware/links/61a9605c29948f41dbbe7358/SIMCNN--Exploiting-Computational-Similarity-to-Accelerate-CNN-Training-in-Hardware.pdf, https://arxiv.org/abs/2110.14904 (Uses bit sequences to detect similar vectors.)
NM Cicek, Feb 2021, General reuse-centric CNN accelerator, Masters Thesis, Graduate School of Engineering and Science, Bilkent university, PDF: http://repository.bilkent.edu.tr/bitstream/handle/11693/55049/Master_Thesis_Bilkent_NihatMertCicek_v3.pdf?sequence=1 (Data reuse algorithms and includes hardware-based LSH detection of vector similarity.)
NM Cicek, X Shen, O Ozturk, 2022, Energy Efficient Boosting of GEMM Accelerators for DNN via Reuse, ACM Transactions on Design Automation of Electronic Systems, Volume 27, Issue 5, Article No. 43, pp 1–26, https://doi.org/10.1145/3503469, https://dl.acm.org/doi/10.1145/3503469, PDF: https://dl.acm.org/doi/pdf/10.1145/3503469 (Uses computation reuse to speed up matrix multiplications.)
P Guo, B Hu, R Li, W Hu, 2018, Foggycache: Cross-device approximate computation reuse MobiCom '18: Proceedings of the 24th Annual International Conference on Mobile Computing and Networking, October 2018, Pages 19–34, https://dl.acm.org/doi/abs/10.1145/3241539.3241557, PDF: https://par.nsf.gov/servlets/purl/10122201 (Uses LSH for computation reuse across multiple devices.)
Azar Rahimi, Luca Benini, and Rajesh K. Gupta. 2013. Spatial memoization: Concurrent instruction reuse to correct timing errors in SIMD architectures. IEEE Transactions on Circuits and Systems II: Express Briefs 60, 12, 847–851. https://ieeexplore.ieee.org/document/6617694, PDF: https://iis-people.ee.ethz.ch/~arahimi/papers/TCAS-II13.pdf (A spatial approximation, but not LSH.)

Input Similarity-Based Caching and Re-Use

Input similarity caching is the reuse of cached data about a previously seen input or image. When an input is similar enough to a prior input, the previous inference results can be cached and re-used. This is applicable to analysis of continual feeds, such as audio or video frames, where the incremental differences are relatively small. This is a type of incremental algorithm for neural network inference.

Research papers on caching from prior similar inputs include:

Marc Riera, Jose-Maria Arnau, and Antonio Gonzalez. 2018. Computation Reuse in DNNs by Exploiting Input Similarity. In Annual International Symposium on Computer Architecture (ISCA). 57–68. https://ieeexplore.ieee.org/document/8416818 https://dl.acm.org/doi/10.1109/ISCA.2018.00016 (Caching of similar results from DNN computations in previous audio or video frames.)
SLID: Exploiting spatial locality in input data as a computational reuse method for efficient CNN F Alantali, Y Halawani, B Mohammad… - IEEE Access, 2021 - ieeexplore.ieee.org https://ieeexplore.ieee.org/abstract/document/9395591/, PDF: https://ieeexplore.ieee.org/iel7/6287639/9312710/09395591.pdf (Caches partial MAC calculations and re-uses them, under conditions of input similarity.)
Fatmah Ali Alantali, 2021, Thesis, Efficient CNN Inference using Spatial Local Input Data Similarity, MSc. Thesis, Electrical and Computer Engineering, Khalifa University, December 2021, PDF: https://khalifauniversity.elsevierpure.com/ws/portalfiles/portal/6825541/file (SLID method for reduced processing of MAC operations in CNN inference using caching, based on input similarity.)
F. Alantali, Y. Halawani, B. Mohammad and M. Al-Qutayri, 2023, “F-CNN: Faster CNN exploiting SLID with Statistical Analysis,” 2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS), https://ieeexplore.ieee.org/document/10168606
J Schmerge, D Mawhirter, C Holmes, 2021, ELIχR: Eliminating Computation Redundancy in CNN-Based Video Processing, 2021 IEEE/ACM Redefining Scalability for Diversely Heterogeneous Architectures Workshop (RSDHA), https://ieeexplore.ieee.org/abstract/document/9651154/, PDF: https://par.nsf.gov/servlets/purl/10323450 (Incremental inference for video frames using caching.)
H Benmeziane, H Bouzidi, H Ouarnoughi, May 2023, Treasure What You Have: Exploiting Similarity in Deep Neural Networks for Efficient Video Processing, https://arxiv.org/abs/2305.06492
Y Yang, Y Liu, Z Yuan, W Sun, R Liu, 2021, A 65-nm Energy-Efficient Interframe Data Reuse Neural Network Accelerator for Video Applications, IEEE Journal of Solid-State Circuits, Volume 57, Issue 8, August 2022, https://ieeexplore.ieee.org/abstract/document/9631967/

Inference Cache

Inference cache is an LLM optimization that speeds up inference by reusing data stored in a cache from a previous inference computation. This can include a full inference cache of text-to-text mappings, or a partial inference cache that stores a cache of KV data.

A full inference cache is where the entire results of a model inference are stored, and re-used for a later identical query. For example, such an approach would recognize that 100 users are all submitting "This is a test", whether concurrently or over time, and would do the inference computation only once. There are multiple things that could be stored by the cache:

The full output text
Logit probabilities
Prefill/encoder KV data

Basic inference caching involves storing the actual identical results, in which case all users would get exactly the same response. Alternatively, a more flexible approach that still avoids most computations is storing the near-final results, in some intermediate form with logits (probabilities), and a final brief computation can still emit different results to different users. In this way, most of the computation is avoided, and some variability is added to the final output.

Limitations of Inference Caching. An inference cache does not always work well. The problems that can arise with an inference cache include:

Non-variability of output (as mentioned above)
Time-dependent queries
Context-dependent queries

Some queries cannot be cached, and an inference cache needs either heuristics or training to know when. It's not that obvious. Consider these time-dependent queries:

What is the current time?
What's the score in the World Cup final?
What properties are currently for sale in Houston?

Time isn't the only problem, and other types of context of the user should alter the answers, not just across time, but with different answers for different users at the same time. Here's some examples:

What's the weather forecast?
What zip code am I in?
What's my IP address?
When is high tide?

If you look closely, you might notice that most or all of these problematic queries have a commonality: they're the same types of queries that require extra tool usage by the LLM (e.g. clocks, data source plug-in integrations, etc.). None of these queries can be answered by the weights that were trained from the input training data set. Hence, it's plausible to turn off the caching mechanism using the same mechanism whereby tool-requiring queries are identified (e.g. fine-tuned tool "trigger" tokens or multi-step planning or symbolic execution).

Global KV Caching. An example of a partial inference cache that overcomes some of these problems is global KV caching, where the interim results of K and V operations in attention are stored across queries. The KV calculations are the same for identical queries, so they can be stored and re-loaded when a previously seen query text is detected (e.g. by hashing). This is a type of inference cache, but only the KV calculations are stored, with the full decoding sequences still needing execution. The advantages of global KV caching include:

Normal variation in answers (less "robotic")
Avoids prefill/encoding phase cost.
Very fast Time-to-First-Token (TTFT).
Time-dependent and tool-requiring queries handled.

Global KV caching still avoids significant computation by avoiding the entirety of the prefill or encoding phase. But it's not as great of a speedup compared to a full inference cache, since the decoding phase must be run. On the other hand, avoiding prefill means zero delay in the output of the first token, so the user will see very fast latency or response time. Decoding won't be any more efficient, but decoding in interactive applications like chatbots only needs to be faster than users can read, which is quite slow. Users are much more sensitive to the initial delay from prefill than to the speed of decoding.

Since the decoding phase still occurs, this also means that some randomness arises from the decoding algorithm (e.g. top-k decoding). Hence, the answers are varied, as they would be if the same query were typed twice without any cache.

Global KV caching also avoids a lot of the problems with time-dependent or other problematic queries. The full decoding phase is still executed, which would presumably also mean generation of the tool integrations in the final output. In other words, tool execution would still occur, after the KV cache has been loaded, and either during or after the decoding phase.

Frame skipping in video. Another use case for a full inference cache is where the input is similar or continous. This is typically the case in image processing of machine vision (e.g. self-driving cars) or video analysis (e.g. security camera monitoring). There are many frames per second and they are often not very different. In such cases, the cached inference results from the previous image or frame can often be re-used with modifications for a faster incremental algorithm (see above section), or alternatively, the entire results from the previous frame can simply replace the current inference computation (i.e. "frame skipping").

Research on Inference Cache

Research papers on inference cache optimizations, where the results of entire inference operations are cached:

Drolia, U.; Guo, K.; Tan, J.; Gandhi, R.; Narasimhan, P., 2017, Cachier: Edge-Caching for Recognition Applications. In Proceedings of the International Conference on Distributed Computing Systems, Atlanta, GA, USA, 5–8 June 2017; pp. 276–286. http://dx.doi.org/10.1109/ICDCS.2017.94, https://ieeexplore.ieee.org/document/7979974
W Li, H Hacid, E Almazrouei, M Debbah, 2023, A Comprehensive Review and a Taxonomy of Edge Machine Learning: Requirements, Paradigms, and Techniques, AI 2023, 4(3), 729-786, https://www.mdpi.com/2673-2688/4/3/39 (Broad survey of optimizations for edge computing has a section on "Inference Cache".)
Xu, M.; Zhu, M.; Liu, Y.; Lin, F.X.; Liu, X., 2018, DeepCache: Principled cache for mobile deep vision. In Proceedings of the Annual International Conference on Mobile Computing and Networking, MOBICOM, New Delhi, India, 29 October–2 November 2018; pp. 129–144. http://dx.doi.org/10.1145/3241539.3241563
Mengwei Xu, Mengze Zhu, Yunxin Liu, Felix Xiaozhu Lin, Xuanzhe Liu, Mar 2020, DeepCache: Principled Cache for Mobile Deep Vision, https://arxiv.org/abs/1712.01670
Li, Y.; Zhang, C.; Han, S.; Zhang, L.L.; Yin, B.; Liu, Y.; Xu, M., Boosting Mobile CNN Inference through Semantic Memory. In Proceedings of the 29th ACM International Conference on Multimedia, MM 2021, Online, 20–24 October 2021; pp. 2362–2371. http://dx.doi.org/10.1145/3474085.3475399, https://arxiv.org/abs/2112.02644
C Barrios, M Kumar, 2023, Service Caching and Computation Reuse Strategies at the Edge: A Survey, ACM Computing Surveys, Volume 56, Issue 2, Article No. 43, pp 1–38, https://dl.acm.org/doi/abs/10.1145/3609504
Z Jiang, J Liu, Z Chen, Y Li, J Huang, Y Huo, P He, 2023, LLMParser: A LLM-based Log Parsing Framework, https://browse.arxiv.org/pdf/2310.01796.pdf (Uses an adaptive parsing cache that is tree-structured.)
Vikranth Srivatsa∗, Zijian He∗, Reyna Abhyankar, Dongming Li, Yiying Zhang, 2024, Preble: Efficient Distributed Prompt Scheduling for LLM Serving, University of California, San Diego, https://escholarship.org/content/qt1bm0k1w0/qt1bm0k1w0.pdf (Evalulates prompt sharing including full inference cache or a partial prefix-based computation of a global KV cache for the prefill phase. Also schedules GPUs based on prefill versus decoding phase requirements.)
Reyna Abhyankar, Zijian He, Vikranth Srivatsa, Hao Zhang, Yiying Zhang, 2 Feb 2024, APIServe: Efficient API Support for Large-Language Model Inferencing, https://arxiv.org/abs/2402.01869
In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, Lin Zhong, Nov 2023, Prompt Cache: Modular Attention Reuse for Low-Latency Inference, https://arxiv.org/abs/2311.04934 (Unique and insightful advance of generalizing KV caching to multiple prompts by computing a cache for short "segments" of prompts, including methods to adjust the different KV cache values for text segments that appear in different positions of the overall prompt.)
David Spuler, March 2024, Chapter 29. Caching Optimizations, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
Chip Huyen, Jul 25, 2024, Building A Generative AI Platform, https://huyenchip.com/2024/07/25/genai-platform.html
Google, July 2024 (accessed), Context caching, https://ai.google.dev/gemini-api/docs/caching?lang=python
Llama.cpp, July 2024 (accessed), Prompt Caching, https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md#prompt-caching
LangChain, Aug 2024 (accessed), Caching, https://python.langchain.com/v0.1/docs/modules/model_io/llms/llm_caching/
Anthropic, 15 Aug 2024, Prompt caching with Claude, https://www.anthropic.com/news/prompt-caching (Anthropic is now supporting prompt caching with approximately tenfold reduction in token pricing.)
Yibo Jin, Tao Wang, Huimin Lin, Mingyang Song, Peiyang Li, Yipeng Ma, Yicheng Shan, Zhengfan Yuan, Cailong Li, Yajing Sun, Tiandeng Wu, Xing Chu, Ruizhi Huan, Li Ma, Xiao You, Wenting Zhou, Yunpeng Ye, Wen Liu, Xiangkun Xu, Yongsheng Zhang, Tiantian Dong, Jiawei Zhu, Zhe Wang, Xijian Ju, Jianxun Song, Haoliang Cheng, Xiaojing Li, Jiandong Ding, Hefei Guo, Zhengyong Zhang, 15 Aug 2024, P/D-Serve: Serving Disaggregated Large Language Model at Scale, https://arxiv.org/abs/2408.08147 (Comprehensive serving system addressing disaggregated prefill and KV cache transfer with RDMA.)
Makhkamova, Ozoda, and Doohyun Kim. 2021. "A Conversation History-Based Q&A Cache Mechanism for Multi-Layered Chatbot Services" Applied Sciences 11, no. 21: 9981. https://doi.org/10.3390/app11219981 https://www.mdpi.com/2076-3417/11/21/9981 https://www.mdpi.com/2076-3417/11/21/9981/pdf
David Spuler, March 2024, Inference Cache, in Generative AI in C++, https://www.aussieai.com/book/ch29-inference-cache
Yihua Cheng, Kuntai Du, Jiayi Yao, Junchen Jiang, 16 Sep 2024, Do Large Language Models Need a Content Delivery Network? https://arxiv.org/abs/2409.13761 https://github.com/LMCache/LMCache (Managing the process of sharing KV cache data over a network.)
David Spuler, , September 26, 2024, RAG Optimization via Caching, https://www.aussieai.com/blog/rag-optimization-caching
Noah Martin, Abdullah Bin Faisal, Hiba Eltigani, Rukhshan Haroon, Swaminathan Lamelas, Fahad Dogar, 4 Oct 2024, LLMProxy: Reducing Cost to Access Large Language Models, https://arxiv.org/abs/2410.11857 (Deploying a proxy between user and LLM, with handling of conversational history context and caching.)
H. Oh, W. Zhang, C. D. Rickett, S. R. Sukumar and S. Byna, "Evaluating Performance Trade-offs of Caching Strategies for AI-Powered Querying Systems," 2024 IEEE International Conference on Big Data (BigData), Washington, DC, USA, 2024, pp. 769-776, doi: 10.1109/BigData62323.2024.10825819. https://ieeexplore.ieee.org/abstract/document/10825819

Semantic Cache

A semantic cache is a generalization of an inference cache, whereby it looks for semantically similar prompts in a vector database (i.e., using embeddings and vector lookup). This is effectively a "vector database cache" method of re-using similar user queries. An inference cache looks for exact matches on text or tokens. A semantic cache will find a cache hit where the input prompt is similar enough to the stored cache, even if the exact text or tokens expressing the idea were superficially similar. Hence, a semantic cache should have a higher cache hit ratio than an inference cacne.

Research papers on the semantic cache architecture:

M Sponner, B Waschneck, A Kumar , 2024, Adapting Neural Networks at Runtime: Current Trends in At-Runtime Optimizations for Deep Learning, ACM Computing Surveys,, PDF: https://dl.acm.org/doi/pdf/10.1145/3657283 (Survey of various adaptive inference optimization techniques with much focus on image and video processing optimization for LLMs.)
Can Wang, Bolin Zhang, Dianbo Sui, Zhiying Tu, Xiaoyu Liu, Jiabao Kang, 1 Mar 2024 (v2), A Survey on Effective Invocation Methods of Massive LLM Services, https://arxiv.org/abs/2402.03408 (Deployment of LLMs as LLM-as-a-Service or LLMaaS architectures including prompt compression, semantic caching and model selection based on scoring inputs.)
Harsh Raj, Vipul Gupta, Domenic Rosati, Subhabrata Majumdar, 17 Aug 2023, Semantic Consistency for Assuring Reliability of Large Language Models, https://arxiv.org/pdf/2308.09138.pdf
VRUSHANK VYAS, JUL 11, 2023, Reducing LLM Costs & Latency with Semantic Cache, https://portkey.ai/blog/reducing-llm-costs-and-latency-semantic-cache/
David Spuler, March 2024, Chapter 29. Caching Optimizations, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
OpenAI, Sep 2021, Caching Representations, https://community.openai.com/t/caching-representations/10284
Dr. Ashish Bamania, Jun 18, 2024, Google’s New Algorithms Just Made Searching Vector Databases Faster Than Ever: A Deep Dive into how Google’s ScaNN and SOAR Search algorithms supercharge the performance of Vector Databases, https://levelup.gitconnected.com/googles-new-algorithms-just-made-searching-vector-databases-faster-than-ever-36073618d078
Chip Huyen, Jul 25, 2024, Building A Generative AI Platform, https://huyenchip.com/2024/07/25/genai-platform.html
Vrushank Vyas, Jul 11, 2023, Reducing LLM Costs & Latency with Semantic Cache, https://blog.portkey.ai/blog/reducing-llm-costs-and-latency-semantic-cache/
Hanlin Zhu, Banghua Zhu, Jiantao Jiao, 2 Feb 2024, Efficient Prompt Caching via Embedding Similarity, https://arxiv.org/abs/2402.01173
Conor Atkins, Ian Wood, Mohamed Ali Kaafar, Hassan Asghar, Nardine Basta, Michal Kepkowski, 26 Jun 2024, ConvoCache: Smart Re-Use of Chatbot Responses, https://arxiv.org/abs/2406.18133
Donald Farmer, 08 Aug 2024, 10 top vector database options for similarity searches, https://www.techtarget.com/searchdatamanagement/tip/Top-vector-database-options-for-similarity-searches
Microsoft, 10 Aug 2024, Enable semantic caching for Azure OpenAI APIs in Azure API Management, https://learn.microsoft.com/en-us/azure/api-management/azure-openai-enable-semantic-caching
Pere Martra, Aug 2024 (accessed), Implementing semantic cache to improve a RAG system with FAISS, https://huggingface.co/learn/cookbook/semantic_cache_chroma_vector_database
Richmond Alake, Apoorva Joshi, Aug 14, 2024, Adding Semantic Caching and Memory to Your RAG Application Using MongoDB and LangChain, MongoDB, https://www.mongodb.com/developer/products/atlas/advanced-rag-langchain-mongodb/
Zhi Jing, Yongye Su, Yikun Han, Bo Yuan, Haiyun Xu, Chunjiang Liu, Kehai Chen, Min Zhang, 6 Feb 2024 (v2), When Large Language Models Meet Vector Databases: A Survey, https://arxiv.org/abs/2402.01763
Makhkamova, Ozoda, and Doohyun Kim. 2021. "A Conversation History-Based Q&A Cache Mechanism for Multi-Layered Chatbot Services" Applied Sciences 11, no. 21: 9981. https://doi.org/10.3390/app11219981 https://www.mdpi.com/2076-3417/11/21/9981 https://www.mdpi.com/2076-3417/11/21/9981/pdf
Jiaxing Li, Chi Xu, Feng Wang, Isaac M von Riedemann, Cong Zhang, Jiangchuan Liu, 24 May 2024, SCALM: Towards Semantic Caching for Automated Chat Services with Large Language Models, https://arxiv.org/abs/2406.00025
Fu Bang. 2023. GPTCache: An Open-Source Semantic Cache for LLM Applications Enabling Faster Answers and Cost Savings. In Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023), pages 212–218, Singapore. Association for Computational Linguistics. https://aclanthology.org/2023.nlposs-1.24/ https://aclanthology.org/2023.nlposs-1.24.pdf
David Spuler, March 2024, Semantic Caching and Vector Databases, in Generative AI in C++, https://www.aussieai.com/book/ch29-semantic-caching-vector-databases
David Spuler, 26th August, 2024, Inference Optimization Research Ideas, https://www.aussieai.com/blog/inference-optimization-ideas
Linke Song, Zixuan Pang, Wenhao Wang, Zihao Wang, XiaoFeng Wang, Hongbo Chen, Wei Song, Yier Jin, Dan Meng, Rui Hou, 30 Sep 2024, The Early Bird Catches the Leak: Unveiling Timing Side Channels in LLM Serving Systems, https://arxiv.org/abs/2409.20002
Noah Martin, Abdullah Bin Faisal, Hiba Eltigani, Rukhshan Haroon, Swaminathan Lamelas, Fahad Dogar, 4 Oct 2024, LLMProxy: Reducing Cost to Access Large Language Models, https://arxiv.org/abs/2410.11857 (Deploying a proxy between user and LLM, with handling of conversational history context and caching.)
Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Monishwaran Maheswaran, June Paik, Michael W. Mahoney, Kurt Keutzer, Amir Gholami, 14 Nov 2024, Squeezed Attention: Accelerating Long Context Length LLM Inference, https://arxiv.org/abs/2411.09688 https://github.com/SqueezeAILab/SqueezedAttention (This is like a combination of semantic caching and prefix KV caching, and close to fused KV caching.)
LG Schroeder, Nov 2024, VectorQ: Advanced Semantic Prompt Caching With Dynamic Thresholds and Performance-Based Clustering, Master’s Thesis in Informatics, Technical University of Munich, https://talks.db.in.tum.de/uploads/fe/d9d7195b2c48e8adab7a46f69b1e80/thesis.pdf
Can Wang, Dianbo Sui, Bolin Zhang, Xiaoyu Liu, Jiabao Kang, Zhidong Qiao, Zhiying Tu, Jan 2025, A Framework for Effective Invocation Methods of Various LLM Services, Proceedings of the 31st International Conference on Computational Linguistics, pages 6953–6965, January 19–24, 2025, Association for Computational Linguistics, https://aclanthology.org/2025.coling-main.464.pdf
Yifan Yu, Yu Gan, Lily Tasi, Nikhil Sarda, Jiaming Shen, Yanqi Zhou, Arvind Krishnamurthy, Fan Lai, Henry M. Levy, David Culler, 22 Jan 2025, EchoLM: Accelerating LLM Serving with Real-time Knowledge Distillation, https://arxiv.org/abs/2501.12689 (Using a semantic cache to prepend previously computed answers from similar queries as promopt examples, to improve results from a smaller LLM's final result.)
A. Thakur, N. Chaudhary, A. Gupta, G. Singh, A. K. Choubey and M. Bhandwal, "Refining Large Language Model Query Optimization: An Adaptive Semantic Approach," 2024 4th International Conference on Technological Advancements in Computational Sciences (ICTACS), Tashkent, Uzbekistan, 2024, pp. 76-80, doi: 10.1109/ICTACS62700.2024.10840635. https://ieeexplore.ieee.org/abstract/document/10840635
Arun Iyengar, Ashish Kundu, Ramana Kompella, Sai Nandan Mamidi, 22 Mar 2025, A Generative Caching System for Large Language Models, https://arxiv.org/abs/2503.17603
Keihan Haqiq, Majid Vafaei Jahan, Saeede Anbaee Farimani, Seyed Mahmood Fattahi Masoom, MinCache: A hybrid cache system for efficient chatbots with hierarchical embedding matching and LLM, Future Generation Computer Systems, Volume 170, 2025, 107822, ISSN 0167-739X, https://doi.org/10.1016/j.future.2025.107822. https://www.sciencedirect.com/science/article/abs/pii/S0167739X25001177
Camille Couturier, Spyros Mastorakis, Haiying Shen, Saravan Rajmohan, Victor Rühle, 16 May 2025, Semantic Caching of Contextual Summaries for Efficient Question-Answering with Language Models, https://arxiv.org/abs/2505.11271

Semantic KV Caching?

Semantic KV caching is the use of semantically related queries with KV cache data for faster LLM inference. This is somewhat computationally difficult, because semantically similar queries may have different token sequences, but there are methods to optimize this.

Can you do semantic caching of the KV cache data? Short answer: no.

But why not? After all, the generalization of the "inference cache" is the "context cache" (or "global KV caching"), where the KV cache data is stored and re-used for a repeated query. But the key point here is that the inference cache is storing data for queries with identical text.

Well, the problem is that two semantically-similar texts don't have the same tokens, nor usually even the same token length. Hence, if we surface the KV cache data from a different query, the token sequence doesn't match with the KV data, and then the whole attention computation in the decoding phase will be a random mess.

There's an obvious solution: use a semantically identical question that has a stored KV cache. Effectively, you take the user's query and replace it with a hopefully identical query plus its stored KV cache. Then the inference processes with the replaced query, rather than the user's original query tokens.

Does it work? I'm not sure how good an approach that is, and don't think I've seen a research paper on this. On the other hand, maybe it's almost the same as just doing a text-to-text semantic cache of answers?

However, don't let me stop you trying to match different token sequences. It's an interesting research question whether there's a way to cache the KV data for use in a semantically-similar but token-different query? Or at least to cache something for use in speeding up the attention computation.

Or maybe there's a caching speedup somewhere else than the attention algorithm. Maybe you can speed up the detection of whether or not a tool integration, or data source integration, is required? This would remove some cost of a step, although it's not speeding up the attention mechanism.

RAG Caching

RAG caching is the use of cached data to speed up LLM inference in a RAG pipeline. There are several types of caching that be use in a RAG architecture, including vector database caching and prefix KV caching.

Research papers on caching in the RAG architecture:

Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Xin Liu, Xuanzhe Liu, Xin Jin, 18 Apr 2024, RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation, https://arxiv.org/abs/2404.12457
Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, Junchen Jiang, 3 Jun 2024 (v2), CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion, https://arxiv.org/abs/2405.16444
Google, 2024, Context caching, https://ai.google.dev/gemini-api/docs/caching?lang=python (Pass in context tokens and reuse them without re-uploading, might be doing something like prefix KV caching underneath.)
Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
Pere Martra, Aug 2024 (accessed), Implementing semantic cache to improve a RAG system with FAISS, https://huggingface.co/learn/cookbook/semantic_cache_chroma_vector_database
Richmond Alake, Apoorva Joshi, Aug 14, 2024, Adding Semantic Caching and Memory to Your RAG Application Using MongoDB and LangChain, MongoDB, https://www.mongodb.com/developer/products/atlas/advanced-rag-langchain-mongodb/
Anthropic, 20 Sept 2024, Introducing Contextual Retrieval, https://www.anthropic.com/news/contextual-retrieval
Yihua Cheng, Kuntai Du, Jiayi Yao, Junchen Jiang, 16 Sep 2024, Do Large Language Models Need a Content Delivery Network? https://arxiv.org/abs/2409.13761 https://github.com/LMCache/LMCache (Managing the process of sharing KV cache data over a network.)
David Spuler, , September 26, 2024, RAG Optimization via Caching, https://www.aussieai.com/blog/rag-optimization-caching
Songshuo Lu, Hua Wang, Yutian Rong, Zhi Chen, Yaohua Tang, 10 Oct 2024, TurboRAG: Accelerating Retrieval-Augmented Generation with Precomputed KV Caches for Chunked Text, https://arxiv.org/abs/2410.07590 (Fusing precomputed KV caches for each RAG chunk.)
David Spuler, October 24, 2024, Generalizing Prefix KV Caching to RAG Chunks, Aussie AI Blog, https://www.aussieai.com/blog/prefix-kv-rag
Philhoon Oh, Jinwoo Shin, James Thorne, 13 Jan 2025, Parallel Key-Value Cache Fusion for Position Invariant RAG, https://arxiv.org/abs/2501.07523 (Generating the KV cache for each RAG chunk.)
Guangyuan Liu, Yinqiu Liu, Jiacheng Wang, Hongyang Du, Dusit Niyato, Jiawen Kang, Zehui Xiong, 16 Jan 2025, Adaptive Contextual Caching for Mobile Edge Large Language Model Service, https://arxiv.org/abs/2501.09383
S Agarwal, S Sundaresan, S Mitra, D Mahapatra, Feb 2025, Cache-Craft: Managing Chunk-Caches for Efficient Retrieval-Augmented Generation, https://skejriwal44.github.io/docs/CacheCraft_SIGMOD_2025.pdf (Managing pre-computed KV caches for RAG chunks as a generalization of prefix KV caching, addressing limitations in their position and ordering.)
Jingbo Yang, Bairu Hou, Wei Wei, Yujia Bao, Shiyu Chang, 21 Feb 2025, KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse, https://arxiv.org/abs/2502.16002 https://github.com/UCSB-NLP-Chang/KVLink (Computing a KV cache for each RAG chunk, and using techniques to fuse/merge/concatenate these KV caches, i.e., fused KV caching as a generalization of prefix KV caching, while restoring cross-chunk attention accuracy via 3 techniques: positional re-encoding, "link tokens" between chunks processed during inference, and fine-tuning).
Shai Bergman, Zhang Ji, Anne-Marie Kermarrec, Diana Petrescu, Rafael Pires, Mathis Randl, Martijn de Vos, 7 Mar 2025, Leveraging Approximate Caching for Faster Retrieval-Augmented Generation, https://arxiv.org/abs/2503.05530
Giulio Corallo, Orion Weller, Fabio Petroni, Paolo Papotti, 6 Mar 2025, Beyond RAG: Task-Aware KV Cache Compression for Comprehensive Knowledge Reasoning, https://arxiv.org/abs/2503.04973
Qiuyu Zhu, Liang Zhang, Qianxiong Xu, Cheng Long, Jie Zhang, 19 May 2025 (v2), SubGCache: Accelerating Graph-based RAG with Subgraph-level KV Cache, https://arxiv.org/abs/2505.10951

Vector Database Caching

Vector database caching is the use of caching of previously computed lookups to speed up a vector database. This caching optimization can be used to speed up the LLM inference in a RAG architecture, where a vector database is typically used.

Research papers on caching in a vector database as an AI architecture component:

Yikun Han, Chunjiang Liu, Pengfei Wang, 18 Oct 2023, A Comprehensive Survey on Vector Database: Storage and Retrieval Technique, Challenge, https://arxiv.org/abs/2310.11703

Chatbot Caching

Chatbot caching is the use of cached data to speed up LLM inference for a chatbot use case. Various types of prompt caching, including session caching and prefix KV caching, can be reused to avoid computation of chatbot prompts in LLM inference.

Research papers on the use of caching in a Chatbot. Note that the above discussions of "session KV caching" and "prefix KV caching" are also relevant to Chatbots, amongst other types of caches in general.

Conor Atkins, Ian Wood, Mohamed Ali Kaafar, Hassan Asghar, Nardine Basta, Michal Kepkowski, 26 Jun 2024, ConvoCache: Smart Re-Use of Chatbot Responses, https://arxiv.org/abs/2406.18133
Makhkamova, Ozoda, and Doohyun Kim. 2021. "A Conversation History-Based Q&A Cache Mechanism for Multi-Layered Chatbot Services" Applied Sciences 11, no. 21: 9981. https://doi.org/10.3390/app11219981 https://www.mdpi.com/2076-3417/11/21/9981 https://www.mdpi.com/2076-3417/11/21/9981/pdf
Jiaxing Li, Chi Xu, Feng Wang, Isaac M von Riedemann, Cong Zhang, Jiangchuan Liu, 24 May 2024, SCALM: Towards Semantic Caching for Automated Chat Services with Large Language Models, https://arxiv.org/abs/2406.00025

Caching General Theory in Computer Science

Caching ("memoization") is a long-standing programming optimization technique. The basic idea of trading space for extra speed has very general applicability. Papers on the theory of caching, unrelated to AI, are below:

Georgios Keramidas, Chrysa Kokkala, and Iakovos Stamoulis. 2015. Clumsy value cache: An approximate memoization technique for mobile GPU fragment shaders. Workshop on Approximate Computing (WAPCO’15). PDF: https://wapco.e-ce.uth.gr/2015/papers/SESSION3/WAPCO_3_3.pdf
Azar Rahimi, Luca Benini, and Rajesh K. Gupta. 2013. Spatial memoization: Concurrent instruction reuse to correct timing errors in SIMD architectures. IEEE Transactions on Circuits and Systems II: Express Briefs 60, 12, 847–851. https://ieeexplore.ieee.org/document/6617694, PDF: https://iis-people.ee.ethz.ch/~arahimi/papers/TCAS-II13.pdf
Carlos Alvarez, Jesus Corbal, and Mateo Valero. 2005. Fuzzy memoization for floating-point multimedia applications. IEEE Transactions on Computers 54, 7, 922–927. https://ieeexplore.ieee.org/document/1432675, PDF: https://hpc.ac.upc.edu/PDFs/dir15/file000247.pdf
D. F. Bacon, S. L. Graham, and O. J. Sharp. 1994. Compiler transformations for high-performance computing. ACM Computing Surveys 26, 4 (1994), 345–420. https://dl.acm.org/doi/10.1145/197405.197406, PDF: https://people.eecs.berkeley.edu/~fateman/264/papers/bacon.pdf (Discusses "memoization" along with extensive coverage of numerous compiler auto-optimizations of program code.)
Xinyin Ma, Gongfan Fang, Michael Bi Mi, Xinchao Wang, 3 Jun 2024, Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching, https://arxiv.org/abs/2406.01733 Code: https://github.com/horseee/learning-to-cache (Layer skipping in diffusion transformers via layer caching.)
Vikranth Srivatsa∗, Zijian He∗, Reyna Abhyankar, Dongming Li, Yiying Zhang, 2024, Preble: Efficient Distributed Prompt Scheduling for LLM Serving, University of California, San Diego, https://escholarship.org/content/qt1bm0k1w0/qt1bm0k1w0.pdf (Evalulates prompt sharing including full inference cache or a partial prefix-based computation of a global KV cache for the prefill phase. Also schedules GPUs based on prefill versus decoding phase requirements.)
João Monteiro, Étienne Marcotte, Pierre-André Noël, Valentina Zantedeschi, David Vázquez, Nicolas Chapados, Christopher Pal, Perouz Taslakian, 23 Apr 2024, XC-Cache: Cross-Attending to Cached Context for Efficient LLM Inference, https://arxiv.org/abs/2404.15420
Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Xin Liu, Xuanzhe Liu, Xin Jin, 18 Apr 2024, RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation, https://arxiv.org/abs/2404.12457
Jon Saad-Falcon, Amanpreet Singh, Luca Soldaini, Mike D'Arcy, Arman Cohan, Doug Downey, 30 Jan 2023 (v3), Embedding Recycling for Language Models, https://arxiv.org/abs/2207.04993
M Sponner, B Waschneck, A Kumar , 2024, Adapting Neural Networks at Runtime: Current Trends in At-Runtime Optimizations for Deep Learning, ACM Computing Surveys,, PDF: https://dl.acm.org/doi/pdf/10.1145/3657283 (Survey of various adaptive inference optimization techniques with much focus on image and video processing optimization for LLMs.)
Yuxin Zhang, Yuxuan Du, Gen Luo, Yunshan Zhong, Zhenyu Zhang, Shiwei Liu, Rongrong Ji, 2024, CaM: Cache Merging for Memory-efficient LLMs Inference, https://openreview.net/pdf?id=LCTmppB165 Code: https://github.com/zyxxmu/cam (Compressing the KV cache by merging KV data that is about to be evicted into other parts of the KV cache.)
Yun Zhu, Yaoke Wang, Haizhou Shi, Siliang Tang, 28 Jan 2024, Efficient Tuning and Inference for Large Language Models on Textual Graphs, https://arxiv.org/abs/2401.15569 (Optimizing Graph Neural Networks on textual graphs using caching and early exit inference.)
In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, Lin Zhong, Nov 2023, Prompt Cache: Modular Attention Reuse for Low-Latency Inference, https://arxiv.org/abs/2311.04934 (Unique and insightful advance of generalizing KV caching to multiple prompts by computing a cache for short "segments" of prompts, including methods to adjust the different KV cache values for text segments that appear in different positions of the overall prompt.)
http://supertech.csail.mit.edu/papers/Prokop99.pdf Harald Prokop, May 21, 1999, Cache-Oblivious Algorithms, Master of Science Thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, http://supertech.csail.mit.edu/papers/Prokop99.pdf
Rafael Fão de Moura, Paulo C Santos, João Paulo C de Lima, Marco AZ Alves, Antonio CS Beck, and Luigi Carro. 2019. Skipping CNN convolutions through efficient memoization. In International Conference on Embedded Computer Systems. Springer, 65–76. https://link.springer.com/chapter/10.1007/978-3-030-27562-4_5
Keunyoung Park and Doo-Hyun Kim. 2019. Accelerating image classification using feature map similarity in convolutional neural networks. Applied Sciences 9, 1 (2019), 108. https://www.mdpi.com/2076-3417/9/1/108
Weijie Chen, Yuan Zhang, Di Xie, and Shiliang Pu. 2019. A layer decomposition-recomposition framework for neuron pruning towards accurate lightweight networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 3355–3362. https://arxiv.org/abs/1812.06611 (Layerwise dynamic structural pruning of unimportant neurons.)
Taiji Suzuki, Hiroshi Abe, Tomoya Murata, Shingo Horiuchi, Kotaro Ito, Tokuma Wachi, So Hirai, Masatoshi Yukishima, and Tomoaki Nishimura. 2020. Spectral Pruning: Compressing Deep Neural Networks via Spectral Analysis and its Generalization Error. IJCAI. https://arxiv.org/abs/1808.08558 (A type of structured pruning based on information loss metrics.)
Lukas Cavigelli and Luca Benini. 2019. CBinfer: Exploiting frame-to-frame locality for faster convolutional network inference on video streams. IEEE Transactions on Circuits and Systems for Video Technology 30, 5 (2019), 1451–1465. https://arxiv.org/abs/1808.05488 (Caching for similar video frames.)
Xizhou Zhu, Yuwen Xiong, Jifeng Dai, Lu Yuan, and Yichen Wei. 2017. Deep feature flow for video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2349–2358, https://arxiv.org/abs/1611.07715
113. Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C., MobileNetV2: Inverted Residuals and Linear Bottlenecks. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. http://dx.doi.org/10.1109/CVPR.2018.00474
114. Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, Quoc V. Le, Hartwig Adam, Searching for mobileNetV3. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2021; pp. 1314–1324. http://dx.doi.org/10.1109/ICCV.2019.00140 https://arxiv.org/abs/1905.02244
W Li, H Hacid, E Almazrouei, M Debbah, 2023, A Comprehensive Review and a Taxonomy of Edge Machine Learning: Requirements, Paradigms, and Techniques, AI 2023, 4(3), 729-786, https://www.mdpi.com/2673-2688/4/3/39
Shubham Agarwal, Subrata Mitra, Sarthak Chakraborty, Srikrishna Karanam, Koyel Mukherjee, Shiv Saini, 7 Dec 2023, Approximate Caching for Efficiently Serving Diffusion Models, https://arxiv.org/abs/2312.04429
Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, Junchen Jiang, 3 Jun 2024 (v2), CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion, https://arxiv.org/abs/2405.16444 Code: https://github.com/YaoJiayi/CacheBlend.git (Generalizes prefix KV caching to KV cache fusion with selective recomputation of some KV cache data.)
Hengyuan Hu. 2016. Papers with Code. Network Trimming: A Data-Driven Neuron Pruning Approach towards Efficient Deep Architectures. https://paperswithcode.com/paper/network-trimming-a-data-driven-neuron-pruning (2021).
Xitong Gao. 2019. Papers with Code. Dynamic Channel Pruning: Feature Boosting and Suppression. 2021, https://paperswithcode.com/paper/dynamic-channel-pruning-feature-boosting-and
David Spuler, March 2024, Chapter 29. Caching Optimizations, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
X. Jiao, V. Akhlaghi, Yu Jiang, and R. K. Gupta. 2018. Energy-efficient neural networks using approximate computation reuse. Proc. of the 2018 Design, Automation and Test in Europe Conference and Exhibition, (DATE) (2018), 1223–1228. https://ieeexplore.ieee.org/document/8342202 PDF: http://wingtecher.com/themes/WingTecherResearch/assets/papers/date18-energy-efficient.pdf (Caches and reuses approximate computations by using Bloom filters, a data structure similar to hashing.)
OpenAI, Sep 2021, Caching Representations, https://community.openai.com/t/caching-representations/10284
Giorgos Armeniakos, Georgios Zervakis, Dimitrios Soudris, Jörg Henkel, 2022, Hardware Approximate Techniques for Deep Neural Network Accelerators: A Survey, ACM Computing Surveys, Volume 55, Issue 4, No. 83, pp 1–36 https://doi.org/10.1145/3527156, https://dl.acm.org/doi/10.1145/3527156, https://arxiv.org/abs/2203.08737
Linke Song, Zixuan Pang, Wenhao Wang, Zihao Wang, XiaoFeng Wang, Hongbo Chen, Wei Song, Yier Jin, Dan Meng, Rui Hou, 30 Sep 2024, The Early Bird Catches the Leak: Unveiling Timing Side Channels in LLM Serving Systems, https://arxiv.org/abs/2409.20002
Jie Peng, Zhang Cao, Huaizhi Qu, Zhengyu Zhang, Chang Guo, Yanyong Zhang, Zhichao Cao, Tianlong Chen, 23 Oct 2024 (v2), Harnessing Your DRAM and SSD for Sustainable and Accessible LLM Inference with Mixed-Precision and Multi-level Caching, https://arxiv.org/abs/2410.14740
Kumara Kahatapitiya, Haozhe Liu, Sen He, Ding Liu, Menglin Jia, Chenyang Zhang, Michael S. Ryoo, Tian Xie, 7 Nov 2024 (v2), Adaptive Caching for Faster Video Generation with Diffusion Transformers, https://arxiv.org/abs/2411.02397 https://adacache-dit.github.io/ (Caching in a video diffusion model during training.)
Chang Zou, Evelyn Zhang, Runlin Guo, Haohang Xu, Conghui He, Xuming Hu, Linfeng Zhang, 25 Dec 2024, Accelerating Diffusion Transformers with Dual Feature Caching, https://arxiv.org/abs/2412.18911
Dingyan Zhang, Haotian Wang, Yang Liu, Xingda Wei, Yizhou Shan, Rong Chen, Haibo Chen, 23 Dec 2024, Fast and Live Model Auto Scaling with O(1) Host Caching, https://arxiv.org/abs/2412.17246
Sanghyeon Lee, Hongbeen Kim, Soojin Hwang, Guseul Heo, Minwoo Noh, Jaehyuk Huh. 3 Jan 2025, Efficient LLM Inference with Activation Checkpointing and Hybrid Caching, https://arxiv.org/abs/2501.01792 (Recomputation of the KV cache from stored activations.)
Desen Sun, Henry Tian, Tim Lu, Sihang Liu, 18 Dec 2024, FlexCache: Flexible Approximate Cache System for Video Diffusion, https://arxiv.org/abs/2501.04012
H. Oh, W. Zhang, C. D. Rickett, S. R. Sukumar and S. Byna, "Evaluating Performance Trade-offs of Caching Strategies for AI-Powered Querying Systems," 2024 IEEE International Conference on Big Data (BigData), Washington, DC, USA, 2024, pp. 769-776, doi: 10.1109/BigData62323.2024.10825819. https://ieeexplore.ieee.org/abstract/document/10825819
Arun Iyengar, Ashish Kundu, Ramana Kompella, Sai Nandan Mamidi, 22 Mar 2025, A Generative Caching System for Large Language Models, https://arxiv.org/abs/2503.17603
The Latency Gambler, May 10, 2025, Scaling to 1 Million Users: The Architecture I Wish I Knew Sooner, https://medium.com/@kanishks772/scaling-to-1-million-users-the-architecture-i-wish-i-knew-sooner-39c688ded2f1