Aussie AI
Caching and Reuse Optimizations
-
Last Updated 12 December, 2024
-
by David Spuler, Ph.D.
Transformer Caching Optimizations
Transformer caching optimizations are the use of datastores of cached information for LLM inference optimization. There are several ways that Transformer architectures can use caching to speed up LLM inference. There are two main ways:
- Outside the Transformer — Inference cache, Semantic cache, RAG caching, etc.
- Inside the Transformer — KV caching (basic autoregressive decoding), global KV caching, etc..
When working outside the Transformer, the goal of the cache is to avoid doing any LLM work. Instead, hit the cache, get the answers, and send the answers back to the user. No LLM weights are harmed in this process!
The types of "outside" caching include:
- Inference cache — text to text, for a given input prompt, find the results text in the cache and output.
- Inference cache of logits.
- Semantic cache — given a query, find a close-enough query in the cache, output the results text.
Single-query KV caching: When using caching internal to the Transformer inference phase, the idea is to speed up the QKV calculations in the attention heads. And in a decoder-only architecture (e.g. GPT style), one of the other reasons is to speed up the "prefill" phase, which is the initial delay before the first token is output.
The main type of "internal" KV caching is those that work only within the current query (and are then discarded):
- KV caching — basic internal KV vector caching during the auto-regressive decoding phase (one token at a time).
KV Cache Reuse (Multi-query KV caching): There are also several type of KV cache called "KV cache reuse" or "global KV caching" whereby they work across multiple queries. The idea is similar to an inference cache, which caches the text results for a single input query, but instead we cache just the KV cache data. The idea of KV cache reuse means that when we hit a query we've seen before, we don't have the output, but we have the preliminary KV data, and we can just skip past the "prefill" phase and immediately start outputing the first word of our response. Types of these "global KV cache" methods include:
- Global KV caching (basic text version)
- Global KV caching (token sequence version) (also called "context caching")
- Global KV caching (semantic embeddings vector version)
- Prefix global KV caching
- Fused global KV caching
Note that global KV caching involves lookup of a query to find its corresponding KV cache data, which means that this idea can be used with: (a) inference caching text-based lookup, (b) token-based lookup in the inference cache, or (c) semantic caching and vector-based lookup. However, the prefix and fused global KV caching variants rely on token/text matches, so cannot be used with semantic caching (or can they?).
Low-memory KV caching: For both the local and global KV caches, one of the major problems with KV caches is that they grow too large. There are several techniques to cut down the size of the KV cache in memory (or on disk):
- Memory-efficient QKV attention algorithms — Flash attention, Paged attention, Local attention, Linear attention, etc.
- KV cache quantization
- KV cache compression — e.g. sparsity/pruning of the KV cache, KV cache layer fusion, and other variants.
- KV cache eviction
- KV data recomputation — don't cache!
RAG architecture caching: The special features of RAG architectures can use caching in various ways:
- Datastore retrieval caching — the usual database type caching.
- Chunk-specific global KV caching — prefix global KV caching or fused global KV caching.
Chatbot architecture caching: Chatbot conversations have an interesting property: the current query and response become the input context for the next query, via the "conversation history" used as prompt context. This means that the KV cache data at the end of one answer is effectively the global KV caching data for the next cycle. This idea is similar to "prefix global KV caching" but is a chatbot-specific property.
For on-device chatbots in particular, this means that the KV cache data is immediately available, and prefill latency can be avoided. For data center chatbot architectures, there is a difficulty in mapping the KV cache data across multi-user sessions, although it can be worth doing, since the latency of prefill is also avoided in this manner for cloud architectures.
General Caching Theory
Caching is the general optimization method where computed results are stored and re-used instead of repeating a later computation. Generally, the idea is to trade off use of extra memory in order to save on execution time. This mainly works if the same exact computations are being repeated, but can also work for repetitions of similar near-identical computations. In the literature, caching algorithms for neural networks are also called "memoization", "data re-use" or "computation re-use" algorithms.
One low-level version of caching is called "common subexpression elimination" which uses a temporary variable to eliminate any instances where the same calculation is done twice. Such optimizations are usually automated by compilers in modern programming. Another type of caching is where a loop with a repeatedly calculated value is modified to bring the calculation out in front of the loop, thereby calculating it only once.
Various optimizations to Transformers have involved caching. For example, it was discovered that some of the K and V tensor calculations could be cached between tokens, thereby avoiding repeated matrix computations in the usual autoregressive model. See Transformer optimizations.
Intermediate-level caching and computation reuse can be done at the vector dot product level. By detecting when similar vectors have been calculated before, such as using Locality-Sensitive Hashing (LSH), the cache results can be accessed and re-used instead.
Caching can also be done at the highest level with model inference. Incremental caching of full inference results can be used with "input similarity" such as analyzing the frames of a video. Inference results can also be cached across multiple queries from multiple users. When the entire results of an inference calculation is saved and re-used, the optimization is called an "Inference Cache".
Caching and computation reuse are a type of dynamic inference optimization. Comparable dynamic data and computation efficiency strategies include zero skipping, weight sharing, and layer fusion.
Transformer Calculation Caching
Transformer calculation caching is the storage of computations for later reuse in memory or on disk, as a way to optimize LLM inference. Experience has shown that some computations performed in a vanilla Transformer can be cached for faster inference. This is often called "computation reuse" in the literature; see also Transformer architectures and Transformer optimizations. Papers that discuss these Transformer caching methods are below:
- Hugging Face, How we sped up transformer inference 100x for HF API customers, January 18, 2021, https://huggingface.co/blog/accelerated-inference (Various Transformer optimization techniques are suggested, including caching of attention head matrix computations from already-processed tokens to reduce auto-regression costs, i.e. auto-regressive KV caching.)
- Intel, Optimizing Transformer Model Inference on Intel Processors, April 2023, https://www.intel.com/content/www/us/en/developer/articles/technical/optimize-transformer-model-inference-processors.html (One of the suggested optimizations is caching of computations of K and V tensors in the attention head logic.)
- Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. fairseq: A fast, extensible toolkit for sequence modeling. arXiv preprint arXiv:1904.01038, 2019, https://arxiv.org/abs/1904.01038, Code: https://github.com/pytorch/fairseq (Includes caching of model states from previously generated tokens.)
- Lucas D. Lingle, Sep 2023, Transformer-VQ: Linear-Time Transformers via Vector Quantization https://arxiv.org/abs/2309.16354, Code: https://github.com/transformer-vq/transformer_vq (Uses a "long range cache" in attention optimization.)
- Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed, Oct 2023, Mistral 7B, https://arxiv.org/abs/2310.06825, Code: https://mistral.ai/news/announcing-mistral-7b/ (Uses a "rolling buffer cache" method.)
KV Caching
KV caching is storing the results of the K and V vector operations that are performed in Transformer attention heads for LLM inference optimization. The data from the KV cache can be used to optimize the attention of the current query or multiple future queries. Analysis of the vanilla Transformer by researchers has discovered at least two distinct ways to cache these results.
Autoregressive decoder KV caching: This is in-memory caching during one query as the decoder processes multiple tokens. Partial KV tensor operations can be cached in memory during decoding, across the processing of multiple tokens, avoiding re-computations in autoregressive mode. In autoregressive decoder mode, the extra KV computations related to the new token are not cached, but all prior KV-related calculations can be cached. This is a subtype of autoregression optimization.
Research papers on autoregressive decoder KV caching:
- Hugging Face, How we sped up transformer inference 100x for HF API customers, January 18, 2021, https://huggingface.co/blog/accelerated-inference (Various Transformer optimization techniques are suggested, including caching of attention head matrix computations from already-processed tokens to reduce auto-regression costs, i.e. auto-regressive KV caching.)
- Intel, Optimizing Transformer Model Inference on Intel Processors, April 2023, https://www.intel.com/content/www/us/en/developer/articles/technical/optimize-transformer-model-inference-processors.html (One of the suggested optimizations is caching of computations of K and V tensors in the attention head logic, in autoregressive mode.)
- Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Anselm Levskaya, Jonathan Heek, Kefan Xiao, Shivani Agrawal, Jeff Dean, Nov 2022, Efficiently Scaling Transformer Inference, https://arxiv.org/abs/2211.05102 (Includes some discussion relevant to KV caching, but it is mostly in relation to operator fusion.)
- Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Shigang Li, and Torsten Hoefler. Nov 2021, Data movement is all you need: A case study on optimizing transformers, Proceedings of Machine Learning and Systems, 3, 2021. https://arxiv.org/abs/2007.00072 Code: https://github.com/spcl/substation (Includes a detailed analysis of the QKV tensors in relation to optimizations, with relevance to KV caching, kernel operator fusion, and matrix algebra optimizations.)
- Chen, Carol, Transformer Inference Arithmetic 2022-03-30, kipply's blog, https://kipp.ly/transformer-inference-arithmetic/ (Covers multiple optimization methods including KV caching and sparse QKV attention layers.)
- Lilian Weng, January 10, 2023, Large Transformer Model Inference Optimization, https://lilianweng.github.io/posts/2023-01-10-inference-optimization/ (Multiple Transformer optimization methods, with some relevance to in-memory KV caching.)
- Noam Shazeer, Nov 2019, Fast Transformer Decoding: One Write-Head is All You Need, https://arxiv.org/abs/1911.02150 (Multi-query attention shares KV tensors across multiple attention heads, which isn't exactly KV caching, but is in the same ballpark.)
- Dipkumar Patel, February 12, 2023, Speeding up the GPT - KV cache, https://www.dipkumar.dev/becoming-the-unbeatable/posts/gpt-kvcache/ (Useful article specifically on KV caching.)
- A Ouyang, June 2023, Understanding the Performance of Transformer Inference, Masters Thesis, Electrical Engineering and Computer Science, MIT, https://dspace.mit.edu/handle/1721.1/151543, https://dspace.mit.edu/bitstream/handle/1721.1/151543/ouyang-aouyang-meng-eecs-2023-thesis.pdf?sequence=1&isAllowed=y (Detailed analysis of Transformer performance optimizations, including the technique of autoregressive KV caching during decoding.)
- Gongzheng Li, Yadong Xi, Jingzhen Ding, Duan Wang, Bai Liu, Changjie Fan, Xiaoxi Mao, Zeng Zhao, Easy and Efficient Transformer : Scalable Inference Solution For large NLP model, May 2022, https://arxiv.org/abs/2104.12470
- Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of importance hypothesis for LLM KV cache compression at test time, 2023. http://arxiv.org/abs/2305.17118 (Reduces the size of the KV cache by limiting storage to only pivotal tokens.)
- Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re, Beidi Chen, 2023, Deja vu: Contextual sparsity for efficient LLMs at inference time, Proceedings of the 40th International Conference on Machine Learning, PMLR 202:22137-22176, 2023. https://proceedings.mlr.press/v202/liu23am.html, PDF: https://proceedings.mlr.press/v202/liu23am/liu23am.pdf
- Y Jin, CF Wu, D Brooks, GY Wei, 2023, S3: Increasing GPU Utilization during Generative Inference for Higher Throughput, arXiv preprint arXiv:2306.06000, https://arxiv.org/abs/2306.06000
- Luciano Del Corro, Allie Del Giorno, Sahaj Agarwal, Bin Yu, Ahmed Awadallah, Subhabrata Mukherjee, July 2023, SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference, https://arxiv.org/abs/2307.02628 (Early exit for non-autoregression, with consideration of the KV cache.)
- Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, Sumit Sanghai, May 2023, GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints, https://arxiv.org/abs/2305.13245 (Some discussion of KV caching in the context of multi-query attention.)
- Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, Furu Wei, July 2023, Retentive Network: A Successor to Transformer for Large Language Models https://arxiv.org/abs/2307.08621, Code: https://aka.ms/retnet (Some analysis of KV cache memory usage, but not a primary focus of the paper.)
- H Jiang, Q Wu, CY Lin, Y Yang, L Qiu, Oct 2023, LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models, arXiv preprint arXiv:2310.05736, https://arxiv.org/pdf/2310.05736.pdf, Code: https://aka.ms/LLMLingua (Discusses token pruning reducing size of KV cache.)
- S Ren, Q Jia, KQ Zhu, arXiv preprint arXiv:2310.08152, Context Compression for Auto-regressive Transformers with Sentinel Tokens, Oct 2023, https://arxiv.org/pdf/2310.08152.pdf, Code: https://github.com/DRSY/KV_Compression
- G Xiao, Y Tian, B Chen, S Han, M Lewis, Sep 2023, Efficient Streaming Language Models with Attention Sinks, arXiv preprint arXiv:2309.17453, https://arxiv.org/abs/2309.17453 (Sliding window KV caching.)
- Hyungjun Oh, Kihong Kim, Jaemin Kim, Sungkyun Kim, Junyeol Lee, Du-seong Chang, Jiwon Seo, 15 Mar 2024, ExeGPT: Constraint-Aware Resource Scheduling for LLM Inference, https://arxiv.org/abs/2404.07947 (Scheduling and pipelining of inference calculations.)
- Yechen Xu, Xinhao Kong, Tingjun Chen, Danyang Zhuo, 4 Jun 2024 (v2), Conveyor: Efficient Tool-aware LLM Serving with Tool Partial Execution, https://arxiv.org/abs/2406.00059 Code: https://github.com/conveyor-sys/conveyor (Speeding up inference by partially running tools in parallel to the LLM query procesisng, rather than sequentially after the LLM request, by detecting tool requests deep inside the decoding algorithm and starting them off immediately, before the LLM has finished generating the fully decoed output.)
- Liu, Y., Li, H., Du, K., Yao, J., Cheng, Y., Huang, Y., Lu, S., Maire, M., Hoffmann, H., Holtzman, A., Ananthanarayanan, G., and Jiang, J., August 2024, Cachegen: Fast context loading for language model applications, Microsoft Research, https://arxiv.org/abs/2310.07240 https://www.microsoft.com/en-us/research/publication/cachegen-fast-context-loading-for-language-model-applications-via-kv-cache-streaming/
- X Wu, L Zhang, Y Wang, Y Ren, M Hack, 2016, zExpander: a Key-Value Cache with both High Performance and Fewer Misses, EuroSys ’16 April 18–21, 2016, London, United Kingdom, https://ranger.uta.edu/~sjiang/pubs/papers/wu16_zExpander.pdf (General theory paper about prefix key-value caching in a trie or binary tree, not specific to neural networks.)
- Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica, Oct 2023, Efficient Memory Management for Large Language Model Serving with PagedAttention, SOSP ’23, October 23–26, 2023, Koblenz, Germany, https://dl.acm.org/doi/pdf/10.1145/3600006.3613165
- Lu Ye, Ze Tao, Yong Huang, Yang Li, 22 Mar 2024 (v2), ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition, https://arxiv.org/abs/2402.15220 (Identify prefixes of prompts and caching the KV values of these portions of the prompt.)
- Yunsheng Ni, Chuanjian Liu, Yehui Tang, Kai Han, Yunhe Wang, 13 May 2024, EMS-SD: Efficient Multi-sample Speculative Decoding for Accelerating Large Language Models, https://arxiv.org/abs/2405.07542 Code: https://github.com/niyunsheng/EMS-SD (Speculative decoding across multiple queries by avoiding padding tokens and optimizing the KV cache.)
- J. Lin, S. Q. Zhang and A. Leon-Garcia, 2024, sLLM: Accelerating LLM Inference using Semantic Load Balancing with Shared Memory Data Structures, 2024 25th International Symposium on Quality Electronic Design (ISQED), San Francisco, CA, USA, 2024, pp. 1-6, doi: 10.1109/ISQED60706.2024.10528703. https://ieeexplore.ieee.org/abstract/document/10528703 (Optimize the global KV cache by sharing it across multiple queries.)
- Vikranth Srivatsa∗, Zijian He∗, Reyna Abhyankar, Dongming Li, Yiying Zhang, 2024, Preble: Efficient Distributed Prompt Scheduling for LLM Serving, University of California, San Diego, https://escholarship.org/content/qt1bm0k1w0/qt1bm0k1w0.pdf (Evalulates prompt sharing including full inference cache or a partial prefix-based computation of a global KV cache for the prefill phase. Also schedules GPUs based on prefill versus decoding phase requirements.)
- Yao Fu, 14 May 2024, Challenges in Deploying Long-Context Transformers: A Theoretical Peak Performance Analysis, https://arxiv.org/abs/2405.08944 (The KV cache size is the main bottleneck for long context processing, in both prefill and decoding phases, and includes analysis of different optimizations to address this.)
- Minsik Cho, Mohammad Rastegari, Devang Naik, 8 May 2024, KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache Generation, https://arxiv.org/abs/2405.05329 (Parallelization of KV cache generation in prefill phase.)
- Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, Furu Wei, 9 May 2024 (v2), You Only Cache Once: Decoder-Decoder Architectures for Language Models, https://arxiv.org/abs/2405.05254 Code: https://aka.ms/YOCO (A novel decoder-decoder architecture with fast KV caching and cross-attention.)
- Shaoyuan Chen, Yutong Lin, Mingxing Zhang, Yongwei Wu, 3 May 2024, Efficient and Economic Large Language Model Inference with Attention Offloading, https://arxiv.org/abs/2405.01814 (Separates the process-bound and memory-bound parts of inference for speedup, with focus on prefill, decoding, and the sub-tasks such as QKV and FFN use of GEMM kernels, versus the different pattern of attention computations and the KV cache.)
- Qiaozhi He, Zhihua Wu, 28 Apr 2024, Efficient LLM Inference with Kcache, https://arxiv.org/abs/2404.18057 (Splits the KV cache into a KCache stored in HBM and a Vcache stored in CPU memory. The requests for the V cache are limited by filtering after attention based on the Softmax scaled top-N results of the QK matrix multiplication, so thereby pruning a lot of the V cache memory loads and corresponding calculations.)
- Jiamin Li, Le Xu, Hong Xu, Aditya Akella, 28 Apr 2024, BlockLLM: Multi-tenant Finer-grained Serving for Large Language Models, https://arxiv.org/abs/2404.18322 (Partitioning inference over blocks for GPU.)
- Jesus Rodriguez, Apr 22, 2024, Some Technical Notes About Llama 3: New tokenizer, optimized pretraining and some other details about Meta AI’s new model, Towards AI, https://pub.towardsai.net/some-technical-notes-about-llama-3-042c0b19db14
- João Monteiro, Étienne Marcotte, Pierre-André Noël, Valentina Zantedeschi, David Vázquez, Nicolas Chapados, Christopher Pal, Perouz Taslakian, 23 Apr 2024, XC-Cache: Cross-Attending to Cached Context for Efficient LLM Inference, https://arxiv.org/abs/2404.15420
- Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, Deming Chen, 22 Apr 2024, SnapKV: LLM Knows What You are Looking for Before Generation, https://arxiv.org/abs/2404.14469
- Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Xin Liu, Xuanzhe Liu, Xin Jin, 18 Apr 2024, RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation, https://arxiv.org/abs/2404.12457
- Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, 22 Apr 2024, A Survey on Efficient Inference for Large Language Models, https://arxiv.org/abs/2404.14294
- L. Zheng, L. Yin, Z. Xie, J. Huang, C. Sun, C. H. Yu, S. Cao, C. Kozyrakis, I. Stoica, J. E. Gonzalez et al., Dec 2023, Efficiently programming large language models using SGLang, arXiv preprint arXiv:2312.07104, 2023, https://arxiv.org/abs/2312.07104 (Uses a radix attention method, a trie or prefix tree, for KV caching.)
- Shashank Verma and Neal Vaidya, Nov 17, 2023 Mastering LLM Techniques: Inference Optimization, NVIDIA Technical Blog, https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/ (An overview that covers a lot of inference optimization techniques.)
- Theia Vogel, December 18, 2023, How to make LLMs go fast, https://vgel.me/posts/faster-inference/
- Foteini Strati, Sara Mcallister, Amar Phanishayee, Jakub Tarnawski, Ana Klimovic, 4 Mar 2024, DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving, https://arxiv.org/abs/2403.01876
- 3 Jan 2024 (v2), SPEED: Speculative Pipelined Execution for Efficient Decoding, Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Hasan Genc, Kurt Keutzer, Amir Gholami, Sophia Shao, https://arxiv.org/abs/2310.12072 (Speculatively executing multiple future tokens in parallel to the current token, by using multiple tokens with high probability from the early layers of inference of the current token in the model. This allows multiple speculations of the autoregressive inference of the next token to start before the current token is finished.)
- Aonan Zhang, Chong Wang, Yi Wang, Xuanyu Zhang, Yunfei Cheng, 22 Mar 2024 (v2), Recurrent Drafter for Fast Speculative Decoding in Large Language Models, https://arxiv.org/abs/2403.09919 (Use of small RNN as the drafting model for speculative decoding.)
- Zihao Wang, Shaoduo Gan, 7 Apr 2024] SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget, https://arxiv.org/abs/2404.04793 Code: https://github.com/hetailang/squeezeattention (Optimization of the KV cache along the two dimensions of layers and input sequence.)
- Yijin Liu, Fandong Meng, Jie Zhou, 10 Apr 2024, Accelerating Inference in Large Language Models with a Unified Layer Skipping Strategy, https://arxiv.org/abs/2404.06954 Code: https://github.com/Adaxry/Unified_Layer_Skipping (Layer skipping with choosing globally which layers to skip in an orderly way for all tokens based on speedup required. All tokens skip the exact same layers, which avoids the problem with out-of-date KV caches.)
- Ajay Jaiswal, Bodun Hu, Lu Yin, Yeonju Ro, Shiwei Liu, Tianlong Chen, Aditya Akella, 5 Apr 2024, FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping, https://arxiv.org/abs/2404.03865 (Presents an input-adaptive layer skipping scheme for drops about 30% of FFN calculations. Avoids the KV caching problems by only skipping FFN computations in layers.)
- LMDeploy Contributors, 2023, LMDeploy: A Toolkit for Compressing, Deploying, and Serving LLM, Apache 2.0 License, Code: https://github.com/InternLM/lmdeploy
- Wangsong Yin, Mengwei Xu, Yuanchun Li, Xuanzhe Liu, 18 Mar 2024, LLM as a System Service on Mobile Devices, https://arxiv.org/abs/2403.11805 (On-device inference for LLMs, including a stateful on-device AI service LLMaaS, including Llama2 7B and OPT-7B with INT8 quantization, based on improved KV caching on mobile, with pipelining, recomputation and chunk-level KV cache memory management for running on phones.)
- Youpeng Zhao, Di Wu, Jun Wang, 26 Mar 2024, ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching, https://arxiv.org/abs/2403.17312 (Improved memory management of the cache for KV caching during autoregressive inference with prioritization of tokens based on sparse window attention, and managing caching versus recomputation.)
- Shuming Shi, Enbo Zhao, Deng Cai, Leyang Cui, Xinting Huang, Huayang Li, 16 Jan 2024, Inferflow: an Efficient and Highly Configurable Inference Engine for Large Language Models, https://arxiv.org/abs/2401.08294 Source: https://github.com/inferflow/inferflow
- Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
- In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, Lin Zhong, Nov 2023, Prompt Cache: Modular Attention Reuse for Low-Latency Inference, https://arxiv.org/abs/2311.04934 (Unique and insightful advance of generalizing KV caching to multiple prompts by computing a cache for short "segments" of prompts, including methods to adjust the different KV cache values for text segments that appear in different positions of the overall prompt.)
- Hui Wu, Yi Gan, Feng Yuan, Jing Ma, Wei Zhu, Yutao Xu, Hong Zhu, Yuhua Zhu, Xiaoli Liu, Jinghui Gu, Dec 2023, Efficient LLM inference solution on Intel GPU, https://arxiv.org/abs/2401.05391 (Optimized LLM inference using kernel fusion of GEMM with element-wise operations for better data movement, and also advanced management of the KV cache.)
- Hongxuan Zhang, Zhining Liu, Jiaqi Zheng, Chenyi Zhuang, Jinjie Gu, Guihai Chen, Nov 2023, Fast Chain-of-Thought: A Glance of Future from Parallel Decoding Leads to Answers Faster, https://arxiv.org/abs/2311.08263
- Pratyush Patel, Esha Choukse, Chaojie Zhang, Íñigo Goiri, Aashaka Shah, Saeed Maleki, Ricardo Bianchini, 30 Nov 2023, Splitwise: Efficient generative LLM inference using phase splitting, https://arxiv.org/abs/2311.18677 (Separates the two Transformer phases of initial prompt computation or prefill to generate the KV cache, and the token generation phase or decoding algorithm onto two machines.)
- Yanxi Chen, Xuchen Pan, Yaliang Li, Bolin Ding, Jingren Zhou, Dec 2023, EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism, https://arxiv.org/abs/2312.04916 Code: https://github.com/pan-x-c/EE-LLM
- Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
- Sangmin Bae, Jongwoo Ko, Hwanjun Song, and Se-Young Yun. 2023. Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding. arXiv preprint arXiv:2310.05424. https://arxiv.org/abs/2310.05424 (Using early exits as the draft model in generalized speculative decoding.)
- Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu, 2023, KIVI : Plug-and-play 2bit KV Cache Quantization with Streaming Asymmetric Quantization, https://www.researchgate.net/profile/Zirui-Liu-29/publication/376831635_KIVI_Plug-and-play_2bit_KV_Cache_Quantization_with_Streaming_Asymmetric_Quantization/links/658b5d282468df72d3db3280/KIVI-Plug-and-play-2bit-KV-Cache-Quantization-with-Streaming-Asymmetric-Quantization.pdf (Explores quantization of values stored in the KV cache as a way to maintain a smaller KV caching and reduce memory storage requirements.)
- H Shen, H Chang, B Dong, Y Luo, H Meng, Nov 2023, Efficient LLM Inference on CPUs, arXiv preprint arXiv:2311.00502, https://arxiv.org/pdf/2311.00502.pdf Code: https://github.com/intel/intel-extension-for-transformers (INT4 weight quantization with 16-bit activations, and highly optimized kernel with support for AVX2, AVX512, AVX512_VNNI and Advanced Matrix Extensions (AMX), and KV caching, tested on LLamam2 3B to 20B with 20-80ms latency per token.)
- Luchang Li, Sheng Qian, Jie Lu, Lunxi Yuan, Rui Wang, Qin Xie, 29 Mar 2024, Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs, https://arxiv.org/abs/2403.20041 (On-device LLMs via four optimizations: dynamic-tensor-shape inference, FP4 quantization, operator optimizations, and KV cache improvements.)
- Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 15 Mar 2024 (v5), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer (A large survey of a variety of LLM optimizations.)
- Ruikang Liu, Haoli Bai, Haokun Lin, Yuening Li, Han Gao, Zhengzhuo Xu, Lu Hou, Jun Yao, Chun Yuan, 2 Mar 2024, IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact, https://arxiv.org/abs/2403.01241
- Piotr Nawrot, Adrian Łańcucki, Marcin Chochowski, David Tarjan, Edoardo M. Ponti, 14 Mar 2024, Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference, https://arxiv.org/abs/2403.09636 (Reducing the memory size of the KV cache.)
- Muhammad Adnan, Akhil Arunkumar, Gaurav Jain, Prashant J. Nair, Ilya Soloveychik, Purushotham Kamath, 14 Mar 2024, Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference, https://arxiv.org/abs/2403.09054 (Reducing KV cache in-memory size and related computations by focusing on a subset of tokens.)
- Haim Barad, Ekaterina Aidova, Yury Gorbachev, Nov 2023, Leveraging Speculative Sampling and KV-Cache Optimizations Together for Generative AI using OpenVINO, https://arxiv.org/abs/2311.04951 Code: https://github.com/openvinotoolkit/openvino_notebooks/tree/main/notebooks/266-speculative-sampling
- Bin Lin, Tao Peng, Chen Zhang, Minmin Sun, Lanbo Li, Hanyu Zhao, Wencong Xiao, Qi Xu, Xiafei Qiu, Shen Li, Zhigang Ji, Yong Li, Wei Lin, 5 Jan 2024, Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache, https://arxiv.org/abs/2401.02669 (Long context processing by a modification to the QKV caching mechanisms.)
- Pierre Lienhart, Jan 16, 2024, LLM Inference Series: 4. KV caching, a deeper look, https://medium.com/@plienhar/llm-inference-series-4-kv-caching-a-deeper-look-4ba9a77746c8
- MartinLwx, Oct 2023 LLM inference optimization - KV Cache, https://martinlwx.github.io/en/llm-inference-optimization-kv-cache/
- Omri Mallis, February 25, 2024 , Techniques for KV Cache Optimization in Large Language Models, https://www.omrimallis.com/posts/techniques-for-kv-cache-optimization/
- Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, Junchen Jiang, 3 Jun 2024 (v2), CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion, https://arxiv.org/abs/2405.16444 Code: https://github.com/YaoJiayi/CacheBlend.git (Generalizes prefix KV caching to KV cache fusion with selective recomputation of some KV cache data.)
- Lei Zhu, Xinjiang Wang, Wayne Zhang, Rynson W.H. Lau, 30 May 2024 (v3), RelayAttention for Efficient Large Language Model Serving with Long System Prompts, https://arxiv.org/abs/2402.14808 (Reduces the number of memory accesses for attention computations and the KV cache.)
- Yuan Feng, Hyeran Jeon, Filip Blagojevic, Cyril Guyot, Qing Li, Dong Li, 17 Apr 2023 (v2), AttMEMO : Accelerating Transformers with Memoization on Big Memory Systems, https://arxiv.org/abs/2301.09262
- C Hooper, S Kim, H Mohammadzadeh, H Genc, Oct 2023, SPEED: Speculative Pipelined Execution for Efficient Decoding https://arxiv.org/pdf/2310.12072.pdf
- H. Face, “Transformers,” https://github.com/huggingface/transformers.
- NVIDIA, “FasterTransformer,” https://github.com/NVIDIA/FasterTransformer.
- G.-I. Yu, J. S. Jeong, G.-W. Kim, S. Kim, and B.-G. Chun, “Orca: A distributed serving system for Transformer-Based generative models,” in 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). Carlsbad, CA: USENIX Association, Jul. 2022, pp. 521–538. [Online]. Available: https://www.usenix.org/conference/osdi22/presentation/yu
- Mistral AI, https://github.com/mistralai/mistral-src
- Reyna Abhyankar, Zijian He, Vikranth Srivatsa, Hao Zhang, Yiying Zhang, 2024, INFERCEPT: Efficient Intercept Support for Augmented Large Language Model Inference, https://openreview.net/pdf?id=wDDGQabYPQ
- The White Box, April 7, 2024, KV Cache, ChatGPT’s Memory, https://thewhitebox.ai/kv-cache-chatgpts-memory/
- Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ramjee, Ashish Panwar, 7 May 2024, vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention, https://arxiv.org/abs/2405.04437
- William Brandon, Aniruddha Nrusimha, Kevin Qian, Zachary Ankner, Tian Jin, Zhiye Song, Jonathan Ragan-Kelley, 15 Nov 2023, Striped Attention: Faster Ring Attention for Causal Transformers, https://arxiv.org/abs/2311.09431
- David Spuler, March 2024, Chapter 29. Caching Optimizations, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Natarajan Vaidhyanathan Mar 7, 2024, How to Quadruple LLM Decoding Performance with Speculative Decoding (SpD) and Microscaling (MX) Formats on Qualcomm® Cloud AI 100, https://www.qualcomm.com/developer/blog/2024/03/how-quadruple-llm-decoding-performance-speculative-decoding-spd-and-microscaling-mx-formats
- Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng ZHANG, Dahua Lin, Ion Stoica, Hao Zhang, 02 May 2024, MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving, ICML 2024, https://openreview.net/forum?id=R0SoZvqXyQ PDF: https://openreview.net/pdf?id=R0SoZvqXyQ Code: https://github.com/hao-ai-lab/MuxServe (Separates the prefill and decoding phases when serving, and also manages the LLM weights and KV cache data in blocks for memory efficiency.)
- Cunchen Hu, Heyang Huang, Junhao Hu, Jiang Xu, Xusheng Chen, Tao Xie, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan, 26 Jun 2024 (v2), MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool, https://arxiv.org/abs/2406.17565 (Combined session-based prefix KV caching with disaggregation of prefill and decoding phases.)
- Sean Welleck, Amanda Bertsch, Matthew Finlayson, Hailey Schoelkopf, Alex Xie, Graham Neubig, Ilia Kulikov, Zaid Harchaoui, 24 Jun 2024, From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models, https://arxiv.org/abs/2406.16838 (Survey and theoretical analysis of many different decoding algorithms, along with various ways to speed them up such as speculative decoding and KV caches.)
- Zhongwei Wan, Ziang Wu, Che Liu, Jinfa Huang, Zhihong Zhu, Peng Jin, Longyue Wang, Li Yuan, 26 Jun 2024, LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference, https://arxiv.org/abs/2406.18139 (KV cache compression in text and multimodal inference, prioritizing eviction of text over image tokens, and using new ways to merge evicted KV cache data into the retained KV cache, including averaging, pivotal tokens, and weighted averages, which is relevant to token merging and KV cache fusion
- Wonbeom Lee, Jungi Lee, Junghwan Seo, Jaewoong Sim, 28 Jun 2024, InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management, https://arxiv.org/abs/2406.19707 (KV caching optimization using salient token pruning for the attention layer.)
- Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, Xinran Xu, 2 Jul 2024 (v2), Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving, https://arxiv.org/abs/2407.00079 Code: https://github.com/kvcache-ai/Mooncake (Disaggregates prefill and decoding phases for scheduling, with chunked prefill, while managing the KV cache.)
- Google, 2024, Context caching, https://ai.google.dev/gemini-api/docs/caching?lang=python (Pass in context tokens and reuse them without re-uploading, might be doing something like prefix KV caching underneath.)
- Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas Kübler, Jiaji Huang, Matthäus Kleindessner, Jun Huan, Volkan Cevher, Yida Wang, George Karypis, 12 Jul 2024, Inference Optimization of Foundation Models on AI Accelerators, KDD’24, August 25–29, 2024, Barcelona, Spain, https://arxiv.org/abs/2407.09111
- Hailin Zhang, Xupeng Miao, Xiaodong Ji, Xiaonan Nie Yilin Chen, Weipeng Chen, Fangcheng Fu, Bin Cui, 2024, PQCache: Product Quantization-based KVCache for Long Context LLM Inference, https://hugozhl.github.io/files/PQCache.pdf
- Minsik Cho, Mohammad Rastegari, Devang Naik, 25 Jul 2024, KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache Generation, https://proceedings.mlr.press/v235/cho24e.html PDF: https://raw.githubusercontent.com/mlresearch/v235/main/assets/cho24e/cho24e.pdf
- Luohe Shi, Hongyi Zhang, Yao Yao, Zuchao Li, Hai Zhao, 13 Aug 2024 (v3), Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption, https://arxiv.org/abs/2407.18003 https://github.com/zcli-charlie/Awesome-KV%20Cache
- Bingli Liao, Danilo Vasconcellos Vargas, 13 Jul 2024, Beyond KV Caching: Shared Attention for Efficient LLMs, https://arxiv.org/abs/2407.12866 (Layerwise weight sharing in attention.)
- Huan Yang, Deyu Zhang, Yudong Zhao, Yuanchun Li, Yunxin Liu, 6 Sep 2024, A First Look At Efficient And Secure On-Device LLM Inference Against KV Leakage, https://arxiv.org/abs/2409.04040 (Security issues where KV caches can be data leaks as they may contain encodings of private information.)
- R Parthasarathy, R Shuttleworth, Sep 2024, Analyzing Inference Optimizations for Transformers, 6.5930 Final Project Report, https://reeceshuttle.me/assets/6.5930_Project.pdf
- David Spuler, 26th August, 2024, State-of-the-Art LLM Backends, Aussie AI Blog, https://www.aussieai.com/blog/state-of-the-art-llm-backends
- Yushuo Chen, Tianyi Tang, Erge Xiang, Linjiang Li, Wayne Xin Zhao, Jing Wang, Yunpeng Chai, Ji-Rong Wen, 17 Apr 2024, Towards Coarse-to-Fine Evaluation of Inference Efficiency for Large Language Models, https://arxiv.org/abs/2404.11502 (Benchmarks the performance of various Transformer inference frameworks: Transformers, vLLM, DeepSpeed-MII, TGI, TenserRT-LLM, llama.cpp, LightLLM, LMDeploy, StreamingLLM.)
- Nikhil Jha, Kevin Wang, 2023, https://people.eecs.berkeley.edu/~kubitron/courses/cs262a-F23/projects/reports/project1010_paper_64287652274076362722.pdf (Extends Paged Attention to a global multi-query KV cache and also implements prefix KV caching.)
- Zayd Muhammad Kawakibi Zuhri, Muhammad Farid Adilazuarda, Ayu Purwarianti, Alham Fikri Aji, 13 Jun 2024, MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding, https://arxiv.org/abs/2406.09297 Code: https://github.com/zaydzuhri/pythia-mlkv (Extends cross-head KV sharing in MQA to also share KV data between layers, analogous to layer fusion of weights.)
- Apple, June 2024, Introducing Apple’s On-Device and Server Foundation Models, https://machinelearning.apple.com/research/introducing-apple-foundation-models (Apple's on-device models feature optimizations including small models, grouped query attention, 2-bit/4-bit quantization including activation quantization, shared embedding/unembedding tensors, small-ish vocabulary size of 49k, an undisclosed efficient KV cache optimization for neural engines, and layer-specific 16-bit LoRA/QLoRA adapters of size "10s of megabytes" for fine-tuned specialized model versions, also sometimes in 2-bit/4-bit, claiming speed rates of 0.6ms/token in prefill, and 30 tokens per second in decoding.)
- Siyan Zhao, Daniel Israel, Guy Van den Broeck, Aditya Grover, 15 Apr 2024, Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models, https://arxiv.org/abs/2404.09529 Code: https://github.com/siyan-zhao/prepacking (Optimizes prefill KV cache computations by batching multiple query prefill phases together via packing, since prefill token sequence lengths are fully known, and further combined with simple modifications to positional encoding and masking to avoid cross-query attention.)
- Zhiyu Guo, Hidetaka Kamigaito, Taro Watanabe, 18 Jun 2024, Attention Score is not All You Need for Token Importance Indicator in KV Cache Reduction: Value Also Matters, https://arxiv.org/abs/2406.12335 (Extensions of KV cache token pruning methods tjat ise attention scores to find pivotal tokens, generalized to also consider L1 vector norms of value vectors.)
- Character.AI, June 20, 2024, Optimizing AI Inference at Character.AI, https://research.character.ai/optimizing-inference/
- Schwinn Saereesitthipitak, Ashish Rao, Cathy Zhou, William Li, 2024, Prophet: An LLM Inference Engine Optimized For Head-of-Line Blocking, https://www.scs.stanford.edu/24sp-cs244b/projects/Prophet_An_LLM_Inference_Engine_Optimized_For_Head_of_Line_Blocking.pdf (Faster inference serving via iterative scheduling, separating prefill and decoding phase computations for batching, using priority-based schedulers with preemption, and controling transfer of KV caches from prefill to decoders.)
- Baolin Li, Yankai Jiang, Vijay Gadepally, Devesh Tiwari, 17 Jul 2024, LLM Inference Serving: Survey of Recent Advances and Opportunities, https://arxiv.org/abs/2407.12391
- Junyoung Park, Myeonggu Kang, Yunki Han, Yanggon Kim, Jaekang Shin, Lee-Sup Kim, 21 Jul 2024, Token-Picker: Accelerating Attention in Text Generation with Minimized Memory Transfer via Probability Estimation, https://arxiv.org/abs/2407.15131
- Yingbing Huang, Lily Jiaxin Wan, Hanchen Ye, Manvi Jha, Jinghua Wang, Yuhong Li, Xiaofan Zhang, Deming Chen, 16 Jun 2024, New Solutions on LLM Acceleration, Optimization, and Application, https://arxiv.org/abs/2406.10903
- Kan Zhu, Yilong Zhao, Liangyu Zhao, Gefei Zuo, Yile Gu, Dedong Xie, Yufei Gao, Qinyu Xu, Tian Tang, Zihao Ye, Keisuke Kamahori, Chien-Yu Lin, Stephanie Wang, Arvind Krishnamurthy, Baris Kasikci, 22 Aug 2024, NanoFlow: Towards Optimal Large Language Model Serving Throughput, https://arxiv.org/abs/2408.12757
- Sowoong Kim, Eunyeong Sim, Youngsam Shin, YeonGon Cho, and Woongki Baek. 2024. Activation Sequence Caching: High-Throughput and Memory-Efficient Generative Inference with a Single GPU. In Proceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques (PACT '24). Association for Computing Machinery, New York, NY, USA, 78–90. https://doi.org/10.1145/3656019.3676945 https://dl.acm.org/doi/abs/10.1145/3656019.3676945
KV Caching in Early Exit
KV caching in early exit is the need to recompute the KV data when laters have been exited or skipped in LLM inference. The use of "early exit" (dynamic layer pruning) or layer skipping causes a problem with the KV cache. Unexecuted layers then have the wrong KV cache data, because the computation has not proceeded through all the layers. A future step in the inference (i.e., the next token in autoregressive decoding) will have the wrong KV cache, and this needs to be corrected for early exit mechanisms.
There is also the converse inefficiency of unnecessary prior KV caching computations whenever early exit is successful. If the current token exits at an earlier layer than the prior token, we have needlessly stored the KV cache for the later layers of the prior token, since those are the layers that were currently skipped by the early exit.
Several alternative ways to correct the KV cache have been considered in the research:
- Recomputation of the KV cache: note that the cache is wrong and recompute the KV cache data in a layer when needed.
- Propagation of the KV cache: This is reusing the KV cache of the last executed layer as the stored KV cache for any layers that get skipped, which becomes a kind of approximation of the KV cache.
- Exiting/caching pattern changes. Doing the layer skipping in a way that the KV cache issue is avoided, of which the simplest is a fixed number of layers, or more complex arrangements such as using monotonically decreasing exit layers.
- FFN-only partial layer early exiting: Avoiding the KV cache issue by only doing early exit on the FFN weight computations (i.e. not skipping the QKV computations in early exit when skipping a layer).
KV Recomputation. The simplest idea is to mark the cache as out-of-date so that it will be recomputed at the next execution of that layer, but the downside of this approach is that it thereby loses some of the speed advantages of early exiting. Potentially, every saving of a skipped layer in an early-exit is then undermined in the next inference iteration, if all those layers need their cache re-computed, although this won't happen every time so there is still benefit to early exit.
Fixed early exit layer count. Exiting early after a fixed number of layers avoids the KV cache out-of-date problem, becaues none of the tokens ever do more (or less) layers than the fixed count, and the KV cache is only needed for this same number of layers. A little more complex is a simple version of "deep encoder, shallow decoder" architecture, whereby there are two fixed counts: one for the encoder (or prefill in decoder-only architectures), and another level for the decoding phase. However, this simplistic exit criteria is non-adaptive to the input's complexity, has poor accuracy characteristics because it doesn't check for any decision criteria at all before exiting, and is effectively the same as permanently removing layers of the model with static layer pruning.
Layer exiting orders. There are several ways to avoid a mismatch between the layers and the KV cache simply by controlling how layers are skipped. The simplest is a fixed global number of layers always early exited for all queries as above. Another way is to skip a fixed subset of the layers, not necessarily in order (i.e., layer skipping, not early exit), thereby ensuring that all tokens in an output sequence run the same layers and have the same KV cache needs. The downside is that every token gets the same computation, regardless of whether it is easy or difficult. Yet another way is to monotonically sequence the exit points, so that although they change, a later token cannot go to more layers than an earlier token, which avoids KV caching problems. This approach means that the end of an output always gets less computation than the early tokens, which may not match the real computation needs.
Research on KV cache correction in early exiting. Research papers on correcting the KV cache in early exit methods include:
- Luciano Del Corro, Allie Del Giorno, Sahaj Agarwal, Bin Yu, Ahmed Awadallah, Subhabrata Mukherjee, July 2023, SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference, https://arxiv.org/abs/2307.02628 (Early exit for non-autoregression, with consideration of the KV cache. By using a monotically decreasing exit point, this avoids the possibility of a later token's inference requiring an out-of-date KV cache.)
- Maha Elbayad, Jiatao Gu, Edouard Grave, and Michael Auli. 2020. Depth-adaptive transformer. ArXiv, abs/1910.10073, https://arxiv.org/abs/1910.10073 (Copies the internal states of the exited layer to the later laters.)
- Ajay Jaiswal, Bodun Hu, Lu Yin, Yeonju Ro, Shiwei Liu, Tianlong Chen, Aditya Akella, 5 Apr 2024, FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping, https://arxiv.org/abs/2404.03865 (Presents an input-adaptive layer skipping scheme for drops about 30% of FFN calculations. Avoids the KV caching problems by only skipping FFN computations in layers.)
- Yanxi Chen, Xuchen Pan, Yaliang Li, Bolin Ding, Jingren Zhou, Dec 2023, EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism, https://arxiv.org/abs/2312.04916 Code: https://github.com/pan-x-c/EE-LLM (Examines two methods of KV cache handling with early exit, and implements KV recomputation.)
- Li, L., Wang, C., Qiu, M. et al., 2024, Accelerating BERT inference with GPU-efficient exit prediction. Front. Comput. Sci. 18, 183308 (2024). https://doi.org/10.1007/s11704-022-2341-9, https://link.springer.com/article/10.1007/s11704-022-2341-9 (Propagates hidden states to the exited layers.)
- Tal Schuster, Adam Fisch, Jai Gupta, Mostafa Dehghani, Dara Bahri, Vinh Q. Tran, Yi Tay, Donald Metzler, 25 Oct 2022 (v2), Confident Adaptive Language Modeling, https://arxiv.org/abs/2207.07061 (KV propagation copies computed KV states to the exited layers.)
- Sangmin Bae, Jongwoo Ko, Hwanjun Song, and Se-Young Yun. 2023. Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding. arXiv preprint arXiv:2310.05424. https://arxiv.org/abs/2310.05424 (Uses KV recomputation.)
- Peng Tang, Pengkai Zhu, Tian Li, Srikar Appalaraju, Vijay Mahadevan, R. Manmatha, Nov 2023, DEED: Dynamic Early Exit on Decoder for Accelerating Encoder-Decoder Transformer Models, https://arxiv.org/abs/2311.08623 (Uses KV recomputation.)
- Yijin Liu, Fandong Meng, Jie Zhou, 10 Apr 2024, Accelerating Inference in Large Language Models with a Unified Layer Skipping Strategy, https://arxiv.org/abs/2404.06954 Code: https://github.com/Adaxry/Unified_Layer_Skipping (Layer skipping with choosing globally which layers to skip in an orderly way for all tokens, based on speedup required. All tokens skip the exact same layers, which avoids the problem with out-of-date KV caches.)
- Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 15 Mar 2024 (v5), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer (A large survey of a variety of LLM optimizations that discusses the KV caching problems in the section on early exit; see "Hidden States Propagation" paragraph.)
KV Cache Compression (Memory Reduction)
KV cache compression is an LLM inference optimization via shrinking the size of the KV cache data so that it is smaller and faster to compute. Compression methods such as quantization and pruning can be applied to the KV data, resulting in less data to be stored, and fewer computations required.
This is "optimizing the optimization!" The idea of KV caching is to trade extra memory (the KV cache) for faster speed. But it's been too successful, and often requires too much memory, so there are now research papers on optimization of the memory size of the KV cache, including KV cache compression, and its subtype KV cache quantization.
Papers on KV cache compression and/or KV cache quantization:
- Harry Dong, Xinyu Yang, Zhenyu Zhang, Zhangyang Wang, Yuejie Chi, Beidi Che, 14 Feb 2024, Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference, https://arxiv.org/abs/2402.09398
- Luchang Li, Sheng Qian, Jie Lu, Lunxi Yuan, Rui Wang, Qin Xie, 29 Mar 2024, Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs, https://arxiv.org/abs/2403.20041 (On-device LLMs via four optimizations: dynamic-tensor-shape inference, FP4 quantization, operator optimizations, and KV cache improvements.)
- Shichen Dong, Wen Cheng, Jiayu Qin, Wei Wang, 7 Mar 2024, QAQ: Quality Adaptive Quantization for LLM KV Cache, https://arxiv.org/abs/2403.04643 Code: http://github.com/ClubieDong/KVCacheQuantization (Reducing the size of the KV cache using quantization.)
- Hao Kang, Qingru Zhang, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu, Tushar Krishna, Tuo Zhao, 11 Mar 2024 (v2), GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM, https://arxiv.org/abs/2403.05527 Code: https://github.com/HaoKang-Timmy/GEAR (Compressing the size of the KV cache using quantization, low-rank matrices, and sparse matrix.)
- 14 Mar 2024, Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference, Piotr Nawrot, Adrian Łańcucki, Marcin Chochowski, David Tarjan, Edoardo M. Ponti, https://arxiv.org/abs/2403.09636 (Reducing the memory size of the KV cache.)
- Muhammad Adnan, Akhil Arunkumar, Gaurav Jain, Prashant J. Nair, Ilya Soloveychik, Purushotham Kamath, 14 Mar 2024, Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference, https://arxiv.org/abs/2403.09054 (Reducing KV cache in-memory size and related computations by focusing on a subset of tokens.)
- Wangsong Yin, Mengwei Xu, Yuanchun Li, Xuanzhe Liu, 18 Mar 2024, LLM as a System Service on Mobile Devices, https://arxiv.org/abs/2403.11805 (On-device inference for LLMs, including a stateful on-device AI service LLMaaS, including Llama2 7B and OPT-7B with INT8 quantization, based on improved KV caching on mobile, with pipelining, recomputation and chunk-level KV cache memory management for running on phones.)
- Youpeng Zhao, Di Wu, Jun Wang, 26 Mar 2024, ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching, https://arxiv.org/abs/2403.17312 (Improved memory management of the cache for KV caching during autoregressive inference with prioritization of tokens based on sparse window attention, and managing caching versus recomputation.)
- Hui Wu, Yi Gan, Feng Yuan, Jing Ma, Wei Zhu, Yutao Xu, Hong Zhu, Yuhua Zhu, Xiaoli Liu, Jinghui Gu, Dec 2023, Efficient LLM inference solution on Intel GPU, https://arxiv.org/abs/2401.05391 (Optimized LLM inference using kernel fusion of GEMM with element-wise operations for better data movement, and also advanced management of the KV cache.)
- Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, Pengfei Zuo, 23 Mar 2024, AttentionStore: Cost-effective Attention Reuse across Multi-turn Conversations in Large Language Model Serving, https://arxiv.org/abs/2403.19708 (Memory management of KV caches using hierarchical cache layers.)
- Zihao Wang, Shaoduo Gan, 7 Apr 2024] SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget, https://arxiv.org/abs/2404.04793 Code: https://github.com/hetailang/squeezeattention (Optimization of the KV cache along the two dimensions of layers and input sequence.)
- Aleksandar Botev, Soham De, Samuel L Smith, Anushan Fernando, George-Cristian Muraru, Ruba Haroun, Leonard Berrada, Razvan Pascanu, Pier Giuseppe Sessa, Robert Dadashi, Léonard Hussenot, Johan Ferret, Sertan Girgin, Olivier Bachem, Alek Andreev, Kathleen Kenealy, Thomas Mesnard, Cassidy Hardin, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, Armand Joulin, Noah Fiedel, Evan Senter, Yutian Chen, Srivatsan Srinivasan, Guillaume Desjardins, David Budden, Arnaud Doucet, Sharad Vikram, Adam Paszke, Trevor Gale, Sebastian Borgeaud, Charlie Chen, Andy Brock, Antonia Paterson, Jenny Brennan, Meg Risdal, Raj Gundluru, Nesh Devanathan, Paul Mooney, Nilay Chauhan, Phil Culliton, Luiz GUStavo Martins, Elisa Bandy, David Huntsperger, Glenn Cameron, Arthur Zucker, Tris Warkentin, Ludovic Peran, Minh Giang, Zoubin Ghahramani, Clément Farabet, Koray Kavukcuoglu, Demis Hassabis, Raia Hadsell, Yee Whye Teh, Nando de Frietas, 11 Apr 2024, RecurrentGemma: Moving Past Transformers for Efficient Open Language Models, Google Research, https://arxiv.org/abs/2404.07839 (KV cache is bounded, rather than growing with sequence length.)
- Yushuo Chen, Tianyi Tang, Erge Xiang, Linjiang Li, Wayne Xin Zhao, Jing Wang, Yunpeng Chai, Ji-Rong Wen, 17 Apr 2024, Towards Coarse-to-Fine Evaluation of Inference Efficiency for Large Language Models, https://arxiv.org/abs/2404.11502 (Examines the KV cache optimization methods used by multiple frameworks. Benchmarks the performance of various Transformer inference frameworks: Transformers, vLLM, DeepSpeed-MII, TGI, TenserRT-LLM, llama.cpp, LightLLM, LMDeploy, StreamingLLM.)
- Amir Zandieh, Majid Daliri, Insu Han, 5 Jun 2024, QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead, https://arxiv.org/abs/2406.03482 Code: https://github.com/amirzandieh/QJL (Using 1-bit or 2-bit KV cache quantization approach based on sign bits as estimates.)
- Prajwal Singhania, Siddharth Singh, Shwai He, Soheil Feizi, Abhinav Bhatele, 4 Jun 2024, Loki: Low-Rank Keys for Efficient Sparse Attention, https://arxiv.org/abs/2406.02542 (Sparsification of the KV cache values with a focus on the key vectors.)
- Akide Liu, Jing Liu, Zizheng Pan, Yefei He, Gholamreza Haffari, Bohan Zhuang, 23 May 2024, MiniCache: KV Cache Compression in Depth Dimension for Large Language Models, arXiv preprint arXiv:2405.14366, 2024, https://arxiv.org/abs/2405.14366 (Merging KV caches with similar values across nearby layers to effectively share parts of the cache across layers for a 41% reduction in size.)
- Yefei He, Luoming Zhang, Weijia Wu, Jing Liu, Hong Zhou, Bohan Zhuang, 23 May 2024, ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification, https://arxiv.org/abs/2405.14256 (Quantizing the KV cache with combination of other KV cache compression methods based on salient tokens.)
- Zichang Liu, April 2024, Ph.D. Thesis, Rice University, Houston, Texas, https://repository.rice.edu/server/api/core/bitstreams/a089344e-6f6b-44d2-a1c3-6cef2c303e86/content (Using sparsity to compress the KV cache for long context windows.)
- Liu, Y., Li, H., Du, K., Yao, J., Cheng, Y., Huang, Y., Lu, S., Maire, M., Hoffmann, H., Holtzman, A., Ananthanarayanan, G., and Jiang, J., August 2024, Cachegen: Fast context loading for language model applications, Microsoft Research, https://arxiv.org/abs/2310.07240 https://www.microsoft.com/en-us/research/publication/cachegen-fast-context-loading-for-language-model-applications-via-kv-cache-streaming/
- Lu Ye, Ze Tao, Yong Huang, Yang Li, 22 Mar 2024 (v2), ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition, https://arxiv.org/abs/2402.15220 (Identify prefixes of prompts and caching the KV values of these portions of the prompt.)
- Jeon, Byungsoo, May 2024, Automated and Portable Machine Learning Systems, Ph.D. Thesis, Carnegie Mellon University, https://doi.org/10.1184/R1/25746708.v1 https://kilthub.cmu.edu/articles/thesis/Automated_and_Portable_Machine_Learning_Systems/25746708/1 PDF: https://kilthub.cmu.edu/ndownloader/files/46074087 Code: https://github.com/cmu-catalyst/collage (Portability layer to integrate the various kernels and low-level backends more easily. Also covers pipeline parallelism in graph models, and KV cache parallelism similar to FlashDecode.)
- Peiyu Liu, Ze-Feng Gao, Wayne Xin Zhao, Yipeng Ma, Tao Wang, Ji-Rong Wen, 21 May 2024, Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression, https://arxiv.org/abs/2405.12591
- William Brandon, Mayank Mishra, Aniruddha Nrusimha, Rameswar Panda, Jonathan Ragan Kelly, 21 May 2024, Reducing Transformer Key-Value Cache Size with Cross-Layer Attention, https://arxiv.org/abs/2405.12981 (Sharing KV cache values across layers in MQA, every 2nd or 3rd layer, to reduce overall KV cache size by 2 or 3 times.)
- Haoyi Wu, Kewei Tu, 17 May 2024, Layer-Condensed KV Cache for Efficient Inference of Large Language Models, https://arxiv.org/abs/2405.10637 Code: https://github.com/whyNLP/LCKV (Use the KV cache for only the final layer as the KV cache for all other layers, or alternatively, use only the cache from a few layers, also possibly using a few standard layers as "warmup layers". This idea is conceptuatlly similar to "propagation" of the KV cache in early exit methods or to layer fusion of weights.)
- Yao Fu, 14 May 2024, Challenges in Deploying Long-Context Transformers: A Theoretical Peak Performance Analysis, https://arxiv.org/abs/2405.08944 (The KV cache size is the main bottleneck for long context processing, in both prefill and decoding phases, and includes analysis of different optimizations to address this.)
- Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, Jianfeng Gao, 16 Jan 2024, Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs. https://openreview.net/forum?id=uNrFpDPMyo
- Runheng Liu, Xingchen Xiao, Heyan Huang, Zewen Chi, Zhijing Wu, 7 May 2024, FlashBack:Efficient Retrieval-Augmented Language Modeling for Long Context Inference, https://arxiv.org/abs/2405.04065 (Optimize RAG by appending rather than prepending documents, and modifying the attention for improvements in KV caching, by shimming or replacing some of the CUDA GPU low-level memory management APIs to avoid the need to rewrite kernels with extra higher-level memory management code.)
- Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, Furu Wei, 9 May 2024 (v2), You Only Cache Once: Decoder-Decoder Architectures for Language Models, https://arxiv.org/abs/2405.05254 Code: https://aka.ms/YOCO (A novel decoder-decoder architecture with fast KV caching and cross-attention.)
- Qiaozhi He, Zhihua Wu, 28 Apr 2024, Efficient LLM Inference with Kcache, https://arxiv.org/abs/2404.18057 (Splits the KV cache into a KCache stored in HBM and a Vcache stored in CPU memory. The requests for the V cache are limited by filtering after attention based on the Softmax scaled top-N results of the QK matrix multiplication, so thereby pruning a lot of the V cache memory loads and corresponding calculations.)
- João Monteiro, Étienne Marcotte, Pierre-André Noël, Valentina Zantedeschi, David Vázquez, Nicolas Chapados, Christopher Pal, Perouz Taslakian, 23 Apr 2024, XC-Cache: Cross-Attending to Cached Context for Efficient LLM Inference, https://arxiv.org/abs/2404.15420
- Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, Deming Chen, 22 Apr 2024, SnapKV: LLM Knows What You are Looking for Before Generation, https://arxiv.org/abs/2404.14469
- Hanshi Sun, Zhuoming Chen, Xinyu Yang, Yuandong Tian, Beidi Chen, 18 Apr 2024, TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding, https://arxiv.org/abs/2404.11912 Code: https://github.com/Infini-AI-Lab/TriForce (Improves issues with long context in the use of speculative decoding, such as small model context sizes and KV cache memory usage bottlenecks for Llama2-7B-128K model on an A100 and RTX 4090 GPU.)
- Reyna Abhyankar, Zijian He, Vikranth Srivatsa, Hao Zhang, Yiying Zhang, 2 Feb 2024, APIServe: Efficient API Support for Large-Language Model Inferencing, https://arxiv.org/abs/2402.01869
- Foteini Strati, Sara Mcallister, Amar Phanishayee, Jakub Tarnawski, Ana Klimovic, 4 Mar 2024, DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving, https://arxiv.org/abs/2403.01876
- Yuxin Zhang, Yuxuan Du, Gen Luo, Yunshan Zhong, Zhenyu Zhang, Shiwei Liu, Rongrong Ji, 2024, CaM: Cache Merging for Memory-efficient LLMs Inference, https://openreview.net/pdf?id=LCTmppB165 Code: https://github.com/zyxxmu/cam (Compressing the KV cache by merging KV data that is about to be evicted into other parts of the KV cache.)
- Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
- Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivas tava. 2023, Scissorhands: Exploiting the Persistence of Importance Hypoth esis for LLM KV Cache Compression at Test Time. arXiv preprint arXiv:2305.17118, https://arxiv.org/abs/2305.17118
- Nikita Trukhanov, Ilya Soloveychik, 29 Mar 2024, Accurate Block Quantization in LLMs with Outliers, https://arxiv.org/abs/2403.20137 (Analyzes block floating point number formats in block quantization with a focus on the KV cache memory reduction, including the use of permutations to reorder tensor weight rows.)
- Pierre Lienhart, Jan 16, 2024, LLM Inference Series: 4. KV caching, a deeper look, https://medium.com/@plienhar/llm-inference-series-4-kv-caching-a-deeper-look-4ba9a77746c8
- Zefan Cai., Yichi Zhang, Bofei Gao, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, Wen Xiao, 4 Jun 2024, PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling, https://arxiv.org/abs/2406.02069
- Liyuan Liu , Jianfeng Gao , May 8, 2024 LLM profiling guides KV cache optimization, Microsoft Research Blog, 12th International Conference on Learning Representations(opens in new tab) (ICLR 2024), https://www.microsoft.com/en-us/research/blog/llm-profiling-guides-kv-cache-optimization/
- Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, Jianfeng Gao, May 2024, Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs, ICLR 2024, https://www.microsoft.com/en-us/research/publication/model-tells-you-what-to-discard-adaptive-kv-cache-compression-for-llms/ https://arxiv.org/pdf/2310.01801
- Noam Shazeer, 6 Nov 2019, Fast Transformer Decoding: One Write-Head is All You Need, https://arxiv.org/abs/1911.02150
- Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, Sumit Sanghai, 23 Dec 2023 (v3), GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints, https://arxiv.org/abs/2305.13245
- Amir Zandieh, Insu Han, Vahab Mirrokni, Amin Karbasi, 8 Feb 2024, SubGen: Token Generation in Sublinear Time and Memory, https://arxiv.org/abs/2402.06082
- June Yong Yang, Byeongwook Kim, Jeongin Bae, Beomseok Kwon, Gunho Park, Eunho Yang, Se Jung Kwon, Dongsoo Lee, 28 Feb 2024, No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization, https://arxiv.org/abs/2402.18096
- Xingbo Wu, Li Zhang, Yandong Wang, Yufei Ren, Michel Hack, Song Jiang, 18 April 2016, zExpander: a key-value cache with both high performance and fewer misses, EuroSys '16: Proceedings of the Eleventh European Conference on Computer SystemsApril 2016Article No.: 14Pages 1–15, https://dl.acm.org/doi/abs/10.1145/2901318.2901332 https://doi.org/10.1145/2901318.2901332
- Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. Model tells you what to discard: Adaptive kv cache compression for llms. arXiv preprint arXiv:2310.01801, 2023. https://arxiv.org/abs/2310.01801
- Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. https://arxiv.org/abs/2310.06825 arXiv preprint arXiv:2310.06825, 2023
- Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. In The Twelfth International Conference on Learning Representations, 2023. [51] Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems, 36, 2024 https://arxiv.org/abs/2306.14048
- Zhiyu Guo, Hidetaka Kamigaito, Taro Watanabe, 18 Jun 2024, Attention Score is not All You Need for Token Importance Indicator in KV Cache Reduction: Value Also Matters, https://arxiv.org/abs/2406.12335 (Extensions of KV cache token pruning methods tjat ise attention scores to find pivotal tokens, generalized to also consider L1 vector norms of value vectors.)
- Hao Yu, Zelan Yang, Shen Li, Yong Li, Jianxin Wu, 11 Jun 2024, Effectively Compress KV Heads for LLM, https://arxiv.org/abs/2406.07056 (Examines KV cache head merging approaches for KV cache size reduction, and also examines RoPE encoding issues with relevance to fusing KV caches.)
- DeepSeek-AI. DeepSeek-V2: A strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434, 2024. https://arxiv.org/abs/2405.04434 Code: https://github.com/deepseek-ai/DeepSeek-V2 (Introduces various architectural optimizations, notably RoPE handling and KV cache compression via low-rank matrices.)
- Character.AI, June 20, 2024, Optimizing AI Inference at Character.AI, https://research.character.ai/optimizing-inference/
- Zhongwei Wan, Xinjian Wu, Yu Zhang, Yi Xin, Chaofan Tao, Zhihong Zhu, Xin Wang, Siqi Luo, Jing Xiong, Mi Zhang, 18 Jun 2024, D2O:Dynamic Discriminative Operations for Efficient Generative Inference of Large Language Models, https://arxiv.org/abs/2406.13035 (Per-layer KV cache eviction strategies with token merging applied to the KV cache.)
- Alessio Devoto, Yu Zhao, Simone Scardapane, Pasquale Minervini, 17 Jun 2024, A Simple and Effective L2 Norm-Based Strategy for KV Cache Compression, https://arxiv.org/abs/2406.11430
- Yingbing Huang, Lily Jiaxin Wan, Hanchen Ye, Manvi Jha, Jinghua Wang, Yuhong Li, Xiaofan Zhang, Deming Chen, 16 Jun 2024, New Solutions on LLM Acceleration, Optimization, and Application, https://arxiv.org/abs/2406.10903 (A survey of inference optimization methods and further analysis of Medusa-type speculative decoding and KV cache compression. Also explores hardware co-design, ML compilers and LLM-assisted code debugging.)
- Chao Lou, Zixia Jia, Zilong Zheng, Kewei Tu, 24 Jun 2024, Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers, https://arxiv.org/abs/2406.16747 (Sparse KV cache for memory-efficient decoding on long contexts by selecting KV pairs of salient tokens.)
- Jeffrey Willette, Heejun Lee, Youngwan Lee, Myeongjae Jeon, Sung Ju Hwang, 24 Jun 2024, Training-Free Exponential Extension of Sliding Window Context with Cascading KV Cache, https://arxiv.org/abs/2406.17808 (Extends the KV cache eviction policy in sliding window attention so that the KV partially looks back further than the window.)
- Zhongwei Wan, Ziang Wu, Che Liu, Jinfa Huang, Zhihong Zhu, Peng Jin, Longyue Wang, Li Yuan, 26 Jun 2024, LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference, https://arxiv.org/abs/2406.18139 (KV cache compression in text and multimodal inference, prioritizing eviction of text over image tokens, and using new ways to merge evicted KV cache data into the retained KV cache, including averaging, pivotal tokens, and weighted averages, which is relevant to token merging and KV cache fusion
- Franklin Huang, May 17, 2024, Machine Learning Systems with Reduced Memory Requirements, Masters of Science, Electrical Engineering and Computer Sciences, University of California, Berkeley, Technical Report No. UCB/EECS-2024-120 http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.html https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.pdf Code: https://github.com/hongyihuang/spec-mcts/blob/main/triton (Broad paper that examines a lot of different optimizations that reduce memory costs, including quantization, kernel fusion, sparsity, MatMul optimizations, KV cache compression, and various other methods.)
- Namgyu Ho, Sangmin Bae, Taehyeon Kim, Hyunjik Jo, Yireun Kim, Tal Schuster, Adam Fisch, James Thorne, Se-Young Yun, 4 Jun 2024, Block Transformer: Global-to-Local Language Modeling for Fast Inference, https://arxiv.org/abs//2406.02657 Code: https://github.com/itsnamgyu/block-transformer (Impressive technique of combining tokens into blocks, then doing inference on the blocks, then unblocking to get tokens.)
- Wonbeom Lee, Jungi Lee, Junghwan Seo, Jaewoong Sim, 28 Jun 2024, InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management, https://arxiv.org/abs/2406.19707 (KV caching optimization using salient token pruning for the attention layer.)
- Yao Yao, Zuchao Li, Hai Zhao, 21 May 2024, SirLLM: Streaming Infinite Retentive LLM, https://arxiv.org/abs/2405.12528 (Low-rank decomposition to compress KV cache heads.)
- Jiayi Yuan, Hongyi Liu, Shaochen (Henry)Zhong, Yu-Neng Chuang, Songchen Li, Guanchu Wang, Duy Le, Hongye Jin, Vipin Chaudhary, Zhaozhuo Xu, Zirui Liu, Xia Hu, 1 Jul 2024, KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches, https://arxiv.org/abs/2407.01527 Code: https://github.com/henryzhongsc/longctx_bench (Survey and benchmarking of several KV cache compression and long context handling techniques.)
- Ruiqing Yan, Xingbo Du, Haoyu Deng, Linghan Zheng, Qiuzhuang Sun, Jifang Hu, Yuhang Shao, Penghao Jiang, Jinrong Jiang, Lian Zhao, 3 Jul 2024 (v2), Unveiling and Controlling Anomalous Attention Distribution in Transformers, https://arxiv.org/abs/2407.01601 (Examination of why the very first token in a sequence always gets more attention than others, including the effect of positional encoding, and its impact on KV cache compression.)
- Baolin Li, Yankai Jiang, Vijay Gadepally, Devesh Tiwari, 17 Jul 2024, LLM Inference Serving: Survey of Recent Advances and Opportunities, https://arxiv.org/abs/2407.12391
- Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
- Junyoung Park, Myeonggu Kang, Yunki Han, Yanggon Kim, Jaekang Shin, Lee-Sup Kim, 21 Jul 2024, Token-Picker: Accelerating Attention in Text Generation with Minimized Memory Transfer via Probability Estimation, https://arxiv.org/abs/2407.15131
- Shi Luohe, Zhang Hongyi, Yao Yao, Li Zuchao, Zhao Hai, 25 Jul 2024, Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption, https://arxiv.org/abs/2407.18003
- Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference, Harry Dong, Xinyu Yang, Zhenyu Zhang, Zhangyang Wang, Yuejie Chi, Beidi Chen, 2024, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:11437-11452, 2024, https://proceedings.mlr.press/v235/dong24f.html https://raw.githubusercontent.com/mlresearch/v235/main/assets/dong24f/dong24f.pdf https://openreview.net/forum?id=uhHDhVKFMW Code: https://github.com/hdong920/LESS
- Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, 04 August 2024, CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving, ACM SIGCOMM '24: Proceedings of the ACM SIGCOMM 2024 Conference, Pages 38 - 56, https://doi.org/10.1145/3651890.3672274 https://dl.acm.org/doi/abs/10.1145/3651890.3672274
- Giulio Corallo, Paolo Papotti, 31 Jul 2024, Finch: Prompt-guided Key-Value Cache Compression, https://arxiv.org/abs/2408.00167 (KV cache compression along the lengthwise input dimension in a blockwise KV data pruning method.)
- Wang, Bin Guo, Chen Meng, Sheng Gui, Weifei Yu, Yi Xie, 10 Jul 2024, Inference Performance Optimization for Large Language Models on CPUs,Pujiang He, Shan Zhou, Wenhuan Huang, Changqing Li, Duyi https://arxiv.org/abs/2407.07304 Code: https://github.com/intel/xFasterTransformer
- Zeyu Zhang, Haiying Shen, 7 Aug 2024, Zero-Delay QKV Compression for Mitigating KV Cache and Network Bottlenecks in LLM Inference, https://arxiv.org/abs/2408.04107
- Utkarsh Saxena, Gobinda Saha, Sakshi Choudhary, Kaushik Roy, 10 Aug 2024, Eigen Attention: Attention in Low-Rank Space for KV Cache Compression, https://arxiv.org/abs/2408.05646
- Shuo Yang, Ying Sheng, Joseph E. Gonzalez, Ion Stoica, Lianmin Zheng, 11 Aug 2024, Post-Training Sparse Attention with Double Sparsity, https://arxiv.org/abs/2408.07092 Code: https://github.com/andy-yang-1/DoubleSparse (Combined token-level sparse attention with reduced KV data accesses.)
- Piotr Nawrot, Adrian Łańcucki, Marcin Chochowski, David Tarjan, Edoardo Ponti, July 2024, Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:37396-37412, 2024, https://proceedings.mlr.press/v235/nawrot24a.html PDF: https://raw.githubusercontent.com/mlresearch/v235/main/assets/nawrot24a/nawrot24a.pdf Code: https://github.com/NVIDIA/Megatron-LM/tree/DMC
- Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, Song Han, July 2024, QUEST: Query-Aware Sparsity for Efficient Long-Context LLM Inference, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:47901-47911, 2024, https://proceedings.mlr.press/v235/tang24l.html PDF: https://raw.githubusercontent.com/mlresearch/v235/main/assets/tang24l/tang24l.pdf Code: https://github.com/mit-han-lab/quest
- Yuhui Xu, Zhanming Jie, Hanze Dong, Lei Wang, Xudong Lu, Aojun Zhou, Amrita Saha, Caiming Xiong, Doyen Sahoo, 30 Jul 2024, ThinK: Thinner Key Cache by Query-Driven Pruning, https://arxiv.org/abs/2407.21018
- David Spuler, 25th August, 2024, Hot Inference Optimization Techniques, https://www.aussieai.com/blog/hot-inference-research
- Ahmed Tremo, Aug 6, 2024, How to Efficiently Serve an LLM? https://ahmedtremo.com/posts/How-to-Efficiently-serve-an-llm/
- Zihao Ye,, Lequn Chen, Ruihang Lai, Yilong Zhao, Size Zheng, Junru Shao, Bohan Hou, Hongyi Jin, Yifei Zuo, Liangsheng Yin, Tianqi Chen, Luis Ceze, Feb 2, 2024, Accelerating Self-Attentions for LLM Serving with FlashInfer, https://flashinfer.ai/2024/02/02/introduce-flashinfer.html
- Lilly Kumari, Anthony Rowe, Shengjie Wang, Jeff Bilmes, 2024, BumbleBee: Dynamic KV-Cache Streaming Submodular Summarization for Infinite-Context Transformers, COLM 2024, https://openreview.net/pdf?id=8w0RApM5yG (KV cache compression via "summaries" of the KV cache data.)
- Luka Ribar, Ivan Chelombiev, Luke Hudlass-Galley, Charlie Blake, Carlo Luschi, Douglas Orr, July 2024, SparQ Attention: Bandwidth-Efficient LLM Inference, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:42558-42583, 2024, https://proceedings.mlr.press/v235/ribar24a.html PDF: https://raw.githubusercontent.com/mlresearch/v235/main/assets/ribar24a/ribar24a.pdf
- Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guohong Liu, Jiacheng Liu, Wenxing Xu, Xiang Wang, Yi Sun, Rui Kong, Yile Wang, Hanfei Geng, Jian Luan, Xuefeng Jin, Zilong Ye, Guanjing Xiong, Fan Zhang, Xiang Li, Mengwei Xu, Zhijun Li, Peng Li, Yang Liu, Ya-Qin Zhang, Yunxin Liu, 8 May 2024 (v2), Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security, https://arxiv.org/abs/2401.05459 https://github.com/MobileLLM/Personal_LLM_Agents_Survey
- Zhenyu Zhang, Shiwei Liu, Runjin Chen, Bhavya Kailkhura, Beidi Chen, Atlas Wang, 2024, Q-Hitter: A Better Token Oracle for Efficient LLM Inference via Sparse-Quantized KV Cache, Part of Proceedings of Machine Learning and Systems 6 (MLSys 2024) Conference, https://proceedings.mlsys.org/paper_files/paper/2024/hash/bbb7506579431a85861a05fff048d3e1-Abstract-Conference.html https://proceedings.mlsys.org/paper_files/paper/2024/file/bbb7506579431a85861a05fff048d3e1-Paper-Conference.pdf https://github.com/VITA-Group/Q-Hitter
- Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong-Yan Chen, Yu-Fang Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, Kai-Chiang Wu, 30 Jul 2024, Palu: Compressing KV-Cache with Low-Rank Projection, https://arxiv.org/abs/2407.21118 https://github.com/shadowpa0327/Palu
- The SGLang Team, Sep 04, 2024 SGLang v0.3 Release: 7x Faster DeepSeek MLA, 1.5x Faster torch.compile, Multi-Image/Video LLaVA-OneVision, https://lmsys.org/blog/2024-09-04-sglang-v0-3/
- Xiurui Pan, Endian Li, Qiao Li, Shengwen Liang, Yizhou Shan, Ke Zhou, Yingwei Luo, Xiaolin Wang, Jie Zhang, 8 Sep 2024, InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference, https://arxiv.org/abs/2409.04992
- Luning Wang, Shiyao Li, Xuefei Ning, Zhihang Yuan, Shengen Yan, Guohao Dai, Yu Wang, 16 Sep 2024, CSKV: Training-Efficient Channel Shrinking for KV Cache in Long-Context Scenarios, https://arxiv.org/abs/2409.10593 (KV cache compression on the "channel" or "width" dimension.)
- Zhenyu Ning, Jieru Zhao, Qihao Jin, Wenchao Ding, Minyi Guo, 11 Sep 2024, Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU, https://arxiv.org/abs/2409.09086
- Bo Lv, Quan Zhou, Xuanang Ding, Yan Wang, Zeming Ma, 17 Sep 2024, KVPruner: Structural Pruning for Faster and Memory-Efficient Large Language Models, https://arxiv.org/abs/2409.11057
- Yanshu Wang, Tong Yang, Xiyan Liang, Guoan Wang, Hanning Lu, Xu Zhe, Yaoming Li, Li Weitao, 18 Sep 2024, Art and Science of Quantizing Large-Scale Models: A Comprehensive Overview, https://arxiv.org/abs/2409.11650 (Extensive survey of quantization from the basics to SOTA approaches, with also some coverage of knowledge distillation and KV cache compression.)
- Junlin Lv, Yuan Feng, Xike Xie, Xin Jia, Qirong Peng, Guiming Xie, 19 Sep 2024, CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling Acceleration in LLMs, https://arxiv.org/abs/2409.12490
- David Spuler, 26th August, 2024, Inference Optimization Research Ideas, https://www.aussieai.com/blog/inference-optimization-ideas
- Minsoo Kim, Kyuhong Shim, Jungwook Choi, Simyung Chang, 2 Oct 2024, InfiniPot: Infinite Context Processing on Memory-Constrained LLMs, https://arxiv.org/abs/2410.01518 (Length-wise KV cache pruning by analyzing token importance.)
- Isaac Rehg, 7 Oct 2024 (v2), KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head, https://arxiv.org/abs/2410.00161
- Shiwei Gao, Youmin Chen, Jiwu Shu, Oct 2024, Fast State Restoration in LLM Serving with HCache EuroSys ’25, March 30–April 3, 2025, Rotterdam, Netherlands, https://chenyoumin1993.github.io/papers/eurosys25-hcache.pdf
- Suyu Ge, Xihui Lin, Yunan Zhang, Jiawei Han, Hao Peng, 2 Oct 2024, A Little Goes a Long Way: Efficient Long Context Training and Inference with Partial Contexts, https://arxiv.org/abs/2410.01485
- Jing Xiong, Jianghan Shen, Fanghua Ye, Chaofan Tao, Zhongwei Wan, Jianqiao Lu, Xun Wu, Chuanyang Zheng, Zhijiang Guo, Lingpeng Kong, Ngai Wong, 4 Oct 2024, UNComp: Uncertainty-Aware Long-Context Compressor for Efficient Large Language Model Inference, https://arxiv.org/abs/2410.03090
- Rongzhi Zhang, Kuang Wang, Liyuan Liu, Shuohang Wang, Hao Cheng, Chao Zhang, Yelong Shen, 4 Oct 2024, LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy, https://arxiv.org/abs/2410.03111
- Yefei He, Feng Chen, Jing Liu, Wenqi Shao, Hong Zhou, Kaipeng Zhang, Bohan Zhua, 11 Oct 2024, ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification and KV Cache Compression, https://arxiv.org/abs/2410.08584
- Xuan Zhang, Cunxiao Du, Chao Du, Tianyu Pang, Wei Gao, Min Lin, 17 Oct 2024, SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction, https://arxiv.org/abs/2410.13846 https://github.com/sail-sg/SimLayerKV
- Anonymous authors, Oct 2024, LSH Tells You What To Discard: An Adaptive Locality-Sensitive Strategy for KV Cache Compression, ICLR 2025, https://openreview.net/pdf?id=0ZcQhdyI3n
- OpenVINO-toolkit, Oct 1, 2024, Introducing OpenVINO™ 2024.4, https://medium.com/openvino-toolkit/introducing-openvino-2024-4-28578870b264
- Yu Fu, Zefan Cai, Abedelkadir Asi, Wayne Xiong, Yue Dong, Wen Xiao, 28 Oct 2024 (v2), Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning, https://arxiv.org/abs/2410.19258
- Junqi Zhao, Zhijin Fang, Shu Li, Shaohui Yang, Shichao He, 30 Oct 2024, BUZZ: Beehive-structured Sparse KV Cache with Segmented Heavy Hitters for Efficient LLM Inference, https://arxiv.org/abs/2410.23079 https://github.com/JunqiZhao888/buzz-llm
- Xingwu Sun, Yanfeng Chen, Yiqing Huang, Ruobing Xie, Jiaqi Zhu, Kai Zhang, Shuaipeng Li, Zhen Yang, Jonny Han, Xiaobo Shu, Jiahao Bu, (and many more authors), 4 Nov 2024, Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent, https://arxiv.org/abs/2411.02265 https://github.com/Tencent/Hunyuan-Large https://huggingface.co/tencent/Tencent-Hunyuan-Large
- Dezhan Tu, Danylo Vashchilenko, Yuzhe Lu, Panpan Xu, 29 Oct 2024, VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration, https://arxiv.org/abs/2410.23317
- B Sun, X Yu, D Tao, Nov 2024, KVSort: Drastically Improving LLM Inference Performance via KV Cache Compression, https://sc24.supercomputing.org/proceedings/poster/poster_files/post189s2-file3.pdf
- Chaoyi Jiang, Lei Gao, Hossein Entezari Zarch, Murali Annavaram, 26 Nov 2024, Efficient LLM Inference with I/O-Aware Partial KV Cache Recomputation, https://arxiv.org/abs/2411.17089 (Overlapping/optimizing CPU-GPU network bandwidth for KV cache with some recomputation.)
- Yash Akhauri, Safeen Huda, Mohamed S. Abdelfattah, 26 Nov 2024, Attamba: Attending To Multi-Token States, https://arxiv.org/abs/2411.17685
- Da Ma, Lu Chen, Situo Zhang, Yuxun Miao, Su Zhu, Zhi Chen, Hongshen Xu, Hanqi Li, Shuai Fan, Lei Pan, Kai Yu, 3 Dec 2024, Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity, https://arxiv.org/abs/2412.02252
- Yanqi Zhang, Yuwei Hu, Runyuan Zhao, John C.S. Lui, Haibo Chen, 4 Dec 2024, Unifying KV Cache Compression for Large Language Models with LeanKV, https://arxiv.org/abs/2412.03131 (KV cache compression via mixed-precision KV quantization, token-specific KV pruning, and KV sparsity. Also uses a KV paging method similar to paged attention.)
- Guangda Liu, Chengwei Li, Jieru Zhao, Chenqi Zhang, Minyi Guo, 4 Dec 2024, ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression, https://arxiv.org/abs/2412.03213
- Weizhuo Li, Zhigang Wang, Yu Gu, Ge Yu, 8 Dec 2024, XKV: Personalized KV Cache Memory Reduction for Long-Context LLM Inference, https://arxiv.org/abs/2412.05896
- Xiaohuan Pei, Tao Huang, Chang Xu, 5 Dec 2024, Cross-Self KV Cache Pruning for Efficient Vision-Language Inference, https://arxiv.org/abs/2412.04652 https://github.com/TerryPei/CSP (KV cache pruning in multimodal LLMs.)
- Michael R. Metel, Boxing Chen, Mehdi Rezagholizadeh, 7 Dec 2024, Batch-Max: Higher LLM Throughput using Larger Batch Sizes and KV Cache Compression, https://arxiv.org/abs/2412.05693 (KV cache compression in prefill or prompt processing phase.)
KV Cache Low-Rank Matrix Factorization
KV low-rank matrix factorization, or KV cache decomposition, is the use of low-rank matrices for faster LLM inference. This is another type of KV cache compression uses low-rank matrix factorization, which is a well-known model compression method. A large matrix is approximated as the product of two much smaller matrices, which reduces memory size and computation requirements.
Research papers on low-rank matrix factorization applied to KV cache compression:
- Hao Kang, Qingru Zhang, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu, Tushar Krishna, Tuo Zhao, 11 Mar 2024 (v2), GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM, https://arxiv.org/abs/2403.05527 https://github.com/HaoKang-Timmy/GEAR
- Prajwal Singhania, Siddharth Singh, Shwai He, Soheil Feizi, Abhinav Bhatele, 4 Jun 2024, Loki: Low-Rank Keys for Efficient Sparse Attention, https://arxiv.org/abs/2406.02542 (Sparsification of the KV cache values with a focus on the key vectors.)
- DeepSeek-AI. DeepSeek-V2: A strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434, 2024. https://arxiv.org/abs/2405.04434 Code: https://github.com/deepseek-ai/DeepSeek-V2 (Introduces various architectural optimizations, notably RoPE handling and KV cache compression via low-rank matrices.)
- Yao Yao, Zuchao Li, Hai Zhao, 21 May 2024, SirLLM: Streaming Infinite Retentive LLM, https://arxiv.org/abs/2405.12528 (Low-rank decomposition to compress KV cache heads.)
- Utkarsh Saxena, Gobinda Saha, Sakshi Choudhary, Kaushik Roy, 10 Aug 2024, Eigen Attention: Attention in Low-Rank Space for KV Cache Compression, https://arxiv.org/abs/2408.05646
- Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong-Yan Chen, Yu-Fang Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, Kai-Chiang Wu, 30 Jul 2024, Palu: Compressing KV-Cache with Low-Rank Projection, https://arxiv.org/abs/2407.21118 https://github.com/shadowpa0327/Palu
- Rongzhi Zhang, Kuang Wang, Liyuan Liu, Shuohang Wang, Hao Cheng, Chao Zhang, Yelong Shen, 4 Oct 2024, LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy, https://arxiv.org/abs/2410.03111
KV Cache Sparsity
KV cache sparsity is the use of zero values in the KV data to reduce the computations in LLM inference. Data stored in the KV cache is often sparse, with many zero or near-zero values. This data can be pruned in a structured or unstructured way, similar to the use of sparsification in model weights.
Research has shown that attention tends to be quite sparse, with some tokens having a much greater impact on attention than others. The ideas of pruning and sparsity can therefore be used on the KV cache data.
Possible types of KV cache pruning and sparsification include:
- Magnitude pruning
- Token-specific KV sparsity (lengthwise pruning of "pivotal tokens")
- Layer-wise pruning (or layer-wise fusion of KV layer-specific data, akin to "layer fusion" of weights)
Papers on the specific types of KV cache compression via pruning or sparsity, which effectively means structured or unstructured pruning applied to the KV cache data, include:
- Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of importance hypothesis for LLM KV cache compression at test time, 2023. http://arxiv.org/abs/2305.17118 (Reduces the size of the KV cache by limiting storage to only pivotal tokens.)
- H Jiang, Q Wu, CY Lin, Y Yang, L Qiu, Oct 2023, LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models, arXiv preprint arXiv:2310.05736, https://arxiv.org/pdf/2310.05736.pdf, Code: https://aka.ms/LLMLingua (Discusses token pruning reducing size of KV cache.)
- S Ren, Q Jia, KQ Zhu, arXiv preprint arXiv:2310.08152, Context Compression for Auto-regressive Transformers with Sentinel Tokens, Oct 2023, https://arxiv.org/pdf/2310.08152.pdf, Code: https://github.com/DRSY/KV_Compression
- G Xiao, Y Tian, B Chen, S Han, M Lewis, Sep 2023, Efficient Streaming Language Models with Attention Sinks, arXiv preprint arXiv:2309.17453, https://arxiv.org/abs/2309.17453 (Sliding window KV caching.)
- Qiaozhi He, Zhihua Wu, 28 Apr 2024, Efficient LLM Inference with Kcache, https://arxiv.org/abs/2404.18057 (Splits the KV cache into a KCache stored in HBM and a Vcache stored in CPU memory. The requests for the V cache are limited by filtering after attention based on the Softmax scaled top-N results of the QK matrix multiplication, so thereby pruning a lot of the V cache memory loads and corresponding calculations.)
- Zihao Wang, Shaoduo Gan, 7 Apr 2024] SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget, https://arxiv.org/abs/2404.04793 Code: https://github.com/hetailang/squeezeattention (Optimization of the KV cache along the two dimensions of layers and input sequence.)
- Youpeng Zhao, Di Wu, Jun Wang, 26 Mar 2024, ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching, https://arxiv.org/abs/2403.17312 (Improved memory management of the cache for KV caching during autoregressive inference with prioritization of tokens based on sparse window attention, and managing caching versus recomputation.)
- Chao Lou, Zixia Jia, Zilong Zheng, Kewei Tu, 24 Jun 2024, Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers, https://arxiv.org/abs/2406.16747 (Sparse KV cache for memory-efficient decoding on long contexts by selecting KV pairs of salient tokens.)
- Prajwal Singhania, Siddharth Singh, Shwai He, Soheil Feizi, Abhinav Bhatele, 4 Jun 2024, Loki: Low-Rank Keys for Efficient Sparse Attention, https://arxiv.org/abs/2406.02542 (Sparsification of the KV cache values with a focus on the key vectors.)
- Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey
- Zhenyu Zhang, Shiwei Liu, Runjin Chen, Bhavya Kailkhura, Beidi Chen, Atlas Wang, 2024, Q-Hitter: A Better Token Oracle for Efficient LLM Inference via Sparse-Quantized KV Cache, Part of Proceedings of Machine Learning and Systems 6 (MLSys 2024) Conference, https://proceedings.mlsys.org/paper_files/paper/2024/hash/bbb7506579431a85861a05fff048d3e1-Abstract-Conference.html https://proceedings.mlsys.org/paper_files/paper/2024/file/bbb7506579431a85861a05fff048d3e1-Paper-Conference.pdf https://github.com/VITA-Group/Q-Hitter
- Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, Beidi Chen, 28 Oct 2024, ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference, https://arxiv.org/abs/2410.21465 https://github.com/bytedance/ShadowKV
- Junqi Zhao, Zhijin Fang, Shu Li, Shaohui Yang, Shichao He, 30 Oct 2024, BUZZ: Beehive-structured Sparse KV Cache with Segmented Heavy Hitters for Efficient LLM Inference, https://arxiv.org/abs/2410.23079 https://github.com/JunqiZhao888/buzz-llm
- M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024, Resource-efficient Algorithms and Systems of Foundation Models: A Survey, https://dl.acm.org/doi/pdf/10.1145/3706418
- Yanqi Zhang, Yuwei Hu, Runyuan Zhao, John C.S. Lui, Haibo Chen, 4 Dec 2024, Unifying KV Cache Compression for Large Language Models with LeanKV, https://arxiv.org/abs/2412.03131 (KV cache compression via mixed-precision KV quantization, token-specific KV pruning, and KV sparsity. Also uses a KV paging method similar to paged attention.)
KV Cache Token Pruning
KV cache token pruning is the idea of token pruning (lengthwise pruning) applied to the KV cache data for faster LLM inference. Unimportant tokens can have their KV cache data removed (pruned) or fused with another token's KV data. This is based on the idea of "pivotal tokens" or "salient tokens" and that not all tokens are important to the output.
Research papers on KV cache token pruning:
- Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of importance hypothesis for LLM KV cache compression at test time, 2023. http://arxiv.org/abs/2305.17118 (Reduces the size of the KV cache by limiting storage to only pivotal tokens.)
- H Jiang, Q Wu, CY Lin, Y Yang, L Qiu, Oct 2023, LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models, arXiv preprint arXiv:2310.05736, https://arxiv.org/pdf/2310.05736.pdf, Code: https://aka.ms/LLMLingua (Discusses token pruning reducing size of KV cache.)
- S Ren, Q Jia, KQ Zhu, arXiv preprint arXiv:2310.08152, Context Compression for Auto-regressive Transformers with Sentinel Tokens, Oct 2023, https://arxiv.org/pdf/2310.08152.pdf, Code: https://github.com/DRSY/KV_Compression
- Zihao Wang, Shaoduo Gan, 7 Apr 2024] SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget, https://arxiv.org/abs/2404.04793 Code: https://github.com/hetailang/squeezeattention (Optimization of the KV cache along the two dimensions of layers and input sequence.)
- Youpeng Zhao, Di Wu, Jun Wang, 26 Mar 2024, ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching, https://arxiv.org/abs/2403.17312 (Improved memory management of the cache for KV caching during autoregressive inference with prioritization of tokens based on sparse window attention, and managing caching versus recomputation.)
- Zhiyu Guo, Hidetaka Kamigaito, Taro Watanabe, 18 Jun 2024, Attention Score is not All You Need for Token Importance Indicator in KV Cache Reduction: Value Also Matters, https://arxiv.org/abs/2406.12335 (Extensions of KV cache token pruning methods tjat ise attention scores to find pivotal tokens, generalized to also consider L1 vector norms of value vectors.)
- Wonbeom Lee, Jungi Lee, Junghwan Seo, Jaewoong Sim, 28 Jun 2024, InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management, https://arxiv.org/abs/2406.19707 (KV caching optimization using salient token pruning for the attention layer.)
- Jiayi Yuan, Hongyi Liu, Shaochen (Henry)Zhong, Yu-Neng Chuang, Songchen Li, Guanchu Wang, Duy Le, Hongye Jin, Vipin Chaudhary, Zhaozhuo Xu, Zirui Liu, Xia Hu, 1 Jul 2024, KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches, https://arxiv.org/abs/2407.01527 Code: https://github.com/henryzhongsc/longctx_bench (Survey and benchmarking of several KV cache compression and long context handling techniques.)
- Hailin Zhang, Xupeng Miao, Xiaodong Ji, Xiaonan Nie Yilin Chen, Weipeng Chen, Fangcheng Fu, Bin Cui, 2024, PQCache: Product Quantization-based KVCache for Long Context LLM Inference, https://hugozhl.github.io/files/PQCache.pdf
- Qichen Fu, Minsik Cho, Thomas Merth, Sachin Mehta, Mohammad Rastegari, Mahyar Najibi, 19 Jul 2024, LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference, https://arxiv.org/abs/2407.14057
- Junyoung Park, Myeonggu Kang, Yunki Han, Yanggon Kim, Jaekang Shin, Lee-Sup Kim, 21 Jul 2024, Token-Picker: Accelerating Attention in Text Generation with Minimized Memory Transfer via Probability Estimation, https://arxiv.org/abs/2407.15131
- Shi Luohe, Zhang Hongyi, Yao Yao, Li Zuchao, Zhao Hai, 25 Jul 2024, Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption, https://arxiv.org/abs/2407.18003
- Giulio Corallo, Paolo Papotti, 31 Jul 2024, Finch: Prompt-guided Key-Value Cache Compression, https://arxiv.org/abs/2408.00167 (KV cache compression along the lengthwise input dimension in a blockwise KV data pruning method.)
- Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, Song Han, July 2024, QUEST: Query-Aware Sparsity for Efficient Long-Context LLM Inference, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:47901-47911, 2024, https://proceedings.mlr.press/v235/tang24l.html PDF: https://raw.githubusercontent.com/mlresearch/v235/main/assets/tang24l/tang24l.pdf Code: https://github.com/mit-han-lab/quest
- Yuhui Xu, Zhanming Jie, Hanze Dong, Lei Wang, Xudong Lu, Aojun Zhou, Amrita Saha, Caiming Xiong, Doyen Sahoo, 30 Jul 2024, ThinK: Thinner Key Cache by Query-Driven Pruning, https://arxiv.org/abs/2407.21018
- Minsoo Kim, Kyuhong Shim, Jungwook Choi, Simyung Chang, 2 Oct 2024, InfiniPot: Infinite Context Processing on Memory-Constrained LLMs, https://arxiv.org/abs/2410.01518 (Length-wise KV cache pruning by analyzing token importance.)
- Zihao Zeng, Bokai Lin, Tianqi Hou, Hao Zhang, Zhijie Deng, 15 Oct 2024, In-context KV-Cache Eviction for LLMs via Attention-Gate, https://arxiv.org/abs/2410.12876
- Junqi Zhao, Zhijin Fang, Shu Li, Shaohui Yang, Shichao He, 30 Oct 2024, BUZZ: Beehive-structured Sparse KV Cache with Segmented Heavy Hitters for Efficient LLM Inference, https://arxiv.org/abs/2410.23079 https://github.com/JunqiZhao888/buzz-llm
- Wei Wu, Zhuoshi Pan, Chao Wang, Liyi Chen, Yunchu Bai, Kun Fu, Zheng Wang, Hui Xiong, 5 Nov 2024, TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection, https://arxiv.org/abs/2411.02886
- Xiangyu Zhang, Yu Zhou, Guang Yang, Harald C. Gall, Taolue Chen, 11 Nov 2024, Anchor Attention, Small Cache: Code Generation with Large Language Models, https://arxiv.org/abs/2411.06680
- Da Ma, Lu Chen, Situo Zhang, Yuxun Miao, Su Zhu, Zhi Chen, Hongshen Xu, Hanqi Li, Shuai Fan, Lei Pan, Kai Yu, 3 Dec 2024, Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity, https://arxiv.org/abs/2412.02252
- Yanqi Zhang, Yuwei Hu, Runyuan Zhao, John C.S. Lui, Haibo Chen, 4 Dec 2024, Unifying KV Cache Compression for Large Language Models with LeanKV, https://arxiv.org/abs/2412.03131 (KV cache compression via mixed-precision KV quantization, token-specific KV pruning, and KV sparsity. Also uses a KV paging method similar to paged attention.)
KV Cache Eviction
KV cache eviction is a method of LLM inference optimization that compresses the KV cache size by evicting tokens. This eviction policy ensures that the KV cache does not get too large, which is a bottleneck for processing long token sequences.
One way to compress your KV cache is to just stop it getting too big. Denied! The way to do this is called an "eviction policy" whereby some of the cached KV data gets evicted out of the cache on some basis.
Research papers on KV cache eviction policies:
- Yushuo Chen, Tianyi Tang, Erge Xiang, Linjiang Li, Wayne Xin Zhao, Jing Wang, Yunpeng Chai, Ji-Rong Wen, 17 Apr 2024, Towards Coarse-to-Fine Evaluation of Inference Efficiency for Large Language Models, https://arxiv.org/abs/2404.11502 (Benchmarks the performance of various Transformer inference frameworks: Transformers, vLLM, DeepSpeed-MII, TGI, TenserRT-LLM, llama.cpp, LightLLM, LMDeploy, StreamingLLM.)
- Hanshi Sun, Zhuoming Chen, Xinyu Yang, Yuandong Tian, Beidi Chen, 18 Apr 2024, TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding, https://arxiv.org/abs/2404.11912 Code: https://github.com/Infini-AI-Lab/TriForce (Improves issues with long context in the use of speculative decoding, such as small model context sizes and KV cache memory usage bottlenecks for Llama2-7B-128K model on an A100 and RTX 4090 GPU.)
- Yuxin Zhang, Yuxuan Du, Gen Luo, Yunshan Zhong, Zhenyu Zhang, Shiwei Liu, Rongrong Ji, 2024, CaM: Cache Merging for Memory-efficient LLMs Inference, https://openreview.net/pdf?id=LCTmppB165 Code: https://github.com/zyxxmu/cam (Compressing the KV cache by merging KV data that is about to be evicted into other parts of the KV cache.)
- Liyuan Liu , Jianfeng Gao , May 8, 2024 LLM profiling guides KV cache optimization, Microsoft Research Blog, 12th International Conference on Learning Representations(opens in new tab) (ICLR 2024), https://www.microsoft.com/en-us/research/blog/llm-profiling-guides-kv-cache-optimization/
- Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, Jianfeng Gao, May 2024, Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs, ICLR 2024, https://www.microsoft.com/en-us/research/publication/model-tells-you-what-to-discard-adaptive-kv-cache-compression-for-llms/ https://arxiv.org/pdf/2310.01801
- June Yong Yang, Byeongwook Kim, Jeongin Bae, Beomseok Kwon, Gunho Park, Eunho Yang, Se Jung Kwon, Dongsoo Lee, 28 Feb 2024, No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization, https://arxiv.org/abs/2402.18096
- Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. Model tells you what to discard: Adaptive kv cache compression for llms. arXiv preprint arXiv:2310.01801, 2023. https://arxiv.org/abs/2310.01801
- Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. https://arxiv.org/abs/2310.06825 arXiv preprint arXiv:2310.06825, 2023
- Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time. Advances in Neural Information Processing Systems, 36, 2024. https://arxiv.org/abs/2305.17118
- Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. In The Twelfth International Conference on Learning Representations, 2023. [51] Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems, 36, 2024 https://arxiv.org/abs/2306.14048
- Zhongwei Wan, Xinjian Wu, Yu Zhang, Yi Xin, Chaofan Tao, Zhihong Zhu, Xin Wang, Siqi Luo, Jing Xiong, Mi Zhang, 18 Jun 2024, D2O:Dynamic Discriminative Operations for Efficient Generative Inference of Large Language Models, https://arxiv.org/abs/2406.13035 (Per-layer KV cache eviction strategies with token merging applied to the KV cache.)
- Jeffrey Willette, Heejun Lee, Youngwan Lee, Myeongjae Jeon, Sung Ju Hwang, 24 Jun 2024, Training-Free Exponential Extension of Sliding Window Context with Cascading KV Cache, https://arxiv.org/abs/2406.17808 (Extends the KV cache eviction policy in sliding window attention so that the KV partially looks back further than the window.)
- Zhongwei Wan, Ziang Wu, Che Liu, Jinfa Huang, Zhihong Zhu, Peng Jin, Longyue Wang, Li Yuan, 26 Jun 2024, LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference, https://arxiv.org/abs/2406.18139 (KV cache compression in text and multimodal inference, prioritizing eviction of text over image tokens, and using new ways to merge evicted KV cache data into the retained KV cache, including averaging, pivotal tokens, and weighted averages, which is relevant to token merging and KV cache fusion
- Wonbeom Lee, Jungi Lee, Junghwan Seo, Jaewoong Sim, 28 Jun 2024, InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management, https://arxiv.org/abs/2406.19707 (KV caching optimization using salient token pruning for the attention layer.)
- Shi Luohe, Zhang Hongyi, Yao Yao, Li Zuchao, Zhao Hai, 25 Jul 2024, Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption, https://arxiv.org/abs/2407.18003
- Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference, Harry Dong, Xinyu Yang, Zhenyu Zhang, Zhangyang Wang, Yuejie Chi, Beidi Chen, 2024, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:11437-11452, 2024, https://proceedings.mlr.press/v235/dong24f.html https://raw.githubusercontent.com/mlresearch/v235/main/assets/dong24f/dong24f.pdf https://openreview.net/forum?id=uhHDhVKFMW Code: https://github.com/hdong920/LESS
- Yilong Chen, Guoxia Wang, Junyuan Shang, Shiyao Cui, Zhenyu Zhang, Tingwen Liu, Shuohuan Wang, Yu Sun, Dianhai Yu, Hua Wu, 8 Aug 2024 (v2), NACL: A General and Effective KV Cache Eviction Framework for LLMs at Inference Time, https://arxiv.org/abs/2408.03675 Code: https://github.com/PaddlePaddle/Research/tree/master/NLP/ACL2024-NACL
- Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, S. Kevin Zhou, 16 Aug 2024 (v3), Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference, https://arxiv.org/abs/2407.11550
- vLLM, 2024, Performance and Tuning, https://docs.vllm.ai/en/latest/models/performance.html
- Zuyan Liu, Benlin Liu, Jiahui Wang, Yuhao Dong, Guangyi Chen, Yongming Rao, Ranjay Krishna, Jiwen Lu, 25 Jul 2024, Efficient Inference of Vision Instruction-Following Models with Elastic Cache, https://arxiv.org/abs/2407.18121 https://github.com/liuzuyan/ElasticCache
- Zhenyu Ning, Jieru Zhao, Qihao Jin, Wenchao Ding, Minyi Guo, 11 Sep 2024, Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU, https://arxiv.org/abs/2409.09086
- Isaac Rehg, 7 Oct 2024 (v2), KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head, https://arxiv.org/abs/2410.00161
- Yuxiang Huang, Binhang Yuan, Xu Han, Chaojun Xiao, Zhiyuan Liu, 2 Oct 2024, Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads, https://arxiv.org/abs/2410.01805
- Zihao Zeng, Bokai Lin, Tianqi Hou, Hao Zhang, Zhijie Deng, 15 Oct 2024, In-context KV-Cache Eviction for LLMs via Attention-Gate, https://arxiv.org/abs/2410.12876
- Anonymous authors, Oct 2024, LSH Tells You What To Discard: An Adaptive Locality-Sensitive Strategy for KV Cache Compression, ICLR 2025, https://openreview.net/pdf?id=0ZcQhdyI3n
- Z. Xu and J. Wu, "Crowd: An KV Cache Eviction Policy Which Uses Crowd Information to Select Evicted Key-Value Pairs," 2024 4th International Conference on Computer Science and Blockchain (CCSB), Shenzhen, China, 2024, pp. 608-612, doi: 10.1109/CCSB63463.2024.10735473. https://ieeexplore.ieee.org/abstract/document/10735473
KV Cache Quantization
KV cache quantization is an LLM inference optimization that applies quantization to the KV cache data as a type of KV cache compression. As with model weight or activation quantization, this makes the KV cache both smaller in memory and faster to process, leading to faster LLM inference, and allowing the efficient processing of longer queries.
Research on the quantization of the KV cache to reduce memory, as a subtype of KV cache compression:
- Yuxuan Yue, Zhihang Yuan, Haojie Duanmu, Sifan Zhou, Jianlong Wu, Liqiang Nie, 20 Feb 2024 (v2), WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More, https://arxiv.org/abs/2402.12065
- Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, Xia Hu, 5 Feb 2024, KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache, https://arxiv.org/abs/2402.02750 Code: https://github.com/jy-yuan/KIVI (KV cache 2-bit quantization on Llama-2, Falcon and Mistral models.)
- Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 15 Mar 2024 (v5), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer (A large survey of a variety of LLM optimizations.)
- Hao Kang, Qingru Zhang, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu, Tushar Krishna, Tuo Zhao, 11 Mar 2024 (v2), GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM, https://arxiv.org/abs/2403.05527 Code: https://github.com/HaoKang-Timmy/GEAR (Compressing the size of the KV cache using quantization, low-rank matrices, and sparse matrix.)
- Shichen Dong, Wen Cheng, Jiayu Qin, Wei Wang, 7 Mar 2024, QAQ: Quality Adaptive Quantization for LLM KV Cache, https://arxiv.org/abs/2403.04643 Code: http://github.com/ClubieDong/KVCacheQuantization (Reducing the size of the KV cache using quantization.)
- Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu, 2023, KIVI : Plug-and-play 2bit KV Cache Quantization with Streaming Asymmetric Quantization, https://www.researchgate.net/profile/Zirui-Liu-29/publication/376831635_KIVI_Plug-and-play_2bit_KV_Cache_Quantization_with_Streaming_Asymmetric_Quantization/links/658b5d282468df72d3db3280/KIVI-Plug-and-play-2bit-KV-Cache-Quantization-with-Streaming-Asymmetric-Quantization.pdf -> KV Caching (Explores quantization of values stored in the KV cache as a way to maintain a smaller KV caching and reduce memory storage requirements.)
- Amir Zandieh, Majid Daliri, Insu Han, 5 Jun 2024, QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead, https://arxiv.org/abs/2406.03482 Code: https://github.com/amirzandieh/QJL (Using 1-bit or 2-bit KV cache quantization approach based on sign bits as estimates.)
- Akide Liu, Jing Liu, Zizheng Pan, Yefei He, Gholamreza Haffari, Bohan Zhuang, 23 May 2024, MiniCache: KV Cache Compression in Depth Dimension for Large Language Models, arXiv preprint arXiv:2405.14366, 2024, https://arxiv.org/abs/2405.14366 (Merging KV caches with similar values across nearby layers to effectively share parts of the cache across layers for a 41% reduction in size.)
- Yefei He, Luoming Zhang, Weijia Wu, Jing Liu, Hong Zhou, Bohan Zhuang, 23 May 2024, ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification, https://arxiv.org/abs/2405.14256 (Quantizing the KV cache with combination of other KV cache compression methods based on salient tokens.)
- Yao Fu, 14 May 2024, Challenges in Deploying Long-Context Transformers: A Theoretical Peak Performance Analysis, https://arxiv.org/abs/2405.08944 (The KV cache size is the main bottleneck for long context processing, in both prefill and decoding phases, and includes analysis of different optimizations to address this.)
- Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, Song Han, 7 May 2024, QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving, arXiv preprint arXiv:2405.04532, https://arxiv.org/abs/2405.04532 Project: https://hanlab.mit.edu/projects/qserve Code: https://github.com/mit-han-lab/qserve (Efficient quantized inference on GPUs using 4-bit weights, 8-bit activations, and 4-bit KV cache, mostly via a GEMM speedup.)
- Zhenyu Zhang, Shiwei Liu, Runjin Chen, Bhavya Kailkhura, Beidi Chen, Zhangyang Wang, 2024, Q-HITTER: A BETTER TOKEN ORACLE FOR EFFICIENT LLM INFERENCE VIA SPARSE-QUANTIZED KV CACHE, Proceedings of the 5th MLSys Conference, Santa Clara, CA, USA, https://proceedings.mlsys.org/paper_files/paper/2024/file/bbb7506579431a85861a05fff048d3e1-Paper-Conference.pdf
- Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
- Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. Kivi : Plug-and-play 2bit kv cache quantization with streaming asymmetric quantization. 2023. doi: 10.13140/RG.2.2.28167.37282. https://rgdoi.net/10.13140/RG.2.2.28167.37282.
- Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L. Croci, Bo Li, Martin Jaggi, Dan Alistarh, Torsten Hoefler, James Hensman, 30 Mar 2024, QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs, https://arxiv.org/abs/2404.00456 Code: https://github.com/spcl/QuaRot
- June Yong Yang, Byeongwook Kim, Jeongin Bae, Beomseok Kwon, Gunho Park, Eunho Yang, Se Jung Kwon, Dongsoo Lee, 28 Feb 2024, No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization, https://arxiv.org/abs/2402.18096
- Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, Amir Gholami, 4 Apr 2024 (v3), KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization, https://arxiv.org/abs/2401.18079
- Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang Shi, Raghuraman Krishnamoorthi, and Vikas Chandra. Llm-qat: Data-free quantization aware training for large language models. arXiv preprint arXiv:2305.17888, 2023, https://arxiv.org/abs/2305.17888
- Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. Flexgen: High-throughput generative inference of large language models with a single gpu. In International Conference on Machine Learning, pages 31094–31116. PMLR, 2023. https://arxiv.org/abs/2303.06865
- GuangxuanXiao, JiLin, Mickael Seznec, HaoWu, JulienDemouth, andSongHan. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pages 38087–38099. PMLR, 2023. https://arxiv.org/abs/2211.10438
- Apple, June 2024, Introducing Apple’s On-Device and Server Foundation Models, https://machinelearning.apple.com/research/introducing-apple-foundation-models (Apple's on-device models feature optimizations including small models, grouped query attention, 2-bit/4-bit quantization including activation quantization, shared embedding/unembedding tensors, small-ish vocabulary size of 49k, an undisclosed efficient KV cache optimization for neural engines, and layer-specific 16-bit LoRA/QLoRA adapters of size "10s of megabytes" for fine-tuned specialized model versions, also sometimes in 2-bit/4-bit, claiming speed rates of 0.6ms/token in prefill, and 30 tokens per second in decoding.)
- Haojie Duanmu, Zhihang Yuan, Xiuhong Li, Jiangfei Duan, Xingcheng Zhang, Dahua Lin, 13 May 2024 (v2), SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models, https://arxiv.org/abs/2405.06219
- Character.AI, June 20, 2024, Optimizing AI Inference at Character.AI, https://research.character.ai/optimizing-inference/
- Jiayi Yuan, Hongyi Liu, Shaochen (Henry)Zhong, Yu-Neng Chuang, Songchen Li, Guanchu Wang, Duy Le, Hongye Jin, Vipin Chaudhary, Zhaozhuo Xu, Zirui Liu, Xia Hu, 1 Jul 2024, KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches, https://arxiv.org/abs/2407.01527 Code: https://github.com/henryzhongsc/longctx_bench (Survey and benchmarking of several KV cache compression and long context handling techniques.)
- Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas Kübler, Jiaji Huang, Matthäus Kleindessner, Jun Huan, Volkan Cevher, Yida Wang, George Karypis, 12 Jul 2024, Inference Optimization of Foundation Models on AI Accelerators, KDD’24, August 25–29, 2024, Barcelona, Spain, https://arxiv.org/abs/2407.09111
- Hailin Zhang, Xupeng Miao, Xiaodong Ji, Xiaonan Nie Yilin Chen, Weipeng Chen, Fangcheng Fu, Bin Cui, 2024, PQCache: Product Quantization-based KVCache for Long Context LLM Inference, https://hugozhl.github.io/files/PQCache.pdf
- Shi Luohe, Zhang Hongyi, Yao Yao, Li Zuchao, Zhao Hai, 25 Jul 2024, Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption, https://arxiv.org/abs/2407.18003
- Wang, Bin Guo, Chen Meng, Sheng Gui, Weifei Yu, Yi Xie, 10 Jul 2024, Inference Performance Optimization for Large Language Models on CPUs,Pujiang He, Shan Zhou, Wenhuan Huang, Changqing Li, Duyi https://arxiv.org/abs/2407.07304 Code: https://github.com/intel/xFasterTransformer
- Hugging Face, Aug 2024 (accessed), Best Practices for Generation with Cache, https://huggingface.co/docs/transformers/kv_cache
- Zihao Ye,, Lequn Chen, Ruihang Lai, Yilong Zhao, Size Zheng, Junru Shao, Bohan Hou, Hongyi Jin, Yifei Zuo, Liangsheng Yin, Tianqi Chen, Luis Ceze, Feb 2, 2024, Accelerating Self-Attentions for LLM Serving with FlashInfer, https://flashinfer.ai/2024/02/02/introduce-flashinfer.html
- The SGLang Team, Sep 04, 2024 SGLang v0.3 Release: 7x Faster DeepSeek MLA, 1.5x Faster torch.compile, Multi-Image/Video LLaVA-OneVision, https://lmsys.org/blog/2024-09-04-sglang-v0-3/
- W. Byun, J. Woo and S. Mukhopadhyay, "Hessian-Aware KV Cache Quantization for LLMs," 2024 IEEE 67th International Midwest Symposium on Circuits and Systems (MWSCAS), Springfield, MA, USA, 2024, pp. 243-247, doi: 10.1109/MWSCAS60917.2024.10658840. https://ieeexplore.ieee.org/abstract/document/10658840
- Yanshu Wang, Tong Yang, Xiyan Liang, Guoan Wang, Hanning Lu, Xu Zhe, Yaoming Li, Li Weitao, 18 Sep 2024, Art and Science of Quantizing Large-Scale Models: A Comprehensive Overview, https://arxiv.org/abs/2409.11650 (Extensive survey of quantization from the basics to SOTA approaches, with also some coverage of knowledge distillation and KV cache compression.)
- Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu, 25 Sep 2024, A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms, https://arxiv.org/abs/2409.16694
- Yifan Tan, Haoze Wang, Chao Yan, Yangdong Deng, 25 Sep 2024, AlignedKV: Reducing Memory Access of KV-Cache with Precision-Aligned Quantization, https://arxiv.org/abs/2409.16546 (Focuses on access latency of KV cache in floating point, rather than size reduction.)
- Yefei He, Feng Chen, Jing Liu, Wenqi Shao, Hong Zhou, Kaipeng Zhang, Bohan Zhua, 11 Oct 2024, ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification and KV Cache Compression, https://arxiv.org/abs/2410.08584
- Juntao Zhao, Wenhao Lu, Sheng Wang, Lingpeng Kong, Chuan Wu, 15 Oct 2024, QSpec: Speculative Decoding with Complementary Quantization Schemes, https://arxiv.org/abs/2410.11305 (Enhance speculative decoding using quantization to reuse KV cache values and weights.)
- Lian Liu, Haimeng Ren, Long Cheng, Zhaohui Xu, Yudong Pan, Mengdi Wang, Xiaowei Li, Yinhe Han, Ying Wang, 16 Oct 2024, COMET: Towards Partical W4A4KV4 LLMs Serving, https://arxiv.org/abs/2410.12168
- Qian Tao, Wenyuan Yu, Jingren Zhou, 17 Oct 2024, AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asymmetric Quantization Configurations, https://arxiv.org/abs/2410.13212
- Zhen Yang, J.N.Han, Kan Wu, Ruobing Xie, An Wang, Xingwu Sun, Zhanhui Kang, 20 Oct 2024, Lossless KV Cache Compression to 2%, https://arxiv.org/abs/2410.15252
- Akshat Sharma, Hangliang Ding, Jianping Li, Neel Dani, Minjia Zhang, 27 Nov 2024, Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache, https://arxiv.org/abs/2411.18077
- Yanqi Zhang, Yuwei Hu, Runyuan Zhao, John C.S. Lui, Haibo Chen, 4 Dec 2024, Unifying KV Cache Compression for Large Language Models with LeanKV, https://arxiv.org/abs/2412.03131 (KV cache compression via mixed-precision KV quantization, token-specific KV pruning, and KV sparsity. Also uses a KV paging method similar to paged attention.)
KV Cache Layer Fusion
KV cache layer fusion is a type of KV cache compression, analogous to layer fusion as a type of LLM model compression. The idea is that some layers of the KV cache are similar enough that they can be combined (fused), alleviating the need to store the KV data for any layer that is fused with another KV data layer. This is similar to model parameter sharing and layer fusion in LLM model weights.
Research papers on KV cache layer fusion:
- Akide Liu, Jing Liu, Zizheng Pan, Yefei He, Gholamreza Haffari, Bohan Zhuang, 23 May 2024, MiniCache: KV Cache Compression in Depth Dimension for Large Language Models, https://arxiv.org/abs/2405.14366 (Compresses the KV cache on the depth dimension of layers, analogous to layer fusion.)
- Zayd Muhammad Kawakibi Zuhri, Muhammad Farid Adilazuarda, Ayu Purwarianti, Alham Fikri Aji, 13 Jun 2024, MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding, https://arxiv.org/abs/2406.09297 Code: https://github.com/zaydzuhri/pythia-mlkv (Extends cross-head KV sharing in MQA to also share KV data between layers, analogous to layer fusion of weights.)
- Character.AI, June 20, 2024, Optimizing AI Inference at Character.AI, https://research.character.ai/optimizing-inference/
- Haoyi Wu, Kewei Tu, 4 Jun 2024 (v2), Layer-Condensed KV Cache for Efficient Inference of Large Language Models, https://arxiv.org/abs/2405.10637 Code: https://github.com/whyNLP/LCKV (Only computes the KV cache of some layers.)
- Qichen Fu, Minsik Cho, Thomas Merth, Sachin Mehta, Mohammad Rastegari, Mahyar Najibi, 19 Jul 2024, LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference, https://arxiv.org/abs/2407.14057
- William Brandon, Mayank Mishra, Aniruddha Nrusimha, Rameswar Panda, Jonathan Ragan Kelly, 21 May 2024, Reducing Transformer Key-Value Cache Size with Cross-Layer Attention, https://arxiv.org/abs/2405.12981 (Sharing KV cache values across layers in MQA, every 2nd or 3rd layer, to reduce overall KV cache size by 2 or 3 times.)
- AIModels.FYI, 2024, Layer-Condensed KV Cache for Efficient Inference of Large Language Models, https://www.aimodels.fyi/papers/arxiv/layer-condensed-kv-cache-efficient-inference-large
- Luohe Shi, Hongyi Zhang, Yao Yao, Zuchao Li, Hai Zhao, 13 Aug 2024 (v3), Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption, https://arxiv.org/abs/2407.18003 https://github.com/zcli-charlie/Awesome-KV%20Cache
- Bingli Liao, Danilo Vasconcellos Vargas, 13 Jul 2024, Beyond KV Caching: Shared Attention for Efficient LLMs, https://arxiv.org/abs/2407.12866 (Layerwise weight sharing in attention.)
- Yanshu Wang, Tong Yang, Xiyan Liang, Guoan Wang, Hanning Lu, Xu Zhe, Yaoming Li, Li Weitao, 18 Sep 2024, Art and Science of Quantizing Large-Scale Models: A Comprehensive Overview, https://arxiv.org/abs/2409.11650 (Extensive survey of quantization from the basics to SOTA approaches, with also some coverage of knowledge distillation and KV cache compression.)
- Shashank Rajput, Ying Sheng, Sean Owen, Vitaliy Chiley, 23 Sep 2024, Inference-Friendly Models With MixAttention, https://arxiv.org/abs/2409.15012 (Attention optimization with sliding window attention and KV cache layer fusion, inspired by the approach of Character AI.)
- Aurick Qiao, Zhewei Yao, Samyam Rajbhandari, Yuxiong He, 4 Oct 2024, SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation, https://arxiv.org/abs/2410.03960
- You Wu, Haoyi Wu, Kewei Tu, 18 Oct 2024, A Systematic Study of Cross-Layer KV Sharing for Efficient LLM Inference, https://arxiv.org/abs/2410.14442
- Zhen Yang, J.N.Han, Kan Wu, Ruobing Xie, An Wang, Xingwu Sun, Zhanhui Kang, 20 Oct 2024, Lossless KV Cache Compression to 2%, https://arxiv.org/abs/2410.15252
- Zhanchao Zhou, Tianyi Wu, Zhiyun Jiang, Zhenzhong Lan, 23 Oct 2024, Value Residual Learning For Alleviating Attention Concentration In Transformers, https://arxiv.org/abs/2410.17897
- Yifei Yang, Zouying Cao, Qiguang Chen, Libo Qin, Dongjie Yang, Hai Zhao, Zhi Chen, 24 Oct 2024, KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing, https://arxiv.org/abs/2410.18517
- Xingwu Sun, Yanfeng Chen, Yiqing Huang, Ruobing Xie, Jiaqi Zhu, Kai Zhang, Shuaipeng Li, Zhen Yang, Jonny Han, Xiaobo Shu, Jiahao Bu, (and many more authors), 4 Nov 2024, Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent, https://arxiv.org/abs/2411.02265 https://github.com/Tencent/Hunyuan-Large https://huggingface.co/tencent/Tencent-Hunyuan-Large
- Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya Sunil Mahabaleshwarkar, Shih-Yang Liu, Matthijs Van Keirsbilck, Min-Hung Chen, Yoshi Suhara, Yingyan Lin, Jan Kautz, Pavlo Molchanov, 20 Nov 2024, Hymba: A Hybrid-head Architecture for Small Language Models, https://arxiv.org/abs/2411.13676
- AI: Alan Wake, Albert Wang, Bei Chen, C.X. Lv, Chao Li, Chengen Huang, Chenglin Cai, Chujie Zheng, Daniel Cooper, Ethan Dai, Fan Zhou, Feng Hu, Heng Ji, Howard Qiu, Jiangcheng Zhu, Jun Tian, Katherine Su, Lihuan Zhang, Liying Li, Ming Song, Mou Li, Peng Liu, Qichen Hu, Shawn Wang, Shijun Zhou, Shiyong Li, Tianhang Zhu, Wen Xie, Xiang He, Xiaobo Chen, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Yanpeng Li, Yongke Zhao, Yongzhen Luo, Yuchi Xu, Yuxuan Sha, Zhaodong Yan, Zhiyuan Liu, Zirui Zhang, 3 Dec 2024 (v2), Yi-Lightning Technical Report, https://arxiv.org/abs/2412.01253 https://platform.lingyiwanwu.com/ (MoE architecture with model expert routing optimizations, also with hybrid global-local attention and fused layers in the KV caching.)
- Da Ma, Lu Chen, Situo Zhang, Yuxun Miao, Su Zhu, Zhi Chen, Hongshen Xu, Hanqi Li, Shuai Fan, Lei Pan, Kai Yu, 3 Dec 2024, Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity, https://arxiv.org/abs/2412.02252
KV Cache Layer Pruning
KV cache layer pruning is a type of KV cache compression that removes layers of KV data for faster LLM inference processing. This makes the KV cache data smaller in memory and faster to process, allowing more efficient processing of longer token sequences during LLM inference. This method of KV cache layer pruning is analogous to layer pruning as a type of model compression.
The idea is that some layers of the KV cache are unimportant, and can be removed. This is similar to structured pruning (depth pruning), and reduces the total number of layers that need to be stored. This reduces the memory size of the overall KV cache.
- Zihao Wang, Shaoduo Gan, 7 Apr 2024] SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget, https://arxiv.org/abs/2404.04793 Code: https://github.com/hetailang/squeezeattention (Optimization of the KV cache along the two dimensions of layers and input sequence.)
- Haoyi Wu, Kewei Tu, 4 Jun 2024 (v2), Layer-Condensed KV Cache for Efficient Inference of Large Language Models, https://arxiv.org/abs/2405.10637 Code: https://github.com/whyNLP/LCKV (Only computes the KV cache of some layers.)
- William Brandon, Mayank Mishra, Aniruddha Nrusimha, Rameswar Panda, Jonathan Ragan Kelly, 21 May 2024, Reducing Transformer Key-Value Cache Size with Cross-Layer Attention, https://arxiv.org/abs/2405.12981 (Sharing KV cache values across layers in MQA, every 2nd or 3rd layer, to reduce overall KV cache size by 2 or 3 times.)
- Akide Liu, Jing Liu, Zizheng Pan, Yefei He, Gholamreza Haffari, Bohan Zhuang, 23 May 2024, MiniCache: KV Cache Compression in Depth Dimension for Large Language Models, arXiv preprint arXiv:2405.14366, 2024, https://arxiv.org/abs/2405.14366 (Merging KV caches with similar values across nearby layers to effectively share parts of the cache across layers for a 41% reduction in size.)
- Yanshu Wang, Tong Yang, Xiyan Liang, Guoan Wang, Hanning Lu, Xu Zhe, Yaoming Li, Li Weitao, 18 Sep 2024, Art and Science of Quantizing Large-Scale Models: A Comprehensive Overview, https://arxiv.org/abs/2409.11650 (Extensive survey of quantization from the basics to SOTA approaches, with also some coverage of knowledge distillation and KV cache compression.)
- Dezhan Tu, Danylo Vashchilenko, Yuzhe Lu, Panpan Xu, 29 Oct 2024, VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration, https://arxiv.org/abs/2410.23317
KV Fused Head Research
KV fused heads are an optimization to LLM inference that makes the KV cache data smaller and faster to process. This method in KV data is analogous to fused head optimizations in the main attention algorithms of LLM inference. The goal is to make the KV cache data smaller in memory and more efficient to compute, allowing processing of longer token sequences.
Research papers on KV head fusion or merging:
- Hao Yu, Zelan Yang, Shen Li, Yong Li, Jianxin Wu, 11 Jun 2024, Effectively Compress KV Heads for LLM, https://arxiv.org/abs/2406.07056
- Zhen Yang, J.N.Han, Kan Wu, Ruobing Xie, An Wang, Xingwu Sun, Zhanhui Kang, 20 Oct 2024, Lossless KV Cache Compression to 2%, https://arxiv.org/abs/2410.15252
- Yao Yao, Zuchao Li, Hai Zhao, 21 May 2024, SirLLM: Streaming Infinite Retentive LLM, https://arxiv.org/abs/2405.12528 (Low-rank decomposition to compress KV cache heads.)
- Isaac Rehg, 7 Oct 2024 (v2), KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head, https://arxiv.org/abs/2410.00161
- Yu Fu, Zefan Cai, Abedelkadir Asi, Wayne Xiong, Yue Dong, Wen Xiao, 28 Oct 2024 (v2), Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning, https://arxiv.org/abs/2410.19258
- Yuxiang Huang, Binhang Yuan, Xu Han, Chaojun Xiao, Zhiyuan Liu, 2 Oct 2024, Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads, https://arxiv.org/abs/2410.01805
- Zayd Muhammad Kawakibi Zuhri, Muhammad Farid Adilazuarda, Ayu Purwarianti, Alham Fikri Aji, 13 Jun 2024, MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding, https://arxiv.org/abs/2406.09297 Code: https://github.com/zaydzuhri/pythia-mlkv (Extends cross-head KV sharing in MQA to also share KV data between layers, analogous to layer fusion of weights.)
KV Cache Reuse (Multi-Query KV Caching)
KV cache reuse is the caching of the entire KV data from one query, to reuse that data in future queries, as a type of LLM inference optimization. The basic KV cache methods only store KV data within the current query, while autoregressive decoding is happening, and then the cache data is discarded. A more efficient idea is to generalize this idea to store KV data across multiple queries.
There are several types of multi-query KV cache reuse:
- Basic KV cache reuse (global KV caching)
- Prefix KV caching
- Session-based multi-turn KV caching (e.g., in chatbot conversations)
- Fused substring KV caching
Context Caching (Global KV Prefill/Encoder Caching)
Context caching is the storage and reuse of KV cache data from the "context" of an LLM query. This allows faster LLM inference on any context tokens that the LLM has seen before. Context caching is often implemented as a type of prefix KV caching or "prefix sharing" optimization.
Cross-query multi-user KV caching: The idea of "context caching" or "global KV caching" is to store the inference context across multiple queries. This is on-disk caching of the prefill/encoder KV data results across multiple user queries, and re-using them on subsequent identical or similar queries. The KV operations can be cached for identical queries, across many users, so that when a user inputs the same text, the KV operations do not have to be re-done, but can be loaded from a disk cache. This type of KV caching method is a subtype of an "Inference Cache" architecture.
Framework Context Caching Support. This idea of a "context cache" or "multi-query KV cache" is appearing in some of the frameworks and commercial model frameworks. Some of the platforms with "context caching" (including "prefix KV caching") include:
- vLLM (open source)
- Google (commercial)
- DeepSeek V2 (commercial)
This type of caching can greatly reduce inference costs, so I expect to see it in a lot more platforms, along with pricing reductions for such tokens. For example, DeepSeek has two distinct levels of pricing for "cached" and "non-cached" tokens.
Global semantic KV caching? Can we generalize the idea of a "semantic cache" to store the KV cache across semantically-similar queries, rather than just storing the text output? Maybe not, and I've not seen a paper on this. The main problem is that KV caching is very specific to the exact token sequence, whereas semantic caching aims to cache results for very different textual queries. It's not just different lengths of token sequences, but also having different tokens. Maybe there's some way to do it?
Research papers on multi-query prefill/encoder KV caching:
- G Gernanov, March 13, 2023, Store KV cache of computed prompts to disk to avoid re-compute in follow-up runs #64, Llama.cpp project, https://github.com/ggerganov/llama.cpp/issues/64 (This is multi-user KV caching; uses a hash of the input query to store the KV computation to a disk cache for re-use.)
- Yechen Xu, Xinhao Kong, Tingjun Chen, Danyang Zhuo, 4 Jun 2024 (v2), Conveyor: Efficient Tool-aware LLM Serving with Tool Partial Execution, https://arxiv.org/abs/2406.00059 Code: https://github.com/conveyor-sys/conveyor (Speeding up inference by partially running tools in parallel to the LLM query procesisng, rather than sequentially after the LLM request, by detecting tool requests deep inside the decoding algorithm and starting them off immediately, before the LLM has finished generating the fully decoed output.)
- Liu, Y., Li, H., Du, K., Yao, J., Cheng, Y., Huang, Y., Lu, S., Maire, M., Hoffmann, H., Holtzman, A., Ananthanarayanan, G., and Jiang, J., August 2024, Cachegen: Fast context loading for language model applications, Microsoft Research, https://arxiv.org/abs/2310.07240 https://www.microsoft.com/en-us/research/publication/cachegen-fast-context-loading-for-language-model-applications-via-kv-cache-streaming/
- Jordan Juravsky, Bradley Brown, Ryan Ehrlich, Daniel Y. Fu, Christopher Ré, Azalia Mirhoseini, 13 May 2024 (v2), Hydragen: High-Throughput LLM Inference with Shared Prefixes, https://arxiv.org/abs/2402.05099 Code: https://github.com/jordan-benjamin/hydragen
- Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica, Oct 2023, Efficient Memory Management for Large Language Model Serving with PagedAttention, SOSP ’23, October 23–26, 2023, Koblenz, Germany, https://dl.acm.org/doi/pdf/10.1145/3600006.3613165
- Lu Ye, Ze Tao, Yong Huang, Yang Li, 22 Mar 2024 (v2), ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition, https://arxiv.org/abs/2402.15220 (Identify prefixes of prompts and caching the KV values of these portions of the prompt.)
- J. Lin, S. Q. Zhang and A. Leon-Garcia, 2024, sLLM: Accelerating LLM Inference using Semantic Load Balancing with Shared Memory Data Structures, 2024 25th International Symposium on Quality Electronic Design (ISQED), San Francisco, CA, USA, 2024, pp. 1-6, doi: 10.1109/ISQED60706.2024.10528703. https://ieeexplore.ieee.org/abstract/document/10528703 (Optimize the global KV cache by sharing it across multiple queries.)
- Vikranth Srivatsa∗, Zijian He∗, Reyna Abhyankar, Dongming Li, Yiying Zhang, 2024, Preble: Efficient Distributed Prompt Scheduling for LLM Serving, University of California, San Diego, https://escholarship.org/content/qt1bm0k1w0/qt1bm0k1w0.pdf (Evalulates prompt sharing including full inference cache or a partial prefix-based computation of a global KV cache for the prefill phase. Also schedules GPUs based on prefill versus decoding phase requirements.)
- João Monteiro, Étienne Marcotte, Pierre-André Noël, Valentina Zantedeschi, David Vázquez, Nicolas Chapados, Christopher Pal, Perouz Taslakian, 23 Apr 2024, XC-Cache: Cross-Attending to Cached Context for Efficient LLM Inference, https://arxiv.org/abs/2404.15420
- Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Xin Liu, Xuanzhe Liu, Xin Jin, 18 Apr 2024, RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation, https://arxiv.org/abs/2404.12457
- L. Zheng, L. Yin, Z. Xie, J. Huang, C. Sun, C. H. Yu, S. Cao, C. Kozyrakis, I. Stoica, J. E. Gonzalez et al., Dec 2023, Efficiently programming large language models using SGLang, arXiv preprint arXiv:2312.07104, 2023, https://arxiv.org/abs/2312.07104 (Uses a radix attention method, a trie or prefix tree, for KV caching.)
- Bo-Ru Lu, Nikita Haduong, Chien-Yu Lin, Hao Cheng, Noah A. Smith, Mari Ostendorf, 19 Mar 2024, Encode Once and Decode in Parallel: Efficient Transformer Decoding, https://arxiv.org/abs/2403.13112
- Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, Pengfei Zuo, 23 Mar 2024, AttentionStore: Cost-effective Attention Reuse across Multi-turn Conversations in Large Language Model Serving, https://arxiv.org/abs/2403.19708 PDF: https://prongs1996.github.io/assets/pdf/CachedAttention.pdf (Memory management of KV caches using hierarchical cache layers.)
- Nikhil Jha, Kevin Wang, 2023, https://people.eecs.berkeley.edu/~kubitron/courses/cs262a-F23/projects/reports/project1010_paper_64287652274076362722.pdf (Extends Paged Attention to a global multi-query KV cache and also implements prefix KV caching.)
- David Spuler, March 2024, Chapter 29. Caching Optimizations, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Cunchen Hu, Heyang Huang, Junhao Hu, Jiang Xu, Xusheng Chen, Tao Xie, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan, 26 Jun 2024 (v2), MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool, https://arxiv.org/abs/2406.17565 (Combined session-based prefix KV caching with disaggregation of prefill and decoding phases.)
- Sean Welleck, Amanda Bertsch, Matthew Finlayson, Hailey Schoelkopf, Alex Xie, Graham Neubig, Ilia Kulikov, Zaid Harchaoui, 24 Jun 2024, From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models, https://arxiv.org/abs/2406.16838 (Survey and theoretical analysis of many different decoding algorithms, along with various ways to speed them up such as speculative decoding and KV caches.)
- Google, 2024, Context caching, https://ai.google.dev/gemini-api/docs/caching?lang=python (Pass in context tokens and reuse them without re-uploading, might be doing something like prefix KV caching underneath.)
- NVIDIA, July 2024 (accessed), KV cache reuse, https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/kv_cache_reuse.md (KV cache reuse in TensorRT is an implementation of prefix-based KV caching.)
- Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
- Chip Huyen, Jul 25, 2024, Building A Generative AI Platform, https://huyenchip.com/2024/07/25/genai-platform.html
- Llama.cpp, July 2024 (accessed), Prompt Caching, https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md#prompt-caching
- DeepSeek, 02 August, 2024, DeepSeek API introduces Context Caching on Disk, cutting prices by an order of magnitude, https://platform.deepseek.com/api-docs/news/news0802/ (Announcement of commercial support for global KV caching with session-based and prefix KV caches.)
- Anthropic, 15 Aug 2024, Prompt caching with Claude, https://www.anthropic.com/news/prompt-caching (Anthropic is now supporting prompt caching with approximately tenfold reduction in token pricing.)
- Yibo Jin, Tao Wang, Huimin Lin, Mingyang Song, Peiyang Li, Yipeng Ma, Yicheng Shan, Zhengfan Yuan, Cailong Li, Yajing Sun, Tiandeng Wu, Xing Chu, Ruizhi Huan, Li Ma, Xiao You, Wenting Zhou, Yunpeng Ye, Wen Liu, Xiangkun Xu, Yongsheng Zhang, Tiantian Dong, Jiawei Zhu, Zhe Wang, Xijian Ju, Jianxun Song, Haoliang Cheng, Xiaojing Li, Jiandong Ding, Hefei Guo, Zhengyong Zhang, 15 Aug 2024, P/D-Serve: Serving Disaggregated Large Language Model at Scale, https://arxiv.org/abs/2408.08147 (Comprehensive serving system addressing disaggregated prefill and KV cache transfer with RDMA.)
- Hugging Face, Aug 2024 (accessed), Best Practices for Generation with Cache, https://huggingface.co/docs/transformers/kv_cache
- David Spuler, March 2024, Global KV Prefill/Encoder Caching, in Generative AI in C++, https://www.aussieai.com/book/ch29-global-prefill-kv-caching
- Yihua Cheng, Kuntai Du, Jiayi Yao, Junchen Jiang, 16 Sep 2024, Do Large Language Models Need a Content Delivery Network? https://arxiv.org/abs/2409.13761 https://github.com/LMCache/LMCache (Managing the process of sharing KV cache data over a network.)
- Linke Song, Zixuan Pang, Wenhao Wang, Zihao Wang, XiaoFeng Wang, Hongbo Chen, Wei Song, Yier Jin, Dan Meng, Rui Hou, 30 Sep 2024, The Early Bird Catches the Leak: Unveiling Timing Side Channels in LLM Serving Systems, https://arxiv.org/abs/2409.20002
- Amr Elmeleegy, Nick Comly and Thor Johnsen, Nov 08, 2024, 5x Faster Time to First Token with NVIDIA TensorRT-LLM KV Cache Early Reuse, https://developer.nvidia.com/blog/5x-faster-time-to-first-token-with-nvidia-tensorrt-llm-kv-cache-early-reuse/
Prefix KV Caching
Prefix KV caching, or prefix caching, is an LLM optimization that involves reusing a cache of the KV data for a previously seen token prefix. There are several common use cases where an LLM will reprocess the same token prefix, such as in session-based conversational history or prepended global instructions for all queries.
An interesting generalization of the global KV cache is that it needn't only apply to the whole query — it can also apply to any prefix. By caching the portion of the KV cache up to a prefix of the query, a significant amount of computation is avoided. And there are several major cases where a prefix re-occurs:
- RAG document prepended to a query
- Conversational history of a chatbot or other session-based AI.
- Global instructions (prepended to all queries)
- Document re-use (multiple queries on the same context)
Each time a RAG document is prepended as data to a user's query, the query has the same prefix. The KV cache can be stored for this prefix, and, even further, it could be precomputed offline for each RAG document chunk.
Similarly, one notes that each turn of conversation between a user and a chatbot prepends the prior conversation history. Hence, the KV data for the current query is a prefix KV cache for the next query from the same user. This imposes some practical issues in terms of synching the user's session with the cache on large data center backends (e.g. use of "stick sessions" in the network load balancing hardware, or a "network-shared disk" accessible from all servers, where the KV caches are stored). There's also a very clear win on single-user systems such as on-device chatbots on AI Phones or AI PCs, where there's only really a single session and the KV cache is obviously always on the same device.
Research on prefix KV caching: There are various research papers:
- Liu, Y., Li, H., Du, K., Yao, J., Cheng, Y., Huang, Y., Lu, S., Maire, M., Hoffmann, H., Holtzman, A., Ananthanarayanan, G., and Jiang, J., August 2024, Cachegen: Fast context loading for language model applications, Microsoft Research, https://arxiv.org/abs/2310.07240 https://www.microsoft.com/en-us/research/publication/cachegen-fast-context-loading-for-language-model-applications-via-kv-cache-streaming/
- X Wu, L Zhang, Y Wang, Y Ren, M Hack, 2016, zExpander: a Key-Value Cache with both High Performance and Fewer Misses, EuroSys ’16 April 18–21, 2016, London, United Kingdom, https://ranger.uta.edu/~sjiang/pubs/papers/wu16_zExpander.pdf (General theory paper about prefix key-value caching in a trie or binary tree, not specific to neural networks.)
- Jordan Juravsky, Bradley Brown, Ryan Ehrlich, Daniel Y. Fu, Christopher Ré, Azalia Mirhoseini, 13 May 2024 (v2), Hydragen: High-Throughput LLM Inference with Shared Prefixes, https://arxiv.org/abs/2402.05099 Code: https://github.com/jordan-benjamin/hydragen
- Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica, Oct 2023, Efficient Memory Management for Large Language Model Serving with PagedAttention, SOSP ’23, October 23–26, 2023, Koblenz, Germany, https://dl.acm.org/doi/pdf/10.1145/3600006.3613165
- Lu Ye, Ze Tao, Yong Huang, Yang Li, 22 Mar 2024 (v2), ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition, https://arxiv.org/abs/2402.15220 (Identify prefixes of prompts and caching the KV values of these portions of the prompt.)
- Vikranth Srivatsa∗, Zijian He∗, Reyna Abhyankar, Dongming Li, Yiying Zhang, 2024, Preble: Efficient Distributed Prompt Scheduling for LLM Serving, University of California, San Diego, https://escholarship.org/content/qt1bm0k1w0/qt1bm0k1w0.pdf (Evalulates prompt sharing including full inference cache or a partial prefix-based computation of a global KV cache for the prefill phase. Also schedules GPUs based on prefill versus decoding phase requirements.)
- Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Xin Liu, Xuanzhe Liu, Xin Jin, 18 Apr 2024, RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation, https://arxiv.org/abs/2404.12457
- L. Zheng, L. Yin, Z. Xie, J. Huang, C. Sun, C. H. Yu, S. Cao, C. Kozyrakis, I. Stoica, J. E. Gonzalez et al., Dec 2023, Efficiently programming large language models using SGLang, arXiv preprint arXiv:2312.07104, 2023, https://arxiv.org/abs/2312.07104 (Uses a radix attention method, a trie or prefix tree, for KV caching.)
- Bo-Ru Lu, Nikita Haduong, Chien-Yu Lin, Hao Cheng, Noah A. Smith, Mari Ostendorf, 19 Mar 2024, Encode Once and Decode in Parallel: Efficient Transformer Decoding, https://arxiv.org/abs/2403.13112
- Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, Pengfei Zuo, 23 Mar 2024, AttentionStore: Cost-effective Attention Reuse across Multi-turn Conversations in Large Language Model Serving, https://arxiv.org/abs/2403.19708 PDF: https://prongs1996.github.io/assets/pdf/CachedAttention.pdf (Memory management of KV caches using hierarchical cache layers.)
- Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Jeff Huang, Chuyue Sun, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, Ying Sheng, Jan 17, 2024, Fast and Expressive LLM Inference with RadixAttention and SGLang, https://lmsys.org/blog/2024-01-17-sglang/ https://arxiv.org/abs/2312.07104 Code: https://github.com/sgl-project/sglang/
- Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, Junchen Jiang, 3 Jun 2024 (v2), CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion, https://arxiv.org/abs/2405.16444 Code: https://github.com/YaoJiayi/CacheBlend.git (Generalizes prefix KV caching to KV cache fusion with selective recomputation of some KV cache data.)
- Shu Liu, Asim Biswal, Audrey Cheng, Xiangxi Mo, Shiyi Cao, Joseph E. Gonzalez, Ion Stoica, and Matei Zaharia. March 2024, Optimizing llm queries in relational workloads, https://arxiv.org/abs/2403.05821
- Yuan Feng, Hyeran Jeon, Filip Blagojevic, Cyril Guyot, Qing Li, Dong Li, 17 Apr 2023 (v2), AttMEMO : Accelerating Transformers with Memoization on Big Memory Systems, https://arxiv.org/abs/2301.09262
- Zihao Ye, Ruihang Lai, Bo-Ru Lu, Chien-Yu Lin, Size Zheng, Lequn Chen, Tianqi Chen, Luis Ceze, Feb 2, 2024, Cascade Inference: Memory Bandwidth Efficient Shared Prefix Batch Decoding, https://flashinfer.ai/2024/02/02/cascade-inference.html
- Reyna Abhyankar, Zijian He, Vikranth Srivatsa, Hao Zhang, Yiying Zhang, 2024, INFERCEPT: Efficient Intercept Support for Augmented Large Language Model Inference, https://openreview.net/pdf?id=wDDGQabYPQ
- Nikhil Jha, Kevin Wang, 2023, https://people.eecs.berkeley.edu/~kubitron/courses/cs262a-F23/projects/reports/project1010_paper_64287652274076362722.pdf (Extends Paged Attention to a global multi-query KV cache and also implements prefix KV caching.)
- Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, Lili Qiu, 30 May 2024, Parrot: Efficient Serving of LLM-based Applications with Semantic Variable, https://arxiv.org/abs/2405.19888 (Uses prefix KV caching and a combined flash attention and paged attention module.)
- 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
- VLLM, 2024, What is Automatic Prefix Caching, https://docs.vllm.ai/en/stable/automatic_prefix_caching/apc.html
- Character.AI, June 20, 2024, Optimizing AI Inference at Character.AI, https://research.character.ai/optimizing-inference/
- Cunchen Hu, Heyang Huang, Junhao Hu, Jiang Xu, Xusheng Chen, Tao Xie, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan, 26 Jun 2024 (v2), MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool, https://arxiv.org/abs/2406.17565 (Combined session-based prefix KV caching with disaggregation of prefill and decoding phases.)
- Sean Welleck, Amanda Bertsch, Matthew Finlayson, Hailey Schoelkopf, Alex Xie, Graham Neubig, Ilia Kulikov, Zaid Harchaoui, 24 Jun 2024, From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models, https://arxiv.org/abs/2406.16838 (Survey and theoretical analysis of many different decoding algorithms, along with various ways to speed them up such as speculative decoding and KV caches.)
- Franklin Huang, May 17, 2024, Machine Learning Systems with Reduced Memory Requirements, Masters of Science, Electrical Engineering and Computer Sciences, University of California, Berkeley, Technical Report No. UCB/EECS-2024-120 http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.html https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.pdf Code: https://github.com/hongyihuang/spec-mcts/blob/main/triton (Broad paper that examines a lot of different optimizations that reduce memory costs, including quantization, kernel fusion, sparsity, MatMul optimizations, KV cache compression, and various other methods.)
- Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, Xinran Xu, 2 Jul 2024 (v2), Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving, https://arxiv.org/abs/2407.00079 Code: https://github.com/kvcache-ai/Mooncake (Disaggregates prefill and decoding phases for scheduling, with chunked prefill, while managing the KV cache.)
- Google, 2024, Context caching, https://ai.google.dev/gemini-api/docs/caching?lang=python (Pass in context tokens and reuse them without re-uploading, might be doing something like prefix KV caching underneath.)
- Tianyi Tang, Yiwen Hu, Bingqian Li, Wenyang Luo, Zijing Qin, Haoxiang Sun, Jiapeng Wang, Shiyi Xu, Xiaoxue Cheng, Geyang Guo, Han Peng, Bowen Zheng, Yiru Tang, Yingqian Min, Yushuo Chen, Jie Chen, Yuanqian Zhao, Luran Ding, Yuhao Wang, Zican Dong, Chunxuan Xia, Junyi Li, Kun Zhou, Wayne Xin Zhao, Ji-Rong Wen, 8 Jul 2024, LLMBox: A Comprehensive Library for Large Language Models, https://arxiv.org/abs/2407.05563 Code: https://github.com/RUCAIBox/LLMBox
- Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas Kübler, Jiaji Huang, Matthäus Kleindessner, Jun Huan, Volkan Cevher, Yida Wang, George Karypis, 12 Jul 2024, Inference Optimization of Foundation Models on AI Accelerators, KDD’24, August 25–29, 2024, Barcelona, Spain, https://arxiv.org/abs/2407.09111
- NVIDIA, July 2024 (accessed), KV cache reuse, https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/kv_cache_reuse.md (KV cache reuse in TensorRT is an implementation of prefix-based KV caching.)
- Kexin Chu, Tzechinh Liu, Yunding Li, Pengchao Yuan, Wei Zhang, 2024, CaR: An Efficient KV Cache Reuse System for Large Language Model Inference, University of Connecticut, https://llm-gnn.org/slides/CaR-Chu.pdf
- James Groeneveld, Aug 1, 2024, Prompt Design at Character.AI, Character.AI blog, https://research.character.ai/prompt-design-at-character-ai/
- DeepSeek, 02 August, 2024, DeepSeek API introduces Context Caching on Disk, cutting prices by an order of magnitude, https://platform.deepseek.com/api-docs/news/news0802/ (Announcement of commercial support for global KV caching with session-based and prefix KV caches.)
- Yibo Jin, Tao Wang, Huimin Lin, Mingyang Song, Peiyang Li, Yipeng Ma, Yicheng Shan, Zhengfan Yuan, Cailong Li, Yajing Sun, Tiandeng Wu, Xing Chu, Ruizhi Huan, Li Ma, Xiao You, Wenting Zhou, Yunpeng Ye, Wen Liu, Xiangkun Xu, Yongsheng Zhang, Tiantian Dong, Jiawei Zhu, Zhe Wang, Xijian Ju, Jianxun Song, Haoliang Cheng, Xiaojing Li, Jiandong Ding, Hefei Guo, Zhengyong Zhang, 15 Aug 2024, P/D-Serve: Serving Disaggregated Large Language Model at Scale, https://arxiv.org/abs/2408.08147 (Comprehensive serving system addressing disaggregated prefill and KV cache transfer with RDMA.)
- David Spuler, 25th August, 2024, Hot Inference Optimization Techniques, https://www.aussieai.com/blog/hot-inference-research
- Ahmed Tremo, Aug 6, 2024, How to Efficiently Serve an LLM? https://ahmedtremo.com/posts/How-to-Efficiently-serve-an-llm/
- Zihao Ye,, Lequn Chen, Ruihang Lai, Yilong Zhao, Size Zheng, Junru Shao, Bohan Hou, Hongyi Jin, Yifei Zuo, Liangsheng Yin, Tianqi Chen, Luis Ceze, Feb 2, 2024, Accelerating Self-Attentions for LLM Serving with FlashInfer, https://flashinfer.ai/2024/02/02/introduce-flashinfer.html
- David Spuler, , September 26, 2024, RAG Optimization via Caching, https://www.aussieai.com/blog/rag-optimization-caching
- Open AI, Oct 2024 (accessed), Prompt Caching, https://platform.openai.com/docs/guides/prompt-caching
- Shuowei Jin, Xueshen Liu, Qingzhao Zhang, Z. Morley Mao, 4 Oct 2024, Compute Or Load KV Cache? Why Not Both? https://arxiv.org/abs/2410.03065
- OpenVINO-toolkit, Oct 1, 2024, Introducing OpenVINO™ 2024.4, https://medium.com/openvino-toolkit/introducing-openvino-2024-4-28578870b264
- Junhao Hu, Wenrui Huang, Haoyi Wang, Weidong Wang, Tiancheng Hu, Qin Zhang, Hao Feng, Xusheng Chen, Yizhou Shan, Tao Xie, 20 Oct 2024, EPIC: Efficient Position-Independent Context Caching for Serving Large Language Models, https://arxiv.org/abs/2410.15332
- David Spuler, October 24, 2024, Generalizing Prefix KV Caching to RAG Chunks, Aussie AI Blog, https://www.aussieai.com/blog/prefix-kv-rag
- Yilong Zhao, Shuo Yang, Kan Zhu, Lianmin Zheng, Baris Kasikci, Yang Zhou, Jiarong Xing, Ion Stoica, 25 Nov 2024, BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching, https://arxiv.org/abs/2411.16102
- Rui Pan, Zhuang Wang, Zhen Jia, Can Karakus, Luca Zancato, Tri Dao, Ravi Netravali, Yida Wang, 28 Nov 2024, Marconi: Prefix Caching for the Era of Hybrid LLMs, https://arxiv.org/abs/2411.19379 (Prefix caching applied to hybrid SSM-Transformer LLMs.)
- Zhen Zheng, Xin Ji, Taosong Fang, Fanghao Zhou, Chuanjie Liu, Gang Peng, 29 Nov 2024, BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching, https://arxiv.org/abs/2412.03594
- Ao Wang, Hui Chen, Jianchao Tan, Kefeng Zhang, Xunliang Cai, Zijia Lin, Jungong Han, Guiguang Ding, 4 Dec 2024, PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation, https://arxiv.org/abs/2412.03409 https://github.com/THU-MIG/PrefixKV
- PromptHub, December 6, 2024, Prompt Caching with OpenAI, Anthropic, and Google Models, https://www.prompthub.us/blog/prompt-caching-with-openai-anthropic-and-google-models
Prompt Caching
Prompt caching is an LLM inference optimization that uses cached data for faster query responses. This is a more general optimization than context caching, as prompt caching may refer to caching of any elements of the prompt, including both user query and context tokens.
The term "prompt caching" is appearing in industry and can mean various things. Typically, it refers to multi-query caching of the KV data for input prompt tokens. However, prompt cache may also refer to prefix KV caching, or even to a non-KV inference cache.
Research papers on "prompt cache" include:
- Chip Huyen, Jul 25, 2024, Building A Generative AI Platform, https://huyenchip.com/2024/07/25/genai-platform.html
- Google, July 2024 (accessed), Context caching, https://ai.google.dev/gemini-api/docs/caching?lang=python
- Llama.cpp, July 2024 (accessed), Prompt Caching, https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md#prompt-caching
- Anthropic, 15 Aug 2024, Prompt caching with Claude, https://www.anthropic.com/news/prompt-caching (Anthropic is now supporting prompt caching with approximately tenfold reduction in token pricing.)
- DeepSeek, 02 August, 2024, DeepSeek API introduces Context Caching on Disk, cutting prices by an order of magnitude, https://platform.deepseek.com/api-docs/news/news0802/ (Announcement of commercial support for global KV caching with session-based and prefix KV caches.)
- Hanlin Zhu, Banghua Zhu, Jiantao Jiao, 2 Feb 2024, Efficient Prompt Caching via Embedding Similarity, https://arxiv.org/abs/2402.01173
- In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, Lin Zhong, 25 Apr 2024 (v2), Prompt Cache: Modular Attention Reuse for Low-Latency Inference, https://arxiv.org/abs/2311.04934
- Anthropic, 20 Sept 2024, Introducing Contextual Retrieval, https://www.anthropic.com/news/contextual-retrieval
- Open AI, Oct 2024 (accessed), Prompt Caching, https://platform.openai.com/docs/guides/prompt-caching
- Michael Nuñez, October 1, 2024, OpenAI’s DevDay 2024: 4 major updates that will make AI more accessible and affordable, https://venturebeat.com/ai/openai-devday-2024-4-major-updates-that-will-make-ai-more-accessible-and-affordable/
- Lingjiao Chen, Matei Zaharia, James Zou, 9 May 2023, FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance, https://arxiv.org/abs/2305.05176
- Yuzhe Yang, Yipeng Du, Ahmad Farhan, Claudio Angione, Yue Zhao, Harry Yang, Fielding Johnston, James Buban, Patrick Colangelo, 28 Oct 2024, Meta-Learning for Speeding Up Large Model Inference in Decentralized Environments, https://arxiv.org/abs/2410.21340 (Choosing between multiple acceleration techniques).
- PromptHub, December 6, 2024, Prompt Caching with OpenAI, Anthropic, and Google Models, https://www.prompthub.us/blog/prompt-caching-with-openai-anthropic-and-google-models
Session KV Caching
Session KV caching is the use of KV caching to speed up the inference in an LLM session. This typically refers to a chatbot or Q&A session, where computations from earlier in a session can be cached and used to speed up the inference of the next query.
Chatbots and other session-based Q&A interfaces have an interesting property, where the current output is used as part of the input for the next query. This allows session-based optimization for conversational history or other multi-turn interfaces.
If you consider how the conversation history is handled, the output of the current query becomes the prepended text onto the next query. Although you cannot speed up the current query this way, you can avoid a lot of processing of the conversation on the next query by retaining the KV cache. This is a special case of "prefix KV caching" in multi-turn conversational sessions. Effectively, the prefill phase for processing any conversation history text is avoided by maintaining a KV cache of the prior answer.
There are some practical problems with this, especially related to the size of the KV cache. A long discussion of a chatbot, where its answers may be lengthy, can become a large sequence of tokens. This is even more the case in a Q&A session with a RAG architecture, where the responses will be lengthy from summarization of the RAG chunks. The length of the session's prepended conversation history becomes long in either of these cases, in which case the amount of memory used by the KV cache is excessive. Various approaches exist to address this, such as context summarization, KV cache truncation, or KV cache compression. However, it is not entirely clear which approaches for KV cache size reduction can be used in a way that the prior conversational KV cache can still be used to replace prefill of the next query.
Research papers on session-based prefix caching:
- Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, Pengfei Zuo, 23 Mar 2024, AttentionStore: Cost-effective Attention Reuse across Multi-turn Conversations in Large Language Model Serving, https://arxiv.org/abs/2403.19708 (Memory management of KV caches using hierarchical cache layers.)
- VLLM, 2024, What is Automatic Prefix Caching, https://docs.vllm.ai/en/stable/automatic_prefix_caching/apc.html
- Character.AI, June 20, 2024, Optimizing AI Inference at Character.AI, https://research.character.ai/optimizing-inference/
- Cunchen Hu, Heyang Huang, Junhao Hu, Jiang Xu, Xusheng Chen, Tao Xie, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan, 26 Jun 2024 (v2), MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool, https://arxiv.org/abs/2406.17565 (Combined session-based prefix KV caching with disaggregation of prefill and decoding phases.)
- Yu, Lingfan, 2024, Improve Language Model Serving Efficiency With Fine-Grained and Stateful Scheduling, Ph.D. Thesis, Department of Computer Science, New York University, ProQuest Dissertations & Theses, 31139782, https://www.proquest.com/openview/7200cdfc0906f1d4edb8008b4368bcf9 PDF: https://cs.nyu.edu/media/publications/lingfan_yu_phd_thesis.pdf (Examines efficiency of batching methods and how to create a "stateful" version with cached multi-turn conversation history using session-based KV caching.)
- Kexin Chu, Tzechinh Liu, Yunding Li, Pengchao Yuan, Wei Zhang, 2024, CaR: An Efficient KV Cache Reuse System for Large Language Model Inference, University of Connecticut, https://llm-gnn.org/slides/CaR-Chu.pdf
- James Groeneveld, Aug 1, 2024, Prompt Design at Character.AI, Character.AI blog, https://research.character.ai/prompt-design-at-character-ai/
- DeepSeek, 02 August, 2024, DeepSeek API introduces Context Caching on Disk, cutting prices by an order of magnitude, https://platform.deepseek.com/api-docs/news/news0802/ (Announcement of commercial support for global KV caching with session-based and prefix KV caches.)
- Anthropic, 15 Aug 2024, Prompt caching with Claude, https://www.anthropic.com/news/prompt-caching (Anthropic is now supporting prompt caching with approximately tenfold reduction in token pricing.)
- Hugging Face, Aug 2024 (accessed), Best Practices for Generation with Cache, https://huggingface.co/docs/transformers/kv_cache
- Amy Yang, Jingyi Yang, Aya Ibrahim, Xinfeng Xie, Bangsheng Tang, Grigory Sizov, Jongsoo Park, Jianyu Huang, 4 Nov 2024, Context Parallelism for Scalable Million-Token Inference, https://arxiv.org/abs/2411.01783
Fused KV Caching
Fused KV caching is an attempt to generalize prefix KV caching to handle multiple text sequences anywhere in the input prompt string. This allows LLM inference to be optimized in any situations where a portion of the tokens have previously been processed by the LLM.
Fuse KV caching is an improvement over prefix KV caching. For example, when RAG returns two or more chunks, only the first chunk is cached in prefix KV caching. More specifically, the second chunk is not properly cached in the sense that the second chunk won't be cached if it appears later, except if it happens to be after the same first chunk again, if the two chunks are prefix-cached. This problem undermines the idea of having a cache of any RAG chunks that are often used in answers. This problem can be alleviated somewhat by using "caching-aware reranking", to reorder the chunks with a cached chunk placed first, so that the first chunk is almost always cached, but the second chunk is never cached and its KV cache must be recomputed.
There are some major problems that block this idea of caching sequences for two or more chunks anywhere in the prompt text:
- Positional encoding
- Inter-chunk attention
- Non-linearity
Positional encoding: One of the difficulties is that the KV cache of a chunk appearing second in the sequence actually depends on the length of the first sequence, because of positional encoding. Hence, merging of the KV cache for the second or later chunks is difficult, and would really need to be per-position, which is an unrealistic memory requirement.
Inter-chunk attention: If there are two RAG chunks, and we have a cached KV data set for both, neither of the cached values are doing attention on the other. If there is a user query after the second chunk, then the pre-computed KV data after the second chunk doesn't have any contribution from the first chunk. Obviously, the attention computation for the query tokens will attend to both of the chunks again, as part of prefill, but is that enough attention? It seems likely for RAG that the cross-chunk attention is far less important than the query-to-chunk attention, but this hasn't really been confirmed empirically.
Non-linearity of KV data: If it were linear, we could just add the two KV caches together for the two RAG chunks. Unfortunately, as we well know, LLMs don't work linearly. More complicated approximations are required.
Research on Fused KV Caching: This is a new area of research with not many papers yet. One approach in [Gim et al 2023] is to adjust the KV cache for the positional encoding problems, and then merge them together using simple concatenation, which is an approximation of the KV cache that would be computed. The positional encoding problem is avoided by using a structured prompt layout that ensures the text always occurs at particular positions (i.e., similar to the idea of having fixed lengths for each RAG chunk). Hence, there are limited positions for each chunk, and caching per-chunk for a limited number of positions is realistic. No attempt is made to add any inter-chunk attention, with each chunk's KV cache simply used without changes, and they are all concatenated together to create the full token sequence's KV cache data. Surprisingly, this seems to all work well, as shown in the paper. So much for those concerns about non-linearity!
Another approach to fusing KV caches in [Yao et al, 2024] is selective re-computation of KV caches combined with fusing the KV cache for multiple chunks, which again is an approximation. This merges two or more KV caches together in sequence, but then does a selective re-computation of the KV caches for some tokens. Around 10-20% of the tokens have their KV cache re-computed in each layer. As the paper shows, this is much faster in terms of Time-to-first-token, but has a negligible loss in accuracy.
Research papers on fused KV caching:
- Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, Junchen Jiang, 3 Jun 2024 (v2), CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion, https://arxiv.org/abs/2405.16444 Code: https://github.com/YaoJiayi/CacheBlend.git (Generalizes prefix KV caching to KV cache fusion with selective recomputation of some KV cache data.)
- In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, Lin Zhong, Nov 2023, Prompt Cache: Modular Attention Reuse for Low-Latency Inference, https://arxiv.org/abs/2311.04934 (Unique and insightful advance of generalizing KV caching to multiple prompts by computing a cache for short "segments" of prompts, including methods to adjust the different KV cache values for text segments that appear in different positions of the overall prompt.)
- Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Xin Liu, Xuanzhe Liu, Xin Jin, 18 Apr 2024, RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation, https://arxiv.org/abs/2404.12457 (This paper briefly considers merging KV caches of multiple RAG chunks, but instead focuses on (a) caching of two or more chunks in one KV cache record, and (b) reordering the chunks in a cache-aware manner.)
- Hao Yu, Zelan Yang, Shen Li, Yong Li, Jianxin Wu, 11 Jun 2024, Effectively Compress KV Heads for LLM, https://arxiv.org/abs/2406.07056 (Examines KV cache head merging approaches for KV cache size reduction, and also examines RoPE encoding issues with relevance to fusing KV caches.)
- Zhongwei Wan, Ziang Wu, Che Liu, Jinfa Huang, Zhihong Zhu, Peng Jin, Longyue Wang, Li Yuan, 26 Jun 2024, LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference, https://arxiv.org/abs/2406.18139 (KV cache compression in text and multimodal inference, prioritizing eviction of text over image tokens, and using new ways to merge evicted KV cache data into the retained KV cache, including averaging, pivotal tokens, and weighted averages, which is relevant to token merging and KV cache fusion
- Yao Yao, Zuchao Li, Hai Zhao, 21 May 2024, SirLLM: Streaming Infinite Retentive LLM, https://arxiv.org/abs/2405.12528 (Low-rank decomposition to compress KV cache heads.)
- David Spuler, , September 26, 2024, RAG Optimization via Caching, https://www.aussieai.com/blog/rag-optimization-caching
- Songshuo Lu, Hua Wang, Yutian Rong, Zhi Chen, Yaohua Tang, 10 Oct 2024, TurboRAG: Accelerating Retrieval-Augmented Generation with Precomputed KV Caches for Chunked Text, https://arxiv.org/abs/2410.07590 (Fusing precomputed KV caches for each RAG chunk.)
- Junhao Hu, Wenrui Huang, Haoyi Wang, Weidong Wang, Tiancheng Hu, Qin Zhang, Hao Feng, Xusheng Chen, Yizhou Shan, Tao Xie, 20 Oct 2024, EPIC: Efficient Position-Independent Context Caching for Serving Large Language Models, https://arxiv.org/abs/2410.15332
- David Spuler, October 24, 2024, Generalizing Prefix KV Caching to RAG Chunks, Aussie AI Blog, https://www.aussieai.com/blog/prefix-kv-rag
- Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Monishwaran Maheswaran, June Paik, Michael W. Mahoney, Kurt Keutzer, Amir Gholami, 14 Nov 2024, Squeezed Attention: Accelerating Long Context Length LLM Inference, https://arxiv.org/abs/2411.09688 https://github.com/SqueezeAILab/SqueezedAttention (This is like a combination of semantic caching and prefix KV caching, and close to fused KV caching.)
General Research on KV Cache Optimization
Other papers on optimizing the KV cache include:
- Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ramjee, Ashish Panwar, 7 May 2024, vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention, https://arxiv.org/abs/2405.04437 (Further optimizes paged attention algorithm for KV caching in attention, by storing the KV cache in contiguous memory and using underlying system paging.)
- Yushuo Chen, Tianyi Tang, Erge Xiang, Linjiang Li, Wayne Xin Zhao, Jing Wang, Yunpeng Chai, Ji-Rong Wen, 17 Apr 2024, Towards Coarse-to-Fine Evaluation of Inference Efficiency for Large Language Models, https://arxiv.org/abs/2404.11502 (Benchmarks the performance of various Transformer inference frameworks: Transformers, vLLM, DeepSpeed-MII, TGI, TenserRT-LLM, llama.cpp, LightLLM, LMDeploy, StreamingLLM.)
- Jinwei Yao, Kaiqi Chen, Kexun Zhang, Jiaxuan You, Binhang Yuan, Zeke Wang, Tao Lin, 30 Mar 2024, DeFT: Flash Tree-attention with IO-Awareness for Efficient Tree-search-based LLM Inference, https://arxiv.org/abs/2404.00242
- Omri Mallis, February 25, 2024 , Techniques for KV Cache Optimization in Large Language Models, https://www.omrimallis.com/posts/techniques-for-kv-cache-optimization/
- Nikhil Jha, Kevin Wang, 2023, https://people.eecs.berkeley.edu/~kubitron/courses/cs262a-F23/projects/reports/project1010_paper_64287652274076362722.pdf (Extends Paged Attention to a global multi-query KV cache and also implements prefix KV caching.)
- Zayd Muhammad Kawakibi Zuhri, Muhammad Farid Adilazuarda, Ayu Purwarianti, Alham Fikri Aji, 13 Jun 2024, MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding, https://arxiv.org/abs/2406.09297 Code: https://github.com/zaydzuhri/pythia-mlkv (Extends cross-head KV sharing in MQA to also share KV data between layers, analogous to layer fusion of weights.)
- Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey
- Apple, June 2024, Introducing Apple’s On-Device and Server Foundation Models, https://machinelearning.apple.com/research/introducing-apple-foundation-models (Apple's on-device models feature optimizations including small models, grouped query attention, 2-bit/4-bit quantization including activation quantization, shared embedding/unembedding tensors, small-ish vocabulary size of 49k, an undisclosed efficient KV cache optimization for neural engines, and layer-specific 16-bit LoRA/QLoRA adapters of size "10s of megabytes" for fine-tuned specialized model versions, also sometimes in 2-bit/4-bit, claiming speed rates of 0.6ms/token in prefill, and 30 tokens per second in decoding.)
- Siyan Zhao, Daniel Israel, Guy Van den Broeck, Aditya Grover, 15 Apr 2024, Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models, https://arxiv.org/abs/2404.09529 Code: https://github.com/siyan-zhao/prepacking (Optimizes prefill KV cache computations by batching multiple query prefill phases together via packing, since prefill token sequence lengths are fully known, and further combined with simple modifications to positional encoding and masking to avoid cross-query attention.)
- Zhiyu Guo, Hidetaka Kamigaito, Taro Watanabe, 18 Jun 2024, Attention Score is not All You Need for Token Importance Indicator in KV Cache Reduction: Value Also Matters, https://arxiv.org/abs/2406.12335 (Extensions of KV cache token pruning methods tjat ise attention scores to find pivotal tokens, generalized to also consider L1 vector norms of value vectors.)
- Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng ZHANG, Dahua Lin, Ion Stoica, Hao Zhang, 02 May 2024, MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving, ICML 2024, https://openreview.net/forum?id=R0SoZvqXyQ PDF: https://openreview.net/pdf?id=R0SoZvqXyQ Code: https://github.com/hao-ai-lab/MuxServe (Separates the prefill and decoding phases when serving, and also manages the LLM weights and KV cache data in blocks for memory efficiency.)
- Character.AI, June 20, 2024, Optimizing AI Inference at Character.AI, https://research.character.ai/optimizing-inference/
- Schwinn Saereesitthipitak, Ashish Rao, Cathy Zhou, William Li, 2024, Prophet: An LLM Inference Engine Optimized For Head-of-Line Blocking, https://www.scs.stanford.edu/24sp-cs244b/projects/Prophet_An_LLM_Inference_Engine_Optimized_For_Head_of_Line_Blocking.pdf (Faster inference serving via iterative scheduling, separating prefill and decoding phase computations for batching, using priority-based schedulers with preemption, and controling transfer of KV caches from prefill to decoders.)
- Hui Wu, Yi Gan, Feng Yuan, Jing Ma, Wei Zhu, Yutao Xu, Hong Zhu, Yuhua Zhu, Xiaoli Liu, Jinghui Gu, Peng Zhao, 23 Jun 2024 (v2), Efficient LLM inference solution on Intel GPU, https://arxiv.org/abs/2401.05391 (Disaggregated the KV cache between prefill and decoding tokens, since theh KV cache size is known for prefill, thereby reducing memory fragmentation, and also applying kernel fusion to several modules include the scaled dot product attention.)
- Baolin Li, Yankai Jiang, Vijay Gadepally, Devesh Tiwari, 17 Jul 2024, LLM Inference Serving: Survey of Recent Advances and Opportunities, https://arxiv.org/abs/2407.12391
- Luchang Li, Sheng Qian, Jie Lu, Lunxi Yuan, Rui Wang, Qin Xie, 5 Jul 2024 (v3), Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs, https://arxiv.org/abs/2403.20041
- Junyoung Park, Myeonggu Kang, Yunki Han, Yanggon Kim, Jaekang Shin, Lee-Sup Kim, 21 Jul 2024, Token-Picker: Accelerating Attention in Text Generation with Minimized Memory Transfer via Probability Estimation, https://arxiv.org/abs/2407.15131
- Shi Luohe, Zhang Hongyi, Yao Yao, Li Zuchao, Zhao Hai, 25 Jul 2024, Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption, https://arxiv.org/abs/2407.18003
- Yingbing Huang, Lily Jiaxin Wan, Hanchen Ye, Manvi Jha, Jinghua Wang, Yuhong Li, Xiaofan Zhang, Deming Chen, 16 Jun 2024, New Solutions on LLM Acceleration, Optimization, and Application, https://arxiv.org/abs/2406.10903
- Xingbo Wu, Li Zhang, Yandong Wang, Yufei Ren, Michel Hack, Song Jiang, 18 April 2016, zExpander: a key-value cache with both high performance and fewer misses, EuroSys '16: Proceedings of the Eleventh European Conference on Computer SystemsApril 2016Article No.: 14Pages 1–15, https://dl.acm.org/doi/abs/10.1145/2901318.2901332 https://doi.org/10.1145/2901318.2901332
Computation Reuse Optimizations
Computation reuse optimizations are LLM inference improvements where computed data is reused. There are several areas where LLM computations can be reused, including as part of kernel optimizations, or the use of KV caching. Computations in neural network inference can be re-used by storing them in a cache (also called "memoization").
Research papers on data re-use or computation re-use include:
- Giorgos Armeniakos, Georgios Zervakis, Dimitrios Soudris, Jörg Henkel, 2022, Hardware Approximate Techniques for Deep Neural Network Accelerators: A Survey, ACM Computing Surveys, Volume 55, Issue 4, No. 83, pp 1–36 https://doi.org/10.1145/3527156, https://dl.acm.org/doi/10.1145/3527156, https://arxiv.org/abs/2203.08737 (Extensive survey that contains a section on "Memoization" which is caching computed values for later reuse.)
- X. Jiao, V. Akhlaghi, Yu Jiang, and R. K. Gupta. 2018. Energy-efficient neural networks using approximate computation reuse. Proc. of the 2018 Design, Automation and Test in Europe Conference and Exhibition, (DATE) (2018), 1223–1228. https://ieeexplore.ieee.org/document/8342202 https://www.date-conference.com/proceedings-archive/2018/pdf/0571.pdf (Uses Bloom filters and caching of results.)
- JA Chen, W Niu, B Ren, Y Wang, X Shen, 2023, Survey: Exploiting data redundancy for optimization of deep learning, ACM Computing Surveys, https://dl.acm.org/doi/abs/10.1145/3564663, https://arxiv.org/pdf/2208.13363 (Survey paper covering various data redundancy optimizations such as skipping or reusing computations for similar data.)
- Luca Mocerino, Valerio Tenace, and Andrea Calimera. 2019. Energy-Efficient Convolutional Neural Networks via Recurrent Data Reuse. In Design, Automation Test in Europe Conference Exhibition (DATE). 848–853. https://ieeexplore.ieee.org/document/8714880 (Caching using an associative memory unit.)
- M Riera, JM Arnau, A González, 2022, CREW: Computation reuse and efficient weight storage for hardware-accelerated MLPs and RNNs, Journal of Systems Architecture, Elsevier, https://www.sciencedirect.com/science/article/pii/S1383762122001394, https://arxiv.org/abs/2107.09408 (Hardware-assisted caching of weight computations.)
- V Janfaza, K Weston, M Razavi, 2023, MERCURY: Accelerating DNN Training By Exploiting Input Similarity, 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA), https://ieeexplore.ieee.org/abstract/document/10071051/, https://arxiv.org/abs/2110.14904 (Uses a bit-signature to find similar vectors for reusing dot product calculations.)
- X Li, B Ren, X Shen, Y Wang, 2022, CoCoPIE XGen: A Full-Stack AI-Oriented Optimizing Framework, arXiv preprint arXiv:2206.10620, https://arxiv.org/abs/2206.10620 (Various optimizations including block pruning and deep reuse.)
- Yuan Feng, Hyeran Jeon, Filip Blagojevic, Cyril Guyot, Qing Li, Dong Li, 2023, AttMEMO : Accelerating Transformers with Memoization on Big Memory Systems, https://arxiv.org/abs/2301.09262 (Exploits input text similarities in terms of tokens and embeddings so as to cache and reuse tensor attention computations.)
- M Capra, B Bussolino, A Marchisio, M Shafique, 2020, An updated survey of efficient hardware architectures for accelerating deep convolutional neural networks, Future Internet, PDF: https://www.mdpi.com/1999-5903/12/7/113/pdf (Includes a section on data reuse.)
- Hegde, K., Yu, J., Agrawal, R., Yan, M., Pellauer, M., Fletcher, C.W.: UCNN: Exploiting Computational Reuse in Deep Neural Networks via Weight Repetition. In: Proceedings of the 45th Annual International Symposium on Computer Architecture. pp. 674–687. IEEE Press (2018) https://ieeexplore.ieee.org/document/8416864, https://arxiv.org/abs/1804.06508 (Combines analysis of "weight repetition" to reuse partial dot product results, further improved with sparsity.)
- E. Guo, P. Li, K. Wang, H. Feng, J. Lu, and S. Guo, Exploiting computation reuse in cloud-based deep learning via input reordering, in Proc. ICC-IEEE Int. Conf. Commun. (ICC), Jun. 2020, pp. 1–6. https://ieeexplore.ieee.org/document/9148746
- Franyell Silfa, Jose-Maria Arnau, Antonio González, Feb 2022, Saving RNN Computations with a Neuron-Level Fuzzy Memoization Scheme https://arxiv.org/abs/2202.06563 (Uses "fuzzy memoization" in hardware for computation reuse.)
- Franyell Silfa, Gem Dot, Jose-Maria Arnau, and Antonio Gonzàlez. Neuron-Level Fuzzy Memoization in RNNs. In Annual IEEE/ACM International Symposium on Microarchitecture, pages 782–793, 2019. https://dl.acm.org/doi/10.1145/3352460.3358309 (Uses "fuzzy memoization" for caching neuron outputs for computation reuse.)
- Ruofan Wu, Feng Zhang, Jiawei Guan, Zhen Zheng, Xiaoyong Du, and Xipeng Shen. DREW: Efficient Winograd CNN Inference with Deep Reuse. In Proceedings of the ACM Web Conference 2022, pages 1807–1816, 2022. https://dl.acm.org/doi/10.1145/3485447.3511985 PDF: https://research.csc.ncsu.edu/picture/publications/papers/www2022.pdf (Combines the Winograd matrix multiplication algorithm with computation reuse.)
- NM Cicek, L Ning, O Ozturk, 2022, General reuse-centric CNN accelerator, IEEE Transactions on Computers, Volume 71, Issue 4, 01 April 2022, https://ieeexplore.ieee.org/abstract/document/9373953/ (Detects similarities in patches of images.)
- F Zhang, R Wu, J Guan, Z Zheng, X Guo, 2023, Expanding the Edge: Enabling Efficient Winograd CNN Inference With Deep Reuse on Edge Device, IEEE Transactions on Knowledge and Data Engineering, Volume 35, Issue 10, 01 October 2023, https://ieeexplore.ieee.org/abstract/document/10106424/ (Further analysis of DREW, which combines Winograd optimizations with data reuse.)
- Okan Köpüklü, Maryam Babaee, Stefan Hörmann, Gerhard Rigoll, Feb 2019, Convolutional neural networks with layer reuse, 2019 IEEE International Conference on Image Processing (ICIP), https://ieeexplore.ieee.org/abstract/document/8802998/, https://arxiv.org/pdf/1901.09615 (The method of repeatedly re-using the same entire layers.)
- J Servais, E Atoofian, 2021, Adaptive computation reuse for energy-efficient training of deep neural networks, ACM Transactions on Embedded Computing Systems, Volume 20, Issue 6, Article No. 114, pp. 1–24, https://doi.org/10.1145/3487025, https://dl.acm.org/doi/abs/10.1145/3487025 (Optimizing training rather than inference, by using computation reuse.)
- Alireza Khadem, Haojie Ye, Trevor Mudge, Apr 2021, CoDR: Computation and Data Reuse Aware CNN Accelerator, https://arxiv.org/abs/2104.09798 (Hardware-accelerated optimization based on sparsity, weight repetition, and similarity for computation reuse.)
- Hoda Mahdiani; Alireza Khadem; Ali Yasoubi; Azam Ghanbari; Mehdi Modarressi; Masoud Daneshtalab, 2020, Computation reuse-aware accelerator for neural networks, In: Hardware Architectures for Deep Learning, Institution of Engineering and Technology , 2020, p. 147-158, https://digital-library.theiet.org/content/books/10.1049/pbcs055e_ch7, https://www.diva-portal.org/smash/record.jsf?pid=diva2:1761051
- Mohammed F. Tolba, Hani Saleh, Baker Mohammad, Mahmoud Al-Qutayri, Thanos Stouraitis, 2023, EACNN: Efficient CNN Accelerator Utilizing Linear Approximation and Computation Reuse, 2023 IEEE International Symposium on Circuits and Systems (ISCAS), https://ieeexplore.ieee.org/document/10181343 (Reducing multiplications by detecting weight similarity.)
- C Barrios, M Kumar, 2023, Service Caching and Computation Reuse Strategies at the Edge: A Survey, ACM Computing Surveys, Volume 56, Issue 2, Article No. 43, pp 1–38, https://dl.acm.org/doi/abs/10.1145/3609504
- B Zerom, M Tolba, H Tesfai, H Saleh, 2022, Approximate Logarithmic Multiplier For Convolutional Neural Network Inference With Computational Reuse, 2022 29th IEEE International Conference on Electronics, Circuits and Systems (ICECS), https://ieeexplore.ieee.org/document/9970861 (Combines the Logarithmic Number System, Mitchell's approximate multiplication algorithm, and data reuse strategies to speed up MAC operations.)
- Energy-efficient acceleration of convolutional neural networks using computation reuse, A Ghanbari, M Modarressi, Journal of Systems Architecture, 2022, https://www.sciencedirect.com/science/article/pii/S1383762122000674, https://dl.acm.org/doi/10.1016/j.sysarc.2022.102490
- MF Tolba, HT Tesfai, H Saleh, B Mohammad, 2022, Deep Neural Networks-Based Weight Approximation and Computation Reuse for 2-D Image Classification, IEEE Access (Volume 10), https://ieeexplore.ieee.org/abstract/document/9740128/, PDF: https://ieeexplore.ieee.org/iel7/6287639/6514899/09740128.pdf, https://arxiv.org/abs/2105.02954
- D Ma, X Yin, M Niemier, XS Hu, X Jiao, 2020, Axr-nn: Approximate computation reuse for energy-efficient convolutional neural networks, GLSVLSI '20: Proceedings of the 2020 on Great Lakes Symposium on VLSI, September 2020, Pages 363–368, https://dl.acm.org/doi/abs/10.1145/3386263.3407595
- Ali Yasoubi; Reza Hojabr; Mehdi Modarressi, 2016, Power-efficient accelerator design for neural networks using computation reuse, IEEE Computer Architecture Letters, Volume 16, Issue 1, Jan-June 2017, https://ieeexplore.ieee.org/abstract/document/7393481/ PDF: https://www.researchgate.net/profile/Ali-Yasoubi/publication/292077006_Power-Efficient_Accelerator_Design_for_Neural_Networks_Using_Computation_Reuse/links/5b98c095299bf14ad4d0b04d/Power-Efficient-Accelerator-Design-for-Neural-Networks-Using-Computation-Reuse.pdf
- Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks,” ISCA, 2016. https://ieeexplore.ieee.org/document/7551407, PDF: http://www.rle.mit.edu/eems/wp-content/uploads/2016/04/eyeriss_isca_2016.pdf, PDF Slides: https://eems.mit.edu/wp-content/uploads/2016/06/eyeriss_isca_2016_slides.pdf, Project: http://eyeriss.mit.edu/ (Examines computation re-use as a memory-efficient dataflow, including in kernel operator fusion, which is called "folding.")
- N. Vasilache, O. Zinenko, T. Theodoridis, P. Goyal, Z. DeVito, W. S. Moses, S. Verdoolaege, A. Adams, and A. Cohen, 2018, Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions, CoRR, vol. abs/1802.04730, http://arxiv.org/abs/1802.04730 (Caching and data reuse in low-level kernel fusion optimizations.)
- Xia, C., Zhao, J., Sun, Q., Wang, Z., Wen, Y., Feng, X., Cui, H., 2023, Optimizing Deep Learning Inference via Global Analysis and Tensor Expressions, The ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 27 Apr-01 May 2023, San Diego, USA. https://eprints.whiterose.ac.uk/203681/, PDF: https://eprints.whiterose.ac.uk/203681/1/asplos24.pdf (Identifying data reuse opportunities at the ML compiler level.)
- C Fu, 2023, Machine Learning Algorithm and System Co-design for Hardware Efficiency, Ph.D. thesis, Computer Science, University of California San Diego, https://escholarship.org/content/qt52q368p3/qt52q368p3.pdf (Explores computation reuse of partial dot product sums.)
- Rohan Baskar Prabhakar, Sachit Kuhar, Rohit Agrawal, Christopher J. Hughes, and Christopher W. Fletcher. Summerge: An efficient algorithm and implementation for weight repetition-aware dnn inference. In Proceedings of the ACM International Conference on Supercomputing, ICS ’21, page 279–290, New York, NY, USA, 2021. Association for Computing Machinery, PDF: https://dl.acm.org/doi/pdf/10.1145/3447818.3460375 (Efficient dot product computation via computation reuse and weight repetition.)
- Michael E. Wolf and Monica S. Lam, 1991, A Data Locality Optimizing Algorithm, Proceedings of the ACM SIGPLAN ’91 Conference on Programming Language Design and Implementation. Toronto, Ontario, Canada, June 26-28, 1991. https://www.cs.cmu.edu/afs/cs.cmu.edu/academic/class/15745-s05/www/lectures/wolf-lam-pldi91.pdf (Early 1991 paper that includes optimizations to loops and matrix multiplication algorithms with a cache locality focus.)
Cached or Precomputed Transpose
Cached transpose, or cached matrix transpose, is a kernel optimization in LLM inference where the transpose of a matrix is precomputed or cached. There are several ways that MatMul or GEMM kernels can be optimized using the tranpose of a matrix, because it may have better storage of contiguous data, leading to faster memory access patterns.
One minor way to optimize matrix multiplications that involve the transpose of a matrix, is to store both versions of the matrix in memory: the original matrix and its transpose. This can help to speed up inference by (a) avoiding the need to compute the transpose on-the-fly, and (b) by having the transpose already laid out properly in contiguous memory for pipelining and dataflow efficiency.
Some papers mention this optimization technique:
- Optimizing Inference Performance of Transformers on CPUs Dave Dice, Alex Kogan, Feb 2021, https://arxiv.org/pdf/2102.06621.pdf (Stores weight matrix and its transpose in memory, precomputed from the start.)
- Olivier Beaumont, Lionel Eyraud-Dubois, Mathieu Vérité, Julien Langou, 21 Feb 2022, I/O-Optimal Algorithms for Symmetric Linear Algebra Kernels https://arxiv.org/abs/2202.10217
- Paul Springer, Paolo Bientinesi, 7 Nov 2017 (v3), Design of a high-performance GEMM-like Tensor-Tensor Multiplication, https://arxiv.org/abs/1607.00145
- David Spuler, March 2024, Chapter 29. Caching Optimizations, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- David Spuler, March 2024, Cached or Precomputed Transpose, in Generative AI in C++, https://www.aussieai.com/book/ch29-cached-precomputed-transpose
Vector Computation Reuse with Hashing (Vector Caching)
Vector caching is the use of precomputed vector dot products that are stored in a cache datastore. A lot of the low-level computation of tensor multiplications or "convolutions" in AI inference breaks down into a vector dot product computation (also called a "scalar product"). This is a multiplication-addition (multiply-accumulate or MAC) of all of the elements of two vectors to create a single number, usually a 32-bit floating point result. Hence, it's a good candidate for caching with computation reuse since it's a large amount of arithmetic, and the result is only a single floating-point number, which won't need much extra space to store. The trade-off of a small amount of extra storage to avoid significant computations is attractive. Also, one of those vectors is static during inference (e.g. the weights vector), whereas the other vector operand is dynamic. Hence, the idea with vector computation reuse is to cache the computed dot product results and then detect when the second, incoming (dynamic) vector is the same as, or similar enough to, a previous incoming vector.
Various researchers have looked into this type of vector caching. The main methods to detect similar vectors are:
- Locality-Sensitive Hashing (LSH): the most popular method, which uses cosine similarity to find reasonably "close" vectors in n-dimensional space. See also more research on hashing.
- Bit signatures
- K-means clustering
- Hyper-Cube
If the vectors are similar, the cached result is a reasonable approximation of the dot product computation, which is thereby avoided.
Assuming similar vectors can be identified efficiently, the question is: how often does AI model inference perform vector computations on similar vectors? What is the cache hit rate? Research papers seem to indicate that it's rather a lot, with some reporting 50% speedup of inference over time.
Research papers on LSH-based vector dot product caching and computation reuse:
- L. Ning and X. Shen, Deep reuse: Streamline CNN inference on the fly via coarse-grained computation reuse, in Proc. ACM Int. Conf. Supercomputing, 2019, pp. 438–448. [25] L. Ning, H. Guan, and X. Shen, Adaptive deep reuse: Accelerating CNN training on the fly, in Proc. IEEE 35th Int. Conf. Data Eng. (ICDE), Apr. 2019, pp. 1538–1549. https://dl.acm.org/doi/10.1145/3330345.3330384 (A dynamic method to reuse computations based on vector and sub-vector similarities detected via locality-sensitive hashing during inference. Explores LSH, K-means clustering and Hyper-Cube for vector similarity detection.)
- R. Moura, P. Santos, J. Lima, M. Alves, A. Beck, and L. Carro, Skipping CNN convolutions through efficient memoization, in Proc. Int. Conf. Embedded Comput. Syst. Springer, 2019, pp. 65–76. https://link.springer.com/chapter/10.1007/978-3-030-27562-4_5, PDF: https://web.inf.ufpr.br/mazalves/wp-content/uploads/sites/13/2020/03/samos2019.pdf (Caching of dot product multiplications using hashing and proximity-based clustering.)
- Vahid Janfaza, Kevin Weston, Moein Razavi, Shantanu Mandal, Farabi Mahmud, Alex Hilty, Abdullah Muzahid, 2021 (revised Nov 2022), SIMCNN: Exploiting Computational Similarity to Accelerate CNN Training in Hardware, https://www.researchgate.net/profile/Moein-Razavi-Ghods/publication/355730736_SIMCNN_--_Exploiting_Computational_Similarity_to_Accelerate_CNN_Training_in_Hardware/links/61a9605c29948f41dbbe7358/SIMCNN--Exploiting-Computational-Similarity-to-Accelerate-CNN-Training-in-Hardware.pdf, https://arxiv.org/abs/2110.14904 (Uses bit sequences to detect similar vectors.)
- NM Cicek, Feb 2021, General reuse-centric CNN accelerator, Masters Thesis, Graduate School of Engineering and Science, Bilkent university, PDF: http://repository.bilkent.edu.tr/bitstream/handle/11693/55049/Master_Thesis_Bilkent_NihatMertCicek_v3.pdf?sequence=1 (Data reuse algorithms and includes hardware-based LSH detection of vector similarity.)
- NM Cicek, X Shen, O Ozturk, 2022, Energy Efficient Boosting of GEMM Accelerators for DNN via Reuse, ACM Transactions on Design Automation of Electronic Systems, Volume 27, Issue 5, Article No. 43, pp 1–26, https://doi.org/10.1145/3503469, https://dl.acm.org/doi/10.1145/3503469, PDF: https://dl.acm.org/doi/pdf/10.1145/3503469 (Uses computation reuse to speed up matrix multiplications.)
- P Guo, B Hu, R Li, W Hu, 2018, Foggycache: Cross-device approximate computation reuse MobiCom '18: Proceedings of the 24th Annual International Conference on Mobile Computing and Networking, October 2018, Pages 19–34, https://dl.acm.org/doi/abs/10.1145/3241539.3241557, PDF: https://par.nsf.gov/servlets/purl/10122201 (Uses LSH for computation reuse across multiple devices.)
- Azar Rahimi, Luca Benini, and Rajesh K. Gupta. 2013. Spatial memoization: Concurrent instruction reuse to correct timing errors in SIMD architectures. IEEE Transactions on Circuits and Systems II: Express Briefs 60, 12, 847–851. https://ieeexplore.ieee.org/document/6617694, PDF: https://iis-people.ee.ethz.ch/~arahimi/papers/TCAS-II13.pdf (A spatial approximation, but not LSH.)
Input Similarity-Based Caching and Re-Use
Input similarity caching is the reuse of cached data about a previously seen input or image. When an input is similar enough to a prior input, the previous inference results can be cached and re-used. This is applicable to analysis of continual feeds, such as audio or video frames, where the incremental differences are relatively small. This is a type of incremental algorithm for neural network inference.
Research papers on caching from prior similar inputs include:
- Marc Riera, Jose-Maria Arnau, and Antonio Gonzalez. 2018. Computation Reuse in DNNs by Exploiting Input Similarity. In Annual International Symposium on Computer Architecture (ISCA). 57–68. https://ieeexplore.ieee.org/document/8416818 https://dl.acm.org/doi/10.1109/ISCA.2018.00016 (Caching of similar results from DNN computations in previous audio or video frames.)
- SLID: Exploiting spatial locality in input data as a computational reuse method for efficient CNN F Alantali, Y Halawani, B Mohammad… - IEEE Access, 2021 - ieeexplore.ieee.org https://ieeexplore.ieee.org/abstract/document/9395591/, PDF: https://ieeexplore.ieee.org/iel7/6287639/9312710/09395591.pdf (Caches partial MAC calculations and re-uses them, under conditions of input similarity.)
- Fatmah Ali Alantali, 2021, Thesis, Efficient CNN Inference using Spatial Local Input Data Similarity, MSc. Thesis, Electrical and Computer Engineering, Khalifa University, December 2021, PDF: https://khalifauniversity.elsevierpure.com/ws/portalfiles/portal/6825541/file (SLID method for reduced processing of MAC operations in CNN inference using caching, based on input similarity.)
- F. Alantali, Y. Halawani, B. Mohammad and M. Al-Qutayri, 2023, “F-CNN: Faster CNN exploiting SLID with Statistical Analysis,” 2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS), https://ieeexplore.ieee.org/document/10168606
- J Schmerge, D Mawhirter, C Holmes, 2021, ELIχR: Eliminating Computation Redundancy in CNN-Based Video Processing, 2021 IEEE/ACM Redefining Scalability for Diversely Heterogeneous Architectures Workshop (RSDHA), https://ieeexplore.ieee.org/abstract/document/9651154/, PDF: https://par.nsf.gov/servlets/purl/10323450 (Incremental inference for video frames using caching.)
- H Benmeziane, H Bouzidi, H Ouarnoughi, May 2023, Treasure What You Have: Exploiting Similarity in Deep Neural Networks for Efficient Video Processing, https://arxiv.org/abs/2305.06492
- Y Yang, Y Liu, Z Yuan, W Sun, R Liu, 2021, A 65-nm Energy-Efficient Interframe Data Reuse Neural Network Accelerator for Video Applications, IEEE Journal of Solid-State Circuits, Volume 57, Issue 8, August 2022, https://ieeexplore.ieee.org/abstract/document/9631967/
Inference Cache
Inference cache is an LLM optimization that speeds up inference by reusing data stored in a cache from a previous inference computation. This can include a full inference cache of text-to-text mappings, or a partial inference cache that stores a cache of KV data.
A full inference cache is where the entire results of a model inference are stored, and re-used for a later identical query. For example, such an approach would recognize that 100 users are all submitting "This is a test", whether concurrently or over time, and would do the inference computation only once. There are multiple things that could be stored by the cache:
- The full output text
- Logit probabilities
- Prefill/encoder KV data
Basic inference caching involves storing the actual identical results, in which case all users would get exactly the same response. Alternatively, a more flexible approach that still avoids most computations is storing the near-final results, in some intermediate form with logits (probabilities), and a final brief computation can still emit different results to different users. In this way, most of the computation is avoided, and some variability is added to the final output.
Limitations of Inference Caching. An inference cache does not always work well. The problems that can arise with an inference cache include:
- Non-variability of output (as mentioned above)
- Time-dependent queries
- Context-dependent queries
Some queries cannot be cached, and an inference cache needs either heuristics or training to know when. It's not that obvious. Consider these time-dependent queries:
- What is the current time?
- What's the score in the World Cup final?
- What properties are currently for sale in Houston?
Time isn't the only problem, and other types of context of the user should alter the answers, not just across time, but with different answers for different users at the same time. Here's some examples:
- What's the weather forecast?
- What zip code am I in?
- What's my IP address?
- When is high tide?
If you look closely, you might notice that most or all of these problematic queries have a commonality: they're the same types of queries that require extra tool usage by the LLM (e.g. clocks, data source plug-in integrations, etc.). None of these queries can be answered by the weights that were trained from the input training data set. Hence, it's plausible to turn off the caching mechanism using the same mechanism whereby tool-requiring queries are identified (e.g. fine-tuned tool "trigger" tokens or multi-step planning or symbolic execution).
Global KV Caching. An example of a partial inference cache that overcomes some of these problems is global KV caching, where the interim results of K and V operations in attention are stored across queries. The KV calculations are the same for identical queries, so they can be stored and re-loaded when a previously seen query text is detected (e.g. by hashing). This is a type of inference cache, but only the KV calculations are stored, with the full decoding sequences still needing execution. The advantages of global KV caching include:
- Normal variation in answers (less "robotic")
- Avoids prefill/encoding phase cost.
- Very fast Time-to-First-Token (TTFT).
- Time-dependent and tool-requiring queries handled.
Global KV caching still avoids significant computation by avoiding the entirety of the prefill or encoding phase. But it's not as great of a speedup compared to a full inference cache, since the decoding phase must be run. On the other hand, avoiding prefill means zero delay in the output of the first token, so the user will see very fast latency or response time. Decoding won't be any more efficient, but decoding in interactive applications like chatbots only needs to be faster than users can read, which is quite slow. Users are much more sensitive to the initial delay from prefill than to the speed of decoding.
Since the decoding phase still occurs, this also means that some randomness arises from the decoding algorithm (e.g. top-k decoding). Hence, the answers are varied, as they would be if the same query were typed twice without any cache.
Global KV caching also avoids a lot of the problems with time-dependent or other problematic queries. The full decoding phase is still executed, which would presumably also mean generation of the tool integrations in the final output. In other words, tool execution would still occur, after the KV cache has been loaded, and either during or after the decoding phase.
Frame skipping in video. Another use case for a full inference cache is where the input is similar or continous. This is typically the case in image processing of machine vision (e.g. self-driving cars) or video analysis (e.g. security camera monitoring). There are many frames per second and they are often not very different. In such cases, the cached inference results from the previous image or frame can often be re-used with modifications for a faster incremental algorithm (see above section), or alternatively, the entire results from the previous frame can simply replace the current inference computation (i.e. "frame skipping").
Research on Inference Cache
Research papers on inference cache optimizations, where the results of entire inference operations are cached: