Aussie AI

Prefill Optimization

Last Updated 24 June, 2025

by David Spuler, Ph.D.

What is Prefill?

The prefill phase, also called the "prompt processing" phase, is the initial phase of LLM inference that processes the input tokens in the prompt for LLMs in decoder-only Transformer-based models (e.g., GPT-like architectures). This prefill phase, which does not emit any tokens except the first one, precedes the decoding phase that produces the majority of the LLM's answer, by outputing one token at a time. Prefill is analogous to the encoding phase that occurs in encoder-decoder models (i.e. vanilla Transformers), but decoder-only models (e.g. GPT series) do not have an encoder.

I noticed that some of the internet answers to "what is prefill?" are equating prefill with the tokenization phase. This is incorrect as prefill is considered to be distinct from tokenization. In fact, there are two preliminary computations on the user's input prompt, both of which are not particularly expensive, that occur before prefill:

Tokenization — converts the input prompt's written text (or images) into numeric token values (i.e., integers).
Embedding module — converts each token to an embedding vector.

The prefill phase is a sequence of expensive matrix computations that occurs after both these initial computations on the prompt text. The goal is to take the initial embedding values and use them to calculate the "KV cache" values for each token in the sequence. Prefill is much more expensive to compute than the preliminary steps.

Prefill is the first major phase, which computes the KV cache but outputs at most one token (i.e., the first one in the LLM's answer), and then decoding is the second major phase of computation, which emits the rest of the LLM's answer, one token at a time. The prefill phase has several differences compared to the decoding phase. The main characteristics of the prefill phase include:

Parallelizable across multiple tokens (unlike the inherently sequential nature of the autoregressive decoding phase).
Known length (the token size of the prompt context and query)
Compute-bound (whereas decoding is memory-bound).
Creates the KV cache data.
Outputs at most one token at the end (with subsequent tokens emitted by the decoding phase).

Prefill processes the whole prompt input and is sometimes called the "prompt processing" phase. In some research, it is called "infill" rather than "prefill".

Whatever the name, prefill encodes all of the input tokens at once, and this can be done in parallel. This is unlike the decoding phase, which must be done sequentially, one token at a time (called "autoregression").

The prefill phase also generates the KV key-value pairs that are used by the decoding phase. Hence, prefill is inherently related to the "KV caching" optimization (see caching).

Prefill outputs at most one token, which is the first one. The decoding phase then starts outputing the tokens, one new token for each cycle of inference.

Optimizing Prefill

Prefill is often much slower than the decoding phase, and there may be a delay before the first token is output: this is called the "latency" or Time-to-First-Token (TTFT). The reason prefill is slower than decoding is that prefill analyzes all the tokens in your input prompt (but does so only once at the start), whereas decoding is only processing a single new token each time (but does so repeatedly).

A common performance characteristic of decoder-only models is an initial delay during prefill, and then tokens are output regularly without such a big delay, one at a time. The time between each output token during decoding is typically much less than the initial time cost of prefill.

Note that decoding cannot be parallelized since each decoding phase for the next token depends on the output of the prior decoding step (called "autoregressive" decoding). However, there are numerous variants of decoding algorithms with many possible optimizations.

Prefill computations are more complicated than decoding. Because prefill processes n tokens, whereas decoding is only processing one new token, the tensor computations of prefill may be matrix-by-matrix multiplication (ie. MatMul/GEMM kernels), whereas decoding may only be matrix-vector computations (i.e. GEMV or VMM kernel). This explains why the prefill may be slower.

Methods to optimize the prefill phase include:

Avoiding prefill with KV caching. The main output of prefill is the KV cache, but we don't need to do this if it's cached. Various types of KV caching can almost completely avoid the prefill phase: context caching, prefix KV caching, and other types of KV caching.
Optimizing prefill parallelization and deployment. The prefill phase has different performance characteristics to the decoding algorithm. Hence, the optimization methods may be significantly different, and some research has been done on this. Speeding up prefill is also important to get a good "latency" or "fast response time" from an AI engine, rather than a delay before the first token is output.
Chunked prefill. One type of improved parallelization of prefill is to break the token sequence into "chunks" that are a fixed size. This idea is similar to "batching" and "continous batching" as a general inference optimization, but applied to the prefill phase specifically.
Disaggregating prefill and decoding phases (phase splitting). Various research has examined separating prefill from the decoding phase, so as to allow faster parallelization of both. Prefill is a very heavy parallel computation that is compute-bound (because it can load the model once and then crank away across all input tokens), whereas decoding is an iterative and autoregressive method, that is often underutilizing GPU because it is memory-bound (because not only does it need the model in memory, but must also continually load new KV cache data). Hence, there is a benefit in trying to schedule these two separate phases into different GPUs with different characteristics.
Matrix-matrix computations. Whereas the decoding phase is effectively performing matrix-vector computations, using weight matrices with linear projections on the (single) activation vector for the next token, the prefill phase can do computations for multiple tokens in parallel. This allows the kernels to be optimized as matrix-matrix computations along the lengthwise dimension of the token sequence, or to use more granular and complex tensor-based computation sequences, such as calculating blockwise portions across multiple GPUs.

KV Caching Optimizations

The main purpose of the prefill phase is to generate the KV cache for use in the decoding phase. Hence, if you can pre-compute and cache the KV cache data, then the need for the prefill phase disappears!

Various types of caching in Transformers include:

KV caching
KV caching in early exit
KV cache compression
KV cache sparsity
KV cache token pruning
KV cache eviction policies
KV cache quantization
KV cache layer fusion
KV cache layer pruning
KV cache reuse
KV cache global (multi-query KV caching)
Prefix KV cache
Session KV cache (multi-turn KV caching)
Substring KV cache (Lengthwise-fused KV caching))

Chunked Prefill

Chunked prefill is the optimization of breaking the prompt up into chunks and then calculating prefill on just that subset of data. This allows a more predictable and easily schedulable burst of computation, that does not depend on the unpredictable length of the input prompts.

8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
VLLM, 2024, Performance and Tuning: Chunked Prefill, https://docs.vllm.ai/en/v0.4.2/models/performance.html
Daliang Xu, Hao Zhang, Liming Yang, Ruiqi Liu, Mengwei Xu, and Xuanzhe Liu, 11 June 2024, WiP: Efficient LLM Prefilling with Mobile NPU, EdgeFM '24: Proceedings of the Workshop on Edge and Mobile Foundation Models, June 2024, Pages 33 - 35, https://doi.org/10.1145/3662006.3662066 https://dl.acm.org/doi/abs/10.1145/3662006.3662066 PDF: https://dl.acm.org/doi/pdf/10.1145/3662006.3662066 (Faster NPU prefill via chunked prefilling using sequences of tokens, along with INT8 NPU quantization that is aware of outliers and offloads FP32 calculations from NPU back to CPU.)
Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, Xinran Xu, 2 Jul 2024 (v2), Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving, https://arxiv.org/abs/2407.00079 Code: https://github.com/kvcache-ai/Mooncake (Disaggregates prefill and decoding phases for scheduling, with chunked prefill, while managing the KV cache.)
Daliang Xu, Hao Zhang, Liming Yang, Ruiqi Liu, Gang Huang, Mengwei Xu, Xuanzhe Liu, 8 Jul 2024, Empowering 1000 tokens/second on-device LLM prefilling with mllm-NPU, https://arxiv.org/abs/2407.05858
Siyan Zhao, Daniel Israel, Guy Van den Broeck, Aditya Grover, 15 Apr 2024, Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models, https://arxiv.org/abs/2404.09529 Code: https://github.com/siyan-zhao/prepacking (Optimizes prefill KV cache computations by batching multiple query prefill phases together via packing, since prefill token sequence lengths are fully known, and further combined with simple modifications to positional encoding and masking to avoid cross-query attention.)
Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Ramachandran Ramjee, 31 Aug 2023, SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills, https://arxiv.org/abs/2308.16369 (Examines the different GPU costs of prefill vs decoding phases, and optimizes decoding by "piggybacking" off the more intense computation during prefill.)
Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas Kübler, Jiaji Huang, Matthäus Kleindessner, Jun Huan, Volkan Cevher, Yida Wang, George Karypis, 12 Jul 2024, Inference Optimization of Foundation Models on AI Accelerators, KDD’24, August 25–29, 2024, Barcelona, Spain, https://arxiv.org/abs/2407.09111
Amr Elmeleegy, Shivam Raj, Brian Slechta and Vishal Mehta, Jun 12, 2024, Demystifying AI Inference Deployments for Trillion Parameter Large Language Models, NVIDIA Technical Blog, https://developer.nvidia.com/blog/demystifying-ai-inference-deployments-for-trillion-parameter-large-language-models/
David Spuler, 25th August, 2024, Hot Inference Optimization Techniques, https://www.aussieai.com/blog/hot-inference-research
SGLang, July 2024, Chunked prefill #800, https://github.com/sgl-project/sglang/pull/800
Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, Ramachandran Ramjee, 4 Mar 2024, Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve, https://arxiv.org/abs/2403.02310 (Faster latency by scheduling of prefill and decoding algorithm phases.)
Ahmed Tremo, Aug 6, 2024, How to Efficiently Serve an LLM? https://ahmedtremo.com/posts/How-to-Efficiently-serve-an-llm/
vLLM, 2024, Performance and Tuning, https://docs.vllm.ai/en/latest/models/performance.html
Melisa Russak, Umar Jamil, Christopher Bryant, Kiran Kamble, Axel Magnuson, Mateusz Russak, Waseem AlShikh, 27 Aug 2024, Writing in the Margins: Better Inference Pattern for Long Context Retrieval, https://arxiv.org/abs/2408.14906 https://github.com/writer/writing-in-the-margins
Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Jiang Xu, Shuang Chen, Hao Feng, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan, 20 Jan 2024, Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads, https://arxiv.org/abs/2401.11181
Bin Xiao, Lei Su, 4 Sep 2024, ISO: Overlap of Computation and Communication within Seqenence For LLM Inference, https://arxiv.org/abs/2409.11155
Zeyu Zhang, Haiying Shen, 23 Sep 2024, CSPS: A Communication-Efficient Sequence-Parallelism based Serving System for Transformer based Models with Long Prompts, https://arxiv.org/abs/2409.15104 (Sparse attention and overlapped communication with computation and disaggregates prefill/decoding with chunked prefill, with a novel QKV splitting approach focused on the Q values.)
Amey Agrawal, Junda Chen, Íñigo Goiri, Ramachandran Ramjee, Chaojie Zhang, Alexey Tumanov, Esha Choukse, 25 Sep 2024, Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations, https://arxiv.org/abs/2409.17264
Byron (Pin-Lun)Hsu, Yun Dai, Vignesh Kothapalli, Qingquan Song, Shao Tang, Siyu Zhu, Steven Shimizu, Shivam Sahni, Haowen Ning, Yanning Chen, 14 Oct 2024, Liger Kernel: Efficient Triton Kernels for LLM Training, https://arxiv.org/abs/2410.10989 http://github.com/linkedin/Liger-Kernel
Y Cong, 2024, Research for Enhancing Processing and Computational Efficiency in LLM, 2024 2nd International Conference on Image, https://www.atlantis-press.com/article/126004157.pdf
Don Moon, Aug 28, 2024, Chunked Prefill and Decode-Maximal Batching https://medium.com/byte-sized-ai/llm-inference-optimizations-2-chunked-prefill-764407b3a67a
Z Zeng, Q Guo, X Liu, Z Yin, W Shu, M Huang, B Wang, 2024, Memorize Step by Step: Efficient Long-Context Prefilling with Incremental Memory and Decremental Chunk, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 21021–21034, November 12-16, 2024, https://aclanthology.org/2024.emnlp-main.1169.pdf
Markus Rabe, Carl Case, November 14, 2024, Rethinking LLM Inference: Why Developer AI Needs a Different Approach, https://www.augmentcode.com/blog/rethinking-llm-inference-why-developer-ai-needs-a-different-approach
NVIDIA, Dec 2024, Multi-Head, Multi-Query, and Group-Query Attention, https://nvidia.github.io/TensorRT-LLM/advanced/gpt-attention.html#kv-cache
Amr Elmeleegy, Nick Comly and Sharan Chetlur, Nov 15, 2024, Streamlining AI Inference Performance and Deployment with NVIDIA TensorRT-LLM Chunked Prefill, https://developer.nvidia.com/blog/streamlining-ai-inference-performance-and-deployment-with-nvidia-tensorrt-llm-chunked-prefill/
Daliang Xu, Hao Zhang, Liming Yang, Ruiqi Liu, Gang Huang, Mengwei Xu, and Xuanzhe Liu. 2025. Fast On-device LLM Inference with NPUs. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1 (ASPLOS '25). Association for Computing Machinery, New York, NY, USA, 445–462. https://doi.org/10.1145/3669940.3707239 https://dl.acm.org/doi/abs/10.1145/3669940.3707239 (Offloading chunked prefill computations to NPUs.)

Disaggregated Prefill and Decoding Phases (Phase Splitting)

Disaggregation of prefill and decoding computations, also known as "phase splitting," is an optimization methods that uses different scheduling and serving of prefill versus decoding computations. Prefill is known to be computation-intensive (i.e. GPU-bound/CPU-bound), whereas decoding is memory-bound and less computation intensive. The different characteristics of these two major phases offers optimizations that separate them, so as to better manage resources in the overall scheduling of GPU and memory operations. The different tasks can be split across:

Multiple GPUs (same server)
Multiple distributed servers

The general idea is to run the different types of computation on different hardware setups, optimized for either memory-bound or compute-bound algorithms.

There are several improvements and generalization to the idea:

Chunked, disaggregated prefill The idea of chunked prefill can be used for the prefill computation as part of a phase splitting setup. The two optimizations are orthogonal, and natural to use together.
Communication of the KV cache data. If prefill and decoding are split across different GPUs or servers, there is a problem: the decoding phase needs the KV cache data that was generated by the prefill phase. Hence, there is a communication cost of transmitting the KV data from wherever it is computed to wherever the decoding phase will run.
Scheduling optimizations. The distinct properties of memory-bound versus compute-bound allow more precise use of scheduling algorithms, such as chunks of prefill or the sub-computations to emit a single token in decoding.
Overlapping optimizations. The computation of the KV cache data, its transmission over communication protocols, and its use in the decoding phase are sequential computations that can be partially pipelined via "overlapping" methods. For example, chunks of prefill computation can create the KV cache for that chunk, which can then be sent to the decoding phase, while other prefill computations are occurring. The decoding phase can do less overlapping, as it must wait for the full KV cache data in order to use it in the attention phase.
Granular sublayer disaggregation. Even deeper disaggregation of computation is possible within the subcomponents of each layer. There has also been research that splits computation further, with the factor that FFN matrix computations in the decoding phase are usually also compute-bound (i.e., like prefill) whereas the memory-bound nature of the decoding phase arises from attention kernel matrix computations and the processing of the KV cache.

Research on Phase Splitting

Research papers on disaggregating prefill and decoding phases:

Cunchen Hu, Heyang Huang, Junhao Hu, Jiang Xu, Xusheng Chen, Tao Xie, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan, 26 Jun 2024 (v2), MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool, https://arxiv.org/abs/2406.17565 (Combined session-based prefix KV caching with disaggregation of prefill and decoding phases.)
Hui Wu, Yi Gan, Feng Yuan, Jing Ma, Wei Zhu, Yutao Xu, Hong Zhu, Yuhua Zhu, Xiaoli Liu, Jinghui Gu, Peng Zhao, 23 Jun 2024 (v2), Efficient LLM inference solution on Intel GPU, https://arxiv.org/abs/2401.05391 (Disaggregated the KV cache between prefill and decoding tokens, since theh KV cache size is known for prefill, thereby reducing memory fragmentation, and also applying kernel fusion to several modules include the scaled dot product attention.)
Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, Ricardo Bianchini, 20 May 2024 (v2), Splitwise: Efficient generative LLM inference using phase splitting, https://arxiv.org/abs/2311.18677 (Disaggregating the prefill prompt processing phase from the decoding phase, with network transfer of the KV cache generated by prefill.)
Baolin Li, Yankai Jiang, Vijay Gadepally, Devesh Tiwari, 17 Jul 2024, LLM Inference Serving: Survey of Recent Advances and Opportunities, https://arxiv.org/abs/2407.12391
Qichen Fu, Minsik Cho, Thomas Merth, Sachin Mehta, Mohammad Rastegari, Mahyar Najibi, 19 Jul 2024, LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference, https://arxiv.org/abs/2407.14057
A Aali, A Cardoza, M Capo, 2024, Splitwiser: Efficient LLM Inference with Constrained Resources, The University of Texas at Austin, https://asadaali.com/assets/pdf/paper_splitwiser.pdf (Splits prefill and decoding phases of inference using CUDA MPS APIs.)
Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, Xinran Xu, 2 Jul 2024 (v2), Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving, https://arxiv.org/abs/2407.00079 Code: https://github.com/kvcache-ai/Mooncake (Disaggregates prefill and decoding phases for scheduling, with chunked prefill, while managing the KV cache.)
Zhenliang Xue, Yixin Song, Zeyu Mi, Le Chen, Yubin Xia, Haibo Chen, 12 Jun 2024 (v2), PowerInfer-2: Fast Large Language Model Inference on a Smartphone, https://arxiv.org/abs/2406.06282 Project: https://powerinfer.ai/v2/ Code: https://github.com/SJTU-IPADS/PowerInfer (Runs 47B models on phones using neuron cluster approach to matrix multiplication on NPUs and dynamic activation sparsity, with different approaches for prefill versus decoding phases.)
Shaoyuan Chen, Yutong Lin, Mingxing Zhang, Yongwei Wu, 3 May 2024, Efficient and Economic Large Language Model Inference with Attention Offloading, https://arxiv.org/abs/2405.01814 (Separates the process-bound and memory-bound parts of inference for speedup, with focus on prefill, decoding, and the sub-tasks such as QKV and FFN use of GEMM kernels, versus the different pattern of attention computations and the KV cache.)
PS Aishwarya, PA Nair, Y Samaga, T Boyd, S Kumar, 2024, Tandem Transformers for Inference Efficient LLMs, https://www.prateekjain.org/publications/all_papers/NairSBKJN24.pdf (Separates prefill from decoding phase into a "tandem transformer" in combination with speculative decoding.)
Hyungjun Oh, Kihong Kim, Jaemin Kim, Sungkyun Kim, Junyeol Lee, Du-seong Chang, Jiwon Seo, 15 Mar 2024, ExeGPT: Constraint-Aware Resource Scheduling for LLM Inference, https://arxiv.org/abs/2404.07947 (Scheduling and pipelining of inference calculations, with separate prefill/encoding and decoding phase.)
Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, Ramachandran Ramjee, 4 Mar 2024, Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve, https://arxiv.org/abs/2403.02310 (Faster latency by scheduling of prefill and decoding algorithm phases.)
Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, Hao Zhang, 19 Mar 2024 (v2), DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving, https://arxiv.org/abs/2401.09670 (Optimizing LLMs differently in the prefill and decoding phases.)
Siyan Zhao, Daniel Israel, Guy Van den Broeck, Aditya Grover, 15 Apr 2024, Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models, https://arxiv.org/abs/2404.09529 Code: https://github.com/siyan-zhao/prepacking (Optimizes prefill KV cache computations by batching multiple query prefill phases together via packing, since prefill token sequence lengths are fully known, and further combined with simple modifications to positional encoding and masking to avoid cross-query attention.)
Yibo Jin, Tao Wang, Huimin Lin, Mingyang Song, Peiyang Li, Yipeng Ma, Yicheng Shan, Zhengfan Yuan, Cailong Li, Yajing Sun, Tiandeng Wu, Xing Chu, Ruizhi Huan, Li Ma, Xiao You, Wenting Zhou, Yunpeng Ye, Wen Liu, Xiangkun Xu, Yongsheng Zhang, Tiantian Dong, Jiawei Zhu, Zhe Wang, Xijian Ju, Jianxun Song, Haoliang Cheng, Xiaojing Li, Jiandong Ding, Hefei Guo, Zhengyong Zhang, 15 Aug 2024, P/D-Serve: Serving Disaggregated Large Language Model at Scale, https://arxiv.org/abs/2408.08147 (Comprehensive serving system addressing disaggregated prefill and KV cache transfer with RDMA.)
David Spuler, 25th August, 2024, Hot Inference Optimization Techniques, https://www.aussieai.com/blog/hot-inference-research
Schwinn Saereesitthipitak, Ashish Rao, Cathy Zhou, William Li, 2024, Prophet: An LLM Inference Engine Optimized For Head-of-Line Blocking, https://www.scs.stanford.edu/24sp-cs244b/projects/Prophet_An_LLM_Inference_Engine_Optimized_For_Head_of_Line_Blocking.pdf
Ahmed Tremo, Aug 6, 2024, How to Efficiently Serve an LLM? https://ahmedtremo.com/posts/How-to-Efficiently-serve-an-llm/
Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Jiang Xu, Shuang Chen, Hao Feng, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan, 20 Jan 2024, Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads, https://arxiv.org/abs/2401.11181
Zeyu Zhang, Haiying Shen, 23 Sep 2024, CSPS: A Communication-Efficient Sequence-Parallelism based Serving System for Transformer based Models with Long Prompts, https://arxiv.org/abs/2409.15104 (Sparse attention and overlapped communication with computation and disaggregates prefill/decoding with chunked prefill, with a novel QKV splitting approach focused on the Q values.)
Amey Agrawal, Junda Chen, Íñigo Goiri, Ramachandran Ramjee, Chaojie Zhang, Alexey Tumanov, Esha Choukse, 25 Sep 2024, Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations, https://arxiv.org/abs/2409.17264
Aditya K Kamath, Ramya Prabhu, Jayashree Mohan, Simon Peter, Ramachandran Ramjee, Ashish Panwar, 23 Oct 2024, POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference, https://arxiv.org/abs/2410.18038
Peizhuang Cong, Qizhi Chen, Haochen Zhao, Tong Yang, 24 Oct 2024, BATON: Enhancing Batch-wise Inference Efficiency for Large Language Models via Dynamic Re-batching, https://arxiv.org/abs/2410.18701
Yinmin Zhong, Junda Chen, Shengyu Liu, Yibo Zhu, Xin Jin, Hao Zhang, March 17, 2024, Throughput is Not All You Need: Maximizing Goodput in LLM Serving using Prefill-Decode Disaggregation, https://hao-ai-lab.github.io/blogs/distserve/
Z Zeng, Q Guo, X Liu, Z Yin, W Shu, M Huang, B Wang, 2024, Memorize Step by Step: Efficient Long-Context Prefilling with Incremental Memory and Decremental Chunk, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 21021–21034, November 12-16, 2024, https://aclanthology.org/2024.emnlp-main.1169.pdf
Palak (Microsoft Research India), Rohan Gandhi (Microsoft Research India), Karan Tandon (Microsoft Research India), Debopam Bhattacherjee (Microsoft Research India), Venkata N. Padmanabhan (Microsoft Research India), 16 Nov 2024, Improving training time and GPU utilization in geo-distributed language model training, https://arxiv.org/abs/2411.14458
Ç. Yeşil, B. T. Ay, F. A. Ak, Ö. B. Mercan and O. Nefesoğlu, "Adaptive Batch Budget for LLM Inference," 2024 9th International Conference on Computer Science and Engineering (UBMK), Antalya, Turkiye, 2024, pp. 219-223, doi: 10.1109/UBMK63289.2024.10773573. https://ieeexplore.ieee.org/abstract/document/10773573
Hongyi Jin, Ruihang Lai, Charlie F. Ruan, Yingcheng Wang, Todd C. Mowry, Xupeng Miao, Zhihao Jia, Tianqi Chen, 17 Dec 2024, A System for Microserving of LLMs, https://arxiv.org/abs/2412.12488 (Disaggregated prefill and decoding combined with context cache migration for sending the KV cache over the network.)
Tianyao Shi, Yanran Wu, Sihang Liu, Yi Ding, 29 Dec 2024, GreenLLM: Disaggregating Large Language Model Serving on Heterogeneous GPUs for Lower Carbon Emissions, https://arxiv.org/abs/2412.20322
Gursimran Singh, Xinglu Wang, Ivan Hu, Timothy Yu, Linzi Xing, Wei Jiang, Zhefeng Wang, Xiaolong Bai, Yi Li, Ying Xiong, Yong Zhang, Zhenan Fan, 25 Dec 2024, Efficiently serving large multimedia models using EPD Disaggregation,https://arxiv.org/abs/2501.05460 (Diaggregation of three steps: encoding, prefill, and decoding.)
Pouya Hamadanian, Sadjad Fouladi, 20 Jan 2025, Glinthawk: A Two-Tiered Architecture for High-Throughput LLM Inference, https://arxiv.org/abs/2501.11779 https://github.com/microsoft/glinthawk (Separate memory-bound attention computation from other parts of model such as compute-bound FFNs, but only in the decoding phase (not prefill), whereby attention and KV cache management can be performed on a greater number of lower-end GPUs or CPU.)
Haoran Qiu, Anish Biswas, Zihan Zhao, Jayashree Mohan, Alind Khare, Esha Choukse, Íñigo Goiri, Zeyu Zhang, Haiying Shen, Chetan Bansal, Ramachandran Ramjee, Rodrigo Fonseca, 2 Feb 2025, Towards Efficient Large Multimodal Model Serving, https://arxiv.org/abs/2502.00937 (Disaggregating or "decoupling" the different stages of multimodal LLM inference, not only prefill and decoding, but also the multimodal-specific bottlenecks in cross-attention and image encoding.)
Zeyu Zhang, Haiying Shen, Shay Vargaftik, Ran Ben Basat, Michael Mitzenmacher, Minlan Yu, 5 Feb 2025, HACK: Homomorphic Acceleration via Compression of the Key-Value Cache for Disaggregated LLM Inference, https://arxiv.org/abs/2502.03589
Youhe Jiang, Ran Yan, Binhang Yuan, 11 Feb 2025, HexGen-2: Disaggregated Generative Inference of LLMs in Heterogeneous Environment, https://arxiv.org/abs/2502.07903
Yupeng Tang, Dec 2024, Optimizing Memory Management for Disaggregated Architectures, Ph.D. Thesis, Yale University, https://www.proquest.com/openview/85f4eca3d62c09c20e3afc4ad1b98328
Xiaoran Liu, Ruixiao Li, Mianqiu Huang, Zhigeng Liu, Yuerong Song, Qipeng Guo, Siyang He, Qiqi Wang, Linlin Li, Qun Liu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu, 24 Feb 2025, Thus Spake Long-Context Large Language Model, https://arxiv.org/abs/2502.17129 (Impressive survey of many techniques to improve efficiency and accuracy of long context processing in both inference and training, covering text, video and multimodal models.)
Bowen Pang, Kai Li, Ruifeng She, Feifan Wang, 14 Feb 2025, Hybrid Offline-online Scheduling Method for Large Language Model Inference Optimization, https://arxiv.org/abs/2502.15763
Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, and Weimin Zheng, Xinran Xu, 2025, Mooncake: Trading More Storage for Less Computation — A KVCache-centric Architecture for Serving LLM Chatbot, 23rd USENIX Conference on File and Storage Technologies, February 25–27, 2025, Santa Clara, CA, USA, https://www.usenix.org/system/files/fast25-qin.pdf (Generalized disaggregated prefill/decoding to also add a "disaggregated KV cache".)
Junsoo Kim, Hunjong Lee, Geonwoo Ko, Gyubin Choi, Seri Ham, Seongmin Hong, Joo-Young Kim, 6 Mar 2025, ADOR: A Design Exploration Framework for LLM Serving with Enhanced Latency and Throughput, https://arxiv.org/abs/2503.04253
Z. Ding and T. Yang, "DynamicAttention: Dynamic KV Cache for Disaggregate LLM Inference," ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 2025, pp. 1-5, doi: 10.1109/ICASSP49660.2025.10890367. https://ieeexplore.ieee.org/abstract/document/10890367 (Differing management of GPU memory in prefill and decoding phases.)
Jianian Zhu, Hang Wu, Haojie Wang, Yinghui Li, Biao Hou, Ruixuan Li, Jidong Zhai, 11 Mar 2025, FastCache: Optimizing Multimodal LLM Serving through Lightweight KV-Cache Compression Framework, https://arxiv.org/abs/2503.08461
Amr Elmeleegy, Harry Kim, David Zier, Kyle Kranen, Neelay Shah, Ryan Olson and Omri Kahalon, Mar 18, 2025, Introducing NVIDIA Dynamo, A Low-Latency Distributed Inference Framework for Scaling Reasoning AI Models, https://developer.nvidia.com/blog/introducing-nvidia-dynamo-a-low-latency-distributed-inference-framework-for-scaling-reasoning-ai-models/
Yunkai Liang, Zhangyu Chen, Pengfei Zuo, Zhi Zhou, Xu Chen, Zhou Yu, 26 Mar 2025, Injecting Adrenaline into LLM Serving: Boosting Resource Utilization and Throughput via Attention Disaggregation https://arxiv.org/abs/2503.20552
Rajeshkumar Bambhaniya, Abhimanyu ; Wu, Hanjiang ; Subramanian, Suvinay ; Srinivasan, Sudarshan ; Kundu, Souvik ; Yazdanbakhsh, Amir ; Elavazhagan, Midhilesh ; Kumar, Madhu ; Krishna, Tushar, April 2025, Understanding and Optimizing Multi-Stage AI Inference Pipelines, https://ui.adsabs.harvard.edu/abs/2025arXiv250409775R/abstract https://arxiv.org/abs/2504.09775
Yuxing Xiang, Xue Li, Kun Qian, Wenyuan Yu, Ennan Zhai, Xin Jin, 15 May 2025, ServeGen: Workload Characterization and Generation of Large Language Model Serving in Production, https://arxiv.org/abs/2505.09999
Shaoyu Wang, Guangrong He, Geon-Woo Kim, Yanqi Zhou, Seo Jin Park, 13 May 2025, Toward Cost-Efficient Serving of Mixture-of-Experts with Asynchrony, https://arxiv.org/abs/2505.08944
Kuntai Du, Bowen Wang, Chen Zhang, Yiming Cheng, Qing Lan, Hejian Sang, Yihua Cheng, Jiayi Yao, Xiaoxuan Liu, Yifan Qiao, Ion Stoica, Junchen Jiang, 12 May 2025, PrefillOnly: An Inference Engine for Prefill-only Workloads in Large Language Model Applications, https://arxiv.org/abs/2505.07203
CC Hu, HY Huang, LL Xu, XS Chen, C Wang, J Xu, 2025, ShuffleInfer: Disaggregate LLM Inference for Mixed Downstream Workloads, ACMTrans. Arch. Code Optim., https://dl.acm.org/doi/pdf/10.1145/3732941 https://doi.org/10.1145/3732941
Xiannan Hu, Tianyou Zeng, Xiaoming Yuan, Liwei Song, Guangyuan Zhang, Bangzheng He, 6 Jun 2025, BestServe: Serving Strategies with Optimal Goodput in Collocation and Disaggregation Architectures, https://arxiv.org/abs/2506.05871

Mini-Prefill Computation

The main purpose of the prefill computation is to calculate the KV cache for all the prompt tokens. This is a full prefill computation.

There are several types of inference optimizations whereby a KV cache computation is skipped, but is needed for the next decoding token. Hence, the idea is to do a "mini-prefill" computation of the KV cache for the prior token, or a prior sequence of multiple skipped tokens. This necessity arises in optimizations such as:

Note that there are various ways to avoid a full re-calculation of the skipped KV cache, such as propagation of the prior values or KV cache fusion (see early exit KV cache research for details of these sub-optimizations).

An important point about these optimizations is that the speedup can be used without accuracy loss simply by computing the missing KV cache in parallel with the inference for the next token. Early exit has traditionally simply skipped the KV computation of the layers. However, there is an overlapping or pipelining idea whereby early exit triggers a token to be emitted, allowing the autoregressive decoding of the next token to start, but the skipped layers can still be executed to create the KV cache in parallel to the next token generation. The KV cache data will thus be available before it is needed by the next token generation's layers.

Research on Prefill Optimizations

Various papers have examined the prefill phase in more detail:

Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Anselm Levskaya, Jonathan Heek, Kefan Xiao, Shivani Agrawal, Jeff Dean, 9 Nov 2022, Efficiently Scaling Transformer Inference, https://arxiv.org/abs/2211.05102 (The paper that seems to have coined the term "prefill" and examines some aspects of prefill vs decoding phases in optimization.)
Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, Hao Zhang, 19 Mar 2024 (v2), DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving, https://arxiv.org/abs/2401.09670 (Optimizing LLMs differently in the prefill and decoding phases.)
Juntao Zhao, Borui Wan, Yanghua Peng, Haibin Lin, Chuan Wu, 2 Mar 2024, LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization, https://arxiv.org/abs/2403.01136 (Deployment of LLMs on heterogenous GPUs and also differences between the two phases of decoder-only Transformers: prefill and decoding computations.)
Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Jiang Xu, Shuang Chen, Hao Feng, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan, 20 Jan 2024, Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads, https://arxiv.org/abs/2401.11181 (Separating the prefill and decoding phases for optimization.)
Luchang Li, Sheng Qian, Jie Lu, Lunxi Yuan, Rui Wang, Qin Xie, 29 Mar 2024, Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs, https://arxiv.org/abs/2403.20041 (Analysis of prefill vs decoding phases, where prefill phase does matrix-matrix multiplications but the decoding phase does matrix-vector multiplications.)
Minghao Yan, Saurabh Agarwal, Shivaram Venkataraman, 2 Feb 2024, Decoding Speculative Decoding, https://arxiv.org/abs/2402.01528 (Analysis of throughput versus acceptance rates, with a draft model for Llama-65B, including some coverage of prefill issues in speculative decoding.)
Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 15 Mar 2024 (v5), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer (A large survey of a variety of LLM optimizations that mentions prefill.)
Xuanlei Zhao, Bin Jia, Haotian Zhou, Ziming Liu, Shenggan Cheng, Yang You, 2 Mar 2024, HeteGen: Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices, https://arxiv.org/abs/2403.01164
Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, Ramachandran Ramjee, 4 Mar 2024, Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve, https://arxiv.org/abs/2403.02310 (Faster latency by scheduling of prefill and decoding algorithm phases.)
Hui Wu, Yi Gan, Feng Yuan, Jing Ma, Wei Zhu, Yutao Xu, Hong Zhu, Yuhua Zhu, Xiaoli Liu, Jinghui Gu, Dec 2023, Efficient LLM inference solution on Intel GPU, https://arxiv.org/abs/2401.05391 (Optimized LLM inference using kernel fusion of GEMM with element-wise operations for better data movement, and also advanced management of the KV cache. Does different optimizations for prefill and decoding phases in a decoder-only architecture.)
Pratyush Patel, Esha Choukse, Chaojie Zhang, Íñigo Goiri, Aashaka Shah, Saeed Maleki, Ricardo Bianchini, 30 Nov 2023, Splitwise: Efficient generative LLM inference using phase splitting, https://arxiv.org/abs/2311.18677 (Separates the two Transformer phases of initial prompt computation or prefill to generate the KV cache, and the token generation phase or decoding algorithm onto two machines.)
Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Ramachandran Ramjee, 31 Aug 2023, SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills, https://arxiv.org/abs/2308.16369 (Examines the different GPU costs of prefill vs decoding phases, and optimizes decoding by "piggybacking" off the more intense computation during prefill.)
Hyungjun Oh, Kihong Kim, Jaemin Kim, Sungkyun Kim, Junyeol Lee, Du-seong Chang, Jiwon Seo, 15 Mar 2024, ExeGPT: Constraint-Aware Resource Scheduling for LLM Inference, https://arxiv.org/abs/2404.07947 (Scheduling and pipelining of inference calculations, with separate prefill/encoding and decoding phase.)
Yushuo Chen, Tianyi Tang, Erge Xiang, Linjiang Li, Wayne Xin Zhao, Jing Wang, Yunpeng Chai, Ji-Rong Wen, 17 Apr 2024, Towards Coarse-to-Fine Evaluation of Inference Efficiency for Large Language Models, https://arxiv.org/abs/2404.11502 (Detailed inference profiling of prefill and decoding phases. Benchmarks the performance of various Transformer inference frameworks: Transformers, vLLM, DeepSpeed-MII, TGI, TenserRT-LLM, llama.cpp, LightLLM, LMDeploy, StreamingLLM.)
Abhimanyu Bambhaniya, Ritik Raj, Geonhwa Jeong, Souvik Kundu, Sudarshan Srinivasan, Midhilesh Elavazhagan, Madhu Kumar, Tushar Krishna, 3 Jun 2024, Demystifying Platform Requirements for Diverse LLM Inference Use Cases, https://arxiv.org/abs/2406.01698 Code: https://github.com/abhibambhaniya/GenZ-LLM-Analyzer (Analysis of cost of serving LLMs, including separate profiles of prefill versus decoding phases, and the cost of extra prompt processing in RAG architectures with prepended information.)
PS Aishwarya, PA Nair, Y Samaga, T Boyd, S Kumar, 2024, Tandem Transformers for Inference Efficient LLMs, https://www.prateekjain.org/publications/all_papers/NairSBKJN24.pdf (Separates prefill from decoding phase into a "tandem transformer" in combination with speculative decoding.)
Peiyu Liu, Ze-Feng Gao, Wayne Xin Zhao, Yipeng Ma, Tao Wang, Ji-Rong Wen, 21 May 2024, Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression, https://arxiv.org/abs/2405.12591
Rya Sanovar, Srikant Bharadwaj, Renee St. Amant, Victor Rühle, Saravan Rajmohan, 17 May 2024, Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers, https://arxiv.org/abs/2405.10480
Vikranth Srivatsa∗, Zijian He∗, Reyna Abhyankar, Dongming Li, Yiying Zhang, 2024, Preble: Efficient Distributed Prompt Scheduling for LLM Serving, University of California, San Diego, https://escholarship.org/content/qt1bm0k1w0/qt1bm0k1w0.pdf (Evalulates prompt sharing including full inference cache or a partial prefix-based computation of a global KV cache for the prefill phase. Also schedules GPUs based on prefill versus decoding phase requirements.)
Yao Fu, 14 May 2024, Challenges in Deploying Long-Context Transformers: A Theoretical Peak Performance Analysis, https://arxiv.org/abs/2405.08944 (The KV cache size is the main bottleneck for long context processing, in both prefill and decoding phases, and includes analysis of different optimizations to address this.)
A Aali, A Cardoza, M Capo, 2024, Splitwiser: Efficient LLM Inference with Constrained Resources, The University of Texas at Austin, https://asadaali.com/assets/pdf/paper_splitwiser.pdf (Splits prefill and decoding phases of inference using CUDA MPS APIs.)
Chengyi Nie, Rodrigo Fonseca, Zhenhua Liu, 11 May 2024, Aladdin: Joint Placement and Scaling for SLO-Aware LLM Serving, https://arxiv.org/abs/2405.06856
Minsik Cho, Mohammad Rastegari, Devang Naik, 8 May 2024, KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache Generation, https://arxiv.org/abs/2405.05329 (Parallelization of KV cache generation in prefill phase.)
Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, Furu Wei, 9 May 2024 (v2), You Only Cache Once: Decoder-Decoder Architectures for Language Models, https://arxiv.org/abs/2405.05254 Code: https://aka.ms/YOCO (A novel decoder-decoder architecture with fast KV caching and cross-attention.)
Shaoyuan Chen, Yutong Lin, Mingxing Zhang, Yongwei Wu, 3 May 2024, Efficient and Economic Large Language Model Inference with Attention Offloading, https://arxiv.org/abs/2405.01814 (Separates the process-bound and memory-bound parts of inference for speedup, with focus on prefill, decoding, and the sub-tasks such as QKV and FFN use of GEMM kernels, versus the different pattern of attention computations and the KV cache.)
Sachin Mehta, Mohammad Hossein Sekhavat, Qingqing Cao, Maxwell Horton, Yanzi Jin, Chenfan Sun, Iman Mirzadeh, Mahyar Najibi, Dmitry Belenko, Peter Zatloukal, Mohammad Rastegari, 22 Apr 2024, OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework, Apple Research, https://arxiv.org/abs/2404.14619 Code: https://huggingface.co/apple/OpenELM
Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, 22 Apr 2024, A Survey on Efficient Inference for Large Language Models, https://arxiv.org/abs/2404.14294
Shashank Verma and Neal Vaidya, Nov 17, 2023 Mastering LLM Techniques: Inference Optimization, NVIDIA Technical Blog, https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/ (An overview that covers a lot of inference optimization techniques.)
Joao Gante, 2023, Assisted generation: a new direction toward low-latency text generation, Hugging Face, DOI: 10.57967/hf/0638, https://huggingface.co/datasets/joaogante/assisted_generation (Using a model's forward pass to valid a sequence of multiple tokens, analogous to verification in speculative decoding.)
Reyna Abhyankar, Zijian He, Vikranth Srivatsa, Hao Zhang, Yiying Zhang, 2 Feb 2024, APIServe: Efficient API Support for Large-Language Model Inferencing, https://arxiv.org/abs/2402.01869
Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, Junchen Jiang, 3 Jun 2024 (v2), CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion, https://arxiv.org/abs/2405.16444 Code: https://github.com/YaoJiayi/CacheBlend.git (Generalizes prefix KV caching to KV cache fusion with selective recomputation of some KV cache data.)
Zhenliang Xue, Yixin Song, Zeyu Mi, Le Chen, Yubin Xia, Haibo Chen, 12 Jun 2024 (v2), PowerInfer-2: Fast Large Language Model Inference on a Smartphone, https://arxiv.org/abs/2406.06282 Project: https://powerinfer.ai/v2/ Code: https://github.com/SJTU-IPADS/PowerInfer (Runs 47B models on phones using neuron cluster approach to matrix multiplication on NPUs and dynamic activation sparsity, with different approaches for prefill versus decoding phases.)
Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey
Apple, June 2024, Introducing Apple’s On-Device and Server Foundation Models, https://machinelearning.apple.com/research/introducing-apple-foundation-models (Apple's on-device models feature optimizations including small models, grouped query attention, 2-bit/4-bit quantization including activation quantization, shared embedding/unembedding tensors, small-ish vocabulary size of 49k, an undisclosed efficient KV cache optimization for neural engines, and layer-specific 16-bit LoRA/QLoRA adapters of size "10s of megabytes" for fine-tuned specialized model versions, also sometimes in 2-bit/4-bit, claiming speed rates of 0.6ms/token in prefill, and 30 tokens per second in decoding.)
Siyan Zhao, Daniel Israel, Guy Van den Broeck, Aditya Grover, 15 Apr 2024, Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models, https://arxiv.org/abs/2404.09529 Code: https://github.com/siyan-zhao/prepacking (Optimizes prefill KV cache computations by batching multiple query prefill phases together via packing, since prefill token sequence lengths are fully known, and further combined with simple modifications to positional encoding and masking to avoid cross-query attention.)
VLLM, 2024, Performance and Tuning: Chunked Prefill, https://docs.vllm.ai/en/v0.4.2/models/performance.html
Schwinn Saereesitthipitak, Ashish Rao, Cathy Zhou, William Li, 2024, Prophet: An LLM Inference Engine Optimized For Head-of-Line Blocking, https://www.scs.stanford.edu/24sp-cs244b/projects/Prophet_An_LLM_Inference_Engine_Optimized_For_Head_of_Line_Blocking.pdf (Faster inference serving via iterative scheduling, separating prefill and decoding phase computations for batching, using priority-based schedulers with preemption, and controling transfer of KV caches from prefill to decoders.)
Daliang Xu, Hao Zhang, Liming Yang, Ruiqi Liu, Mengwei Xu, and Xuanzhe Liu, 11 June 2024, WiP: Efficient LLM Prefilling with Mobile NPU, EdgeFM '24: Proceedings of the Workshop on Edge and Mobile Foundation Models, June 2024, Pages 33 - 35, https://doi.org/10.1145/3662006.3662066 https://dl.acm.org/doi/abs/10.1145/3662006.3662066 PDF: https://dl.acm.org/doi/pdf/10.1145/3662006.3662066 (Faster NPU prefill via chunked prefilling using sequences of tokens, along with INT8 NPU quantization that is aware of outliers and offloads FP32 calculations from NPU back to CPU.)
Cunchen Hu, Heyang Huang, Junhao Hu, Jiang Xu, Xusheng Chen, Tao Xie, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan, 26 Jun 2024 (v2), MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool, https://arxiv.org/abs/2406.17565 (Combined session-based prefix KV caching with disaggregation of prefill and decoding phases.)
Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, Xinran Xu, 2 Jul 2024 (v2), Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving, https://arxiv.org/abs/2407.00079 Code: https://github.com/kvcache-ai/Mooncake (Disaggregates prefill and decoding phases for scheduling, with chunked prefill, while managing the KV cache.)
Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, Lili Qiu, 2 Jul 2024, MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention, https://arxiv.org/abs/2407.02490 Code: https://aka.ms/MInference
Daliang Xu, Hao Zhang, Liming Yang, Ruiqi Liu, Gang Huang, Mengwei Xu, Xuanzhe Liu, 8 Jul 2024, Empowering 1000 tokens/second on-device LLM prefilling with mllm-NPU, https://arxiv.org/abs/2407.05858
Baolin Li, Yankai Jiang, Vijay Gadepally, Devesh Tiwari, 17 Jul 2024, LLM Inference Serving: Survey of Recent Advances and Opportunities, https://arxiv.org/abs/2407.12391
Runheng Liu, Xingchen Xiao, Heyan Huang, Zewen Chi, Zhijing Wu, 7 May 2024, FlashBack:Efficient Retrieval-Augmented Language Modeling for Long Context Inference, https://arxiv.org/abs/2405.04065 (Optimize RAG by appending rather than prepending documents, and modifying the attention for improvements in KV caching, by shimming or replacing some of the CUDA GPU low-level memory management APIs to avoid the need to rewrite kernels with extra higher-level memory management code.)
Haifeng Qian, Sujan Kumar Gonugondla, Sungsoo Ha, Mingyue Shang, Sanjay Krishna Gouda, Ramesh Nallapati, Sudipta Sengupta, Xiaofei Ma, Anoop Deoras, 24 Apr 2024, BASS: Batched Attention-optimized Speculative Sampling, https://arxiv.org/abs/2404.15778 (Optimizes batched multi-query use of speculative decoding with consideration of GPU utilization in prefill and decoding phases.)
Qichen Fu, Minsik Cho, Thomas Merth, Sachin Mehta, Mohammad Rastegari, Mahyar Najibi, 19 Jul 2024, LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference, https://arxiv.org/abs/2407.14057
Amr Elmeleegy, Shivam Raj, Brian Slechta and Vishal Mehta, Jun 12, 2024, Demystifying AI Inference Deployments for Trillion Parameter Large Language Models, NVIDIA Technical Blog, https://developer.nvidia.com/blog/demystifying-ai-inference-deployments-for-trillion-parameter-large-language-models/
Yibo Jin, Tao Wang, Huimin Lin, Mingyang Song, Peiyang Li, Yipeng Ma, Yicheng Shan, Zhengfan Yuan, Cailong Li, Yajing Sun, Tiandeng Wu, Xing Chu, Ruizhi Huan, Li Ma, Xiao You, Wenting Zhou, Yunpeng Ye, Wen Liu, Xiangkun Xu, Yongsheng Zhang, Tiantian Dong, Jiawei Zhu, Zhe Wang, Xijian Ju, Jianxun Song, Haoliang Cheng, Xiaojing Li, Jiandong Ding, Hefei Guo, Zhengyong Zhang, 15 Aug 2024, P/D-Serve: Serving Disaggregated Large Language Model at Scale, https://arxiv.org/abs/2408.08147 (Comprehensive serving system addressing disaggregated prefill and KV cache transfer with RDMA.)
David Spuler, 25th August, 2024, Hot Inference Optimization Techniques, https://www.aussieai.com/blog/hot-inference-research
Zihao Ye,, Lequn Chen, Ruihang Lai, Yilong Zhao, Size Zheng, Junru Shao, Bohan Hou, Hongyi Jin, Yifei Zuo, Liangsheng Yin, Tianqi Chen, Luis Ceze, Feb 2, 2024, Accelerating Self-Attentions for LLM Serving with FlashInfer, https://flashinfer.ai/2024/02/02/introduce-flashinfer.html
Seungrok Jung. 15, Mar 2024, Large language model inference optimizations on AMD GPUs, ROCm Blogs, https://rocm.blogs.amd.com/artificial-intelligence/llm-inference-optimize/README.html
Lilly Kumari, Anthony Rowe, Shengjie Wang, Jeff Bilmes, 2024, BumbleBee: Dynamic KV-Cache Streaming Submodular Summarization for Infinite-Context Transformers, COLM 2024, https://openreview.net/pdf?id=8w0RApM5yG (KV cache compression via "summaries" of the KV cache data.)
Junlin Lv, Yuan Feng, Xike Xie, Xin Jia, Qirong Peng, Guiming Xie, 19 Sep 2024, CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling Acceleration in LLMs, https://arxiv.org/abs/2409.12490
Junyang Zhang, Mu Yuan, Ruiguang Zhong, Puhan Luo, Huiyou Zhan, Ningkang Zhang, Chengchen Hu, Xiangyang Li, 23 Sep 2024, A-VL: Adaptive Attention for Large Vision-Language Models, https://arxiv.org/abs/2409.14846 (Separate handling of text and image attention modules.)
Zeyu Zhang, Haiying Shen, 23 Sep 2024, CSPS: A Communication-Efficient Sequence-Parallelism based Serving System for Transformer based Models with Long Prompts, https://arxiv.org/abs/2409.15104 (Sparse attention and overlapped communication with computation and disaggregates prefill/decoding with chunked prefill, with a novel QKV splitting approach focused on the Q values.)
Amey Agrawal, Junda Chen, Íñigo Goiri, Ramachandran Ramjee, Chaojie Zhang, Alexey Tumanov, Esha Choukse, 25 Sep 2024, Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations, https://arxiv.org/abs/2409.17264
Shuowei Jin, Xueshen Liu, Qingzhao Zhang, Z. Morley Mao, 4 Oct 2024, Compute Or Load KV Cache? Why Not Both? https://arxiv.org/abs/2410.03065
Aurick Qiao, Zhewei Yao, Samyam Rajbhandari, Yuxiong He, 4 Oct 2024, SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation, https://arxiv.org/abs/2410.03960
Maxwell Horton, Qingqing Cao, Chenfan Sun, Yanzi Jin, Sachin Mehta, Mohammad Rastegari, Moin Nabi, 10 Oct 2024, KV Prediction for Improved Time to First Token, https://arxiv.org/abs/2410.08391 https://github.com/apple/corenet/tree/main/projects/kv-prediction (Small model creates an approximation of the KV cache for use by a larger model.)
Yuzhe Yang, Yipeng Du, Ahmad Farhan, Claudio Angione, Yue Zhao, Harry Yang, Fielding Johnston, James Buban, Patrick Colangelo, 28 Oct 2024, Meta-Learning for Speeding Up Large Model Inference in Decentralized Environments, https://arxiv.org/abs/2410.21340 (Choosing between multiple acceleration techniques).
Yvan Fafchamps, Apr 5, 2024, How to benchmark and optimize LLM inference performance (for data scientists), https://medium.com/@yvan.fafchamps/how-to-benchmark-and-optimize-llm-inference-performance-for-data-scientists-1dbacdc7412a
Amy Yang, Jingyi Yang, Aya Ibrahim, Xinfeng Xie, Bangsheng Tang, Grigory Sizov, Jongsoo Park, Jianyu Huang, 4 Nov 2024, Context Parallelism for Scalable Million-Token Inference, https://arxiv.org/abs/2411.01783
Z Zeng, Q Guo, X Liu, Z Yin, W Shu, M Huang, B Wang, 2024, Memorize Step by Step: Efficient Long-Context Prefilling with Incremental Memory and Decremental Chunk, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 21021–21034, November 12-16, 2024, https://aclanthology.org/2024.emnlp-main.1169.pdf
R. Li, D. Fu, C. Shi, Z. Huang and G. Lu, "Efficient LLMs Training and Inference: An Introduction," in IEEE Access, doi: 10.1109/ACCESS.2024.3501358. https://ieeexplore.ieee.org/abstract/document/10756602 https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10756602
Palak (Microsoft Research India), Rohan Gandhi (Microsoft Research India), Karan Tandon (Microsoft Research India), Debopam Bhattacherjee (Microsoft Research India), Venkata N. Padmanabhan (Microsoft Research India), 16 Nov 2024, Improving training time and GPU utilization in geo-distributed language model training, https://arxiv.org/abs/2411.14458
Wenxuan Huang, Zijie Zhai, Yunhang Shen, Shaoshen Cao, Fei Zhao, Xiangfeng Xu, Zheyu Ye, Shaohui Lin, 3 Dec 2024 (v2), Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification, https://arxiv.org/abs/2412.00876 https://github.com/Osilly/dynamic_llava (Sparsification of the context in vision model.)
Michael R. Metel, Boxing Chen, Mehdi Rezagholizadeh, 7 Dec 2024, Batch-Max: Higher LLM Throughput using Larger Batch Sizes and KV Cache Compression, https://arxiv.org/abs/2412.05693 (KV cache compression in prefill or prompt processing phase.)
Junhyuck Kim, Jongho Park, Jaewoong Cho, Dimitris Papailiopoulos, 12 Dec 2024, Lexico: Extreme KV Cache Compression via Sparse Coding over Universal Dictionaries, https://arxiv.org/abs/2412.08890 https://github.com/krafton-ai/lexico (Sparsification of KV cache in prefill, using INT8 and vector lookup in a dictionary of predefined vectors.)
Jialong Wu, Zhenglin Wang, Linhai Zhang, Yilong Lai, Yulan He, Deyu Zhou, 18 Dec 2024, SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation, https://arxiv.org/abs/2412.13649 (Different KV cache optimizations for prefill and decoding phases.)
Andrew Chan, Dec 12, 2024, Fast LLM Inference From Scratch: Pushing single-GPU inference throughput to the edge without libraries, https://andrewkchan.dev/posts/yalm.html
Zhuomin He, Yizhen Yao, Pengfei Zuo, Bin Gao, Qinya Li, Zhenzhe Zheng, Fan Wu, 4 Jan 2025, AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM Inference, https://arxiv.org/abs/2501.02336 (Optimally skipping sublayer components in FFN and attention during prefill and decoding phases.)
Gursimran Singh, Xinglu Wang, Ivan Hu, Timothy Yu, Linzi Xing, Wei Jiang, Zhefeng Wang, Xiaolong Bai, Yi Li, Ying Xiong, Yong Zhang, Zhenan Fan, 25 Dec 2024, Efficiently serving large multimedia models using EPD Disaggregation,https://arxiv.org/abs/2501.05460 (Diaggregation of three steps: encoding, prefill, and decoding.)
Paul Gillin, Jan 16, 2025, Snowflake claims breakthrough can cut AI inferencing times by more than 50%, https://siliconangle.com/2025/01/16/snowflake-claims-breakthrough-can-cut-ai-inferencing-times-50/ (Inference optimization by KV cache management during prefill phase.)
Weizhi Fei, Xueyan Niu, Guoqing Xie, Yingqing Liu, Bo Bai, Wei Han, 22 Jan 2025, Efficient Prompt Compression with Evaluator Heads for Long-Context Transformer Inference, https://arxiv.org/abs/2501.12959 (Input token scanning efficiently using early exit during prefill to prune tokens for the decoding phase.)
Haoran Qiu, Anish Biswas, Zihan Zhao, Jayashree Mohan, Alind Khare, Esha Choukse, Íñigo Goiri, Zeyu Zhang, Haiying Shen, Chetan Bansal, Ramachandran Ramjee, Rodrigo Fonseca, 2 Feb 2025, Towards Efficient Large Multimodal Model Serving, https://arxiv.org/abs/2502.00937 (Disaggregating or "decoupling" the different stages of multimodal LLM inference, not only prefill and decoding, but also the multimodal-specific bottlenecks in cross-attention and image encoding.)
Daliang Xu, Hao Zhang, Liming Yang, Ruiqi Liu, Gang Huang, Mengwei Xu, and Xuanzhe Liu. 2025. Fast On-device LLM Inference with NPUs. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1 (ASPLOS '25). Association for Computing Machinery, New York, NY, USA, 445–462. https://doi.org/10.1145/3669940.3707239 https://dl.acm.org/doi/abs/10.1145/3669940.3707239 (Offloading chunked prefill computations to NPUs.)
Kun Luo, Zheng Liu, Peitian Zhang, Hongjin Qian, Jun Zhao, Kang Liu, 17 Feb 2025, Does RAG Really Perform Bad For Long-Context Processing? https://arxiv.org/abs/2502.11444 (Long context RAG processing based on the KV cache data is similar to fused/substring KV caching methods.)
Xunhao Lai, Jianqiao Lu, Yao Luo, Yiyuan Ma, Xun Zhou, 28 Feb 2025, FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference, https://arxiv.org/abs/2502.20766 (Prefill optimization that dynamically applies different attention patterns, including sparse attention, for KV computations, based on the input query.)
Junsoo Kim, Hunjong Lee, Geonwoo Ko, Gyubin Choi, Seri Ham, Seongmin Hong, Joo-Young Kim, 6 Mar 2025, ADOR: A Design Exploration Framework for LLM Serving with Enhanced Latency and Throughput, https://arxiv.org/abs/2503.04253
Yunkai Liang, Zhangyu Chen, Pengfei Zuo, Zhi Zhou, Xu Chen, Zhou Yu, 26 Mar 2025, Injecting Adrenaline into LLM Serving: Boosting Resource Utilization and Throughput via Attention Disaggregation https://arxiv.org/abs/2503.20552
Rajeshkumar Bambhaniya, Abhimanyu ; Wu, Hanjiang ; Subramanian, Suvinay ; Srinivasan, Sudarshan ; Kundu, Souvik ; Yazdanbakhsh, Amir ; Elavazhagan, Midhilesh ; Kumar, Madhu ; Krishna, Tushar, April 2025, Understanding and Optimizing Multi-Stage AI Inference Pipelines, https://ui.adsabs.harvard.edu/abs/2025arXiv250409775R/abstract https://arxiv.org/abs/2504.09775
Kuntai Du, Bowen Wang, Chen Zhang, Yiming Cheng, Qing Lan, Hejian Sang, Yihua Cheng, Jiayi Yao, Xiaoxuan Liu, Yifan Qiao, Ion Stoica, Junchen Jiang, 12 May 2025, PrefillOnly: An Inference Engine for Prefill-only Workloads in Large Language Model Applications, https://arxiv.org/abs/2505.07203

Aussie AI

Prefill Optimization

What is Prefill?

Optimizing Prefill

KV Caching Optimizations

Chunked Prefill

Disaggregated Prefill and Decoding Phases (Phase Splitting)

Research on Phase Splitting

Mini-Prefill Computation

Research on Prefill Optimizations

More AI Research

Quick Links

Product

New to Writing?

Writing Styles