Aussie AI

Speculative Decoding: Types and Optimizations

Last Updated 11 June, 2025

by David Spuler, Ph.D.

What is Speculative Decoding?

Speculative decoding is a multi-model ensemble architecture where a small model generates some possible output tokens ("speculating" on the correct answer via outputs from its decoder), and a larger model verifies if the output of the smaller model is correct. The two models are:

Draft model — the smaller "speculating" model.
Verifier model — a much larger, smarter model.

The idea is to use a much smaller model, because it runs much faster. Some implementations use a draft model with 100 to 1,000 times fewer parameters than the larger verifier model. Speedup depends roughlly on the number of parameters, as all weights are typically used in an inference phase.

Types of Speculative Decoding

There are now various subtypes and extensions of the basic speculative decoding algorithm.

Generalized speculative decoding (the drafter can be anything!)
Heuristic speculative decoding (non-LLM drafters)
Self-speculative decoding (using the same model as drafter and verifier)
Tree speculative decoding (akin to combining beam search and speculative decoding)
Retrieval lookup decoding (using RAG chunk text as the draft)
Prompt lookup decoding (using prompt text as the draft)
Aggressive decoding (for grammatical error correction)
Lookahead decoding
Multi-token speculative decoding (parallel drafting and n-gram decoding)
Superposed decoding
Hierarchical speculative decoding
Sequential speculative decoding (faster but not parallel!)

Also closely related are these research areas:

Parallel decoding
Multi-token models

There's probably some more that I've missed! There is an endless stream of speculative decoding research papers.

Draft But Verify?

Draft and verify. Speculative decoding uses a "draft-then-verify" strategy. This optimizes inference speed because it is faster for a large model to verify the correctness of suggested output tokens than for it to fully generate its own new tokens. If the small model predicts poorly, then the bigger model vetos the suggested token, and has to "backtrack" and the whole process is slower. However, the smaller model should be correct most of the time, causing an overall speedup across all of the tokens, while getting very close to the accuracy of a bigger model.

Note that the draft has to be two or more tokens. Having a small model simply suggest a single next token is not a speedup, because we're simply then running the big model autoregressively on each suggestion, which is effectively the same as just running the big model in its normal mode. The method is only faster if two or more speculated tokens can be verified in parallel.

Theoretical Basis of Speculative Decoding

Speculative Decoding vs Big-Little Architectures. Speculative decoding is technically a subtype of the "big-little architecture" but is a very specialized architecture compared to standard two-model big-little architectures. Another type of big-little architecture involves using a heuristic to detect "easy" requests that are routed to a small model, or "hard" queries that are routed to the big model. Speculative decoding differs from big-little methods because all queries go first to the small model, and are then checked by the larger model, and sometimes the big model overrides the small model's suggestions and re-generates its own.

Easy vs Hard Queries. Speculative decoding is based on the "easy vs hard" query architectural issue. The idea is that the small draft model can generate accurate predicted tokens for "easy" cases, but only the large model will correctly handle "hard" tokens. Note that standard inference by a large model is not faster on easy cases by default, but simply implements the same computations across all the layers, regardless of whether an analysis is easy or difficult. Hence, efficiency would be improved if the easy cases can be computed by a smaller model instead.

Generalized speculative decoding. The idea of "generalized speculative decoding" is to generalize the "speculator" model to (a) any other type of cut-down model (e.g. early-exit, quantization, pruning, etc.), and (b) any non-LLM heuristic that can make a prediction of the next token. The idea is generalized to any possible method of "speculating" about the next token. See generalized speculative decoding research.

Speedup from Speculative Decoding

Acceptance rates of draft tokens. The efficiency of speculative decoding overall depends on the "acceptance rate" for the speculated tokens by the verifier. This depends on how "smart" the smaller draft models are in predicting the same tokens as the large model would. Acceptance in verification directly impacts performance, because accept tokens avoid large model inference, whereas rejected tokens cause a "rollback" to a full large-model inference phase. Some papers report low acceptance rates of 20%, whereas other papers report 80-90% for verification by 13B large models and 50% for 70B verifier models.

Parallel verification methods. The larger model has to "verify" the multiple draft tokens from the smaller model in parallel. How does it do so? We are used to a large model emitting a single validated token at a time, but it can't use this for the "verify" phase, because that would be just as inefficient as simply running the large model on every token step (in fact, worse due to overhead of speculation). Instead, the large model has to verify multiple tokens in a phase all in one parallel execution of full inference.

Why is this faster? The speculator has produced more than one token candidate. For a single candidate, we'd just run the big model on the sequence and see whether the probability of the suggested candidate is high enough. That wouldn't be a speedup.

But if there are two or more candidate tokens, we can test each one in parallel, rather than doing one-at-a-time as we'd do in autoregressive decoding.

As an example, let's consider if 2 candidates are generated. The 1st and 2nd candidates are each tested in parallel. We can generate a probability of the prior tokens and see whether the probability of the suggestion agrees with the 1st suggestion. In parallel, we can also run the model with prior tokens and the 1st suggestion, and see whether the predictions agree that the 2nd token has a high enough probability. Hence, we are testing each one individually, which is like an autoregressive mode, but it's not really autoregressive because we can check the 2nd one without having to await the results from the 1st computation.

After we've run these in parallel, we test the probabilities agreed for the 1st and the 2nd and any more tokens. This final check is done sequentially but it's a tiny computation compared to the model inference evaluations done in parallel. Best case, this final check approves all of the tokens, and we output them. Worst case, this final check rejects all of the suggestions, in which case we can still output one token, which isn't one of the suggestions, but is the prediction given by the big model on the original sequence. Intermediate case is where the final check approves some but not all of the speculative suggestions, in which case we can output the approved suggested tokens and one more, which is the correction token given by the big model for the rejected token. Hence, we always output at least one token, and often many more.

When the large model rejects one (or more) of the suggestions, we have to stop there. Even though we have suggestions from the large model for tokens later in the sequence, they have been computed using the now-rejected earlier token as part of the tested sequence. So we have to throw away the later tokens, and only ever output one correction token.

Accept or reject criteria. Note that when we run the big model's verification of speculated tokens in parallel, we get a set of logits back for each step. Hence, we get a predicted probability for each of the speculated tokens, and many others.

We have to decide: do we reject or discard the speculated token? The criteria can be variable. For example, we could reject the token if it's not the best suggestion with the highest probability. This is a "greedy" or "top-1" critieria. Alternatively, we could accept the token if it's in the top-50 of probabilities. Since a high acceptance rate makes the algorithm very fast, this is preferable.

Note that we can't try to choose a different higher-probability token in the middle and then still accept other tokens in the sequence. Well, actually, we probably could in some cases, but it's hard to know which ones, so the algorithm has to discard tokens later in the sequence. This is basically the rejection of a token.

One nuance here is that we could choose the highest probability token in the last token in the speculated suggestions. There's no cost to rejecting the very last suggested token, as there aren't any more to discard in the sequence.

More Computations, Not Less

Speculative decoding actually does more total computations of the model (i.e. more FLOPs), rather than less. But it gets a speedup by doing more in parallel. The wall clock speed of inference, which is basically user latency, is improved, but overall GPU processing cost is greater.

Computation resources. Note that the verifying large model doesn't do less computation overall when compared to the basic autoregressive run of the large model. It runs a full model inference on the large model for the 1st and 2nd tokens (and any more speculated tokens). But it can do the same computations in parallel because it no longer needs to wait for the next token to be generated (as it would do in an autoregressive mode). In fact, it does more computations because it not only runs the big model (in parallel) but it also runs the smaller model (in sequence) before running the big model. However, it can do much more computation if many tokens are rejected.

On-device execution. Speculative decoding may not be a good candidate for low-resource platforms such as phones or edge devices, because it can't speed up sequential execution. Speculative decoding on a sequential platform actually increases computations overall, which affects battery depletion and heat issues. Also, these platforms may not have a lot of spare computation resources to run the large model in parallel. This method works better on larger platforms where the parallel computations of the large model can be farmed out to multiple GPUs.

Related Research Areas

Similar research areas: Another area of optimization of Transformers that is similar to speculative decoding is the non-autoregressive Transformer architecture. The general class of "multi-AI models" where two or more neural networks are used is called "ensemble methods". Some of the other research areas related to speculative decoding include:

Collaborative decoding (collaborative inference)
Big-little architecture
Easy vs hard queries
Mutually-guided decoding
Aggressive decoding
Parallel decoding
Shallow fusion
Consensus-based decoding
Bidirectional decoding
Shallow decoder architectures
Non-autoregressive algorithms

Speculative execution is the ovearching general area of Computer Science theory from which speculative decoding is derived. Various algorithms benefit from speculatively executing in parallel with another pathway. A particular example is "branch prediction" in hardware execution of low-level machine code.

Survey Papers on Speculative Decoding

Broad review or survey papers that cover speculative decoding:

Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, Zhifang Sui, 15 Jan 2024, Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding, https://arxiv.org/abs/2401.07851 (A survey that's just on speculative decoding!)
Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, Liang Zhao, 4 Jan 2024, Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models https://arxiv.org/abs/2401.00625 (A general survey paper with coverage of many techniques including this one.)
Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 15 Mar 2024 (v5), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer (A large survey of a variety of LLM optimizations.)
Chen Zhang, Zhuorui Liu, Dawei Song, 23 Apr 2024, Beyond the Speculative Game: A Survey of Speculative Execution in Large Language Models, https://arxiv.org/abs/2404.14897
Hyun Ryu, Eric Kim, 20 Nov 2024, Closer Look at Efficient Inference Methods: A Survey of Speculative Decoding, https://arxiv.org/abs/2411.13157
Yunhai Hu, Zining Liu, Zhenyuan Dong, Tianfan Peng, Bradley McDanel, Sai Qian Zhang, 27 Feb 2025, Speculative Decoding and Beyond: An In-Depth Review of Techniques, https://arxiv.org/abs/2502.19732

Tree-Based Speculative Decoding

Speculative decoding can track multiple possible draft sequences, and these can have a common prefix. Hence, there is a tree structure of multiple possible drafts, which can be processed using specialized handling of tree structres. (Note that this idea is closely related to "beam decoding" in regular non-speculative inference.)

Vieira, T., Gumbel-max trick and weighted reservoir sampling. 2014. https: //timvieira.github.io/blog/post/2014/ 08/01/gumbel-max-trick-andweighted- reservoir-sampling/
Fu, Y., Bailis, P., Stoica, I., and Zhang, H. Break ing the sequential dependency of LLM inference using lookahead decoding, November 2023. https://lmsys.org/blog/2023-11-21-lookahead-decoding/
Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference acceleration framework with multiple decoding heads. arXiv preprint arXiv:2401.10774, 2024 https://arxiv.org/abs/2401.10774
Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Rae Ying Yee Wong, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, and Zhihao Jia. Specinfer: Accelerating generative llm serving with speculative inference and token tree verification. arXiv preprint arXiv:2305.09781, 2023. https://arxiv.org/abs/2305.09781 https://www.researchgate.net/publication/370841788_SpecInfer_Accelerating_Generative_LLM_Serving_with_Speculative_Inference_and_Token_Tree_Verification#:~:text=introduces%20SpecInfer%2C%20an%20LLM%20serving%20system%20that%20accelerates,the%20predictions%20are%20organized%20as%20a%20tok%20en
Zhuoming Chen, Avner May, Ruslan Svirschevski, Yuhsun Huang, Max Ryabinin, Zhihao Jia, and Beidi Chen. Sequoia: Scalable, robust, and hardware-aware speculative decoding. arXiv preprint arXiv:2402.12374, 2024. https://arxiv.org/abs/2402.12374v1
Ziteng Sun, Ananda Theertha Suresh, Jae Hun Ro, Ahmad Beirami, Himanshu Jain, and Felix Yu. Spectr: Fast speculative decoding via optimal transport. Advances in Neural Information Processing Systems, 36, 2024. https://arxiv.org/abs/2310.15141
Kim, S., Mangalam, K., Malik, J., Mahoney, M. W., Gho lami, A., and Keutzer, K. Big little transformer decoder. arXiv preprint arXiv:2302.07863, 2023. https://arxiv.org/abs/2302.07863v1
Kool, W., Van Hoof, H., and Welling, M. Stochastic beams and where to find them: The Gumbel-Top-k trick for sam pling sequences without replacement. In Proceedings of the 36th International Conference on Machine Learning (ICML), pp. 3499–3508. PMLR, 2019 https://arxiv.org/abs/1903.06059
Nikhil Bhendawade, Irina Belousova, Qichen Fu, Henry Mason, Mohammad Rastegari, and Mahyar Najibi. Speculative streaming: Fast llm inference without auxiliary models. arXiv preprint arXiv:2402.11131, 2024. https://arxiv.org/abs/2402.11131
Theia Vogel, December 18, 2023, How to make LLMs go fast, https://vgel.me/posts/faster-inference/
Aonan Zhang, Chong Wang, Yi Wang, Xuanyu Zhang, Yunfei Cheng, 22 Mar 2024 (v2), Recurrent Drafter for Fast Speculative Decoding in Large Language Models, https://arxiv.org/abs/2403.09919 (Use of small RNN as the drafting model for speculative decoding.)
Benjamin Spector and Chris Re. Accelerating llm inference with staged speculative decoding. arXiv preprint arXiv:2308.04623, 2023. https://arxiv.org/abs/2308.04623
Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: Speculative sampling requires rethinking feature uncertainty. arXiv preprint arXiv:2401.15077, 2024. https://arxiv.org/abs/2401.15077
Zachary Ankner, Rishab Parthasarathy, Aniruddha Nrusimha, Christopher Rinard, Jonathan Ragan-Kelley, and William Brandon. Hydra: Sequentially-dependent draft heads for medusa decoding, 2024. https://arxiv.org/abs/2402.05109
Ruslan Svirschevski, Avner May, Zhuoming Chen, Beidi Chen, Zhihao Jia, Max Ryabinin, 4 Jun 2024, SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices, https://arxiv.org/abs/2406.02532 (Speculative decoding with draft trees on low-resource consumer hardware with offloading.)
8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
Xiaoxuan Liu, Cade Daniel, Langxiang Hu, Woosuk Kwon, Zhuohan Li, Xiangxi Mo, Alvin Cheung, Zhijie Deng, Ion Stoica, Hao Zhang, 20 Jun 2024, Optimizing Speculative Decoding for Serving Large Language Models Using Goodput, https://arxiv.org/abs/2406.14066 (Estimation of the draft length for increased acceptance to improve overall performance.)
SpecInfer: Accelerating Generative Large Language Model Serving With Speculative Inference and Token Tree Verification, Cheng, Xinhao, Masters Thesis, Carnegie Mellon University, ProQuest Dissertations & Theses, 2023. 30817164. https://www.proquest.com/openview/54f75ff88889b27845efd4a56f3a167a/1?pq-origsite=gscholar&cbl=18750&diss=y
Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang, 24 Jun 2024, EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees, https://arxiv.org/abs/2406.16858 (Extends a tree-structured draft to use the logit probabilities of the draft model as an estimate of the likely acceptance rates by the larger model.)
Sean Welleck, Amanda Bertsch, Matthew Finlayson, Hailey Schoelkopf, Alex Xie, Graham Neubig, Ilia Kulikov, Zaid Harchaoui, 24 Jun 2024, From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models, https://arxiv.org/abs/2406.16838 (Survey and theoretical analysis of many different decoding algorithms, along with various ways to speed them up such as speculative decoding and KV caches.)
Zhenglin Wang, Jialong Wu, Yilong Lai, Congzhi Zhang, Deyu Zhou, 26 Jun 2024, SEED: Accelerating Reasoning Tree Construction via Scheduled Speculative Decoding, https://arxiv.org/abs/2406.18200 (Scheduling and parallelization of multiple draft models across one or multiple queries in tree-based speculative decoding.)
Jikai Wang, Yi Su, Juntao Li, Qinrong Xia, Zi Ye, Xinyu Duan, Zhefeng Wang, Min Zhang, 25 Jun 2024, OPT-Tree: Speculative Decoding with Adaptive Draft Tree Structure, https://arxiv.org/abs/2406.17276 Code: https://github.com/Jikai0Wang/OPT-Tree
Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas Kübler, Jiaji Huang, Matthäus Kleindessner, Jun Huan, Volkan Cevher, Yida Wang, George Karypis, 12 Jul 2024, Inference Optimization of Foundation Models on AI Accelerators, KDD’24, August 25–29, 2024, Barcelona, Spain, https://arxiv.org/abs/2407.09111
Zhengmian Hu, Heng Huang, 2024, Accelerated Speculative Sampling Based on Tree Monte Carlo, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:19216-19251, 2024. https://proceedings.mlr.press/v235/hu24f.html https://openreview.net/forum?id=stMhi1Sn2G PDF: https://raw.githubusercontent.com/mlresearch/v235/main/assets/hu24f/hu24f.pdf
Daliang Xu, Wangsong Yin, Xin Jin, Ying Zhang, Shiyun Wei, Mengwei Xu, and Xuanzhe Liu, Sep 2023, Llmcad: Fast and scalable on-device large language model inference. CoRR, abs/2309.04255, https://arxiv.org/abs/2309.04255
Xianzhen Luo, Yixuan Wang, Qingfu Zhu, Zhiming Zhang, Xuanyu Zhang, Qing Yang, Dongliang Xu, Wanxiang Che, 16 Aug 2024, Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling, https://arxiv.org/abs/2408.08696
Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
Zongyue Qin, Zifan He, Neha Prakriya, Jason Cong, Yizhou Sun, 25 Sep 2024, Dynamic-Width Speculative Beam Decoding for Efficient LLM Inference, https://arxiv.org/abs/2409.16560
Xinyu Lin, Chaoqun Yang, Wenjie Wang, Yongqi Li, Cunxiao Du, Fuli Feng, See-Kiong Ng, Tat-Seng Chua, 8 Oct 2024 (v2), Efficient Inference for Large Language Model-based Generative Recommendation, https://arxiv.org/abs/2410.05165
Yunfan Xiong, Ruoyu Zhang, Yanzeng Li, Tianhao Wu, Lei Zou, 15 Oct 2024, DySpec: Faster Speculative Decoding with Dynamic Token Tree Structure, https://arxiv.org/abs/2410.11744
Jiahao Qiu, Yifu Lu, Yifan Zeng, Jiacheng Guo, Jiayi Geng, Huazheng Wang, Kaixuan Huang, Yue Wu, Mengdi Wang, 18 Oct 2024, TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling, https://arxiv.org/abs/2410.16033
Hyun Ryu, Eric Kim, 20 Nov 2024, Closer Look at Efficient Inference Methods: A Survey of Speculative Decoding, https://arxiv.org/abs/2411.13157
NVIDIA, Dec 2024, Speculative Sampling, https://nvidia.github.io/TensorRT-LLM/advanced/speculative-decoding.html
Xiangxiang Gao, Weisheng Xie, Yiwei Xiang, Feng Ji, 17 Dec 2024, Falcon: Faster and Parallel Inference of Large Language Models through Enhanced Semi-Autoregressive Drafting and Custom-Designed Decoding Tree, https://arxiv.org/abs/2412.12639
D. Xu et al., "EdgeLLM: Fast On-device LLM Inference with Speculative Decoding" in IEEE Transactions on Mobile Computing, vol. , no. 01, pp. 1-18, PrePrints 5555, doi: 10.1109/TMC.2024.3513457. https://www.computer.org/csdl/journal/tm/5555/01/10812936/22UpTlf6X2U
Feiye Huo, Jianchao Tan, Kefeng Zhang, Xunliang Cai, Shengli Sun, 19 Feb 2025, C2T: A Classifier-Based Tree Construction Method in Speculative Decoding, https://arxiv.org/abs/2502.13652
Penghui Yang, Cunxiao Du, Fengzhuo Zhang, Haonan Wang, Tianyu Pang, Chao Du, Bo An, 24 Feb 2025, LongSpec: Long-Context Speculative Decoding with Efficient Drafting and Verification, https://arxiv.org/abs/2502.17421 https://github.com/sail-sg/LongSpec
Tong Wu, Junzhe Shen, Zixia Jia, Yuxuan Wang, Zilong Zheng, 26 Feb 2025, From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation up to 100K Tokens, https://arxiv.org/abs/2502.18890 (Extending speculative decoding to address three bottlenecks in ultra-long context: frequent model reloading, KV cache size, and repetitive output content generation. Uses techniques such as KV cache eviction and decoding penalties to avoid repetition.) https://github.com/bigai-nlco/TokenSwift
Yunhai Hu, Zining Liu, Zhenyuan Dong, Tianfan Peng, Bradley McDanel, Sai Qian Zhang, 27 Feb 2025, Speculative Decoding and Beyond: An In-Depth Review of Techniques, https://arxiv.org/abs/2502.19732
Guofeng Quan, Wenfeng Feng, Chuzhan Hao, Guochao Jiang, Yuewei Zhang, Hao Wang, 5 Mar 2025, RASD: Retrieval-Augmented Speculative Decoding, https://arxiv.org/abs/2503.03434
Yangchao Wu, Zongyue Qin, Alex Wong, Stefano Soatto, 20 May 2025, STree: Speculative Tree Decoding for Hybrid State-Space Models, https://arxiv.org/abs/2505.14969
Yuhao Shen, Junyi Shen, Quan Kong, Tianyu Liu, Yao Lu, Cong Wang, 16 May 2025, Speculative Decoding via Hybrid Drafting and Rollback-Aware Branch Parallelism, https://arxiv.org/abs/2506.01979

Self-Speculative Decoding

Self-speculative decoding is the idea that the smaller draft model, instead of being a separately trained model from the larger verifier model, could simply be a cut-down version of the bigger model. The advantages include:

Only one model to train! (No separate draft model)
Partial computations in the smaller (self) model can be re-used by the verifier model.
Verifier model effectively "overlaps" its computations (reducing latency).
Both models automatically have the same token set and vocabulary.

Self-speculative decoding is based on early-exited versions of the larger model, used as the smaller draft model. Then the larger model simply computes the remaining layers that haven't yet been executed, but it does so in parallel.

Research on self-speculative decoding:

Jiahao Liu, Qifan Wang, Jingang Wang, Xunliang Cai, 6 Jun 2024, Speculative Decoding via Early-exiting for Faster LLM Inference with Thompson Sampling Control Mechanism, https://arxiv.org/abs/2406.03853
Mahsa Khoshnoodi, Vinija Jain, Mingye Gao, Malavika Srikanth, Aman Chadha, 24 May 2024 (v2), A Comprehensive Survey of Accelerated Generation Techniques in Large Language Models, https://arxiv.org/abs/2405.13019
Wei Zhong, Manasa Bharadwaj, 30 May 2024, S3D: A Simple and Cost-Effective Self-Speculative Decoding Scheme for Low-Memory GPUs, https://arxiv.org/abs/2405.20314 (Self-speculative decoding using early layers, multi-token non-autoregressive token predictions for the draft model, and layer skipping.)
Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed A Aly, Beidi Chen, Carole-Jean Wu, 25 Apr 2024, Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding, https://arxiv.org/abs/2404.16710 (Multiple contributions including training with early exit, and speculative decoding with a draft model that is early exit within the larger model, with the advantages: (a) the draft and verifier model thereby share KV cache data for the early layers and (b) avoidance of the problems with an outdated KV cache normally caused by early exiting.)
Fangcheng Liu, Yehui Tang, Zhenhua Liu, Yunsheng Ni, Kai Han, Yunhe Wang, 29 Apr 2024, Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting, https://arxiv.org/abs/2404.18911 Code: https://github.com/Equationliu/Kangaroo (Speculative decoding where the draft model is an early exit of layers in the verifier model, but the draft model is also sped up further by early exiting confidence analysis.)
Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, Gabriel Synnaeve, 30 Apr 2024, Better & Faster Large Language Models via Multi-token Prediction, https://arxiv.org/abs/2404.19737 Project: https://huggingface.co/facebook/multi-token-prediction
Parsa Kavehzadeh, Mohammadreza Pourreza, Mojtaba Valipour, Tinashu Zhu, Haoli Bai, Ali Ghodsi, Boxing Chen, Mehdi Rezagholizadeh, 2 Jul 2024, S2D: Sorted Speculative Decoding For More Efficient Deployment of Nested Large Language Models, https://arxiv.org/abs/2407.01955 (Creating, managing and integrating multiple draft models as submodels in speculative decoding.)
Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
Nikhil Bhendawade, Irina Belousova, Qichen Fu, Henry Mason, Mohammad Rastegari, Mahyar Najibi, 16 Feb 2024, Speculative Streaming: Fast LLM Inference without Auxiliary Models, https://arxiv.org/abs/2402.11131
Fed-EE: Federating Heterogeneous ASR Models using Early-Exit Architectures, Mohamed Nabih Ali, Daniele Falavigna, Alessio Brutti, 2024, PDF: https://cris.fbk.eu/bitstream/11582/343747/1/paper_49.pdf (This paper mentions early exit in relation to generalized speculative decoding.)
Sangmin Bae, Jongwoo Ko, Hwanjun Song, and Se-Young Yun. 2023. Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding. arXiv preprint arXiv:2310.05424. https://arxiv.org/abs/2310.05424 (Using early exits as the draft model in generalized speculative decoding.)
Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, Zhifang Sui, 15 Jan 2024, Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding, https://arxiv.org/abs/2401.07851 (Survey paper has coverage of this type of speculative decoding.)
Michael R. Metel, Peng Lu, Boxing Chen, Mehdi Rezagholizadeh, Ivan Kobyzev, 1 Oct 2024, Draft on the Fly: Adaptive Self-Speculative Decoding using Cosine Similarity, https://arxiv.org/abs/2410.01028 (Self-speculative decoding that removes layers based on cosine similarity.)
Heming Xia, Yongqi Li, Jun Zhang, Cunxiao Du, Wenjie Li, 9 Oct 2024, SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration, https://arxiv.org/abs/2410.06916 (Self-speculative decoding using layer skipping, rather than early exit.)
Hyun Ryu, Eric Kim, 20 Nov 2024, Closer Look at Efficient Inference Methods: A Survey of Speculative Decoding, https://arxiv.org/abs/2411.13157
Samarth N Ramesh, Zhixue Zhao, 22 Nov 2024, Efficient Pruning of Text-to-Image Models: Insights from Pruning Stable Diffusion, https://arxiv.org/abs/2411.15113 (Comprehensive analysis of different types of pruning on diffusion image models.)
Aritra Roy Gosthipaty, Mostafa Elhoushi, Pedro Cuenca, Vaibhav Srivastav, November 20, 2024, Faster Text Generation with Self-Speculative Decoding, https://huggingface.co/blog/layerskip
Divya Jyoti Bajpai, Manjesh Kumar Hanawal, 13 Jan 2025, A Survey of Early Exit Deep Neural Networks in NLP, https://arxiv.org/abs/2501.07670 (Good survey of exit exit classifier types.)
Rishabh Tiwari, Haocheng Xi, Aditya Tomar, Coleman Hooper, Sehoon Kim, Maxwell Horton, Mahyar Najibi, Michael W. Mahoney, Kurt Keutzer, Amir Gholami, 5 Feb 2025, QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache, https://arxiv.org/abs/2502.10424 (Combining self-speculative decoding with KV quantization.)
Yuichiro Hoshino, Hideyuki Tachibana, Muneyoshi Inahara, Hiroto Takegawa, 28 May 2025, RAD: Redundancy-Aware Distillation for Hybrid Models via Self-Speculative Decoding, https://arxiv.org/abs/2505.22135

Lookup Decoding (Retrieval Speculative Decoding)

Lookup decoding is similar to speculative decoding in that it is a parallelization strategy. But it doesn't use a draft model to generate the tokens, but instead searches for patterns in a retrieval database.

Retrieval speculative decoding involves searching for n-grams of tokens in a separate datastore. If there is a match, the further token in that stored sequence become the draft tokens. The datastore is often related to a RAG datastore, but is lower-level than a RAG database of text chunks, because the n-grams can be (and should be) sub-sequences within a RAG chunk. In fact, that's the whole point. The idea for the speedup is to find a few tokens, already output by the LLM, which are matching part of a RAG chunk, and then assume that the model is about to output a lot more tokens from that chunk.

One of the interesting features of retrieval-based lookup decoding is that there isn't always a draft token sequence. If there's no match in the database, there's no speculative decoding step, and it defaults to autoregressive sequential decoding. Hence, it is only a speedup on average, and relies on a high hit rate for n-grams in the datastore.

Zhenyu He, Zexuan Zhong, Tianle Cai, Jason D. Lee, Di He, 4 Apr 2024 (v2), REST: Retrieval-Based Speculative Decoding, https://arxiv.org/abs/2311.08252
Apoorv Saxena, November 2023, Prompt Lookup Decoding, https://github.com/apoorvumang/prompt-lookup-decoding
Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey
Nan Yang, Tao Ge, Liang Wang, Binxing Jiao, Daxin Jiang, Linjun Yang, Rangan Majumder, and Furu Wei. 2023. Inference with Reference: Lossless Acceleration of Large Language Models. arXiv:2304.04487 [cs.CL] https://arxiv.org/abs/2304.04487
Sophia Ho, Jinsol Park, Patrick Wang, 8 Aug 2024, CREST: Effectively Compacting a Datastore For Retrieval-Based Speculative Decoding, https://arxiv.org/abs/2408.04678
Guofeng Quan, Wenfeng Feng, Chuzhan Hao, Guochao Jiang, Yuewei Zhang, Hao Wang, 5 Mar 2025, RASD: Retrieval-Augmented Speculative Decoding, https://arxiv.org/abs/2503.03434

Prompt Lookup Decoding

Prompt lookup decoding is similar to lookup decoding, but it looks for sequences of tokens in the input prompt sequence itself, rather than any external datastore. This speedup is based on the assumption that the model might emit verbatim sub-sequences of the input prompt in its output. This might be true in a RAG architecture where it might summarize the input text.

Research on prompt lookup decoding:

Vgel, December 18, 2023, How to make LLMs go fast, https://vgel.me/posts/faster-inference/
Nan Yang, Tao Ge, Liang Wang, Binxing Jiao, Daxin Jiang, Linjun Yang, Rangan Majumder, and Furu Wei. 2023. Inference with Reference: Lossless Acceleration of Large Language Models. arXiv:2304.04487 [cs.CL] https://arxiv.org/abs/2304.04487
vLLM, 2024, Speculative decoding in vLLM, https://docs.vllm.ai/en/stable/models/spec_decode.html (vLLM has speculative decoding and also looks for n-grams in the prompt, which is prompt lookup decoding.)
Xiaoxuan Liu, Cade Daniel, Langxiang Hu, Woosuk Kwon, Zhuohan Li, Xiangxi Mo, Alvin Cheung, Zhijie Deng, Ion Stoica, Hao Zhang, 20 Jun 2024, Optimizing Speculative Decoding for Serving Large Language Models Using Goodput, https://arxiv.org/abs/2406.14066 (Estimation of the draft length for increased acceptance to improve overall performance.)
Lawrence Stewart, Matthew Trager, Sujan Gonugondla, Stefano Soatto, 2024, The N-Grammys: Accelerating autoregressive inference with learning-free batched speculation, 4th NeurIPS Efficient Natural Language and Speech Processing Workshop (ENLSP-IV 2024), https://www.amazon.science/publications/the-n-grammys-accelerating-autoregressive-inference-with-learning-free-batched-speculation (Use a variety of heuristics instead of a draft model, such as precalculated likelihoods, and also prompt lookup decoding using n-grams from the context tokens.)
Shwetha Somasundaram, Anirudh Phukan, Apoorv Saxena, 2 Dec 2024, PLD+: Accelerating LLM inference by leveraging Language Model Artifacts, https://arxiv.org/abs/2412.01447
Hongxuan Zhang, Zhining Liu, Yao Zhao, Jiaqi Zheng, Chenyi Zhuang, Jinjie Gu, Guihai Chen, 4 Jun 2024 (v2), Fast Chain-of-Thought: A Glance of Future from Parallel Decoding Leads to Answers Faster, https://arxiv.org/abs/2311.08263 (Use of Jacobi parallel decoding with Chain-of-Thought.)
NVIDIA, Dec 2024, Speculative Sampling, https://nvidia.github.io/TensorRT-LLM/advanced/speculative-decoding.html
NVIDIA, Dec 2024, Prompt-Lookup Speculative Decoding, https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/prompt_lookup/README.md
Razvan-Gabriel Dumitru, Minglai Yang, Vikas Yadav, Mihai Surdeanu, 13 Feb 2025, CopySpec: Accelerating LLMs with Speculative Copy-and-Paste Without Compromising Quality, https://arxiv.org/abs/2502.08923 https://github.com/RazvanDu/CopySpec (Prompt lookup decoding done across the entire chat history across multiple user queries.)
Qianhui Zhao, Li Zhang, Fang Liu, Xiaoli Lian, Qiaoyuanhe Meng, Ziqian Jiao, Zetong Zhou, Borui Zhang, Runlin Guo, Jia Li, 24 Feb 2025, CodeSwift: Accelerating LLM Inference for Efficient Code Generation, https://arxiv.org/abs/2502.17139 (Using draft sequences from a datastore of code, to achieve parallel inference, similar to prompt looking decoding or retrieval lookup decoding.)
Guanzheng Chen, Qilong Feng, Jinjie Ni, Xin Li, Michael Qizhe Shieh, 27 Feb 2025, Long-Context Inference with Retrieval-Augmented Speculative Decoding, https://arxiv.org/abs/2502.20330

Superposed Decoding for Multi-Draft Generation

The idea of "superposed decoding" is an interesting one, that is not a type of speculative decoding. Rather than using multiple drafts to speed up the computation of one final resulting sequence (verified by a large model), superposed decoding is a fast way to generate multiple reasonable drafts of a multi-token sequence. The key point is to combine the embeddings of the top-k tokens into a single embedding vector using a weighted average, and then validate this with an extra n-gram lookup method (i.e., a heuristic based on n-gram frequency in a text data set). Overall, the method allows the computation of multiple reasonable n-gram token sequences at almost the same cost as a single one.

Ethan Shen, Alan Fan, Sarah M Pratt, Jae Sung Park, Matthew Wallingford, Sham M. Kakade, Ari Holtzman, Ranjay Krishna, Ali Farhadi, Aditya Kusupati, 28 May 2024, Superposed Decoding: Multiple Generations from a Single Autoregressive Inference Pass, https://arxiv.org/abs/2405.18400 Code: https://github.com/RAIVNLab/SuperposedDecoding (Generating multiple possible drafts from a single decoding algorithm with one model pass by superimposing embeddings and using top-k decoding.)

Sequential Speculative Decoding

There is very little research on speculative decoding for speeding up sequential computations, such as on-device inference with CPU or NPU capabilities.

Research papers include:

Daliang Xu, Wangsong Yin, Xin Jin, Ying Zhang, Shiyun Wei, Mengwei Xu, and Xuanzhe Liu, Sep 2023, Llmcad: Fast and scalable on-device large language model inference. CoRR, abs/2309.04255, https://arxiv.org/abs/2309.04255
David Spuler, Aug 2024, Sequential Speculative Decoding, Aussie AI Blog, https://www.aussieai.com/blog/sequential-speculative-decoding
David Spuler, June 2024, Aussie AI, Speculative Decoding With Early Exit for Optimized Transformer On-Device Inference: IP Australia, https://ipsearch.ipaustralia.gov.au/patents/2024901656

Eagle Speculative Decoding

Research papers on Eagle decoding:

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: Speculative sampling requires rethinking feature uncertainty. arXiv preprint arXiv:2401.15077, 2024. https://arxiv.org/abs/2401.15077
Zachary Ankner, Rishab Parthasarathy, Aniruddha Nrusimha, Christopher Rinard, Jonathan Ragan-Kelley, and William Brandon. Hydra: Sequentially-dependent draft heads for medusa decoding, 2024. https://arxiv.org/abs/2402.05109
Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang, 24 Jun 2024, EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees, https://arxiv.org/abs/2406.16858 (Extends a tree-structured draft to use the logit probabilities of the draft model as an estimate of the likely acceptance rates by the larger model.)
Muhammad Rana, 2019, EagleBot: A Chatbot Based Multi-Tier Question Answering System for Retrieving Answers From Heterogeneous Sources Using BERT, Masters Thesis, Computer Science, Georgia Southern University, https://digitalcommons.georgiasouthern.edu/etd/1994/
Haim Barad, Jascha Achterberg, Tien Pei Chou, Jean Yu, 30 Oct 2024, Accelerated AI Inference via Dynamic Execution Methods, https://arxiv.org/abs/2411.00853
NVIDIA, Dec 2024, Speculative Sampling, https://nvidia.github.io/TensorRT-LLM/advanced/speculative-decoding.html
NVIDIA, Dec 2024, EAGLE speculative Decoding, https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/eagle/README.md
Situo Zhang, Hankun Wang, Da Ma, Zichen Zhu, Lu Chen, Kunyao Lan, Kai Yu, 25 Dec 2024, AdaEAGLE: Optimizing Speculative Decoding via Explicit Modeling of Adaptive Draft Structures, https://arxiv.org/abs/2412.18910 (Predicting the length of tokens to draft.)
Haoran Wang, Kai Shu, Jan 2025, MakeEveryTokenCount: ASystematic Survey on Decoding Methods for Foundation Model, https://www.researchgate.net/profile/Haoran-Wang-96/publication/387703971_Make_Every_Token_Count_A_Systematic_Survey_on_Decoding_Methods_for_Foundation_Models/links/67784c8ce74ca64e1f49eb15/Make-Every-Token-Count-A-Systematic-Survey-on-Decoding-Methods-for-Foundation-Models.pdf https://github.com/wang2226/Awesome-LLM-Decoding
Weilin Zhao, Tengyu Pan, Xu Han, Yudi Zhang, Ao Sun, Yuxiang Huang, Kaihuo Zhang, Weilun Zhao, Yuxuan Li, Jianyong Wang, Zhiyuan Liu, Maosong Sun, 20 Feb 2025, FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling, https://arxiv.org/abs/2502.14856 (Limiting the draft model in speculative decoding to frequently-used tokens.)
Yunhai Hu, Zining Liu, Zhenyuan Dong, Tianfan Peng, Bradley McDanel, Sai Qian Zhang, 27 Feb 2025, Speculative Decoding and Beyond: An In-Depth Review of Techniques, https://arxiv.org/abs/2502.19732
Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang, 3 Mar 2025, EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test, https://arxiv.org/abs/2503.01840 https://github.com/SafeAILab/EAGLE

Heuristic Speculative Decoding

This is a relatively new and under-addressed area of research. The "drafter" for speculative decoding need not be an LLM, and can be any sort of heuristic.

Research papers on heuristic speculative decoding or "heuristic drafters":

Lawrence Stewart, Matthew Trager, Sujan Gonugondla, Stefano Soatto, 2024, The N-Grammys: Accelerating autoregressive inference with learning-free batched speculation, 4th NeurIPS Efficient Natural Language and Speech Processing Workshop (ENLSP-IV 2024), https://www.amazon.science/publications/the-n-grammys-accelerating-autoregressive-inference-with-learning-free-batched-speculation (Use a variety of heuristics instead of a draft model, such as precalculated likelihoods, and also prompt lookup decoding using n-grams from the context tokens.)
Hugging Face, October 8, 2024, Faster Assisted Generation with Dynamic Speculation, https://huggingface.co/blog/dynamic_speculation_lookahead
Ryan Sun, Tianyi Zhou, Xun Chen, Lichao Sun, 2024, SpecHub: Provable Acceleration to Multi-Draft Speculative Decoding, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, https://aclanthology.org/2024.emnlp-main.1148/ https://aclanthology.org/2024.emnlp-main.1148.pdf

General Research Papers on Speculative Decoding

Papers that specifically focus on the speculative decoding technique:

Leviathan, Y., Kalman, M., and Matias, Y., Fast inference from transformers via speculative decoding, May 2023, https://arxiv.org/abs/2211.17192
D Xu, W Yin, X Jin, Y Zhang, S Wei, M Xu, X Liu, Sep 2023, LLMCad: Fast and Scalable On-device Large Language Model Inference, arXiv preprint arXiv:2309.04255, https://arxiv.org/pdf/2309.04255.pdf (Keeps a smaller model in memory, improving speed and reducing memory utilization.)
Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, L., and Jumper, J., Feb 2023, Accelerating large language model decoding with speculative sampling, arXiv preprint arXiv:2302.01318, https://arxiv.org/abs/2302.01318
Sehoon Kim, Karttikeya Mangalam, Suhong Moon, John Canny, Jitendra Malik, Michael W. Mahoney, Amir Gholami, Kurt Keutzer, Sep 2023 (original Feb 2023), Speculative Decoding with Big Little Decoder, https://arxiv.org/abs/2302.07863 (Separates a "fallback policy" when the smaller model detects it needs the bigger model, and a "rollback policy" when the bigger model vetos output and intervenes, both for deciding when the bigger model controls.)
Yaniv Leviathan, Matan Kalman, and Yossi Matias. May 2023, Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274–19286. PMLR, 2023. https://arxiv.org/abs/2211.17192
Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling. DeepMind, arXiv preprint arXiv:2302.01318, 2023. https://arxiv.org/abs/2302.01318
Heming Xia, Tao Ge, Si-Qing Chen, Furu Wei, and Zhifang Sui. Speculative decoding: Lossless speedup of autoregressive translation. Openreview, 2022. https://openreview.net/forum?id=H-VlwsYvVi
Nan Yang, Tao Ge, Liang Wang, Binxing Jiao, Daxin Jiang, Linjun Yang, Rangan Majumder, Furu Wei, Apr 2023, Inference with Reference: Lossless Acceleration of Large Language Models, https://arxiv.org/abs/2304.04487 (Not pure speculative decoding, but an analogous method.)
Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Rae Ying Yee Wong, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, and Zhihao Jia, Aug 2023, Specinfer: Accelerating generative llm serving with speculative inference and token tree verification. arXiv preprint arXiv:2305.09781, 2023. https://arxiv.org/abs/2305.09781, Code: https://github.com/flexflow/FlexFlow/tree/inference
Burton, F. W., 1985, Speculative computation, parallelism, and functional programming. IEEE Transactions on Computers, C-34(12):1190–1193, 1985. doi: 10.1109/TC.1985. 6312218. https://ieeexplore.ieee.org/document/6312218 (Algorithmic theory of "speculative computation" from 1985.)
Hennessy, J. L. and Patterson, D. A., Computer Architecture: A Quantitative Approach. Morgan Kaufmann, Amsterdam, 5 edition, 2012. ISBN 978-0-12-383872-8. https://dl.acm.org/doi/book/10.5555/1999263 (Includes coverage of speculative algorithms.)
T. Ge, H. Xia, X. Sun, S. Chen, and F. Wei. Lossless acceleration for seq2seq generation with aggressive decoding. ArXiv, abs/2205.10350, 2022. https://arxiv.org/abs/2205.10350, Code: https://github.com/microsoft/unilm/tree/master/decoding (The generalized aggressive decoding method has a "draft-and-verify" algorithm that is similar to speculative decoding.)
M. Stern, N. Shazeer, and J. Uszkoreit. Blockwise parallel decoding for deep autoregressive models. CoRR, abs/1811.03115, 2018. https://arxiv.org/abs/1811.03115 (Generates various output in parallel and using a scoring method to confirm them.)
Jungo Kasai, Keisuke Sakaguchi, Ronan Le Bras, Hao Peng, Ximing Lu, Dragomir Radev, Yejin Choi, Noah A. Smith, Oct 2022, Twist Decoding: Diverse Generators Guide Each Other, https://arxiv.org/abs/2205.09273, Code: https://github.com/jungokasai/twist_decoding
S Bae, J Ko, H Song, SY Yun, Oct 2023, Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding, arXiv preprint arXiv:2310.05424, https://arxiv.org/pdf/2310.05424.pdf (Combination of early-exit with a "shallow-deep module" and parallel decoding.)
Kaya Y., Hong S., Dumitras T., Shallow-deep networks: Understanding and mitigating network overthinking Proceedings of the international conference on machine learning, ICML (2019), pp. 3301-3310, https://arxiv.org/abs/1810.07052 (Shallow-deep method in a single model is analogous to speculative decoding.)
Yongchao Zhou, Kaifeng Lyu, Ankit Singh Rawat, Aditya Krishna Menon, Afshin Rostamizadeh, Sanjiv Kumar, Jean-François Kagy, Rishabh Agarwal, Oct 2023, DistillSpec: Improving Speculative Decoding via Knowledge Distillation, https://arxiv.org/abs/2310.08461
Sen Yang, Shujian Huang, Xinyu Dai, Jiajun Chen, 12 Jan 2024, Multi-Candidate Speculative Decoding, https://arxiv.org/abs/2401.06706 (Draft model generates multiple candidates, which can each have their acceptance or rejection.)
Shuming Shi, Enbo Zhao, Deng Cai, Leyang Cui, Xinting Huang, Huayang Li, 16 Jan 2024, Inferflow: an Efficient and Highly Configurable Inference Engine for Large Language Models, https://arxiv.org/abs/2401.08294 Source: https://github.com/inferflow/inferflow
S Kim, K Mangalam, S Moon, J Malik, MW Mahoney, 2023, Speculative Decoding with Big Little Decoder, https://openreview.net/pdf?id=EfMyf9MC3t Code: https://github.com/kssteven418/BigLittleDecoder
Qidong Su, Christina Giannoula, Gennady Pekhimenko, Oct 2023, The Synergy of Speculative Decoding and Batching in Serving Large Language Models, https://arxiv.org/abs/2310.18813 (Optimizing by adapting dynamically the length of the speculated sequence in batches.)
Haim Barad, Ekaterina Aidova, Yury Gorbachev, Nov 2023, Leveraging Speculative Sampling and KV-Cache Optimizations Together for Generative AI using OpenVINO, https://arxiv.org/abs/2311.04951 Code: https://github.com/openvinotoolkit/openvino_notebooks/tree/main/notebooks/266-speculative-sampling
Xiaoxuan Liu, Lanxiang Hu, Peter Bailis, Ion Stoica, Zhijie Deng, Alvin Cheung, Hao Zhang, Oct 2023, Online Speculative Decoding, https://arxiv.org/abs//2310.07177
Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Tri Dao, Nov 2023 (accessed), Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads, https://sites.google.com/view/medusa-llm
Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang, 4 Feb 2024 ( v2), EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty, https://arxiv.org/abs/2401.15077 (Speculative encoding at the "feature" level rather than the token level.)
Nikhil Bhendawade, Irina Belousova, Qichen Fu, Henry Mason, Mohammad Rastegari, Mahyar Najibi, 16 Feb 2024, Speculative Streaming: Fast LLM Inference without Auxiliary Models, https://arxiv.org/abs/2402.11131 (Speculative decoding with a draft model that creates n-gram multiple tokens.)
Aishwarya P S, Pranav Ajit Nair, Yashas Samaga, Toby Boyd, Sanjiv Kumar, Prateek Jain, Praneeth Netrapalli, 26 Mar 2024 (v3), Tandem Transformers for Inference Efficient LLMs, https://arxiv.org/abs/2402.08644 (A two-model architecture with a small autoregressive model and a larger model with non-autoregressive block decoding, which is similar to big-little inference and speculative decoding methods.)
Minghao Yan, Saurabh Agarwal, Shivaram Venkataraman, 2 Feb 2024, Decoding Speculative Decoding, https://arxiv.org/abs/2402.01528 (Analysis of throughput versus acceptance rates, with a draft model for Llama-65B.)
Weilin Zhao, Yuxiang Huang, Xu Han, Chaojun Xiao, Zhiyuan Liu, Maosong Sun, 21 Feb 2024, Ouroboros: Speculative Decoding with Large Model Enhanced Drafting, https://arxiv.org/abs/2402.13720 Code: https://github.com/thunlp/Ouroboros
Siqi Wang, Hailong Yang, Xuezhu Wang, Tongxuan Liu, Pengbo Wang, Xuning Liang, Kejie Ma, Tianyu Feng, Xin You, Yongjun Bao, Yi Liu, Zhongzhi Luan, Depei Qian, 24 Feb 2024, Minions: Accelerating Large Language Model Inference with Adaptive and Collective Speculative Decoding, https://arxiv.org/abs/2402.15678 (Speculative decoding with a focus on improving poor acceptance rates using majority-voting consensus decoding with multiple small draft models, achieving 80-90% acceptance with OPT-13B, and around 50% acceptance for Llama-70B, and also pipelining of inference computations for speedup.)
Ziqian Zeng, Jiahong Yu, Qianshi Pang, Zihao Wang, Huiping Zhuang, Cen Chen, 24 Feb 2024, Chimera: A Lossless Decoding Method for Accelerating Large Language Models Inference by Fusing all Tokens, https://arxiv.org/abs/2402.15758 (Speculative decoding speedups by splitting inputs into small sequences of a few tokens for a faster draft model computation.)
Daniel Warfield, Dec 16, 2023, Towards Data Science, Speculative Sampling — Intuitively and Exhaustively Explained, https://towardsdatascience.com/speculative-sampling-intuitively-and-exhaustively-explained-2daca347dbb9
Dakota Goldberg, Nov. 16, 2023, Accelerating large model inference with speculative decoding, 6.S898 Deep Learning Blogs 2023, MIT, https://deep-learning-mit.github.io/staging/blog/2023/speculative-decoding/ (Conducts an analysis of speculative decoding and an advanced method of layering called "hierarchical speculative decoding.")
Taehyeon Kim, Ananda Theertha Suresh, Kishore Papineni, Michael Riley, Sanjiv Kumar, Adrian Benton, 14 Apr 2024, Towards Fast Inference: Exploring and Improving Blockwise Parallel Drafts, Google Research, https://arxiv.org/abs/2404.09221 (Improving blockwise parallel decoding via top-k decoding and generation of predicted n-grams.)
Jiahao Liu, Qifan Wang, Jingang Wang, Xunliang Cai, 6 Jun 2024, Speculative Decoding via Early-exiting for Faster LLM Inference with Thompson Sampling Control Mechanism, https://arxiv.org/abs/2406.03853
Xinyin Ma, Gongfan Fang, Michael Bi Mi, Xinchao Wang, 3 Jun 2024, Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching, https://arxiv.org/abs/2406.01733 Code: https://github.com/horseee/learning-to-cache (Layer skipping in diffusion transformers via layer caching.)
Kaixuan Huang, Xudong Guo, Mengdi Wang, 30 May 2024, SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths, https://arxiv.org/abs/2405.19715 (Training the draft model in speculative decoding to decide how far ahead to predict draft tokens.)
Wei Zhong, Manasa Bharadwaj, 30 May 2024, S3D: A Simple and Cost-Effective Self-Speculative Decoding Scheme for Low-Memory GPUs, https://arxiv.org/abs/2405.20314 (Self-speculative decoding using early layers, multi-token non-autoregressive token predictions for the draft model, and layer skipping.)
Minghan Li, Xilun Chen, Ari Holtzman, Beidi Chen, Jimmy Lin, Wen-tau Yih, Xi Victoria Lin, 29 May 2024, Nearest Neighbor Speculative Decoding for LLM Generation and Attribution, https://arxiv.org/abs/2405.19325 (Merging of RALM and speculative decoding.)
Harikrishna Narasimhan, Wittawat Jitkrittum, Ankit Singh Rawat, Seungyeon Kim, Neha Gupta, Aditya Krishna Menon, Sanjiv Kumar, 29 May 2024, Faster Cascades via Speculative Decoding, https://arxiv.org/abs/2405.19261 (A combination of cascades with speculative decoding.)
Hao (Mark) Chen, Wayne Luk, Ka Fai Cedric Yiu, Rui Li, Konstantin Mishchenko, Stylianos I. Venieris, Hongxiang Fan, 28 May 2024, Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference, https://arxiv.org/abs/2405.18628 Code: https://github.com/hmarkc/parallel-prompt-decoding (Similar to speculative decoding with extra trained prompt tokens and a tree-structured verification of multiple optional draft sequences.)
Stern, Mitchell Thomas, 2020, Structured Neural Models and Structured Decoding for Natural Language Processing, Ph.D. Thesis, Computer Science, University of California, Berkeley, https://escholarship.org/uc/item/4m2211b5 https://escholarship.org/content/qt4m2211b5/qt4m2211b5.pdf
Cunxiao Du, Jing Jiang, Xu Yuanchen, Jiawei Wu, Sicheng Yu, Yongqi Li, Shenggui Li, Kai Xu, Liqiang Nie, Zhaopeng Tu, Yang You, 3 Feb 2024, GliDe with a CaPE: A Low-Hassle Method to Accelerate Speculative Decoding, https://arxiv.org/abs/2402.02082 (Allow the draft model to use the KV cache of the large model in choosing its predictions, and extends the drafts to use a top-k set of tokens at each position.)
Zhenyu He, Zexuan Zhong, Tianle Cai, Jason D. Lee, Di He, 4 Apr 2024 (v2), REST: Retrieval-Based Speculative Decoding, https://arxiv.org/abs/2311.08252
Nadav Timor, Jonathan Mamou, Daniel Korat, Moshe Berchansky, Oren Pereg, Moshe Wasserblat, Tomer Galanti, Michal Gordon, David Harel, 23 May 2024, Distributed Speculative Inference of Large Language Models, https://arxiv.org/abs/2405.14105 (Speculative decoding with multiple drafter models and distributed processing across multiple threads and GPUs.)
PS Aishwarya, PA Nair, Y Samaga, T Boyd, S Kumar, 2024, Tandem Transformers for Inference Efficient LLMs, https://www.prateekjain.org/publications/all_papers/NairSBKJN24.pdf (Separates prefill from decoding phase into a "tandem transformer" in combination with speculative decoding.)
Yunsheng Ni, Chuanjian Liu, Yehui Tang, Kai Han, Yunhe Wang, 13 May 2024, EMS-SD: Efficient Multi-sample Speculative Decoding for Accelerating Large Language Models, https://arxiv.org/abs/2405.07542 Code: https://github.com/niyunsheng/EMS-SD (Speculative decoding across multiple queries by avoiding padding tokens and optimizing the KV cache.)
Aman, May 14, 2024, Near-Instant Full-File Edits, Cursor, https://cursor.sh/blog/instant-apply (A type of speculative decoding for code editing called "speculative edits.")
Bin Xiao, Chunan Shi, Xiaonan Nie, Fan Yang, Xiangwei Deng, Lei Su, Weipeng Chen, Bin Cui, 1 May 2024, Clover: Regressive Lightweight Speculative Decoding with Sequential Knowledge, https://arxiv.org/abs/2405.00263 (Speculative decoding improvement by extending Medusa to tree attention with a cross attention block.)
Davis Wertheimer, Joshua Rosenkranz, Thomas Parnell, Sahil Suneja, Pavithra Ranganathan, Raghu Ganti, Mudhakar Srivatsa, 29 Apr 2024, Accelerating Production LLMs with Combined Token/Embedding Speculators, IBM Research, https://arxiv.org/abs/2404.19124 Code: https://github.com/foundation-model-stack/fms-fsdp Code: https://huggingface.co/ibm-fms (Extending Medusa architecture with a single multi-headed architecture so the draft model predicts an n-gram with multiple tokens more accurately.)
Fangcheng Liu, Yehui Tang, Zhenhua Liu, Yunsheng Ni, Kai Han, Yunhe Wang, 29 Apr 2024, Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting, https://arxiv.org/abs/2404.18911 Code: https://github.com/Equationliu/Kangaroo (Speculative decoding where the draft model is an early exit of layers in the verifier model, but the draft model is also sped up further by early exiting confidence analysis.)
Haifeng Qian, Sujan Kumar Gonugondla, Sungsoo Ha, Mingyue Shang, Sanjay Krishna Gouda, Ramesh Nallapati, Sudipta Sengupta, Xiaofei Ma, Anoop Deoras, 24 Apr 2024, BASS: Batched Attention-optimized Speculative Sampling, https://arxiv.org/abs/2404.15778 (Optimizes batched multi-query use of speculative decoding with consideration of GPU utilization in prefill and decoding phases.)
Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed A Aly, Beidi Chen, Carole-Jean Wu, 25 Apr 2024, Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding, https://arxiv.org/abs/2404.16710 (Multiple contributions including training with early exit, and speculative decoding with a draft model that is early exit within the larger model, with the advantages: (a) the draft and verifier model thereby share KV cache data for the early layers and (b) avoidance of the problems with an outdated KV cache normally caused by early exiting.)
Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, Chunan Shi, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, Zhihao Jia, 2024. SpecInfer: Accelerating Large Language Model Serving with Tree-based Speculative Inference and Verification, ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, April 2024, Pages 932–949, https://doi.org/10.1145/3620666.3651335 https://dl.acm.org/doi/abs/10.1145/3620666.3651335 Code: https://github.com/flexflow/FlexFlow/
Chen Zhang, Zhuorui Liu, Dawei Song, 23 Apr 2024, Beyond the Speculative Game: A Survey of Speculative Execution in Large Language Models, Beijing Institute of Technology, China, https://arxiv.org/abs/2404.14897 (Strong survey specific to speculative decoding and other draft-then-verify optimization techniques.)
Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, 22 Apr 2024, A Survey on Efficient Inference for Large Language Models, https://arxiv.org/abs/2404.14294
L. Zheng, L. Yin, Z. Xie, J. Huang, C. Sun, C. H. Yu, S. Cao, C. Kozyrakis, I. Stoica, J. E. Gonzalez et al., Dec 2023, Efficiently programming large language models using SGLang, arXiv preprint arXiv:2312.07104, 2023, https://arxiv.org/abs/2312.07104 (Uses a radix attention method, a trie or prefix tree, for KV caching.)
B. Spector and C. Re, Aug 2023, “Accelerating llm inference with staged speculative decoding,” arXiv preprint arXiv:2308.04623, 2023. https://arxiv.org/abs/2308.04623
Z. Chen, X. Yang, J. Lin, C. Sun, J. Huang, and K. C.-C. Chang, 27 Feb 2024 (v4), “Cascade speculative drafting for even faster llm inference,” arXiv preprint arXiv:2312.11462, 2023. https://arxiv.org/abs/2312.11462 Code: https://github.com/lfsszd/CS-Drafting (Uses non-LLM draft models based on statistical langage models and also uses multiple draft models in a hierarchy.)
Shashank Verma and Neal Vaidya, Nov 17, 2023 Mastering LLM Techniques: Inference Optimization, NVIDIA Technical Blog, https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/ (An overview that covers a lot of inference optimization techniques.)
Hanshi Sun, Zhuoming Chen, Xinyu Yang, Yuandong Tian, Beidi Chen, 18 Apr 2024, TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding, https://arxiv.org/abs/2404.11912 Code: https://github.com/Infini-AI-Lab/TriForce (Improves issues with long context in the use of speculative decoding, such as small model context sizes and KV cache memory usage bottlenecks for Llama2-7B-128K model on an A100 and RTX 4090 GPU.)
Theia Vogel, December 18, 2023, How to make LLMs go fast, https://vgel.me/posts/faster-inference/
Joao Gante, 2023, Assisted generation: a new direction toward low-latency text generation, Hugging Face, DOI: 10.57967/hf/0638, https://huggingface.co/datasets/joaogante/assisted_generation (Using a model's forward pass to valid a sequence of multiple tokens, analogous to verification in speculative decoding.)
Zhuorui Liu, Chen Zhang, Dawei Song, 2024, How Speculative Can Speculative Decoding Be? Beijing Institute of Technology, https://oro.open.ac.uk/97102/1/COLING_2024__How_Speculative_Can_Speculative_Decoding_Be_.pdf Code: https://github.com/ZhuoruiLiu12/SpecGame (Analysis of the accuracy and speed requirements of the drafting model for effective speculative decoding, including finding that draft models can be 60 times smaller than the large model.)
H Xia, T Ge, P Wang, SQ Chen, F Wei, 2023, Speculative decoding: Exploiting speculative execution for accelerating seq2seq generation, https://arxiv.org/abs/2203.16487 https://aclanthology.org/2023.findings-emnlp.257.pdf Code: https://github.com/hemingkx/SpecDec (Uses a specially optimized deep-encoder shallow-decoder architecture as the drafting model.)
3 Jan 2024 (v2), SPEED: Speculative Pipelined Execution for Efficient Decoding, Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Hasan Genc, Kurt Keutzer, Amir Gholami, Sophia Shao, https://arxiv.org/abs/2310.12072 (Speculatively executing multiple future tokens in parallel to the current token, by using multiple tokens with high probability from the early layers of inference of the current token in the model. This allows multiple speculations of the autoregressive inference of the next token to start before the current token is finished.)
Yao Zhao, Zhitian Xie, Chenyi Zhuang, Jinjie Gu, 4 Jan 2024 (v2), Lookahead: An Inference Acceleration Framework for Large Language Model with Lossless Generation Accuracy, https://arxiv.org/abs/2312.12728 Code: https://github.com/alipay/PainlessInferenceAcceleration (Generates multiple tokens on multiple branches for verification, giving a tree-structured approach.)
Bodun Hu, Le Xu, Jeongyoon Moon, Neeraja J. Yadwadkar, Aditya Akella, 27 Oct 2023, MOSEL: Inference Serving Using Dynamic Modality Selection, https://arxiv.org/abs/2310.18481 (Multi-modal model with dynamic selection of modality.)
Feng Lin, Hanling Yi, Hongbin Li, Yifan Yang, Xiaotian Yu, Guangming Lu, Rong Xiao, 25 Jan 2024 (v2), BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models, https://arxiv.org/abs/2401.12522 Code: https://github.com/linfeng93/BiTA
15 Mar 2024, Optimal Block-Level Draft Verification for Accelerating Speculative Decoding, Ziteng Sun, Jae Hun Ro, Ahmad Beirami, Ananda Theertha Suresh, https://arxiv.org/abs/2403.10444 (Draft a block of tokens for verification.)
Giovanni Monea, Armand Joulin, Edouard Grave, 22 Nov 2023, PaSS: Parallel Speculative Sampling, https://arxiv.org/abs/2311.13581 (Generates multiple draft tokens using a parallel extension via "look ahead embeddings".)
Jinghui Lu, Ziwei Yang, Yanjie Wang, Xuejing Liu, Brian Mac Namee, Can Huang, 15 Feb 2024 (v4), PaDeLLM-NER: Parallel Decoding in Large Language Models for Named Entity Recognition, https://arxiv.org/abs/2402.04838 (Use of parallel decoding in Named Entity Recognition use case.)
Jinwei Yao, Kaiqi Chen, Kexun Zhang, Jiaxuan You, Binhang Yuan, Zeke Wang, Tao Lin, 30 Mar 2024, DeFT: Flash Tree-attention with IO-Awareness for Efficient Tree-search-based LLM Inference, https://arxiv.org/abs/2404.00242
Wonseok Jeon, Mukul Gagrani, Raghavv Goel, Junyoung Park, Mingu Lee, Christopher Lott, 5 Mar 2024 (v2), Recursive Speculative Decoding: Accelerating LLM Inference via Sampling Without Replacement, Qualcomm AI Research, https://arxiv.org/abs/2402.14160 (Improvements of an adaptive inference version of a draft token-tree with multiple n-gram paths for speculative decoding.)
Aonan Zhang, Chong Wang, Yi Wang, Xuanyu Zhang, Yunfei Cheng, 22 Mar 2024 (v2), Recurrent Drafter for Fast Speculative Decoding in Large Language Models, https://arxiv.org/abs/2403.09919 (Use of small RNN as the drafting model for speculative decoding.)
Jie Ou, Yueming Chen, Wenhong Tian, 10 Apr 2024, Lossless Acceleration of Large Language Model via Adaptive N-gram Parallel Decoding, https://arxiv.org/abs/2404.08698 (Use an n-gram model as the drafter to create a version of parallel decoding or generalized speculative decoding.)
Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, Zhifang Sui, 15 Jan 2024, Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding, https://arxiv.org/abs/2401.07851
Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, Liang Zhao, 4 Jan 2024, Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models https://arxiv.org/abs/2401.00625 (A general survey paper with coverage of many techniques including this one.)
Intel, 2024, https://github.com/intel/intel-extension-for-transformers
Haoran You, Yichao Fu, Zheng Wang, Amir Yazdanbakhsh, Yingyan (Celine) Lin, 11 Jun 2024, When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models, https://arxiv.org/abs/2406.07368 Code: https://github.com/GATECH-EIC/Linearized-LLM
Zhihao Zhang, Alan Zhu, Lijie Yang, Yihua Xu, Lanting Li, Phitchaya Mangpo Phothilimthana, Zhihao Jia, 2024, Accelerating Iterative Retrieval-augmented Language Model Serving with Speculation, Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024, https://openreview.net/pdf?id=CDnv4vg02f
Hanling Yi, Feng Lin, Hongbin Li, Peiyang Ning, Xiaotian Yu, Rong Xiao, 16 Apr 2024 (v2)], Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct Decoding, https://arxiv.org/abs/2402.11809 (Semi-autoregressive draft model with parallel verification.)
Fu, Y., Bailis, P., Stoica, I., and Zhang, H. Break ing the sequential dependency of LLM inference using lookahead decoding, November 2023. https://lmsys.org/blog/2023-11-21-lookahead-decoding/
Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 15 Mar 2024 (v5), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer (A large survey of a variety of LLM optimizations.)
Xiaoxuan Liu, Lanxiang Hu, Peter Bailis, Ion Stoica, Zhijie Deng, Alvin Cheung, Hao Zhang, 17 Oct 2023 (v2), Online Speculative Decoding, https://arxiv.org/abs/2310.07177
Luv Bansal, Apr 8, 2024, Speculative Decoding — Make LLM Inference Faster, https://medium.com/ai-science/speculative-decoding-make-llm-inference-faster-c004501af120
C Hooper, S Kim, H Mohammadzadeh, H Genc, Oct 2023, SPEED: Speculative Pipelined Execution for Efficient Decoding https://arxiv.org/pdf/2310.12072.pdf
David Spuler, March 2024, Chapter 54. Ensemble Multi-Model Architectures, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Natarajan Vaidhyanathan Mar 7, 2024, How to Quadruple LLM Decoding Performance with Speculative Decoding (SpD) and Microscaling (MX) Formats on Qualcomm® Cloud AI 100, https://www.qualcomm.com/developer/blog/2024/03/how-quadruple-llm-decoding-performance-speculative-decoding-spd-and-microscaling-mx-formats
M. Liu, A. Zeng, B. Wang, P. Zhang, J. Tang, and Y. Dong, “Apar: Llms can do auto-parallel auto-regressive decoding,” arXiv preprint arXiv:2401.06761, 2024. https://arxiv.org/abs/2401.06761
Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference acceleration framework with multiple decoding heads. arXiv preprint arXiv:2401.10774, 2024 https://arxiv.org/abs/2401.10774
Zhuoming Chen, Avner May, Ruslan Svirschevski, Yuhsun Huang, Max Ryabinin, Zhihao Jia, and Beidi Chen. Sequoia: Scalable, robust, and hardware-aware speculative decoding. arXiv preprint arXiv:2402.12374, 2024. https://arxiv.org/abs/2402.12374v1
Ziteng Sun, Ananda Theertha Suresh, Jae Hun Ro, Ahmad Beirami, Himanshu Jain, and Felix Yu. Spectr: Fast speculative decoding via optimal transport. Advances in Neural Information Processing Systems, 36, 2024. https://arxiv.org/abs/2310.15141
Kim, S., Mangalam, K., Malik, J., Mahoney, M. W., Gho lami, A., and Keutzer, K. Big little transformer decoder. arXiv preprint arXiv:2302.07863, 2023. https://arxiv.org/abs/2302.07863v1
Kool, W., Van Hoof, H., and Welling, M. Stochastic beams and where to find them: The Gumbel-Top-k trick for sam pling sequences without replacement. In Proceedings of the 36th International Conference on Machine Learning (ICML), pp. 3499–3508. PMLR, 2019 https://arxiv.org/abs/1903.06059
Zachary Ankner, Rishab Parthasarathy, Aniruddha Nrusimha, Christopher Rinard, Jonathan Ragan-Kelley, and William Brandon. Hydra: Sequentially-dependent draft heads for medusa decoding, 2024. https://arxiv.org/abs/2402.05109
Ruslan Svirschevski, Avner May, Zhuoming Chen, Beidi Chen, Zhihao Jia, Max Ryabinin, 4 Jun 2024, SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices, https://arxiv.org/abs/2406.02532 (Speculative decoding with draft trees on low-resource consumer hardware with offloading.)
Chengbo Liu, Yong Zhu, 1 Apr 2024 (v2), SDSAT: Accelerating LLM Inference through Speculative Decoding with Semantic Adaptive Tokens, https://arxiv.org/abs/2403.18647 Code: https://github.com/hasuoshenyun/SDSAT
Mahsa Khoshnoodi, Vinija Jain, Mingye Gao, Malavika Srikanth, Aman Chadha, 24 May 2024 (v2), A Comprehensive Survey of Accelerated Generation Techniques in Large Language Models, https://arxiv.org/abs/2405.13019
Xiaoxuan Liu, Lanxiang Hu, Peter Bailis, Alvin Cheung, Zhijie Deng, Ion Stoica, Hao Zhang, 10 Jun 2024 (v4), Online Speculative Decoding, https://github.com/LiuXiaoxuanPKU/OSD
Dominik Wagner, Seanie Lee, Ilja Baumann, Philipp Seeberger, Korbinian Riedhammer, Tobias Bocklet, 16 Jun 2024, Optimized Speculative Sampling for GPU Hardware Accelerators, https://arxiv.org/abs/2406.11016 (Speculative decoding accelerated with multiple GPUs using approaches such as tiling, and uses a fused sigmoid replacing Softmax.)
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pier ric Cistac, Tim Rault, Remi Louf, Morgan Funtow icz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics. https://aclanthology.org/2020.emnlp-demos.6/ Code: https://github.com/huggingface/transformers
Apple, June 2024, Introducing Apple’s On-Device and Server Foundation Models, https://machinelearning.apple.com/research/introducing-apple-foundation-models
vLLM, 2024, Speculative decoding in vLLM, https://docs.vllm.ai/en/stable/models/spec_decode.html (vLLM has speculative decoding and also looks for n-grams in the prompt, which is prompt lookup decoding.)
Cuda Mode, 2024, Lecture 22: Hacker's Guide to Speculative Decoding in VLLM, https://www.youtube.com/watch?v=9wNAgpX6z_4
vLLM, 2024, [RFC]: Automate Speculative Decoding #4565, https://github.com/vllm-project/vllm/issues/4565
Xiaoxuan Liu, Cade Daniel, Langxiang Hu, Woosuk Kwon, Zhuohan Li, Xiangxi Mo, Alvin Cheung, Zhijie Deng, Ion Stoica, Hao Zhang, 20 Jun 2024, Optimizing Speculative Decoding for Serving Large Language Models Using Goodput, https://arxiv.org/abs/2406.14066 (Estimation of the draft length for increased acceptance to improve overall performance.)
Zeping Li, Xinlong Yang, Ziheng Gao, Ji Liu, Zhuang Liu, Dong Li, Jinzhang Peng, Lu Tian, Emad Barsoum, 19 Jun 2024, Amphista: Accelerate LLM Inference with Bi-directional Multiple Drafting Heads in a Non-autoregressive Style, https://arxiv.org/abs/2406.13170 (Applying bi-directional decoding to speculative decoding.)
Yingbing Huang, Lily Jiaxin Wan, Hanchen Ye, Manvi Jha, Jinghua Wang, Yuhong Li, Xiaofan Zhang, Deming Chen, 16 Jun 2024, New Solutions on LLM Acceleration, Optimization, and Application, https://arxiv.org/abs/2406.10903 (A survey of inference optimization methods and further analysis of Medusa-type speculative decoding and KV cache compression. Also explores hardware co-design, ML compilers and LLM-assisted code debugging.)
SpecInfer: Accelerating Generative Large Language Model Serving With Speculative Inference and Token Tree Verification, Cheng, Xinhao, Masters Thesis, Carnegie Mellon University, ProQuest Dissertations & Theses, 2023. 30817164. https://www.proquest.com/openview/54f75ff88889b27845efd4a56f3a167a/1?pq-origsite=gscholar&cbl=18750&diss=y
Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang, 24 Jun 2024, EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees, https://arxiv.org/abs/2406.16858 (Extends a tree-structured draft to use the logit probabilities of the draft model as an estimate of the likely acceptance rates by the larger model.)
Sean Welleck, Amanda Bertsch, Matthew Finlayson, Hailey Schoelkopf, Alex Xie, Graham Neubig, Ilia Kulikov, Zaid Harchaoui, 24 Jun 2024, From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models, https://arxiv.org/abs/2406.16838 (Survey and theoretical analysis of many different decoding algorithms, along with various ways to speed them up such as speculative decoding and KV caches.)
Mouxiang Chen, Hao Tian, Zhongxin Liu, Xiaoxue Ren, Jianling Sun, 5 Jun 2024 (v2), JumpCoder: Go Beyond Autoregressive Coder via Online Modification, https://arxiv.org/abs/2401.07870 Code: https://github.com/Keytoyze/JumpCoder
Zhenglin Wang, Jialong Wu, Yilong Lai, Congzhi Zhang, Deyu Zhou, 26 Jun 2024, SEED: Accelerating Reasoning Tree Construction via Scheduled Speculative Decoding, https://arxiv.org/abs/2406.18200 (Scheduling and parallelization of multiple draft models across one or multiple queries in tree-based speculative decoding.)
Jikai Wang, Yi Su, Juntao Li, Qinrong Xia, Zi Ye, Xinyu Duan, Zhefeng Wang, Min Zhang, 25 Jun 2024, OPT-Tree: Speculative Decoding with Adaptive Draft Tree Structure, https://arxiv.org/abs/2406.17276 Code: https://github.com/Jikai0Wang/OPT-Tree
Hyunjong Ok, Jegwang Ryu, Jaeho Lee, 26 Jun 2024, Decoding with Limited Teacher Supervision Requires Understanding When to Trust the Teacher, https://arxiv.org/abs/2406.18002 (Examines the idea of not using the larger model to always verify, and when to trust either the smaller or larger models, which is an idea that generalized beyond speculative decoding.)
Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, Gabriel Synnaeve, 30 Apr 2024, Better & Faster Large Language Models via Multi-token Prediction, https://arxiv.org/abs/2404.19737 Project: https://huggingface.co/facebook/multi-token-prediction
Parsa Kavehzadeh, Mohammadreza Pourreza, Mojtaba Valipour, Tinashu Zhu, Haoli Bai, Ali Ghodsi, Boxing Chen, Mehdi Rezagholizadeh, 2 Jul 2024, S2D: Sorted Speculative Decoding For More Efficient Deployment of Nested Large Language Models, https://arxiv.org/abs/2407.01955 (Creating, managing and integrating multiple draft models as submodels in speculative decoding.)
Anonymous, 2024, Towards Fast Multilingual LLM Inference: Speculative Decoding and Specialized Drafters, https://openreview.net/pdf?id=rNWRnVGfBC
Ziqin Luo, Haixia Han, Haokun Zhao, Guochao Jiang, Chengyu Du, Tingyun Li, Jiaqing Liang, Deqing Yang, Yanghua Xiao, 26 May 2024, SED: Self-Evaluation Decoding Enhances Large Language Models for Better Generation, https://arxiv.org/abs/2405.16552
Zilong Wang, Zifeng Wang, Long Le, Huaixiu Steven Zheng, Swaroop Mishra, Vincent Perot, Yuwei Zhang, Anush Mattapalli, Ankur Taly, Jingbo Shang, Chen-Yu Lee, Tomas Pfister, 11 Jul 2024, Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting, https://arxiv.org/abs/2407.08223
Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas Kübler, Jiaji Huang, Matthäus Kleindessner, Jun Huan, Volkan Cevher, Yida Wang, George Karypis, 12 Jul 2024, Inference Optimization of Foundation Models on AI Accelerators, KDD’24, August 25–29, 2024, Barcelona, Spain, https://arxiv.org/abs/2407.09111
Mikhail Andronov, Natalia Andronova, Michael Wand, Jürgen Schmidhuber, Djork-Arné Clevert, 17 Jul 2024 (v2), Accelerating the inference of string generation-based chemical reaction models for industrial applications, https://arxiv.org/abs/2407.09685
Zongyue Qin, Ziniu Hu, Zifan He, Neha Prakriya, Jason Cong, Yizhou Sun, 12 Jul 2024, Multi-Token Joint Speculative Decoding for Accelerating Large Language Model Inference, https://arxiv.org/abs/2407.09722
Xukun Liu, Bowen Lei, Ruqi Zhang, Dongkuan Xu, 27 Jun 2024, Adaptive Draft-Verification for Efficient Large Language Model Decoding, https://arxiv.org/abs/2407.12021 Project: https://anonymous.4open.science/r/ADED-C7D5 (A draft-and-verification method that is similar to speculative decoding, but differs.)
Kaiyan Zhang, Jianyu Wang, Ning Ding, Biqing Qi, Ermo Hua, Xingtai Lv, Bowen Zhou, 18 Jun 2024, Fast and Slow Generating: An Empirical Study on Large and Small Language Models Collaborative Decoding, https://arxiv.org/abs/2406.12295 Code: https://github.com/TsinghuaC3I/FS-GEN
Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
Zhengmian Hu, Heng Huang, 2024, Accelerated Speculative Sampling Based on Tree Monte Carlo, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:19216-19251, 2024. https://proceedings.mlr.press/v235/hu24f.html https://openreview.net/forum?id=stMhi1Sn2G PDF: https://raw.githubusercontent.com/mlresearch/v235/main/assets/hu24f/hu24f.pdf
Domas Grigaliūnas, Mantas Lukoševičius, 29 Jul 2024, Inference acceleration for large language models using "stairs" assisted greedy generation, https://arxiv.org/abs/2407.19947 (A draft-and-verify method that is similar to speculative decoding.)
Ziteng Sun1 Uri Mendlovic, Yaniv Leviathan1 Asaf Aharoni, Ahmad Beirami , Jae HunRo, Ananda Theertha Suresh, https://openreview.net/pdf?id=OWwc8eOIm8
Amazon, 2024, Optimize model inference with Amazon SageMaker, https://docs.aws.amazon.com/sagemaker/latest/dg/model-optimize.html
Hongyi Yuan, Keming Lu, Fei Huang, Zheng Yuan, Chang Zhou, 13 Mar 2024 (v2), Speculative Contrastive Decoding, https://arxiv.org/abs/2311.08981
Daliang Xu, Wangsong Yin, Xin Jin, Ying Zhang, Shiyun Wei, Mengwei Xu, and Xuanzhe Liu, Sep 2023, Llmcad: Fast and scalable on-device large language model inference. CoRR, abs/2309.04255, https://arxiv.org/abs/2309.04255
Leo Donisch, Sigurd Schacht, Carsten Lanquillon, 6 Aug 2024, Inference Optimizations for Large Language Models: Effects, Challenges, and Practical Considerations, https://arxiv.org/abs/2408.03130
Yunjia Xi, Hangyu Wang, Bo Chen, Jianghao Lin, Menghui Zhu, Weiwen Liu, Ruiming Tang, Weinan Zhang, Yong Yu, 11 Aug 2024, A Decoding Acceleration Framework for Industrial Deployable LLM-based Recommender Systems, https://arxiv.org/abs/2408.05676 (Determining when speculative decoding is most beneficial.)
Jacob K Christopher, Brian R Bartoldson, Bhavya Kailkhura, Ferdinando Fioretto, 10 Aug 2024, Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion, https://arxiv.org/abs/2408.05636 (Parallel drafting and verification in diffusion models.)
Kaiqi Zhang, Jing Zhao, Rui Chen, 15 Aug 2024, KOALA: Enhancing Speculative Decoding for LLM via Multi-Layer Draft Heads with Adversarial Learning, https://arxiv.org/abs/2408.08146
Xianzhen Luo, Yixuan Wang, Qingfu Zhu, Zhiming Zhang, Xuanyu Zhang, Qing Yang, Dongliang Xu, Wanxiang Che, 16 Aug 2024, Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling, https://arxiv.org/abs/2408.08696
Jerry Huang, Prasanna Parthasarathi, Mehdi Rezagholizadeh, Sarath Chandar, 16 Aug 2024, Context-Aware Assistant Selection for Improved Inference Acceleration with Large Language Models, https://arxiv.org/abs/2408.08470
Aishwarya P S, Pranav Ajit Nair, Yashas Samaga B L, Toby James Boyd, Sanjiv Kumar, Prateek Jain, Praneeth Netrapalli, July 2024, Tandem Transformers for Inference Efficient LLMs, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:42906-42917, 2024, https://proceedings.mlr.press/v235/s24a.html
Branden Butler, Sixing Yu, Arya Mazaheri, Ali Jannesari, 16 Jul 2024, PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation, https://arxiv.org/abs/2407.11798 (Optimizes speculative decoding further using pipelining.)
Tianyu Liu, Yun Li, Qitan Lv, Kai Liu, Jianchen Zhu, Winston Hu, 13 Aug 2024, Parallel Speculative Decoding with Adaptive Draft Length, https://arxiv.org/abs/2408.11850
David Spuler, 25th August, 2024, Hot Inference Optimization Techniques, https://www.aussieai.com/blog/hot-inference-research
Ahmed Tremo, Aug 6, 2024, How to Efficiently Serve an LLM? https://ahmedtremo.com/posts/How-to-Efficiently-serve-an-llm/
Zhuocheng Gong, Jiahao Liu, Ziyue Wang, Pengfei Wu, Jingang Wang, Xunliang Cai, Dongyan Zhao, Rui Yan, 23 Jul 2024, Graph-Structured Speculative Decoding, https://arxiv.org/abs/2407.16207
Lujun Gui, Bin Xiao, Lei Su, Weipeng Chen, 28 Aug 2024, Boosting Lossless Speculative Decoding via Feature Sampling and Partial Alignment Distillation, https://arxiv.org/abs/2408.15562
Lefan Zhang, Xiaodan Wang, Yanhua Huang, Ruiwen Xu, 28 Aug 2024, Harmonized Speculative Sampling, https://arxiv.org/abs/2408.15766
David Spuler, June 2024, Aussie AI, Speculative Decoding With Early Exit for Optimized Transformer On-Device Inference: IP Australia, https://ipsearch.ipaustralia.gov.au/patents/2024901656
Karl Stratos, 2024, Speculative Decoding https://karlstratos.com/notes/speculative.pdf
Du Cunxiao, 2024, Towards Faster Inference of Transformers: Strategies for Accelerating Decoding Processes, Ph.D. thesis, Computer Science, School of Computing and Information Systems, Singapore Management University, https://ink.library.smu.edu.sg/cgi/viewcontent.cgi?article=1611&context=etd_coll (Examines non-autoregressive decoding, speculative decoding and attention optimizations.)
Yael Segal-Feldman, Aviv Shamsian, Aviv Navon, Gill Hetz, Joseph Keshet, 24 Sep 2024, Whisper in Medusa's Ear: Multi-head Efficient Decoding for Transformer-based ASR, https://arxiv.org/abs/2409.15869
Zongyue Qin, Zifan He, Neha Prakriya, Jason Cong, Yizhou Sun, 25 Sep 2024, Dynamic-Width Speculative Beam Decoding for Efficient LLM Inference, https://arxiv.org/abs/2409.16560
Yao Teng, Han Shi, Xian Liu, Xuefei Ning, Guohao Dai, Yu Wang, Zhenguo Li, Xihui Liu, 2 Oct 2024, Accelerating Auto-regressive Text-to-Image Generation with Training-free Speculative Jacobi Decoding, https://arxiv.org/abs/2410.01699
Matthieu Zimmer, Milan Gritta, Gerasimos Lampouras, Haitham Bou Ammar, Jun Wang, 4 Oct 2024, Mixture of Attentions For Speculative Decoding, https://arxiv.org/abs/2410.03804
Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Guohao Dai, 6 Oct 2024, Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective, https://arxiv.org/abs/2410.04466
Xinyu Lin, Chaoqun Yang, Wenjie Wang, Yongqi Li, Cunxiao Du, Fuli Feng, See-Kiong Ng, Tat-Seng Chua, 8 Oct 2024 (v2), Efficient Inference for Large Language Model-based Generative Recommendation, https://arxiv.org/abs/2410.05165
Yijie Ding, Yupeng Hou, Jiacheng Li, Julian McAuley, 3 Oct 2024, Inductive Generative Recommendation via Retrieval-based Speculation, https://arxiv.org/abs/2410.02939 https://github.com/Jamesding000/SpecGR
Doohyuk Jang, Sihwan Park, June Yong Yang, Yeonsung Jung, Jihun Yun, Souvik Kundu, Sung-Yub Kim, Eunho Yang, 4 Oct 2024, LANTERN: Accelerating Visual Autoregressive Models with Relaxed Speculative Decoding, https://arxiv.org/abs/2410.03355
Cong Guo, Feng Cheng, Zhixu Du, James Kiessling, Jonathan Ku, Shiyu Li, Ziru Li, Mingyuan Ma, Tergel Molom-Ochir, Benjamin Morris, Haoxuan Shan, Jingwei Sun, Yitu Wang, Chiyue Wei, Xueying Wu, Yuhao Wu, Hao Frank Yang, Jingyang Zhang, Junyao Zhang, Qilin Zheng, Guanglei Zhou, Hai (Helen)Li, Yiran Chen, 8 Oct 2024. A Survey: Collaborative Hardware and Software Design in the Era of Large Language Models, https://arxiv.org/abs/2410.07265
Heming Xia, Yongqi Li, Jun Zhang, Cunxiao Du, Wenjie Li, 9 Oct 2024, SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration, https://arxiv.org/abs/2410.06916 (Self-speculative decoding using layer skipping, rather than early exit.)
Xinyi Zeng, Yuying Shang, Yutao Zhu, Jiawei Chen, Yu Tian, 9 Oct 2024, Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level, https://arxiv.org/abs/2410.06809
Zilin Xiao, Hongming Zhang, Tao Ge, Siru Ouyang, Vicente Ordonez, Dong Yu, 8 Oct 2024, ParallelSpec: Parallel Drafter for Efficient Speculative Decoding, https://arxiv.org/abs/2410.05589 (Multi-token prediction in draft models for speculative decoding.)
Siru Ouyang, Shuohang Wang, Minhao Jiang, Ming Zhong, Donghan Yu, Jiawei Han, Yelong Shen, 14 Oct 2024, Temperature-Centric Investigation of Speculative Decoding with Knowledge Distillation, https://arxiv.org/abs/2410.10141 https://github.com/ozyyshr/TempSpec
Yunfan Xiong, Ruoyu Zhang, Yanzeng Li, Tianhao Wu, Lei Zou, 15 Oct 2024, DySpec: Faster Speculative Decoding with Dynamic Token Tree Structure, https://arxiv.org/abs/2410.11744
Juntao Zhao, Wenhao Lu, Sheng Wang, Lingpeng Kong, Chuan Wu, 15 Oct 2024, QSpec: Speculative Decoding with Complementary Quantization Schemes, https://arxiv.org/abs/2410.11305 (Enhance speculative decoding using quantization to reuse KV cache values and weights.)
Yuxuan Liu, Wenyuan Li, Laizhong Cui, Hailiang Yang, 17 Oct 2024, Cerberus: Efficient Inference with Adaptive Parallel Decoding and Sequential Knowledge Enhancement, https://arxiv.org/abs/2410.13344
Tan Dat Nguyen, Ji-Hoon Kim, Jeongsoo Choi, Shukjae Choi, Jinseok Park, Younglo Lee, Joon Son Chung, 17 Oct 2024, Accelerating Codec-based Speech Synthesis with Multi-Token Prediction and Speculative Decoding, https://arxiv.org/abs/2410.13839
Bradley McDanel, 22 Oct 2024, AMUSD: Asynchronous Multi-Device Speculative Decoding for LLM Acceleration, https://arxiv.org/abs/2410.17375 https://github.com/BradMcDanel/AMUSD/
Ashish Khisti, M.Reza Ebrahimi, Hassan Dbouk, Arash Behboodi, Roland Memisevic, Christos Louizos, 23 Oct 2024, Multi-Draft Speculative Sampling: Canonical Architectures and Theoretical Limits, https://arxiv.org/abs/2410.18234
Hanshi Sun, Momin Haider, Ruiqi Zhang, Huitao Yang, Jiahao Qiu, Ming Yin, Mengdi Wang, Peter Bartlett, Andrea Zanette, 26 Oct 2024, Fast Best-of-N Decoding via Speculative Rejection, https://arxiv.org/abs/2410.20290
Pengfei Wu, Jiahao Liu, Zhuocheng Gong, Qifan Wang, Jinpeng Li, Jingang Wang, Xunliang Cai, Dongyan Zhao, 27 Oct 2024, FIRP: Faster LLM inference via future intermediate representation prediction, https://arxiv.org/abs/2410.20488
Yuzhe Yang, Yipeng Du, Ahmad Farhan, Claudio Angione, Yue Zhao, Harry Yang, Fielding Johnston, James Buban, Patrick Colangelo, 28 Oct 2024, Meta-Learning for Speeding Up Large Model Inference in Decentralized Environments, https://arxiv.org/abs/2410.21340 (Choosing between multiple acceleration techniques).
Ming Yin, Minshuo Chen, Kaixuan Huang, Mengdi Wang, 30 Oct 2024, A Theoretical Perspective for Speculative Decoding Algorithm, https://arxiv.org/abs/2411.00841
Lawrence Stewart, Matthew Trager, Sujan Gonugondla, Stefano Soatto, 2024, The N-Grammys: Accelerating autoregressive inference with learning-free batched speculation, 4th NeurIPS Efficient Natural Language and Speech Processing Workshop (ENLSP-IV 2024), https://www.amazon.science/publications/the-n-grammys-accelerating-autoregressive-inference-with-learning-free-batched-speculation (Use a variety of heuristics instead of a draft model, such as precalculated likelihoods, and also prompt lookup decoding using n-grams from the context tokens.)
Jiankun Wei, Abdulrahman Abdulrazzag, Tianchen Zhang, Adel Muursepp, Gururaj Saileshwar, 5 Nov 2024 (v2), Privacy Risks of Speculative Decoding in Large Language Models, https://arxiv.org/abs/2411.01076
Haim Barad, Jascha Achterberg, Tien Pei Chou, Jean Yu, 30 Oct 2024, Accelerated AI Inference via Dynamic Execution Methods, https://arxiv.org/abs/2411.00853
Michele Marzollo, Jiawei Zhuang, Niklas Roemer, Lorenz K. Müller, Lukas Cavigelli, 8 Nov 2024, SSSD: Simply-Scalable Speculative Decoding, https://arxiv.org/abs/2411.05894
Ofir Zafrir, Igor Margulis, Dorin Shteyman, Guy Boudoukh, 17 Nov 2024, FastDraft: How to Train Your Draft, https://arxiv.org/abs/2411.11055
Samarth N Ramesh, Zhixue Zhao, 22 Nov 2024, Efficient Pruning of Text-to-Image Models: Insights from Pruning Stable Diffusion, https://arxiv.org/abs/2411.15113 (Comprehensive analysis of different types of pruning on diffusion image models.)
M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024, Resource-efficient Algorithms and Systems of Foundation Models: A Survey, https://dl.acm.org/doi/pdf/10.1145/3706418
Rongxiang Wang and Felix Xiaozhu Lin. 2024. Turbocharge Speech Understanding with Pilot Inference. In Proceedings of the 30th Annual International Conference on Mobile Computing and Networking (ACM MobiCom '24). Association for Computing Machinery, New York, NY, USA, 1299–1313. https://doi.org/10.1145/3636534.3690694 https://dl.acm.org/doi/abs/10.1145/3636534.3690694 https://dl.acm.org/doi/pdf/10.1145/3636534.3690694 ("Pilot inference" is a specialized mix of caching, computation reuse, and backtracking in beam search for speech understanding, and is somewhat related to speculative decoding, and similar to continual inference for processing a stream.)
NVIDIA, Dec 2024, Speculative Sampling, https://nvidia.github.io/TensorRT-LLM/advanced/speculative-decoding.html
Xiangxiang Gao, Weisheng Xie, Yiwei Xiang, Feng Ji, 17 Dec 2024, Falcon: Faster and Parallel Inference of Large Language Models through Enhanced Semi-Autoregressive Drafting and Custom-Designed Decoding Tree, https://arxiv.org/abs/2412.12639
D. Xu et al., "EdgeLLM: Fast On-device LLM Inference with Speculative Decoding" in IEEE Transactions on Mobile Computing, vol. , no. 01, pp. 1-18, PrePrints 5555, doi: 10.1109/TMC.2024.3513457. https://www.computer.org/csdl/journal/tm/5555/01/10812936/22UpTlf6X2U
JK Christopher, BR Bartoldson, T Ben-Nun, M Cardei, Dec 2024, Speculative Diffusion Decoding for Accelerated Language Generation, 4th NeurIPS Efficient Natural Language and Speech Processing Workshop (ENLSP-IV 2024), https://neurips2024-enlsp.github.io/papers/paper_68.pdf (Using diffusion models as parallel-enabled drafters for speculative decoding.)
P Kavehzadeh, M Pourreza, M Valipour, T Zhu, H Bai, Dec 2024, S2D: Sorted Speculative Decoding For More Efficient Deployment of Large Language Models, 4th NeurIPS Efficient Natural Language and Speech Processing Workshop (ENLSP-IV 2024), https://neurips2024-enlsp.github.io/papers/paper_29.pdf (Optimizing multiple draft models combined with multiple larger verifier LLMs.)
T Kim, H Jung, SY Yun, Dec 2024, A Unified Framework for Speculative Decoding with Multiple Drafters as a Bandit, 38th Conference on Neural Information Processing Systems (NeurIPS 2024), https://neurips2024-enlsp.github.io/papers/paper_53.pdf (Efficient use of multiple drafters and allocating resources to them.)
Xiangjue Dong, Maria Teleki, James Caverlee, 18 Dec 2024, A Survey on LLM Inference-Time Self-Improvement, https://arxiv.org/abs/2412.14352 https://github.com/dongxiangjue/Awesome-LLM-Self-Improvement (Broad survey of reasoning improvement methods from multi-step inference to RALM to decoding algorithms.)
Libo Zhang, Zhaoning Zhang, Baizhou Xu, Songzhu Mei, Dongsheng Li, 25 Dec 2024, Dovetail: A CPU/GPU Heterogeneous Speculative Decoding for LLM inference, https://arxiv.org/abs/2412.18934
Situo Zhang, Hankun Wang, Da Ma, Zichen Zhu, Lu Chen, Kunyao Lan, Kai Yu, 25 Dec 2024, AdaEAGLE: Optimizing Speculative Decoding via Explicit Modeling of Adaptive Draft Structures, https://arxiv.org/abs/2412.18910 (Predicting the length of tokens to draft.)
Haoran Wang, Kai Shu, Jan 2025, MakeEveryTokenCount: ASystematic Survey on Decoding Methods for Foundation Model, https://www.researchgate.net/profile/Haoran-Wang-96/publication/387703971_Make_Every_Token_Count_A_Systematic_Survey_on_Decoding_Methods_for_Foundation_Models/links/67784c8ce74ca64e1f49eb15/Make-Every-Token-Count-A-Systematic-Survey-on-Decoding-Methods-for-Foundation-Models.pdf https://github.com/wang2226/Awesome-LLM-Decoding
Valentin De Bortoli, Alexandre Galashov, Arthur Gretton, Arnaud Doucet, 9 Jan 2025, Accelerated Diffusion Models via Speculative Sampling, https://arxiv.org/abs/2501.05370 (Using speculative decoding to speed up diffusion models.)
OpenVINO™ toolkit, Dec 13, 2024, Accelerating LLM Inference with Speculative Decoding Using OpenVINO GenAI API, https://medium.com/openvino-toolkit/accelerating-llm-inference-with-speculative-decoding-using-openvino-genai-api-d965dfbb443e
Gregor Bachmann, Sotiris Anagnostidis, Albert Pumarola, Markos Georgopoulos, Artsiom Sanakoyeu, Yuming Du, Edgar Schönfeld, Ali Thabet, Jonas Kohler, 31 Jan 2025, Judge Decoding: Faster Speculative Sampling Requires Going Beyond Model Alignment, https://arxiv.org/abs/2501.19309 (Using "LLM as Judge" methods to speed up speculative decoding via higher acceptance rates.)
Baohao Liao, Yuhui Xu, Hanze Dong, Junnan Li, Christof Monz, Silvio Savarese, Doyen Sahoo, Caiming Xiong, 31 Jan 2025, Reward-Guided Speculative Decoding for Efficient LLM Reasoning, https://arxiv.org/abs/2501.19324 (Using rewards to enhance draft model accuracy.)
Yize Wu, Ke Gao, Yanjun Wu, 4 Feb 2025, EasySpec: Layer-Parallel Speculative Decoding for Efficient Multi-GPU Utilization, https://arxiv.org/abs/2502.02493 (Increased parallelization across layers.)
Jiale Fu, Yuchu Jiang, Junkai Chen, Jiaming Fan, Xin Geng, Xu Yang, 1 Feb 2025, Speculative Ensemble: Fast Large Language Model Ensemble via Speculation, https://arxiv.org/abs/2502.01662 https://github.com/Kamichanw/Speculative-Ensemble/ (Speculative decoding approach applied to ensemble decoding.)
Haiduo Huang, Fuwei Yang, Zhenhua Liu, Yixing Xu, Jinze Li, Yang Liu, Xuanwu Yin, Dong Li, Pengju Ren, Emad Barsoum, 10 Feb 2025, Jakiro: Boosting Speculative Decoding with Decoupled Multi-Head via MoE, https://arxiv.org/abs/2502.06282 https://github.com/haiduo/Jakiro
Meiyu Zhong, Noel Teku, Ravi Tandon, 6 Feb 2025, Speeding up Speculative Decoding via Approximate Verification, https://arxiv.org/abs/2502.04557
Ziyao Wang, Muneeza Azmart, Ang Li, Raya Horesh, Mikhail Yurochkin, 11 Feb 2025, Speculate, then Collaborate: Fusing Knowledge of Language Models during Decoding, https://arxiv.org/abs/2502.08020
Razvan-Gabriel Dumitru, Minglai Yang, Vikas Yadav, Mihai Surdeanu, 13 Feb 2025, CopySpec: Accelerating LLMs with Speculative Copy-and-Paste Without Compromising Quality, https://arxiv.org/abs/2502.08923 https://github.com/RazvanDu/CopySpec (Prompt lookup decoding done across the entire chat history across multiple user queries.)
Shijing Hu, Jingyang Li, Xingyu Xie, Zhihui Lu, Kim-Chuan Toh, Pan Zhou, 16 Feb 2025, GRIFFIN: Effective Token Alignment for Faster Speculative Decoding, https://arxiv.org/abs/2502.11018
Feiye Huo, Jianchao Tan, Kefeng Zhang, Xunliang Cai, Shengli Sun, 19 Feb 2025, C2T: A Classifier-Based Tree Construction Method in Speculative Decoding, https://arxiv.org/abs/2502.13652
Weilin Zhao, Tengyu Pan, Xu Han, Yudi Zhang, Ao Sun, Yuxiang Huang, Kaihuo Zhang, Weilun Zhao, Yuxuan Li, Jianyong Wang, Zhiyuan Liu, Maosong Sun, 20 Feb 2025, FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling, https://arxiv.org/abs/2502.14856 (Limiting the draft model in speculative decoding to frequently-used tokens.)
Zhaoxuan Wu, Zijian Zhou, Arun Verma, Alok Prakash, Daniela Rus, Bryan Kian Hsiang Low, 21 Feb 2025, TETRIS: Optimal Draft Token Selection for Batch Speculative Decoding, https://arxiv.org/abs/2502.15197 (Improving draft accuracy across multiple queries in a batch.)
Penghui Yang, Cunxiao Du, Fengzhuo Zhang, Haonan Wang, Tianyu Pang, Chao Du, Bo An, 24 Feb 2025, LongSpec: Long-Context Speculative Decoding with Efficient Drafting and Verification, https://arxiv.org/abs/2502.17421 https://github.com/sail-sg/LongSpec
Tong Wu, Junzhe Shen, Zixia Jia, Yuxuan Wang, Zilong Zheng, 26 Feb 2025, From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation up to 100K Tokens, https://arxiv.org/abs/2502.18890 (Extending speculative decoding to address three bottlenecks in ultra-long context: frequent model reloading, KV cache size, and repetitive output content generation. Uses techniques such as KV cache eviction and decoding penalties to avoid repetition.) https://github.com/bigai-nlco/TokenSwift
Yunhai Hu, Zining Liu, Zhenyuan Dong, Tianfan Peng, Bradley McDanel, Sai Qian Zhang, 27 Feb 2025, Speculative Decoding and Beyond: An In-Depth Review of Techniques, https://arxiv.org/abs/2502.19732
Heming Xia, Cunxiao Du, Yongqi Li, Qian Liu, Wenjie Li, 1 Mar 2025, Tutorial Proposal: Speculative Decoding for Efficient LLM Inference, https://arxiv.org/abs/2503.00491 https://speculative-decoding.github.io/
M Lee, W Kang, B Ahn, C Classen, M Yan, HI Koo, Mar 2025, In-batch Ensemble Drafting: Robust Speculative Decoding for LVLMs, https://openreview.net/pdf?id=ffDhpmwqdu (Efficient vision models with batch-based drafting in speculative decoding.)
Kaiyu Huang, Hao Wu, Zhubo Shi, Han Zou, Minchen Yu, Qingjiang Shi, 7 Mar 2025, SpecServe: Efficient and SLO-Aware Large Language Model Serving with Adaptive Speculative Decoding, https://arxiv.org/abs/2503.05096
Yiwei Li, Jiayi Shi, Shaoxiong Feng, Peiwen Yuan, Xinglin Wang, Yueqi Zhang, Ji Zhang, Chuyi Tan, Boyuan Pan, Yao Hu, Kan Li, 7 Mar 2025, Speculative Decoding for Multi-Sample Inference, https://arxiv.org/abs/2503.05330 (Optimizing speculative decoding when generating multiple answers for a single query, such as for Best-of-N reasoning.)
Jinze Li, Yixing Xu, Haiduo Huang, Xuanwu Yin, Dong Li, Edith C.H. Ngai, Emad Barsoum, 13 Mar 2025, Gumiho: A Hybrid Architecture to Prioritize Early Tokens in Speculative Decoding, https://arxiv.org/abs/2503.10135
Changran Xu, Yi Liu, Yunhao Zhou, Shan Huang, Ningyi Xu, Qiang Xu, 18 Mar 2025, Speculative Decoding for Verilog: Speed and Quality, All in One, https://arxiv.org/abs/2503.14153
Fahao Chen, Peng Li, Tom H. Luan, Zhou Su, Jing Deng, 20 Mar 2025, SPIN: Accelerating Large Language Model Inference with Heterogeneous Speculative Models, https://arxiv.org/abs/2503.15921
Evangelos Georganas, Dhiraj Kalamkar, Alexander Kozlov, Alexander Heinecke, 17 Mar 2025, ML-SpecQD: Multi-Level Speculative Decoding with Quantized Drafts, https://arxiv.org/abs/2503.13565
Shibo Jie, Yehui Tang, Kai Han, Zhi-Hong Deng, Jing Han, 20 Mar 2025, SpeCache: Speculative Key-Value Caching for Efficient Generation of LLMs, https://arxiv.org/abs/2503.16163
Zijian Lin, Yang Zhang, Yougen Yuan, Yuming Yan, Jinjiang Liu, Zhiyong Wu, Pengfei Hu, Qun Yu, 21 May 2025, Accelerating Autoregressive Speech Synthesis Inference With Speech Speculative Decoding, https://arxiv.org/abs/2505.15380
Yunlong Hou, Fengzhuo Zhang, Cunxiao Du, Xuan Zhang, Jiachun Pan, Tianyu Pang, Chao Du, Vincent Y. F. Tan, Zhuoran Yang, 21 May 2025, BanditSpec: Adaptive Speculative Decoding via Bandit Algorithms, https://arxiv.org/abs/2505.15141
Xiangwen Zhuge, Xu Shen, Zeyu Wang, Fan Dang, Xuan Ding, Danyang Li, Yahui Han, Tianxiang Hao, Zheng Yang, 21 May 2025 (v3), SpecOffload: Unlocking Latent GPU Capacity for LLM Inference on Resource-Constrained Devices, https://arxiv.org/abs/2505.10259 https://github.com/MobiSense/SpecOffload-public
Danying Ge, Jianhua Gao, Qizhi Jiang, Yifei Feng, Weixing Ji, 13 May 2025, Automatic Task Detection and Heterogeneous LLM Speculative Decoding, https://arxiv.org/abs/2505.08600
Hang Wu, Jianian Zhu, Yinghui Li, Haojie Wang, Biao Hou, Jidong Zhai, 12 May 2025, SpecRouter: Adaptive Routing for Multi-Level Speculative Decoding in Large Language Models, https://arxiv.org/abs/2505.07680
Zhihai Wang, Jie Wang, Jilai Pan, Xilin Xia, Huiling Zhen, Mingxuan Yuan, Jianye Hao, Feng Wu, 3 May 2025, Accelerating Large Language Model Reasoning via Speculative Search, https://arxiv.org/abs/2505.02865
Gabe Guo, Stefano Ermon, 29 Apr 2025, Reviving Any-Subset Autoregressive Models with Principled Parallel Sampling and Speculative Decoding, https://arxiv.org/abs/2504.20456
Yuhao Shen, Junyi Shen, Quan Kong, Tianyu Liu, Yao Lu, Cong Wang, 16 May 2025, Speculative Decoding via Hybrid Drafting and Rollback-Aware Branch Parallelism, https://arxiv.org/abs/2506.01979
Jinhui Wei, Ye Huang, Yuhui Zhou, Jiazhi Jiang, Jiangsu Du, 29 May 2025, Ghidorah: Fast LLM Inference on Edge with Speculative Decoding and Hetero-Core Parallelism, https://arxiv.org/abs/2505.23219
Yeshwanth Venkatesha, Souvik Kundu, Priyadarshini Panda, 27 May 2025, Fast and Cost-effective Speculative Edge-Cloud Decoding with Early Exits, https://arxiv.org/abs/2505.21594
Yingpeng Du, Tianjun Wei, Zhu Sun, Jie Zhang, 23 May 2025, Reinforcement Speculative Decoding for Fast Ranking, https://arxiv.org/abs/2505.20316