Aussie AI

Generalized Speculative Decoding

Last Updated 10 June, 2025

by David Spuler, Ph.D.

What is Basic Speculative Decoding?

The original method of "speculative decoding" uses two models: small and large. The smaller model predicts the next token (i.e., "speculates") and the larger model then confirms the correctness. If the larger model agrees, then the token is output, and any further processing by the large model is skipped. If the larger model disagrees, this is a "rollback," and the full inference phase is run by the larger model to generate the next output token.

Why is this faster? This seems slower since not only is the bigger model running, but also another smaller model. However, in speculative decoding, the larger model is not doing full predictions, only confirmations (in parallel).

The output token of the smaller model is faster, but not as accurate. The speedup arises because the large model is faster at confirming an output than as running its own full prediction. Hence, if the small model is mostly correct, and rollbacks are few, then this method is faster than always having the bigger model predict each token.

What is Generalized Speculative Decoding?

Some bright spark realized that the "smaller model" does not need to be a full model. It can be a cut-down model. Or it can even be not a model.

The speculating component only has to be some method that can "speculate" as to what the next token probably will be. Any method that is correct more often than not can yield a speedup. Hence, generalized speculative decoding uses various other methods as the "speculator":

Smaller model — the basic type of speculative decoding
Early exit of the large model — effectively a smaller model running inside the full large model.
Non-autoregressive decoding algorithms.
Aggressive decoding (for editing)
Blockwise parallel decoding (generate multiple predicted tokens in parallel with a big model)
Multi-token decoding algorithms (e.g. efficient drafting of multiple token candidates).
Tree-structured drafting methods (akin to beam search followed by verification).
Non-Transformer small models (e.g. RNNs as draft model).
Non-LLM heuristics — any coding method that predicts the next token in a sequence (generally difficult to achieve without a model!).

Hence, it is generalized to any cut-down model and any non-model heuristic method. In general, any LLM optimization that creates a model that is faster but less accurate can be considered (e.g. pruning, quantization, early exit, and many more). Faster decoding algorithms such as non-autoregressive models or blockwise parallel decoding also meet this criteria by generating multiple suggested tokens quickly, and then they can be verifyied in parallel. In the special case of editing and grammatical error correction, the input prompt itself can be considered a form of draft text (because the edited output is usually similar to the input when editing) and "aggressive decoding" is effectively a type of generalized speculative decoding. Furthermore, any non-LLM coding heuristic that can guess the next token from a sequence can also be the speculator.

Note that the speculator cannot simply suggest one token. The method is only faster if the verifier model can check two or more speculations in parallel. Hence, there's no value in a simple heuristic that attempts to predict only the next token. For example, we could try a heuristic that detects the start of sentences (e.g., prior token is "."), and then predicts a comma after words like "However" and "Hence". But this isn't a speedup because it's only predicting a single token (i.e. the comma). It's much more difficult to come up with heuristics that predict multiple tokens. Might as well use a model!

On-device execution problematic. Like speculative decoding, the method of generalized speculative decoding doesn't work well on platforms without spare computation capacity. All types of speculative decoding work by parallelizing computation by farming inference out to multiple GPUs, rather than reducing computations overall, which doesn't help much with phones and AI PC platforms.

In fact, speculative decoding increases the total amount of computation because it runs not only the big model on every token, but also the smaller draft model. It also runs both models wastefully in cases where the big model rejects drafted tokens. Hence, it's a tradeoff with increased processing in order to achieve a wallclock speedup at the cost of additional parallelized GPU power.

Survey Papers on Generalized Speculative Decoding

Survey research papers include:

Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, Zhifang Sui, 15 Jan 2024, Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding, https://arxiv.org/abs/2401.07851 (A survey paper on speculative decoding, which has a section on generalized speculative decoding.)
Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234 (General survey on inference speedups, with a section on generalized speculative decoding.)

Generalized Speculative Decoding Research

Research papers include:

Jinghui Lu, Ziwei Yang, Yanjie Wang, Xuejing Liu, Brian Mac Namee, Can Huang, 15 Feb 2024 (v4), PaDeLLM-NER: Parallel Decoding in Large Language Models for Named Entity Recognition, https://arxiv.org/abs/2402.04838 (Use of parallel decoding in Named Entity Recognition use case.)
Jinwei Yao, Kaiqi Chen, Kexun Zhang, Jiaxuan You, Binhang Yuan, Zeke Wang, Tao Lin, 30 Mar 2024, DeFT: Flash Tree-attention with IO-Awareness for Efficient Tree-search-based LLM Inference, https://arxiv.org/abs/2404.00242
Wonseok Jeon, Mukul Gagrani, Raghavv Goel, Junyoung Park, Mingu Lee, Christopher Lott, 5 Mar 2024 (v2), Recursive Speculative Decoding: Accelerating LLM Inference via Sampling Without Replacement, Qualcomm AI Research, https://arxiv.org/abs/2402.14160 (Improvements of an adaptive inference version of a draft token-tree with multiple n-gram paths for speculative decoding.)
Aonan Zhang, Chong Wang, Yi Wang, Xuanyu Zhang, Yunfei Cheng, 22 Mar 2024 (v2), Recurrent Drafter for Fast Speculative Decoding in Large Language Models, https://arxiv.org/abs/2403.09919 (Use of small RNN as the drafting model for speculative decoding.)
Jie Ou, Yueming Chen, Wenhong Tian, 10 Apr 2024, Lossless Acceleration of Large Language Model via Adaptive N-gram Parallel Decoding, https://arxiv.org/abs/2404.08698 (Use an n-gram model as the drafter to create a version of parallel decoding or generalized speculative decoding.)
Daniel Jurafsky, James H. Martin., February 3, 2024 (draft of 3rd edition),. N-gram Language Models, chapter 3, Speech and Language Processing. 2023, https://web.stanford.edu/~jurafsky/slp3/3.pdf https://web.stanford.edu/~jurafsky/slp3/ https://web.stanford.edu/~jurafsky/slp3/ed3bookfeb3_2024.pdf https://www.amazon.com/Speech-Language-Processing-Daniel-Jurafsky/dp/0131873210/ (2nd edition)
Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, Zhifang Sui, 15 Jan 2024, Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding, https://arxiv.org/abs/2401.07851
Sangmin Bae, Jongwoo Ko, Hwanjun Song, and Se-Young Yun. 2023. Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding. arXiv preprint arXiv:2310.05424. https://arxiv.org/abs/2310.05424 (Using early exits as the draft model in generalized speculative decoding.)
Daniel Warfield, Dec 16, 2023, Towards Data Science, Speculative Sampling — Intuitively and Exhaustively Explained, https://towardsdatascience.com/speculative-sampling-intuitively-and-exhaustively-explained-2daca347dbb9
Mohamed Nabih Ali, Daniele Falavigna, Alessio Brutti, 2024, Fed-EE: Federating Heterogeneous ASR Models using Early-Exit Architectures, PDF: https://cris.fbk.eu/bitstream/11582/343747/1/paper_49.pdf
Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. 2023. Breaking the Sequential Dependency of LLM Inference Using Lookahead Decoding. https://lmsys.org/blog/2023-11-21-lookahead-decoding/ Code: https://github.com/hao-ai-lab/LookaheadDecoding
Benjamin Spector and Chris Re. Accelerating llm inference with staged speculative decoding. arXiv preprint arXiv:2308.04623, 2023. https://arxiv.org/abs/2308.04623
Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: Speculative sampling requires rethinking feature uncertainty. arXiv preprint arXiv:2401.15077, 2024. https://arxiv.org/abs/2401.15077
Zachary Ankner, Rishab Parthasarathy, Aniruddha Nrusimha, Christopher Rinard, Jonathan Ragan-Kelley, and William Brandon. Hydra: Sequentially-dependent draft heads for medusa decoding, 2024. https://arxiv.org/abs/2402.05109
Miao, X., Oliaro, G., Zhang, Z., Cheng, X., Wang, Z., Wong, R. Y. Y., Chen, Z., Arfeen, D., Abhyankar, R., and Jia, Z. SpecInfer: Accelerating generative LLM serving with speculative inference and token tree verification. arXiv preprint arXiv:2305.09781, 2023. https://arxiv.org/abs/2305.09781
Kim, S., Mangalam, K., Malik, J., Mahoney, M. W., Gho lami, A., and Keutzer, K. Big little transformer decoder. arXiv preprint arXiv:2302.07863, 2023. https://arxiv.org/abs/2302.07863v1
Kool, W., Van Hoof, H., and Welling, M. Stochastic beams and where to find them: The Gumbel-Top-k trick for sam pling sequences without replacement. In Proceedings of the 36th International Conference on Machine Learning (ICML), pp. 3499–3508. PMLR, 2019 https://arxiv.org/abs/1903.06059
Nikhil Bhendawade, Irina Belousova, Qichen Fu, Henry Mason, Mohammad Rastegari, and Mahyar Najibi. Speculative streaming: Fast llm inference without auxiliary models. arXiv preprint arXiv:2402.11131, 2024. https://arxiv.org/abs/2402.11131
Zhuocheng Gong, Jiahao Liu, Ziyue Wang, Pengfei Wu, Jingang Wang, Xunliang Cai, Dongyan Zhao, Rui Yan, 23 Jul 2024, Graph-Structured Speculative Decoding, https://arxiv.org/abs/2407.16207
David Spuler, June 2024, Aussie AI, Heuristic Optimization of Transformer On-Device Inference: IP Australia, https://ipsearch.ipaustralia.gov.au/patents/2024901670

Early Exit in Generalized Speculative Decoding

One of the ways to do generalized speculative decoding is to use the large model as the smaller draft model, simply by doing early exiting of layers dynamically within the large model.

The advantage of using early exit of the large model as the drafter is that no computation is wasted if the draft token is rejected, as the full-size model inference can continue where it left off inside the big model. The downside of this approach is that the layers of a large model may have significant weights, and hence a few layers of a big model might be larger and more computationally intense that a truly small, compact model (even if all the layers of the small model are run).

Fed-EE: Federating Heterogeneous ASR Models using Early-Exit Architectures, Mohamed Nabih Ali, Daniele Falavigna, Alessio Brutti, 2024, PDF: https://cris.fbk.eu/bitstream/11582/343747/1/paper_49.pdf (This paper mentions early exit in relation to generalized speculative decoding.)
Sangmin Bae, Jongwoo Ko, Hwanjun Song, and Se-Young Yun. 2023. Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding. arXiv preprint arXiv:2310.05424. https://arxiv.org/abs/2310.05424 (Using early exits as the draft model in generalized speculative decoding.)
Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, Zhifang Sui, 15 Jan 2024, Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding, https://arxiv.org/abs/2401.07851 (Survey paper has coverage of this type of speculative decoding.)
Jiahao Liu, Qifan Wang, Jingang Wang, Xunliang Cai, 6 Jun 2024, Speculative Decoding via Early-exiting for Faster LLM Inference with Thompson Sampling Control Mechanism, https://arxiv.org/abs/2406.03853
Mahsa Khoshnoodi, Vinija Jain, Mingye Gao, Malavika Srikanth, Aman Chadha, 24 May 2024 (v2), A Comprehensive Survey of Accelerated Generation Techniques in Large Language Models, https://arxiv.org/abs/2405.13019
Wei Zhong, Manasa Bharadwaj, 30 May 2024, S3D: A Simple and Cost-Effective Self-Speculative Decoding Scheme for Low-Memory GPUs, https://arxiv.org/abs/2405.20314 (Self-speculative decoding using early layers, multi-token non-autoregressive token predictions for the draft model, and layer skipping.)
Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed A Aly, Beidi Chen, Carole-Jean Wu, 25 Apr 2024, Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding, https://arxiv.org/abs/2404.16710 (Multiple contributions including training with early exit, and speculative decoding with a draft model that is early exit within the larger model, with the advantages: (a) the draft and verifier model thereby share KV cache data for the early layers and (b) avoidance of the problems with an outdated KV cache normally caused by early exiting.)
Fangcheng Liu, Yehui Tang, Zhenhua Liu, Yunsheng Ni, Kai Han, Yunhe Wang, 29 Apr 2024, Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting, https://arxiv.org/abs/2404.18911 Code: https://github.com/Equationliu/Kangaroo (Speculative decoding where the draft model is an early exit of layers in the verifier model, but the draft model is also sped up further by early exiting confidence analysis.)
Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, Gabriel Synnaeve, 30 Apr 2024, Better & Faster Large Language Models via Multi-token Prediction, https://arxiv.org/abs/2404.19737 Project: https://huggingface.co/facebook/multi-token-prediction
Parsa Kavehzadeh, Mohammadreza Pourreza, Mojtaba Valipour, Tinashu Zhu, Haoli Bai, Ali Ghodsi, Boxing Chen, Mehdi Rezagholizadeh, 2 Jul 2024, S2D: Sorted Speculative Decoding For More Efficient Deployment of Nested Large Language Models, https://arxiv.org/abs/2407.01955 (Creating, managing and integrating multiple draft models as submodels in speculative decoding.)
Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
Nikhil Bhendawade, Irina Belousova, Qichen Fu, Henry Mason, Mohammad Rastegari, Mahyar Najibi, 16 Feb 2024, Speculative Streaming: Fast LLM Inference without Auxiliary Models, https://arxiv.org/abs/2402.11131
Michael R. Metel, Peng Lu, Boxing Chen, Mehdi Rezagholizadeh, Ivan Kobyzev, 1 Oct 2024, Draft on the Fly: Adaptive Self-Speculative Decoding using Cosine Similarity, https://arxiv.org/abs/2410.01028 (Self-speculative decoding that removes layers based on cosine similarity.)
Heming Xia, Yongqi Li, Jun Zhang, Cunxiao Du, Wenjie Li, 9 Oct 2024, SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration, https://arxiv.org/abs/2410.06916 (Self-speculative decoding using layer skipping, rather than early exit.)
Hyun Ryu, Eric Kim, 20 Nov 2024, Closer Look at Efficient Inference Methods: A Survey of Speculative Decoding, https://arxiv.org/abs/2411.13157
Samarth N Ramesh, Zhixue Zhao, 22 Nov 2024, Efficient Pruning of Text-to-Image Models: Insights from Pruning Stable Diffusion, https://arxiv.org/abs/2411.15113 (Comprehensive analysis of different types of pruning on diffusion image models.)
Aritra Roy Gosthipaty, Mostafa Elhoushi, Pedro Cuenca, Vaibhav Srivastav, November 20, 2024, Faster Text Generation with Self-Speculative Decoding, https://huggingface.co/blog/layerskip
Divya Jyoti Bajpai, Manjesh Kumar Hanawal, 13 Jan 2025, A Survey of Early Exit Deep Neural Networks in NLP, https://arxiv.org/abs/2501.07670 (Good survey of exit exit classifier types.)
Rishabh Tiwari, Haocheng Xi, Aditya Tomar, Coleman Hooper, Sehoon Kim, Maxwell Horton, Mahyar Najibi, Michael W. Mahoney, Kurt Keutzer, Amir Gholami, 5 Feb 2025, QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache, https://arxiv.org/abs/2502.10424 (Combining self-speculative decoding with KV quantization.)
Yuichiro Hoshino, Hideyuki Tachibana, Muneyoshi Inahara, Hiroto Takegawa, 28 May 2025, RAD: Redundancy-Aware Distillation for Hybrid Models via Self-Speculative Decoding, https://arxiv.org/abs/2505.22135
Yeshwanth Venkatesha, Souvik Kundu, Priyadarshini Panda, 27 May 2025, Fast and Cost-effective Speculative Edge-Cloud Decoding with Early Exits, https://arxiv.org/abs/2505.21594

Non-Autoregressive Decoding in Generalized Speculative Decoding

Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, Zhifang Sui, 15 Jan 2024, Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding, https://arxiv.org/abs/2401.07851 (Survey paper has coverage of this type of speculative decoding.)

Multi-Token Decoding in Generalized Speculative Decoding

Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, Zhifang Sui, 15 Jan 2024, Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding, https://arxiv.org/abs/2401.07851 (Survey paper has coverage of this type of speculative decoding.)

Non-LLM Heuristics in Generalized Speculative Decoding

Research papers on non-LLM approaches for the draft model in generalized speculative decoding:

Lawrence Stewart, Matthew Trager, Sujan Gonugondla, Stefano Soatto, 2024, The N-Grammys: Accelerating autoregressive inference with learning-free batched speculation, 4th NeurIPS Efficient Natural Language and Speech Processing Workshop (ENLSP-IV 2024), https://www.amazon.science/publications/the-n-grammys-accelerating-autoregressive-inference-with-learning-free-batched-speculation (Use a variety of heuristics instead of a draft model, such as precalculated likelihoods, and also prompt lookup decoding using n-grams from the context tokens.)
Hugging Face, October 8, 2024, Faster Assisted Generation with Dynamic Speculation, https://huggingface.co/blog/dynamic_speculation_lookahead
Ryan Sun, Tianyi Zhou, Xun Chen, Lichao Sun, 2024, SpecHub: Provable Acceleration to Multi-Draft Speculative Decoding, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, https://aclanthology.org/2024.emnlp-main.1148/ https://aclanthology.org/2024.emnlp-main.1148.pdf

Hierarchical speculative decoding

Another way to generalize speculative decoding is to do it more than once, in layers. The smaller drafting model could itself be accelerated by another even smaller model. In this way, there are layers of drafters. This idea has been called "hierarchical speculative decoding."

This method is largely orthogonal to the other variants of speculative decoding, which could all be applied in this multi-layer manner.

Research papers on hierarchical speculative decoding:

Dakota Goldberg, Nov. 16, 2023, Accelerating large model inference with speculative decoding, 6.S898 Deep Learning Blogs 2023, MIT, https://deep-learning-mit.github.io/staging/blog/2023/speculative-decoding/ (Conducts an analysis of speculative decoding and an advanced method of layering called "hierarchical speculative decoding.")
Hanshi Sun, Zhuoming Chen, Xinyu Yang, Yuandong Tian, Beidi Chen, 18 Apr 2024, TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding, https://arxiv.org/abs/2404.11912 Code: https://github.com/Infini-AI-Lab/TriForce (Improves issues with long context in the use of speculative decoding, such as small model context sizes and KV cache memory usage bottlenecks for Llama2-7B-128K model on an A100 and RTX 4090 GPU.)
Hang Wu, Jianian Zhu, Yinghui Li, Haojie Wang, Biao Hou, Jidong Zhai, 12 May 2025, SpecRouter: Adaptive Routing for Multi-Level Speculative Decoding in Large Language Models, https://arxiv.org/abs/2505.07680

Aussie AI

Generalized Speculative Decoding

What is Basic Speculative Decoding?

What is Generalized Speculative Decoding?

Survey Papers on Generalized Speculative Decoding

Generalized Speculative Decoding Research

Early Exit in Generalized Speculative Decoding

Non-Autoregressive Decoding in Generalized Speculative Decoding

Multi-Token Decoding in Generalized Speculative Decoding

Non-LLM Heuristics in Generalized Speculative Decoding

Hierarchical speculative decoding

More AI Research

Quick Links

Product

New to Writing?

Writing Styles