Aussie AI
Speculative Decoding: Types and Optimizations
-
Last Updated 12 December, 2024
-
by David Spuler, Ph.D.
What is Speculative Decoding?
Speculative decoding is a multi-model ensemble architecture where a small model generates some possible output tokens ("speculating" on the correct answer via outputs from its decoder), and a larger model verifies if the output of the smaller model is correct. The two models are:
- Draft model — the smaller "speculating" model.
- Verifier model — a much larger, smarter model.
The idea is to use a much smaller model, because it runs much faster. Some implementations use a draft model with 100 to 1,000 times fewer parameters than the larger verifier model. Speedup depends roughlly on the number of parameters, as all weights are typically used in an inference phase.
Types of Speculative Decoding
There are now various subtypes and extensions of the basic speculative decoding algorithm.
- Generalized speculative decoding (the drafter can be anything!)
- Heuristic speculative decoding (non-LLM drafters)
- Self-speculative decoding (using the same model as drafter and verifier)
- Tree speculative decoding (akin to combining beam search and speculative decoding)
- Retrieval lookup decoding (using RAG chunk text as the draft)
- Prompt lookup decoding (using prompt text as the draft)
- Aggressive decoding (for grammatical error correction)
- Lookahead decoding
- Multi-token speculative decoding (parallel drafting and n-gram decoding)
- Superposed decoding
- Hierarchical speculative decoding
- Sequential speculative decoding (faster but not parallel!)
Also closely related are these research areas:
- Parallel decoding
- Multi-token models
There's probably some more that I've missed! There is an endless stream of speculative decoding research papers.
Draft But Verify?
Draft and verify. Speculative decoding uses a "draft-then-verify" strategy. This optimizes inference speed because it is faster for a large model to verify the correctness of suggested output tokens than for it to fully generate its own new tokens. If the small model predicts poorly, then the bigger model vetos the suggested token, and has to "backtrack" and the whole process is slower. However, the smaller model should be correct most of the time, causing an overall speedup across all of the tokens, while getting very close to the accuracy of a bigger model.
Note that the draft has to be two or more tokens. Having a small model simply suggest a single next token is not a speedup, because we're simply then running the big model autoregressively on each suggestion, which is effectively the same as just running the big model in its normal mode. The method is only faster if two or more speculated tokens can be verified in parallel.
Theoretical Basis of Speculative Decoding
Speculative Decoding vs Big-Little Architectures. Speculative decoding is technically a subtype of the "big-little architecture" but is a very specialized architecture compared to standard two-model big-little architectures. Another type of big-little architecture involves using a heuristic to detect "easy" requests that are routed to a small model, or "hard" queries that are routed to the big model. Speculative decoding differs from big-little methods because all queries go first to the small model, and are then checked by the larger model, and sometimes the big model overrides the small model's suggestions and re-generates its own.
Easy vs Hard Queries. Speculative decoding is based on the "easy vs hard" query architectural issue. The idea is that the small draft model can generate accurate predicted tokens for "easy" cases, but only the large model will correctly handle "hard" tokens. Note that standard inference by a large model is not faster on easy cases by default, but simply implements the same computations across all the layers, regardless of whether an analysis is easy or difficult. Hence, efficiency would be improved if the easy cases can be computed by a smaller model instead.
Generalized speculative decoding. The idea of "generalized speculative decoding" is to generalize the "speculator" model to (a) any other type of cut-down model (e.g. early-exit, quantization, pruning, etc.), and (b) any non-LLM heuristic that can make a prediction of the next token. The idea is generalized to any possible method of "speculating" about the next token. See generalized speculative decoding research.
Speedup from Speculative Decoding
Acceptance rates of draft tokens. The efficiency of speculative decoding overall depends on the "acceptance rate" for the speculated tokens by the verifier. This depends on how "smart" the smaller draft models are in predicting the same tokens as the large model would. Acceptance in verification directly impacts performance, because accept tokens avoid large model inference, whereas rejected tokens cause a "rollback" to a full large-model inference phase. Some papers report low acceptance rates of 20%, whereas other papers report 80-90% for verification by 13B large models and 50% for 70B verifier models.
Parallel verification methods. The larger model has to "verify" the multiple draft tokens from the smaller model in parallel. How does it do so? We are used to a large model emitting a single validated token at a time, but it can't use this for the "verify" phase, because that would be just as inefficient as simply running the large model on every token step (in fact, worse due to overhead of speculation). Instead, the large model has to verify multiple tokens in a phase all in one parallel execution of full inference.
Why is this faster? The speculator has produced more than one token candidate. For a single candidate, we'd just run the big model on the sequence and see whether the probability of the suggested candidate is high enough. That wouldn't be a speedup.
But if there are two or more candidate tokens, we can test each one in parallel, rather than doing one-at-a-time as we'd do in autoregressive decoding.
As an example, let's consider if 2 candidates are generated. The 1st and 2nd candidates are each tested in parallel. We can generate a probability of the prior tokens and see whether the probability of the suggestion agrees with the 1st suggestion. In parallel, we can also run the model with prior tokens and the 1st suggestion, and see whether the predictions agree that the 2nd token has a high enough probability. Hence, we are testing each one individually, which is like an autoregressive mode, but it's not really autoregressive because we can check the 2nd one without having to await the results from the 1st computation.
After we've run these in parallel, we test the probabilities agreed for the 1st and the 2nd and any more tokens. This final check is done sequentially but it's a tiny computation compared to the model inference evaluations done in parallel. Best case, this final check approves all of the tokens, and we output them. Worst case, this final check rejects all of the suggestions, in which case we can still output one token, which isn't one of the suggestions, but is the prediction given by the big model on the original sequence. Intermediate case is where the final check approves some but not all of the speculative suggestions, in which case we can output the approved suggested tokens and one more, which is the correction token given by the big model for the rejected token. Hence, we always output at least one token, and often many more.
When the large model rejects one (or more) of the suggestions, we have to stop there. Even though we have suggestions from the large model for tokens later in the sequence, they have been computed using the now-rejected earlier token as part of the tested sequence. So we have to throw away the later tokens, and only ever output one correction token.
Accept or reject criteria. Note that when we run the big model's verification of speculated tokens in parallel, we get a set of logits back for each step. Hence, we get a predicted probability for each of the speculated tokens, and many others.
We have to decide: do we reject or discard the speculated token? The criteria can be variable. For example, we could reject the token if it's not the best suggestion with the highest probability. This is a "greedy" or "top-1" critieria. Alternatively, we could accept the token if it's in the top-50 of probabilities. Since a high acceptance rate makes the algorithm very fast, this is preferable.
Note that we can't try to choose a different higher-probability token in the middle and then still accept other tokens in the sequence. Well, actually, we probably could in some cases, but it's hard to know which ones, so the algorithm has to discard tokens later in the sequence. This is basically the rejection of a token.
One nuance here is that we could choose the highest probability token in the last token in the speculated suggestions. There's no cost to rejecting the very last suggested token, as there aren't any more to discard in the sequence.
More Computations, Not Less
Speculative decoding actually does more total computations of the model (i.e. more FLOPs), rather than less. But it gets a speedup by doing more in parallel. The wall clock speed of inference, which is basically user latency, is improved, but overall GPU processing cost is greater.
Computation resources. Note that the verifying large model doesn't do less computation overall when compared to the basic autoregressive run of the large model. It runs a full model inference on the large model for the 1st and 2nd tokens (and any more speculated tokens). But it can do the same computations in parallel because it no longer needs to wait for the next token to be generated (as it would do in an autoregressive mode). In fact, it does more computations because it not only runs the big model (in parallel) but it also runs the smaller model (in sequence) before running the big model. However, it can do much more computation if many tokens are rejected.
On-device execution. Speculative decoding may not be a good candidate for low-resource platforms such as phones or edge devices, because it can't speed up sequential execution. Speculative decoding on a sequential platform actually increases computations overall, which affects battery depletion and heat issues. Also, these platforms may not have a lot of spare computation resources to run the large model in parallel. This method works better on larger platforms where the parallel computations of the large model can be farmed out to multiple GPUs.
Related Research Areas
Similar research areas: Another area of optimization of Transformers that is similar to speculative decoding is the non-autoregressive Transformer architecture. The general class of "multi-AI models" where two or more neural networks are used is called "ensemble methods". Some of the other research areas related to speculative decoding include:
- Collaborative decoding (collaborative inference)
- Big-little architecture
- Easy vs hard queries
- Mutually-guided decoding
- Aggressive decoding
- Parallel decoding
- Shallow fusion
- Consensus-based decoding
- Bidirectional decoding
- Shallow decoder architectures
- Non-autoregressive algorithms
Speculative execution is the ovearching general area of Computer Science theory from which speculative decoding is derived. Various algorithms benefit from speculatively executing in parallel with another pathway. A particular example is "branch prediction" in hardware execution of low-level machine code.
Survey Papers on Speculative Decoding
Survey papers. Broad review or survey papers that cover speculative decoding:
- Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, Zhifang Sui, 15 Jan 2024, Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding, https://arxiv.org/abs/2401.07851 (A survey that's just on speculative decoding!)
- Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
- Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
- Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, Liang Zhao, 4 Jan 2024, Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models https://arxiv.org/abs/2401.00625 (A general survey paper with coverage of many techniques including this one.)
- Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 15 Mar 2024 (v5), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer (A large survey of a variety of LLM optimizations.)
General Research Papers on Speculative Decoding
Papers that specifically focus on the speculative decoding technique:
- Leviathan, Y., Kalman, M., and Matias, Y., Fast inference from transformers via speculative decoding, May 2023, https://arxiv.org/abs/2211.17192
- D Xu, W Yin, X Jin, Y Zhang, S Wei, M Xu, X Liu, Sep 2023, LLMCad: Fast and Scalable On-device Large Language Model Inference, arXiv preprint arXiv:2309.04255, https://arxiv.org/pdf/2309.04255.pdf (Keeps a smaller model in memory, improving speed and reducing memory utilization.)
- Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, L., and Jumper, J., Feb 2023, Accelerating large language model decoding with speculative sampling, arXiv preprint arXiv:2302.01318, https://arxiv.org/abs/2302.01318
- Sehoon Kim, Karttikeya Mangalam, Suhong Moon, John Canny, Jitendra Malik, Michael W. Mahoney, Amir Gholami, Kurt Keutzer, Sep 2023 (original Feb 2023), Speculative Decoding with Big Little Decoder, https://arxiv.org/abs/2302.07863 (Separates a "fallback policy" when the smaller model detects it needs the bigger model, and a "rollback policy" when the bigger model vetos output and intervenes, both for deciding when the bigger model controls.)
- Yaniv Leviathan, Matan Kalman, and Yossi Matias. May 2023, Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274–19286. PMLR, 2023. https://arxiv.org/abs/2211.17192
- Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling. DeepMind, arXiv preprint arXiv:2302.01318, 2023. https://arxiv.org/abs/2302.01318
- Heming Xia, Tao Ge, Si-Qing Chen, Furu Wei, and Zhifang Sui. Speculative decoding: Lossless speedup of autoregressive translation. Openreview, 2022. https://openreview.net/forum?id=H-VlwsYvVi
- Nan Yang, Tao Ge, Liang Wang, Binxing Jiao, Daxin Jiang, Linjun Yang, Rangan Majumder, Furu Wei, Apr 2023, Inference with Reference: Lossless Acceleration of Large Language Models, https://arxiv.org/abs/2304.04487 (Not pure speculative decoding, but an analogous method.)
- Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Rae Ying Yee Wong, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, and Zhihao Jia, Aug 2023, Specinfer: Accelerating generative llm serving with speculative inference and token tree verification. arXiv preprint arXiv:2305.09781, 2023. https://arxiv.org/abs/2305.09781, Code: https://github.com/flexflow/FlexFlow/tree/inference
- Burton, F. W., 1985, Speculative computation, parallelism, and functional programming. IEEE Transactions on Computers, C-34(12):1190–1193, 1985. doi: 10.1109/TC.1985. 6312218. https://ieeexplore.ieee.org/document/6312218 (Algorithmic theory of "speculative computation" from 1985.)
- Hennessy, J. L. and Patterson, D. A., Computer Architecture: A Quantitative Approach. Morgan Kaufmann, Amsterdam, 5 edition, 2012. ISBN 978-0-12-383872-8. https://dl.acm.org/doi/book/10.5555/1999263 (Includes coverage of speculative algorithms.)
- T. Ge, H. Xia, X. Sun, S. Chen, and F. Wei. Lossless acceleration for seq2seq generation with aggressive decoding. ArXiv, abs/2205.10350, 2022. https://arxiv.org/abs/2205.10350, Code: https://github.com/microsoft/unilm/tree/master/decoding (The generalized aggressive decoding method has a "draft-and-verify" algorithm that is similar to speculative decoding.)
- M. Stern, N. Shazeer, and J. Uszkoreit. Blockwise parallel decoding for deep autoregressive models. CoRR, abs/1811.03115, 2018. https://arxiv.org/abs/1811.03115 (Generates various output in parallel and using a scoring method to confirm them.)
- Jungo Kasai, Keisuke Sakaguchi, Ronan Le Bras, Hao Peng, Ximing Lu, Dragomir Radev, Yejin Choi, Noah A. Smith, Oct 2022, Twist Decoding: Diverse Generators Guide Each Other, https://arxiv.org/abs/2205.09273, Code: https://github.com/jungokasai/twist_decoding
- S Bae, J Ko, H Song, SY Yun, Oct 2023, Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding, arXiv preprint arXiv:2310.05424, https://arxiv.org/pdf/2310.05424.pdf (Combination of early-exit with a "shallow-deep module" and parallel decoding.)
- Kaya Y., Hong S., Dumitras T., Shallow-deep networks: Understanding and mitigating network overthinking Proceedings of the international conference on machine learning, ICML (2019), pp. 3301-3310, https://arxiv.org/abs/1810.07052 (Shallow-deep method in a single model is analogous to speculative decoding.)
- Yongchao Zhou, Kaifeng Lyu, Ankit Singh Rawat, Aditya Krishna Menon, Afshin Rostamizadeh, Sanjiv Kumar, Jean-François Kagy, Rishabh Agarwal, Oct 2023, DistillSpec: Improving Speculative Decoding via Knowledge Distillation, https://arxiv.org/abs/2310.08461
- Sen Yang, Shujian Huang, Xinyu Dai, Jiajun Chen, 12 Jan 2024, Multi-Candidate Speculative Decoding, https://arxiv.org/abs/2401.06706 (Draft model generates multiple candidates, which can each have their acceptance or rejection.)
- Shuming Shi, Enbo Zhao, Deng Cai, Leyang Cui, Xinting Huang, Huayang Li, 16 Jan 2024, Inferflow: an Efficient and Highly Configurable Inference Engine for Large Language Models, https://arxiv.org/abs/2401.08294 Source: https://github.com/inferflow/inferflow
- S Kim, K Mangalam, S Moon, J Malik, MW Mahoney, 2023, Speculative Decoding with Big Little Decoder, https://openreview.net/pdf?id=EfMyf9MC3t Code: https://github.com/kssteven418/BigLittleDecoder
- Qidong Su, Christina Giannoula, Gennady Pekhimenko, Oct 2023, The Synergy of Speculative Decoding and Batching in Serving Large Language Models, https://arxiv.org/abs/2310.18813 (Optimizing by adapting dynamically the length of the speculated sequence in batches.)
- Haim Barad, Ekaterina Aidova, Yury Gorbachev, Nov 2023, Leveraging Speculative Sampling and KV-Cache Optimizations Together for Generative AI using OpenVINO, https://arxiv.org/abs/2311.04951 Code: https://github.com/openvinotoolkit/openvino_notebooks/tree/main/notebooks/266-speculative-sampling
- Xiaoxuan Liu, Lanxiang Hu, Peter Bailis, Ion Stoica, Zhijie Deng, Alvin Cheung, Hao Zhang, Oct 2023, Online Speculative Decoding, https://arxiv.org/abs//2310.07177
- Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Tri Dao, Nov 2023 (accessed), Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads, https://sites.google.com/view/medusa-llm
- Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang, 4 Feb 2024 ( v2), EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty, https://arxiv.org/abs/2401.15077 (Speculative encoding at the "feature" level rather than the token level.)
- Nikhil Bhendawade, Irina Belousova, Qichen Fu, Henry Mason, Mohammad Rastegari, Mahyar Najibi, 16 Feb 2024, Speculative Streaming: Fast LLM Inference without Auxiliary Models, https://arxiv.org/abs/2402.11131 (Speculative decoding with a draft model that creates n-gram multiple tokens.)
- Aishwarya P S, Pranav Ajit Nair, Yashas Samaga, Toby Boyd, Sanjiv Kumar, Prateek Jain, Praneeth Netrapalli, 26 Mar 2024 (v3), Tandem Transformers for Inference Efficient LLMs, https://arxiv.org/abs/2402.08644 (A two-model architecture with a small autoregressive model and a larger model with non-autoregressive block decoding, which is similar to big-little inference and speculative decoding methods.)
- Minghao Yan, Saurabh Agarwal, Shivaram Venkataraman, 2 Feb 2024, Decoding Speculative Decoding, https://arxiv.org/abs/2402.01528 (Analysis of throughput versus acceptance rates, with a draft model for Llama-65B.)
- Weilin Zhao, Yuxiang Huang, Xu Han, Chaojun Xiao, Zhiyuan Liu, Maosong Sun, 21 Feb 2024, Ouroboros: Speculative Decoding with Large Model Enhanced Drafting, https://arxiv.org/abs/2402.13720 Code: https://github.com/thunlp/Ouroboros
- Siqi Wang, Hailong Yang, Xuezhu Wang, Tongxuan Liu, Pengbo Wang, Xuning Liang, Kejie Ma, Tianyu Feng, Xin You, Yongjun Bao, Yi Liu, Zhongzhi Luan, Depei Qian, 24 Feb 2024, Minions: Accelerating Large Language Model Inference with Adaptive and Collective Speculative Decoding, https://arxiv.org/abs/2402.15678 (Speculative decoding with a focus on improving poor acceptance rates using majority-voting consensus decoding with multiple small draft models, achieving 80-90% acceptance with OPT-13B, and around 50% acceptance for Llama-70B, and also pipelining of inference computations for speedup.)
- Ziqian Zeng, Jiahong Yu, Qianshi Pang, Zihao Wang, Huiping Zhuang, Cen Chen, 24 Feb 2024, Chimera: A Lossless Decoding Method for Accelerating Large Language Models Inference by Fusing all Tokens, https://arxiv.org/abs/2402.15758 (Speculative decoding speedups by splitting inputs into small sequences of a few tokens for a faster draft model computation.)
- Daniel Warfield, Dec 16, 2023, Towards Data Science, Speculative Sampling — Intuitively and Exhaustively Explained, https://towardsdatascience.com/speculative-sampling-intuitively-and-exhaustively-explained-2daca347dbb9
- Dakota Goldberg, Nov. 16, 2023, Accelerating large model inference with speculative decoding, 6.S898 Deep Learning Blogs 2023, MIT, https://deep-learning-mit.github.io/staging/blog/2023/speculative-decoding/ (Conducts an analysis of speculative decoding and an advanced method of layering called "hierarchical speculative decoding.")
- Taehyeon Kim, Ananda Theertha Suresh, Kishore Papineni, Michael Riley, Sanjiv Kumar, Adrian Benton, 14 Apr 2024, Towards Fast Inference: Exploring and Improving Blockwise Parallel Drafts, Google Research, https://arxiv.org/abs/2404.09221 (Improving blockwise parallel decoding via top-k decoding and generation of predicted n-grams.)
- Jiahao Liu, Qifan Wang, Jingang Wang, Xunliang Cai, 6 Jun 2024, Speculative Decoding via Early-exiting for Faster LLM Inference with Thompson Sampling Control Mechanism, https://arxiv.org/abs/2406.03853
- Xinyin Ma, Gongfan Fang, Michael Bi Mi, Xinchao Wang, 3 Jun 2024, Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching, https://arxiv.org/abs/2406.01733 Code: https://github.com/horseee/learning-to-cache (Layer skipping in diffusion transformers via layer caching.)
- Kaixuan Huang, Xudong Guo, Mengdi Wang, 30 May 2024, SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths, https://arxiv.org/abs/2405.19715 (Training the draft model in speculative decoding to decide how far ahead to predict draft tokens.)
- Wei Zhong, Manasa Bharadwaj, 30 May 2024, S3D: A Simple and Cost-Effective Self-Speculative Decoding Scheme for Low-Memory GPUs, https://arxiv.org/abs/2405.20314 (Self-speculative decoding using early layers, multi-token non-autoregressive token predictions for the draft model, and layer skipping.)
- Minghan Li, Xilun Chen, Ari Holtzman, Beidi Chen, Jimmy Lin, Wen-tau Yih, Xi Victoria Lin, 29 May 2024, Nearest Neighbor Speculative Decoding for LLM Generation and Attribution, https://arxiv.org/abs/2405.19325 (Merging of RALM and speculative decoding.)
- Harikrishna Narasimhan, Wittawat Jitkrittum, Ankit Singh Rawat, Seungyeon Kim, Neha Gupta, Aditya Krishna Menon, Sanjiv Kumar, 29 May 2024, Faster Cascades via Speculative Decoding, https://arxiv.org/abs/2405.19261 (A combination of cascades with speculative decoding.)
- Hao (Mark) Chen, Wayne Luk, Ka Fai Cedric Yiu, Rui Li, Konstantin Mishchenko, Stylianos I. Venieris, Hongxiang Fan, 28 May 2024, Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference, https://arxiv.org/abs/2405.18628 Code: https://github.com/hmarkc/parallel-prompt-decoding (Similar to speculative decoding with extra trained prompt tokens and a tree-structured verification of multiple optional draft sequences.)
- Stern, Mitchell Thomas, 2020, Structured Neural Models and Structured Decoding for Natural Language Processing, Ph.D. Thesis, Computer Science, University of California, Berkeley, https://escholarship.org/uc/item/4m2211b5 https://escholarship.org/content/qt4m2211b5/qt4m2211b5.pdf
- Cunxiao Du, Jing Jiang, Xu Yuanchen, Jiawei Wu, Sicheng Yu, Yongqi Li, Shenggui Li, Kai Xu, Liqiang Nie, Zhaopeng Tu, Yang You, 3 Feb 2024, GliDe with a CaPE: A Low-Hassle Method to Accelerate Speculative Decoding, https://arxiv.org/abs/2402.02082 (Allow the draft model to use the KV cache of the large model in choosing its predictions, and extends the drafts to use a top-k set of tokens at each position.)
- Zhenyu He, Zexuan Zhong, Tianle Cai, Jason D. Lee, Di He, 4 Apr 2024 (v2), REST: Retrieval-Based Speculative Decoding, https://arxiv.org/abs/2311.08252
- Nadav Timor, Jonathan Mamou, Daniel Korat, Moshe Berchansky, Oren Pereg, Moshe Wasserblat, Tomer Galanti, Michal Gordon, David Harel, 23 May 2024, Distributed Speculative Inference of Large Language Models, https://arxiv.org/abs/2405.14105 (Speculative decoding with multiple drafter models and distributed processing across multiple threads and GPUs.)
- PS Aishwarya, PA Nair, Y Samaga, T Boyd, S Kumar, 2024, Tandem Transformers for Inference Efficient LLMs, https://www.prateekjain.org/publications/all_papers/NairSBKJN24.pdf (Separates prefill from decoding phase into a "tandem transformer" in combination with speculative decoding.)
- Yunsheng Ni, Chuanjian Liu, Yehui Tang, Kai Han, Yunhe Wang, 13 May 2024, EMS-SD: Efficient Multi-sample Speculative Decoding for Accelerating Large Language Models, https://arxiv.org/abs/2405.07542 Code: https://github.com/niyunsheng/EMS-SD (Speculative decoding across multiple queries by avoiding padding tokens and optimizing the KV cache.)
- Aman, May 14, 2024, Near-Instant Full-File Edits, Cursor, https://cursor.sh/blog/instant-apply (A type of speculative decoding for code editing called "speculative edits.")
- Bin Xiao, Chunan Shi, Xiaonan Nie, Fan Yang, Xiangwei Deng, Lei Su, Weipeng Chen, Bin Cui, 1 May 2024, Clover: Regressive Lightweight Speculative Decoding with Sequential Knowledge, https://arxiv.org/abs/2405.00263 (Speculative decoding improvement by extending Medusa to tree attention with a cross attention block.)
- Davis Wertheimer, Joshua Rosenkranz, Thomas Parnell, Sahil Suneja, Pavithra Ranganathan, Raghu Ganti, Mudhakar Srivatsa, 29 Apr 2024, Accelerating Production LLMs with Combined Token/Embedding Speculators, IBM Research, https://arxiv.org/abs/2404.19124 Code: https://github.com/foundation-model-stack/fms-fsdp Code: https://huggingface.co/ibm-fms (Extending Medusa architecture with a single multi-headed architecture so the draft model predicts an n-gram with multiple tokens more accurately.)
- Fangcheng Liu, Yehui Tang, Zhenhua Liu, Yunsheng Ni, Kai Han, Yunhe Wang, 29 Apr 2024, Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting, https://arxiv.org/abs/2404.18911 Code: https://github.com/Equationliu/Kangaroo (Speculative decoding where the draft model is an early exit of layers in the verifier model, but the draft model is also sped up further by early exiting confidence analysis.)
- Haifeng Qian, Sujan Kumar Gonugondla, Sungsoo Ha, Mingyue Shang, Sanjay Krishna Gouda, Ramesh Nallapati, Sudipta Sengupta, Xiaofei Ma, Anoop Deoras, 24 Apr 2024, BASS: Batched Attention-optimized Speculative Sampling, https://arxiv.org/abs/2404.15778 (Optimizes batched multi-query use of speculative decoding with consideration of GPU utilization in prefill and decoding phases.)
- Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed A Aly, Beidi Chen, Carole-Jean Wu, 25 Apr 2024, Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding, https://arxiv.org/abs/2404.16710 (Multiple contributions including training with early exit, and speculative decoding with a draft model that is early exit within the larger model, with the advantages: (a) the draft and verifier model thereby share KV cache data for the early layers and (b) avoidance of the problems with an outdated KV cache normally caused by early exiting.)
- Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, Chunan Shi, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, Zhihao Jia, 2024. SpecInfer: Accelerating Large Language Model Serving with Tree-based Speculative Inference and Verification, ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, April 2024, Pages 932–949, https://doi.org/10.1145/3620666.3651335 https://dl.acm.org/doi/abs/10.1145/3620666.3651335 Code: https://github.com/flexflow/FlexFlow/
- Chen Zhang, Zhuorui Liu, Dawei Song, 23 Apr 2024, Beyond the Speculative Game: A Survey of Speculative Execution in Large Language Models, Beijing Institute of Technology, China, https://arxiv.org/abs/2404.14897 (Strong survey specific to speculative decoding and other draft-then-verify optimization techniques.)
- Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, 22 Apr 2024, A Survey on Efficient Inference for Large Language Models, https://arxiv.org/abs/2404.14294
- L. Zheng, L. Yin, Z. Xie, J. Huang, C. Sun, C. H. Yu, S. Cao, C. Kozyrakis, I. Stoica, J. E. Gonzalez et al., Dec 2023, Efficiently programming large language models using SGLang, arXiv preprint arXiv:2312.07104, 2023, https://arxiv.org/abs/2312.07104 (Uses a radix attention method, a trie or prefix tree, for KV caching.)
- B. Spector and C. Re, Aug 2023, “Accelerating llm inference with staged speculative decoding,” arXiv preprint arXiv:2308.04623, 2023. https://arxiv.org/abs/2308.04623
- Z. Chen, X. Yang, J. Lin, C. Sun, J. Huang, and K. C.-C. Chang, 27 Feb 2024 (v4), “Cascade speculative drafting for even faster llm inference,” arXiv preprint arXiv:2312.11462, 2023. https://arxiv.org/abs/2312.11462 Code: https://github.com/lfsszd/CS-Drafting (Uses non-LLM draft models based on statistical langage models and also uses multiple draft models in a hierarchy.)
- Shashank Verma and Neal Vaidya, Nov 17, 2023 Mastering LLM Techniques: Inference Optimization, NVIDIA Technical Blog, https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/ (An overview that covers a lot of inference optimization techniques.)
- Hanshi Sun, Zhuoming Chen, Xinyu Yang, Yuandong Tian, Beidi Chen, 18 Apr 2024, TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding, https://arxiv.org/abs/2404.11912 Code: https://github.com/Infini-AI-Lab/TriForce (Improves issues with long context in the use of speculative decoding, such as small model context sizes and KV cache memory usage bottlenecks for Llama2-7B-128K model on an A100 and RTX 4090 GPU.)
- Theia Vogel, December 18, 2023, How to make LLMs go fast, https://vgel.me/posts/faster-inference/
- Joao Gante, 2023, Assisted generation: a new direction toward low-latency text generation, Hugging Face, DOI: 10.57967/hf/0638, https://huggingface.co/datasets/joaogante/assisted_generation (Using a model's forward pass to valid a sequence of multiple tokens, analogous to verification in speculative decoding.)
- Zhuorui Liu, Chen Zhang, Dawei Song, 2024, How Speculative Can Speculative Decoding Be? Beijing Institute of Technology, https://oro.open.ac.uk/97102/1/COLING_2024__How_Speculative_Can_Speculative_Decoding_Be_.pdf Code: https://github.com/ZhuoruiLiu12/SpecGame (Analysis of the accuracy and speed requirements of the drafting model for effective speculative decoding, including finding that draft models can be 60 times smaller than the large model.)
- H Xia, T Ge, P Wang, SQ Chen, F Wei, 2023, Speculative decoding: Exploiting speculative execution for accelerating seq2seq generation, https://arxiv.org/abs/2203.16487 https://aclanthology.org/2023.findings-emnlp.257.pdf Code: https://github.com/hemingkx/SpecDec (Uses a specially optimized deep-encoder shallow-decoder architecture as the drafting model.)
- 3 Jan 2024 (v2), SPEED: Speculative Pipelined Execution for Efficient Decoding, Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Hasan Genc, Kurt Keutzer, Amir Gholami, Sophia Shao, https://arxiv.org/abs/2310.12072 (Speculatively executing multiple future tokens in parallel to the current token, by using multiple tokens with high probability from the early layers of inference of the current token in the model. This allows multiple speculations of the autoregressive inference of the next token to start before the current token is finished.)
- Yao Zhao, Zhitian Xie, Chenyi Zhuang, Jinjie Gu, 4 Jan 2024 (v2), Lookahead: An Inference Acceleration Framework for Large Language Model with Lossless Generation Accuracy, https://arxiv.org/abs/2312.12728 Code: https://github.com/alipay/PainlessInferenceAcceleration (Generates multiple tokens on multiple branches for verification, giving a tree-structured approach.)
- Bodun Hu, Le Xu, Jeongyoon Moon, Neeraja J. Yadwadkar, Aditya Akella, 27 Oct 2023, MOSEL: Inference Serving Using Dynamic Modality Selection, https://arxiv.org/abs/2310.18481 (Multi-modal model with dynamic selection of modality.)
- Feng Lin, Hanling Yi, Hongbin Li, Yifan Yang, Xiaotian Yu, Guangming Lu, Rong Xiao, 25 Jan 2024 (v2), BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models, https://arxiv.org/abs/2401.12522 Code: https://github.com/linfeng93/BiTA
- 15 Mar 2024, Optimal Block-Level Draft Verification for Accelerating Speculative Decoding, Ziteng Sun, Jae Hun Ro, Ahmad Beirami, Ananda Theertha Suresh, https://arxiv.org/abs/2403.10444 (Draft a block of tokens for verification.)
- Giovanni Monea, Armand Joulin, Edouard Grave, 22 Nov 2023, PaSS: Parallel Speculative Sampling, https://arxiv.org/abs/2311.13581 (Generates multiple draft tokens using a parallel extension via "look ahead embeddings".)
- Jinghui Lu, Ziwei Yang, Yanjie Wang, Xuejing Liu, Brian Mac Namee, Can Huang, 15 Feb 2024 (v4), PaDeLLM-NER: Parallel Decoding in Large Language Models for Named Entity Recognition, https://arxiv.org/abs/2402.04838 (Use of parallel decoding in Named Entity Recognition use case.)
- Jinwei Yao, Kaiqi Chen, Kexun Zhang, Jiaxuan You, Binhang Yuan, Zeke Wang, Tao Lin, 30 Mar 2024, DeFT: Flash Tree-attention with IO-Awareness for Efficient Tree-search-based LLM Inference, https://arxiv.org/abs/2404.00242
- Wonseok Jeon, Mukul Gagrani, Raghavv Goel, Junyoung Park, Mingu Lee, Christopher Lott, 5 Mar 2024 (v2), Recursive Speculative Decoding: Accelerating LLM Inference via Sampling Without Replacement, Qualcomm AI Research, https://arxiv.org/abs/2402.14160 (Improvements of an adaptive inference version of a draft token-tree with multiple n-gram paths for speculative decoding.)
- Aonan Zhang, Chong Wang, Yi Wang, Xuanyu Zhang, Yunfei Cheng, 22 Mar 2024 (v2), Recurrent Drafter for Fast Speculative Decoding in Large Language Models, https://arxiv.org/abs/2403.09919 (Use of small RNN as the drafting model for speculative decoding.)
- Jie Ou, Yueming Chen, Wenhong Tian, 10 Apr 2024, Lossless Acceleration of Large Language Model via Adaptive N-gram Parallel Decoding, https://arxiv.org/abs/2404.08698 (Use an n-gram model as the drafter to create a version of parallel decoding or generalized speculative decoding.)
- Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, Zhifang Sui, 15 Jan 2024, Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding, https://arxiv.org/abs/2401.07851
- Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
- Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
- Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, Liang Zhao, 4 Jan 2024, Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models https://arxiv.org/abs/2401.00625 (A general survey paper with coverage of many techniques including this one.)
- Intel, 2024, https://github.com/intel/intel-extension-for-transformers
- Haoran You, Yichao Fu, Zheng Wang, Amir Yazdanbakhsh, Yingyan (Celine) Lin, 11 Jun 2024, When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models, https://arxiv.org/abs/2406.07368 Code: https://github.com/GATECH-EIC/Linearized-LLM
- Zhihao Zhang, Alan Zhu, Lijie Yang, Yihua Xu, Lanting Li, Phitchaya Mangpo Phothilimthana, Zhihao Jia, 2024, Accelerating Iterative Retrieval-augmented Language Model Serving with Speculation, Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024, https://openreview.net/pdf?id=CDnv4vg02f
- Hanling Yi, Feng Lin, Hongbin Li, Peiyang Ning, Xiaotian Yu, Rong Xiao, 16 Apr 2024 (v2)], Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct Decoding, https://arxiv.org/abs/2402.11809 (Semi-autoregressive draft model with parallel verification.)
- Fu, Y., Bailis, P., Stoica, I., and Zhang, H. Break ing the sequential dependency of LLM inference using lookahead decoding, November 2023. https://lmsys.org/blog/2023-11-21-lookahead-decoding/
- Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 15 Mar 2024 (v5), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer (A large survey of a variety of LLM optimizations.)
- Xiaoxuan Liu, Lanxiang Hu, Peter Bailis, Ion Stoica, Zhijie Deng, Alvin Cheung, Hao Zhang, 17 Oct 2023 (v2), Online Speculative Decoding, https://arxiv.org/abs/2310.07177
- Luv Bansal, Apr 8, 2024, Speculative Decoding — Make LLM Inference Faster, https://medium.com/ai-science/speculative-decoding-make-llm-inference-faster-c004501af120
- C Hooper, S Kim, H Mohammadzadeh, H Genc, Oct 2023, SPEED: Speculative Pipelined Execution for Efficient Decoding https://arxiv.org/pdf/2310.12072.pdf
- David Spuler, March 2024, Chapter 54. Ensemble Multi-Model Architectures, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Natarajan Vaidhyanathan Mar 7, 2024, How to Quadruple LLM Decoding Performance with Speculative Decoding (SpD) and Microscaling (MX) Formats on Qualcomm® Cloud AI 100, https://www.qualcomm.com/developer/blog/2024/03/how-quadruple-llm-decoding-performance-speculative-decoding-spd-and-microscaling-mx-formats
- M. Liu, A. Zeng, B. Wang, P. Zhang, J. Tang, and Y. Dong, “Apar: Llms can do auto-parallel auto-regressive decoding,” arXiv preprint arXiv:2401.06761, 2024. https://arxiv.org/abs/2401.06761
- Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference acceleration framework with multiple decoding heads. arXiv preprint arXiv:2401.10774, 2024 https://arxiv.org/abs/2401.10774
- Zhuoming Chen, Avner May, Ruslan Svirschevski, Yuhsun Huang, Max Ryabinin, Zhihao Jia, and Beidi Chen. Sequoia: Scalable, robust, and hardware-aware speculative decoding. arXiv preprint arXiv:2402.12374, 2024. https://arxiv.org/abs/2402.12374v1
- Ziteng Sun, Ananda Theertha Suresh, Jae Hun Ro, Ahmad Beirami, Himanshu Jain, and Felix Yu. Spectr: Fast speculative decoding via optimal transport. Advances in Neural Information Processing Systems, 36, 2024. https://arxiv.org/abs/2310.15141
- Kim, S., Mangalam, K., Malik, J., Mahoney, M. W., Gho lami, A., and Keutzer, K. Big little transformer decoder. arXiv preprint arXiv:2302.07863, 2023. https://arxiv.org/abs/2302.07863v1
- Kool, W., Van Hoof, H., and Welling, M. Stochastic beams and where to find them: The Gumbel-Top-k trick for sam pling sequences without replacement. In Proceedings of the 36th International Conference on Machine Learning (ICML), pp. 3499–3508. PMLR, 2019 https://arxiv.org/abs/1903.06059
- Zachary Ankner, Rishab Parthasarathy, Aniruddha Nrusimha, Christopher Rinard, Jonathan Ragan-Kelley, and William Brandon. Hydra: Sequentially-dependent draft heads for medusa decoding, 2024. https://arxiv.org/abs/2402.05109
- Ruslan Svirschevski, Avner May, Zhuoming Chen, Beidi Chen, Zhihao Jia, Max Ryabinin, 4 Jun 2024, SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices, https://arxiv.org/abs/2406.02532 (Speculative decoding with draft trees on low-resource consumer hardware with offloading.)
- Chengbo Liu, Yong Zhu, 1 Apr 2024 (v2), SDSAT: Accelerating LLM Inference through Speculative Decoding with Semantic Adaptive Tokens, https://arxiv.org/abs/2403.18647 Code: https://github.com/hasuoshenyun/SDSAT
- Mahsa Khoshnoodi, Vinija Jain, Mingye Gao, Malavika Srikanth, Aman Chadha, 24 May 2024 (v2), A Comprehensive Survey of Accelerated Generation Techniques in Large Language Models, https://arxiv.org/abs/2405.13019
- Xiaoxuan Liu, Lanxiang Hu, Peter Bailis, Alvin Cheung, Zhijie Deng, Ion Stoica, Hao Zhang, 10 Jun 2024 (v4), Online Speculative Decoding, https://github.com/LiuXiaoxuanPKU/OSD
- Dominik Wagner, Seanie Lee, Ilja Baumann, Philipp Seeberger, Korbinian Riedhammer, Tobias Bocklet, 16 Jun 2024, Optimized Speculative Sampling for GPU Hardware Accelerators, https://arxiv.org/abs/2406.11016 (Speculative decoding accelerated with multiple GPUs using approaches such as tiling, and uses a fused sigmoid replacing Softmax.)
- Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pier ric Cistac, Tim Rault, Remi Louf, Morgan Funtow icz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics. https://aclanthology.org/2020.emnlp-demos.6/ Code: https://github.com/huggingface/transformers
- Apple, June 2024, Introducing Apple’s On-Device and Server Foundation Models, https://machinelearning.apple.com/research/introducing-apple-foundation-models
- vLLM, 2024, Speculative decoding in vLLM, https://docs.vllm.ai/en/stable/models/spec_decode.html (vLLM has speculative decoding and also looks for n-grams in the prompt, which is prompt lookup decoding.)
- Cuda Mode, 2024, Lecture 22: Hacker's Guide to Speculative Decoding in VLLM, https://www.youtube.com/watch?v=9wNAgpX6z_4
- vLLM, 2024, [RFC]: Automate Speculative Decoding #4565, https://github.com/vllm-project/vllm/issues/4565
- Xiaoxuan Liu, Cade Daniel, Langxiang Hu, Woosuk Kwon, Zhuohan Li, Xiangxi Mo, Alvin Cheung, Zhijie Deng, Ion Stoica, Hao Zhang, 20 Jun 2024, Optimizing Speculative Decoding for Serving Large Language Models Using Goodput, https://arxiv.org/abs/2406.14066 (Estimation of the draft length for increased acceptance to improve overall performance.)
- Zeping Li, Xinlong Yang, Ziheng Gao, Ji Liu, Zhuang Liu, Dong Li, Jinzhang Peng, Lu Tian, Emad Barsoum, 19 Jun 2024, Amphista: Accelerate LLM Inference with Bi-directional Multiple Drafting Heads in a Non-autoregressive Style, https://arxiv.org/abs/2406.13170 (Applying bi-directional decoding to speculative decoding.)
- Yingbing Huang, Lily Jiaxin Wan, Hanchen Ye, Manvi Jha, Jinghua Wang, Yuhong Li, Xiaofan Zhang, Deming Chen, 16 Jun 2024, New Solutions on LLM Acceleration, Optimization, and Application, https://arxiv.org/abs/2406.10903 (A survey of inference optimization methods and further analysis of Medusa-type speculative decoding and KV cache compression. Also explores hardware co-design, ML compilers and LLM-assisted code debugging.)
- SpecInfer: Accelerating Generative Large Language Model Serving With Speculative Inference and Token Tree Verification, Cheng, Xinhao, Masters Thesis, Carnegie Mellon University, ProQuest Dissertations & Theses, 2023. 30817164. https://www.proquest.com/openview/54f75ff88889b27845efd4a56f3a167a/1?pq-origsite=gscholar&cbl=18750&diss=y
- Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang, 24 Jun 2024, EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees, https://arxiv.org/abs/2406.16858 (Extends a tree-structured draft to use the logit probabilities of the draft model as an estimate of the likely acceptance rates by the larger model.)
- Sean Welleck, Amanda Bertsch, Matthew Finlayson, Hailey Schoelkopf, Alex Xie, Graham Neubig, Ilia Kulikov, Zaid Harchaoui, 24 Jun 2024, From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models, https://arxiv.org/abs/2406.16838 (Survey and theoretical analysis of many different decoding algorithms, along with various ways to speed them up such as speculative decoding and KV caches.)
- Mouxiang Chen, Hao Tian, Zhongxin Liu, Xiaoxue Ren, Jianling Sun, 5 Jun 2024 (v2), JumpCoder: Go Beyond Autoregressive Coder via Online Modification, https://arxiv.org/abs/2401.07870 Code: https://github.com/Keytoyze/JumpCoder
- Zhenglin Wang, Jialong Wu, Yilong Lai, Congzhi Zhang, Deyu Zhou, 26 Jun 2024, SEED: Accelerating Reasoning Tree Construction via Scheduled Speculative Decoding, https://arxiv.org/abs/2406.18200 (Scheduling and parallelization of multiple draft models across one or multiple queries in tree-based speculative decoding.)
- Jikai Wang, Yi Su, Juntao Li, Qinrong Xia, Zi Ye, Xinyu Duan, Zhefeng Wang, Min Zhang, 25 Jun 2024, OPT-Tree: Speculative Decoding with Adaptive Draft Tree Structure, https://arxiv.org/abs/2406.17276 Code: https://github.com/Jikai0Wang/OPT-Tree
- Hyunjong Ok, Jegwang Ryu, Jaeho Lee, 26 Jun 2024, Decoding with Limited Teacher Supervision Requires Understanding When to Trust the Teacher, https://arxiv.org/abs/2406.18002 (Examines the idea of not using the larger model to always verify, and when to trust either the smaller or larger models, which is an idea that generalized beyond speculative decoding.)
- Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, Gabriel Synnaeve, 30 Apr 2024, Better & Faster Large Language Models via Multi-token Prediction, https://arxiv.org/abs/2404.19737 Project: https://huggingface.co/facebook/multi-token-prediction
- Parsa Kavehzadeh, Mohammadreza Pourreza, Mojtaba Valipour, Tinashu Zhu, Haoli Bai, Ali Ghodsi, Boxing Chen, Mehdi Rezagholizadeh, 2 Jul 2024, S2D: Sorted Speculative Decoding For More Efficient Deployment of Nested Large Language Models, https://arxiv.org/abs/2407.01955 (Creating, managing and integrating multiple draft models as submodels in speculative decoding.)
- Anonymous, 2024, Towards Fast Multilingual LLM Inference: Speculative Decoding and Specialized Drafters, https://openreview.net/pdf?id=rNWRnVGfBC
- Ziqin Luo, Haixia Han, Haokun Zhao, Guochao Jiang, Chengyu Du, Tingyun Li, Jiaqing Liang, Deqing Yang, Yanghua Xiao, 26 May 2024, SED: Self-Evaluation Decoding Enhances Large Language Models for Better Generation, https://arxiv.org/abs/2405.16552
- Zilong Wang, Zifeng Wang, Long Le, Huaixiu Steven Zheng, Swaroop Mishra, Vincent Perot, Yuwei Zhang, Anush Mattapalli, Ankur Taly, Jingbo Shang, Chen-Yu Lee, Tomas Pfister, 11 Jul 2024, Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting, https://arxiv.org/abs/2407.08223
- Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas Kübler, Jiaji Huang, Matthäus Kleindessner, Jun Huan, Volkan Cevher, Yida Wang, George Karypis, 12 Jul 2024, Inference Optimization of Foundation Models on AI Accelerators, KDD’24, August 25–29, 2024, Barcelona, Spain, https://arxiv.org/abs/2407.09111
- Mikhail Andronov, Natalia Andronova, Michael Wand, Jürgen Schmidhuber, Djork-Arné Clevert, 17 Jul 2024 (v2), Accelerating the inference of string generation-based chemical reaction models for industrial applications, https://arxiv.org/abs/2407.09685
- Zongyue Qin, Ziniu Hu, Zifan He, Neha Prakriya, Jason Cong, Yizhou Sun, 12 Jul 2024, Multi-Token Joint Speculative Decoding for Accelerating Large Language Model Inference, https://arxiv.org/abs/2407.09722
- Xukun Liu, Bowen Lei, Ruqi Zhang, Dongkuan Xu, 27 Jun 2024, Adaptive Draft-Verification for Efficient Large Language Model Decoding, https://arxiv.org/abs/2407.12021 Project: https://anonymous.4open.science/r/ADED-C7D5 (A draft-and-verification method that is similar to speculative decoding, but differs.)
- Kaiyan Zhang, Jianyu Wang, Ning Ding, Biqing Qi, Ermo Hua, Xingtai Lv, Bowen Zhou, 18 Jun 2024, Fast and Slow Generating: An Empirical Study on Large and Small Language Models Collaborative Decoding, https://arxiv.org/abs/2406.12295 Code: https://github.com/TsinghuaC3I/FS-GEN
- Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
- Zhengmian Hu, Heng Huang, 2024, Accelerated Speculative Sampling Based on Tree Monte Carlo, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:19216-19251, 2024. https://proceedings.mlr.press/v235/hu24f.html https://openreview.net/forum?id=stMhi1Sn2G PDF: https://raw.githubusercontent.com/mlresearch/v235/main/assets/hu24f/hu24f.pdf
- Domas Grigaliūnas, Mantas Lukoševičius, 29 Jul 2024, Inference acceleration for large language models using "stairs" assisted greedy generation, https://arxiv.org/abs/2407.19947 (A draft-and-verify method that is similar to speculative decoding.)
- Ziteng Sun1 Uri Mendlovic, Yaniv Leviathan1 Asaf Aharoni, Ahmad Beirami , Jae HunRo, Ananda Theertha Suresh, https://openreview.net/pdf?id=OWwc8eOIm8
- Amazon, 2024, Optimize model inference with Amazon SageMaker, https://docs.aws.amazon.com/sagemaker/latest/dg/model-optimize.html
- Hongyi Yuan, Keming Lu, Fei Huang, Zheng Yuan, Chang Zhou, 13 Mar 2024 (v2), Speculative Contrastive Decoding, https://arxiv.org/abs/2311.08981
- Daliang Xu, Wangsong Yin, Xin Jin, Ying Zhang, Shiyun Wei, Mengwei Xu, and Xuanzhe Liu, Sep 2023, Llmcad: Fast and scalable on-device large language model inference. CoRR, abs/2309.04255, https://arxiv.org/abs/2309.04255
- Leo Donisch, Sigurd Schacht, Carsten Lanquillon, 6 Aug 2024, Inference Optimizations for Large Language Models: Effects, Challenges, and Practical Considerations, https://arxiv.org/abs/2408.03130
- Yunjia Xi, Hangyu Wang, Bo Chen, Jianghao Lin, Menghui Zhu, Weiwen Liu, Ruiming Tang, Weinan Zhang, Yong Yu, 11 Aug 2024, A Decoding Acceleration Framework for Industrial Deployable LLM-based Recommender Systems, https://arxiv.org/abs/2408.05676 (Determining when speculative decoding is most beneficial.)
- Jacob K Christopher, Brian R Bartoldson, Bhavya Kailkhura, Ferdinando Fioretto, 10 Aug 2024, Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion, https://arxiv.org/abs/2408.05636 (Parallel drafting and verification in diffusion models.)
- Kaiqi Zhang, Jing Zhao, Rui Chen, 15 Aug 2024, KOALA: Enhancing Speculative Decoding for LLM via Multi-Layer Draft Heads with Adversarial Learning, https://arxiv.org/abs/2408.08146
- Xianzhen Luo, Yixuan Wang, Qingfu Zhu, Zhiming Zhang, Xuanyu Zhang, Qing Yang, Dongliang Xu, Wanxiang Che, 16 Aug 2024, Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling, https://arxiv.org/abs/2408.08696
- Jerry Huang, Prasanna Parthasarathi, Mehdi Rezagholizadeh, Sarath Chandar, 16 Aug 2024, Context-Aware Assistant Selection for Improved Inference Acceleration with Large Language Models, https://arxiv.org/abs/2408.08470
- Aishwarya P S, Pranav Ajit Nair, Yashas Samaga B L, Toby James Boyd, Sanjiv Kumar, Prateek Jain, Praneeth Netrapalli, July 2024, Tandem Transformers for Inference Efficient LLMs, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:42906-42917, 2024, https://proceedings.mlr.press/v235/s24a.html
- Branden Butler, Sixing Yu, Arya Mazaheri, Ali Jannesari, 16 Jul 2024, PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation, https://arxiv.org/abs/2407.11798 (Optimizes speculative decoding further using pipelining.)
- Tianyu Liu, Yun Li, Qitan Lv, Kai Liu, Jianchen Zhu, Winston Hu, 13 Aug 2024, Parallel Speculative Decoding with Adaptive Draft Length, https://arxiv.org/abs/2408.11850
- David Spuler, 25th August, 2024, Hot Inference Optimization Techniques, https://www.aussieai.com/blog/hot-inference-research
- Ahmed Tremo, Aug 6, 2024, How to Efficiently Serve an LLM? https://ahmedtremo.com/posts/How-to-Efficiently-serve-an-llm/
- Zhuocheng Gong, Jiahao Liu, Ziyue Wang, Pengfei Wu, Jingang Wang, Xunliang Cai, Dongyan Zhao, Rui Yan, 23 Jul 2024, Graph-Structured Speculative Decoding, https://arxiv.org/abs/2407.16207
- Lujun Gui, Bin Xiao, Lei Su, Weipeng Chen, 28 Aug 2024, Boosting Lossless Speculative Decoding via Feature Sampling and Partial Alignment Distillation, https://arxiv.org/abs/2408.15562
- Lefan Zhang, Xiaodan Wang, Yanhua Huang, Ruiwen Xu, 28 Aug 2024, Harmonized Speculative Sampling, https://arxiv.org/abs/2408.15766
- David Spuler, June 2024, Aussie AI, Speculative Decoding With Early Exit for Optimized Transformer On-Device Inference: IP Australia, https://ipsearch.ipaustralia.gov.au/patents/2024901656
- Karl Stratos, 2024, Speculative Decoding https://karlstratos.com/notes/speculative.pdf
- Du Cunxiao, 2024, Towards Faster Inference of Transformers: Strategies for Accelerating Decoding Processes, Ph.D. thesis, Computer Science, School of Computing and Information Systems, Singapore Management University, https://ink.library.smu.edu.sg/cgi/viewcontent.cgi?article=1611&context=etd_coll (Examines non-autoregressive decoding, speculative decoding and attention optimizations.)
- Yael Segal-Feldman, Aviv Shamsian, Aviv Navon, Gill Hetz, Joseph Keshet, 24 Sep 2024, Whisper in Medusa's Ear: Multi-head Efficient Decoding for Transformer-based ASR, https://arxiv.org/abs/2409.15869
- Zongyue Qin, Zifan He, Neha Prakriya, Jason Cong, Yizhou Sun, 25 Sep 2024, Dynamic-Width Speculative Beam Decoding for Efficient LLM Inference, https://arxiv.org/abs/2409.16560
- Yao Teng, Han Shi, Xian Liu, Xuefei Ning, Guohao Dai, Yu Wang, Zhenguo Li, Xihui Liu, 2 Oct 2024, Accelerating Auto-regressive Text-to-Image Generation with Training-free Speculative Jacobi Decoding, https://arxiv.org/abs/2410.01699
- Matthieu Zimmer, Milan Gritta, Gerasimos Lampouras, Haitham Bou Ammar, Jun Wang, 4 Oct 2024, Mixture of Attentions For Speculative Decoding, https://arxiv.org/abs/2410.03804
- Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Guohao Dai, 6 Oct 2024, Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective, https://arxiv.org/abs/2410.04466
- Xinyu Lin, Chaoqun Yang, Wenjie Wang, Yongqi Li, Cunxiao Du, Fuli Feng, See-Kiong Ng, Tat-Seng Chua, 8 Oct 2024 (v2), Efficient Inference for Large Language Model-based Generative Recommendation, https://arxiv.org/abs/2410.05165
- Yijie Ding, Yupeng Hou, Jiacheng Li, Julian McAuley, 3 Oct 2024, Inductive Generative Recommendation via Retrieval-based Speculation, https://arxiv.org/abs/2410.02939 https://github.com/Jamesding000/SpecGR
- Doohyuk Jang, Sihwan Park, June Yong Yang, Yeonsung Jung, Jihun Yun, Souvik Kundu, Sung-Yub Kim, Eunho Yang, 4 Oct 2024, LANTERN: Accelerating Visual Autoregressive Models with Relaxed Speculative Decoding, https://arxiv.org/abs/2410.03355
- Cong Guo, Feng Cheng, Zhixu Du, James Kiessling, Jonathan Ku, Shiyu Li, Ziru Li, Mingyuan Ma, Tergel Molom-Ochir, Benjamin Morris, Haoxuan Shan, Jingwei Sun, Yitu Wang, Chiyue Wei, Xueying Wu, Yuhao Wu, Hao Frank Yang, Jingyang Zhang, Junyao Zhang, Qilin Zheng, Guanglei Zhou, Hai (Helen)Li, Yiran Chen, 8 Oct 2024. A Survey: Collaborative Hardware and Software Design in the Era of Large Language Models, https://arxiv.org/abs/2410.07265
- Heming Xia, Yongqi Li, Jun Zhang, Cunxiao Du, Wenjie Li, 9 Oct 2024, SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration, https://arxiv.org/abs/2410.06916 (Self-speculative decoding using layer skipping, rather than early exit.)
- Xinyi Zeng, Yuying Shang, Yutao Zhu, Jiawei Chen, Yu Tian, 9 Oct 2024, Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level, https://arxiv.org/abs/2410.06809
- Zilin Xiao, Hongming Zhang, Tao Ge, Siru Ouyang, Vicente Ordonez, Dong Yu, 8 Oct 2024, ParallelSpec: Parallel Drafter for Efficient Speculative Decoding, https://arxiv.org/abs/2410.05589 (Multi-token prediction in draft models for speculative decoding.)
- Siru Ouyang, Shuohang Wang, Minhao Jiang, Ming Zhong, Donghan Yu, Jiawei Han, Yelong Shen, 14 Oct 2024, Temperature-Centric Investigation of Speculative Decoding with Knowledge Distillation, https://arxiv.org/abs/2410.10141 https://github.com/ozyyshr/TempSpec
- Yunfan Xiong, Ruoyu Zhang, Yanzeng Li, Tianhao Wu, Lei Zou, 15 Oct 2024, DySpec: Faster Speculative Decoding with Dynamic Token Tree Structure, https://arxiv.org/abs/2410.11744
- Juntao Zhao, Wenhao Lu, Sheng Wang, Lingpeng Kong, Chuan Wu, 15 Oct 2024, QSpec: Speculative Decoding with Complementary Quantization Schemes, https://arxiv.org/abs/2410.11305 (Enhance speculative decoding using quantization to reuse KV cache values and weights.)
- Yuxuan Liu, Wenyuan Li, Laizhong Cui, Hailiang Yang, 17 Oct 2024, Cerberus: Efficient Inference with Adaptive Parallel Decoding and Sequential Knowledge Enhancement, https://arxiv.org/abs/2410.13344
- Tan Dat Nguyen, Ji-Hoon Kim, Jeongsoo Choi, Shukjae Choi, Jinseok Park, Younglo Lee, Joon Son Chung, 17 Oct 2024, Accelerating Codec-based Speech Synthesis with Multi-Token Prediction and Speculative Decoding, https://arxiv.org/abs/2410.13839
- Bradley McDanel, 22 Oct 2024, AMUSD: Asynchronous Multi-Device Speculative Decoding for LLM Acceleration, https://arxiv.org/abs/2410.17375 https://github.com/BradMcDanel/AMUSD/
- Ashish Khisti, M.Reza Ebrahimi, Hassan Dbouk, Arash Behboodi, Roland Memisevic, Christos Louizos, 23 Oct 2024, Multi-Draft Speculative Sampling: Canonical Architectures and Theoretical Limits, https://arxiv.org/abs/2410.18234
- Hanshi Sun, Momin Haider, Ruiqi Zhang, Huitao Yang, Jiahao Qiu, Ming Yin, Mengdi Wang, Peter Bartlett, Andrea Zanette, 26 Oct 2024, Fast Best-of-N Decoding via Speculative Rejection, https://arxiv.org/abs/2410.20290
- Pengfei Wu, Jiahao Liu, Zhuocheng Gong, Qifan Wang, Jinpeng Li, Jingang Wang, Xunliang Cai, Dongyan Zhao, 27 Oct 2024, FIRP: Faster LLM inference via future intermediate representation prediction, https://arxiv.org/abs/2410.20488
- Yuzhe Yang, Yipeng Du, Ahmad Farhan, Claudio Angione, Yue Zhao, Harry Yang, Fielding Johnston, James Buban, Patrick Colangelo, 28 Oct 2024, Meta-Learning for Speeding Up Large Model Inference in Decentralized Environments, https://arxiv.org/abs/2410.21340 (Choosing between multiple acceleration techniques).
- Ming Yin, Minshuo Chen, Kaixuan Huang, Mengdi Wang, 30 Oct 2024, A Theoretical Perspective for Speculative Decoding Algorithm, https://arxiv.org/abs/2411.00841
- Lawrence Stewart, Matthew Trager, Sujan Gonugondla, Stefano Soatto, 2024, The N-Grammys: Accelerating autoregressive inference with learning-free batched speculation, 4th NeurIPS Efficient Natural Language and Speech Processing Workshop (ENLSP-IV 2024), https://www.amazon.science/publications/the-n-grammys-accelerating-autoregressive-inference-with-learning-free-batched-speculation (Use a variety of heuristics instead of a draft model, such as precalculated likelihoods, and also prompt lookup decoding using n-grams from the context tokens.)
- Jiankun Wei, Abdulrahman Abdulrazzag, Tianchen Zhang, Adel Muursepp, Gururaj Saileshwar, 5 Nov 2024 (v2), Privacy Risks of Speculative Decoding in Large Language Models, https://arxiv.org/abs/2411.01076
- Haim Barad, Jascha Achterberg, Tien Pei Chou, Jean Yu, 30 Oct 2024, Accelerated AI Inference via Dynamic Execution Methods, https://arxiv.org/abs/2411.00853
- Michele Marzollo, Jiawei Zhuang, Niklas Roemer, Lorenz K. Müller, Lukas Cavigelli, 8 Nov 2024, SSSD: Simply-Scalable Speculative Decoding, https://arxiv.org/abs/2411.05894
- Ofir Zafrir, Igor Margulis, Dorin Shteyman, Guy Boudoukh, 17 Nov 2024, FastDraft: How to Train Your Draft, https://arxiv.org/abs/2411.11055
- Samarth N Ramesh, Zhixue Zhao, 22 Nov 2024, Efficient Pruning of Text-to-Image Models: Insights from Pruning Stable Diffusion, https://arxiv.org/abs/2411.15113 (Comprehensive analysis of different types of pruning on diffusion image models.)
- M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024, Resource-efficient Algorithms and Systems of Foundation Models: A Survey, https://dl.acm.org/doi/pdf/10.1145/3706418
- Rongxiang Wang and Felix Xiaozhu Lin. 2024. Turbocharge Speech Understanding with Pilot Inference. In Proceedings of the 30th Annual International Conference on Mobile Computing and Networking (ACM MobiCom '24). Association for Computing Machinery, New York, NY, USA, 1299–1313. https://doi.org/10.1145/3636534.3690694 https://dl.acm.org/doi/abs/10.1145/3636534.3690694 https://dl.acm.org/doi/pdf/10.1145/3636534.3690694 ("Pilot inference" is a specialized mix of caching, computation reuse, and backtracking in beam search for speech understanding, and is somewhat related to speculative decoding, and similar to continual inference for processing a stream.)
Tree-Based Speculative Decoding
Speculative decoding can track multiple possible draft sequences, and these can have a common prefix. Hence, there is a tree structure of multiple possible drafts, which can be processed using specialized handling of tree structres. (Note that this idea is closely related to "beam decoding" in regular non-speculative inference.)
- Vieira, T., Gumbel-max trick and weighted reservoir sampling. 2014. https: //timvieira.github.io/blog/post/2014/ 08/01/gumbel-max-trick-andweighted- reservoir-sampling/
- Fu, Y., Bailis, P., Stoica, I., and Zhang, H. Break ing the sequential dependency of LLM inference using lookahead decoding, November 2023. https://lmsys.org/blog/2023-11-21-lookahead-decoding/
- Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference acceleration framework with multiple decoding heads. arXiv preprint arXiv:2401.10774, 2024 https://arxiv.org/abs/2401.10774
- Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Rae Ying Yee Wong, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, and Zhihao Jia. Specinfer: Accelerating generative llm serving with speculative inference and token tree verification. arXiv preprint arXiv:2305.09781, 2023. https://arxiv.org/abs/2305.09781 https://www.researchgate.net/publication/370841788_SpecInfer_Accelerating_Generative_LLM_Serving_with_Speculative_Inference_and_Token_Tree_Verification#:~:text=introduces%20SpecInfer%2C%20an%20LLM%20serving%20system%20that%20accelerates,the%20predictions%20are%20organized%20as%20a%20tok%20en
- Zhuoming Chen, Avner May, Ruslan Svirschevski, Yuhsun Huang, Max Ryabinin, Zhihao Jia, and Beidi Chen. Sequoia: Scalable, robust, and hardware-aware speculative decoding. arXiv preprint arXiv:2402.12374, 2024. https://arxiv.org/abs/2402.12374v1
- Ziteng Sun, Ananda Theertha Suresh, Jae Hun Ro, Ahmad Beirami, Himanshu Jain, and Felix Yu. Spectr: Fast speculative decoding via optimal transport. Advances in Neural Information Processing Systems, 36, 2024. https://arxiv.org/abs/2310.15141
- Kim, S., Mangalam, K., Malik, J., Mahoney, M. W., Gho lami, A., and Keutzer, K. Big little transformer decoder. arXiv preprint arXiv:2302.07863, 2023. https://arxiv.org/abs/2302.07863v1
- Kool, W., Van Hoof, H., and Welling, M. Stochastic beams and where to find them: The Gumbel-Top-k trick for sam pling sequences without replacement. In Proceedings of the 36th International Conference on Machine Learning (ICML), pp. 3499–3508. PMLR, 2019 https://arxiv.org/abs/1903.06059
- Nikhil Bhendawade, Irina Belousova, Qichen Fu, Henry Mason, Mohammad Rastegari, and Mahyar Najibi. Speculative streaming: Fast llm inference without auxiliary models. arXiv preprint arXiv:2402.11131, 2024. https://arxiv.org/abs/2402.11131
- Theia Vogel, December 18, 2023, How to make LLMs go fast, https://vgel.me/posts/faster-inference/
- Aonan Zhang, Chong Wang, Yi Wang, Xuanyu Zhang, Yunfei Cheng, 22 Mar 2024 (v2), Recurrent Drafter for Fast Speculative Decoding in Large Language Models, https://arxiv.org/abs/2403.09919 (Use of small RNN as the drafting model for speculative decoding.)
- Benjamin Spector and Chris Re. Accelerating llm inference with staged speculative decoding. arXiv preprint arXiv:2308.04623, 2023. https://arxiv.org/abs/2308.04623
- Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: Speculative sampling requires rethinking feature uncertainty. arXiv preprint arXiv:2401.15077, 2024. https://arxiv.org/abs/2401.15077
- Zachary Ankner, Rishab Parthasarathy, Aniruddha Nrusimha, Christopher Rinard, Jonathan Ragan-Kelley, and William Brandon. Hydra: Sequentially-dependent draft heads for medusa decoding, 2024. https://arxiv.org/abs/2402.05109
- Ruslan Svirschevski, Avner May, Zhuoming Chen, Beidi Chen, Zhihao Jia, Max Ryabinin, 4 Jun 2024, SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices, https://arxiv.org/abs/2406.02532 (Speculative decoding with draft trees on low-resource consumer hardware with offloading.)
- 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
- Xiaoxuan Liu, Cade Daniel, Langxiang Hu, Woosuk Kwon, Zhuohan Li, Xiangxi Mo, Alvin Cheung, Zhijie Deng, Ion Stoica, Hao Zhang, 20 Jun 2024, Optimizing Speculative Decoding for Serving Large Language Models Using Goodput, https://arxiv.org/abs/2406.14066 (Estimation of the draft length for increased acceptance to improve overall performance.)
- SpecInfer: Accelerating Generative Large Language Model Serving With Speculative Inference and Token Tree Verification, Cheng, Xinhao, Masters Thesis, Carnegie Mellon University, ProQuest Dissertations & Theses, 2023. 30817164. https://www.proquest.com/openview/54f75ff88889b27845efd4a56f3a167a/1?pq-origsite=gscholar&cbl=18750&diss=y
- Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang, 24 Jun 2024, EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees, https://arxiv.org/abs/2406.16858 (Extends a tree-structured draft to use the logit probabilities of the draft model as an estimate of the likely acceptance rates by the larger model.)
- Sean Welleck, Amanda Bertsch, Matthew Finlayson, Hailey Schoelkopf, Alex Xie, Graham Neubig, Ilia Kulikov, Zaid Harchaoui, 24 Jun 2024, From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models, https://arxiv.org/abs/2406.16838 (Survey and theoretical analysis of many different decoding algorithms, along with various ways to speed them up such as speculative decoding and KV caches.)
- Zhenglin Wang, Jialong Wu, Yilong Lai, Congzhi Zhang, Deyu Zhou, 26 Jun 2024, SEED: Accelerating Reasoning Tree Construction via Scheduled Speculative Decoding, https://arxiv.org/abs/2406.18200 (Scheduling and parallelization of multiple draft models across one or multiple queries in tree-based speculative decoding.)
- Jikai Wang, Yi Su, Juntao Li, Qinrong Xia, Zi Ye, Xinyu Duan, Zhefeng Wang, Min Zhang, 25 Jun 2024, OPT-Tree: Speculative Decoding with Adaptive Draft Tree Structure, https://arxiv.org/abs/2406.17276 Code: https://github.com/Jikai0Wang/OPT-Tree
- Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas Kübler, Jiaji Huang, Matthäus Kleindessner, Jun Huan, Volkan Cevher, Yida Wang, George Karypis, 12 Jul 2024, Inference Optimization of Foundation Models on AI Accelerators, KDD’24, August 25–29, 2024, Barcelona, Spain, https://arxiv.org/abs/2407.09111
- Zhengmian Hu, Heng Huang, 2024, Accelerated Speculative Sampling Based on Tree Monte Carlo, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:19216-19251, 2024. https://proceedings.mlr.press/v235/hu24f.html https://openreview.net/forum?id=stMhi1Sn2G PDF: https://raw.githubusercontent.com/mlresearch/v235/main/assets/hu24f/hu24f.pdf
- Daliang Xu, Wangsong Yin, Xin Jin, Ying Zhang, Shiyun Wei, Mengwei Xu, and Xuanzhe Liu, Sep 2023, Llmcad: Fast and scalable on-device large language model inference. CoRR, abs/2309.04255, https://arxiv.org/abs/2309.04255
- Xianzhen Luo, Yixuan Wang, Qingfu Zhu, Zhiming Zhang, Xuanyu Zhang, Qing Yang, Dongliang Xu, Wanxiang Che, 16 Aug 2024, Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling, https://arxiv.org/abs/2408.08696
- Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
- Zongyue Qin, Zifan He, Neha Prakriya, Jason Cong, Yizhou Sun, 25 Sep 2024, Dynamic-Width Speculative Beam Decoding for Efficient LLM Inference, https://arxiv.org/abs/2409.16560
- Xinyu Lin, Chaoqun Yang, Wenjie Wang, Yongqi Li, Cunxiao Du, Fuli Feng, See-Kiong Ng, Tat-Seng Chua, 8 Oct 2024 (v2), Efficient Inference for Large Language Model-based Generative Recommendation, https://arxiv.org/abs/2410.05165
- Yunfan Xiong, Ruoyu Zhang, Yanzeng Li, Tianhao Wu, Lei Zou, 15 Oct 2024, DySpec: Faster Speculative Decoding with Dynamic Token Tree Structure, https://arxiv.org/abs/2410.11744
- Jiahao Qiu, Yifu Lu, Yifan Zeng, Jiacheng Guo, Jiayi Geng, Huazheng Wang, Kaixuan Huang, Yue Wu, Mengdi Wang, 18 Oct 2024, TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling, https://arxiv.org/abs/2410.16033
- Hyun Ryu, Eric Kim, 20 Nov 2024, Closer Look at Efficient Inference Methods: A Survey of Speculative Decoding, https://arxiv.org/abs/2411.13157
Self-Speculative Decoding
Self-speculative decoding is the idea that the smaller draft model, instead of being a separately trained model from the larger verifier model, could simply be a cut-down version of the bigger model. The advantages include:
- Only one model to train! (No separate draft model)
- Partial computations in the smaller (self) model can be re-used by the verifier model.
- Verifier model effectively "overlaps" its computations (reducing latency).
- Both models automatically have the same token set and vocabulary.
Self-speculative decoding is based on early-exited versions of the larger model, used as the smaller draft model. Then the larger model simply computes the remaining layers that haven't yet been executed, but it does so in parallel.
Research on self-speculative decoding:
- Jiahao Liu, Qifan Wang, Jingang Wang, Xunliang Cai, 6 Jun 2024, Speculative Decoding via Early-exiting for Faster LLM Inference with Thompson Sampling Control Mechanism, https://arxiv.org/abs/2406.03853
- Mahsa Khoshnoodi, Vinija Jain, Mingye Gao, Malavika Srikanth, Aman Chadha, 24 May 2024 (v2), A Comprehensive Survey of Accelerated Generation Techniques in Large Language Models, https://arxiv.org/abs/2405.13019
- Wei Zhong, Manasa Bharadwaj, 30 May 2024, S3D: A Simple and Cost-Effective Self-Speculative Decoding Scheme for Low-Memory GPUs, https://arxiv.org/abs/2405.20314 (Self-speculative decoding using early layers, multi-token non-autoregressive token predictions for the draft model, and layer skipping.)
- Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed A Aly, Beidi Chen, Carole-Jean Wu, 25 Apr 2024, Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding, https://arxiv.org/abs/2404.16710 (Multiple contributions including training with early exit, and speculative decoding with a draft model that is early exit within the larger model, with the advantages: (a) the draft and verifier model thereby share KV cache data for the early layers and (b) avoidance of the problems with an outdated KV cache normally caused by early exiting.)
- Fangcheng Liu, Yehui Tang, Zhenhua Liu, Yunsheng Ni, Kai Han, Yunhe Wang, 29 Apr 2024, Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting, https://arxiv.org/abs/2404.18911 Code: https://github.com/Equationliu/Kangaroo (Speculative decoding where the draft model is an early exit of layers in the verifier model, but the draft model is also sped up further by early exiting confidence analysis.)
- Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, Gabriel Synnaeve, 30 Apr 2024, Better & Faster Large Language Models via Multi-token Prediction, https://arxiv.org/abs/2404.19737 Project: https://huggingface.co/facebook/multi-token-prediction
- Parsa Kavehzadeh, Mohammadreza Pourreza, Mojtaba Valipour, Tinashu Zhu, Haoli Bai, Ali Ghodsi, Boxing Chen, Mehdi Rezagholizadeh, 2 Jul 2024, S2D: Sorted Speculative Decoding For More Efficient Deployment of Nested Large Language Models, https://arxiv.org/abs/2407.01955 (Creating, managing and integrating multiple draft models as submodels in speculative decoding.)
- Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
- Nikhil Bhendawade, Irina Belousova, Qichen Fu, Henry Mason, Mohammad Rastegari, Mahyar Najibi, 16 Feb 2024, Speculative Streaming: Fast LLM Inference without Auxiliary Models, https://arxiv.org/abs/2402.11131
- Fed-EE: Federating Heterogeneous ASR Models using Early-Exit Architectures, Mohamed Nabih Ali, Daniele Falavigna, Alessio Brutti, 2024, PDF: https://cris.fbk.eu/bitstream/11582/343747/1/paper_49.pdf (This paper mentions early exit in relation to generalized speculative decoding.)
- Sangmin Bae, Jongwoo Ko, Hwanjun Song, and Se-Young Yun. 2023. Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding. arXiv preprint arXiv:2310.05424. https://arxiv.org/abs/2310.05424 (Using early exits as the draft model in generalized speculative decoding.)
- Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, Zhifang Sui, 15 Jan 2024, Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding, https://arxiv.org/abs/2401.07851 (Survey paper has coverage of this type of speculative decoding.)
- Michael R. Metel, Peng Lu, Boxing Chen, Mehdi Rezagholizadeh, Ivan Kobyzev, 1 Oct 2024, Draft on the Fly: Adaptive Self-Speculative Decoding using Cosine Similarity, https://arxiv.org/abs/2410.01028 (Self-speculative decoding that removes layers based on cosine similarity.)
- Heming Xia, Yongqi Li, Jun Zhang, Cunxiao Du, Wenjie Li, 9 Oct 2024, SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration, https://arxiv.org/abs/2410.06916 (Self-speculative decoding using layer skipping, rather than early exit.)
- Hyun Ryu, Eric Kim, 20 Nov 2024, Closer Look at Efficient Inference Methods: A Survey of Speculative Decoding, https://arxiv.org/abs/2411.13157
- Samarth N Ramesh, Zhixue Zhao, 22 Nov 2024, Efficient Pruning of Text-to-Image Models: Insights from Pruning Stable Diffusion, https://arxiv.org/abs/2411.15113 (Comprehensive analysis of different types of pruning on diffusion image models.)
- Aritra Roy Gosthipaty, Mostafa Elhoushi, Pedro Cuenca, Vaibhav Srivastav, November 20, 2024, Faster Text Generation with Self-Speculative Decoding, https://huggingface.co/blog/layerskip
Lookup Decoding (Retrieval Speculative Decoding)
Lookup decoding is similar to speculative decoding in that it is a parallelization strategy. But it doesn't use a draft model to generate the tokens, but instead searches for patterns in a retrieval database.
Retrieval speculative decoding involves searching for n-grams of tokens in a separate datastore. If there is a match, the further token in that stored sequence become the draft tokens. The datastore is often related to a RAG datastore, but is lower-level than a RAG database of text chunks, because the n-grams can be (and should be) sub-sequences within a RAG chunk. In fact, that's the whole point. The idea for the speedup is to find a few tokens, already output by the LLM, which are matching part of a RAG chunk, and then assume that the model is about to output a lot more tokens from that chunk.
One of the interesting features of retrieval-based lookup decoding is that there isn't always a draft token sequence. If there's no match in the database, there's no speculative decoding step, and it defaults to autoregressive sequential decoding. Hence, it is only a speedup on average, and relies on a high hit rate for n-grams in the datastore.
- Zhenyu He, Zexuan Zhong, Tianle Cai, Jason D. Lee, Di He, 4 Apr 2024 (v2), REST: Retrieval-Based Speculative Decoding, https://arxiv.org/abs/2311.08252
- Apoorv Saxena, November 2023, Prompt Lookup Decoding, https://github.com/apoorvumang/prompt-lookup-decoding
- Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey
- Nan Yang, Tao Ge, Liang Wang, Binxing Jiao, Daxin Jiang, Linjun Yang, Rangan Majumder, and Furu Wei. 2023. Inference with Reference: Lossless Acceleration of Large Language Models. arXiv:2304.04487 [cs.CL] https://arxiv.org/abs/2304.04487
- Sophia Ho, Jinsol Park, Patrick Wang, 8 Aug 2024, CREST: Effectively Compacting a Datastore For Retrieval-Based Speculative Decoding, https://arxiv.org/abs/2408.04678
Prompt Lookup Decoding
Prompt lookup decoding is similar to lookup decoding, but it looks for sequences of tokens in the input prompt sequence itself, rather than any external datastore. This speedup is based on the assumption that the model might emit verbatim sub-sequences of the input prompt in its output. This might be true in a RAG architecture where it might summarize the input text.
Research on prompt lookup decoding:
- Vgel, December 18, 2023, How to make LLMs go fast, https://vgel.me/posts/faster-inference/
- Nan Yang, Tao Ge, Liang Wang, Binxing Jiao, Daxin Jiang, Linjun Yang, Rangan Majumder, and Furu Wei. 2023. Inference with Reference: Lossless Acceleration of Large Language Models. arXiv:2304.04487 [cs.CL] https://arxiv.org/abs/2304.04487
- vLLM, 2024, Speculative decoding in vLLM, https://docs.vllm.ai/en/stable/models/spec_decode.html (vLLM has speculative decoding and also looks for n-grams in the prompt, which is prompt lookup decoding.)
- Xiaoxuan Liu, Cade Daniel, Langxiang Hu, Woosuk Kwon, Zhuohan Li, Xiangxi Mo, Alvin Cheung, Zhijie Deng, Ion Stoica, Hao Zhang, 20 Jun 2024, Optimizing Speculative Decoding for Serving Large Language Models Using Goodput, https://arxiv.org/abs/2406.14066 (Estimation of the draft length for increased acceptance to improve overall performance.)
- Lawrence Stewart, Matthew Trager, Sujan Gonugondla, Stefano Soatto, 2024, The N-Grammys: Accelerating autoregressive inference with learning-free batched speculation, 4th NeurIPS Efficient Natural Language and Speech Processing Workshop (ENLSP-IV 2024), https://www.amazon.science/publications/the-n-grammys-accelerating-autoregressive-inference-with-learning-free-batched-speculation (Use a variety of heuristics instead of a draft model, such as precalculated likelihoods, and also prompt lookup decoding using n-grams from the context tokens.)
- Shwetha Somasundaram, Anirudh Phukan, Apoorv Saxena, 2 Dec 2024, PLD+: Accelerating LLM inference by leveraging Language Model Artifacts, https://arxiv.org/abs/2412.01447
Superposed Decoding for Multi-Draft Generation
The idea of "superposed decoding" is an interesting one, that is not a type of speculative decoding. Rather than using multiple drafts to speed up the computation of one final resulting sequence (verified by a large model), superposed decoding is a fast way to generate multiple reasonable drafts of a multi-token sequence. The key point is to combine the embeddings of the top-k tokens into a single embedding vector using a weighted average, and then validate this with an extra n-gram lookup method (i.e., a heuristic based on n-gram frequency in a text data set). Overall, the method allows the computation of multiple reasonable n-gram token sequences at almost the same cost as a single one.
- Ethan Shen, Alan Fan, Sarah M Pratt, Jae Sung Park, Matthew Wallingford, Sham M. Kakade, Ari Holtzman, Ranjay Krishna, Ali Farhadi, Aditya Kusupati, 28 May 2024, Superposed Decoding: Multiple Generations from a Single Autoregressive Inference Pass, https://arxiv.org/abs/2405.18400 Code: https://github.com/RAIVNLab/SuperposedDecoding (Generating multiple possible drafts from a single decoding algorithm with one model pass by superimposing embeddings and using top-k decoding.)
Sequential Speculative Decoding
There is very little research on speculative decoding for speeding up sequential computations, such as on-device inference with CPU or NPU capabilities.
Research papers include:
- Daliang Xu, Wangsong Yin, Xin Jin, Ying Zhang, Shiyun Wei, Mengwei Xu, and Xuanzhe Liu, Sep 2023, Llmcad: Fast and scalable on-device large language model inference. CoRR, abs/2309.04255, https://arxiv.org/abs/2309.04255
- David Spuler, Aug 2024, Sequential Speculative Decoding, Aussie AI Blog, https://www.aussieai.com/blog/sequential-speculative-decoding
- David Spuler, June 2024, Aussie AI, Speculative Decoding With Early Exit for Optimized Transformer On-Device Inference: IP Australia, https://ipsearch.ipaustralia.gov.au/patents/2024901656
Heuristic Speculative Decoding
This is a relatively new and under-addressed area of research. The "drafter" for speculative decoding need not be an LLM, and can be any sort of heuristic.
Research papers on heuristic speculative decoding or "heuristic drafters":
- Lawrence Stewart, Matthew Trager, Sujan Gonugondla, Stefano Soatto, 2024, The N-Grammys: Accelerating autoregressive inference with learning-free batched speculation, 4th NeurIPS Efficient Natural Language and Speech Processing Workshop (ENLSP-IV 2024), https://www.amazon.science/publications/the-n-grammys-accelerating-autoregressive-inference-with-learning-free-batched-speculation (Use a variety of heuristics instead of a draft model, such as precalculated likelihoods, and also prompt lookup decoding using n-grams from the context tokens.)
- Hugging Face, October 8, 2024, Faster Assisted Generation with Dynamic Speculation, https://huggingface.co/blog/dynamic_speculation_lookahead
- Ryan Sun, Tianyi Zhou, Xun Chen, Lichao Sun, 2024, SpecHub: Provable Acceleration to Multi-Draft Speculative Decoding, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, https://aclanthology.org/2024.emnlp-main.1148/ https://aclanthology.org/2024.emnlp-main.1148.pdf
More AI Research
Read more about:
- Generalized Speculative Decoding
- Ensemble (Multi-AI)
- Inference Optimizations
- Loop Optimizations
- Code Optimizations
- « Research Home