Aussie AI

Generalized Speculative Decoding

  • Last Updated 7 December, 2024
  • by David Spuler, Ph.D.

What is Basic Speculative Decoding?

The original method of "speculative decoding" uses two models: small and large. The smaller model predicts the next token (i.e., "speculates") and the larger model then confirms the correctness. If the larger model agrees, then the token is output, and any further processing by the large model is skipped. If the larger model disagrees, this is a "rollback," and the full inference phase is run by the larger model to generate the next output token.

Why is this faster? This seems slower since not only is the bigger model running, but also another smaller model. However, in speculative decoding, the larger model is not doing full predictions, only confirmations (in parallel).

The output token of the smaller model is faster, but not as accurate. The speedup arises because the large model is faster at confirming an output than as running its own full prediction. Hence, if the small model is mostly correct, and rollbacks are few, then this method is faster than always having the bigger model predict each token.

What is Generalized Speculative Decoding?

Some bright spark realized that the "smaller model" does not need to be a full model. It can be a cut-down model. Or it can even be not a model.

The speculating component only has to be some method that can "speculate" as to what the next token probably will be. Any method that is correct more often than not can yield a speedup. Hence, generalized speculative decoding uses various other methods as the "speculator":

  • Smaller model — the basic type of speculative decoding
  • Early exit of the large model — effectively a smaller model running inside the full large model.
  • Non-autoregressive decoding algorithms.
  • Aggressive decoding (for editing)
  • Blockwise parallel decoding (generate multiple predicted tokens in parallel with a big model)
  • Multi-token decoding algorithms (e.g. efficient drafting of multiple token candidates).
  • Tree-structured drafting methods (akin to beam search followed by verification).
  • Non-Transformer small models (e.g. RNNs as draft model).
  • Non-LLM heuristics — any coding method that predicts the next token in a sequence (generally difficult to achieve without a model!).

Hence, it is generalized to any cut-down model and any non-model heuristic method. In general, any LLM optimization that creates a model that is faster but less accurate can be considered (e.g. pruning, quantization, early exit, and many more). Faster decoding algorithms such as non-autoregressive models or blockwise parallel decoding also meet this criteria by generating multiple suggested tokens quickly, and then they can be verifyied in parallel. In the special case of editing and grammatical error correction, the input prompt itself can be considered a form of draft text (because the edited output is usually similar to the input when editing) and "aggressive decoding" is effectively a type of generalized speculative decoding. Furthermore, any non-LLM coding heuristic that can guess the next token from a sequence can also be the speculator.

Note that the speculator cannot simply suggest one token. The method is only faster if the verifier model can check two or more speculations in parallel. Hence, there's no value in a simple heuristic that attempts to predict only the next token. For example, we could try a heuristic that detects the start of sentences (e.g., prior token is "."), and then predicts a comma after words like "However" and "Hence". But this isn't a speedup because it's only predicting a single token (i.e. the comma). It's much more difficult to come up with heuristics that predict multiple tokens. Might as well use a model!

On-device execution problematic. Like speculative decoding, the method of generalized speculative decoding doesn't work well on platforms without spare computation capacity. All types of speculative decoding work by parallelizing computation by farming inference out to multiple GPUs, rather than reducing computations overall, which doesn't help much with phones and AI PC platforms.

In fact, speculative decoding increases the total amount of computation because it runs not only the big model on every token, but also the smaller draft model. It also runs both models wastefully in cases where the big model rejects drafted tokens. Hence, it's a tradeoff with increased processing in order to achieve a wallclock speedup at the cost of additional parallelized GPU power.

Survey Papers on Generalized Speculative Decoding

Survey research papers include:

  • Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, Zhifang Sui, 15 Jan 2024, Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding, https://arxiv.org/abs/2401.07851 (A survey paper on speculative decoding, which has a section on generalized speculative decoding.)
  • Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234 (General survey on inference speedups, with a section on generalized speculative decoding.)

Generalized Speculative Decoding Research

Research papers include:

Early Exit in Generalized Speculative Decoding

One of the ways to do generalized speculative decoding is to use the large model as the smaller draft model, simply by doing early exiting of layers dynamically within the large model.

The advantage of using early exit of the large model as the drafter is that no computation is wasted if the draft token is rejected, as the full-size model inference can continue where it left off inside the big model. The downside of this approach is that the layers of a large model may have significant weights, and hence a few layers of a big model might be larger and more computationally intense that a truly small, compact model (even if all the layers of the small model are run).

  • Fed-EE: Federating Heterogeneous ASR Models using Early-Exit Architectures, Mohamed Nabih Ali, Daniele Falavigna, Alessio Brutti, 2024, PDF: https://cris.fbk.eu/bitstream/11582/343747/1/paper_49.pdf (This paper mentions early exit in relation to generalized speculative decoding.)
  • Sangmin Bae, Jongwoo Ko, Hwanjun Song, and Se-Young Yun. 2023. Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding. arXiv preprint arXiv:2310.05424. https://arxiv.org/abs/2310.05424 (Using early exits as the draft model in generalized speculative decoding.)
  • Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, Zhifang Sui, 15 Jan 2024, Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding, https://arxiv.org/abs/2401.07851 (Survey paper has coverage of this type of speculative decoding.)
  • Jiahao Liu, Qifan Wang, Jingang Wang, Xunliang Cai, 6 Jun 2024, Speculative Decoding via Early-exiting for Faster LLM Inference with Thompson Sampling Control Mechanism, https://arxiv.org/abs/2406.03853
  • Mahsa Khoshnoodi, Vinija Jain, Mingye Gao, Malavika Srikanth, Aman Chadha, 24 May 2024 (v2), A Comprehensive Survey of Accelerated Generation Techniques in Large Language Models, https://arxiv.org/abs/2405.13019
  • Wei Zhong, Manasa Bharadwaj, 30 May 2024, S3D: A Simple and Cost-Effective Self-Speculative Decoding Scheme for Low-Memory GPUs, https://arxiv.org/abs/2405.20314 (Self-speculative decoding using early layers, multi-token non-autoregressive token predictions for the draft model, and layer skipping.)
  • Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed A Aly, Beidi Chen, Carole-Jean Wu, 25 Apr 2024, Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding, https://arxiv.org/abs/2404.16710 (Multiple contributions including training with early exit, and speculative decoding with a draft model that is early exit within the larger model, with the advantages: (a) the draft and verifier model thereby share KV cache data for the early layers and (b) avoidance of the problems with an outdated KV cache normally caused by early exiting.)
  • Fangcheng Liu, Yehui Tang, Zhenhua Liu, Yunsheng Ni, Kai Han, Yunhe Wang, 29 Apr 2024, Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting, https://arxiv.org/abs/2404.18911 Code: https://github.com/Equationliu/Kangaroo (Speculative decoding where the draft model is an early exit of layers in the verifier model, but the draft model is also sped up further by early exiting confidence analysis.)
  • Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, Gabriel Synnaeve, 30 Apr 2024, Better & Faster Large Language Models via Multi-token Prediction, https://arxiv.org/abs/2404.19737 Project: https://huggingface.co/facebook/multi-token-prediction
  • Parsa Kavehzadeh, Mohammadreza Pourreza, Mojtaba Valipour, Tinashu Zhu, Haoli Bai, Ali Ghodsi, Boxing Chen, Mehdi Rezagholizadeh, 2 Jul 2024, S2D: Sorted Speculative Decoding For More Efficient Deployment of Nested Large Language Models, https://arxiv.org/abs/2407.01955 (Creating, managing and integrating multiple draft models as submodels in speculative decoding.)
  • Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
  • Nikhil Bhendawade, Irina Belousova, Qichen Fu, Henry Mason, Mohammad Rastegari, Mahyar Najibi, 16 Feb 2024, Speculative Streaming: Fast LLM Inference without Auxiliary Models, https://arxiv.org/abs/2402.11131
  • Michael R. Metel, Peng Lu, Boxing Chen, Mehdi Rezagholizadeh, Ivan Kobyzev, 1 Oct 2024, Draft on the Fly: Adaptive Self-Speculative Decoding using Cosine Similarity, https://arxiv.org/abs/2410.01028 (Self-speculative decoding that removes layers based on cosine similarity.)
  • Heming Xia, Yongqi Li, Jun Zhang, Cunxiao Du, Wenjie Li, 9 Oct 2024, SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration, https://arxiv.org/abs/2410.06916 (Self-speculative decoding using layer skipping, rather than early exit.)
  • Hyun Ryu, Eric Kim, 20 Nov 2024, Closer Look at Efficient Inference Methods: A Survey of Speculative Decoding, https://arxiv.org/abs/2411.13157
  • Samarth N Ramesh, Zhixue Zhao, 22 Nov 2024, Efficient Pruning of Text-to-Image Models: Insights from Pruning Stable Diffusion, https://arxiv.org/abs/2411.15113 (Comprehensive analysis of different types of pruning on diffusion image models.)
  • Aritra Roy Gosthipaty, Mostafa Elhoushi, Pedro Cuenca, Vaibhav Srivastav, November 20, 2024, Faster Text Generation with Self-Speculative Decoding, https://huggingface.co/blog/layerskip

Non-Autoregressive Decoding in Generalized Speculative Decoding

  • Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, Zhifang Sui, 15 Jan 2024, Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding, https://arxiv.org/abs/2401.07851 (Survey paper has coverage of this type of speculative decoding.)

Multi-Token Decoding in Generalized Speculative Decoding

  • Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, Zhifang Sui, 15 Jan 2024, Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding, https://arxiv.org/abs/2401.07851 (Survey paper has coverage of this type of speculative decoding.)

Non-LLM Heuristics in Generalized Speculative Decoding

Research papers on non-LLM approaches for the draft model in generalized speculative decoding:

Hierarchical speculative decoding

Another way to generalize speculative decoding is to do it more than once, in layers. The smaller drafting model could itself be accelerated by another even smaller model. In this way, there are layers of drafters. This idea has been called "hierarchical speculative decoding."

This method is largely orthogonal to the other variants of speculative decoding, which could all be applied in this multi-layer manner.

Research papers on hierarchical speculative decoding:

  • Dakota Goldberg, Nov. 16, 2023, Accelerating large model inference with speculative decoding, 6.S898 Deep Learning Blogs 2023, MIT, https://deep-learning-mit.github.io/staging/blog/2023/speculative-decoding/ (Conducts an analysis of speculative decoding and an advanced method of layering called "hierarchical speculative decoding.")
  • Hanshi Sun, Zhuoming Chen, Xinyu Yang, Yuandong Tian, Beidi Chen, 18 Apr 2024, TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding, https://arxiv.org/abs/2404.11912 Code: https://github.com/Infini-AI-Lab/TriForce (Improves issues with long context in the use of speculative decoding, such as small model context sizes and KV cache memory usage bottlenecks for Llama2-7B-128K model on an A100 and RTX 4090 GPU.)

More AI Research

Read more about: