Aussie AI

Parallel Decoding

  • Last Updated 17 November, 2025
  • by David Spuler, Ph.D.

What is Parallel Decoding?

Parallel decoding is an LLM optimization method to produce two or more tokens in parallel. This is faster than the vanilla LLM "autoregressive" decoding method, which outputs only one token at a time, in a sequential manner. Parallel decoding algorithms aim to break the autoregression bottleneck in decoder output algorithms. The idea is to output as many tokens in parallel as possible, which is much faster than greedy decoding or beam search decoding, which are both autoregressive. For more information on the basics of the sequential decoder, see non-autoregressive decoding algorithms.

Types of Parallel Decoding Optimizations

There are several types of "parallel decoding" algorithms:

  • Speculative decoding
  • Generalized speculative decoding
  • Aggressive decoding
  • Lookup decoding
  • Prompt lookup decoding
  • Lookahead decoding

Note that these above methods are all computing multiple tokens in parallel for a single query within a single model. There are also various ways to parallelize decoding at a higher level by using multiple models, which is called "ensemble decoding" (e.g. big-little decoding, consensus decoding, collaborative decoding).

Research on Parallel Decoding

Papers on parallel decoding algorithms include (see also non-autoregressive decoding algorithms):

n-gram decoding

N-gram decoding is an LLM optimization that emits a sequence of tokens each decoding step, rather than only one token at a time in sequence. An "n-gram" decoding algorithm is one that generates more than one token (i.e., n tokens in an "n-gram") in one single sequence. This is usually done in parallel execution, because it isn't much of an optimization to run this sequentially, because that's how normal autoregressive decoding generates n-grams, too.

Research on n-gram generation:

  • Minghan Li, Xilun Chen, Ari Holtzman, Beidi Chen, Jimmy Lin, Wen-tau Yih, Xi Victoria Lin, 29 May 2024, Nearest Neighbor Speculative Decoding for LLM Generation and Attribution, https://arxiv.org/abs/2405.19325 (Merging of RALM and speculative decoding.)
  • Ethan Shen, Alan Fan, Sarah M Pratt, Jae Sung Park, Matthew Wallingford, Sham M. Kakade, Ari Holtzman, Ranjay Krishna, Ali Farhadi, Aditya Kusupati, 28 May 2024, Superposed Decoding: Multiple Generations from a Single Autoregressive Inference Pass, https://arxiv.org/abs/2405.18400 https://github.com/RAIVNLab/SuperposedDecoding (Generating multiple possible drafts from a single decoding algorithm with one model pass by superimposing embeddings and using top-k decoding.)
  • Davis Wertheimer, Joshua Rosenkranz, Thomas Parnell, Sahil Suneja, Pavithra Ranganathan, Raghu Ganti, Mudhakar Srivatsa, 29 Apr 2024, Accelerating Production LLMs with Combined Token/Embedding Speculators, IBM Research, https://arxiv.org/abs/2404.19124 Code: https://github.com/foundation-model-stack/fms-fsdp Code: https://huggingface.co/ibm-fms (Extending Medusa architecture with a single multi-headed architecture so the draft model predicts an n-gram with multiple tokens more accurately.)
  • Joao Gante, 2023, Assisted generation: a new direction toward low-latency text generation, Hugging Face, DOI: 10.57967/hf/0638, https://huggingface.co/datasets/joaogante/assisted_generation (Using a model's forward pass to valid a sequence of multiple tokens, analogous to verification in speculative decoding.)
  • Yao Zhao, Zhitian Xie, Chenyi Zhuang, Jinjie Gu, 4 Jan 2024 (v2), Lookahead: An Inference Acceleration Framework for Large Language Model with Lossless Generation Accuracy, https://arxiv.org/abs/2312.12728 Code: https://github.com/alipay/PainlessInferenceAcceleration (Generates multiple tokens on multiple branches for verification, giving a tree-structured approach.)
  • Feng Lin, Hanling Yi, Hongbin Li, Yifan Yang, Xiaotian Yu, Guangming Lu, Rong Xiao, 25 Jan 2024 (v2), BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models, https://arxiv.org/abs/2401.12522 Code: https://github.com/linfeng93/BiTA
  • 15 Mar 2024, Optimal Block-Level Draft Verification for Accelerating Speculative Decoding, Ziteng Sun, Jae Hun Ro, Ahmad Beirami, Ananda Theertha Suresh, https://arxiv.org/abs/2403.10444 (Draft a block of tokens for verification.)
  • Giovanni Monea, Armand Joulin, Edouard Grave, 22 Nov 2023, PaSS: Parallel Speculative Sampling, https://arxiv.org/abs/2311.13581 (Generates multiple draft tokens using a parallel extension via "look ahead embeddings".)
  • Aonan Zhang, Chong Wang, Yi Wang, Xuanyu Zhang, Yunfei Cheng, 22 Mar 2024 (v2), Recurrent Drafter for Fast Speculative Decoding in Large Language Models, https://arxiv.org/abs/2403.09919 (Use of small RNN as the drafting model for speculative decoding.)
  • Yichao Fu, Peter Bailis, Ion Stoica, Hao Zhang, Nov 21, 2023, Break the Sequential Dependency of LLM Inference Using Lookahead Decoding, https://lmsys.org/blog/2023-11-21-lookahead-decoding/ Code: https://github.com/hao-ai-lab/LookaheadDecoding (Generates tokens in parallel by using Jacobi iteration.)
  • Taehyeon Kim, Ananda Theertha Suresh, Kishore Papineni, Michael Riley, Sanjiv Kumar, Adrian Benton, 14 Apr 2024, Towards Fast Inference: Exploring and Improving Blockwise Parallel Drafts, Google Research, https://arxiv.org/abs/2404.09221 (Improving blockwise parallel decoding via top-k decoding and generation of predicted n-grams.)
  • Wonseok Jeon, Mukul Gagrani, Raghavv Goel, Junyoung Park, Mingu Lee, Christopher Lott, 5 Mar 2024 (v2), Recursive Speculative Decoding: Accelerating LLM Inference via Sampling Without Replacement, Qualcomm AI Research, https://arxiv.org/abs/2402.14160 (Improvements of an adaptive inference version of a draft token-tree with multiple n-gram paths for speculative decoding.)
  • Jie Ou, Yueming Chen, Wenhong Tian, 10 Apr 2024, Lossless Acceleration of Large Language Model via Adaptive N-gram Parallel Decoding, https://arxiv.org/abs/2404.08698 (Use an n-gram model as the drafter to create a version of parallel decoding or generalized speculative decoding.)
  • Yisheng Xiao, Lijun Wu, Junliang Guo, Juntao Li, Min Zhang, Tao Qin, Tie-yan Liu, 6 Jul 2023 (v2), A Survey on Non-Autoregressive Generation for Neural Machine Translation and Beyond, https://arxiv.org/pdf/2204.09269.pdf
  • Nikhil Bhendawade, Irina Belousova, Qichen Fu, Henry Mason, Mohammad Rastegari, and Mahyar Najibi. Speculative streaming: Fast llm inference without auxiliary models. arXiv preprint arXiv:2402.11131, 2024. https://arxiv.org/abs/2402.11131
  • Benjamin Spector and Chris Re. Accelerating llm inference with staged speculative decoding. arXiv preprint arXiv:2308.04623, 2023. https://arxiv.org/abs/2308.04623
  • Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: Speculative sampling requires rethinking feature uncertainty. arXiv preprint arXiv:2401.15077, 2024. https://arxiv.org/abs/2401.15077
  • Zachary Ankner, Rishab Parthasarathy, Aniruddha Nrusimha, Christopher Rinard, Jonathan Ragan-Kelley, and William Brandon. Hydra: Sequentially-dependent draft heads for medusa decoding, 2024. https://arxiv.org/abs/2402.05109
  • Wang, J., Chen, K., Chen, G., Shou, L., McAuley, J.: Skipbert: Efficient inference with shallow layer skipping. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 7287–7301 (2022) https://aclanthology.org/2022.acl-long.503/ (Skips early layers of a model via precomputed lookup tables based on detecting known token n-grams in the prompt.)
  • Hao Mark Chen, Wayne Luk, Ka Fai Cedric Yiu, Rui Li, Konstantin Mishchenko, Stylianos I. Venieris, Hongxiang Fan, 2 Jun 2024 (v2), Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference, https://arxiv.org/abs/2405.18628 https://github.com/hmarkc/parallel-prompt-decoding
  • 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
  • Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, Gabriel Synnaeve, 30 Apr 2024, Better & Faster Large Language Models via Multi-token Prediction, https://arxiv.org/abs/2404.19737 Project: https://huggingface.co/facebook/multi-token-prediction
  • Michael Nuñez, July 4, 2024, Meta drops AI bombshell: Multi-token prediction models now open for research, https://venturebeat.com/ai/meta-drops-ai-bombshell-multi-token-prediction-models-now-open-for-research/
  • Shikhar Tuli, Chi-Heng Lin, Yen-Chang Hsu, Niraj K. Jha, Yilin Shen, Hongxia Jin, 1 May 2024, DynaMo: Accelerating Language Model Inference with Dynamic Multi-Token Sampling, https://arxiv.org/abs/2405.00888 (A model trained to predict multiple tokens ahead.)
  • Daniel Jurafsky, James H. Martin., February 3, 2024 (draft of 3rd edition),. N-gram Language Models, chapter 3, Speech and Language Processing. 2023, https://web.stanford.edu/~jurafsky/slp3/3.pdf https://web.stanford.edu/~jurafsky/slp3/ https://web.stanford.edu/~jurafsky/slp3/ed3bookfeb3_2024.pdf https://www.amazon.com/Speech-Language-Processing-Daniel-Jurafsky/dp/0131873210/ (2nd edition)
  • Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
  • Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
  • Shibaranjani Dasgupta, Chandan Maity, Somdip Mukherjee, Rohan Singh, Diptendu Dutta, Debasish Jana, 14 Dec 2024, HITgram: A Platform for Experimenting with n-gram Language Models, https://arxiv.org/abs/2412.10717
  • Da Yu, Edith Cohen, Badih Ghazi, Yangsibo Huang, Pritish Kamath, Ravi Kumar, Daogao Liu, Chiyuan Zhang, 3 Feb 2025, Scaling Embedding Layers in Language Models, https://arxiv.org/abs/2502.01637 (Using n-gram multi-token embedding layers, because embeddings are cheap to compute, rather than increasing vocabulary size.)
  • Aditya Varre, Gizem Y\"uce, Nicolas Flammarion, 19 Aug 2025, Learning In-context n-grams with Transformers: Sub-n-grams Are Near-stationary Points, https://arxiv.org/abs/2508.12837
  • Jerry Li, Evangelos Papalexakis, 3 Sep 2025, Beyond ROUGE: N-Gram Subspace Features for LLM Hallucination Detection, https://arxiv.org/abs/2509.05360
  • Mohsinul Kabir, Tasfia Tahsin, Sophia Ananiadou, 17 Sep 2025, From n-gram to Attention: How Model Architectures Learn and Propagate Bias in Language Modeling, https://arxiv.org/abs/2505.12381
  • Arkadiy Saakyan, Najoung Kim, Smaranda Muresan, Tuhin Chakrabarty, 26 Sep 2025, Death of the Novel(ty): Beyond n-Gram Novelty as a Metric for Textual Creativity, https://arxiv.org/abs/2509.22641

Blockwise Parallel Decoding

Blockwise parallel decoding is an LLM inference optimization whereby the decoder outputs multiple tokens at once, by processing in blocks. It is a type of parallel decoding that improves efficiency beyond the vanilla LLM decoder, which emits one token at a time in an autoregressive, sequential manner.

Research on blockwise parallel decoding:

  • Stern, Mitchell Thomas, 2020, Structured Neural Models and Structured Decoding for Natural Language Processing, Ph.D. Thesis, Computer Science, University of California, Berkeley, https://escholarship.org/uc/item/4m2211b5 https://escholarship.org/content/qt4m2211b5/qt4m2211b5.pdf
  • Chen Zhang, Zhuorui Liu, Dawei Song, 23 Apr 2024, Beyond the Speculative Game: A Survey of Speculative Execution in Large Language Models, Beijing Institute of Technology, China, https://arxiv.org/abs/2404.14897 (Strong survey specific to speculative decoding and other draft-then-verify optimization techniques.)
  • Taehyeon Kim, Ananda Theertha Suresh, Kishore Papineni, Michael Riley, Sanjiv Kumar, Adrian Benton, 14 Apr 2024, Towards Fast Inference: Exploring and Improving Blockwise Parallel Drafts, Google Research, https://arxiv.org/abs/2404.09221 (Improving blockwise parallel decoding via top-k decoding and generation of predicted n-grams.)
  • 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
  • Mahsa Khoshnoodi, Vinija Jain, Mingye Gao, Malavika Srikanth, Aman Chadha, 24 May 2024 (v2), A Comprehensive Survey of Accelerated Generation Techniques in Large Language Models, https://arxiv.org/abs/2405.13019
  • Taehyeon Kim, Ananda Theertha Suresh, Kishore Papineni, Michael Riley, Sanjiv Kumar, Adrian Benton, 2024, Exploring and Improving Drafts in Blockwise Parallel Decoding, https://openreview.net/pdf?id=KtnUTS1f91

Lookahead Decoding

Lookahead decoding is a type parallel decoding where the algorithm attempts to "look ahead" in parallel at more than one token. This allows the output of two or more tokens at a time, which is more efficiency than the standard autoregressive decoding algorithms.

MEDUSA Decoding

Medusa decoding is an LLM optimization that uses multiple decoding heads to emit tokens in parallel. This method is faster than the sequential bottleneck of emitting one token at a time, as done by the standard autoregressive LLM decoding algorithms.

More Research on Decoding Algorithms

AI Books from Aussie AI



The Sweetest Lesson: Your Brain Versus AI The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
  • Your brain is 50 times bigger than the best AI engines.
  • Truly intelligent AI will require more compute!
  • Another case of the bitter lesson?
  • Maybe it's the opposite of that: the sweetest lesson.

Get your copy from Amazon: The Sweetest Lesson



RAG Optimization RAG Optimization: Accurate and Efficient LLM Applications: new book on RAG architectures:
  • Smarter RAG
  • Faster RAG
  • Cheaper RAG
  • Agentic RAG
  • RAG reasoning

Get your copy from Amazon: RAG Optimization



Generative AI in C++ Generative AI Applications book:
  • Deciding on your AI project
  • Planning for success and safety
  • Designs and LLM architectures
  • Expediting development
  • Implementation and deployment

Get your copy from Amazon: Generative AI Applications



Generative AI in C++ Generative AI programming book:
  • Generative AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++



CUDA C++ Optimization CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization



CUDA C++ Optimization CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging

More AI Research

Read more about: