Aussie AI

Parallel Decoding

Last Updated 11 June, 2025

by David Spuler, Ph.D.

What is Parallel Decoding?

Parallel decoding is an LLM optimization method to produce two or more tokens in parallel. This is faster than the vanilla LLM "autoregressive" decoding method, which outputs only one token at a time, in a sequential manner. Parallel decoding algorithms aim to break the autoregression bottleneck in decoder output algorithms. The idea is to output as many tokens in parallel as possible, which is much faster than greedy decoding or beam search decoding, which are both autoregressive. For more information on the basics of the sequential decoder, see non-autoregressive decoding algorithms.

Types of Parallel Decoding Optimizations

There are several types of "parallel decoding" algorithms:

Speculative decoding
Generalized speculative decoding
Aggressive decoding
Lookup decoding
Prompt lookup decoding
Lookahead decoding

Note that these above methods are all computing multiple tokens in parallel for a single query within a single model. There are also various ways to parallelize decoding at a higher level by using multiple models, which is called "ensemble decoding" (e.g. big-little decoding, consensus decoding, collaborative decoding).

Research on Parallel Decoding

Papers on parallel decoding algorithms include (see also non-autoregressive decoding algorithms):

Andrea Santilli, Silvio Severino, Emilian Postolache, Valentino Maiorca, Michele Mancusi, Riccardo Marin, Emanuele Rodolà, May 2023, Accelerating Transformer Inference for Translation via Parallel Decoding, https://arxiv.org/abs/2305.10427
A Elgam, Y Peretz, Y Pinhasi, 2023, Enhancing MIMO Spatial-Multiplexing and Parallel-Decoding under Interference by Computational Feedback, Electronics, https://www.mdpi.com/2079-9292/12/3/761
Marjan Ghazvininejad, Omer Levy, Yinhan Liu, Luke Zettlemoyer, 2019, Mask-Predict: Parallel Decoding of Conditional Masked Language Models, https://arxiv.org/abs/1904.09324
Predictive Pipelined Decoding: A Compute-Latency Trade-off for Exact LLM Decoding Seongjun Yang, Gibbeum Lee, Jaewoong Cho, Dimitris Papailiopoulos, Kangwook Lee, July 2023, https://arxiv.org/abs/2307.05908
X Xia, Y Zheng, T Gu, 2019, SenSys '19: Proceedings of the 17th Conference on Embedded Networked Sensor Systems, FTrack: Parallel decoding for LoRa transmissions, November 2019, Pages 192–204, https://dl.acm.org/doi/abs/10.1145/3356250.3360024 PDF: https://web.comp.polyu.edu.hk/csyqzheng/papers/FTrack_Sensys19.pdf
M Sagong, Y Shin, S Kim, S Park, 2019, Pepsi: Fast image inpainting with parallel decoding network, https://openaccess.thecvf.com/content_CVPR_2019/papers/Sagong_PEPSI__Fast_Image_Inpainting_With_Parallel_Decoding_Network_CVPR_2019_paper.pdf
M Stern, N Shazeer, J Uszkoreit, 2018, Blockwise parallel decoding for deep autoregressive models https://arxiv.org/abs/1811.03115, https://proceedings.neurips.cc/paper/2018/hash/c4127b9194fe8562c64dc0f5bf2c93bc-Abstract.html PDF: https://proceedings.neurips.cc/paper/2018/file/c4127b9194fe8562c64dc0f5bf2c93bc-Paper.pdf
Bo-Ru Lu, Nikita Haduong, Chien-Yu Lin, Hao Cheng, Noah A. Smith, Mari Ostendorf, 19 Mar 2024, Encode Once and Decode in Parallel: Efficient Transformer Decoding, https://arxiv.org/abs/2403.13112
Taehyeon Kim, Ananda Theertha Suresh, Kishore Papineni, Michael Riley, Sanjiv Kumar, Adrian Benton, 14 Apr 2024, Towards Fast Inference: Exploring and Improving Blockwise Parallel Drafts, Google Research, https://arxiv.org/abs/2404.09221 (Improving blockwise parallel decoding via top-k decoding and generation of predicted n-grams.)
Xuefei Ning , Zinan Lin , November 17, 2023 Skeleton-of-Thought: Parallel decoding speeds up and improves LLM output, Microsoft Research Blog, https://www.microsoft.com/en-us/research/blog/skeleton-of-thought-parallel-decoding-speeds-up-and-improves-llm-output/ Code: https://github.com/imagination-research/sot/
Xuefei Ning, Zinan Lin, Zixuan Zhou, Zifu Wang, Huazhong Yang, Yu Wang, May 2024, Skeleton-of-Thought: Prompting LLMs for Efficient Parallel Generation, ICLR 2024, https://www.microsoft.com/en-us/research/publication/skeleton-of-thought-large-language-models-can-do-parallel-decoding/ https://neurips2023-enlsp.github.io/papers/paper_33.pdf Code: https://github.com/imagination-research/sot/
Hao (Mark) Chen, Wayne Luk, Ka Fai Cedric Yiu, Rui Li, Konstantin Mishchenko, Stylianos I. Venieris, Hongxiang Fan, 28 May 2024, Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference, https://arxiv.org/abs/2405.18628 Code: https://github.com/hmarkc/parallel-prompt-decoding (Similar to speculative decoding with extra trained prompt tokens and a tree-structured verification of multiple optional draft sequences.)
Stern, Mitchell Thomas, 2020, Structured Neural Models and Structured Decoding for Natural Language Processing, Ph.D. Thesis, Computer Science, University of California, Berkeley, https://escholarship.org/uc/item/4m2211b5 https://escholarship.org/content/qt4m2211b5/qt4m2211b5.pdf
Ethan Shen, Alan Fan, Sarah M Pratt, Jae Sung Park, Matthew Wallingford, Sham M. Kakade, Ari Holtzman, Ranjay Krishna, Ali Farhadi, Aditya Kusupati, 28 May 2024, Superposed Decoding: Multiple Generations from a Single Autoregressive Inference Pass, https://arxiv.org/abs/2405.18400 https://github.com/RAIVNLab/SuperposedDecoding (Generating multiple possible drafts from a single decoding algorithm with one model pass by superimposing embeddings and using top-k decoding.)
Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, Gabriel Synnaeve, 30 Apr 2024, Better & Faster Large Language Models via Multi-token Prediction, https://arxiv.org/abs/2404.19737
Chen Zhang, Zhuorui Liu, Dawei Song, 23 Apr 2024, Beyond the Speculative Game: A Survey of Speculative Execution in Large Language Models, Beijing Institute of Technology, China, https://arxiv.org/abs/2404.14897 (Strong survey specific to speculative decoding and other draft-then-verify optimization techniques.)
Abhijeet Awasthi, Sunita Sarawagi, Rasna Goyal, Sabyasachi Ghosh, Vihari Piratla, 15 May 2020 (v2), Parallel Iterative Edit Models for Local Sequence Transduction, https://arxiv.org/abs/1910.02893 (Transforms the input text into a sequence of edits and then does parallel optimizations.)
3 Jan 2024 (v2), SPEED: Speculative Pipelined Execution for Efficient Decoding, Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Hasan Genc, Kurt Keutzer, Amir Gholami, Sophia Shao, https://arxiv.org/abs/2310.12072 (Speculatively executing multiple future tokens in parallel to the current token, by using multiple tokens with high probability from the early layers of inference of the current token in the model. This allows multiple speculations of the autoregressive inference of the next token to start before the current token is finished.)
Yao Zhao, Zhitian Xie, Chenyi Zhuang, Jinjie Gu, 4 Jan 2024 (v2), Lookahead: An Inference Acceleration Framework for Large Language Model with Lossless Generation Accuracy, https://arxiv.org/abs/2312.12728 Code: https://github.com/alipay/PainlessInferenceAcceleration (Generates multiple tokens on multiple branches for verification, giving a tree-structured approach.)
Bodun Hu, Le Xu, Jeongyoon Moon, Neeraja J. Yadwadkar, Aditya Akella, 27 Oct 2023, MOSEL: Inference Serving Using Dynamic Modality Selection, https://arxiv.org/abs/2310.18481 (Multi-modal model with dynamic selection of modality.)
Feng Lin, Hanling Yi, Hongbin Li, Yifan Yang, Xiaotian Yu, Guangming Lu, Rong Xiao, 25 Jan 2024 (v2), BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models, https://arxiv.org/abs/2401.12522 Code: https://github.com/linfeng93/BiTA
15 Mar 2024, Optimal Block-Level Draft Verification for Accelerating Speculative Decoding, Ziteng Sun, Jae Hun Ro, Ahmad Beirami, Ananda Theertha Suresh, https://arxiv.org/abs/2403.10444 (Draft a block of tokens for verification.)
Giovanni Monea, Armand Joulin, Edouard Grave, 22 Nov 2023, PaSS: Parallel Speculative Sampling, https://arxiv.org/abs/2311.13581 (Generates multiple draft tokens using a parallel extension via "look ahead embeddings".)
Jinghui Lu, Ziwei Yang, Yanjie Wang, Xuejing Liu, Brian Mac Namee, Can Huang, 15 Feb 2024 (v4), PaDeLLM-NER: Parallel Decoding in Large Language Models for Named Entity Recognition, https://arxiv.org/abs/2402.04838 (Use of parallel decoding in Named Entity Recognition use case.)
Jinwei Yao, Kaiqi Chen, Kexun Zhang, Jiaxuan You, Binhang Yuan, Zeke Wang, Tao Lin, 30 Mar 2024, DeFT: Flash Tree-attention with IO-Awareness for Efficient Tree-search-based LLM Inference, https://arxiv.org/abs/2404.00242
Wonseok Jeon, Mukul Gagrani, Raghavv Goel, Junyoung Park, Mingu Lee, Christopher Lott, 5 Mar 2024 (v2), Recursive Speculative Decoding: Accelerating LLM Inference via Sampling Without Replacement, Qualcomm AI Research, https://arxiv.org/abs/2402.14160 (Improvements of an adaptive inference version of a draft token-tree with multiple n-gram paths for speculative decoding.)
Aonan Zhang, Chong Wang, Yi Wang, Xuanyu Zhang, Yunfei Cheng, 22 Mar 2024 (v2), Recurrent Drafter for Fast Speculative Decoding in Large Language Models, https://arxiv.org/abs/2403.09919 (Use of small RNN as the drafting model for speculative decoding.)
Jie Ou, Yueming Chen, Wenhong Tian, 10 Apr 2024, Lossless Acceleration of Large Language Model via Adaptive N-gram Parallel Decoding, https://arxiv.org/abs/2404.08698 (Use an n-gram model as the drafter to create a version of parallel decoding or generalized speculative decoding.)
Hongxuan Zhang, Zhining Liu, Jiaqi Zheng, Chenyi Zhuang, Jinjie Gu, Guihai Chen, Nov 2023, Fast Chain-of-Thought: A Glance of Future from Parallel Decoding Leads to Answers Faster, https://arxiv.org/abs/2311.08263
Yichao Fu, Peter Bailis, Ion Stoica, Hao Zhang, Nov 21, 2023, Break the Sequential Dependency of LLM Inference Using Lookahead Decoding, https://lmsys.org/blog/2023-11-21-lookahead-decoding/ Code: https://github.com/hao-ai-lab/LookaheadDecoding (Generates tokens in parallel by using Jacobi iteration.)
Hanling Yi, Feng Lin, Hongbin Li, Peiyang Ning, Xiaotian Yu, Rong Xiao, 16 Apr 2024 (v2)], Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct Decoding, https://arxiv.org/abs/2402.11809 (Semi-autoregressive draft model with parallel verification.)
Shuzhang Zhong, Zebin Yang, Meng Li, Ruihao Gong, Runsheng Wang, Ru Huang, 21 Feb 2024, Dynamic Token Tree Pruning and Generation for LLM Parallel Decoding, https://arxiv.org/abs/2402.13485 (Parallel decoding with dynamic token tree pruning.)
Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 15 Mar 2024 (v5), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer (A large survey of a variety of LLM optimizations.)
C Hooper, S Kim, H Mohammadzadeh, H Genc, Oct 2023, SPEED: Speculative Pipelined Execution for Efficient Decoding https://arxiv.org/pdf/2310.12072.pdf
M. Liu, A. Zeng, B. Wang, P. Zhang, J. Tang, and Y. Dong, “Apar: Llms can do auto-parallel auto-regressive decoding,” arXiv preprint arXiv:2401.06761, 2024. https://arxiv.org/abs/2401.06761
Benjamin Spector and Chris Re. Accelerating llm inference with staged speculative decoding. arXiv preprint arXiv:2308.04623, 2023. https://arxiv.org/abs/2308.04623
Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: Speculative sampling requires rethinking feature uncertainty. arXiv preprint arXiv:2401.15077, 2024. https://arxiv.org/abs/2401.15077
Zachary Ankner, Rishab Parthasarathy, Aniruddha Nrusimha, Christopher Rinard, Jonathan Ragan-Kelley, and William Brandon. Hydra: Sequentially-dependent draft heads for medusa decoding, 2024. https://arxiv.org/abs/2402.05109
Siqi Kou, Lanxiang Hu, Zhezhi He, Zhijie Deng, Hao Zhang, 8 Mar 2024 (v3), CLLMs: Consistency Large Language Models, https://arxiv.org/abs/2403.00835
Aishwarya P S, Pranav Ajit Nair, Yashas Samaga, Toby Boyd, Sanjiv Kumar, Prateek Jain, Praneeth Netrapalli, 26 Mar 2024 (v3), Tandem Transformers for Inference Efficient LLMs, https://arxiv.org/abs/2402.08644
8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
Nan Yang, Tao Ge, Liang Wang, Binxing Jiao, Daxin Jiang, Linjun Yang, Rangan Majumder, and Furu Wei. 2023. Inference with Reference: Lossless Acceleration of Large Language Models. arXiv:2304.04487 [cs.CL] https://arxiv.org/abs/2304.04487
Mouxiang Chen, Hao Tian, Zhongxin Liu, Xiaoxue Ren, Jianling Sun, 5 Jun 2024 (v2), JumpCoder: Go Beyond Autoregressive Coder via Online Modification, https://arxiv.org/abs/2401.07870 Code: https://github.com/Keytoyze/JumpCoder
Namgyu Ho, Sangmin Bae, Taehyeon Kim, Hyunjik Jo, Yireun Kim, Tal Schuster, Adam Fisch, James Thorne, Se-Young Yun, 4 Jun 2024, Block Transformer: Global-to-Local Language Modeling for Fast Inference, https://arxiv.org/abs//2406.02657 Code: https://github.com/itsnamgyu/block-transformer (Impressive technique of combining tokens into blocks, then doing inference on the blocks, then unblocking to get tokens.)
Michael Nuñez, July 4, 2024, Meta drops AI bombshell: Multi-token prediction models now open for research, https://venturebeat.com/ai/meta-drops-ai-bombshell-multi-token-prediction-models-now-open-for-research/
Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
Zhenyu Li, Yike Zhang, Tengyu Pan, Yutao Sun, Zhichao Duan, Junjie Fang, Rong Han, Zixuan Wang, Jianyong Wang, 21 Aug 2024, FocusLLM: Scaling LLM's Context by Parallel Decoding, https://arxiv.org/abs/2408.11745 Code: https://github.com/leezythu/FocusLLM
Yingbing Huang, Lily Jiaxin Wan, Hanchen Ye, Manvi Jha, Jinghua Wang, Yuhong Li, Xiaofan Zhang, Deming Chen, 16 Jun 2024, New Solutions on LLM Acceleration, Optimization, and Application, https://arxiv.org/abs/2406.10903
Federico Berto, Chuanbo Hua, Laurin Luttmann, Jiwoo Son, Junyoung Park, Kyuree Ahn, Changhyun Kwon, Lin Xie, Jinkyoo Park, 5 Sep 2024, PARCO: Learning Parallel Autoregressive Policies for Efficient Multi-Agent Combinatorial Optimization, https://arxiv.org/abs/2409.03811 https://github.com/ai4co/parco
Yael Segal-Feldman, Aviv Shamsian, Aviv Navon, Gill Hetz, Joseph Keshet, 24 Sep 2024, Whisper in Medusa's Ear: Multi-head Efficient Decoding for Transformer-based ASR, https://arxiv.org/abs/2409.15869
Yao Teng, Han Shi, Xian Liu, Xuefei Ning, Guohao Dai, Yu Wang, Zhenguo Li, Xihui Liu, 2 Oct 2024, Accelerating Auto-regressive Text-to-Image Generation with Training-free Speculative Jacobi Decoding, https://arxiv.org/abs/2410.01699
Zilin Xiao, Hongming Zhang, Tao Ge, Siru Ouyang, Vicente Ordonez, Dong Yu, 8 Oct 2024, ParallelSpec: Parallel Drafter for Efficient Speculative Decoding, https://arxiv.org/abs/2410.05589 (Multi-token prediction in draft models for speculative decoding.)
Yuxuan Liu, Wenyuan Li, Laizhong Cui, Hailiang Yang, 17 Oct 2024, Cerberus: Efficient Inference with Adaptive Parallel Decoding and Sequential Knowledge Enhancement, https://arxiv.org/abs/2410.13344
Ziqi Pang, Tianyuan Zhang, Fujun Luan, Yunze Man, Hao Tan, Kai Zhang, William T. Freeman, Yu-Xiong Wang, 2 Dec 2024, RandAR: Decoder-only Autoregressive Visual Generation in Random Orders, https://arxiv.org/abs/2412.01827 https://rand-ar.github.io/ (Attempt to parallelize image generation decoding by randomizing the order at which to create patches of an image.)
Yefei He, Feng Chen, Yuanyu He, Shaoxuan He, Hong Zhou, Kaipeng Zhang, Bohan Zhuang, 5 Dec 2024, ZipAR: Accelerating Autoregressive Image Generation through Spatial Locality, https://arxiv.org/abs/2412.04062
Xiangxiang Gao, Weisheng Xie, Yiwei Xiang, Feng Ji, 17 Dec 2024, Falcon: Faster and Parallel Inference of Large Language Models through Enhanced Semi-Autoregressive Drafting and Custom-Designed Decoding Tree, https://arxiv.org/abs/2412.12639
Yuqing Wang, Shuhuai Ren, Zhijie Lin, Yujin Han, Haoyuan Guo, Zhenheng Yang, Difan Zou, Jiashi Feng, Xihui Liu, 19 Dec 2024, Parallelized Autoregressive Visual Generation, https://arxiv.org/abs/2412.15119 https://epiphqny.github.io/PAR-project
xjdr-alt, Dec 2024, entropix: Entropy Based Sampling and Parallel CoT Decoding, https://github.com/xjdr-alt/entropix (Parallel decoding attempts to get something similar to CoT.)
Y Li, K Livescu, J Zhou, Dec 2024, Beyond Token Generation: Adaptive Chunk-Distilled Language Modeling, 38th Conference on Neural Information Processing Systems (NeurIPS 2024), https://neurips2024-enlsp.github.io/papers/paper_90.pdf (Generate multiple tokens in decoding by inserting RAG chunks directly into the decoding output.)
Xiangjue Dong, Maria Teleki, James Caverlee, 18 Dec 2024, A Survey on LLM Inference-Time Self-Improvement, https://arxiv.org/abs/2412.14352 https://github.com/dongxiangjue/Awesome-LLM-Self-Improvement (Broad survey of reasoning improvement methods from multi-step inference to RALM to decoding algorithms.)
Yanhong Li, Karen Livescu, Jiawei Zhou, 31 Dec 2024, Chunk-Distilled Language Modeling, https://arxiv.org/abs/2501.00343 (Multi-token decoding using retrieval.)
Sean Welleck, Amanda Bertsch, Matthew Finlayson, Hailey Schoelkopf, Alex Xie, Graham Neubig, Ilia Kulikov, Zaid Harchaoui, 20 Nov 2024 (v2), From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models, https://arxiv.org/abs/2406.16838
Haoran Wang, Kai Shu, Jan 2025, MakeEveryTokenCount: ASystematic Survey on Decoding Methods for Foundation Model, https://www.researchgate.net/profile/Haoran-Wang-96/publication/387703971_Make_Every_Token_Count_A_Systematic_Survey_on_Decoding_Methods_for_Foundation_Models/links/67784c8ce74ca64e1f49eb15/Make-Every-Token-Count-A-Systematic-Survey-on-Decoding-Methods-for-Foundation-Models.pdf https://github.com/wang2226/Awesome-LLM-Decoding
Dylan Cutler, Arun Kandoor, Nishanth Dikkala, Nikunj Saunshi, Xin Wang, Rina Panigrahy, 26 Jan 2025, StagFormer: Time Staggering Transformer Decoding for RunningLayers In Parallel, https://arxiv.org/abs/2501.15665 (Parallelization of the decoding phase by splitting into blocks of computation across different model layers.)
Qianhui Zhao, Li Zhang, Fang Liu, Xiaoli Lian, Qiaoyuanhe Meng, Ziqian Jiao, Zetong Zhou, Borui Zhang, Runlin Guo, Jia Li, 24 Feb 2025, CodeSwift: Accelerating LLM Inference for Efficient Code Generation, https://arxiv.org/abs/2502.17139 (Using draft sequences from a datastore of code, to achieve parallel inference, similar to prompt looking decoding or retrieval lookup decoding.)
Yijiong Yu, 26 Mar 2025, Accelerate Parallelizable Reasoning via Parallel Decoding within One Sequence, https://arxiv.org/abs/2503.20533 https://github.com/yuyijiong/parallel-decoding-in-one-sequence
Gabe Guo, Stefano Ermon, 29 Apr 2025, Reviving Any-Subset Autoregressive Models with Principled Parallel Sampling and Speculative Decoding, https://arxiv.org/abs/2504.20456

n-gram decoding

N-gram decoding is an LLM optimization that emits a sequence of tokens each decoding step, rather than only one token at a time in sequence. An "n-gram" decoding algorithm is one that generates more than one token (i.e., n tokens in an "n-gram") in one single sequence. This is usually done in parallel execution, because it isn't much of an optimization to run this sequentially, because that's how normal autoregressive decoding generates n-grams, too.

Research on n-gram generation:

Minghan Li, Xilun Chen, Ari Holtzman, Beidi Chen, Jimmy Lin, Wen-tau Yih, Xi Victoria Lin, 29 May 2024, Nearest Neighbor Speculative Decoding for LLM Generation and Attribution, https://arxiv.org/abs/2405.19325 (Merging of RALM and speculative decoding.)
Ethan Shen, Alan Fan, Sarah M Pratt, Jae Sung Park, Matthew Wallingford, Sham M. Kakade, Ari Holtzman, Ranjay Krishna, Ali Farhadi, Aditya Kusupati, 28 May 2024, Superposed Decoding: Multiple Generations from a Single Autoregressive Inference Pass, https://arxiv.org/abs/2405.18400 https://github.com/RAIVNLab/SuperposedDecoding (Generating multiple possible drafts from a single decoding algorithm with one model pass by superimposing embeddings and using top-k decoding.)
Davis Wertheimer, Joshua Rosenkranz, Thomas Parnell, Sahil Suneja, Pavithra Ranganathan, Raghu Ganti, Mudhakar Srivatsa, 29 Apr 2024, Accelerating Production LLMs with Combined Token/Embedding Speculators, IBM Research, https://arxiv.org/abs/2404.19124 Code: https://github.com/foundation-model-stack/fms-fsdp Code: https://huggingface.co/ibm-fms (Extending Medusa architecture with a single multi-headed architecture so the draft model predicts an n-gram with multiple tokens more accurately.)
Joao Gante, 2023, Assisted generation: a new direction toward low-latency text generation, Hugging Face, DOI: 10.57967/hf/0638, https://huggingface.co/datasets/joaogante/assisted_generation (Using a model's forward pass to valid a sequence of multiple tokens, analogous to verification in speculative decoding.)
Yao Zhao, Zhitian Xie, Chenyi Zhuang, Jinjie Gu, 4 Jan 2024 (v2), Lookahead: An Inference Acceleration Framework for Large Language Model with Lossless Generation Accuracy, https://arxiv.org/abs/2312.12728 Code: https://github.com/alipay/PainlessInferenceAcceleration (Generates multiple tokens on multiple branches for verification, giving a tree-structured approach.)
Feng Lin, Hanling Yi, Hongbin Li, Yifan Yang, Xiaotian Yu, Guangming Lu, Rong Xiao, 25 Jan 2024 (v2), BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models, https://arxiv.org/abs/2401.12522 Code: https://github.com/linfeng93/BiTA
15 Mar 2024, Optimal Block-Level Draft Verification for Accelerating Speculative Decoding, Ziteng Sun, Jae Hun Ro, Ahmad Beirami, Ananda Theertha Suresh, https://arxiv.org/abs/2403.10444 (Draft a block of tokens for verification.)
Giovanni Monea, Armand Joulin, Edouard Grave, 22 Nov 2023, PaSS: Parallel Speculative Sampling, https://arxiv.org/abs/2311.13581 (Generates multiple draft tokens using a parallel extension via "look ahead embeddings".)
Aonan Zhang, Chong Wang, Yi Wang, Xuanyu Zhang, Yunfei Cheng, 22 Mar 2024 (v2), Recurrent Drafter for Fast Speculative Decoding in Large Language Models, https://arxiv.org/abs/2403.09919 (Use of small RNN as the drafting model for speculative decoding.)
Yichao Fu, Peter Bailis, Ion Stoica, Hao Zhang, Nov 21, 2023, Break the Sequential Dependency of LLM Inference Using Lookahead Decoding, https://lmsys.org/blog/2023-11-21-lookahead-decoding/ Code: https://github.com/hao-ai-lab/LookaheadDecoding (Generates tokens in parallel by using Jacobi iteration.)
Taehyeon Kim, Ananda Theertha Suresh, Kishore Papineni, Michael Riley, Sanjiv Kumar, Adrian Benton, 14 Apr 2024, Towards Fast Inference: Exploring and Improving Blockwise Parallel Drafts, Google Research, https://arxiv.org/abs/2404.09221 (Improving blockwise parallel decoding via top-k decoding and generation of predicted n-grams.)
Wonseok Jeon, Mukul Gagrani, Raghavv Goel, Junyoung Park, Mingu Lee, Christopher Lott, 5 Mar 2024 (v2), Recursive Speculative Decoding: Accelerating LLM Inference via Sampling Without Replacement, Qualcomm AI Research, https://arxiv.org/abs/2402.14160 (Improvements of an adaptive inference version of a draft token-tree with multiple n-gram paths for speculative decoding.)
Jie Ou, Yueming Chen, Wenhong Tian, 10 Apr 2024, Lossless Acceleration of Large Language Model via Adaptive N-gram Parallel Decoding, https://arxiv.org/abs/2404.08698 (Use an n-gram model as the drafter to create a version of parallel decoding or generalized speculative decoding.)
Yisheng Xiao, Lijun Wu, Junliang Guo, Juntao Li, Min Zhang, Tao Qin, Tie-yan Liu, 6 Jul 2023 (v2), A Survey on Non-Autoregressive Generation for Neural Machine Translation and Beyond, https://arxiv.org/pdf/2204.09269.pdf
Nikhil Bhendawade, Irina Belousova, Qichen Fu, Henry Mason, Mohammad Rastegari, and Mahyar Najibi. Speculative streaming: Fast llm inference without auxiliary models. arXiv preprint arXiv:2402.11131, 2024. https://arxiv.org/abs/2402.11131
Benjamin Spector and Chris Re. Accelerating llm inference with staged speculative decoding. arXiv preprint arXiv:2308.04623, 2023. https://arxiv.org/abs/2308.04623
Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: Speculative sampling requires rethinking feature uncertainty. arXiv preprint arXiv:2401.15077, 2024. https://arxiv.org/abs/2401.15077
Zachary Ankner, Rishab Parthasarathy, Aniruddha Nrusimha, Christopher Rinard, Jonathan Ragan-Kelley, and William Brandon. Hydra: Sequentially-dependent draft heads for medusa decoding, 2024. https://arxiv.org/abs/2402.05109
Wang, J., Chen, K., Chen, G., Shou, L., McAuley, J.: Skipbert: Efficient inference with shallow layer skipping. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 7287–7301 (2022) https://aclanthology.org/2022.acl-long.503/ (Skips early layers of a model via precomputed lookup tables based on detecting known token n-grams in the prompt.)
Hao Mark Chen, Wayne Luk, Ka Fai Cedric Yiu, Rui Li, Konstantin Mishchenko, Stylianos I. Venieris, Hongxiang Fan, 2 Jun 2024 (v2), Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference, https://arxiv.org/abs/2405.18628 https://github.com/hmarkc/parallel-prompt-decoding
8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, Gabriel Synnaeve, 30 Apr 2024, Better & Faster Large Language Models via Multi-token Prediction, https://arxiv.org/abs/2404.19737 Project: https://huggingface.co/facebook/multi-token-prediction
Michael Nuñez, July 4, 2024, Meta drops AI bombshell: Multi-token prediction models now open for research, https://venturebeat.com/ai/meta-drops-ai-bombshell-multi-token-prediction-models-now-open-for-research/
Shikhar Tuli, Chi-Heng Lin, Yen-Chang Hsu, Niraj K. Jha, Yilin Shen, Hongxia Jin, 1 May 2024, DynaMo: Accelerating Language Model Inference with Dynamic Multi-Token Sampling, https://arxiv.org/abs/2405.00888 (A model trained to predict multiple tokens ahead.)
Daniel Jurafsky, James H. Martin., February 3, 2024 (draft of 3rd edition),. N-gram Language Models, chapter 3, Speech and Language Processing. 2023, https://web.stanford.edu/~jurafsky/slp3/3.pdf https://web.stanford.edu/~jurafsky/slp3/ https://web.stanford.edu/~jurafsky/slp3/ed3bookfeb3_2024.pdf https://www.amazon.com/Speech-Language-Processing-Daniel-Jurafsky/dp/0131873210/ (2nd edition)
Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
Shibaranjani Dasgupta, Chandan Maity, Somdip Mukherjee, Rohan Singh, Diptendu Dutta, Debasish Jana, 14 Dec 2024, HITgram: A Platform for Experimenting with n-gram Language Models, https://arxiv.org/abs/2412.10717
Da Yu, Edith Cohen, Badih Ghazi, Yangsibo Huang, Pritish Kamath, Ravi Kumar, Daogao Liu, Chiyuan Zhang, 3 Feb 2025, Scaling Embedding Layers in Language Models, https://arxiv.org/abs/2502.01637 (Using n-gram multi-token embedding layers, because embeddings are cheap to compute, rather than increasing vocabulary size.)

Blockwise Parallel Decoding

Blockwise parallel decoding is an LLM inference optimization whereby the decoder outputs multiple tokens at once, by processing in blocks. It is a type of parallel decoding that improves efficiency beyond the vanilla LLM decoder, which emits one token at a time in an autoregressive, sequential manner.

Research on blockwise parallel decoding:

Stern, Mitchell Thomas, 2020, Structured Neural Models and Structured Decoding for Natural Language Processing, Ph.D. Thesis, Computer Science, University of California, Berkeley, https://escholarship.org/uc/item/4m2211b5 https://escholarship.org/content/qt4m2211b5/qt4m2211b5.pdf
Chen Zhang, Zhuorui Liu, Dawei Song, 23 Apr 2024, Beyond the Speculative Game: A Survey of Speculative Execution in Large Language Models, Beijing Institute of Technology, China, https://arxiv.org/abs/2404.14897 (Strong survey specific to speculative decoding and other draft-then-verify optimization techniques.)
Taehyeon Kim, Ananda Theertha Suresh, Kishore Papineni, Michael Riley, Sanjiv Kumar, Adrian Benton, 14 Apr 2024, Towards Fast Inference: Exploring and Improving Blockwise Parallel Drafts, Google Research, https://arxiv.org/abs/2404.09221 (Improving blockwise parallel decoding via top-k decoding and generation of predicted n-grams.)
8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
Mahsa Khoshnoodi, Vinija Jain, Mingye Gao, Malavika Srikanth, Aman Chadha, 24 May 2024 (v2), A Comprehensive Survey of Accelerated Generation Techniques in Large Language Models, https://arxiv.org/abs/2405.13019
Taehyeon Kim, Ananda Theertha Suresh, Kishore Papineni, Michael Riley, Sanjiv Kumar, Adrian Benton, 2024, Exploring and Improving Drafts in Blockwise Parallel Decoding, https://openreview.net/pdf?id=KtnUTS1f91

Lookahead Decoding

Lookahead decoding is a type parallel decoding where the algorithm attempts to "look ahead" in parallel at more than one token. This allows the output of two or more tokens at a time, which is more efficiency than the standard autoregressive decoding algorithms.

Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, 22 Apr 2024, A Survey on Efficient Inference for Large Language Models, https://arxiv.org/abs/2404.14294
Yao Zhao, Zhitian Xie, Chenyi Zhuang, Jinjie Gu, 4 Jan 2024 (v2), Lookahead: An Inference Acceleration Framework for Large Language Model with Lossless Generation Accuracy, https://arxiv.org/abs/2312.12728 Code: https://github.com/alipay/PainlessInferenceAcceleration
Yichao Fu, Peter Bailis, Ion Stoica, Hao Zhang, Nov 21, 2023, Break the Sequential Dependency of LLM Inference Using Lookahead Decoding, https://lmsys.org/blog/2023-11-21-lookahead-decoding/ Code: https://github.com/hao-ai-lab/LookaheadDecoding (Generates tokens in parallel by using Jacobi iteration.)
Vgel, December 18, 2023, How to make LLMs go fast, https://vgel.me/posts/faster-inference/
Shixiaowei02, Oct 2024, TensorRT-LLM 0.13.0 Release Latest, https://github.com/NVIDIA/TensorRT-LLM/releases/tag/v0.13.0
NVIDIA, Dec 2024, Speculative Sampling, https://nvidia.github.io/TensorRT-LLM/advanced/speculative-decoding.html
NVIDIA, Dec 2024, Lookahead Speculative Decoding, https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/lookahead/README.md
Yinfeng Xia, Huiyan Li, Chenyang Le, Manhong Wang, Yutao Sun, Xingyang Ma, Yanmin Qian, 4 Jun 2025, MFLA: Monotonic Finite Look-ahead Attention for Streaming Speech Recognition, https://arxiv.org/abs/2506.03722
Yixuan Wang, Shiyu Ji, Yijun Liu, Yuzhuang Xu, Yang Xu, Qingfu Zhu, Wanxiang Che, 24 May 2025, Lookahead Q-Cache: Achieving More Consistent KV Cache Eviction via Pseudo Query, https://arxiv.org/abs/2505.20334?

MEDUSA Decoding

Medusa decoding is an LLM optimization that uses multiple decoding heads to emit tokens in parallel. This method is faster than the sequential bottleneck of emitting one token at a time, as done by the standard autoregressive LLM decoding algorithms.

Zeping Li, Xinlong Yang, Ziheng Gao, Ji Liu, Zhuang Liu, Dong Li, Jinzhang Peng, Lu Tian, Emad Barsoum, 19 Jun 2024, Amphista: Accelerate LLM Inference with Bi-directional Multiple Drafting Heads in a Non-autoregressive Style, https://arxiv.org/abs/2406.13170 (Applying bi-directional decoding to speculative decoding.)
Together AI, Nov 13, 2023, Announcing Together Inference Engine – the fastest inference available, https://www.together.ai/blog/together-inference-engine-v1
Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
Yingbing Huang, Lily Jiaxin Wan, Hanchen Ye, Manvi Jha, Jinghua Wang, Yuhong Li, Xiaofan Zhang, Deming Chen, 16 Jun 2024, New Solutions on LLM Acceleration, Optimization, and Application, https://arxiv.org/abs/2406.10903
Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Tri Dao, September 11, 2023, Medusa: Simple framework for accelerating LLM generation with multiple decoding heads, https://www.together.ai/blog/medusa
Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference acceleration framework with multiple decoding heads. arXiv preprint arXiv:2401.10774, 2024 https://arxiv.org/abs/2401.10774
Zachary Ankner, Rishab Parthasarathy, Aniruddha Nrusimha, Christopher Rinard, Jonathan Ragan-Kelley, and William Brandon. Hydra: Sequentially-dependent draft heads for medusa decoding, 2024. https://arxiv.org/abs/2402.05109
Jinwei Yao, Kaiqi Chen, Kexun Zhang, Jiaxuan You, Binhang Yuan, Zeke Wang, Tao Lin, 29 May 2024 (v2), DeFT: Decoding with Flash Tree-attention for Efficient Tree-structured LLM Inference, https://arxiv.org/abs/2404.00242 https://openreview.net/forum?id=HqfLHoX8bR
Fangcheng Liu, Yehui Tang, Zhenhua Liu, Yunsheng Ni, Kai Han, Yunhe Wang, 29 Apr 2024, Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting, https://arxiv.org/abs/2404.18911 Code: https://github.com/Equationliu/Kangaroo (Speculative decoding where the draft model is an early exit of layers in the verifier model, but the draft model is also sped up further by early exiting confidence analysis.)
Kaiqi Zhang, Jing Zhao, Rui Chen, 15 Aug 2024, KOALA: Enhancing Speculative Decoding for LLM via Multi-Layer Draft Heads with Adversarial Learning, https://arxiv.org/abs/2408.08146
Karl Stratos, 2024, Speculative Decoding https://karlstratos.com/notes/speculative.pdf
Ziqian Zeng, Jiahong Yu, Qianshi Pang, Zihao Wang, Huiping Zhuang, Hongen Shao, Xiaofeng Zou, 18 Apr 2024 (v2), Chimera: A Lossless Decoding Method for Accelerating Large Language Models Inference by Fusing all Tokens, https://arxiv.org/abs/2402.15758
Davis Wertheimer, Joshua Rosenkranz, Thomas Parnell, Sahil Suneja, Pavithra Ranganathan, Raghu Ganti, Mudhakar Srivatsa, 6 Jun 2024 (v2), Accelerating Production LLMs with Combined Token/Embedding Speculators, https://arxiv.org/abs/2404.19124 https://github.com/foundation-model-stack/fms-fsdp https://huggingface.co/ibm-fms
Wei Zhong, Manasa Bharadwaj, 1 Jun 2024 (v2), S3D: A Simple and Cost-Effective Self-Speculative Decoding Scheme for Low-Memory GPUs, https://arxiv.org/abs/2405.20314
Yael Segal-Feldman, Aviv Shamsian, Aviv Navon, Gill Hetz, Joseph Keshet, 24 Sep 2024, Whisper in Medusa's Ear: Multi-head Efficient Decoding for Transformer-based ASR, https://arxiv.org/abs/2409.15869
Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, Song Han, 14 Oct 2024, DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads, https://arxiv.org/abs/2410.10819 https://github.com/mit-han-lab/duo-attention
Y Cong, 2024, Research for Enhancing Processing and Computational Efficiency in LLM, 2024 2nd International Conference on Image, https://www.atlantis-press.com/article/126004157.pdf
Haim Barad, Jascha Achterberg, Tien Pei Chou, Jean Yu, 30 Oct 2024, Accelerated AI Inference via Dynamic Execution Methods, https://arxiv.org/abs/2411.00853
NVIDIA, Dec 2024, Speculative Sampling, https://nvidia.github.io/TensorRT-LLM/advanced/speculative-decoding.html
NVIDIA, Dec 2024, Medusa Decoding, https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/medusa/README.md
Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, Tri Dao, Dec 2024, Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads, https://sites.google.com/view/medusa-llm
Haoran Wang, Kai Shu, Jan 2025, MakeEveryTokenCount: ASystematic Survey on Decoding Methods for Foundation Model, https://www.researchgate.net/profile/Haoran-Wang-96/publication/387703971_Make_Every_Token_Count_A_Systematic_Survey_on_Decoding_Methods_for_Foundation_Models/links/67784c8ce74ca64e1f49eb15/Make-Every-Token-Count-A-Systematic-Survey-on-Decoding-Methods-for-Foundation-Models.pdf https://github.com/wang2226/Awesome-LLM-Decoding
Ashraf Eassa, Brian Slechta, Brian Pharris and Nick Comly, Sep 05, 2024, Low Latency Inference Chapter 1: Up to 1.9x Higher Llama 3.1 Performance with Medusa on NVIDIA HGX H200 with NVLink Switch, https://developer.nvidia.com/blog/low-latency-inference-chapter-1-up-to-1-9x-higher-llama-3-1-performance-with-medusa-on-nvidia-hgx-h200-with-nvlink-switch/

More Research on Decoding Algorithms

Decoding algorithms (overview)
— Non-autoregressive decoding
— Greedy decoding
— Top-k decoding
— Top-p decoding
— Min-P Sampling
— Flash decoding
— Beam search decoding
— Edit decoding
— Contrastive decoding
— Constrained decoding
Parallel decoding (overview)
— Blockwise parallel decoding
— n-gram parallel decoding
— Lookahead decoding
— Medusa decoding
— Consensus decoding
Speculative decoding (overview)
— Generalized speculative decoding
— Aggressive decoding
— Lookup decoding
— Retrieval lookup decoding
— Prompt lookup decoding
— Self speculative decoding
— Tree speculative decoding
— Superposed decoding
— Hierarchical speculative decoding
— Heuristic speculative decoding
— Multi-token speculative decoding
— Sequential speculative decoding

Aussie AI

Parallel Decoding

What is Parallel Decoding?

Types of Parallel Decoding Optimizations

Research on Parallel Decoding

n-gram decoding

Blockwise Parallel Decoding

Lookahead Decoding

MEDUSA Decoding

More Research on Decoding Algorithms

More AI Research

Quick Links

Product

New to Writing?

Writing Styles