Aussie AI

Parallel Decoding

  • Last Updated 8 December, 2024
  • by David Spuler, Ph.D.

What is Parallel Decoding?

Parallel decoding algorithms aim to break the autoregression bottleneck in decoder output algorithms. The idea is to output as many tokens in parallel as possible, which is much faster than greedy decoding or beam search decoding, which are both autoregressive. For more information on the basics of the sequential decoder, see non-autoregressive decoding algorithms.

Types of Parallel Decoding Optimizations

There are several types of "parallel decoding" algorithms:

  • Speculative decoding
  • Generalized speculative decoding
  • Aggressive decoding
  • Lookup decoding
  • Prompt lookup decoding
  • Lookahead decoding

Note that these above methods are all computing multiple tokens in parallel for a single query within a single model. There are also various ways to parallelize decoding at a higher level by using multiple models, which is called "ensemble decoding" (e.g. big-little decoding, consensus decoding, collaborative decoding).

Research on Parallel Decoding

Papers on parallel decoding algorithms include (see also non-autoregressive decoding algorithms):

n-gram decoding

An "n-gram" decoding algorithm is one that generates more than one token (i.e., n tokens in an "n-gram") in one single sequence. This is usually done in parallel execution, because it isn't much of an optimization to run this sequentially, because that's how normal autoregressive decoding generates n-grams, too.

Research on n-gram generation:

Blockwise Parallel Decoding

  • Stern, Mitchell Thomas, 2020, Structured Neural Models and Structured Decoding for Natural Language Processing, Ph.D. Thesis, Computer Science, University of California, Berkeley, https://escholarship.org/uc/item/4m2211b5 https://escholarship.org/content/qt4m2211b5/qt4m2211b5.pdf
  • Chen Zhang, Zhuorui Liu, Dawei Song, 23 Apr 2024, Beyond the Speculative Game: A Survey of Speculative Execution in Large Language Models, Beijing Institute of Technology, China, https://arxiv.org/abs/2404.14897 (Strong survey specific to speculative decoding and other draft-then-verify optimization techniques.)
  • Taehyeon Kim, Ananda Theertha Suresh, Kishore Papineni, Michael Riley, Sanjiv Kumar, Adrian Benton, 14 Apr 2024, Towards Fast Inference: Exploring and Improving Blockwise Parallel Drafts, Google Research, https://arxiv.org/abs/2404.09221 (Improving blockwise parallel decoding via top-k decoding and generation of predicted n-grams.)
  • 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
  • Mahsa Khoshnoodi, Vinija Jain, Mingye Gao, Malavika Srikanth, Aman Chadha, 24 May 2024 (v2), A Comprehensive Survey of Accelerated Generation Techniques in Large Language Models, https://arxiv.org/abs/2405.13019
  • Taehyeon Kim, Ananda Theertha Suresh, Kishore Papineni, Michael Riley, Sanjiv Kumar, Adrian Benton, 2024, Exploring and Improving Drafts in Blockwise Parallel Decoding, https://openreview.net/pdf?id=KtnUTS1f91

Lookahead Decoding

Lookahead decoding is a type parallel decoding where the algorithm attempts to "look ahead" in parallel.

MEDUSA Decoding

More Research on Decoding Algorithms

More AI Research

Read more about: