Aussie AI

Decoding Algorithms

  • Last Updated 26 March, 2025
  • by David Spuler, Ph.D.

What are Decoding Algorithms?

The decoding algorithm in Transformer AI engines is the method whereby the decoder emits tokens for the output message. At the end of each decoder sequence, the output is a list of "logits" with probabilities for the predictions of the next best token. The algorithm by which the decoder decides to output one token, or multiple tokens, and which ones, is called the decoding algorithm.

Logits vs Activations

Each decoding step is aimed at producing a single token (i.e., the next word to output). The output of a decoding phase for one token is actually a two-step process:

  1. The "activation vector" or "activations" are computed (numbers representing "embeddings"), and then
  2. The "logits" are computed from the activation vector (called "unembedding").

These two vectors are not usually the same size:

  • Activations vector — size is the "hidden" model dimension (e.g., 4096)
  • Logits vector — size is the model vocabulary size (e.g., 50,000 unique tokens).

Logits are very closely related to tokens, and there is one logit value per token. Each logit value represent the LLM's prediction of how likely it would be to output this token as the next one. We could simply take the logit with the highest probability, which is the most likely token according to the LLM, and output that token. This is called "greedy decoding."

Note that most of the LLM's processing is not using logits, but uses activation vectors. The logits only appear only at the very end of a decoding phase. In all the interim steps, which are usually multiple layers of computations inside the model, we use an "embedding space" representation called an activation vector. We don't actually work on "tokens" or their probabilities, but we work on the probabilities of what I call the "signals" in the embedding space, as stored in the activation vector.

The activations are a vector of numbers representing the likelihood of each "signal" in the embeddings For example, a signal might be something like "noun" or "adjective" signals, but there are literally thousands of them, and not everyone understands what every value in an embedding actually represents in reality. But the LLM sure knows!

This vector of numbers is the "activation vector" and is usually shortened to "activations." These values represent the extent to which each signal has been "activated" in each neuron. Each layer of computation modifies the activations, and we get the final activations after the final model layer.

At this very final phase, we need to take these "activations," which are based on the model internal dimension (e.g., 4096) that represents how many signals it's tracking. We need to convert that to logit probabilities, one per token, and there are perhaps 50,000 tokens (depends on the "vocabulary size" but 50,000 or 100,000 is common). Hence, we need to convert a 4096-length vector of numbers ("activations") into a 50,000-length vector of numbers ("logits").

The "unembedding matrix" is what we use. Multiplying the activation vector by this matrix, which is large and rectangular (e.g., 4096x50,000), is how this is done, This converts the 4096-vector into a 50,000-vector. The embedding matrix is large, and expensive to use, which is why we only do this once per decoding phase, rather than once per layer.

Anyway, the computation of activations is not the decoding algorithm. Nor is the multiplication by the unembedding matrix to get the logits vector. Rather, the decoding algorithm is the final phase, which operates on the logits vector of probabilities for each of the 50,000 tokens, and thereby chooses the next token to output.

Types of Decoding Algorithms

There are several possible decoding algorithms for the basic situation of choosing one token to output from a vector of probabilities for each token:

  • Greedy decoding — always choose the highest-probability token.
  • Top-k sampling (random sampling) — choose from k most likely tokens.
  • Top-p sampling (nucleus sampling) — a finesse on the top-k decoding algorithm.
  • Beam search decoding — a more complex "tree" search of multiple token sequences.
  • Edit decoding — using the input context to help decode the output (e.g., grammar checking).

The above are all variations on a theme: take a vector of token probabilities as the input, and analyze these probabilities to choose exactly one of the tokens as the output.

At a higher-level, there are more advanced options, and the main classes of decoding algorithms are:

  • Autoregressive decoding
  • Non-Autoregressive (NAR) decoding
  • Parallel decoding
  • Multi-token output

Other issues for decoding algorithms include:

  • Prefill phase (runs before decoding)
  • Temperature (scaling hyper-parameter that affects decoding)

Parallel Decoding Algorithms

There are several types of parallel optimizations for decoding:

Multi-model decoding algorithms have also been examined:

Hybrid Decoding Optimizations

The decoding algorithm may also be combined with other optimizations that improve the decoding process, such as:

Beam Search Decoding

Beam search decoding is an advanced type of decoding that works on a tree of potential output sequences. This is a complex search space that keeps multiple candidate token sequences in reserve, until it chooses the best one. Beam search can look ahead a few tokens, and then backtrack to choose a different final output.

Research papers on beam search:

Phrase Banning

Phrase banning is a feature extension for LLM decoding to disallow selected words or phrases, rather than a speed optimization. The idea is to block the LLM from outputting certain words or phrases, rather than post-processing LLM output to remove words. Words or phrases can be "banned" at the decoder level, forcing the LLM decoding phase to backtrack whenever it tries to emit a disallowed word or phrase. If a model has whole-word tokenization, then individual words can be banned at the current decoding step, by modifying simple decoding algorithms like greedy or top-k/top-p decoding. However, banning multi-word phrases or other multi-token sequences requires backtracking similar to beam search decoding. In fact, it makes sense to merge the phrase banning algorithm into beam search or other tree decoding methods. Banning phrases is usually efficient, because it has only a small token search cost to detect the phrases, and although backtracking is expensive, hopefully it is a relatively rare condition.

Research papers on phrase banning:

Tree Decoding

Tree decoding is the use of alternative pathways in decoding, in the form of a hierarchical tree. This idea is a generalization of beam search decoding. One of the applications of tree decoding is in the attempt to mimic Chain-of-Thought reasoning in a single inference step using a tree of pathways in CoT decoding.

Research papers on tree decoding:

Contrastive Decoding

Contrastive decoding is a method whereby the probabilities of two or more outputs are "contrasted" to choose the best token to output. This can be done by examining prior layers during inference, or it can be done with multiple models.

Flash Decoding

Flash decoding is a memory-reducing decoding algorithm introduced by the research team better known for "flash attention" (versions 1, 2, and 3 so far). This is similar memory access reductions applied to the decoding algorithm.

Top-p Decoding

Top-p decoding is a longstanding decoding method that examines the cumulative probabilities of the top candidate tokens. Top-p is usually combined with top-k decoding into a hybrid top-k top-p decoding algorithm.

Research papers on Top-p decoding:

Min-P Decoding

Min-p decoding is a new minor decoding modification, that mainly improves accuracy (rather than efficiency), but doesn't reduce efficiency either. Similar to top-p decoding, min-p tries to avoid showing tokens with too-low probabilities, so top-p and min-p have the same goal. However, min-p uses a lower threshold for the minimum probability allowed, and changes this threshold dynamically. The discovery of "min-p" was a nice piece of research work, since it is a small coding change that improves accuracy without sacrificing latency.

Research on min-p decoding:

Constrained Decoding

Constrained decoding is an optimization of the decoding algorithm where there are extra constraints on the token that can be output. This extra information can be used to either force inclusion of a particular token, or to exclude a subset of the tokens from consideration. Examples where there is extra information to use in decoding include:

  • Programming language syntax (code generation)
  • Parts-of-speech identification

For example, if you're programming an LLM decoding algorithm to output C++ code, then you know that the token 'if' is always followed by a token '(' in the code syntax. Hence, there's not really any need for a full LLM computation after an 'if' token, but the heuristic can be used. This idea is using the "constraint" of the language syntax to do "constrained decoding."

Clearly, that heuristic would be much faster, and easily coded. However, it's not all strawberries and cream, because the next token won't have a KV cache for the current token, if we use this heuristic. Hence, the next token would need to do a "mini-prefill" computation to calculate the KV cache, which means there's almost no point in avoiding the current token's computation (i.e., we are simply pushing the current token's computation onto the next token).

However, we've seen this issue of a "missing KV cache" before in early exit or layer skipping optimizations, where the KV cache is missing for any skipped layers (see KV caching). And there are various tricks to avoid fully re-computing the KV cache, such as propagation of the prior one or fusion with another layer. Similar ideas can be used when constrained decoding skips an LLM computation and the next token's KV cache is thereby absent.

Overlapped parallel computation can be used to address the missing KV cache, as also possible for early exit. The constraints of the language grammar allow the second token's inference to start almost immediately, possibly via a heuristic that does not even involve LLM layer execution. However, the computation of the current token's KV cache can still be completed, in parallel to the next token's decoding cycle, by ensuring that the next token's layers are staggered a little behind the current token's KV cache computation. This overlaps the next token's decoding phase with the current token's KV cache computation.

Research papers on constrained decoding:

Multi-Token Decoding

Multi-token decoding is an optimization whereby two or more tokens are output in a single decoding step. The idea of multi-token decoding is to train a special type of model so that it predicts not just the next token, but also the one after that (and possibly more). This improves on autoregressive decoding because the output is no longer one-at-a-time.

Stop Tokens

Stop tokens are one way whereby LLMs can be trained to control the length of their output. The idea is that stop tokens are incorporated at the end of answers during the training phase, and when they occur in an inference phase, they cause the LLM to stop outputting further tokens at that point.

Research papers with coverage of stop token techniques:

General Research on Decoding Algorithms

Papers on the various decoding methods include:

  • S Bae, J Ko, H Song, SY Yun, Oct 2023, Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding, arXiv preprint arXiv:2310.05424, https://arxiv.org/pdf/2310.05424.pdf, Code: https://github.com/raymin0223/fast_robust_early_exit (Combination of early-exit with a "shallow-deep module" and parallel decoding.)
  • Jiatao Gu, James Bradbury, Caiming Xiong, Victor O.K. Li, Richard Socher, 2018, Non-Autoregressive Neural Machine Translation, International Conference on Learning Representations, https://arxiv.org/abs/1711.02281 (Parallel decoding early paper.)
  • Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and Luke Zettlemoyer. 2019. Mask-predict: Parallel decoding of conditional masked language models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 6111–6120. Association for Computational Linguistics. https://arxiv.org/abs/1904.09324
  • Jiatao Gu and Xiang Kong. 2021. Fully non-autoregressive neural machine translation: Tricks of the trade. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 120–133, https://arxiv.org/abs/2012.15833
  • Nikolay Savinov, Junyoung Chung, Mikolaj Binkowski, Erich Elsen, and Aaron van den Oord. 2022. Step-unrolled denoising autoencoders for text generation. International Conference on Learning Representations. https://arxiv.org/abs/2112.06749
  • Andrea Santilli, Silvio Severino, Emilian Postolache, Valentino Maiorca, Michele Mancusi, Riccardo Marin, and Emanuele Rodolà. May 2023. Accelerating transformer inference for translation via parallel decoding. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 12336–12355. Association for Computational Linguistics. https://arxiv.org/abs/2305.10427
  • Y Zhang, Y Zhang, L Cui, G Fu, Oct 2023, Non-autoregressive Text Editing with Copy-aware Latent Alignments, arXiv preprint arXiv:2310.07821, https://arxiv.org/pdf/2310.07821.pdf
  • Tri Dao, Daniel Haziza, Francisco Massa, Grigory Sizov, October 13, 2023, Flash-Decoding for long-context inference, PyTorch Blog, https://pytorch.org/blog/flash-decoding/
  • Nitish Shirish Keskar, Bryan McCann, Lav R. Varshney, Caiming Xiong, Richard Socher, Sep 2019, CTRL: A Conditional Transformer Language Model for Controllable Generation, https://arxiv.org/abs/1909.05858, Code: https://github.com/salesforce/ctrl
  • Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe, Mar 2022, Training language models to follow instructions with human feedback, https://arxiv.org/abs/2203.02155 (InstructGPT main paper from OpenAI in 2022.)
  • Ning Gong, Nianmin Yao, June 2023, A generalized decoding method for neural text generation, Computer Speech & Language, Volume 81, 101503, https://www.sciencedirect.com/science/article/abs/pii/S0885230823000220
  • Cohere, 2023, Temperature, https://docs.cohere.com/docs/temperature
  • GC Garbacea, 2023, Neural Language Generation for Content Adaptation: Explainable, Efficient Low-Resource Text Simplification and Evaluation, Ph.D. thesis, Computer Science and Engineering, University of Michigan, https://deepblue.lib.umich.edu/bitstream/handle/2027.42/178028/garbacea_1.pdf?sequence=1 (Broad thesis with sections on beam search decoding optimizations and AI safety issues such as bias.)
  • Yizhe Zhang, Guoyin Wang, Chunyuan Li, Zhe Gan, Chris Brockett, and Bill Dolan. Pointer: Constrained text generation via insertion-based generative pre-training. arXiv preprint arXiv:2005.00558, 2020. https://arxiv.org/abs/2005.00558
  • Bryan Eikema and Wilker Aziz. 2020. Is MAP decoding all you need? The inadequacy of the mode in neural machine translation. Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020, Barcelona, Spain (Online), December 8- 13, 2020, pages 4506–4520. International Committee on Computational Linguistics. https://arxiv.org/abs/2005.10283
  • Haoran Yang, Deng Cai, Huayang Li, Wei Bi, Wai Lam, Shuming Shi, May 2023, A Frustratingly Simple Decoding Method for Neural Text Generation, https://arxiv.org/abs/2305.12675
  • Clara Meister, Tiago Pimentel, Gian Wiher, and Ryan Cotterell. 2022. Typical decoding for natural language generation. arXiv preprint arXiv:2202.00666, https://arxiv.org/abs/2202.00666 (The "typical sampling" decoding algorithm.)
  • Yixuan Su, Tian Lan, Yan Wang, Dani Yogatama, Lingpeng Kong, and Nigel Collier. 2022. A contrastive framework for neural text generation. Advances in Neural Information Processing Systems, https://arxiv.org/abs/2202.06417 (The "contrastive search" decoding algorithm.)
  • Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. SimCSE: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6894–6910, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. https://arxiv.org/abs/2104.08821 (A "contrastive" decoding algorithm.)
  • John Hewitt, Christopher D. Manning, and Percy Liang. 2022. Truncation sampling as language model desmoothing. In Findings of the Conference on Empirical Methods in Natural Language Processing (Findings of EMNLP). https://arxiv.org/abs/2210.15191 (The "truncation sampling" decoding algorithm.)
  • Xiang Lisa Li, Ari Holtzman, Daniel Fried, Percy Liang, Jason Eisner, Tatsunori Hashimoto, Luke Zettlemoyer, and Mike Lewis. 2022. Contrastive decoding: Open-ended text generation as optimization. arXiv preprint arXiv:2210.15097, https://arxiv.org/abs/2210.15097 (A "contrastive decoding" algorithm.)
  • Ari Holtzman, Jan Buys, Maxwell Forbes, Antoine Bosselut, David Golub, Yejin Choi, 2018, Learning to Write with Cooperative Discriminators, https://arxiv.org/abs/1805.06087
  • Massimo Caccia, Lucas Caccia, William Fedus, Hugo Larochelle, Joelle Pineau, and Laurent Charlin, 2020, Language GANs falling short, International Conference on Learning Representations. https://arxiv.org/abs/1811.02549
  • Moin Nadeem, Tianxing He, Kyunghyun Cho, and James Glass, 2020, A systematic characterization of sampling algorithms for open-ended language generation, Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pp. 334–346. https://arxiv.org/abs/2009.07243, Code: https://github.com/moinnadeem/characterizing-sampling-algorithms
  • Hugh Zhang, Daniel Duckworth, Daphne Ippolito, and Arvind Neelakantan, 2021, Trading off diversity and quality in natural language generation, EACL 2021, p. 25, https://arxiv.org/abs/2004.10450
  • Yunqi Zhu, Xuebing Yang, Yuanyuan Wu, Wensheng Zhang, 22 Mar 2024, Hierarchical Skip Decoding for Efficient Autoregressive Text Generation, https://arxiv.org/abs/2403.14919 (A new decoding algorithm called Hierarchical Skip Decoding involving layer skipping.)
  • Yassir Fathullah, Puria Radmard, Adian Liusie, Mark J. F. Gales, 2024, Who Needs Decoders? Efficient Estimation of Sequence-Level Attributes with Proxies, Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics Volume 1: Long Papers, pages 1478–1496 March 17-22, 2024, https://aclanthology.org/2024.eacl-long.89.pdf (Non-autoregressive decoding methods in special use cases such as machine language translation.)
  • Abhimanyu Bambhaniya, Ritik Raj, Geonhwa Jeong, Souvik Kundu, Sudarshan Srinivasan, Midhilesh Elavazhagan, Madhu Kumar, Tushar Krishna, 3 Jun 2024, Demystifying Platform Requirements for Diverse LLM Inference Use Cases, https://arxiv.org/abs/2406.01698 Code: https://github.com/abhibambhaniya/GenZ-LLM-Analyzer (Analysis of cost of serving LLMs, including separate profiles of prefill versus decoding phases, and the cost of extra prompt processing in RAG architectures with prepended information.)
  • Yechen Xu, Xinhao Kong, Tingjun Chen, Danyang Zhuo, 4 Jun 2024 (v2), Conveyor: Efficient Tool-aware LLM Serving with Tool Partial Execution, https://arxiv.org/abs/2406.00059 Code: https://github.com/conveyor-sys/conveyor (Speeding up inference by partially running tools in parallel to the LLM query procesisng, rather than sequentially after the LLM request, by detecting tool requests deep inside the decoding algorithm and starting them off immediately, before the LLM has finished generating the fully decoed output.)
  • Hao (Mark) Chen, Wayne Luk, Ka Fai Cedric Yiu, Rui Li, Konstantin Mishchenko, Stylianos I. Venieris, Hongxiang Fan, 28 May 2024, Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference, https://arxiv.org/abs/2405.18628 Code: https://github.com/hmarkc/parallel-prompt-decoding (Similar to speculative decoding with extra trained prompt tokens and a tree-structured verification of multiple optional draft sequences.)
  • Maxime Peyrard, Martin Josifoski, Robert West, 21 Mar 2024, The Era of Semantic Decoding, https://arxiv.org/abs/2403.14562
  • Ethan Shen, Alan Fan, Sarah M Pratt, Jae Sung Park, Matthew Wallingford, Sham M. Kakade, Ari Holtzman, Ranjay Krishna, Ali Farhadi, Aditya Kusupati, 28 May 2024, Superposed Decoding: Multiple Generations from a Single Autoregressive Inference Pass, https://arxiv.org/abs/2405.18400 https://github.com/RAIVNLab/SuperposedDecoding (Generating multiple possible drafts from a single decoding algorithm with one model pass by superimposing embeddings and using top-k decoding.)
  • Rya Sanovar, Srikant Bharadwaj, Renee St. Amant, Victor Rühle, Saravan Rajmohan, 17 May 2024, Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers, https://arxiv.org/abs/2405.10480
  • Yifu Qiu, Zheng Zhao, Yftah Ziser, Anna Korhonen, Edoardo M. Ponti, Shay B. Cohen, 15 May 2024, Spectral Editing of Activations for Large Language Model Alignment, https://arxiv.org/pdf/2405.09719 Code: https://github.com/yfqiu-nlp/sea-llm
  • D Shin, May 8, 2024, Multi-User Language Model Resource Allocation Using Contextual Pause Token Aware Transformers, Technical Disclosure Commons, https://www.tdcommons.org/dpubs_series/6981/ PDF: https://www.tdcommons.org/cgi/viewcontent.cgi?article=8121&context=dpubs_series (Interesting idea of training a model how and when to pause during inference, so it can be pre-empted if needed, and thus the overall system can schedule batching of multiple queries more optimally.)
  • Shujian Zhang, Korawat Tanwisuth, Chengyue Gong, Pengcheng He, Mingyuan Zhou, 7 May 2024, Switchable Decision: Dynamic Neural Generation Networks, https://arxiv.org/abs/2405.04513 (Switching and skipping sub-layer components such as attention heads, FFNs, or input token skipping, using decisions made based on allocating computation resources.)
  • Shikhar Tuli, Chi-Heng Lin, Yen-Chang Hsu, Niraj K. Jha, Yilin Shen, Hongxia Jin, 1 May 2024, DynaMo: Accelerating Language Model Inference with Dynamic Multi-Token Sampling, https://arxiv.org/abs/2405.00888 (A model trained to predict multiple tokens ahead.)
  • Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, Gabriel Synnaeve, 30 Apr 2024, Better & Faster Large Language Models via Multi-token Prediction, https://arxiv.org/abs/2404.19737
  • Sachin Mehta, Mohammad Hossein Sekhavat, Qingqing Cao, Maxwell Horton, Yanzi Jin, Chenfan Sun, Iman Mirzadeh, Mahyar Najibi, Dmitry Belenko, Peter Zatloukal, Mohammad Rastegari, 22 Apr 2024, OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework, Apple Research, https://arxiv.org/abs/2404.14619 Code: https://huggingface.co/apple/OpenELM
  • Jared Lichtarge, Christopher Alberti, Shankar Kumar, Noam Shazeer, Niki Parmar, 31 Oct 2018, Weakly Supervised Grammatical Error Correction using Iterative Decoding, https://arxiv.org/abs/1811.01710
  • Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Jiang Xu, Shuang Chen, Hao Feng, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan, 20 Jan 2024, Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads, https://arxiv.org/abs/2401.11181 (Separating the prefill and decoding phases for optimization.)
  • Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Ramachandran Ramjee, 31 Aug 2023, SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills, https://arxiv.org/abs/2308.16369 (Examines the different GPU costs of prefill vs decoding phases, and optimizes decoding by "piggybacking" off the more intense computation during prefill.)
  • You Zhou, Xiujing Lin, Xiang Zhang, Maolin Wang, Gangwei Jiang, Huakang Lu, Yupeng Wu, Kai Zhang, Zhe Yang, Kehang Wang, Yongduo Sui, Fengwei Jia, Zuoli Tang, Yao Zhao, Hongxuan Zhang, Tiannuo Yang, Weibo Chen, Yunong Mao, Yi Li, De Bao, Yu Li, Hongrui Liao, Ting Liu, Jingwen Liu, Jinchi Guo, Xiangyu Zhao, Ying WEI, Hong Qian, Qi Liu, Xiang Wang, Wai Kin (Victor)Chan, Chenliang Li, Yusen Li, Shiyu Yang, Jining Yan, Chao Mou, Shuai Han, Wuxia Jin, Guannan Zhang, Xiaodong Zeng, Nov 2023, On the Opportunities of Green Computing: A Survey, https://arxiv.org/abs/2311.00447 (Extensive survey of environmental and green AI issues, along with a survey of various optimization methods to reduce AI resource requirements in training and inference.)
  • Pratyush Patel, Esha Choukse, Chaojie Zhang, Íñigo Goiri, Aashaka Shah, Saeed Maleki, Ricardo Bianchini, 30 Nov 2023, Splitwise: Efficient generative LLM inference using phase splitting, https://arxiv.org/abs/2311.18677 (Separates the two Transformer phases of initial prompt computation or prefill to generate the KV cache, and the token generation phase or decoding algorithm onto two machines.)
  • Yao Zhao, Zhitian Xie, Chenyi Zhuang, Jinjie Gu, Jan 2024, Lookahead: An Inference Acceleration Framework for Large Language Model with Lossless Generation Accuracy, https://arxiv.org/abs/2312.12728 Code: https://github.com/alipay/PainlessInferenceAcceleration
  • Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
  • Yang Song, Chenlin Meng, Renjie Liao, Stefano Ermon, 2021, Accelerating Feedforward Computation via Parallel Nonlinear Equation Solving, Proceedings of the 38th International Conference on Machine Learning, PMLR 139, 2021, https://proceedings.mlr.press/v139/song21a/song21a.pdf
  • Yichao Fu, Peter Bailis, Ion Stoica, Hao Zhang, Nov 21, 2023, Break the Sequential Dependency of LLM Inference Using Lookahead Decoding, https://lmsys.org/blog/2023-11-21-lookahead-decoding/ Code: https://github.com/hao-ai-lab/LookaheadDecoding (Generates tokens in parallel by using Jacobi iteration.)
  • N Varshney, A Chatterjee, M Parmar, C Baral, Oct 2023, arXiv preprint arXiv:2310.18581, Accelerating LLM Inference by Enabling Intermediate Layer Decoding, https://arxiv.org/pdf/2310.18581.pdf (Dynamic confidence-based early exiting analysis on LLama models.)
  • Chufan Shi, Haoran Yang, Deng Cai, Zhisong Zhang, Yifan Wang, Yujiu Yang, Wai Lam, 10 Feb 2024, A Thorough Examination of Decoding Methods in the Era of LLMs, https://arxiv.org/abs/2402.06925 (Evaluates a number of decoding algorithms with several 7B models including Llama2-7B, and also with 4-bit and 8-bit quantization.)
  • Yehui Tang, Yunhe Wang, Jianyuan Guo, Zhijun Tu, Kai Han, Hailin Hu, Dacheng Tao, 5 Feb 2024. A Survey on Transformer Compression. https://arxiv.org/abs/2402.05964 (Model compression survey paper with focus on pruning, quantization, knowledge distillation, and efficient architecture design.)
  • Xuanlei Zhao, Bin Jia, Haotian Zhou, Ziming Liu, Shenggan Cheng, Yang You, 2 Mar 2024, HeteGen: Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices, https://arxiv.org/abs/2403.01164
  • Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, Ramachandran Ramjee, 4 Mar 2024, Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve, https://arxiv.org/abs/2403.02310 (Faster latency by scheduling of prefill and decoding algorithm phases.)
  • C Hooper, S Kim, H Mohammadzadeh, H Genc, Oct 2023, SPEED: Speculative Pipelined Execution for Efficient Decoding https://arxiv.org/pdf/2310.12072.pdf
  • Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. Guided open vocabulary image captioning with constrained beam search, 2017, Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 936–945, https://arxiv.org/abs/1612.00576
  • Chris Hokamp and Qun Liu, 2017, Lexically constrained decoding for sequence generation using grid beam search. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1535–1546, https://arxiv.org/abs/1704.07138
  • David Spuler, March 2024, Chapter 26. Decoding Algorithms, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
  • S Yang, G Lee, J Cho, D Papailiopoulos, 2023, Predictive Pipelined Decoding: A Compute-Latency Trade-off for Exact LLM Decoding, https://arxiv.org/abs/2307.05908
  • Ke Hong, Guohao Dai, Jiaming Xu, Qiuli Mao, Xiuhong Li, Jun Liu, kangdi chen, Yuhan Dong, Yu Wang, 2024, FlashDecoding++: Faster Large Language Model Inference with Asynchronization, Flat GEMM Optimization, and Heuristics, Part of Proceedings of Machine Learning and Systems 6 (MLSys 2024) Conference, PDF: https://proceedings.mlsys.org/paper_files/paper/2024/file/5321b1dabcd2be188d796c21b733e8c7-Paper-Conference.pdf (Next generation of Flash Decoding, with improved ascynchronous parallelism of Softmax in both prefill and decoding phases, heuristic dataflow management algorithms, and enhanced GEMM during the decoding phase.)
  • kipply's blog, 2023-03-30, Transformer Taxonomy (the last lit review), https://kipp.ly/transformer-taxonomy/ (Papers for all the Transformer architectures and milestone papers for the major optimization improvements on them.)
  • Trenton Bricken, November 20, 2019, Tail Free Sampling A new way to sample from language models for text generation, https://www.trentonbricken.com/Tail-Free-Sampling/ (Alternative to top-k/top-p decoding.)
  • Sean Welleck, Amanda Bertsch, Matthew Finlayson, Hailey Schoelkopf, Alex Xie, Graham Neubig, Ilia Kulikov, Zaid Harchaoui, 24 Jun 2024, From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models, https://arxiv.org/abs/2406.16838 (Survey and theoretical analysis of many different decoding algorithms, along with various ways to speed them up such as speculative decoding and KV caches.)
  • Mouxiang Chen, Hao Tian, Zhongxin Liu, Xiaoxue Ren, Jianling Sun, 5 Jun 2024 (v2), JumpCoder: Go Beyond Autoregressive Coder via Online Modification, https://arxiv.org/abs/2401.07870 Code: https://github.com/Keytoyze/JumpCoder
  • Zexuan Qiu, Zijing Ou, Bin Wu, Jingjing Li, Aiwei Liu, Irwin King, 25 Jun 2024, Entropy-Based Decoding for Retrieval-Augmented Large Language Models, https://arxiv.org/abs/2406.17519 (Enhanced decoding algorithm for multi-document RAG processing.)
  • Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas Kübler, Jiaji Huang, Matthäus Kleindessner, Jun Huan, Volkan Cevher, Yida Wang, George Karypis, 12 Jul 2024, Inference Optimization of Foundation Models on AI Accelerators, KDD’24, August 25–29, 2024, Barcelona, Spain, https://arxiv.org/abs/2407.09111
  • Jiaao He, Kezhao Huang, Jidong Zhai, July 2024, FASTDECODE: High-Throughput LLM Serving through Disaggregating Attention Computation, https://openreview.net/pdf?id=GahfuPsGw2 (Distributing KV caches to multiple nodes.)
  • Xukun Liu, Bowen Lei, Ruqi Zhang, Dongkuan Xu, 27 Jun 2024, Adaptive Draft-Verification for Efficient Large Language Model Decoding, https://arxiv.org/abs/2407.12021 Project: https://anonymous.4open.science/r/ADED-C7D5 (A draft-and-verification method that is similar to speculative decoding, but differs.)
  • Leo Donisch, Sigurd Schacht, Carsten Lanquillon, 6 Aug 2024, Inference Optimizations for Large Language Models: Effects, Challenges, and Practical Considerations, https://arxiv.org/abs/2408.03130
  • Yunjia Xi, Hangyu Wang, Bo Chen, Jianghao Lin, Menghui Zhu, Weiwen Liu, Ruiming Tang, Weinan Zhang, Yong Yu, 11 Aug 2024, A Decoding Acceleration Framework for Industrial Deployable LLM-based Recommender Systems, https://arxiv.org/abs/2408.05676 (Determining when speculative decoding is most beneficial.)
  • Sidharth Mudgal, Jong Lee, Harish Ganapathy, Yaguang Li, Tao Wang, Yanping Huang, Zhifeng Chen, Heng-Tze Cheng, Michael Collins, Trevor Strohman, Jilin Chen, Alex Beutel, Ahmad Beirami, July 2024, Controlled Decoding from Language Models, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:36486-36503, 2024, https://proceedings.mlr.press/v235/mudgal24a.html
  • Wenhong Zhu, Hongkun Hao, Zhiwei He, Yiming Ai, Rui Wang, July 2024, Improving Open-Ended Text Generation via Adaptive Decoding, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:62386-62404, 2024, https://proceedings.mlr.press/v235/zhu24d.html
  • Chenhan Yuan, Fei Huang, Ru Peng, Keming Lu, Bowen Yu, Chang Zhou, Jingren Zhou, 20 Aug 2024, Predicting Rewards Alongside Tokens: Non-disruptive Parameter Insertion for Efficient Inference Intervention in Large Language Model, https://arxiv.org/abs/2408.10764 Code: https://github.com/chenhan97/Otter (Inference intervention in the decoding algorithm.)
  • Yi Cheng, Xiao Liang, Yeyun Gong, Wen Xiao, Song Wang, Yuji Zhang, Wenjun Hou, Kaishuai Xu, Wenge Liu, Wenjie Li, Jian Jiao, Qi Chen, Peng Cheng, Wayne Xiong, Integrative Decoding: Improve Factuality via Implicit Self-consistency, 3 Oct 2024 (v2), https://arxiv.org/abs/2410.01556 (Prepends a previous response to improve decoding accuracy.)
  • Xinyi Zeng, Yuying Shang, Yutao Zhu, Jiawei Chen, Yu Tian, 9 Oct 2024, Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level, https://arxiv.org/abs/2410.06809
  • K Ahmed, KW Chang, G Van den Broeck, Oct 2024, Controllable Generation via Locally Constrained Resampling, Neurips Safe Generative AI Workshop 2024, https://openreview.net/pdf?id=v091fzXTu0
  • Yuxuan Liu, Wenyuan Li, Laizhong Cui, Hailiang Yang, 17 Oct 2024, Cerberus: Efficient Inference with Adaptive Parallel Decoding and Sequential Knowledge Enhancement, https://arxiv.org/abs/2410.13344
  • Rongxiang Wang and Felix Xiaozhu Lin. 2024. Turbocharge Speech Understanding with Pilot Inference. In Proceedings of the 30th Annual International Conference on Mobile Computing and Networking (ACM MobiCom '24). Association for Computing Machinery, New York, NY, USA, 1299–1313. https://doi.org/10.1145/3636534.3690694 https://dl.acm.org/doi/abs/10.1145/3636534.3690694 https://dl.acm.org/doi/pdf/10.1145/3636534.3690694 ("Pilot inference" is a specialized mix of caching, computation reuse, and backtracking in beam search for speech understanding, and is somewhat related to speculative decoding, and similar to continual inference for processing a stream.)
  • Yixiong Fang, Ziran Yang, Zhaorun Chen, Zhuokai Zhao, Jiawei Zhou, 9 Dec 2024, From Uncertainty to Trust: Enhancing Reliability in Vision-Language Models with Uncertainty-Guided Dropout Decoding, https://arxiv.org/abs/2412.06474
  • Xuezhi Wang, Denny Zhou, 23 May 2024 (v2), Chain-of-Thought Reasoning Without Prompting, https://arxiv.org/abs/2402.10200 ("CoT decoding" is examining the alternative paths in the decoding algorithm, which is somewhat similar to Chain-of-Thought reasoning.)
  • Y Li, K Livescu, J Zhou, Dec 2024, Beyond Token Generation: Adaptive Chunk-Distilled Language Modeling, 38th Conference on Neural Information Processing Systems (NeurIPS 2024), https://neurips2024-enlsp.github.io/papers/paper_90.pdf (Generate multiple tokens in decoding by inserting RAG chunks directly into the decoding output.)
  • Xiangjue Dong, Maria Teleki, James Caverlee, 18 Dec 2024, A Survey on LLM Inference-Time Self-Improvement, https://arxiv.org/abs/2412.14352 https://github.com/dongxiangjue/Awesome-LLM-Self-Improvement (Broad survey of reasoning improvement methods from multi-step inference to RALM to decoding algorithms.)
  • Mehul Damani, Idan Shenfeld, Andi Peng, Andreea Bobu, Jacob Andreas, 7 Oct 2024, Learning How Hard to Think: Input-Adaptive Allocation of LM Computation, https://arxiv.org/abs/2410.04707
  • Jianyi Zhang, Da-Cheng Juan, Cyrus Rashtchian, Chun-Sung Ferng, Heinrich Jiang, Yiran Chen, 27 Nov 2024 (v2), SLED: Self Logits Evolution Decoding for Improving Factuality in Large Language Models, https://arxiv.org/abs/2411.02433 https://jayzhang42.github.io/sled_page/ (Decoding algorithm that compares logit values in the final layer with those from earlier layers.)
  • Yuval Shalev, Amir Feder, Ariel Goldstein, 19 Jun 2024, Distributional reasoning in LLMs: Parallel reasoning processes in multi-hop reasoning, https://arxiv.org/abs/2406.13858 (Using embeddings from intermediate model layers in decoding to mimic reasoning pathways.)
  • Eden Biran, Daniela Gottesman, Sohee Yang, Mor Geva, Amir Globerson, 14 Oct 2024 (v2), Hopping Too Late: Exploring the Limitations of Large Language Models on Multi-Hop Queries, https://arxiv.org/abs/2406.12775 (Backpatching prior layers using embeddings from the current activations to mimic multi-step reasoning.)
  • Jacob Pfau, William Merrill, Samuel R. Bowman, 24 Apr 2024, Let's Think Dot by Dot: Hidden Computation in Transformer Language Models, https://arxiv.org/abs/2404.15758 (Use of dummy "filler tokens" similar to "pause tokens" or "reasoning tokens" to aid multi-step reasoning in decoding.)
  • Haoran Wang, Kai Shu, Jan 2025, MakeEveryTokenCount: ASystematic Survey on Decoding Methods for Foundation Model, https://www.researchgate.net/profile/Haoran-Wang-96/publication/387703971_Make_Every_Token_Count_A_Systematic_Survey_on_Decoding_Methods_for_Foundation_Models/links/67784c8ce74ca64e1f49eb15/Make-Every-Token-Count-A-Systematic-Survey-on-Decoding-Methods-for-Foundation-Models.pdf https://github.com/wang2226/Awesome-LLM-Decoding
  • Qiang Wang, Bei Li, Tong Xiao, Jingbo Zhu, Changliang Li, Derek F. Wong, and Lidia S. Chao. 2019, Learning deep transformer models for machine translation. In Proc. of ACL, 2019. https://arxiv.org/abs/1906.01787
  • Xuezhe Ma, Chunting Zhou, Xian Li, Graham Neubig, and Eduard H. Hovy. FlowSeq: Non-autoregressive conditional sequence generation with generative flow. In Proc. of EMNLP, 2019. https://arxiv.org/abs/1909.02480.
  • Raphael Shu, Jason Lee, Hideki Nakayama, and Kyunghyun Cho. 2020, Latent-variable non-autoregressive neural machine translation with deterministic inference using a delta posterior. In Proc. of AAAI, 2020. https://arxiv.org/abs/1908.07181
  • Huan Ma, Jingdong Chen, Guangyu Wang, Changqing Zhang, 1 Feb 2025, Estimating LLM Uncertainty with Logits, https://arxiv.org/abs/2502.00290
  • Zeyu Tang, Zhenhao Chen, Loka Li, Xiangchen Song, Yunlong Deng, Yifan Shen, Guangyi Chen, Peter Spirtes, Kun Zhang, 5 Feb 2025, Reflection-Window Decoding: Text Generation with Selective Refinement, https://arxiv.org/abs/2502.03678 (Combination of sliding window attention with pausing.)
  • Weihua Du, Yiming Yang, Sean Welleck, 7 Feb 2025, Optimizing Temperature for Language Models with Multi-Sample Inference, https://arxiv.org/abs/2502.05234 https://github.com/StigLidu/TURN

More Research on Decoding Algorithms

More AI Research

Read more about: