Aussie AI

Aggressive Decoding

  • Last Updated 30 August, 2024
  • by David Spuler, Ph.D.

Aggressive Decoding for Editing

Aggressive decoding is mostly relevant to the "editing" or "grammatical error correction" use case. The idea is that it compares the output tokens to the input tokens, and if they are similar, then this can output more than one token at a time. This is more efficient than a basic greedy decoding algorithm.

How is this faster? The aggressive decoding algorithm assumes that the input sequence is largely correct, and then runs inference in parallel for all the sub-sequences up to more than one of the tokens. Hence, the model runs a full inference for each of the future tokens, so as to compare its predictions to the input. The model can then accept or reject each of the tokens. If accepted, then the whole sequence is generated. If rejected, it is more difficult, as all future tokens must also be rejected.

Acceptance criteria. The acceptance of a token may be via different metrics. The simplest is to only accept it if it has maximal probability, which gives us greedy decoding. Less strict acceptance criteria can be top-k and/or top-p decoding, as seen in the literature, and other criteria seem possible.

GPU parallelization vs on-device inference. Note that this is faster in terms of wall clock time, but is not less computation. The aggressive decoding algorithm runs just as many inference scans as the autoregressive mode, but it can do a lot of them in parallel, across multiple GPUs (or however other way to parallelize multiple inference queries). In fact, it can be more overall computations when a token is rejected. Hence, this approach suits data center GPU architectures where parallelization is available, but not low-resource on-device platforms.

Lossless decoding. Also important is that this is not an approximation. The use of the input text should allow us to generate the usual result of the slower autoregressive decoding, but faster due to parallel computation. If the assumption that the edited output is similar to the input is correct, the overall text generation is faster due to parallelization. However, it can be slower if our assumption is incorrect, such as with an output with lots of corrections where the output becomes very different to the original input.

Editing vs other use cases. This speedup only works when the output will be similar to the inputs. This means that aggressive decoding can only realistically be used for the "editing" use case, although it can possibly be used for some types of "transformations" that are close to editing (e.g. re-write this paragraph in an optimistic tone) if the number of words changed is not large. Most other use cases have outputs very different to their inputs (e.g. completions, classification, summarization, Q&A, chatting, translation, and so on.) Aggressive decoding will still work correctly, in generating the same output for these other use cases, but will be slower because almost every predicted token will be rejected.

Re-aligning after rejection. In practice, the aggressive decoding process becomes difficult at rejected tokens, when the corrections change the output to be out-of-phase with respect to the input. In this case, the aggressive decoding method either needs to re-align, or in worse case, it loses correspondence between the input and output, in which case the algorithm will revert to sequential auto-regressive decoding (i.e., without any speedup compared to the normal autoregressive method.) The algorithm can match up again at some points using additional algorithms, such as a "suffix match."

Aggressive decoding vs speculative decoding. Note that there is much overlap between aggressive decoding and speculative decoding. The input text used by aggressive decoding can be considered to be akin to the "draft model" as used in speculative decoding. Hence, there is a speculating sequence available at almost zero cost (i.e., the input), rather than having to run a small model as the speculator to generate a candidate sequence. This allows the parallelization techniques to be used as used in speculative decoding. Note that speculative decoding does not actually reduce total computations performed, but only parallelizes them across multiple GPUs for wall clock speedup. Hence, this idea is of limited relevance to low-resource devices without much spare computation capacity such as on-device inference for phones or PCs.

KV Caching. When we notice that this method is computing several subsequences of the same sequence at once (in parallel), we wonder about re-using some of the computations along the way. For example, the longest subsequence may contain the KV cache data for prior sequences. Hence, the idea becomes to try to serialize some of these computations to take advantage of computation reuse. However, in doing so, we've simply converted back to a sequential normal autoregressive decoding sequence, losing the parallelization speed. This is the well-known idea of KV caching in a standard autoregressive Transformer, which is a valid optimization, but not a new one, and not specific to editing or aggressive decoding methods.

Shallow decoder architecture. One of the main papers combined aggressive decoding with the "shallow decoder" architecture. This is early exiting inside the decoder, but not in the encoder (i.e. "deep encoder, shallow decoder"), which is an optimization considered in other papers. Alternatively in a decoder-only architecture, the "prefill" phase might be deep but decoding shallow. The shallowness of the decoder stack usually leads to the output being an approximation of the full-stack decoder, but it may be that this use case (editing) is less likely to make errors because it is mostly going to match the input sequence (unless it is extremely poorly written input needing many corrections).

Research on Aggressive Decoding

Papers on editing and the aggressive decoding algorithm include:

  • Xin Sun, Tao Ge, Furu Wei, and Houfeng Wang. Instantaneous grammatical error correction with shallow aggressive decoding. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 5937–5947, 2021. https://arxiv.org/abs/2106.04970, Code: https://github.com/AutoTemp/Shallow-Aggressive-Decoding (Aggressive decoding emits as many tokens as possible, combined with a shallow decoder architecture.)
  • T. Ge, H. Xia, X. Sun, S. Chen, and F. Wei, 2022, Lossless acceleration for seq2seq generation with aggressive decoding. ArXiv, abs/2205.10350, 2022, Microsoft Research, https://arxiv.org/abs/2205.10350, Code: https://github.com/microsoft/unilm/tree/master/decoding (A method called "input-guided aggressive decoding". Aggressive decoding means emitting multiple tokens at a time, reducing autoregression. This method also generalizes beyond matching greedy top-1 decoding to allow acceptance of tokens according to top-k and top-p decoding parameters. This paper is effectively doing speculative decoding with a smaller model as a second improvement.)
  • Jonathan Mallinson, Jakub Adamek, Eric Malmi, Aliaksei Severyn, 26 Oct 2022 (v2), EdiT5: Semi-Autoregressive Text-Editing with T5 Warm-Start, https://arxiv.org/abs/2205.12209, Code: https://edit5.page.link/code
  • Nan Yang, Tao Ge, Liang Wang, Binxing Jiao, Daxin Jiang, Linjun Yang, Rangan Majumder, Furu Wei, 10 Apr 2023, Inference with Reference: Lossless Acceleration of Large Language Models, https://arxiv.org/abs/2304.04487
  • Yufan Jiang, Qiaozhi He, Xiaomin Zhuang, Zhihua Wu, Kunpeng Wang, Wenlai Zhao, Guangwen Yang, 8 Aug 2023 (v2), RecycleGPT: An Autoregressive Language Model with Recyclable Module, https://arxiv.org/abs/2308.03421 (Uses an advanced architectural idea based on predicting the next token based on only a few preceding tokens by using extra layers inside a Transformer.)
  • Bohdan Didenko, Andrii Sameliuk, 19 Sep 2023, RedPenNet for Grammatical Error Correction: Outputs to Tokens, Attentions to Spans, https://arxiv.org/abs/2309.10898
  • Jared Lichtarge, Christopher Alberti, Shankar Kumar, Noam Shazeer, Niki Parmar, 31 Oct 2018, Weakly Supervised Grammatical Error Correction using Iterative Decoding, https://arxiv.org/abs/1811.01710 (Not parallel aggressive decoding, but has some relevance.)
  • Wang, J., Chen, K., Chen, G., Shou, L., McAuley, J.: Skipbert: Efficient inference with shallow layer skipping. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 7287–7301 (2022) https://aclanthology.org/2022.acl-long.503/ (Skips early layers of a model via precomputed lookup tables based on detecting known token n-grams in the prompt.)
  • Ziteng Sun, Ananda Theertha Suresh, Jae Hun Ro, Ahmad Beirami, Himanshu Jain, and Felix Yu. Spectr: Fast speculative decoding via optimal transport. Advances in Neural Information Processing Systems, 36, 2024. https://arxiv.org/abs/2310.15141
  • Yinghao Li, Xuebo Liu, Shuo Wang, Peiyuan Gong, Derek F. Wong, Yang Gao, Heyan Huang, and Min Zhang. 2023. TemplateGEC: Improving Grammatical Error Correction with Detection Template. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6878–6892, Toronto, Canada. Association for Computational Linguistics. https://aclanthology.org/2023.acl-long.380/ https://aclanthology.org/2023.acl-long.380.pdf

More Research on Decoding Algorithms

More AI Research

Read more about: