Aussie AI

Advanced Decoding Algorithms

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Advanced Decoding Algorithms

The top-k and top-p decoding algorithms a simple in the sense that only a single token is output, one at a time. More advanced decoding algorithms are often used in practice.

Beam Search Decoding: The beam search algorithm is an advancement upon greedy decoding that looks ahead a little further. Beam search maintains some alternative possibilities for sequences of tokens to output, which is analogous to having it consider a number of different words and phrases, before it finally makes a decision on which to output. This means that it is not easily tripped up by decisions between a single-word or multiple-word output.

Parallel Decoding: The class of parallel decoding algorithms aims to break the autoregression bottleneck in decoder output algorithms. The idea is to output as many tokens in parallel as possible, which is much faster than greedy decoding or beam search decoding, which are both autoregressive.

Aggressive Decoding: Aggressive decoding is an optimization that is mostly relevant to the “editing” or “error correction” use case. The idea is that it compares the output tokens to the input tokens, and if they are similar, then this can output more than one token at a time. This is more efficient than a basic greedy decoding algorithm. However, it only works when the output text is going to be very similar to the input prompt.

Non-Autoregressive Decoding

One of the biggest obstacles to fast inference of Large Language Models (LLMs) is that they emit one token at a time (e.g. one word at a time). This limits parallelism and means that the entire model must be re-run multiple times, once for each word (or subword token).

Why is there a bottleneck? The reason for this limitation is that the next word to output inherently relies on the prior word, which is kind of an unavoidable property of human language. But in LLM coding circles, this is called the “autoregression” problem, possibly because researchers tend to like big words.

Why is this slow? Because of this interdependency between words, the Transformer's overall algorithm is designed so that when the decoder emits a word, that word is then appended to the input prompt that goes into the model's next iteration to help it emit the next word. This makes sense, because most humans also speak one word at a time. But that's slow for multiple reasons:

  • The model runs for every token.
  • The model never produces 2 tokens (or more) in parallel.
  • The model cannot start working on the 2nd token before finishing the 1st token, which limits pipelining (a type of parallelism).

There is various research on fixing this decoding bottleneck, and achieving more parallelism. The research area is called “non-autoregression” optimizations. And, to be fair, you probably could Neuralink two words at a time, because your brain is usually far ahead of your mouth.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++