Aussie AI
26. Decoding Algorithms
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
“No, I'm not interested in developing a powerful brain.”
— Alan Turing
What is Decoding?
The decoding algorithm is the method whereby the decoder emits tokens for the output message. At the end of each decoder sequence, the output from the final layer is a vector of “logits” with predicted probabilities of the best token. The algorithm by which the decoder decides to output one token, or multiple tokens, and which ones, is called the “decoding algorithm.”
The decoding algorithm is a relatively simple piece of code. There are no tensors or matrix multiplications involved. The input is a vector of probabilities, and the output is a single token. In some advanced decoding algorithms, it is possible to output multiple tokens at once, but the basic method is to output only a single token, and then go around again to start predicting the next one.
Note that the output of the decoding algorithm is a sequence of tokens, emitted one number at a time. To actually output the text from that, you need to “untokenize” or “decode” the tokens into printable letters. Some special internal-use tokens might also need to be removed from the output, rather than shown to users. This is discussed in the tokenization chapter.
There are two main classes of decoding algorithm:
- Autoregressive decoding
- Parallel decoding (non-autoregressive)
There are several possible decoding algorithms:
- Greedy decoding
- Top-k sampling
- Top-p sampling
- Beam search decoding
- Aggressive decoding
Multi-model decoding algorithms have also been examined where two or more AI engines assist in choosing words:
- Speculative decoding
- Supervised decoding (“big-little” architectures)
- Ensemble decoding
The decoding algorithm may also be combined with other optimizations that improve the decoding process, such as:
- Non-autoregressive decoding
- Token pruning
- Prompt compression
Greedy Decoding
Greedy decoding is a simplistic type of decoding, where the decoder simply picks the output token with the highest predicted probability. This algorithm is very efficient, but regarded as unreliable in terms of accuracy and the quality of output text.
One problem is that greedy decoding is deterministic, which means that the AI engine is certain to always output the same token from the same inputs. This leads to an unimpressive lack of creativity and variety, and in the worst cases, can lead to loops of repetitive phrasing and poor flow of natural language text. This repetition is called “neural text degeneration.”
Greedy decoding doesn't look ahead by even one word, so it can get tripped up by two-word phrases or longer sequences of more accurate text. It is also an autoregressive decoding model, because the single output token is added to the inputs for the next phase of decoding, meaning that it is architecturally slow at a high level, even though the decoding component itself runs fast.
Optimizing Greedy Decoding: Speed is a major advantage of greedy decoding. The algorithm to choose a token is simply the “max” function on the vector of logits. And this can be further optimized by noticing that there's no longer any need to use Softmax to convert the logit outputs from log-scale to full probabilities. The exponential function is a monotonically increasing function, so the token with the highest logit value will also have the highest probability. Hence, the incoming logit vector can simply be scanned for its maximum, skipping the entire Softmax calculation.
Neural Text Degeneration: The problem of repetitious and looping text decoding is called neural text degeneration in research papers. This occurs primarily in deterministic decoding algorithms such as greedy decoding, and is largely resolved by stochastic methods such as top-k decoding.
Top-k Decoding
Top-k decoding is a generalization of greedy decoding, where the output token is chosen from the k tokens with the highest probabilities. Greedy decoding is simply top-k decoding with k=1. Top-k decoding chooses the output token randomly, so this is a stochastic algorithm that intentionally injects unpredictability in order to increase variety and quality of the output.
Top-k is regarded as an improvement over greedy search, largely fixing the “neural text degeneration” problem of repetitive output. However, like greedy decoding, top-k decoding only considers a single token at a time, so it isn't good when there's a better pair of two tokens that should be output at the current situation. Hence, top-k decoding isn't as accurate as beam search decoding with more lookahead.
Example: Top-k Decoding Algorithm: As an example, the basic algorithm for top-50 decoding on a vocabulary size of 50,000 is:
- Softmax-normalize all the 50,000 logits into probabilities.
- Top-k sort the array of 50,000 probabilities.
- Choose the top k items (i.e. the top 50 probabilities).
- Randomly pick one of these 50 tokens to output.
In the last step, randomly choosing from the top 50 items doesn't just choose each token with a 1-out-of-50 probability. Instead, it uses random choice according to their probability distribution. This makes sure that the randomness is according to the top 50 probabilities, so that higher-probability items in those 50 tokens are more likely to be output.
Top-p Decoding
Top-p sampling, also called “nucleus sampling,” is a sampling method that can be used on its own, but is more usually combined with top-k sampling. Top-p uses a single number p, which is a hyper-parameter that refers to a cumulative probability threshold. The number is usually in the range 0.70 to 0.90, which is 70% to 90%.
Why use top-p? One of the problems with top-k sampling is that it always picks a fixed number of the top probabilities (e.g. 50 tokens), regardless of their probabilities. This means that some words with very low probabilities can still get a chance to appear. Top-p sampling aims to exclude such very unlikely words from the output by reducing the 50 tokens from top-k if the cumulative probabilities are already high.
How does top-p work? When choosing tokens, their probabilities are added up until they reach the threshold percentage, p. For example, top-p with p=0.85 means that we add up the combined probabilities (from largest to smallest) until they reach 85%, and then we exclude the rest, even if top-k had chosen them. For example, the 50 tokens from top-k might then be reduced to 40 tokens, if the first 40 tokens have combined probabilities over 0.85 (85%). In this way, rather than a fixed 50 tokens to randomly choose, the lower-end tokens may be excluded, giving a smaller pool of candidate output tokens.
Top-p can be used on its own by simply choosing all tokens until they reach total of 85% probabilities. But in some cases where there's no tokens with high expectations, this will select too many tokens, possibly hundreds or even thousands. Hence, top-p is often combined with top-k, where top-k selects a fixed number of tokens, and then top-p is used to possibly cull some of these at the tail end of probabilities. Top-p never culls the highest probability tokens, only the low-probability ones.
Should you rescale the probabilities of the top 50 tokens before top-p? Well, you can, but this may not be desirable as this ensures that the 50 tokens will add up to 100%, and so will always cut some of the tokens (e.g. the lowest 15% if p=0.85). It can be simpler to track the cumulative probabilities from the top-k selected tokens without rescaling them.
Can you reorder the top-k selection and the top-p restriction? Sure, if you like (a la Hitchhiker's Guide to the Galaxy). It actually makes no difference to the output. In the top-p first case, you limit the selection to a set of tokens adding up to 0.85 probability (85%), which might be more or less than 50 tokens, but then select at most the top 50 using top-k. In the top-p second case, you firstly limit the selection to 50 tokens using top-k, and then add up their probabilities to cull those after the total is 85%. The result of both sequences is actually the same number of tokens, because the probabilities are added up the same, from highest to lowest (i.e. assuming you aren't rescaling the probabilities in between).
Optimizing Top-k Decoding
The implementation of top-k decoding has two main costs: the Softmax normalization function and the top-k array computation. In the example above, the two computations over 50,000 elements are expensive, and any operations on only 50 elements (0.1%) are almost irrelevant. Usually, the decoding algorithm is not a major bottleneck compared to other parts of the inference engine (e.g. matrix multiplications), but it can still be optimized to squeeze out a little more speed.
Softmax is a non-linear normalization that converts “logits” from log-scale values
to their actual probabilities.
It has to be computed across the vector of the size of the vocabulary (e.g. 50,000 tokens).
Softmax requires a call to the “exp
” or “expf
” exponential function for all tokens,
which is a slow non-linear function,
although this can be improved by fast table lookup or hardware-accelerated exponentiation.
Can Softmax be skipped? For greedy decoding, we could just take the maximum logit instead of converting to probabilities, thereby avoiding Softmax completely. Can we just take the top k logits from the incoming array, without converting them using Softmax? Conceptually, yes, the top k logit values will correspond to the top k probabilities after Softmax normalization (because the exponential function is monotonically increasing). However, there are at least four problems this causes:
1. The “temperature” hyper-parameter cannot be used in its usual way, since this is applied as a scaling factor inside Softmax. (Can the temperature parameter be modified to apply directly to logits?)
2. The “top-p” sampling method cannot be easily combined with top-k. The top-p method is compatible with top-k, generally regarded as improving it, but top-p requires summing probabilities, and the non-Softmax logits are logs-of-probabilities. (Maybe this top-p summation can be adjusted to log-scale somehow?)
3. The random selection of tokens has the wrong distribution. Similarly, the random choice of one of the top k tokens using the logits has a log-scale distribution, rather than a probability distribution. The log-scale contracts the logits non-linearly, with the effect that the higher probability tokens won't get as high of a share of the randomness, and will get selected less often compared to if the full linear-scale probabilities were used. The effect of this is somewhat analogous to a higher value for the “temperature” parameter, since there will be more variety with less common tokens chosen more often. Possibly this doesn't matter or can be acceptable?
4. Negative values. The logits can be negative, whereas the outputs from Softmax are all non-negative. The Softmax normalization would change those negatives to tiny positive numbers, close to zero. This is a minor problem, though, but it requires an extra pass over the set of top-k candidates to remove all negatives from the array.
Overall, it looks likely that Softmax could be removed, and, with some modifications, still have a reasonable approximation of the top-k/top-p decoding algorithm.
Optimizing the Top-k Array Calculation:
Top-k computation on a vector is the other major cost of top-k decoding.
Like Softmax, top-k is not a cheap primitive operation,
and needs to be computed on a vector of size equal to the vocabulary (e.g. 50,000 size is common).
The top-k algorithm can conceptually be implemented as a sorting operation followed by selection of the k top elements
from the sorted array.
Sorting can be implemented using sequential code (e.g. std::sort
or the qsort
function),
or there are GPU-based array sorting methods in the research.
Alternatively, there are faster top-k algorithms that don't use sorting,
for both sequential or GPU platforms.
Various optimizations of top-k were examined in Chapter 23.
Fused Softmax-Top-k? Another thought for optimization of top-k decoding is whether the Softmax and top-k calculation algorithms can be fused? Both are scanning the same vector, and doing nasty non-linear things. This troublesome pair sounds like a good candidate for kernel operator fusion. (But I haven't found a paper or a coded example of this as yet.)
Reordered Softmax and Top-k. One optimization to consider is the Apollo 13 idea: reverse it. Do Top-k first, and then Softmax. If we're doing k=50, then Top-k reduces the logit vector from vocabulary size (eg. 50,000) down to 50 (i.e. 0.1%). The subsequent processing of those 50 logits with Softmax or top-p algorithms is almost a nil cost compared to top-k on 50,000 items.
Does reversing it work? It changes the Softmax scaling to have a different denominator, which is based on the sum of the top 1% of probabilities (rather than 100%). The temperature scaling parameter inside Softmax probably still works on those 50 items. And all the 50 parameters are scaled the same, so from the perspective of choosing one of these items randomly according to their probability distribution (of the subset), this reordering optimization actually seems to still work.
There is one problem with reordering if the extra top-p modification is being used in combination with top-k. The logic of top-p reductions of the top-50 elements might need to be revisited and adjusted. If we are using p=0.85, so that we stop if the sum of the candidates in top-k are restricted if they sum up to more than 85%, then the logic changes to the sum-of-probabilities of the subset, but with probabilities scaled differently. Cutting the top-p set at 85% of the newly scaled top-50 items will probably reduce the candidate set more than cutting based on 85% of the entire distribution of 50,000 probabilities. It's not clear if this is even a problem in terms of accuracy/perplexity, but it does seem to reduce the likelihood of uncommon words getting emitted into text, making the output less creative and more predictable. Possibly an adjustment is simply to use a higher value for p when using this reordering optimization.
Overall, the idea of a “post-Softmax” reordering of top-k decoding to do the Softmax after the top-k computation seems workable. It also seems to have fewer problems than just dropping the Softmax operation completely.
Top-k Vector Algorithm
The top-k algorithm on an array of numbers is a well-known theoretical algorithm in Computer Science theory. Longstanding theory examines the top-k algorithm in sequential algorithms, with both sorting and non-sorting versions. More recent theory has examined parallel top-k numeric algorithms using GPU-accelerated execution.
Sequential top-k algorithms have a significant body of theory. The simplest algorithm is to sort the array first, and then choose the first k elements from the sorted array. Sorting an array is an algorithm that also has its own huge body of theory, and the better sorting algorithms can be chosen. This has complexity O(n log n) for most sorting algorithms, and there is only an additional O(1) complexity for choosing k elements from a sorted list. However, sorting the entire array is somewhat wasteful and unnecessary, since we only want a subset, and there are several efficient top-k algorithms without sorting that optimize this overhead away.
Parallelizing top-k is an interesting area of algorithm research. Two approaches have been examined: with sorting and without. Sorting algorithms have fascinated researchers for over fifty years, and more recent papers have generalized this to GPU sorting, methods that could be used in a top-k algorithm. There has also been recent research attention on optimizing the standard top-k algorithms to run in parallel using GPU acceleration, but without using a full sort algorithm.
Advanced Decoding Algorithms
The top-k and top-p decoding algorithms a simple in the sense that only a single token is output, one at a time. More advanced decoding algorithms are often used in practice.
Beam Search Decoding: The beam search algorithm is an advancement upon greedy decoding that looks ahead a little further. Beam search maintains some alternative possibilities for sequences of tokens to output, which is analogous to having it consider a number of different words and phrases, before it finally makes a decision on which to output. This means that it is not easily tripped up by decisions between a single-word or multiple-word output.
Parallel Decoding: The class of parallel decoding algorithms aims to break the autoregression bottleneck in decoder output algorithms. The idea is to output as many tokens in parallel as possible, which is much faster than greedy decoding or beam search decoding, which are both autoregressive.
Aggressive Decoding: Aggressive decoding is an optimization that is mostly relevant to the “editing” or “error correction” use case. The idea is that it compares the output tokens to the input tokens, and if they are similar, then this can output more than one token at a time. This is more efficient than a basic greedy decoding algorithm. However, it only works when the output text is going to be very similar to the input prompt.
Non-Autoregressive Decoding
One of the biggest obstacles to fast inference of Large Language Models (LLMs) is that they emit one token at a time (e.g. one word at a time). This limits parallelism and means that the entire model must be re-run multiple times, once for each word (or subword token).
Why is there a bottleneck? The reason for this limitation is that the next word to output inherently relies on the prior word, which is kind of an unavoidable property of human language. But in LLM coding circles, this is called the “autoregression” problem, possibly because researchers tend to like big words.
Why is this slow? Because of this interdependency between words, the Transformer's overall algorithm is designed so that when the decoder emits a word, that word is then appended to the input prompt that goes into the model's next iteration to help it emit the next word. This makes sense, because most humans also speak one word at a time. But that's slow for multiple reasons:
- The model runs for every token.
- The model never produces 2 tokens (or more) in parallel.
- The model cannot start working on the 2nd token before finishing the 1st token, which limits pipelining (a type of parallelism).
There is various research on fixing this decoding bottleneck, and achieving more parallelism. The research area is called “non-autoregression” optimizations. And, to be fair, you probably could Neuralink two words at a time, because your brain is usually far ahead of your mouth.
Tokens and Non-Autoregression
Although much of the research into autoregression is major surgery to the LLM architecture, there's a simpler way to mitigate the inefficiency: bigger tokens.
If the tokens are longer, then fewer are emitted for each piece of work done by the AI engine. So, the model can run faster in terms of fewer iterations if the tokenizer chooses whole words rather than sub-words, or maybe even handles two-word common phrases as separate single tokens (i.e. multi-word tokens). Longer tokens therefore reduce inefficiencies from autoregression, but also reduce the total length of the input sequence, which also further reduces model execution time, since the Transformer's attention algorithm is well-known to be quadratic in the size of the input sequence.
The downside to using longer tokens is that it means more unique tokens, which increases the vocabulary size. And the model's complexity is somewhat dependent on the vocabulary size, so this increase with longer tokens means that the whole model is larger, and it runs slower. However, it's not that clear cut, because the model's size is dependent on the embeddings size, not the actual vocabulary count, so not all components are directly affected by a larger vocabulary.
Therefore, longer tokens reduce the latency time in terms of reducing the autoregression issue, but increase latency time by making the model larger overall. Maybe there's some happy trade-off here? Most of the current models seem to use a vocabulary of around 50,000 words, where the vocabulary size becomes one of the meta-parameters of the model.
• Next: Chapter 27. Tokenization and Vocabulary • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |