Aussie AI

Top-k and Top-p Decoding

  • Last Updated 14 January, 2025
  • by David Spuler, Ph.D.

What is Top-k Decoding?

Top-k decoding is a generalization of greedy decoding, where the output token is chosen from the k tokens with the highest probabilities. Greedy decoding is simply top-k decoding with k=1. Top-k decoding chooses the output token randomly, so this is a stochastic algorithm that intentionally injects unpredictability in order to increase variety and quality of the output.

Top-k is regarded as an improvement over greedy search, largely fixing the "neural text degeneration" problem of repetitive output. However, like greedy decoding, top-k decoding only considers a single token at a time, so it isn't good when there's a better pair of two tokens that should be output at the current situation. Hence, top-k decoding isn't as accurate as beam search decoding with more lookahead.

Example: Top-k Decoding

As an example, the basic algorithm for top-50 decoding on a vocabulary size of 50,000 is:

  • Softmax-normalize all the 50,000 logits into probabilities.
  • Top-k sort the array of 50,000 probabilities.
  • Choose the top k items (i.e. the top 50 probabilities).
  • Randomly pick one of these 50 tokens to output.

In the last step, randomly choosing from the top 50 items doesn't just choose each token with a 1-out-of-50 probability. Instead, it uses random choice according to their probability distribution. This makes sure that the randomness is according to the top 50 probabilities, so that higher-probability items in those 50 tokens are more likely to be output.

Top-p and Top-k Decoding

Top-p sampling, also called "nucleus sampling", is a sampling method that can be used on its own, but is more usually added to top-k sampling. Top-p uses a single number p, which is a hyper-parameter that refers to a cumulative probability threshold. The number is usually in the range 0.70 to 0.90, which is 70% to 90%.

Why use top-p? One of the problems with top-k sampling is that it always picks a fixed number of the top probabilities (e.g. 50 tokens), regardless of their probabilities. This means that some words with very low probabilities can still get a chance to appear. Top-p sampling aims to exclude such very unlikely words from the output by reducing the 50 tokens from top-k if the cumulative probabilities are already high.

How does top-p work? When choosing tokens, their probabilities are added up until they reach the threshold percentage, p. For example, top-p with p=0.85 means that we add up the combined probabilities (from largest to smallest) until they reach 85%, and then we exclude the rest, even if top-k had chosen them. For example, the 50 tokens from top-k might then be reduced to 40 tokens, if the first 40 tokens have combined probabilities over 0.85 (85%). In this way, rather than a fixed 50 tokens to randomly choose, the lower-end tokens may be excluded, giving a smaller pool of candidate output tokens.

Top-p can be used on its own by simply choosing all tokens until they reach total of 85% probabilities. But in some cases where there's no tokens with high expectations, this will select too many tokens, possibly hundreds or even thousands. Hence, top-p is often combined with top-k, where top-k selects a fixed number of tokens, and then top-p is used to possibly cull some of these at the tail end of probabilities. Top-p never culls the top probability tokens, only the low-probability ones.

Should you rescale the probabilities of the top 50 tokens before top-p? Yes, you can, but this may not be desirable as this will always cut some of the tokens (e.g. the lowest 15% if p=0.85). It can be simpler to track the cumulative probabilities from the top-k selected tokens without rescaling them.

Can you reorder the top-k selection and the top-p restriction? Sure, if you like, because it actually makes no difference (a la Hitchhiker's Guide to the Galaxy). In the top-p first case, you limit the selection to a set of tokens adding up to 0.85 probability (85%), which might be more or less than 50 tokens, but then select at most the top 50 using top-k. In the top-p second case, you firstly limit the selection to 50 tokens using top-k, and then add up their probabilities to cull those after the total is 85%. The result of both sequences is actually the same number of tokens, because the probabilities are added up the same, from highest to lowest (i.e. assuming you aren't rescaling the probabilities in between).

Optimizing Top-k Decoding

The implementation of top-k decoding has two main costs: the Softmax normalization function and the top-k array computation. In the example above, the two computations over 50,000 elements are expensive, and any operations on only 50 elements (0.1%) are almost irrelevant. Usually the decoding algorithm is not a major bottleneck compared to other parts of the inference engine (e.g. matrix multiplications), but it can still be optimized to squeeze out a little more speed.

Softmax is a non-linear normalization that converts "logits" from log-scale values to their actual probabilities. It has to be computed across the vector of the size of the vocabulary (e.g. 50,000 tokens). Softmax requires a call to the "exp" or "expf" exponential function for all tokens, which is a slow non-linear function, although this can be approximated by fast table lookup or possibly hardware-accelerated exponentiation.

Can Softmax be skipped? For greedy decoding, we could just take the maximum logit instead of converting to probabilities, thereby avoiding Softmax completely. Can we just take the top k logits from the incoming array, without converting them using Softmax? Conceptually, yes, the top k logit values will correspond to the top k probabilities after Softmax normalization (because the exponential function is monotonically increasing). However, there are at least four problems this causes:

  • 1. The "temperature" hyper-parameter cannot be used in its usual way, since this is applied as a scaling factor inside Softmax. (Can the temperature parameter be modified to apply directly to logits?)
  • 2. The "top-p" sampling method cannot be easily combined with top-k. The top-p method is compatible with top-k, generally regarded as improving it, but top-p requires summing probabilities, and the non-Softmax logits are logs-of-probabilities. (Maybe this top-p summation can be adjusted to log-scale somehow?)
  • 3. The random selection of tokens has the wrong distribution. Similarly, the random choice of one of the top k tokens using the logits has a log-scale distribution, rather than a probability distribution. The log-scale contracts the logits non-linearly, with the effect that the higher probability tokens won't get as high of a share of the randomness, and will get selected less often compared to if the full linear-scale probabilities were used. The effect of this is somewhat analogous to a higher value for the "temperature" parameter, since there will be more variety with less common tokens chosen more often. Possibly this doesn't matter or can be acceptable?
  • 4. Negative values. The logits can be negative, whereas the output from Softmax are all non-negative. The Softmax normalization would change those negatives to tiny positive numbers, close to zero. This is a minor problem, though, but it requires an extra pass over the set of top-k candidates to remove all negatives from the array.

Optimizing the Top-k Array Calculation: Top-k computation on a vector is the other major cost of top-k decoding. Like Softmax, top-k is not a cheap primitive operation, and needs to be computed on a vector of size equal to the vocabularly (e.g. 50,000 size is common). The top-k algorithm can conceptually be implemented as a sorting operation followed by selection of the k top elements from the sorted array. Sorting can be implemented using sequential code (e.g. the C++ qsort function), or there are GPU-based array sorting methods in the research. Alternatively, there are faster top-k algorithms that don't use sorting, for both sequential or GPU platforms.

Fused Softmax-Top-k? Another thought for optimization of top-k decoding is whether the Softmax and top-k calculation algorithms can be fused? Both are scanning the same vector, and doing nasty non-linear things. This troublesome pair sounds like a good candidate for kernel operator fusion. (But I haven't found a paper or a coded example of this as yet.)

Reordered Softmax and Top-k. One optimization to consider is the Apollo 13 idea: reverse it. Do Top-k first, and then Softmax. If we're doing k=50, then Top-k reduces the logit vector from vocabulary size (eg. 50,000) down to 50 (i.e. 0.1%). The subsequent processing of those 50 logits with Softmax or top-p algorithms is almost a nil cost compared to top-k on 50,000 items.

Does reversing it work? It changes the Softmax scaling to have a different denominator, which is based on the sum of the top 1% of probabilities (rather than 100%). The temperature scaling parameter inside Softmax probably still works on those 50 items. And all the 50 parameters are scaled the same, so from the perspective of choosing one of these items randomly according to their probability distribution (of the subset), this reordering optimization actually seems to still work.

There is one problem with reordering if the extra top-p modification is being used in combination with top-k. The logic of top-p reductions of the top-50 elements might need to be revisited and adjusted. If we are using p=0.85, so that we stop if the sum of the candidates in top-k are restricted if they sum up to more than 85%, then the logic changes to the sum-of-probabilities of the subset, but with probabilities scaled differently. Cutting the top-p set at 85% of the newly scaled top-50 items will probably reduce the candidate set moreso than cutting based on 85% of the entire distribution of 50,000 probabilities. It's not clear if this is even a problem in terms of accuracy/perplexity, but it does seem to reduce the likelihood of uncommon words getting emitted into text, making the output less creative and more predictable. Possibly an adjustment is simply to use a higher value for p when using this reordering optimization.

Overall, the idea of a "post-Softmax" reordering of top-k decoding to do the Softmax after the top-k computation seems workable. It also seems to have fewer problems than just dropping the Softmax operation completely.

Research on Top-k Decoding

Research papers on the use of top-k as a decoding algorithm are below. See also the many research papers on the top-k algorithm in general, further below. Top-K decoding papers for AI models include:

Top-k Vector Algorithm Research

There is a lot of extra research on the computation of the top-k elements in a vector. See also: Top-k Vector Algorithms

Research on Top-P Decoding

Research papers specific to Top-p decoding:

More Research on Decoding Algorithms

More AI Research

Read more about: