Aussie AI

Optimizing Top-k Decoding

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

Optimizing Top-k Decoding

The implementation of top-k decoding has two main costs: the Softmax normalization function and the top-k array computation. In the example above, the two computations over 50,000 elements are expensive, and any operations on only 50 elements (0.1%) are almost irrelevant. Usually, the decoding algorithm is not a major bottleneck compared to other parts of the inference engine (e.g. matrix multiplications), but it can still be optimized to squeeze out a little more speed.

Softmax is a non-linear normalization that converts “logits” from log-scale values to their actual probabilities. It has to be computed across the vector of the size of the vocabulary (e.g. 50,000 tokens). Softmax requires a call to the “exp” or “expf” exponential function for all tokens, which is a slow non-linear function, although this can be improved by fast table lookup or hardware-accelerated exponentiation.

Can Softmax be skipped? For greedy decoding, we could just take the maximum logit instead of converting to probabilities, thereby avoiding Softmax completely. Can we just take the top k logits from the incoming array, without converting them using Softmax? Conceptually, yes, the top k logit values will correspond to the top k probabilities after Softmax normalization (because the exponential function is monotonically increasing). However, there are at least four problems this causes:

1. The “temperature” hyper-parameter cannot be used in its usual way, since this is applied as a scaling factor inside Softmax. (Can the temperature parameter be modified to apply directly to logits?)

2. The “top-p” sampling method cannot be easily combined with top-k. The top-p method is compatible with top-k, generally regarded as improving it, but top-p requires summing probabilities, and the non-Softmax logits are logs-of-probabilities. (Maybe this top-p summation can be adjusted to log-scale somehow?)

3. The random selection of tokens has the wrong distribution. Similarly, the random choice of one of the top k tokens using the logits has a log-scale distribution, rather than a probability distribution. The log-scale contracts the logits non-linearly, with the effect that the higher probability tokens won't get as high of a share of the randomness, and will get selected less often compared to if the full linear-scale probabilities were used. The effect of this is somewhat analogous to a higher value for the “temperature” parameter, since there will be more variety with less common tokens chosen more often. Possibly this doesn't matter or can be acceptable?

4. Negative values. The logits can be negative, whereas the outputs from Softmax are all non-negative. The Softmax normalization would change those negatives to tiny positive numbers, close to zero. This is a minor problem, though, but it requires an extra pass over the set of top-k candidates to remove all negatives from the array.

Overall, it looks likely that Softmax could be removed, and, with some modifications, still have a reasonable approximation of the top-k/top-p decoding algorithm.

Optimizing the Top-k Array Calculation: Top-k computation on a vector is the other major cost of top-k decoding. Like Softmax, top-k is not a cheap primitive operation, and needs to be computed on a vector of size equal to the vocabulary (e.g. 50,000 size is common). The top-k algorithm can conceptually be implemented as a sorting operation followed by selection of the k top elements from the sorted array. Sorting can be implemented using sequential code (e.g. std::sort or the qsort function), or there are GPU-based array sorting methods in the research. Alternatively, there are faster top-k algorithms that don't use sorting, for both sequential or GPU platforms. Various optimizations of top-k were examined in Chapter 23.

Fused Softmax-Top-k? Another thought for optimization of top-k decoding is whether the Softmax and top-k calculation algorithms can be fused? Both are scanning the same vector, and doing nasty non-linear things. This troublesome pair sounds like a good candidate for kernel operator fusion. (But I haven't found a paper or a coded example of this as yet.)

Reordered Softmax and Top-k. One optimization to consider is the Apollo 13 idea: reverse it. Do Top-k first, and then Softmax. If we're doing k=50, then Top-k reduces the logit vector from vocabulary size (eg. 50,000) down to 50 (i.e. 0.1%). The subsequent processing of those 50 logits with Softmax or top-p algorithms is almost a nil cost compared to top-k on 50,000 items.

Does reversing it work? It changes the Softmax scaling to have a different denominator, which is based on the sum of the top 1% of probabilities (rather than 100%). The temperature scaling parameter inside Softmax probably still works on those 50 items. And all the 50 parameters are scaled the same, so from the perspective of choosing one of these items randomly according to their probability distribution (of the subset), this reordering optimization actually seems to still work.

There is one problem with reordering if the extra top-p modification is being used in combination with top-k. The logic of top-p reductions of the top-50 elements might need to be revisited and adjusted. If we are using p=0.85, so that we stop if the sum of the candidates in top-k are restricted if they sum up to more than 85%, then the logic changes to the sum-of-probabilities of the subset, but with probabilities scaled differently. Cutting the top-p set at 85% of the newly scaled top-50 items will probably reduce the candidate set more than cutting based on 85% of the entire distribution of 50,000 probabilities. It's not clear if this is even a problem in terms of accuracy/perplexity, but it does seem to reduce the likelihood of uncommon words getting emitted into text, making the output less creative and more predictable. Possibly an adjustment is simply to use a higher value for p when using this reordering optimization.

Overall, the idea of a “post-Softmax” reordering of top-k decoding to do the Softmax after the top-k computation seems workable. It also seems to have fewer problems than just dropping the Softmax operation completely.

• Next:

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs

The new AI programming book by Aussie AI co-founders:

AI coding in C++
Transformer engine speedups
LLM models
Phone and desktop AI
Code examples
Research citations

Get your copy from Amazon: Generative AI in C++

Aussie AI

Optimizing Top-k Decoding

Optimizing Top-k Decoding

Quick Links

Product

New to Writing?

Writing Styles