Aussie AI

Top-p Decoding

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Top-p Decoding

Top-p sampling, also called “nucleus sampling,” is a sampling method that can be used on its own, but is more usually combined with top-k sampling. Top-p uses a single number p, which is a hyper-parameter that refers to a cumulative probability threshold. The number is usually in the range 0.70 to 0.90, which is 70% to 90%.

Why use top-p? One of the problems with top-k sampling is that it always picks a fixed number of the top probabilities (e.g. 50 tokens), regardless of their probabilities. This means that some words with very low probabilities can still get a chance to appear. Top-p sampling aims to exclude such very unlikely words from the output by reducing the 50 tokens from top-k if the cumulative probabilities are already high.

How does top-p work? When choosing tokens, their probabilities are added up until they reach the threshold percentage, p. For example, top-p with p=0.85 means that we add up the combined probabilities (from largest to smallest) until they reach 85%, and then we exclude the rest, even if top-k had chosen them. For example, the 50 tokens from top-k might then be reduced to 40 tokens, if the first 40 tokens have combined probabilities over 0.85 (85%). In this way, rather than a fixed 50 tokens to randomly choose, the lower-end tokens may be excluded, giving a smaller pool of candidate output tokens.

Top-p can be used on its own by simply choosing all tokens until they reach total of 85% probabilities. But in some cases where there's no tokens with high expectations, this will select too many tokens, possibly hundreds or even thousands. Hence, top-p is often combined with top-k, where top-k selects a fixed number of tokens, and then top-p is used to possibly cull some of these at the tail end of probabilities. Top-p never culls the highest probability tokens, only the low-probability ones.

Should you rescale the probabilities of the top 50 tokens before top-p? Well, you can, but this may not be desirable as this ensures that the 50 tokens will add up to 100%, and so will always cut some of the tokens (e.g. the lowest 15% if p=0.85). It can be simpler to track the cumulative probabilities from the top-k selected tokens without rescaling them.

Can you reorder the top-k selection and the top-p restriction? Sure, if you like (a la Hitchhiker's Guide to the Galaxy). It actually makes no difference to the output. In the top-p first case, you limit the selection to a set of tokens adding up to 0.85 probability (85%), which might be more or less than 50 tokens, but then select at most the top 50 using top-k. In the top-p second case, you firstly limit the selection to 50 tokens using top-k, and then add up their probabilities to cull those after the total is 85%. The result of both sequences is actually the same number of tokens, because the probabilities are added up the same, from highest to lowest (i.e. assuming you aren't rescaling the probabilities in between).

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++