Aussie AI
Tokenization and Inference Latency
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
Tokenization and Inference Latency
The tokenizer affects the latency (speed) of inference of a model in several ways. Firstly, the tokenization algorithm decides the vocabulary size. A larger vocabulary size only indirectly affects the main model weights in the layers, because models used “embeddings” rather than token vectors as a tensor dimension. However, a large vocabulary does directly impact the size of the embeddings matrix, which is a set of weights used before all those layers.
Hence, the tokenization algorithm has both a direct and indirect impact on the overall inference latency. If the tokenizer allows longer tokens, then there are more unique tokens, and the vocabulary is larger. For example, GPT has a vocabulary around 50,000 words (or subwords), but there are over 100,000 words in the English language, although they're not all in common usage.
Secondly, the tokenization method affects the ratio of words to tokens, which affects token sequence length for the input text. A longer sequence of tokens generated from the prompt text will cause longer latency in the inference phase. The transformer attention algorithm is known to be quadratic in input length, so having fewer tokens reduces the overall processing time. Furthermore, an input with fewer tokens also helps reduce the cost of multiple model executions that arise from the “autoregression” problem, which is another LLM bottleneck.
Therefore, a tokenizer that uses individual words (longer) rather than subwords (shorter) will increase vocabulary size (increasing latency) but reduce input sequence (reducing latency). So, the tokenization algorithm and the resulting size of the vocabulary introduces an interesting performance trade-off.
• Next: • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |