Aussie AI
Tokens and Non-Autoregression
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
Tokens and Non-Autoregression
Although much of the research into autoregression is major surgery to the LLM architecture, there's a simpler way to mitigate the inefficiency: bigger tokens.
If the tokens are longer, then fewer are emitted for each piece of work done by the AI engine. So, the model can run faster in terms of fewer iterations if the tokenizer chooses whole words rather than sub-words, or maybe even handles two-word common phrases as separate single tokens (i.e. multi-word tokens). Longer tokens therefore reduce inefficiencies from autoregression, but also reduce the total length of the input sequence, which also further reduces model execution time, since the Transformer's attention algorithm is well-known to be quadratic in the size of the input sequence.
The downside to using longer tokens is that it means more unique tokens, which increases the vocabulary size. And the model's complexity is somewhat dependent on the vocabulary size, so this increase with longer tokens means that the whole model is larger, and it runs slower. However, it's not that clear cut, because the model's size is dependent on the embeddings size, not the actual vocabulary count, so not all components are directly affected by a larger vocabulary.
Therefore, longer tokens reduce the latency time in terms of reducing the autoregression issue, but increase latency time by making the model larger overall. Maybe there's some happy trade-off here? Most of the current models seem to use a vocabulary of around 50,000 words, where the vocabulary size becomes one of the meta-parameters of the model.
• Next: • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |