Aussie AI
Untokenization
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
Untokenization
An important practical point is that you need reverse tokenization in your AI engine. The result of inference is a sequence of tokens to output, which must be converted back from token numbers to printable text. This is not very technically difficult, since each token number in the vocabulary represents a unique sequence of characters. The main point is that you still need the text strings for all 50,000 tokens, even if you've used an automaton to hide them all behind obscure numbers in your lexer.
Another practical matter is the removal of any special non-output tokens. For example, your engine might use “end-of-input” or “unknown token” tokens, and the WordPiece tokenization algorithm has special “##” tokens. Any of these might appear in the output stream from your decoding algorithm, and you have to decide whether to output these oddities, and if so, how.
Encoding is an issue for the output of text. Personally, I think UTF8 is best in C++ because it's easy to work with, although it can be longer in terms of bytes. You need to ensure that the encoding of the de-tokenized tokens matches the encoding you want, and also that the encoding settings in your web page display match the encoding your engine is emitting.
Image models have a different issue in de-tokenization. What does each token represent? A pixel? This is a practical coding matter in terms of emitting the image and also adding the extra formatting bytes for the chosen image file format.
• Next: • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |