Aussie AI
Tokenizer Algorithms
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
Tokenizer Algorithms
Some of the earlier tokenizer algorithms included:
- Single characters/bytes. There are only 256 tokens, and this was used in early neural network models only. Maybe we'd still be doing this if it weren't for GPUs.
- Whole words. This is like the old programming language tokenizers, doesn't do part-word tokens, and isn't commonly used in AI.
The more up-to-date tokenizer algorithms used in AI models include:
- Byte-Pair Encoding (BPE), from Gage (1994), is a longstanding method in neural networks.
- WordPiece, from Wu et al. (2016) is like whole words, but has a greedy algorithm that uses subword tokenization only for unknown whole words. Google has open-sourced the code.
- SentencePiece, Kudo and Richardson (2018), with an open-source codebase from Google, as used by LLama.
- Unigram (Kudo, 2018). This is another method that supports part-word tokens.
• Next: • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |