Aussie AI

Tokenizer Algorithms

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

Tokenizer Algorithms

Some of the earlier tokenizer algorithms included:

Single characters/bytes. There are only 256 tokens, and this was used in early neural network models only. Maybe we'd still be doing this if it weren't for GPUs.
Whole words. This is like the old programming language tokenizers, doesn't do part-word tokens, and isn't commonly used in AI.

The more up-to-date tokenizer algorithms used in AI models include:

Byte-Pair Encoding (BPE), from Gage (1994), is a longstanding method in neural networks.
WordPiece, from Wu et al. (2016) is like whole words, but has a greedy algorithm that uses subword tokenization only for unknown whole words. Google has open-sourced the code.
SentencePiece, Kudo and Richardson (2018), with an open-source codebase from Google, as used by LLama.
Unigram (Kudo, 2018). This is another method that supports part-word tokens.

• Next:

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++

The new AI programming book by Aussie AI co-founders:

AI coding in C++
Transformer engine speedups
LLM models
Phone and desktop AI
Code examples
Research citations

Get your copy from Amazon: Generative AI in C++