Aussie AI

27. Tokenizer and Vocabulary

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

“Words are like weapons; they wound sometimes.”

— Cher, If I Could Turn Back Time, 1989.

What is Tokenization?

C++ programmers are already familiar with tokenization, because that's what your compiler does as its first step. However, the tokenization of a programming language in a parser handles around 50 reserved keywords, whereas an AI tokenizer has to detect 50,000 distinct words in the “vocabulary” of the model. Like much in AI, the concepts are the same, but the scale is greater.

The tokenizer does not receive as much attention in the research literature as other parts of large language models. This is probably because the tokenization phase itself is not a bottleneck in either inference or training, when compared to the many layers of multiplication operations on weights. However, the choice of the tokenizer algorithm, and the resulting size of the vocabulary, has a direct impact on the speed (latency) of model inference, and this issue really warrants greater research attention.

Also, the tokenizer cannot be changed without re-training the model, so choosing the tokenization algorithm is a very important model design aspect. The same tokenizer algorithm must be used in both training and inference, and the vocabulary is also fixed. There is an optimization that removes tokens during inference, which is called “token pruning,” but cannot be applied to all use cases. You cannot add new tokens to a model except by re-creating the entire model. Well, who doesn't like a good rewrite?

Tokenization and Inference Latency

The tokenizer affects the latency (speed) of inference of a model in several ways. Firstly, the tokenization algorithm decides the vocabulary size. A larger vocabulary size only indirectly affects the main model weights in the layers, because models used “embeddings” rather than token vectors as a tensor dimension. However, a large vocabulary does directly impact the size of the embeddings matrix, which is a set of weights used before all those layers.

Hence, the tokenization algorithm has both a direct and indirect impact on the overall inference latency. If the tokenizer allows longer tokens, then there are more unique tokens, and the vocabulary is larger. For example, GPT has a vocabulary around 50,000 words (or subwords), but there are over 100,000 words in the English language, although they're not all in common usage.

Secondly, the tokenization method affects the ratio of words to tokens, which affects token sequence length for the input text. A longer sequence of tokens generated from the prompt text will cause longer latency in the inference phase. The transformer attention algorithm is known to be quadratic in input length, so having fewer tokens reduces the overall processing time. Furthermore, an input with fewer tokens also helps reduce the cost of multiple model executions that arise from the “autoregression” problem, which is another LLM bottleneck.

Therefore, a tokenizer that uses individual words (longer) rather than subwords (shorter) will increase vocabulary size (increasing latency) but reduce input sequence (reducing latency). So, the tokenization algorithm and the resulting size of the vocabulary introduces an interesting performance trade-off.

Tokenizer Design Issues

If you think you can whip up a tokenizer in an afternoon, here's some news. Some of the problematic design issues affecting tokenizers include:

Numbers. There are an infinite number of numbers. Some tokenizers simply treat each digit as a separate token, whereas another approach is to treat common numbers (e.g. 100) as their own tokens, and use digits as tokens for other longer numbers.
Capitalization. The tokenizer usually needs to distinguish capital letters, as it would otherwise make grammar errors with capitalization. But the need to represent both cases of letters increases the size of the vocabulary.
Spaces. How should spaces be tokenized? One approach is that a space is its own token, separate from tokens for words (or numbers or punctuation). Another approach is that a space may be part of a subword sequence. For example, it is common for an LLM to have tokens consisting of a space and a prefix. And note that not all written languages use spaces to separate words like English does.
Multiple Spaces. Should each space get its own token, or should groups of spaces be a token? For example, using multi-space tokens is helpful in understanding the meaning of indentation in Python code. Spaces also have a meaning in terms of layout in other contexts, such as tabular listings of data.
Hyphenated words. Should these be one token or multiple?
Contraction words. How should contractions with an apostrophe (e.g. “isn't”) be tokenized?
Punctuation characters and sequences. Usually, a tokenizer will split punctuation characters into a single-byte token. But there are various multi-character punctuation sequences that could be their own token.
Encoding. Will the input sequence be in Latin1 or UTF8 encoding? Or various others like double-byte Unicode. The model will become confused if it was trained on one encoding, but receives input tokenized from another encoding during inference.
UTF8 and Unicode characters. The vast number of standard byte encodings for obscure symbols makes life difficult for tokenizers. One approach is to ignore this, and simply have a token for each byte that's not part of a known word or other token (i.e. a maximum of 255 of these byte-level tokens). But wouldn't you want your model to know the difference between a love heart and a poop emoji?
Double-byte character set languages. Several languages such as Chinese and Japanese have a large number of distinct symbols, which increases the size of the tokenizer and its vocabulary.
European language letters. Even the relatively simple ASCII extensions with values 128..255 to support European letters need to be handled correctly. Note that there are actually more than 255, so a multi-byte sequence such as UTF8 is probably desirable. However, if using UTF8, should the common European letters get their own token for byte-pairs or byte-triples?
Escape codes. There are various non-printable escape sequences defined by ASCII. Some encodings have meanings for these, but in other encodings they are undefined. An example is ASCII byte code 127, and also various bytes in the range 1-32.
Encoding errors. If using UTF8, there are various byte sequences that are errors that don't properly encode any Unicode number. What should the tokenizer do in this case?
Null bytes. Should the tokenizer allow zero as a token? This is mainly relevant to binary file tokenization and Unicode encoding formats. Usually the answer is “no” since using UTF8 rather than Unicode avoids this issue.
Computer programming language tokens. Each major programming language has its own specific set of tokens. Should the LLM tokenizer use these tokens or not? To what extent do you want it to understand code?
HTML sequences. Should the tokenizer have separate individual tokens for multi-character HTML-specific tokens (even in Latin1 encoding), such as HTML macros (e.g. “<b>” for bold) and HTML entities (e.g. “—” for em dash).
Unknown tokens. The tokenizer must not produce any tokens that the model doesn't know. Otherwise, there's unpredictable behavior during model inference.
Rare words. How should an unknown word be tokenized? By subwords or syllables? By letters?
End-of-input token. The tokenizer needs a way to identify the end of the input stream. This can be implemented via a unique marker token, although there are other ways.
Semantic tokenization and parts of speech. Should the tokenizer attempt to discern some meaning from the words? For example, should it try to distinguish the same word as a different part of speech, such as a different token for a word as a noun or a verb? Or should the tokenizer leave that issue for the model to decide? This is a newer area of research.

Tokenizer Algorithms

Some of the earlier tokenizer algorithms included:

Single characters/bytes. There are only 256 tokens, and this was used in early neural network models only. Maybe we'd still be doing this if it weren't for GPUs.
Whole words. This is like the old programming language tokenizers, doesn't do part-word tokens, and isn't commonly used in AI.

The more up-to-date tokenizer algorithms used in AI models include:

Byte-Pair Encoding (BPE), from Gage (1994), is a longstanding method in neural networks.
WordPiece, from Wu et al. (2016) is like whole words, but has a greedy algorithm that uses subword tokenization only for unknown whole words. Google has open-sourced the code.
SentencePiece, Kudo and Richardson (2018), with an open-source codebase from Google, as used by LLama.
Unigram (Kudo, 2018). This is another method that supports part-word tokens.

Tokenizer Optimizations

There's not a lot to do in optimizing Byte-Pair Encoding or single character tokenization, because they're already fast. However, if you're using WordPiece or SentencePiece, then there are many tokens that are whole words or part-words. And a common vocabulary size is about 50,000 tokens, so how do you optimize the tokenizer to detect 50,000 different words? And it's also a little tricky with part-words, in terms of detecting the end of a partial word.

Fortunately, we've seen this all before. It's called “tokenization” (surprise!), and programming language compilers have been doing it for decades, since before the days of venerable K&R C programming. Let's have a look at some of the techniques:

Hash tables. The idea of hash tables is to look up the words by hashing up their letters into a number that's an offset into a hash table. It's very fast and very well-known. You could even try to get “perfect hashing” running with 50,000 tokens, or at least you could get to near-perfect hashing. Unfortunately, the whole strategy of hashing is based on whole words where you can detect the end of the word using some logic, and it doesn't work too well with partial word tokens, or indeed with tokens that are a punctuation mark and then a word (e.g. space and a word prefix combined into one token). One approach would be to look up a whole word first, and only if it's unknown, then use the other methods. Overall, that plan is problematic and the hash table idea isn't effective.

Tries. The “trie” is one data structure that works well for prefixes of words, part-words, or punctuation-then-letters tokens. It also works fine for whole words, or even whole words with an extra punctuation mark tacked on the end. So, let's leave that on the list of candidates for optimizing tokenizers.

Automata and Finite State Machines. What are these? An automaton or “finite state automaton” is an advanced data structure that performs tokenization using a “finite state machine.” The idea is that the set of tokens is finite, and we can specify them all as patterns of characters.

A finite state machine has “states” that are numbers, and it has “transitions” between the states that represent an input letter. There are a large number of states which can represent the state of being at the start, middle, or end of a token. The “transitions” are like pointer arcs between the states that represent the direction from the next character, so there are 256 transitions from every state. The 256 transitions represent the 255 possible input characters and the null byte, which represents end-of-input.

Conceptually, a finite state machine is a lot like a “trie” data structure, where the states are analogous to the trie nodes, and the transitions are like pointers between the trie nodes. A trie has an array of 256 pointers, one for each possible next character, and similarly an automaton has 256 numbers that are the new states for each character from any given state. A finite state machine uses numbers, whereas a trie uses allocated nodes.

To implement a finite state machine, the main data structure is a massive two-dimensional array of numbers. One dimension is the “states” and this dimension is much larger than the 50,000 vocabulary of unique tokens because the numbers also represent many mid-token states. The other dimension is value of the input character (i.e. the dimension size is 256). The data stored in the array is the next state for the automaton if we're at the current state and the next input letter is the character.

The automaton works by beginning at its start state, and spinning through the list of input characters, one at a time, repeatedly changing to the new state matching the transition (i.e. for each input character). This continues until it reaches a “leaf” state, which means we're at the end of a token, so we emit the token at the state into the token stream, and restart (i.e. go back to the starting state).

This sounds awfully complicated, but there have been tools available to build these automata for decades, of which the most famous is Lex. This is the name of the original Unix tool of “Lex & Yacc” fame, but has been superseded by open source Flex version, which stands for “fast Lex.” The idea of Lex or Flex is as a “lexer-generator” tool that takes a set of input token strings, and builds a finite state machine for you as C++ source code, which you can compile and use as a lexer (i.e. tokenizer). And rather than building a big matrix of state numbers, Flex also unrolls the massive table of states and transitions into nested switch statements. The result is completely unreadable C++ code, but a blazingly fast tokenizer.

Untokenization

An important practical point is that you need reverse tokenization in your AI engine. The result of inference is a sequence of tokens to output, which must be converted back from token numbers to printable text. This is not very technically difficult, since each token number in the vocabulary represents a unique sequence of characters. The main point is that you still need the text strings for all 50,000 tokens, even if you've used an automaton to hide them all behind obscure numbers in your lexer.

Another practical matter is the removal of any special non-output tokens. For example, your engine might use “end-of-input” or “unknown token” tokens, and the WordPiece tokenization algorithm has special “##” tokens. Any of these might appear in the output stream from your decoding algorithm, and you have to decide whether to output these oddities, and if so, how.

Encoding is an issue for the output of text. Personally, I think UTF8 is best in C++ because it's easy to work with, although it can be longer in terms of bytes. You need to ensure that the encoding of the de-tokenized tokens matches the encoding you want, and also that the encoding settings in your web page display match the encoding your engine is emitting.

Image models have a different issue in de-tokenization. What does each token represent? A pixel? This is a practical coding matter in terms of emitting the image and also adding the extra formatting bytes for the chosen image file format.

What are Embeddings?

We've spent all this time talking about tokens, and yet, your Transformer doesn't even use them! Your poor starving tensors never get to taste a single token.

The first step in model inference in the Transformer architecture is to convert an input sequence into numbers called tokens. However, these tokens are not used internally to the model, because the next step of Transformer inference is to immediately convert this sequence of tokens into another internal representation called an “embedding”. An embedding is a vector of numbers that represents the information about the token sequence in very complex ways.

Note that the “embeddings” terminology is unrelated to “embedded” devices such as mobile phones or IoT edge devices. It's simply a different usage of the word.

The mapping from tokens to embeddings is actually learned during model training. The conversion of token vectors into a vector of embeddings is based on a single matrix multiplication using these learned embedding weights, with an additional step that adds “positional embeddings.” The step to combine the vector from the matrix of learned embeddings with the heuristic positional embeddings is simply by using vector addition in the Transformer architecture.

Embedding Optimizations

Several methods have been examined to optimize the embeddings components. The embedding matrix can be quite large, especially if the token vocabulary size is large. However, this multiplication occurs infrequently compared to other weight matrices, so it is not a latency-critical operation. Nevertheless, the storage cost of storing a large embedding matrix can be significant.

Embedding Size NAS. A conceptually simple way to reduce embedding size is to choose a smaller embedding size as a model hyper-parameter. The size of the embedding is a model “hyper-parameter” that is chosen before training. Optimizing this number is a sub-problem of “neural architecture search” (NAS), also called “hyper-parameter optimization” (HPO). The embedding-specific NAS problem has some research papers.

Embedding Matrix Compression. Memory costs of a large embedding matrix can be significant, especially on smaller platforms. There are several research papers specifically on reducing the storage cost of large embedding matrices. Techniques include hashing vectors and pruning embeddings to create sparsity. Vocabulary size is also closely related to embeddings size, since it is one of the dimensions of the matrix, so a smaller vocabulary can improve both memory usage and computation cost.

Positional Encoding

Positional Encoding (PE) is the algorithm whereby relative position information about the placements of words in relation to each other is encoded into “embeddings” that are input into the AI model. The term is often used synonymously with “positional embeddings,” but technically, positional encoding is the algorithm (i.e. code) used to create a vector of positional embeddings (i.e. data).

The positional encoding algorithm was one of the important parts of the vanilla 2017 Transformer architecture, which used a sinusoidal positional encoding. Various attempts have been made to try other methods of positional encoding for improved model accuracy. In particular, the handling of long context lengths has been found to be better with other positional encoding algorithms, notably Rotary Positional Encoding (RoPE).

Some research has attempted to improve the raw speed of positional encoding algorithms. Although positional encoding is not usually a major CPU bottleneck, it can nevertheless be optimized via improved algorithms, approximations (including integer-only versions), and surprisingly, by removing the PE component entirely with a “NoPE” algorithm.

• Next: Chapter 28. Deslugging AI Engines

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs