Aussie AI
Tokenizer Optimizations
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
Tokenizer Optimizations
There's not a lot to do in optimizing Byte-Pair Encoding or single character tokenization, because they're already fast. However, if you're using WordPiece or SentencePiece, then there are many tokens that are whole words or part-words. And a common vocabulary size is about 50,000 tokens, so how do you optimize the tokenizer to detect 50,000 different words? And it's also a little tricky with part-words, in terms of detecting the end of a partial word.
Fortunately, we've seen this all before. It's called “tokenization” (surprise!), and programming language compilers have been doing it for decades, since before the days of venerable K&R C programming. Let's have a look at some of the techniques:
Hash tables. The idea of hash tables is to look up the words by hashing up their letters into a number that's an offset into a hash table. It's very fast and very well-known. You could even try to get “perfect hashing” running with 50,000 tokens, or at least you could get to near-perfect hashing. Unfortunately, the whole strategy of hashing is based on whole words where you can detect the end of the word using some logic, and it doesn't work too well with partial word tokens, or indeed with tokens that are a punctuation mark and then a word (e.g. space and a word prefix combined into one token). One approach would be to look up a whole word first, and only if it's unknown, then use the other methods. Overall, that plan is problematic and the hash table idea isn't effective.
Tries. The “trie” is one data structure that works well for prefixes of words, part-words, or punctuation-then-letters tokens. It also works fine for whole words, or even whole words with an extra punctuation mark tacked on the end. So, let's leave that on the list of candidates for optimizing tokenizers.
Automata and Finite State Machines. What are these? An automaton or “finite state automaton” is an advanced data structure that performs tokenization using a “finite state machine.” The idea is that the set of tokens is finite, and we can specify them all as patterns of characters.
A finite state machine has “states” that are numbers, and it has “transitions” between the states that represent an input letter. There are a large number of states which can represent the state of being at the start, middle, or end of a token. The “transitions” are like pointer arcs between the states that represent the direction from the next character, so there are 256 transitions from every state. The 256 transitions represent the 255 possible input characters and the null byte, which represents end-of-input.
Conceptually, a finite state machine is a lot like a “trie” data structure, where the states are analogous to the trie nodes, and the transitions are like pointers between the trie nodes. A trie has an array of 256 pointers, one for each possible next character, and similarly an automaton has 256 numbers that are the new states for each character from any given state. A finite state machine uses numbers, whereas a trie uses allocated nodes.
To implement a finite state machine, the main data structure is a massive two-dimensional array of numbers. One dimension is the “states” and this dimension is much larger than the 50,000 vocabulary of unique tokens because the numbers also represent many mid-token states. The other dimension is value of the input character (i.e. the dimension size is 256). The data stored in the array is the next state for the automaton if we're at the current state and the next input letter is the character.
The automaton works by beginning at its start state, and spinning through the list of input characters, one at a time, repeatedly changing to the new state matching the transition (i.e. for each input character). This continues until it reaches a “leaf” state, which means we're at the end of a token, so we emit the token at the state into the token stream, and restart (i.e. go back to the starting state).
This sounds awfully complicated, but there have been tools available to build these automata for decades, of which the most famous is Lex. This is the name of the original Unix tool of “Lex & Yacc” fame, but has been superseded by open source Flex version, which stands for “fast Lex.” The idea of Lex or Flex is as a “lexer-generator” tool that takes a set of input token strings, and builds a finite state machine for you as C++ source code, which you can compile and use as a lexer (i.e. tokenizer). And rather than building a big matrix of state numbers, Flex also unrolls the massive table of states and transitions into nested switch statements. The result is completely unreadable C++ code, but a blazingly fast tokenizer.
• Next: • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |