Aussie AI

Tokenizer Design Issues

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Tokenizer Design Issues

If you think you can whip up a tokenizer in an afternoon, here's some news. Some of the problematic design issues affecting tokenizers include:

  • Numbers. There are an infinite number of numbers. Some tokenizers simply treat each digit as a separate token, whereas another approach is to treat common numbers (e.g. 100) as their own tokens, and use digits as tokens for other longer numbers.
  • Capitalization. The tokenizer usually needs to distinguish capital letters, as it would otherwise make grammar errors with capitalization. But the need to represent both cases of letters increases the size of the vocabulary.
  • Spaces. How should spaces be tokenized? One approach is that a space is its own token, separate from tokens for words (or numbers or punctuation). Another approach is that a space may be part of a subword sequence. For example, it is common for an LLM to have tokens consisting of a space and a prefix. And note that not all written languages use spaces to separate words like English does.
  • Multiple Spaces. Should each space get its own token, or should groups of spaces be a token? For example, using multi-space tokens is helpful in understanding the meaning of indentation in Python code. Spaces also have a meaning in terms of layout in other contexts, such as tabular listings of data.
  • Hyphenated words. Should these be one token or multiple?
  • Contraction words. How should contractions with an apostrophe (e.g. “isn't”) be tokenized?
  • Punctuation characters and sequences. Usually, a tokenizer will split punctuation characters into a single-byte token. But there are various multi-character punctuation sequences that could be their own token.
  • Encoding. Will the input sequence be in Latin1 or UTF8 encoding? Or various others like double-byte Unicode. The model will become confused if it was trained on one encoding, but receives input tokenized from another encoding during inference.
  • UTF8 and Unicode characters. The vast number of standard byte encodings for obscure symbols makes life difficult for tokenizers. One approach is to ignore this, and simply have a token for each byte that's not part of a known word or other token (i.e. a maximum of 255 of these byte-level tokens). But wouldn't you want your model to know the difference between a love heart and a poop emoji?
  • Double-byte character set languages. Several languages such as Chinese and Japanese have a large number of distinct symbols, which increases the size of the tokenizer and its vocabulary.
  • European language letters. Even the relatively simple ASCII extensions with values 128..255 to support European letters need to be handled correctly. Note that there are actually more than 255, so a multi-byte sequence such as UTF8 is probably desirable. However, if using UTF8, should the common European letters get their own token for byte-pairs or byte-triples?
  • Escape codes. There are various non-printable escape sequences defined by ASCII. Some encodings have meanings for these, but in other encodings they are undefined. An example is ASCII byte code 127, and also various bytes in the range 1-32.
  • Encoding errors. If using UTF8, there are various byte sequences that are errors that don't properly encode any Unicode number. What should the tokenizer do in this case?
  • Null bytes. Should the tokenizer allow zero as a token? This is mainly relevant to binary file tokenization and Unicode encoding formats. Usually the answer is “no” since using UTF8 rather than Unicode avoids this issue.
  • Computer programming language tokens. Each major programming language has its own specific set of tokens. Should the LLM tokenizer use these tokens or not? To what extent do you want it to understand code?
  • HTML sequences. Should the tokenizer have separate individual tokens for multi-character HTML-specific tokens (even in Latin1 encoding), such as HTML macros (e.g. “<b>” for bold) and HTML entities (e.g. “&mdash;” for em dash).
  • Unknown tokens. The tokenizer must not produce any tokens that the model doesn't know. Otherwise, there's unpredictable behavior during model inference.
  • Rare words. How should an unknown word be tokenized? By subwords or syllables? By letters?
  • End-of-input token. The tokenizer needs a way to identify the end of the input stream. This can be implemented via a unique marker token, although there are other ways.
  • Semantic tokenization and parts of speech. Should the tokenizer attempt to discern some meaning from the words? For example, should it try to distinguish the same word as a different part of speech, such as a different token for a word as a noun or a verb? Or should the tokenizer leave that issue for the model to decide? This is a newer area of research.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++