Aussie AI

What is Tokenizationand?

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

What is Tokenization?

C++ programmers are already familiar with tokenization, because that's what your compiler does as its first step. However, the tokenization of a programming language in a parser handles around 50 reserved keywords, whereas an AI tokenizer has to detect 50,000 distinct words in the “vocabulary” of the model. Like much in AI, the concepts are the same, but the scale is greater.

The tokenizer does not receive as much attention in the research literature as other parts of large language models. This is probably because the tokenization phase itself is not a bottleneck in either inference or training, when compared to the many layers of multiplication operations on weights. However, the choice of the tokenizer algorithm, and the resulting size of the vocabulary, has a direct impact on the speed (latency) of model inference, and this issue really warrants greater research attention.

Also, the tokenizer cannot be changed without re-training the model, so choosing the tokenization algorithm is a very important model design aspect. The same tokenizer algorithm must be used in both training and inference, and the vocabulary is also fixed. There is an optimization that removes tokens during inference, which is called “token pruning,” but cannot be applied to all use cases. You cannot add new tokens to a model except by re-creating the entire model. Well, who doesn't like a good rewrite?

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++