Aussie AI

Tokenizer Algorithms

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Tokenizer Algorithms

Some of the earlier tokenizer algorithms included:

  • Single characters/bytes. There are only 256 tokens, and this was used in early neural network models only. Maybe we'd still be doing this if it weren't for GPUs.
  • Whole words. This is like the old programming language tokenizers, doesn't do part-word tokens, and isn't commonly used in AI.

The more up-to-date tokenizer algorithms used in AI models include:

  • Byte-Pair Encoding (BPE), from Gage (1994), is a longstanding method in neural networks.
  • WordPiece, from Wu et al. (2016) is like whole words, but has a greedy algorithm that uses subword tokenization only for unknown whole words. Google has open-sourced the code.
  • SentencePiece, Kudo and Richardson (2018), with an open-source codebase from Google, as used by LLama.
  • Unigram (Kudo, 2018). This is another method that supports part-word tokens.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++