Aussie AI

Tokenizer Research

  • Last Updated 7 December, 2024
  • by David Spuler, Ph.D.

The tokenizer does not receive as much attention in the research literature as other parts of large language models. This is probably because the tokenization phase itself is not a bottleneck in either inference or training, when compared to the many layers of multiplication operations on weights. However, the choice of the tokenizer algorithm, and the resulting size of the vocabulary, has a direct impact on the speed (latency) of model inference.

Also, the tokenizer cannot be changed without re-training the model. The same tokenizer algorithm must be used in both training and inference, and the vocabulary is fixed. The only way that tokenization can be modified during inference is called "token pruning".

Tokenizer and Model Inference Latency

The tokenizer affects the latency (speed) of inference of a model in several ways. Firstly, the tokenization algorithm decides the vocabulary size. A larger vocabulary size directly affects the model's overall size (i.e. number of weights), and thereby has a big impact on the overall inference latency. If the tokenizer allows longer tokens, then there are more unique tokens, and the vocabulary is larger. For example, GPT has a vocabulary around 50,000 words (or subwords), but there are over 100,000 words in the English language, although they're not all in common usage.

Secondly, the tokenization method affects the ratio of words to tokens, which affects token sequence length for the input text. A longer sequence of tokens generated from the prompt text will cause longer latency in the inference phase. The transformer attention algorithm is known to be quadratic in input length, so fewer tokens reduce the overall processing. Furthermore, fewer tokens also helps reduce the cost of multiple model executions that arise from the "autoregression" problem, which is another LLM bottleneck.

Therefore, a tokenizer that uses individual words (longer) rather than subwords (shorter) will increase vocabulary size (increasing latency) but reduce input sequence (reducing latency). So the tokenization algorithm and the resulting size of the vocabulary introduces an interesting performance trade-off.

Tokenizer Design Issues

Some of the problematic design issues affecting tokenizers include:

  • Numbers. There are an infinite number of numbers. Some tokenizers simply treat each digit as a separate token, whereas another approach is to treat common numbers (e.g. 100) as their own tokens, and use digits as tokens for other longer numbers.
  • Capitalization. The tokenizer usually needs to distinguish capital letters, as it would otherwise make grammar errors with capitalization. But the need to represent both cases of letters increases the size of the vocabulary.
  • Spaces. How should spaces be tokenized? One approach is that a space is its own token, separate from tokens for words (or numbers or punctuation). Another approach is that a space may be part of a subword sequence. And note that not all written languages use spaces to separate words like English does.
  • Hyphenated words. Should these be 1 token or multiple?
  • Contraction words. How should contractions with an apostrophe (e.g. "isn't") be tokenized?
  • Punctuation characters and sequences. Usually a tokenizer will split punctuation characters into a single-byte token. But there are various multi-character punctuation sequences that could be their own token.
  • Encoding. Will the input sequence be in Latin1 or UTF8 encoding? Or various others like double-byte Unicode. The model will become confused if it was trained on one encoding, but receives input tokenized from another encoding during inference.
  • UTF8 and Unicode characters. The vast number of standard byte encodings for obscure symbols makes life difficult for tokenizers. One approach is to ignore this, and simply have a token for each byte that's not part of a known word or other token (i.e. a maximum of 255 of these byte-level tokens).
  • Double-byte character set languages. Several languages such as Chinese and Japanese have a large number of distinct symbols, which increases the size of the tokenizer and its vocabulary.
  • European language letters. Even the relatively simple ASCII extensions with values 128..255 to support European letters need to be handled correctly. Note that there are actually more than 255, so a multi-byte sequence such as UTF8 is probably desirable. However, if using UTF8, should the common European letters get their own token for byte-pairs or byte-triples?
  • Escape codes. There are various non-printable escape sequences defined by ASCII. Some encodings have meanings for these, but in other encodings they are undefined. An example is ASCII byte code 127, and also various bytes in the range 1-32.
  • Encoding errors. If using UTF8, there are various byte sequences that are errors that don't properly encode any Unicode number. What should the tokenizer do in this case?
  • Null bytes. Should the tokenizer allow zero as a token? This is mainly relevant to binary file tokenization.
  • Computer programming language tokens. Each major programming language has its own specific set of tokens. Should the LLM tokenizer use these tokens or not?
  • Programming language sequences. Should the tokenizer have separate individual tokens for multi-character tokens (even in Latin1 encoding), such as HTML macros (e.g. bold) and HTML entities (e.g. em dash).
  • Unknown tokens. The tokenizer must not produce any tokens that the model doesn't know. Otherwise, there's unpredictable behavior during model inference.
  • Rare words. How should an unknown word be tokenized? By subwords or syllables? By letters?
  • End-of-input token. The tokenizer needs a way to identify the end of the input stream. This can be implemented via a unique marker token, although there are other ways.
  • Semantic tokenization and parts of speech. Should the tokenizer attempt to discern some meaning from the words? For example, should it try to distinguish the same word as a different part of speech, such as a different token for a word as a noun or a verb? Or should the tokenizer leave that issue for the model to decide? This is a newer area of research.

Tokenizer Algorithms

Some of the existing tokenizer algorithms include:

  • Single characters/bytes (early models)
  • Byte-Pair Encoding (BPE), from Gage (1994), is a longstanding method.
  • WordPiece, from Wu et al. (2016) uses subword tokenization and Google has open-sourced the code.
  • SentencePiece, Kudo and Richardson (2018), with an open-source codebase from Google, as used by LLama.
  • Unigram (Kudo, 2018)

Research on Tokenization

Tokenizers are often barely mentioned in AI papers, but there are some research papers specifically on the design of tokenization algorithms:

Vocabulary Size Research

Papers on vocabulary size and vocabulary-related issues:

  • Finding the Optimal Vocabulary Size for Neural Machine Translation, Thamme Gowda and Jonathan May, Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3955–3964, November 16-20, 2020, PDF: https://aclanthology.org/2020.findings-emnlp.352.pdf
  • Welin Chen, David Grangier, and Michael Auli. 2016. Strategies for training large vocabulary neural language models. In Proc. ACL. https://arxiv.org/abs/1512.04906
  • S. Jean, K. Cho, R. Memisevic, and Y. Bengio. On using very large target vocabulary for neural machine translation. arXiv preprint arXiv:1412.2007, 2014, https://arxiv.org/abs/1412.2007
  • Jonas Geiping, Tom Goldstein, Dec 2022, Cramming: Training a Language Model on a Single GPU in One Day, https://arxiv.org/abs/2212.14034, Code: https://github.com/JonasGeiping/cramming (Examines vocabulary size impact on training efficiency. Note: code uses deprecated nvFuser compiler.)
  • Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022. https://arxiv.org/abs/2204.02311 (Google Palm architecture used a 256k vocabulary with SentencePiece tokenizer.)
  • Benjamin Minixhofer, Edoardo Maria Ponti, Ivan Vulić, 13 May 2024, Zero-Shot Tokenizer Transfer, https://arxiv.org/abs/2405.07883 (Overcoming the limitation that the tokenizer is fixed for the model, by training the tokenizer to embeddings mapping so as to use different tokenizers, including effective input token pruning reducing tokens in the input with a larger vocabulary.)
  • Yoon Kim, Yacine Jernite, David A. Sontag, and Alexander M. Rush. 2016. Character-aware neural language models. In AAAI, pp. 2741–2749. AAAI Press, 2016, https://arxiv.org/abs/1508.06615
  • David Spuler, March 2024, Chapter 27. Tokenizer and Vocabulary, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
  • Nikolay Bogoychev, Pinzhen Chen, Barry Haddow, Alexandra Birch, June 20, 2024, The Ups and Downs of Large Language Model Inference, with Vocabulary Trimming by Language Heuristics, School of Informatics, University of Edinburgh, Proceedings of the Fifth Workshop on Insights from Negative Results in NLP, pages 148–153 https://aclanthology.org/2024.insights-1.17.pdf
  • Chaofan Tao, Qian Liu, Longxu Dou, Niklas Muennighoff, Zhongwei Wan, Ping Luo, Min Lin, Ngai Wong, 18 Jul 2024, Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies, https://arxiv.org/abs/2407.13623
  • J Hong, G Lee, J Cho, Accelerating Multilingual Language Model for Excessively Tokenized Languages, Findings of the Association for Computational Linguistics: ACL 2024, pages 11095–11111 August 11-16, 2024, https://arxiv.org/abs/2401.10660 https://aclanthology.org/2024.findings-acl.660/ https://aclanthology.org/2024.findings-acl.660.pdf
  • HyoJung Han, Akiko Eriguchi, Haoran Xu, Hieu Hoang, Marine Carpuat, Huda Khayrallah, 12 Oct 2024, Adapters for Altering LLM Vocabularies: What Languages Benefit the Most? https://arxiv.org/abs/2410.09644
  • Yangyifan Xu, Jinliang Lu, Jiajun Zhang, 15 Apr 2024, Bridging the Gap between Different Vocabularies for LLM Ensemble, https://arxiv.org/abs/2404.09492 (Addressing the particular problem with two LLMs have been trained on different vocabularies, but must be used together in an ensemble architecture.)
  • Matthew Durward and Christopher Thomson, 2024, Evaluating Vocabulary Usage in LLMs, Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications, pages 266–282, June 20, 2024, https://aclanthology.org/2024.bea-1.22/ https://aclanthology.org/2024.bea-1.22.pdf
  • Erik Wijmans, Brody Huval, Alexander Hertzberg, Vladlen Koltun, Philipp Krähenbühl, 13 Nov 2024, Cut Your Losses in Large-Vocabulary Language Models, https://arxiv.org/abs/2411.09009 https://github.com/apple/ml-cross-entropy
  • Leonidas Gee, Andrea Zugarini, Leonardo Rigutini, Paolo Torroni, 15 Feb 2024, Fast Vocabulary Transfer for Language Model Compression, https://arxiv.org/abs/2402.09977
  • Vilém Zouhar, 29 Jan 2024, Stolen Subwords: Importance of Vocabularies for Machine Translation Model Stealing, https://arxiv.org/abs/2401.16055
  • Tobias Domhan, Eva Hasler, Ke Tran, Sony Trenous, Bill Byrne, Felix Hieber, July 2022, The Devil is in the Details: On the Pitfalls of Vocabulary Selection in Neural Machine Translation, Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, United States, https://aclanthology.org/2022.naacl-main.136/ https://aclanthology.org/2022.naacl-main.136.pdf

Tokenization for Machine Vision

Tokenization for images and machine vision is different from text analysis:

  • Shengju Qian; Yi Zhu; Wenbo Li; Mu Li; Jiaya Jia, What Makes for Good Tokenizers in Vision Transformer? 22 December 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence, pp.1 - 13, https://arxiv.org/abs/2212.11115
  • T. Xiao, M. Singh, E. Mintun, T. Darrell, P. Dollar, and R. Girshick, ´ “Early convolutions help transformers see better,” in NeurIPS, 2021, https://arxiv.org/abs/2106.14881
  • L. Yuan, Y. Chen, T. Wang, W. Yu, Y. Shi, Z. Jiang, F. E. Tay, J. Feng, and S. Yan, “Tokens-to-token vit: Training vision transformers from scratch on imagenet,” in ICCV, 2021, https://arxiv.org/abs/2101.11986
  • X. Chen, S. Xie, and K. He, “An empirical study of training self-supervised visual transformers,” in ICCV, 2021, https://arxiv.org/abs/2104.02057
  • C.-F. Chen, Q. Fan, and R. Panda, “Crossvit: Cross-attention multi-scale vision transformer for image classification,” arXiv preprint arXiv:2103.14899, 2021, https://arxiv.org/abs/2103.14899
  • W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao, “Pvtv2: Improved baselines with pyramid vision transformer,” arXiv preprint arXiv:2106.13797, 2021, https://arxiv.org/abs/2106.13797
  • T. Wang, L. Yuan, Y. Chen, J. Feng, and S. Yan, “Pnp-detr: Towards efficient visual analysis with transformers,” in ICCV, 2021, https://arxiv.org/abs/2109.07036
  • M. S. Ryoo, A. Piergiovanni, A. Arnab, M. Dehghani, and A. Angelova, “Tokenlearner: What can 8 learned tokens do for images and videos?” in NeurIPS, 2021, https://arxiv.org/abs/2106.11297
  • X. Yue, S. Sun, Z. Kuang, M. Wei, P. Torr, W. Zhang, and D. Lin, “Vision transformer with progressive sampling,” in ICCV, 2021, https://arxiv.org/abs/2108.01684
  • Z. Jiang, Q. Hou, L. Yuan, D. Zhou, Y. Shi, X. Jin, A. Wang, and J. Feng, “All tokens matter: Token labeling for training better vision transformers,” in NeurIPS, 2021, https://arxiv.org/abs/2104.10858

Semantic Tokenization

Papers on semantic tokenization, such as identifying the part-of-speed of a word:

Tokenization of Non-English Languages

Various non-English double-byte languages cause extra difficulties in tokenization:

  • Chenglei Si, Zhengyan Zhang, Yingfa Chen, Fanchao Qi, Xiaozhi Wang, Zhiyuan Liu, Yasheng Wang, Qun Liu, Maosong Sun, "Sub-Character Tokenization for Chinese Pretrained Language Models", Transactions of the Association for Computational Linguistics, vol.11, pp.469, 2023, https://arxiv.org/abs/2106.00400, Code: https://github.com/thunlp/SubCharTokenization
  • Jonathan J. Webster and Chunyu Kit. 1992. Tokenization as the initial phase in NLP. In COLING 1992 Volume 4: The 14th International Conference on Computational Linguistics, https://dl.acm.org/doi/10.3115/992424.992434, DOI: https://doi.org/10.3115/992424.992434
  • N. Venkatesan, N. Arulanand, "Implications of Tokenizers in BERT Model for Low-Resource Indian Language", Journal of Soft Computing Paradigm, vol.4, no.4, pp.264, 2023, https://irojournals.com/jscp/article/view/4/4/5
  • Cagri Toraman, Eyup Halit Yilmaz, Furkan Şahi̇nuç, Oguzhan Ozcelik, "Impact of Tokenization on Language Models: An Analysis for Turkish", ACM Transactions on Asian and Low-Resource Language Information Processing, vol.22, no.4, pp.1, 2023, https://arxiv.org/abs/2204.08832
  • Sumalatha Bandari, Vishnu Vardhan Bulusu, BERT Tokenization and Hybrid-Optimized Deep Recurrent Neural Network for Hindi Document Summarization January 2022, International Journal of Fuzzy System Applications 11(1):1-28, DOI:10.4018/IJFSA.313601, http://dx.doi.org/10.4018/IJFSA.313601

More Research

Read more about: