Aussie AI

Tokenizer Research

Last Updated 23 May, 2025

by David Spuler, Ph.D.

The tokenizer does not receive as much attention in the research literature as other parts of large language models. This is probably because the tokenization phase itself is not a bottleneck in either inference or training, when compared to the many layers of multiplication operations on weights. However, the choice of the tokenizer algorithm, and the resulting size of the vocabulary, has a direct impact on the speed (latency) of model inference.

Also, the tokenizer cannot be changed without re-training the model. The same tokenizer algorithm must be used in both training and inference, and the vocabulary is fixed. The only way that tokenization can be modified during inference is called "token pruning".

Tokenizer and Model Inference Latency

The tokenizer affects the latency (speed) of inference of a model in several ways. Firstly, the tokenization algorithm decides the vocabulary size. A larger vocabulary size directly affects the model's overall size (i.e. number of weights), and thereby has a big impact on the overall inference latency. If the tokenizer allows longer tokens, then there are more unique tokens, and the vocabulary is larger. For example, GPT has a vocabulary around 50,000 words (or subwords), but there are over 100,000 words in the English language, although they're not all in common usage.

Secondly, the tokenization method affects the ratio of words to tokens, which affects token sequence length for the input text. A longer sequence of tokens generated from the prompt text will cause longer latency in the inference phase. The transformer attention algorithm is known to be quadratic in input length, so fewer tokens reduce the overall processing. Furthermore, fewer tokens also helps reduce the cost of multiple model executions that arise from the "autoregression" problem, which is another LLM bottleneck.

Therefore, a tokenizer that uses individual words (longer) rather than subwords (shorter) will increase vocabulary size (increasing latency) but reduce input sequence (reducing latency). So the tokenization algorithm and the resulting size of the vocabulary introduces an interesting performance trade-off.

Tokenizer Design Issues

Some of the problematic design issues affecting tokenizers include:

Numbers. There are an infinite number of numbers. Some tokenizers simply treat each digit as a separate token, whereas another approach is to treat common numbers (e.g. 100) as their own tokens, and use digits as tokens for other longer numbers.
Capitalization. The tokenizer usually needs to distinguish capital letters, as it would otherwise make grammar errors with capitalization. But the need to represent both cases of letters increases the size of the vocabulary.
Spaces. How should spaces be tokenized? One approach is that a space is its own token, separate from tokens for words (or numbers or punctuation). Another approach is that a space may be part of a subword sequence. And note that not all written languages use spaces to separate words like English does.
Hyphenated words. Should these be 1 token or multiple?
Contraction words. How should contractions with an apostrophe (e.g. "isn't") be tokenized?
Punctuation characters and sequences. Usually a tokenizer will split punctuation characters into a single-byte token. But there are various multi-character punctuation sequences that could be their own token.
Encoding. Will the input sequence be in Latin1 or UTF8 encoding? Or various others like double-byte Unicode. The model will become confused if it was trained on one encoding, but receives input tokenized from another encoding during inference.
UTF8 and Unicode characters. The vast number of standard byte encodings for obscure symbols makes life difficult for tokenizers. One approach is to ignore this, and simply have a token for each byte that's not part of a known word or other token (i.e. a maximum of 255 of these byte-level tokens).
Double-byte character set languages. Several languages such as Chinese and Japanese have a large number of distinct symbols, which increases the size of the tokenizer and its vocabulary.
European language letters. Even the relatively simple ASCII extensions with values 128..255 to support European letters need to be handled correctly. Note that there are actually more than 255, so a multi-byte sequence such as UTF8 is probably desirable. However, if using UTF8, should the common European letters get their own token for byte-pairs or byte-triples?
Escape codes. There are various non-printable escape sequences defined by ASCII. Some encodings have meanings for these, but in other encodings they are undefined. An example is ASCII byte code 127, and also various bytes in the range 1-32.
Encoding errors. If using UTF8, there are various byte sequences that are errors that don't properly encode any Unicode number. What should the tokenizer do in this case?
Null bytes. Should the tokenizer allow zero as a token? This is mainly relevant to binary file tokenization.
Computer programming language tokens. Each major programming language has its own specific set of tokens. Should the LLM tokenizer use these tokens or not?
Programming language sequences. Should the tokenizer have separate individual tokens for multi-character tokens (even in Latin1 encoding), such as HTML macros (e.g. bold) and HTML entities (e.g. em dash).
Unknown tokens. The tokenizer must not produce any tokens that the model doesn't know. Otherwise, there's unpredictable behavior during model inference.
Rare words. How should an unknown word be tokenized? By subwords or syllables? By letters?
End-of-input token. The tokenizer needs a way to identify the end of the input stream. This can be implemented via a unique marker token, although there are other ways.
Semantic tokenization and parts of speech. Should the tokenizer attempt to discern some meaning from the words? For example, should it try to distinguish the same word as a different part of speech, such as a different token for a word as a noun or a verb? Or should the tokenizer leave that issue for the model to decide? This is a newer area of research.

Tokenizer Algorithms

Some of the existing tokenizer algorithms include:

Single characters/bytes (early models)
Byte-Pair Encoding (BPE), from Gage (1994), is a longstanding method.
WordPiece, from Wu et al. (2016) uses subword tokenization and Google has open-sourced the code.
SentencePiece, Kudo and Richardson (2018), with an open-source codebase from Google, as used by LLama.
Unigram (Kudo, 2018)

Research on Tokenization

Tokenizers are often barely mentioned in AI papers, but there are some research papers specifically on the design of tokenization algorithms:

Sandeep Mehta, Darpan Shah, Ravindra Kulkarni, Cornelia Caragea, Semantic Tokenizer for Enhanced Natural Language Processing, Apr 2023, https://arxiv.org/abs/2304.12404
Sanghyun Choo & Wonjoon Kim, A study on the evaluation of tokenizer performance in natural language processing, Feb 2023, https://doi.org/10.1080/08839514.2023.2175112
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing Taku Kudo, John Richardson, Aug 2018, https://arxiv.org/abs/1808.06226, Code: https://github.com/google/sentencepiece
Jasdeep Singh, Bryan McCann, Richard Socher, Caiming Xiong, BERT is Not an Interlingua and the Bias of Tokenization, Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019), November 2019, https://aclanthology.org/D19-6106/
Xinying Song, Alex Salcianu, Yang Song, Dave Dopson, Denny Zhou, Fast WordPiece Tokenization, Dec 2020, https://arxiv.org/abs/2012.15524
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, https://arxiv.org/abs/1609.08144 (WordPiece tokenizer seminal paper)
Mike Schuster and Kaisuke Nakajima, 2012, Japanese and korean voice search, In Proc. of ICASSP, 2012, https://ieeexplore.ieee.org/document/6289079 (Byte-Pair-Encoding for Asian languages)
Rico Sennrich, Barry Haddow, and Alexandra Birch, 2016, Neural machine translation of rare words with subword units, In Proc. of ACL. https://arxiv.org/abs/1508.07909
Taku Kudo, Subword regularization: Improving neural network translation models with multiple subword candidates, In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 66–75, Melbourne, Australia, July 2018, Association for Computational Linguistics, https://arxiv.org/abs/1804.10959
Philip Gage, 1994, A New Algorithm for Data Compression. C Users J., 12(2):23–38, February 1994, PDF: https://www.derczynski.com/papers/archive/BPE_Gage.pdf (Byte-Pair Encoding early paper)
Jonathan J. Webster, Chunyu Kit, Tokenization as the initial phase in NLP, 1992, COLING '92: Proceedings of the 14th conference on Computational linguistics, Volume 4, August 1992, pp.1106–1110, https://doi.org/10.3115/992424.992434 PDF: https://aclanthology.org/C92-4173.pdf (early research on tokenization)
Finding the Optimal Vocabulary Size for Neural Machine Translation, Thamme Gowda and Jonathan May, Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3955–3964, November 16-20, 2020, PDF: https://aclanthology.org/2020.findings-emnlp.352.pdf
Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Association for Computational Linguistics (ACL) System Demonstrations, pages 55–60. https://aclanthology.org/P14-5010/ PDF: https://nlp.stanford.edu/pubs/StanfordCoreNlp2014.pdf
On the Role of Text Preprocessing in Neural Network Architectures: An Evaluation Study on Text Categorization and Sentiment Analysis, Jose Camacho-Collados, Mohammad Taher Pilehvar, July 2017, https://arxiv.org/abs/1707.01780
Y. Tay, V. Q. Tran, S. Ruder, J. Gupta, H. W. Chung, D. Bahri, Z. Qin, S. Baumgartner, C. Yu, and D. Metzler, Charformer: Fast character transformers via gradient-based subword tokenization, arXiv preprint arXiv:2106.12672, 2021. https://arxiv.org/abs/2106.12672
Challenges and Applications of Large Language Models, Jean Kaddour, Joshua Harris, Maximilian Mozes, Herbie Bradley, Roberta Raileanu, Robert McHardy, July 2023, https://arxiv.org/pdf/2307.10169.pdf (General LLM survey that contains some discussion of tokenization.)
David D. Palmer, 2000, Tokenisation and sentence segmentation, In Robert Dale, Hermann Moisl, and Harold Somers, editors, Handbook of Natural Language Processing, Chapter 2, pages 24–25. Marcel Dekker. https://www.semanticscholar.org/paper/Chapter-2-%3A-Tokenisation-and-Sentence-Segmentation-Palmer/eeb93adb89f0621fd13c8701b40eaeae74e0c804 , PDF: https://tm-town-nlp-resources.s3.amazonaws.com/ch2.pdf
Stanford, Tokenization & Sentence Segmentation, https://stanfordnlp.github.io/stanza/tokenize.html
Thomas Reps. 1998. “Maximal-Munch” Tokenization in Linear Time. ACM Trans. Program. Lang. Syst., 20(2):259–273. https://dl.acm.org/doi/10.1145/276393.276394, DOI: https://doi.org/10.1145/276393.276394 (Tokenization in linear time.)
HuggingFace. 2020. Tokenizers. https://github.com/huggingface/tokenizers
Jonathan J. Webster and Chunyu Kit. 1992. Tokenization as the initial phase in NLP. In COLING 1992 Volume 4: The 14th International Conference on Computational Linguistics, https://dl.acm.org/doi/10.3115/992424.992434, DOI: https://doi.org/10.3115/992424.992434
Google. 2018. The WordPiece Algorithm in Open Source BERT. Code: https://github.com/google-research/bert/blob/master/tokenization.py#L335-L358 -> tokenizer
Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Association for Computational Linguistics (ACL) System Demonstrations, pages 55–60. https://aclanthology.org/P14-5010/, PDF: https://nlp.stanford.edu/pubs/StanfordCoreNlp2014.pdf
Dan Klein and Christopher D. Manning. 2003. Fast exact inference with a factored model for natural language parsing. In Suzanna Becker, Sebastian Thrun, and Klaus Obermayer, editors, Advances in Neural Information Processing Systems, volume 15, pages 3–10. MIT Press, PDF: https://nlp.stanford.edu/~manning/papers/lex-parser.pdf
Marie-Catherine de Marneffe, Bill MacCartney, and Christopher D. Manning. 2006. Generating typed dependency parses from phrase structure parses. In LREC 2006, pages 449–454, PDF: http://www.lrec-conf.org/proceedings/lrec2006/pdf/440_pdf.pdf
Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In NAACL 3, pages 252–259, https://dl.acm.org/doi/10.3115/1073445.1073478
How Much Does Tokenization Affect Neural Machine Translation? Miguel Domingo, Mercedes García-Martínez, Alexandre Helle, Francisco Casacuberta & Manuel Herranz, LNCS, volume 13451, CICLing 2019: Computational Linguistics and Intelligent Text Processing, pp 545–554, https://link.springer.com/chapter/10.1007/978-3-031-24337-0_38, https://arxiv.org/abs/1812.08621
C. Gong, D. He, X. Tan, T. Qin, L. Wang, and T.-Y. Liu, “Frage: Frequency-agnostic word representation,” in NeurIPS, 2018, https://arxiv.org/abs/1809.06858
Y. Tay, V. Q. Tran, S. Ruder, J. Gupta, H. W. Chung, D. Bahri, Z. Qin, S. Baumgartner, C. Yu, and D. Metzler, June 2021, Charformer: Fast character transformers via gradient-based subword tokenization, arXiv preprint arXiv:2106.12672, 2021. https://arxiv.org/abs/2106.12672 (CharacterBERT)
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics. https://arxiv.org/abs/1810.04805 (uses WordPiece tokenizer)
Ye Lin, Yanyang Li, Tong Xiao, Jingbo Zhu, Bag of Tricks for Optimizing Transformer Efficiency, Findings of the Association for Computational Linguistics: EMNLP 2021, November 2021, https://aclanthology.org/2021.findings-emnlp.357/
Yoon Kim, Yacine Jernite, David A. Sontag, and Alexander M. Rush. Character-aware neural language models. In AAAI, pp. 2741–2749. AAAI Press, 2016.
Jonas Geiping, Tom Goldstein, Dec 2022, Cramming: Training a Language Model on a Single GPU in One Day, https://arxiv.org/abs/2212.14034, Code: https://github.com/JonasGeiping/cramming (Examines the impact of tokenizers WordPiece and BPE, and also separator tokens, on training efficiency. Note: code uses deprecated nvFuser compiler.)
Alibaba Qwen Team, Sep 2023, Qwen Technical Report, https://arxiv.org/pdf/2309.16609.pdf (Has a section on tokenizer implementation.)
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022. https://arxiv.org/abs/2204.02311 (Google Palm architecture tokenizer split unknown Unicode tokens into UTF8, with a separate token for each UTF8 byte. Also uses a token for each numeric digit, splitting numbers into multiple tokens.)
Benjamin Minixhofer, Edoardo Maria Ponti, Ivan Vulić, 13 May 2024, Zero-Shot Tokenizer Transfer, https://arxiv.org/abs/2405.07883 (Overcoming the limitation that the tokenizer is fixed for the model, by training the tokenizer to embeddings mapping so as to use different tokenizers, including effective input token pruning reducing tokens in the input with a larger vocabulary.)
Kevin Slagle, 22 Apr 2024, SpaceByte: Towards Deleting Tokenization from Large Language Modeling, https://arxiv.org/abs/2404.14408
Jesus Rodriguez, Apr 22, 2024, Some Technical Notes About Llama 3: New tokenizer, optimized pretraining and some other details about Meta AI’s new model, Towards AI, https://pub.towardsai.net/some-technical-notes-about-llama-3-042c0b19db14
Sachin Mehta, Mohammad Hossein Sekhavat, Qingqing Cao, Maxwell Horton, Yanzi Jin, Chenfan Sun, Iman Mirzadeh, Mahyar Najibi, Dmitry Belenko, Peter Zatloukal, Mohammad Rastegari, 22 Apr 2024, OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework, Apple Research, https://arxiv.org/abs/2404.14619 Code: https://huggingface.co/apple/OpenELM
Yoon Kim, Yacine Jernite, David A. Sontag, and Alexander M. Rush. 2016. Character-aware neural language models. In AAAI, pp. 2741–2749. AAAI Press, 2016, https://arxiv.org/abs/1508.06615
Yehui Tang, Yunhe Wang, Jianyuan Guo, Zhijun Tu, Kai Han, Hailin Hu, Dacheng Tao, 5 Feb 2024. A Survey on Transformer Compression. https://arxiv.org/abs/2402.05964 (Model compression survey paper with focus on pruning, quantization, knowledge distillation, and efficient architecture design.)
Andrej Karpathy, 2023, Let's build the GPT Tokenizer, https://www.youtube.com/watch?v=zduSFxRajkE
Nikolay Bogoychev, Pinzhen Chen, Barry Haddow, Alexandra Birch, Nov 2023, Large Language Model Inference with Lexical Shortlisting, https://arxiv.org/abs/2311.09709 (Shortlisting the vocabulary to common words for reduced tokens and embedding matrix size.)
David Spuler, March 2024, Chapter 27. Tokenizer and Vocabulary, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Hugging Face, January 18, 2021, How we sped up transformer inference 100x for HF API customers, https://huggingface.co/blog/accelerated-inference
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin, Attention is all you need, 2017, arXive preprint arXiv:1706.03762. https://arxiv.org/abs/1706.03762
Adaptive Input Representations for Neural Language Modeling, Alexei Baevski, Michael Auli, Feb 2019, https://arxiv.org/abs/1809.10853
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. https://arxiv.org/abs/2302.13971
Ori Ram, Liat Bezalel, Adi Zicher, Yonatan Belinkov, Jonathan Berant, and Amir Globerson. 2022. What are you token about? dense retrieval as distributions over the vocabulary. arXiv preprint arXiv:2212.10380 https://arxiv.org/abs/2212.10380
J Mu, XL Li, N Goodman, 2023, Learning to compress prompts with gist tokens, https://arxiv.org/abs/2304.08467
NVIDIA, June 2024, Nemotron-4 340B Technical Report, https://d1qx31qr3h6wln.cloudfront.net/publications/Nemotron_4_340B_8T_0.pdf (Architecture is decoder-only with GQA, SentencePiece tokenizer, causal attention masks, RoPE, 96 layers, 96 heads, 8 KV heads, 256,000 vocabulary, 18432 internal dimension, context window 4096, and uses squared RELU.)
Björn Deiseroth, Manuel Brack, Patrick Schramowski, Kristian Kersting, Samuel Weinbach, 27 Jun 2024, T-FREE: Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings, https://arxiv.org/abs/2406.19223
Yuqing Yang, Yuedong Xu, Lei Jiao, 7 Jul 2024, A Queueing Theoretic Perspective on Low-Latency LLM Inference with Variable Token Length, https://arxiv.org/abs/2407.05347
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng Xue, Na Ni, Pei Zhang, Peng Wang, Ru Peng, Rui Men, Ruize Gao, Runji Lin, Shijie Wang, Shuai Bai, Sinan Tan, Tianhang Zhu, Tianhao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng, Xiaohuan Zhou, Xingzhang Ren, Xinyu Zhang, Xipin Wei, Xuancheng Ren, Yang Fan, Yang Yao, Yichang Zhang, Yu Wan, Yunfei Chu, Zeyu Cui, Zhenru Zhang, Zhihao Fan, 15 Jul 2024, Qwen2 Technical Report, https://arxiv.org/abs/2407.10671
Chaofan Tao, Qian Liu, Longxu Dou, Niklas Muennighoff, Zhongwei Wan, Ping Luo, Min Lin, Ngai Wong, 18 Jul 2024, Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies, https://arxiv.org/abs/2407.13623
Aleksandar Petrov, Emanuele La Malfa, Philip H.S. Torr, Adel Bibi, 20 Oct 2023 (v2), Language Model Tokenizers Introduce Unfairness Between Languages, https://arxiv.org/pdf/2305.15425
Orevaoghene Ahia, Sachin Kumar, Hila Gonen, JungoKasai, David R. Mortensen, Noah A. Smith, Yulia Tsvetkov, 2024, Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models, https://openreview.net/pdf?id=OUmxBN45Gl
Yennie Jun, May 03, 2023, All languages are NOT created (tokenized) equal: Language models cost much more in some languages than others, https://www.artfish.ai/p/all-languages-are-not-created-tokenized
Dimitris Spathis, Fahim Kawsar, The first step is the hardest: pitfalls of representing and tokenizing temporal data for large language models, Journal of the American Medical Informatics Association, Volume 31, Issue 9, September 2024, Pages 2151–2158, https://doi.org/10.1093/jamia/ocae090 https://academic.oup.com/jamia/advance-article-abstract/doi/10.1093/jamia/ocae090/7702405?redirectedFrom=fulltext
Bbot, 2024, Why is GPT bad at math? Because it can't see digits! https://bbot.org/etc/gpt-math.png
McKinsey, July 25, 2024, What is tokenization? https://www.mckinsey.com/featured-insights/mckinsey-explainers/what-is-tokenization
Cybernetist, Oct 2024, You Should Probably Pay Attention to Tokenizers, https://cybernetist.com/2024/10/21/you-should-probably-pay-attention-to-tokenizers/
Shenghao Xie, Wenqiang Zu, Mingyang Zhao, Duo Su, Shilong Liu, Ruohua Shi, Guoqi Li, Shanghang Zhang, Lei Ma, 30 Oct 2024 (v2), Towards Unifying Understanding and Generation in the Era of Vision Foundation Models: A Survey from the Autoregression Perspective, https://arxiv.org/abs/2410.22217 https://github.com/EmmaSRH/ARVFM
Sebastian Raschka, Nov 03, 2024, Understanding Multimodal LLMs: An introduction to the main techniques and latest models, https://magazine.sebastianraschka.com/p/understanding-multimodal-llms
NVIDIA, Nov 2024, Cosmos Tokenizer: A suite of image and video neural tokenizers, https://research.nvidia.com/labs/dir/cosmos-tokenizer/
Ziqi Pang, Tianyuan Zhang, Fujun Luan, Yunze Man, Hao Tan, Kai Zhang, William T. Freeman, Yu-Xiong Wang, 2 Dec 2024, RandAR: Decoder-only Autoregressive Visual Generation in Random Orders, https://arxiv.org/abs/2412.01827 https://rand-ar.github.io/ (Attempt to parallelize image generation decoding by randomizing the order at which to create patches of an image.)
J Hong, G Lee, J Cho, Accelerating Multilingual Language Model for Excessively Tokenized Languages, Findings of the Association for Computational Linguistics: ACL 2024, pages 11095–11111 August 11-16, 2024, https://arxiv.org/abs/2401.10660 https://aclanthology.org/2024.findings-acl.660/ https://aclanthology.org/2024.findings-acl.660.pdf
Artidoro Pagnoni, Ram Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li, Chunting Zhou, Lili Yu, Jason Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman, Srini Iyer, Dec 2024, Byte Latent Transformer: Patches Scale Better Than Tokens, Meta, https://ai.meta.com/research/publications/byte-latent-transformer-patches-scale-better-than-tokens/
Yipeng Zhang, Yifan Liu, Zonghao Guo, Yidan Zhang, Xuesong Yang, Chi Chen, Jun Song, Bo Zheng, Yuan Yao, Zhiyuan Liu, Tat-Seng Chua, Maosong Sun, 18 Dec 2024, LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer, https://arxiv.org/abs/2412.13871
Qwen: An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu Xia, Xingzhang Ren, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yu Wan, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, Zihan Qiu (additional authors not shown), 19 Dec 2024, Qwen2.5 Technical Report, https://arxiv.org/abs/2412.15115
Seungdong Yoa, Seungjun Lee, Hyeseung Cho, Bumsoo Kim, Woohyung Lim, 21 Dec 2024, ImagePiece: Content-aware Re-tokenization for Efficient Image Recognition, https://arxiv.org/abs/2412.16491
Aaditya K. Singh, DJ Strouse, 22 Feb 2024, Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs, https://arxiv.org/abs/2402.14903
Andrea Matarazzo, Riccardo Torlone, 3 Jan 2025, A Survey on Large Language Models with some Insights on their Capabilities and Limitations, https://arxiv.org/abs/2501.04040 (Broad survey with many LLM topics covered from history to architectures to optimizations.)
Asif Razzaq, January 22, 2025, Meet EvaByte: An Open-Source 6.5B State-of-the-Art Tokenizer-Free Language Model Powered by EVA, https://www.marktechpost.com/2025/01/22/meet-evabyte-an-open-source-6-5b-state-of-the-art-tokenizer-free-language-model-powered-by-eva/
Anna Wegmann, Dong Nguyen, David Jurgens, 21 Feb 2025, Tokenization is Sensitive to Language Variation, https://arxiv.org/abs/2502.15343 (Examines issues such as British versus American spelling.)
Xinsheng Wang, Mingqi Jiang, Ziyang Ma, Ziyu Zhang, Songxiang Liu, Linqin Li, Zheng Liang, Qixi Zheng, Rui Wang, Xiaoqin Feng, Weizhen Bian, Zhen Ye, Sitong Cheng, Ruibin Yuan, Zhixian Zhao, Xinfa Zhu, Jiahao Pan, Liumeng Xue, Pengcheng Zhu, Yunlin Chen, Zhifei Li, Xie Chen, Lei Xie, Yike Guo, Wei Xue, 3 Mar 2025, Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens, https://arxiv.org/abs/2503.01710 https://github.com/SparkAudio/Spark-TTS
Haichao Zhang, Yun Fu, 30 Mar 2025 (v2), Token Dynamics: Towards Efficient and Dynamic Video Token Representation for Video Large Language Models, https://arxiv.org/abs/2503.16980
Lavanya Gupta, May 1, 2025, Hidden costs in AI deployment: Why Claude models may be 20-30% more expensive than GPT in enterprise settings, https://venturebeat.com/ai/hidden-costs-in-ai-deployment-why-claude-models-may-be-20-30-more-expensive-than-gpt-in-enterprise-settings/ (The Claude tokenizer apparently creates more tokens than GPT.)

Vocabulary Size Research

Papers on vocabulary size and vocabulary-related issues:

Finding the Optimal Vocabulary Size for Neural Machine Translation, Thamme Gowda and Jonathan May, Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3955–3964, November 16-20, 2020, PDF: https://aclanthology.org/2020.findings-emnlp.352.pdf
Welin Chen, David Grangier, and Michael Auli. 2016. Strategies for training large vocabulary neural language models. In Proc. ACL. https://arxiv.org/abs/1512.04906
S. Jean, K. Cho, R. Memisevic, and Y. Bengio. On using very large target vocabulary for neural machine translation. arXiv preprint arXiv:1412.2007, 2014, https://arxiv.org/abs/1412.2007
Jonas Geiping, Tom Goldstein, Dec 2022, Cramming: Training a Language Model on a Single GPU in One Day, https://arxiv.org/abs/2212.14034, Code: https://github.com/JonasGeiping/cramming (Examines vocabulary size impact on training efficiency. Note: code uses deprecated nvFuser compiler.)
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022. https://arxiv.org/abs/2204.02311 (Google Palm architecture used a 256k vocabulary with SentencePiece tokenizer.)
Benjamin Minixhofer, Edoardo Maria Ponti, Ivan Vulić, 13 May 2024, Zero-Shot Tokenizer Transfer, https://arxiv.org/abs/2405.07883 (Overcoming the limitation that the tokenizer is fixed for the model, by training the tokenizer to embeddings mapping so as to use different tokenizers, including effective input token pruning reducing tokens in the input with a larger vocabulary.)
Yoon Kim, Yacine Jernite, David A. Sontag, and Alexander M. Rush. 2016. Character-aware neural language models. In AAAI, pp. 2741–2749. AAAI Press, 2016, https://arxiv.org/abs/1508.06615
David Spuler, March 2024, Chapter 27. Tokenizer and Vocabulary, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Nikolay Bogoychev, Pinzhen Chen, Barry Haddow, Alexandra Birch, June 20, 2024, The Ups and Downs of Large Language Model Inference, with Vocabulary Trimming by Language Heuristics, School of Informatics, University of Edinburgh, Proceedings of the Fifth Workshop on Insights from Negative Results in NLP, pages 148–153 https://aclanthology.org/2024.insights-1.17.pdf
Chaofan Tao, Qian Liu, Longxu Dou, Niklas Muennighoff, Zhongwei Wan, Ping Luo, Min Lin, Ngai Wong, 18 Jul 2024, Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies, https://arxiv.org/abs/2407.13623
J Hong, G Lee, J Cho, Accelerating Multilingual Language Model for Excessively Tokenized Languages, Findings of the Association for Computational Linguistics: ACL 2024, pages 11095–11111 August 11-16, 2024, https://arxiv.org/abs/2401.10660 https://aclanthology.org/2024.findings-acl.660/ https://aclanthology.org/2024.findings-acl.660.pdf
HyoJung Han, Akiko Eriguchi, Haoran Xu, Hieu Hoang, Marine Carpuat, Huda Khayrallah, 12 Oct 2024, Adapters for Altering LLM Vocabularies: What Languages Benefit the Most? https://arxiv.org/abs/2410.09644
Yangyifan Xu, Jinliang Lu, Jiajun Zhang, 15 Apr 2024, Bridging the Gap between Different Vocabularies for LLM Ensemble, https://arxiv.org/abs/2404.09492 (Addressing the particular problem with two LLMs have been trained on different vocabularies, but must be used together in an ensemble architecture.)
Matthew Durward and Christopher Thomson, 2024, Evaluating Vocabulary Usage in LLMs, Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications, pages 266–282, June 20, 2024, https://aclanthology.org/2024.bea-1.22/ https://aclanthology.org/2024.bea-1.22.pdf
Erik Wijmans, Brody Huval, Alexander Hertzberg, Vladlen Koltun, Philipp Krähenbühl, 13 Nov 2024, Cut Your Losses in Large-Vocabulary Language Models, https://arxiv.org/abs/2411.09009 https://github.com/apple/ml-cross-entropy
Leonidas Gee, Andrea Zugarini, Leonardo Rigutini, Paolo Torroni, 15 Feb 2024, Fast Vocabulary Transfer for Language Model Compression, https://arxiv.org/abs/2402.09977
Vilém Zouhar, 29 Jan 2024, Stolen Subwords: Importance of Vocabularies for Machine Translation Model Stealing, https://arxiv.org/abs/2401.16055
Tobias Domhan, Eva Hasler, Ke Tran, Sony Trenous, Bill Byrne, Felix Hieber, July 2022, The Devil is in the Details: On the Pitfalls of Vocabulary Selection in Neural Machine Translation, Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, United States, https://aclanthology.org/2022.naacl-main.136/ https://aclanthology.org/2022.naacl-main.136.pdf
Yuta Nozaki, Dai Nakashima, Ryo Sato, Naoki Asaba, Shintaro Kawamura, Efficient Vocabulary Reduction for Small Language Models, Jan 2025, Proceedings of the 31st International Conference on Computational Linguistics: Industry Track, pages 771–783, January 19–24, 2025, Association for Computational Linguistic, https://aclanthology.org/2025.coling-industry.64.pdf

Tokenization for Machine Vision

Tokenization for images and machine vision is different from text analysis:

Shengju Qian; Yi Zhu; Wenbo Li; Mu Li; Jiaya Jia, What Makes for Good Tokenizers in Vision Transformer? 22 December 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence, pp.1 - 13, https://arxiv.org/abs/2212.11115
T. Xiao, M. Singh, E. Mintun, T. Darrell, P. Dollar, and R. Girshick, ´ “Early convolutions help transformers see better,” in NeurIPS, 2021, https://arxiv.org/abs/2106.14881
L. Yuan, Y. Chen, T. Wang, W. Yu, Y. Shi, Z. Jiang, F. E. Tay, J. Feng, and S. Yan, “Tokens-to-token vit: Training vision transformers from scratch on imagenet,” in ICCV, 2021, https://arxiv.org/abs/2101.11986
X. Chen, S. Xie, and K. He, “An empirical study of training self-supervised visual transformers,” in ICCV, 2021, https://arxiv.org/abs/2104.02057
C.-F. Chen, Q. Fan, and R. Panda, “Crossvit: Cross-attention multi-scale vision transformer for image classification,” arXiv preprint arXiv:2103.14899, 2021, https://arxiv.org/abs/2103.14899
W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao, “Pvtv2: Improved baselines with pyramid vision transformer,” arXiv preprint arXiv:2106.13797, 2021, https://arxiv.org/abs/2106.13797
T. Wang, L. Yuan, Y. Chen, J. Feng, and S. Yan, “Pnp-detr: Towards efficient visual analysis with transformers,” in ICCV, 2021, https://arxiv.org/abs/2109.07036
M. S. Ryoo, A. Piergiovanni, A. Arnab, M. Dehghani, and A. Angelova, “Tokenlearner: What can 8 learned tokens do for images and videos?” in NeurIPS, 2021, https://arxiv.org/abs/2106.11297
X. Yue, S. Sun, Z. Kuang, M. Wei, P. Torr, W. Zhang, and D. Lin, “Vision transformer with progressive sampling,” in ICCV, 2021, https://arxiv.org/abs/2108.01684
Z. Jiang, Q. Hou, L. Yuan, D. Zhou, Y. Shi, X. Jin, A. Wang, and J. Feng, “All tokens matter: Token labeling for training better vision transformers,” in NeurIPS, 2021, https://arxiv.org/abs/2104.10858
Shenghao Xie, Wenqiang Zu, Mingyang Zhao, Duo Su, Shilong Liu, Ruohua Shi, Guoqi Li, Shanghang Zhang, Lei Ma, 30 Oct 2024 (v2), Towards Unifying Understanding and Generation in the Era of Vision Foundation Models: A Survey from the Autoregression Perspective, https://arxiv.org/abs/2410.22217 https://github.com/EmmaSRH/ARVFM
Sebastian Raschka, Nov 03, 2024, Understanding Multimodal LLMs: An introduction to the main techniques and latest models, https://magazine.sebastianraschka.com/p/understanding-multimodal-llms

Semantic Tokenization

Papers on semantic tokenization, such as identifying the part-of-speed of a word:

Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In NAACL 3, pages 252–259, https://dl.acm.org/doi/10.3115/1073445.1073478
Marie-Catherine de Marneffe, Bill MacCartney, and Christopher D. Manning. 2006. Generating typed dependency parses from phrase structure parses. In LREC 2006, pages 449–454, PDF: http://www.lrec-conf.org/proceedings/lrec2006/pdf/440_pdf.pdf

Tokenization of Non-English Languages

Various non-English double-byte languages cause extra difficulties in tokenization:

Chenglei Si, Zhengyan Zhang, Yingfa Chen, Fanchao Qi, Xiaozhi Wang, Zhiyuan Liu, Yasheng Wang, Qun Liu, Maosong Sun, "Sub-Character Tokenization for Chinese Pretrained Language Models", Transactions of the Association for Computational Linguistics, vol.11, pp.469, 2023, https://arxiv.org/abs/2106.00400, Code: https://github.com/thunlp/SubCharTokenization
Jonathan J. Webster and Chunyu Kit. 1992. Tokenization as the initial phase in NLP. In COLING 1992 Volume 4: The 14th International Conference on Computational Linguistics, https://dl.acm.org/doi/10.3115/992424.992434, DOI: https://doi.org/10.3115/992424.992434
N. Venkatesan, N. Arulanand, "Implications of Tokenizers in BERT Model for Low-Resource Indian Language", Journal of Soft Computing Paradigm, vol.4, no.4, pp.264, 2023, https://irojournals.com/jscp/article/view/4/4/5
Cagri Toraman, Eyup Halit Yilmaz, Furkan Şahi̇nuç, Oguzhan Ozcelik, "Impact of Tokenization on Language Models: An Analysis for Turkish", ACM Transactions on Asian and Low-Resource Language Information Processing, vol.22, no.4, pp.1, 2023, https://arxiv.org/abs/2204.08832
Sumalatha Bandari, Vishnu Vardhan Bulusu, BERT Tokenization and Hybrid-Optimized Deep Recurrent Neural Network for Hindi Document Summarization January 2022, International Journal of Fuzzy System Applications 11(1):1-28, DOI:10.4018/IJFSA.313601, http://dx.doi.org/10.4018/IJFSA.313601

Aussie AI

Tokenizer Research

Tokenizer and Model Inference Latency

Tokenizer Design Issues

Tokenizer Algorithms

Research on Tokenization

Vocabulary Size Research

Tokenization for Machine Vision

Semantic Tokenization

Tokenization of Non-English Languages

More Research

Quick Links

Product

New to Writing?

Writing Styles