Aussie AI

Vocabulary Expansion

Last Updated 11 June, 2025

by David Spuler, Ph.D.

Vocabulary expansion, or vocabulary extension, is increasing the size of the LLM vocabulary. This means that the overall model has more distinct tokens, which can increase the ability of individual tokens to encode particular states or outputs.

LLM vocabulary expansion can be performed for increased accuracy, such as in foreign languages which have a much greater range of words and symbols (e.g., Unicode and DBCS languages). An increased LLM vocabulary can also be used for improved efficiency, as input sequences may be able to be encoded in fewer tokens, which makes it similar to token merging.

Most of the research on vocabulary expansion is related to foreign language translation via the research area of Neural Machine Translation (NMT). This research has existed for some time prior to much of the LLM research, and often uses non-LLM types of AI models. Hence, there is a need for more research on vocabulary extension with LLMs.

Related areas of LLM inference optimization include:

Research on Vocabulary Expansion

Research papers on increasing the size of the LLM vocabulary in tokenization:

J Hong, G Lee, J Cho, Accelerating Multilingual Language Model for Excessively Tokenized Languages, Findings of the Association for Computational Linguistics: ACL 2024, pages 11095–11111 August 11-16, 2024, https://arxiv.org/abs/2401.10660 https://aclanthology.org/2024.findings-acl.660/ https://aclanthology.org/2024.findings-acl.660.pdf
Atsuki Yamaguchi, Aline Villavicencio, Nikolaos Aletras, 16 Sep 2024 (v2), How Can We Effectively Expand the Vocabulary of LLMs with 0.01GB of Target Language Text? https://arxiv.org/abs/2406.11477
HyoJung Han, Akiko Eriguchi, Haoran Xu, Hieu Hoang, Marine Carpuat, Huda Khayrallah, 12 Oct 2024, Adapters for Altering LLM Vocabularies: What Languages Benefit the Most? https://arxiv.org/abs/2410.09644
Chaofan Tao, Qian Liu, Longxu Dou, Niklas Muennighoff, Zhongwei Wan, Ping Luo, Min Lin, Ngai Wong, 1 Nov 2024 (v3), Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies, https://arxiv.org/abs/2407.13623 https://github.com/sail-sg/scaling-with-vocab https://hf.co/spaces/sail/scaling-with-vocab-demo
J Sha, M Zhu, C Feng, Y Shang, 2025, VEEF-Multi-LLM: Effective Vocabulary Expansion and Parameter Efficient Finetuning Towards Multilingual Large Language Models, ACL Anthology, https://aclanthology.org/2025.coling-main.533.pdf
Fangyuan Yu, 25 Feb 2025, Scaling LLM Pre-training with Vocabulary Curriculum, https://arxiv.org/abs/2502.17910?

Aussie AI

Vocabulary Expansion

Research on Vocabulary Expansion

More AI Research

Quick Links

Product

New to Writing?

Writing Styles