Aussie AI

Vocabulary Trimming

Last Updated 24 January, 2025

by David Spuler, Ph.D.

Vocabulary trimming in LLMs is reducing the size of the token vocabulary for inference optimization. This reduces the size of the embedding dimension, thereby reducing both the computation cost and the memory size of model weights.

On the downside, vocabulary size reduction generally means that texts may need to be expressed in more tokens. This means that the token sequence length increases for some input prompts, so this dimension of LLM layer processing is worse, whereas the embedding dimension is improved. Hence, there are important tradeoffs in this approach.

Vocabulary trimming and lexical shortlisting have been use in Neural Machine Translation (NMT) for the translation of foreign languages. This research predates much of the LLM research, with many NMT techniques using other types of AI models, rather than LLMs and Transformers. The use of vocabulary trimming in LLMs remains largely unexplored and is an area warranting further research.

Related areas of LLM inference optimization include:

Research on Vocabulary Trimming

Research papers on reducing the size of an LLM vocabulary:

Nikolay Bogoychev, Pinzhen Chen, Barry Haddow, Alexandra Birch, Nov 2023, Large Language Model Inference with Lexical Shortlisting, https://arxiv.org/abs/2311.09709 (Shortlisting the vocabulary to common words for reduced tokens and embedding matrix size.)
Y Wang, K Chen, H Tan, K Guo, 2023, Tabi: An Efficient Multi-Level Inference System for Large Language Models, EuroSys '23: Proceedings of the Eighteenth European Conference on Computer Systems, Rome, Italy May 2023, Pages 233–248, https://doi.org/10.1145/3552326.3587438 https://dl.acm.org/doi/10.1145/3552326.3587438 PDF: https://cse.hkust.edu.hk/~kaichen/papers/tabi-eurosys23.pdf (Dynamic routing to small or large LLMs based on the query.)
Nikolay Bogoychev, Pinzhen Chen, Barry Haddow, Alexandra Birch, June 20, 2024, The Ups and Downs of Large Language Model Inference, with Vocabulary Trimming by Language Heuristics, School of Informatics, University of Edinburgh, Proceedings of the Fifth Workshop on Insights from Negative Results in NLP, pages 148–153 https://aclanthology.org/2024.insights-1.17.pdf
Jort Vincenti, Karim Abdel Sadek, Joan Velja, Matteo Nulli, Metod Jazbec, 24 Oct 2024, Dynamic Vocabulary Pruning in Early-Exit LLMs, https://arxiv.org/abs/2410.18952
J Hong, G Lee, J Cho, Accelerating Multilingual Language Model for Excessively Tokenized Languages, Findings of the Association for Computational Linguistics: ACL 2024, pages 11095–11111 August 11-16, 2024, https://arxiv.org/abs/2401.10660 https://aclanthology.org/2024.findings-acl.660/ https://aclanthology.org/2024.findings-acl.660.pdf
Nikolay Bogoychev, Pinzhen Chen, 21 Sep 2021 (v3), The Highs and Lows of Simple Lexical Domain Adaptation Approaches for Neural Machine Translation, https://arxiv.org/abs/2101.00421 https://aclanthology.org/2021.insights-1.12/
Sreeram Vennam, Anish Joishy, Ponnurangam Kumaraguru, 10 Nov 2024, LLM Vocabulary Compression for Low-Compute Environments, https://arxiv.org/abs/2411.06371
Leonidas Gee, Andrea Zugarini, Leonardo Rigutini, Paolo Torroni, 15 Feb 2024, Fast Vocabulary Transfer for Language Model Compression, https://arxiv.org/abs/2402.09977
Yuta Nozaki, Dai Nakashima, Ryo Sato, Naoki Asaba, Shintaro Kawamura, Efficient Vocabulary Reduction for Small Language Models, Jan 2025, Proceedings of the 31st International Conference on Computational Linguistics: Industry Track, pages 771–783, January 19–24, 2025, Association for Computational Linguistic, https://aclanthology.org/2025.coling-industry.64.pdf

More Research on Pruning Types

Depth pruning (overview)
— Static layer pruning
— Layer pruning
— Early exit
— Dynamic layer pruning
— Layer skipping
— Layer approximation
— Shallow decoder architecture
— Layer reordering
— Layer Importance
Width pruning (overview)
— Attention head pruning
— Slimmable networks (width pruning)
— FFN pruning
— Channel pruning
— Filter pruning
Length pruning (longitudinal/input/end-to-end):
— Token pruning (input pruning)
— Dynamic token pruning
— Prompt compression
— Context compression
— Token merging
— Token skipping
— Token dropping
— Zero padding removal
Embedding-dimension pruning
— Embedding pruning
— Embedding matrix compression (embedding pruning)
— Embedding low-rank matrix factorization
— Unembedding matrix (output embeddings)
Multi-dimensional pruning
— Dual pruning
— Triple pruning
— Quadruple pruning
— 3D CNN model pruning

Aussie AI

Vocabulary Trimming

Research on Vocabulary Trimming

More Research on Pruning Types

More AI Research

Quick Links

Product

New to Writing?

Writing Styles