Aussie AI
Vocabulary Trimming
-
Last Updated 7 December, 2024
-
by David Spuler, Ph.D.
Vocabulary trimming in LLMs is reducing the size of the token vocabulary for inference optimization. This reduces the size of the embedding dimension, thereby reducing both the computation cost and the memory size of model weights.
On the downside, vocabulary size reduction generally means that texts may need to be expressed in more tokens. This means that the token sequence length increases for some input prompts, so this dimension of LLM layer processing is worse, whereas the embedding dimension is improved. Hence, there are important tradeoffs in this approach.
Vocabulary trimming and lexical shortlisting have been use in Neural Machine Translation (NMT) for the translation of foreign languages. This research predates much of the LLM research, with many NMT techniques using other types of AI models, rather than LLMs and Transformers. The use of vocabulary trimming in LLMs remains largely unexplored and is an area warranting further research.
Related areas of LLM inference optimization include:
- Embeddings
- Tokenization
- Vocabulary expansion
- Token pruning
- Embeddings pruning
- Shortlisting
- Funnel transformer
Research on Vocabulary Trimming
Research papers on reducing the size of an LLM vocabulary:
- Nikolay Bogoychev, Pinzhen Chen, Barry Haddow, Alexandra Birch, Nov 2023, Large Language Model Inference with Lexical Shortlisting, https://arxiv.org/abs/2311.09709 (Shortlisting the vocabulary to common words for reduced tokens and embedding matrix size.)
- Y Wang, K Chen, H Tan, K Guo, 2023, Tabi: An Efficient Multi-Level Inference System for Large Language Models, EuroSys '23: Proceedings of the Eighteenth European Conference on Computer Systems, Rome, Italy May 2023, Pages 233–248, https://doi.org/10.1145/3552326.3587438 https://dl.acm.org/doi/10.1145/3552326.3587438 PDF: https://cse.hkust.edu.hk/~kaichen/papers/tabi-eurosys23.pdf (Dynamic routing to small or large LLMs based on the query.)
- Nikolay Bogoychev, Pinzhen Chen, Barry Haddow, Alexandra Birch, June 20, 2024, The Ups and Downs of Large Language Model Inference, with Vocabulary Trimming by Language Heuristics, School of Informatics, University of Edinburgh, Proceedings of the Fifth Workshop on Insights from Negative Results in NLP, pages 148–153 https://aclanthology.org/2024.insights-1.17.pdf
- Jort Vincenti, Karim Abdel Sadek, Joan Velja, Matteo Nulli, Metod Jazbec, 24 Oct 2024, Dynamic Vocabulary Pruning in Early-Exit LLMs, https://arxiv.org/abs/2410.18952
- J Hong, G Lee, J Cho, Accelerating Multilingual Language Model for Excessively Tokenized Languages, Findings of the Association for Computational Linguistics: ACL 2024, pages 11095–11111 August 11-16, 2024, https://arxiv.org/abs/2401.10660 https://aclanthology.org/2024.findings-acl.660/ https://aclanthology.org/2024.findings-acl.660.pdf
- Nikolay Bogoychev, Pinzhen Chen, 21 Sep 2021 (v3), The Highs and Lows of Simple Lexical Domain Adaptation Approaches for Neural Machine Translation, https://arxiv.org/abs/2101.00421 https://aclanthology.org/2021.insights-1.12/
- Sreeram Vennam, Anish Joishy, Ponnurangam Kumaraguru, 10 Nov 2024, LLM Vocabulary Compression for Low-Compute Environments, https://arxiv.org/abs/2411.06371
- Leonidas Gee, Andrea Zugarini, Leonardo Rigutini, Paolo Torroni, 15 Feb 2024, Fast Vocabulary Transfer for Language Model Compression, https://arxiv.org/abs/2402.09977
More Research on Pruning Types
- Depth pruning (overview)
— Static layer pruning
— Layer pruning
— Early exit
— Dynamic layer pruning
— Layer skipping
— Layer approximation
— Shallow decoder architecture
— Layer reordering
— Layer Importance - Width pruning (overview)
— Attention head pruning
— Slimmable networks (width pruning)
— FFN pruning
— Channel pruning
— Filter pruning - Length pruning (longitudinal/input/end-to-end):
— Token pruning (input pruning)
— Dynamic token pruning
— Prompt compression
— Context compression
— Token merging
— Token skipping
— Token dropping
— Zero padding removal - Embedding-dimension pruning
— Embedding pruning
— Embedding matrix compression (embedding pruning)
— Embedding low-rank matrix factorization
— Unembedding matrix (output embeddings) - Multi-dimensional pruning
— Dual pruning
— Triple pruning
— Quadruple pruning
— 3D CNN model pruning
More AI Research
Read more about: