Aussie AI

Embeddings Research

Last Updated 11 June, 2025

by David Spuler, Ph.D.

The first step in model inference in the Transformer architecture is to convert an input sequence into numbers called tokens. However, these tokens are not used internally to the model, because the next step of Transformer inference is to immediately convert this sequence of tokens into another internal representation called an "embedding". An embedding is a vector of numbers that represents the information about the token sequence in very complex ways.

Note that the "embeddings" terminology is unrelated to "embedded" devices such as mobile phones or IoT edge devices. It's simply a different usage of the word.

The mapping from tokens to embeddings is actually learned during model training. The conversion of token vectors into a vector of embeddings is based on a single matrix multiplication using these learned embedding weights, with an additional step that adds "positional embeddings" (simply added in the Transformer architecture). The embedding matrix can be quite large, especially if the token vocabulary size is large. However, this multiplication occurs infrequently compared to other weight matrices, so it is not a latency-critical operation. Nevertheless, the storage cost of storing a large embedding matrix can be significant.

Related areas of LLM inference optimization include:

Embedding Optimization Research Papers

The embeddings don't receive a huge amount of research in the literature, because they aren't a bottleneck in inference. Most of the research on optimizing embeddings has been on compacting the space of the embedding matrices for use on smaller devices or with smaller models, using matrix compression techniques such as sparsity or hashing.

Siyi Liu, Chen Gao, Yihong Chen, Depeng Jin, Yong Li, Learnable Embedding Sizes for Recommender Systems, arXiv preprint arXiv:2101.07577, Mar 2021, https://arxiv.org/abs/2101.07577, Code: https://github.com/ssui-liu/learnable-embed-sizes-for-RecSys
Kailash A. Hambarde, Hugo Proenca, Information Retrieval: Recent Advances and Beyond, arXiv preprint arXiv:2301.08801, Jan 2023, https://arxiv.org/abs/2301.08801
Carlos Lassance, Maroua Maachou, Joohee Park, Stéphane Clinchant, Learned Token Pruning in Contextualized Late Interaction over BERT (ColBERT), 2022, SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, July 2022, Pages 2232–2236, https://doi.org/10.1145/3477495.3531835, https://dl.acm.org/doi/10.1145/3477495.3531835
Keshav Santhanam, Omar Khattab, Christopher Potts, Matei Zaharia, PLAID: An Efficient Engine for Late Interaction Retrieval, October 2022, CIKM '22: The 31st ACM International Conference on Information and Knowledge Management, DOI:10.1145/3511808.3557325, https://arxiv.org/abs/2205.09707
Liang Qu, Huaisheng Zhu, Ruiqi Zheng, Yuhui Shi, and Hongzhi Yin, 2021, Imgagn: Imbalanced network embedding via generative adversarial graph networks. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 1390–1398, https://arxiv.org/abs/2106.02817
Wang-Cheng Kang, Derek Zhiyuan Cheng, Ting Chen, Xinyang Yi, Dong Lin, Lichan Hong, and Ed H. Chi, 2020, Learning Multi-Granular Quantized Embeddings for Large-Vocab Categorical Features in Recommender Systems, In Companion Proceedings of the Web Conference 2020 (Taipei, Taiwan) (WWW ’20). Association for Computing Machinery, New York, NY, USA, 562–566, https://doi.org/10.1145/3366424.3383416, https://arxiv.org/abs/2002.08530
Manas R. Joglekar, Cong Li, Mei Chen, Taibai Xu, Xiaoming Wang, Jay K. Adams, Pranav Khaitan, Jiahui Liu, and Quoc V. Le. 2020. Neural Input Search for Large Scale Recommendation Models (KDD ’20). Association for Computing Machinery, New York, NY, USA, 2387–2397. https://doi.org/10.1145/3394486.3403288, https://arxiv.org/abs/1907.04471
Manas R Joglekar, Cong Li, Mei Chen, Taibai Xu, Xiaoming Wang, Jay K Adams, Pranav Khaitan, Jiahui Liu, and Quoc V Le. 2020. Neural input search for large scale recommendation models. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2387–2397.
Ruiqi Zheng, Liang Qu, Bin Cui, Yuhui Shi, and Hongzhi Yin. 2022. AutoML for Deep Recommender Systems: A Survey. arXiv prerint arXiv:2203.13922 (2022), https://arxiv.org/abs/2203.13922
WC Kang, DZ Cheng, T Yao, X Yi, T Chen, 2021, Learning to embed categorical features without embedding tables for recommendation https://arxiv.org/abs/2010.10784v2
Yi Luan, Jacob Eisenstein, Kristina Toutanova, and Michael Collins. 2020. Sparse, dense, and attentional representations for text retrieval. In Proceedings of TACL, https://arxiv.org/abs/2005.00181
Omar Khattab and Matei Zaharia. 2020. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. In Proc. SIGIR. 39–48, https://arxiv.org/abs/2004.12832
Samuel Humeau, Kurt Shuster, Marie-Anne Lachaux, and J. Weston. 2020. Poly-encoders: Transformer Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring. In Proc. ICLR. https://arxiv.org/abs/1905.01969
F Lyu, X Tang, H Zhu, H Guo, Y Zhang, 2022, OptEmbed: Learning Optimal Embedding Table for Click-through Rate Prediction, https://arxiv.org/abs/2208.04482
Hanxiao Liu, Karen Simonyan, and Yiming Yang. 2018. Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055 (2018). https://arxiv.org/abs/1806.09055
Nicola Tonellotto, Craig Macdonald, and Iadh Ounis. 2018. Efficient Query Processing for Scalable Web Search. Foundations and Trends in Information Retrieval 12, 4–5 (2018), 319–492, https://ieeexplore.ieee.org/document/8620666
Wang-Cheng Kang, Derek Zhiyuan Cheng, Tiansheng Yao, Xinyang Yi, Ting Chen, Lichan Hong, Ed H. Chi, Learning to Embed Categorical Features without Embedding Tables for Recommendation, June 2021, https://arxiv.org/abs/2010.10784v2
Tong Chen, Hongzhi Yin, Yujia Zheng, Zi Huang, Yang Wang, and Meng Wang. 2021. Learning elastic embeddings for customizing on-device recommenders. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 138–147. https://arxiv.org/abs/2106.02223
Ting Chen, Lala Li, and Yizhou Sun. 2019. Differentiable product quantization for end-to-end embedding compression. arXiv preprint arXiv:1908.09756 (2019). https://arxiv.org/abs/1908.09756
Tianyu Gao, Xingcheng Yao, and Danqi Chen. SimCSE: Simple contrastive learning of sentence embeddings. CoRR, abs/2104.08821, 2021, https://arxiv.org/abs/2104.08821
Mandar Joshi, Eunsol Choi, Omer Levy, Daniel S. Weld, and Luke Zettlemoyer. pair2vec: Compositional word-pair embeddings for cross-sentence inference. In NAACL-HLT (1), pp. 3597–3608. Association for Computational Linguistics, 2019, https://arxiv.org/abs/1810.08854
Lili Mou, Ran Jia, Yan Xu, Ge Li, Lu Zhang, and Zhi Jin. Distilling word embeddings: An encoding approach. In CIKM, pp. 1977–1980. ACM, 2016, https://arxiv.org/abs/1506.04488
Ye Qi, Devendra Singh Sachan, Matthieu Felix, Sarguna Padmanabhan, and Graham Neubig. When and why are pre-trained word embeddings useful for neural machine translation? In NAACL-HLT (2), pp. 529–535. Association for Computational Linguistics, 2018, https://arxiv.org/abs/1804.06323
Shyam Upadhyay, Manaal Faruqui, Chris Dyer, and Dan Roth. Cross-lingual models of word embeddings: An empirical comparison. In ACL (1). The Association for Computer Linguistics, 2016, https://arxiv.org/abs/1604.00425
Alexis Conneau, German Kruszewski, Guillaume Lample, Lo¨ıc Barrault, and Marco Baroni. 2018. What You Can Cram into A Single $&!#∗ Vector: Probing Sentence Embeddings for Linguistic Properties. ACL. https://aclanthology.org/P18-1198/
Zihang Dai, Guokun Lai, Yiming Yang, and Quoc Le. 2020. Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing. In Proceedings of NeurIPS. https://proceedings.neurips.cc/paper/2020/hash/2cd2915e69546904e4e5d4a2ac9e1652-Abstract.html, https://arxiv.org/abs/2006.03236, Code: https://github.com/laiguokun/Funnel-Transformer
Hanrui Wang, Zhanghao Wu, Zhijian Liu, Han Cai, Ligeng Zhu, Chuang Gan, and Song Han. 2020. HAT: Hardware-aware transformers for efficient natural language processing. arXiv preprint arXiv:2005.14187. https://arxiv.org/abs/2005.14187
Representational Redundancy Reduction Strategies for Efficient Neural Network Architectures for Visual and Language Tasks, Rene Bidart, Ph.D. thesis, 2023, University of Waterloo, https://uwspace.uwaterloo.ca/bitstream/handle/10012/19682/Bidart_Rene.pdf?sequence=1
A. Chaulwar et al., "Extreme compression of sentence-transformer ranker models: Faster inference longer battery life and less storage on edge devices", arXiv:2207.12852, 2022. https://arxiv.org/abs/2207.12852v1 (Distillation from the point of view of embeddings.)
Tomas Mikolov, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. In ICLR, 2013 https://arxiv.org/abs/1301.3781 Code: https://code.google.com/p/word2vec/ (This is the word2vec algorithm.)
Guy Dar, Mor Geva, Ankit Gupta, Jonathan Berant, Dec 2022, Analyzing Transformers in Embedding Space, https://arxiv.org/pdf/2209.02535.pdf, Code: https://github.com/guyd1995/embedding-space (Maps backward from model parameters and computations in the "embedding space" and then projects this back to tokens.)
Xin Zhang, Zehan Li, Yanzhao Zhang, Dingkun Long, Pengjun Xie, Meishan Zhang, Min Zhang, Oct 2022, Language Models are Universal Embedders, https://arxiv.org/abs/2310.08232
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022. https://arxiv.org/abs/2204.02311 (Google Palm architecture used RoPE embeddings.)
Jon Saad-Falcon, Amanpreet Singh, Luca Soldaini, Mike D'Arcy, Arman Cohan, Doug Downey, 30 Jan 2023 (v3), Embedding Recycling for Language Models, https://arxiv.org/abs/2207.04993
Hackerllama, January 7, 2024, Sentence Embeddings. Introduction to Sentence Embeddings, https://osanseviero.github.io/hackerllama/blog/posts/sentence_embeddings/
Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, Furu Wei, Dec 2023, Improving Text Embeddings with Large Language Models https://arxiv.org/abs/2401.00368
Lior Wolf, Feb 2017, Ofir Press, Using the Output Embedding to Improve Language Models, https://arxiv.org/abs/1608.05859
Yuhong Zhang, Shilai Yang, Gert Cauwenberghs, Tzyy-Ping Jung, 28 Jan 2024, From Word Embedding to Reading Embedding Using Large Language Model, EEG and Eye-tracking, https://arxiv.org/abs/2401.15681
Björn Deiseroth, Manuel Brack, Patrick Schramowski, Kristian Kersting, Samuel Weinbach, 27 Jun 2024, T-FREE: Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings, https://arxiv.org/abs/2406.19223
Barhoumi Mosbeh, Nov 2024, Late Chunking In Long Context Embedding Models, https://pub.towardsai.net/late-chunking-in-long-context-embedding-models-caf1c1209042
Emilia David, November 8, 2024, Multimodal RAG is growing, here’s the best way to get started, https://venturebeat.com/ai/multimodal-rag-is-growing-heres-the-best-way-to-get-started/
Da Yu, Edith Cohen, Badih Ghazi, Yangsibo Huang, Pritish Kamath, Ravi Kumar, Daogao Liu, Chiyuan Zhang, 3 Feb 2025, Scaling Embedding Layers in Language Models, https://arxiv.org/abs/2502.01637 (Using n-gram multi-token embedding layers, because embeddings are cheap to compute, rather than increasing vocabulary size.)
Michael Wood, Nov 22, 2024, The Insanity of Relying on Vector Embeddings: Why RAG Fails, https://blog.cubed.run/the-insanity-of-relying-on-vector-embeddings-why-rag-fails-be73554490b2
Mohammed, Eiman Hashim A. S. , 2025, Leveraging LLM Embeddings and Reverse Dictionaries for Reliable Topic Modeling and Privacy-Sensitive Smart City Applications: Toward Residents' Satisfaction, Masters Thesis, Hamad Bin Khalifa University (Qatar) , https://www.proquest.com/openview/40a88766a4bcfa6c1e4c7e7323fc3920/1?pq-origsite=gscholar&cbl=2026366&diss=y

Embedding Size Optimization (NAS)

A conceptually simple way to reduce embedding size is to choose a smaller embedding size as a model hyper-parameter. The size of the embedding is a model "hyper-parameter" that is chosen before training. Optimizing this number is a sub-problem of "neural architecture search" (NAS), also called "hyper-parameter optimization" (HPO). The embedding-specific NAS problem has some research papers.

Haochen Liu, Xiangyu Zhao, Chong Wang, Xiaobing Liu, and Jiliang Tang. 2020. Automated embedding size search in deep recommender systems. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2307–2316, https://dl.acm.org/doi/abs/10.1145/3397271.3401436
L Qu, Y Ye, N Tang, L Zhang, Y Shi, H Yin, Single-shot embedding dimension search in recommender system, 2022, https://dl.acm.org/doi/abs/10.1145/3477495.3532060, https://arxiv.org/abs/2204.03281
Xiangyu Zhao, Chong Wang, Ming Chen, Xudong Zheng, Xiaobing Liu, and Jiliang Tang, 2020, AutoEmb: Automated Embedding Dimensionality Search in Streaming Recommendations, CoRR abs/2002.11252 (2020). arXiv:2002.11252, https://arxiv.org/abs/2002.11252
Zi Yin and Yuanyuan Shen. 2018. On the dimensionality of word embedding. In Advances in Neural Information Processing Systems. 887 898, https://arxiv.org/abs/1812.04224
Maxim Naumov. 2019. On the Dimensionality of Embeddings for Sparse Features and Data. arXiv preprint arXiv:1901.02103 (2019), https://arxiv.org/abs/1901.02103

Embedding Matrix Compression (Embedding Pruning)

These papers are specifically on reducing the storage cost of large embedding matrices. Techniques include hashing vectors and pruning embeddings to create sparsity. Vocabulary size is also closely related to embeddings size (see tokenization and vocabulary research).

Daochen Zha, Louis Feng, Bhargav Bhushanam, Dhruv Choudhary, Jade Nie, Yuandong Tian, Jay Chae, Yinbin Ma, Arun Kejariwal, Xia Hu, 2022, AutoShard: Automated Embedding Table Sharding for Recommender Systems, https://dl.acm.org/doi/abs/10.1145/3534678.3539034, https://arxiv.org/abs/2208.06399
A Desai, L Chou, A Shrivastava, Conference on Machine Learning and Systems, 2022, Random Offset Block Embedding (ROBE) for compressed embedding tables in deep learning recommendation systems, https://arxiv.org/abs/2108.02191
Xiangyu Zhao, Haochen Liu, Hui Liu, Jiliang Tang, Weiwei Guo, Jun Shi, Sida Wang, Huiji Gao, and Bo Long. 2020. Memory-efficient embedding for recommendations. arXiv preprint arXiv:2006.14827 (2020), https://arxiv.org/abs/2006.14827
Antonio Ginart, Maxim Naumov, Dheevatsa Mudigere, Jiyan Yang, and James Zou. 2019. Mixed Dimension Embeddings with Application to Memory-Efficient Recommendation Systems. arXiv preprint arXiv:1909.11810 (2019), https://arxiv.org/abs/1909.11810
Nicola Tonellotto, Craig Macdonald, Query Embedding Pruning for Dense Retrieval, CIKM ’21, November 1–5, 2021, Virtual Event, QLD, Australia, https://arxiv.org/abs/2108.10341
IamAdiSri, Pruning a model embedding matrix for memory efficiency, April 2021, Hugging Face discussion board, https://discuss.huggingface.co/t/pruning-a-model-embedding-matrix-for-memory-efficiency/5502/7
Raphael Shu and Hideki Nakayama. Compressing word embeddings via deep compositional code learning. In ICLR (Poster). OpenReview.net, 2018, https://arxiv.org/abs/1711.01068
Hao-Jun Michael Shi, Dheevatsa Mudigere, Maxim Naumov, and Jiyan Yang. Compositional embeddings using complementary partitions for memory-efficient recommendation systems. In KDD, pp. 165-175. ACM, 2020, https://arxiv.org/abs/1909.02107
Valentin Khrulkov, Oleksii Hrinchuk, Leyla Mirvakhabova, and Ivan Oseledets. 2019. Tensorized Embedding Layers for Efficient Model Compression. arXiv preprint arXiv:1901.10787 (2019), updated Feb 2020, https://arxiv.org/abs/1901.10787v1
Manaal Faruqui, Yulia Tsvetkov, Dani Yogatama, Chris Dyer, and Noah A. Smith. Sparse overcomplete word vector representations. In ACL (1), pp. 1491-1500. The Association for Computer Linguistics, 2015, https://arxiv.org/abs/1506.02004
Yunchuan Chen, Lili Mou, Yan Xu, Ge Li, and Zhi Jin. Compressing neural language models by sparse word representations. In ACL (1). The Association for Computer Linguistics, 2016, https://arxiv.org/abs/1610.03950 (Sparse matrix via common and rare word embeddings)
Jingjing Xu, Wangchunshu Zhou, Zhiyi Fu, Hao Zhou, Lei Li, A Survey on Green Deep Learning, Nov 2021, https://arxiv.org/abs/2111.05193 (Extensive survey paper with section on "Compact Embeddings".)
Wei Deng, Junwei Pan, Tian Zhou, Deguang Kong, Aaron Flores, and Guang Lin. 2021. DeepLight: Deep Lightweight Feature Interactions for Accelerating CTR Predictions in Ad Serving. In Proceedings of the 14th ACM international conference on Web search and data mining. 922–930, https://arxiv.org/abs/2002.06987
Jun Suzuki and Masaaki Nagata. 2016. Learning Compact Neural Word Embeddings by Parameter Space Sharing. In IJCAI. 2046–2052, https://dl.acm.org/doi/10.5555/3060832.3060907
Aliakbar Panahi, Seyran Saeedi, and Tom Arodz. 2019. word2ket: Space-efficient Word Embeddings inspired by Quantum Entanglement. In ICLR. https://arxiv.org/abs/1911.04975
Weiping Song, Chence Shi, Zhiping Xiao, Zhijian Duan, Yewen Xu, Ming Zhang, and Jian Tang. AutoInt: Automatic feature interaction learning via self-attentive neural networks. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 1161–1170, 2019, https://arxiv.org/abs/1810.11921, Code: https://github.com/DeepGraphLearning/RecommenderSystems
Antonio Ginart, Maxim Naumov, Dheevatsa Mudigere, Jiyan Yang, and James Zou. Mixed dimension embeddings with application to memory-efficient recommendation systems. arXiv preprint arXiv:1909.11810, 2019 (preprint revised Feb 2021), https://arxiv.org/abs/1909.11810
Xiaorui Wu, Hong Xu, Honglin Zhang, Huaming Chen, and Jian Wang. Saec: Similarity-aware embedding compression in recommendation systems. CoRR, abs/1903.00103, 2019, https://arxiv.org/abs/1903.00103
Martin Andrews. Compressing word embeddings. CoRR, abs/1511.06397, 2015 (revised May 2016), https://arxiv.org/abs/1511.06397v2
Lili Mou, Ran Jia, Yan Xu, Ge Li, Lu Zhang, and Zhi Jin. Distilling word embeddings: An encoding approach. In CIKM, pp. 1977–1980. ACM, 2016. https://arxiv.org/abs/1506.04488 (Distillation of embeddings.)
Patrick H. Chen, Si Si, Yang Li, Ciprian Chelba, and Cho-Jui Hsieh. GroupReduce: Block-wise low-rank approximation for neural language model shrinking. In NeurIPS, pp. 11011–11021, 2018. https://arxiv.org/abs/1806.06950 (Using low-rank matrices for vocabulary and embeddings.)
Maximilian Lam. Word2bits - quantized word vectors. CoRR, abs/1803.05651, 2018, https://arxiv.org/abs/1803.05651 (Quantization ideas leads to compression of word vectors and embeddings.)
Manaal Faruqui, Yulia Tsvetkov, Dani Yogatama, Chris Dyer, and Noah A. Smith. Sparse overcomplete word vector representations. In ACL (1), pp. 1491–1500. The Association for Computer Linguistics, 2015. https://arxiv.org/abs/1506.02004 (Binary quantization in relation to word vector embeddings.)
Alexei Baevski and Michael Auli. Adaptive input representations for neural language modeling. In ICLR, 2019, https://arxiv.org/abs/1809.10853 (Faster training with adaptive embeddings size.)
Niketan Pansare, Jay Katukuri, Aditya Arora, Frank Cipollone, Riyaaz Shaik, Noyan Tokgozoglu, Chandru Venkataraman, 2022, Learning Compressed Embeddings for On-Device Inference, Part of Proceedings of Machine Learning and Systems 4 (MLSys 2022), https://proceedings.mlsys.org/paper_files/paper/2022/hash/72988287eb4acead9fe584bff6c488c5-Abstract.html
Shiwei Li, Huifeng Guo, Xing Tang, Ruiming Tang, Lu Hou, Ruixuan Li, and Rui Zhang. 2024. Embedding Compression in Recommender Systems: A Survey. ACM Comput. Surv. 56, 5, Article 130 (May 2024), 21 pages. https://doi.org/10.1145/3637841 https://dl.acm.org/doi/abs/10.1145/3637841 https://www.ruizhang.info/publications/CSUR%202024%20Embedding%20Compression%20in%20Recommender%20Systems_final.pdf
Luke McDermott, 23 May 2024, Embedding Compression for Efficient Re-Identification, https://arxiv.org/abs/2405.14730
Zebin Yang, Renze Chen, Taiqiang Wu, Ngai Wong, Yun Liang, Runsheng Wang, Ru Huang, Meng Li, 23 Oct 2024, MCUBERT: Memory-Efficient BERT Inference on Commodity Microcontrollers https://arxiv.org/abs/2410.17957
Sreeram Vennam, Anish Joishy, Ponnurangam Kumaraguru, 10 Nov 2024, LLM Vocabulary Compression for Low-Compute Environments, https://arxiv.org/abs/2411.06371
Jort Vincenti, Karim Abdel Sadek, Joan Velja, Matteo Nulli, Metod Jazbec, 30 Oct 2024 (v2), Dynamic Vocabulary Pruning in Early-Exit LLMs, https://arxiv.org/abs/2410.18952 https://github.com/MatteoNulli/Vocabulary_pruning
Tianyi Zhou, Deqing Fu, Mahdi Soltanolkotabi, Robin Jia, Vatsal Sharan, 13 Feb 2025, FoNE: Precise Single-Token Number Embeddings via Fourier Features, https://arxiv.org/abs/2502.09741 https://fouriernumber.github.io/

Embedding Low-Rank Matrix Factorization

The optimization of low-rank matrix factorization, or decomposition, can be applied to the embedding matrix. This is a specific subtype of embedding matrix compression.

Research papers on low-rank factorization of the embedding matrix:

Luke McDermott, 23 May 2024, Embedding Compression for Efficient Re-Identification, https://arxiv.org/abs/2405.14730
Patrick H. Chen, Si Si, Yang Li, Ciprian Chelba, and Cho-Jui Hsieh. GroupReduce: Block-wise low-rank approximation for neural language model shrinking. In NeurIPS, pp. 11011–11021, 2018. https://arxiv.org/abs/1806.06950 (Using low-rank matrices for vocabulary and embeddings.)
Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, James Hensman, 9 Feb 2024 (v2), SliceGPT: Compress Large Language Models by Deleting Rows and Columns, Microsoft Research, https://arxiv.org/abs/2401.15024 Code: https://github.com/microsoft/TransformerCompression (Pruning of matrices effectively prunes along the width dimension and the "fourth" internal dimension of embeddings using techniques such as low-rank matrix factorization.)

Unembedding Matrix (Output Embeddings)

The "unembedding" phase of a Transformer is where the output of the model in embedding format is converted back to tokens, as logits with probabilities. This means that each embedding has to map back to the tokens, which is the reverse of the initial embedding. This is more properly called the "output embedding" but I think the name "unembedding" is clearer.

The output phase uses an "unembedding matrix" which is usually the transpose or inverse of the embedding matrix. There's not a great deal of attention to unembeddings in the research literature.

Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed A Aly, Beidi Chen, Carole-Jean Wu, 25 Apr 2024, Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding, https://arxiv.org/abs/2404.16710 (Multiple contributions including training with early exit, and speculative decoding with a draft model that is early exit within the larger model, with the advantages: (a) the draft and verifier model thereby share KV cache data for the early layers and (b) avoidance of the problems with an outdated KV cache normally caused by early exiting.)
Javier Ferrando, Gabriele Sarti, Arianna Bisazza, Marta R. Costa-jussà, 2 May 2024 (v2), A Primer on the Inner Workings of Transformer-based Language Models, https://arxiv.org/pdf/2405.00208 (Analyzes the theory of the Transformer architecture, including an interesting separation of the effects of attention versus FFNs on logits to give attributions.)
NickyP, 14th Feb 2023, LLM Basics: Embedding Spaces - Transformer Token Vectors Are Not Points in Space, https://www.lesswrong.com/posts/pHPmMGEMYefk9jLeh/llm-basics-embedding-spaces-transformer-token-vectors-are
Mansi Sakarvadia, Arham Khan, Aswathy Ajith, Daniel Grzenda, Nathaniel Hudson, André Bauer, Kyle Chard, Ian Foster, 25 Oct 2023, Attention Lens: A Tool for Mechanistically Interpreting the Attention Head Information Retrieval Mechanism, https://arxiv.org/abs/2310.16270
Ben Levinstein, Feb 06, 2023, Mechanics of Training LLMs: Part III of A Conceptual Guide to Transformers, https://benlevinstein.substack.com/p/a-conceptual-guide-to-transformers-024
Rhys Gould, Euan Ong, George Ogden, Arthur Conmy, 14 Dec 2023, Successor Heads: Recurring, Interpretable Attention Heads In The Wild, https://arxiv.org/abs/2312.09230
Mansi Sakarvadia, Aswathy Ajith, Arham Khan, Daniel Grzenda, Nathaniel Hudson, André Bauer, Kyle Chard, Ian Foster, 28 Feb 2024 (v3), Memory Injections: Correcting Multi-Hop Reasoning Failures during Inference in Transformer-Based Language Models, https://arxiv.org/abs/2309.05605
Hanzhang Zhou, Zijian Feng, Zixiao Zhu, Junlang Qian, Kezhi Mao, 31 May 2024, UniBias: Unveiling and Mitigating LLM Bias through Internal Attention and FFN Manipulation, https://arxiv.org/abs/2405.20612
Fangcong Yin, Xi Ye, Greg Durrett, 3 Jun 2024, LoFiT: Localized Fine-tuning on LLM Representations, https://arxiv.org/abs/2406.01563
Chris Wendler, Veniamin Veselovsky, Giovanni Monea, Robert West, 8 Jun 2024 (v4), Do Llamas Work in English? On the Latent Language of Multilingual Transformers, https://arxiv.org/abs/2402.10588
Kiho Park, Yo Joong Choe, Victor Veitch, 7 Nov 2023, The Linear Representation Hypothesis and the Geometry of Large Language Models, https://arxiv.org/abs/2311.03658
Jordan K. Taylor, 2 Feb 2024, An introduction to graphical tensor notation for mechanistic interpretability, https://arxiv.org/abs/2402.01790
Tymofii Reizin, 2024, Fast Algorithms for Attention Mechanism, Bachelor Thesis, Department of Applied Mathematics, Charles University, Prague, https://dspace.cuni.cz/bitstream/handle/20.500.11956/192084/130390128.pdf?sequence=1
David Spuler, March 2024, Untokenization, in Generative AI in C++, https://www.aussieai.com/book/ch27-untokenization
Sreeram Vennam, Anish Joishy, Ponnurangam Kumaraguru, 10 Nov 2024, LLM Vocabulary Compression for Low-Compute Environments, https://arxiv.org/abs/2411.06371
Jort Vincenti, Karim Abdel Sadek, Joan Velja, Matteo Nulli, Metod Jazbec, 30 Oct 2024 (v2), Dynamic Vocabulary Pruning in Early-Exit LLMs, https://arxiv.org/abs/2410.18952 https://github.com/MatteoNulli/Vocabulary_pruning
HF, 2024, TGI v3 overview, https://huggingface.co/docs/text-generation-inference/conceptual/chunking

Embedding Pruning

The idea of pruning can be applied to (a) the embedding matrix, or (b) the embeddings vectors of size the internal model dimension, dynamically pruning the embedding vectors. One particular implementation of dynamic embeddings pruning is the Funnel Transformer in 2020.

It should be noted that dynamic embeddings pruning, including the Funnel Transformer, have much overlap with dynamic activation sparsification, since activations are log-scale probabilities in the embedding space. See also: activation sparsity research.

Research papers on embedding pruning:

Zihang Dai, Guokun Lai, Yiming Yang, and Quoc Le. 2020. Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing. In Proceedings of NeurIPS. https://proceedings.neurips.cc/paper/2020/hash/2cd2915e69546904e4e5d4a2ac9e1652-Abstract.html https://arxiv.org/abs/2006.03236 Code: https://github.com/laiguokun/Funnel-Transformer
Rene Bidart, Representational Redundancy Reduction Strategies for Efficient Neural Network Architectures for Visual and Language Tasks, 2023, Ph.D. thesis, University of Waterloo, https://uwspace.uwaterloo.ca/bitstream/handle/10012/19682/Bidart_Rene.pdf?sequence=1
Xin Huang, Ashish Khetan, Rene Bidart, and Zohar Karnin. Pyramid-BERT: Reducing complexity via successive core-set based token selection. arXiv preprint arXiv:2203.14380, 2022. https://arxiv.org/abs/2203.14380
Ji Xin, Raphael Tang, Zhiying Jiang, Yaoliang Yu, Jimmy Lin, July 2022, Building an Efficiency Pipeline: Commutativity and Cumulativeness of Efficiency Operators for Transformers, arXiv preprint arXiv:2208.00483, https://arxiv.org/abs/2208.00483
Bowei He, Xu He, Renrui Zhang, Yingxue Zhang, Ruiming Tang, Chen Ma, Aug 2023, Dynamic Embedding Size Search with Minimum Regret for Streaming Recommender System, https://arxiv.org/abs/2308.07760
Gyuwan Kim and Kyunghyun Cho. 2021. Length-adaptive transformer: Train once with length drop, use anytime with search. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6501–6511, Online. Association for Computational Linguistics. https://arxiv.org/abs/2010.07003 Code: https://github.com/clovaai/length-adaptive-transformer
Jean Senellart, Dakun Zhang, Bo Wang, Guillaume Klein, Jean-Pierre Ramatchandirin, Josep Crego, and Alexander Rush. 2018, OpenNMT system description for WNMT 2018: 800 words/sec on a single-core CPU. In Proc. of WNG, 2018. URL https://www.aclweb.org/anthology/W18-2715
Nikolay Bogoychev, Pinzhen Chen, Barry Haddow, Alexandra Birch, Nov 2023, Large Language Model Inference with Lexical Shortlisting, https://arxiv.org/abs/2311.09709 (Shortlisting the vocabulary to common words for reduced tokens and embedding matrix size.)
David Spuler, March 2024, Chapter 49. Length Pruning, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Nicola Tonellotto, Craig Macdonald, 2021, Query Embedding Pruning for Dense Retrieval, CIKM ’21, November 1–5, 2021, Virtual Event, QLD, Australia, https://arxiv.org/abs/2108.10341 PDF: https://arxiv.org/pdf/2108.10341.pdf
Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, James Hensman, 9 Feb 2024 (v2), SliceGPT: Compress Large Language Models by Deleting Rows and Columns, Microsoft Research, https://arxiv.org/abs/2401.15024 Code: https://github.com/microsoft/TransformerCompression (Pruning of matrices effectively prunes along the width dimension and the "fourth" internal dimension of embeddings using techniques such as low-rank matrix factorization.)
IamAdiSri, Pruning a model embedding matrix for memory efficiency, April 2021, Hugging Face discussion board, https://discuss.huggingface.co/t/pruning-a-model-embedding-matrix-for-memory-efficiency/5502/7
Jing Liu, Ruihao Gong, Mingyang Zhang, Yefei He, Jianfei Cai, Bohan Zhuang, 13 Jun 2024, ME-Switch: A Memory-Efficient Expert Switching Framework for Large Language Models, https://arxiv.org/abs/2406.09041 (How to load multiple experts for MoE in a memory-efficient way using mixed-precision quantization based on identifying the few salient channels that need higher precision, as an alternative to multi-LoRA.)
Oshin Dutta, Ritvik Gupta, Sumeet Agarwal, 2024, Efficient LLM Pruning with Global Token-Dependency Awareness and Hardware-Adapted Inference, https://openreview.net/pdf?id=cqhAzteLzc
Jort Vincenti, Karim Abdel Sadek, Joan Velja, Matteo Nulli, Metod Jazbec, 30 Oct 2024 (v2), Dynamic Vocabulary Pruning in Early-Exit LLMs, https://arxiv.org/abs/2410.18952 https://github.com/MatteoNulli/Vocabulary_pruning
Yilong Chen, Junyuan Shang, Zhengyu Zhang, Jiawei Sheng, Tingwen Liu, Shuohuan Wang, Yu Sun, Hua Wu, Haifeng Wang, 10 Dec 2024 (v2), Mixture of Hidden-Dimensions Transformer, https://arxiv.org/abs/2412.05644
Tianyi Zhou, Deqing Fu, Mahdi Soltanolkotabi, Robin Jia, Vatsal Sharan, 13 Feb 2025, FoNE: Precise Single-Token Number Embeddings via Fourier Features, https://arxiv.org/abs/2502.09741 https://fouriernumber.github.io/

Aussie AI

Embeddings Research

Embedding Optimization Research Papers

Embedding Size Optimization (NAS)

Embedding Matrix Compression (Embedding Pruning)

Embedding Low-Rank Matrix Factorization

Unembedding Matrix (Output Embeddings)

Embedding Pruning

More AI Research

Quick Links

Product

New to Writing?

Writing Styles