Aussie AI
Embeddings Research
-
Last Updated 12 December, 2024
-
by David Spuler, Ph.D.
The first step in model inference in the Transformer architecture is to convert an input sequence into numbers called tokens. However, these tokens are not used internally to the model, because the next step of Transformer inference is to immediately convert this sequence of tokens into another internal representation called an "embedding". An embedding is a vector of numbers that represents the information about the token sequence in very complex ways.
Note that the "embeddings" terminology is unrelated to "embedded" devices such as mobile phones or IoT edge devices. It's simply a different usage of the word.
The mapping from tokens to embeddings is actually learned during model training. The conversion of token vectors into a vector of embeddings is based on a single matrix multiplication using these learned embedding weights, with an additional step that adds "positional embeddings" (simply added in the Transformer architecture). The embedding matrix can be quite large, especially if the token vocabulary size is large. However, this multiplication occurs infrequently compared to other weight matrices, so it is not a latency-critical operation. Nevertheless, the storage cost of storing a large embedding matrix can be significant.
Related areas of LLM inference optimization include:
- Tokenization
- Vocabulary expansion
- Vocabulary trimming
- Token pruning
- Embeddings pruning
- Shortlisting
- Funnel transformer
Embedding Optimization Research Papers
The embeddings don't receive a huge amount of research in the literature, because they aren't a bottleneck in inference. Most of the research on optimizing embeddings has been on compacting the space of the embedding matrices for use on smaller devices or with smaller models, using matrix compression techniques such as sparsity or hashing.
- Siyi Liu, Chen Gao, Yihong Chen, Depeng Jin, Yong Li, Learnable Embedding Sizes for Recommender Systems, arXiv preprint arXiv:2101.07577, Mar 2021, https://arxiv.org/abs/2101.07577, Code: https://github.com/ssui-liu/learnable-embed-sizes-for-RecSys
- Kailash A. Hambarde, Hugo Proenca, Information Retrieval: Recent Advances and Beyond, arXiv preprint arXiv:2301.08801, Jan 2023, https://arxiv.org/abs/2301.08801
- Carlos Lassance, Maroua Maachou, Joohee Park, Stéphane Clinchant, Learned Token Pruning in Contextualized Late Interaction over BERT (ColBERT), 2022, SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, July 2022, Pages 2232–2236, https://doi.org/10.1145/3477495.3531835, https://dl.acm.org/doi/10.1145/3477495.3531835
- Keshav Santhanam, Omar Khattab, Christopher Potts, Matei Zaharia, PLAID: An Efficient Engine for Late Interaction Retrieval, October 2022, CIKM '22: The 31st ACM International Conference on Information and Knowledge Management, DOI:10.1145/3511808.3557325, https://arxiv.org/abs/2205.09707
- Liang Qu, Huaisheng Zhu, Ruiqi Zheng, Yuhui Shi, and Hongzhi Yin, 2021, Imgagn: Imbalanced network embedding via generative adversarial graph networks. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 1390–1398, https://arxiv.org/abs/2106.02817
- Wang-Cheng Kang, Derek Zhiyuan Cheng, Ting Chen, Xinyang Yi, Dong Lin, Lichan Hong, and Ed H. Chi, 2020, Learning Multi-Granular Quantized Embeddings for Large-Vocab Categorical Features in Recommender Systems, In Companion Proceedings of the Web Conference 2020 (Taipei, Taiwan) (WWW ’20). Association for Computing Machinery, New York, NY, USA, 562–566, https://doi.org/10.1145/3366424.3383416, https://arxiv.org/abs/2002.08530
- Manas R. Joglekar, Cong Li, Mei Chen, Taibai Xu, Xiaoming Wang, Jay K. Adams, Pranav Khaitan, Jiahui Liu, and Quoc V. Le. 2020. Neural Input Search for Large Scale Recommendation Models (KDD ’20). Association for Computing Machinery, New York, NY, USA, 2387–2397. https://doi.org/10.1145/3394486.3403288, https://arxiv.org/abs/1907.04471
- Manas R Joglekar, Cong Li, Mei Chen, Taibai Xu, Xiaoming Wang, Jay K Adams, Pranav Khaitan, Jiahui Liu, and Quoc V Le. 2020. Neural input search for large scale recommendation models. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2387–2397.
- Ruiqi Zheng, Liang Qu, Bin Cui, Yuhui Shi, and Hongzhi Yin. 2022. AutoML for Deep Recommender Systems: A Survey. arXiv prerint arXiv:2203.13922 (2022), https://arxiv.org/abs/2203.13922
- WC Kang, DZ Cheng, T Yao, X Yi, T Chen, 2021, Learning to embed categorical features without embedding tables for recommendation https://arxiv.org/abs/2010.10784v2
- Yi Luan, Jacob Eisenstein, Kristina Toutanova, and Michael Collins. 2020. Sparse, dense, and attentional representations for text retrieval. In Proceedings of TACL, https://arxiv.org/abs/2005.00181
- Omar Khattab and Matei Zaharia. 2020. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. In Proc. SIGIR. 39–48, https://arxiv.org/abs/2004.12832
- Samuel Humeau, Kurt Shuster, Marie-Anne Lachaux, and J. Weston. 2020. Poly-encoders: Transformer Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring. In Proc. ICLR. https://arxiv.org/abs/1905.01969
- F Lyu, X Tang, H Zhu, H Guo, Y Zhang, 2022, OptEmbed: Learning Optimal Embedding Table for Click-through Rate Prediction, https://arxiv.org/abs/2208.04482
- Hanxiao Liu, Karen Simonyan, and Yiming Yang. 2018. Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055 (2018). https://arxiv.org/abs/1806.09055
- Nicola Tonellotto, Craig Macdonald, and Iadh Ounis. 2018. Efficient Query Processing for Scalable Web Search. Foundations and Trends in Information Retrieval 12, 4–5 (2018), 319–492, https://ieeexplore.ieee.org/document/8620666
- Wang-Cheng Kang, Derek Zhiyuan Cheng, Tiansheng Yao, Xinyang Yi, Ting Chen, Lichan Hong, Ed H. Chi, Learning to Embed Categorical Features without Embedding Tables for Recommendation, June 2021, https://arxiv.org/abs/2010.10784v2
- Tong Chen, Hongzhi Yin, Yujia Zheng, Zi Huang, Yang Wang, and Meng Wang. 2021. Learning elastic embeddings for customizing on-device recommenders. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 138–147. https://arxiv.org/abs/2106.02223
- Ting Chen, Lala Li, and Yizhou Sun. 2019. Differentiable product quantization for end-to-end embedding compression. arXiv preprint arXiv:1908.09756 (2019). https://arxiv.org/abs/1908.09756
- Tianyu Gao, Xingcheng Yao, and Danqi Chen. SimCSE: Simple contrastive learning of sentence embeddings. CoRR, abs/2104.08821, 2021, https://arxiv.org/abs/2104.08821
- Mandar Joshi, Eunsol Choi, Omer Levy, Daniel S. Weld, and Luke Zettlemoyer. pair2vec: Compositional word-pair embeddings for cross-sentence inference. In NAACL-HLT (1), pp. 3597–3608. Association for Computational Linguistics, 2019, https://arxiv.org/abs/1810.08854
- Lili Mou, Ran Jia, Yan Xu, Ge Li, Lu Zhang, and Zhi Jin. Distilling word embeddings: An encoding approach. In CIKM, pp. 1977–1980. ACM, 2016, https://arxiv.org/abs/1506.04488
- Ye Qi, Devendra Singh Sachan, Matthieu Felix, Sarguna Padmanabhan, and Graham Neubig. When and why are pre-trained word embeddings useful for neural machine translation? In NAACL-HLT (2), pp. 529–535. Association for Computational Linguistics, 2018, https://arxiv.org/abs/1804.06323
- Shyam Upadhyay, Manaal Faruqui, Chris Dyer, and Dan Roth. Cross-lingual models of word embeddings: An empirical comparison. In ACL (1). The Association for Computer Linguistics, 2016, https://arxiv.org/abs/1604.00425
- Alexis Conneau, German Kruszewski, Guillaume Lample, Lo¨ıc Barrault, and Marco Baroni. 2018. What You Can Cram into A Single $&!#∗ Vector: Probing Sentence Embeddings for Linguistic Properties. ACL. https://aclanthology.org/P18-1198/
- Zihang Dai, Guokun Lai, Yiming Yang, and Quoc Le. 2020. Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing. In Proceedings of NeurIPS. https://proceedings.neurips.cc/paper/2020/hash/2cd2915e69546904e4e5d4a2ac9e1652-Abstract.html, https://arxiv.org/abs/2006.03236, Code: https://github.com/laiguokun/Funnel-Transformer
- Hanrui Wang, Zhanghao Wu, Zhijian Liu, Han Cai, Ligeng Zhu, Chuang Gan, and Song Han. 2020. HAT: Hardware-aware transformers for efficient natural language processing. arXiv preprint arXiv:2005.14187. https://arxiv.org/abs/2005.14187
- Representational Redundancy Reduction Strategies for Efficient Neural Network Architectures for Visual and Language Tasks, Rene Bidart, Ph.D. thesis, 2023, University of Waterloo, https://uwspace.uwaterloo.ca/bitstream/handle/10012/19682/Bidart_Rene.pdf?sequence=1
- A. Chaulwar et al., "Extreme compression of sentence-transformer ranker models: Faster inference longer battery life and less storage on edge devices", arXiv:2207.12852, 2022. https://arxiv.org/abs/2207.12852v1 (Distillation from the point of view of embeddings.)
- Tomas Mikolov, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. In ICLR, 2013 https://arxiv.org/abs/1301.3781 Code: https://code.google.com/p/word2vec/ (This is the word2vec algorithm.)
- Guy Dar, Mor Geva, Ankit Gupta, Jonathan Berant, Dec 2022, Analyzing Transformers in Embedding Space, https://arxiv.org/pdf/2209.02535.pdf, Code: https://github.com/guyd1995/embedding-space (Maps backward from model parameters and computations in the "embedding space" and then projects this back to tokens.)
- Xin Zhang, Zehan Li, Yanzhao Zhang, Dingkun Long, Pengjun Xie, Meishan Zhang, Min Zhang, Oct 2022, Language Models are Universal Embedders, https://arxiv.org/abs/2310.08232
- Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022. https://arxiv.org/abs/2204.02311 (Google Palm architecture used RoPE embeddings.)
- Jon Saad-Falcon, Amanpreet Singh, Luca Soldaini, Mike D'Arcy, Arman Cohan, Doug Downey, 30 Jan 2023 (v3), Embedding Recycling for Language Models, https://arxiv.org/abs/2207.04993
- Hackerllama, January 7, 2024, Sentence Embeddings. Introduction to Sentence Embeddings, https://osanseviero.github.io/hackerllama/blog/posts/sentence_embeddings/
- Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, Furu Wei, Dec 2023, Improving Text Embeddings with Large Language Models https://arxiv.org/abs/2401.00368
- Lior Wolf, Feb 2017, Ofir Press, Using the Output Embedding to Improve Language Models, https://arxiv.org/abs/1608.05859
- Yuhong Zhang, Shilai Yang, Gert Cauwenberghs, Tzyy-Ping Jung, 28 Jan 2024, From Word Embedding to Reading Embedding Using Large Language Model, EEG and Eye-tracking, https://arxiv.org/abs/2401.15681
- Björn Deiseroth, Manuel Brack, Patrick Schramowski, Kristian Kersting, Samuel Weinbach, 27 Jun 2024, T-FREE: Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings, https://arxiv.org/abs/2406.19223
- Barhoumi Mosbeh, Nov 2024, Late Chunking In Long Context Embedding Models, https://pub.towardsai.net/late-chunking-in-long-context-embedding-models-caf1c1209042
- Emilia David, November 8, 2024, Multimodal RAG is growing, here’s the best way to get started, https://venturebeat.com/ai/multimodal-rag-is-growing-heres-the-best-way-to-get-started/
Embedding Size Optimization (NAS)
A conceptually simple way to reduce embedding size is to choose a smaller embedding size as a model hyper-parameter. The size of the embedding is a model "hyper-parameter" that is chosen before training. Optimizing this number is a sub-problem of "neural architecture search" (NAS), also called "hyper-parameter optimization" (HPO). The embedding-specific NAS problem has some research papers.
- Haochen Liu, Xiangyu Zhao, Chong Wang, Xiaobing Liu, and Jiliang Tang. 2020. Automated embedding size search in deep recommender systems. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2307–2316, https://dl.acm.org/doi/abs/10.1145/3397271.3401436
- L Qu, Y Ye, N Tang, L Zhang, Y Shi, H Yin, Single-shot embedding dimension search in recommender system, 2022, https://dl.acm.org/doi/abs/10.1145/3477495.3532060, https://arxiv.org/abs/2204.03281
- Xiangyu Zhao, Chong Wang, Ming Chen, Xudong Zheng, Xiaobing Liu, and Jiliang Tang, 2020, AutoEmb: Automated Embedding Dimensionality Search in Streaming Recommendations, CoRR abs/2002.11252 (2020). arXiv:2002.11252, https://arxiv.org/abs/2002.11252
- Zi Yin and Yuanyuan Shen. 2018. On the dimensionality of word embedding. In Advances in Neural Information Processing Systems. 887 898, https://arxiv.org/abs/1812.04224
- Maxim Naumov. 2019. On the Dimensionality of Embeddings for Sparse Features and Data. arXiv preprint arXiv:1901.02103 (2019), https://arxiv.org/abs/1901.02103
Embedding Matrix Compression (Embedding Pruning)
These papers are specifically on reducing the storage cost of large embedding matrices. Techniques include hashing vectors and pruning embeddings to create sparsity. Vocabulary size is also closely related to embeddings size (see tokenization and vocabulary research).
- Daochen Zha, Louis Feng, Bhargav Bhushanam, Dhruv Choudhary, Jade Nie, Yuandong Tian, Jay Chae, Yinbin Ma, Arun Kejariwal, Xia Hu, 2022, AutoShard: Automated Embedding Table Sharding for Recommender Systems, https://dl.acm.org/doi/abs/10.1145/3534678.3539034, https://arxiv.org/abs/2208.06399
- A Desai, L Chou, A Shrivastava, Conference on Machine Learning and Systems, 2022, Random Offset Block Embedding (ROBE) for compressed embedding tables in deep learning recommendation systems, https://arxiv.org/abs/2108.02191
- Xiangyu Zhao, Haochen Liu, Hui Liu, Jiliang Tang, Weiwei Guo, Jun Shi, Sida Wang, Huiji Gao, and Bo Long. 2020. Memory-efficient embedding for recommendations. arXiv preprint arXiv:2006.14827 (2020), https://arxiv.org/abs/2006.14827
- Antonio Ginart, Maxim Naumov, Dheevatsa Mudigere, Jiyan Yang, and James Zou. 2019. Mixed Dimension Embeddings with Application to Memory-Efficient Recommendation Systems. arXiv preprint arXiv:1909.11810 (2019), https://arxiv.org/abs/1909.11810
- Nicola Tonellotto, Craig Macdonald, Query Embedding Pruning for Dense Retrieval, CIKM ’21, November 1–5, 2021, Virtual Event, QLD, Australia, https://arxiv.org/abs/2108.10341
- IamAdiSri, Pruning a model embedding matrix for memory efficiency, April 2021, Hugging Face discussion board, https://discuss.huggingface.co/t/pruning-a-model-embedding-matrix-for-memory-efficiency/5502/7
- Raphael Shu and Hideki Nakayama. Compressing word embeddings via deep compositional code learning. In ICLR (Poster). OpenReview.net, 2018, https://arxiv.org/abs/1711.01068
- Hao-Jun Michael Shi, Dheevatsa Mudigere, Maxim Naumov, and Jiyan Yang. Compositional embeddings using complementary partitions for memory-efficient recommendation systems. In KDD, pp. 165-175. ACM, 2020, https://arxiv.org/abs/1909.02107
- Valentin Khrulkov, Oleksii Hrinchuk, Leyla Mirvakhabova, and Ivan Oseledets. 2019. Tensorized Embedding Layers for Efficient Model Compression. arXiv preprint arXiv:1901.10787 (2019), updated Feb 2020, https://arxiv.org/abs/1901.10787v1
- Manaal Faruqui, Yulia Tsvetkov, Dani Yogatama, Chris Dyer, and Noah A. Smith. Sparse overcomplete word vector representations. In ACL (1), pp. 1491-1500. The Association for Computer Linguistics, 2015, https://arxiv.org/abs/1506.02004
- Yunchuan Chen, Lili Mou, Yan Xu, Ge Li, and Zhi Jin. Compressing neural language models by sparse word representations. In ACL (1). The Association for Computer Linguistics, 2016, https://arxiv.org/abs/1610.03950 (Sparse matrix via common and rare word embeddings)
- Jingjing Xu, Wangchunshu Zhou, Zhiyi Fu, Hao Zhou, Lei Li, A Survey on Green Deep Learning, Nov 2021, https://arxiv.org/abs/2111.05193 (Extensive survey paper with section on "Compact Embeddings".)
- Wei Deng, Junwei Pan, Tian Zhou, Deguang Kong, Aaron Flores, and Guang Lin. 2021. DeepLight: Deep Lightweight Feature Interactions for Accelerating CTR Predictions in Ad Serving. In Proceedings of the 14th ACM international conference on Web search and data mining. 922–930, https://arxiv.org/abs/2002.06987
- Jun Suzuki and Masaaki Nagata. 2016. Learning Compact Neural Word Embeddings by Parameter Space Sharing. In IJCAI. 2046–2052, https://dl.acm.org/doi/10.5555/3060832.3060907
- Aliakbar Panahi, Seyran Saeedi, and Tom Arodz. 2019. word2ket: Space-efficient Word Embeddings inspired by Quantum Entanglement. In ICLR. https://arxiv.org/abs/1911.04975
- Weiping Song, Chence Shi, Zhiping Xiao, Zhijian Duan, Yewen Xu, Ming Zhang, and Jian Tang. AutoInt: Automatic feature interaction learning via self-attentive neural networks. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 1161–1170, 2019, https://arxiv.org/abs/1810.11921, Code: https://github.com/DeepGraphLearning/RecommenderSystems
- Antonio Ginart, Maxim Naumov, Dheevatsa Mudigere, Jiyan Yang, and James Zou. Mixed dimension embeddings with application to memory-efficient recommendation systems. arXiv preprint arXiv:1909.11810, 2019 (preprint revised Feb 2021), https://arxiv.org/abs/1909.11810
- Xiaorui Wu, Hong Xu, Honglin Zhang, Huaming Chen, and Jian Wang. Saec: Similarity-aware embedding compression in recommendation systems. CoRR, abs/1903.00103, 2019, https://arxiv.org/abs/1903.00103
- Martin Andrews. Compressing word embeddings. CoRR, abs/1511.06397, 2015 (revised May 2016), https://arxiv.org/abs/1511.06397v2
- Lili Mou, Ran Jia, Yan Xu, Ge Li, Lu Zhang, and Zhi Jin. Distilling word embeddings: An encoding approach. In CIKM, pp. 1977–1980. ACM, 2016. https://arxiv.org/abs/1506.04488 (Distillation of embeddings.)
- Patrick H. Chen, Si Si, Yang Li, Ciprian Chelba, and Cho-Jui Hsieh. GroupReduce: Block-wise low-rank approximation for neural language model shrinking. In NeurIPS, pp. 11011–11021, 2018. https://arxiv.org/abs/1806.06950 (Using low-rank matrices for vocabulary and embeddings.)
- Maximilian Lam. Word2bits - quantized word vectors. CoRR, abs/1803.05651, 2018, https://arxiv.org/abs/1803.05651 (Quantization ideas leads to compression of word vectors and embeddings.)
- Manaal Faruqui, Yulia Tsvetkov, Dani Yogatama, Chris Dyer, and Noah A. Smith. Sparse overcomplete word vector representations. In ACL (1), pp. 1491–1500. The Association for Computer Linguistics, 2015. https://arxiv.org/abs/1506.02004 (Binary quantization in relation to word vector embeddings.)
- Alexei Baevski and Michael Auli. Adaptive input representations for neural language modeling. In ICLR, 2019, https://arxiv.org/abs/1809.10853 (Faster training with adaptive embeddings size.)
- Niketan Pansare, Jay Katukuri, Aditya Arora, Frank Cipollone, Riyaaz Shaik, Noyan Tokgozoglu, Chandru Venkataraman, 2022, Learning Compressed Embeddings for On-Device Inference, Part of Proceedings of Machine Learning and Systems 4 (MLSys 2022), https://proceedings.mlsys.org/paper_files/paper/2022/hash/72988287eb4acead9fe584bff6c488c5-Abstract.html
- Shiwei Li, Huifeng Guo, Xing Tang, Ruiming Tang, Lu Hou, Ruixuan Li, and Rui Zhang. 2024. Embedding Compression in Recommender Systems: A Survey. ACM Comput. Surv. 56, 5, Article 130 (May 2024), 21 pages. https://doi.org/10.1145/3637841 https://dl.acm.org/doi/abs/10.1145/3637841 https://www.ruizhang.info/publications/CSUR%202024%20Embedding%20Compression%20in%20Recommender%20Systems_final.pdf
- Luke McDermott, 23 May 2024, Embedding Compression for Efficient Re-Identification, https://arxiv.org/abs/2405.14730
- Zebin Yang, Renze Chen, Taiqiang Wu, Ngai Wong, Yun Liang, Runsheng Wang, Ru Huang, Meng Li, 23 Oct 2024, MCUBERT: Memory-Efficient BERT Inference on Commodity Microcontrollers https://arxiv.org/abs/2410.17957
- Sreeram Vennam, Anish Joishy, Ponnurangam Kumaraguru, 10 Nov 2024, LLM Vocabulary Compression for Low-Compute Environments, https://arxiv.org/abs/2411.06371
- Jort Vincenti, Karim Abdel Sadek, Joan Velja, Matteo Nulli, Metod Jazbec, 30 Oct 2024 (v2), Dynamic Vocabulary Pruning in Early-Exit LLMs, https://arxiv.org/abs/2410.18952 https://github.com/MatteoNulli/Vocabulary_pruning
Embedding Low-Rank Matrix Factorization
The optimization of low-rank matrix factorization, or decomposition, can be applied to the embedding matrix. This is a specific subtype of embedding matrix compression.
Research papers on low-rank factorization of the embedding matrix:
- Luke McDermott, 23 May 2024, Embedding Compression for Efficient Re-Identification, https://arxiv.org/abs/2405.14730
- Patrick H. Chen, Si Si, Yang Li, Ciprian Chelba, and Cho-Jui Hsieh. GroupReduce: Block-wise low-rank approximation for neural language model shrinking. In NeurIPS, pp. 11011–11021, 2018. https://arxiv.org/abs/1806.06950 (Using low-rank matrices for vocabulary and embeddings.)
- Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, James Hensman, 9 Feb 2024 (v2), SliceGPT: Compress Large Language Models by Deleting Rows and Columns, Microsoft Research, https://arxiv.org/abs/2401.15024 Code: https://github.com/microsoft/TransformerCompression (Pruning of matrices effectively prunes along the width dimension and the "fourth" internal dimension of embeddings using techniques such as low-rank matrix factorization.)
Unembedding Matrix (Output Embeddings)
The "unembedding" phase of a Transformer is where the output of the model in embedding format is converted back to tokens, as logits with probabilities. This means that each embedding has to map back to the tokens, which is the reverse of the initial embedding. This is more properly called the "output embedding" but I think the name "unembedding" is clearer.
The output phase uses an "unembedding matrix" which is usually the transpose or inverse of the embedding matrix. There's not a great deal of attention to unembeddings in the research literature.
- Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed A Aly, Beidi Chen, Carole-Jean Wu, 25 Apr 2024, Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding, https://arxiv.org/abs/2404.16710 (Multiple contributions including training with early exit, and speculative decoding with a draft model that is early exit within the larger model, with the advantages: (a) the draft and verifier model thereby share KV cache data for the early layers and (b) avoidance of the problems with an outdated KV cache normally caused by early exiting.)
- Javier Ferrando, Gabriele Sarti, Arianna Bisazza, Marta R. Costa-jussà, 2 May 2024 (v2), A Primer on the Inner Workings of Transformer-based Language Models, https://arxiv.org/pdf/2405.00208 (Analyzes the theory of the Transformer architecture, including an interesting separation of the effects of attention versus FFNs on logits to give attributions.)
- NickyP, 14th Feb 2023, LLM Basics: Embedding Spaces - Transformer Token Vectors Are Not Points in Space, https://www.lesswrong.com/posts/pHPmMGEMYefk9jLeh/llm-basics-embedding-spaces-transformer-token-vectors-are
- Mansi Sakarvadia, Arham Khan, Aswathy Ajith, Daniel Grzenda, Nathaniel Hudson, André Bauer, Kyle Chard, Ian Foster, 25 Oct 2023, Attention Lens: A Tool for Mechanistically Interpreting the Attention Head Information Retrieval Mechanism, https://arxiv.org/abs/2310.16270
- Ben Levinstein, Feb 06, 2023, Mechanics of Training LLMs: Part III of A Conceptual Guide to Transformers, https://benlevinstein.substack.com/p/a-conceptual-guide-to-transformers-024
- Rhys Gould, Euan Ong, George Ogden, Arthur Conmy, 14 Dec 2023, Successor Heads: Recurring, Interpretable Attention Heads In The Wild, https://arxiv.org/abs/2312.09230
- Mansi Sakarvadia, Aswathy Ajith, Arham Khan, Daniel Grzenda, Nathaniel Hudson, André Bauer, Kyle Chard, Ian Foster, 28 Feb 2024 (v3), Memory Injections: Correcting Multi-Hop Reasoning Failures during Inference in Transformer-Based Language Models, https://arxiv.org/abs/2309.05605
- Hanzhang Zhou, Zijian Feng, Zixiao Zhu, Junlang Qian, Kezhi Mao, 31 May 2024, UniBias: Unveiling and Mitigating LLM Bias through Internal Attention and FFN Manipulation, https://arxiv.org/abs/2405.20612
- Fangcong Yin, Xi Ye, Greg Durrett, 3 Jun 2024, LoFiT: Localized Fine-tuning on LLM Representations, https://arxiv.org/abs/2406.01563
- Chris Wendler, Veniamin Veselovsky, Giovanni Monea, Robert West, 8 Jun 2024 (v4), Do Llamas Work in English? On the Latent Language of Multilingual Transformers, https://arxiv.org/abs/2402.10588
- Kiho Park, Yo Joong Choe, Victor Veitch, 7 Nov 2023, The Linear Representation Hypothesis and the Geometry of Large Language Models, https://arxiv.org/abs/2311.03658
- Jordan K. Taylor, 2 Feb 2024, An introduction to graphical tensor notation for mechanistic interpretability, https://arxiv.org/abs/2402.01790
- Tymofii Reizin, 2024, Fast Algorithms for Attention Mechanism, Bachelor Thesis, Department of Applied Mathematics, Charles University, Prague, https://dspace.cuni.cz/bitstream/handle/20.500.11956/192084/130390128.pdf?sequence=1
- David Spuler, March 2024, Untokenization, in Generative AI in C++, https://www.aussieai.com/book/ch27-untokenization
- Sreeram Vennam, Anish Joishy, Ponnurangam Kumaraguru, 10 Nov 2024, LLM Vocabulary Compression for Low-Compute Environments, https://arxiv.org/abs/2411.06371
- Jort Vincenti, Karim Abdel Sadek, Joan Velja, Matteo Nulli, Metod Jazbec, 30 Oct 2024 (v2), Dynamic Vocabulary Pruning in Early-Exit LLMs, https://arxiv.org/abs/2410.18952 https://github.com/MatteoNulli/Vocabulary_pruning
Embedding Pruning
The idea of pruning can be applied to (a) the embedding matrix, or (b) the embeddings vectors of size the internal model dimension, dynamically pruning the embedding vectors. One particular implementation of dynamic embeddings pruning is the Funnel Transformer in 2020.
It should be noted that dynamic embeddings pruning, including the Funnel Transformer, have much overlap with dynamic activation sparsification, since activations are log-scale probabilities in the embedding space. See also: activation sparsity research.
Research papers on embedding pruning:
- Zihang Dai, Guokun Lai, Yiming Yang, and Quoc Le. 2020. Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing. In Proceedings of NeurIPS. https://proceedings.neurips.cc/paper/2020/hash/2cd2915e69546904e4e5d4a2ac9e1652-Abstract.html https://arxiv.org/abs/2006.03236 Code: https://github.com/laiguokun/Funnel-Transformer
- Rene Bidart, Representational Redundancy Reduction Strategies for Efficient Neural Network Architectures for Visual and Language Tasks, 2023, Ph.D. thesis, University of Waterloo, https://uwspace.uwaterloo.ca/bitstream/handle/10012/19682/Bidart_Rene.pdf?sequence=1
- Xin Huang, Ashish Khetan, Rene Bidart, and Zohar Karnin. Pyramid-BERT: Reducing complexity via successive core-set based token selection. arXiv preprint arXiv:2203.14380, 2022. https://arxiv.org/abs/2203.14380
- Ji Xin, Raphael Tang, Zhiying Jiang, Yaoliang Yu, Jimmy Lin, July 2022, Building an Efficiency Pipeline: Commutativity and Cumulativeness of Efficiency Operators for Transformers, arXiv preprint arXiv:2208.00483, https://arxiv.org/abs/2208.00483
- Bowei He, Xu He, Renrui Zhang, Yingxue Zhang, Ruiming Tang, Chen Ma, Aug 2023, Dynamic Embedding Size Search with Minimum Regret for Streaming Recommender System, https://arxiv.org/abs/2308.07760
- Gyuwan Kim and Kyunghyun Cho. 2021. Length-adaptive transformer: Train once with length drop, use anytime with search. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6501–6511, Online. Association for Computational Linguistics. https://arxiv.org/abs/2010.07003 Code: https://github.com/clovaai/length-adaptive-transformer
- Jean Senellart, Dakun Zhang, Bo Wang, Guillaume Klein, Jean-Pierre Ramatchandirin, Josep Crego, and Alexander Rush. 2018, OpenNMT system description for WNMT 2018: 800 words/sec on a single-core CPU. In Proc. of WNG, 2018. URL https://www.aclweb.org/anthology/W18-2715
- Nikolay Bogoychev, Pinzhen Chen, Barry Haddow, Alexandra Birch, Nov 2023, Large Language Model Inference with Lexical Shortlisting, https://arxiv.org/abs/2311.09709 (Shortlisting the vocabulary to common words for reduced tokens and embedding matrix size.)
- David Spuler, March 2024, Chapter 49. Length Pruning, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Nicola Tonellotto, Craig Macdonald, 2021, Query Embedding Pruning for Dense Retrieval, CIKM ’21, November 1–5, 2021, Virtual Event, QLD, Australia, https://arxiv.org/abs/2108.10341 PDF: https://arxiv.org/pdf/2108.10341.pdf
- Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, James Hensman, 9 Feb 2024 (v2), SliceGPT: Compress Large Language Models by Deleting Rows and Columns, Microsoft Research, https://arxiv.org/abs/2401.15024 Code: https://github.com/microsoft/TransformerCompression (Pruning of matrices effectively prunes along the width dimension and the "fourth" internal dimension of embeddings using techniques such as low-rank matrix factorization.)
- IamAdiSri, Pruning a model embedding matrix for memory efficiency, April 2021, Hugging Face discussion board, https://discuss.huggingface.co/t/pruning-a-model-embedding-matrix-for-memory-efficiency/5502/7
- Jing Liu, Ruihao Gong, Mingyang Zhang, Yefei He, Jianfei Cai, Bohan Zhuang, 13 Jun 2024, ME-Switch: A Memory-Efficient Expert Switching Framework for Large Language Models, https://arxiv.org/abs/2406.09041 (How to load multiple experts for MoE in a memory-efficient way using mixed-precision quantization based on identifying the few salient channels that need higher precision, as an alternative to multi-LoRA.)
- Oshin Dutta, Ritvik Gupta, Sumeet Agarwal, 2024, Efficient LLM Pruning with Global Token-Dependency Awareness and Hardware-Adapted Inference, https://openreview.net/pdf?id=cqhAzteLzc
- Jort Vincenti, Karim Abdel Sadek, Joan Velja, Matteo Nulli, Metod Jazbec, 30 Oct 2024 (v2), Dynamic Vocabulary Pruning in Early-Exit LLMs, https://arxiv.org/abs/2410.18952 https://github.com/MatteoNulli/Vocabulary_pruning
- Yilong Chen, Junyuan Shang, Zhengyu Zhang, Jiawei Sheng, Tingwen Liu, Shuohuan Wang, Yu Sun, Hua Wu, Haifeng Wang, 10 Dec 2024 (v2), Mixture of Hidden-Dimensions Transformer, https://arxiv.org/abs/2412.05644
More AI Research
Read more about:
- Token pruning
- Attention head pruning
- Layer pruning
- FFN pruning
- Shallow decoder architecture
- Normalization pruning
- Length pruning
- Width pruning
- Channel pruning
- « Research Home