Aussie AI

Embeddings Research

  • Last Updated 12 December, 2024
  • by David Spuler, Ph.D.

The first step in model inference in the Transformer architecture is to convert an input sequence into numbers called tokens. However, these tokens are not used internally to the model, because the next step of Transformer inference is to immediately convert this sequence of tokens into another internal representation called an "embedding". An embedding is a vector of numbers that represents the information about the token sequence in very complex ways.

Note that the "embeddings" terminology is unrelated to "embedded" devices such as mobile phones or IoT edge devices. It's simply a different usage of the word.

The mapping from tokens to embeddings is actually learned during model training. The conversion of token vectors into a vector of embeddings is based on a single matrix multiplication using these learned embedding weights, with an additional step that adds "positional embeddings" (simply added in the Transformer architecture). The embedding matrix can be quite large, especially if the token vocabulary size is large. However, this multiplication occurs infrequently compared to other weight matrices, so it is not a latency-critical operation. Nevertheless, the storage cost of storing a large embedding matrix can be significant.

Related areas of LLM inference optimization include:

Embedding Optimization Research Papers

The embeddings don't receive a huge amount of research in the literature, because they aren't a bottleneck in inference. Most of the research on optimizing embeddings has been on compacting the space of the embedding matrices for use on smaller devices or with smaller models, using matrix compression techniques such as sparsity or hashing.

  • Siyi Liu, Chen Gao, Yihong Chen, Depeng Jin, Yong Li, Learnable Embedding Sizes for Recommender Systems, arXiv preprint arXiv:2101.07577, Mar 2021, https://arxiv.org/abs/2101.07577, Code: https://github.com/ssui-liu/learnable-embed-sizes-for-RecSys
  • Kailash A. Hambarde, Hugo Proenca, Information Retrieval: Recent Advances and Beyond, arXiv preprint arXiv:2301.08801, Jan 2023, https://arxiv.org/abs/2301.08801
  • Carlos Lassance, Maroua Maachou, Joohee Park, Stéphane Clinchant, Learned Token Pruning in Contextualized Late Interaction over BERT (ColBERT), 2022, SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, July 2022, Pages 2232–2236, https://doi.org/10.1145/3477495.3531835, https://dl.acm.org/doi/10.1145/3477495.3531835
  • Keshav Santhanam, Omar Khattab, Christopher Potts, Matei Zaharia, PLAID: An Efficient Engine for Late Interaction Retrieval, October 2022, CIKM '22: The 31st ACM International Conference on Information and Knowledge Management, DOI:10.1145/3511808.3557325, https://arxiv.org/abs/2205.09707
  • Liang Qu, Huaisheng Zhu, Ruiqi Zheng, Yuhui Shi, and Hongzhi Yin, 2021, Imgagn: Imbalanced network embedding via generative adversarial graph networks. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 1390–1398, https://arxiv.org/abs/2106.02817
  • Wang-Cheng Kang, Derek Zhiyuan Cheng, Ting Chen, Xinyang Yi, Dong Lin, Lichan Hong, and Ed H. Chi, 2020, Learning Multi-Granular Quantized Embeddings for Large-Vocab Categorical Features in Recommender Systems, In Companion Proceedings of the Web Conference 2020 (Taipei, Taiwan) (WWW ’20). Association for Computing Machinery, New York, NY, USA, 562–566, https://doi.org/10.1145/3366424.3383416, https://arxiv.org/abs/2002.08530
  • Manas R. Joglekar, Cong Li, Mei Chen, Taibai Xu, Xiaoming Wang, Jay K. Adams, Pranav Khaitan, Jiahui Liu, and Quoc V. Le. 2020. Neural Input Search for Large Scale Recommendation Models (KDD ’20). Association for Computing Machinery, New York, NY, USA, 2387–2397. https://doi.org/10.1145/3394486.3403288, https://arxiv.org/abs/1907.04471
  • Manas R Joglekar, Cong Li, Mei Chen, Taibai Xu, Xiaoming Wang, Jay K Adams, Pranav Khaitan, Jiahui Liu, and Quoc V Le. 2020. Neural input search for large scale recommendation models. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2387–2397.
  • Ruiqi Zheng, Liang Qu, Bin Cui, Yuhui Shi, and Hongzhi Yin. 2022. AutoML for Deep Recommender Systems: A Survey. arXiv prerint arXiv:2203.13922 (2022), https://arxiv.org/abs/2203.13922
  • WC Kang, DZ Cheng, T Yao, X Yi, T Chen, 2021, Learning to embed categorical features without embedding tables for recommendation https://arxiv.org/abs/2010.10784v2
  • Yi Luan, Jacob Eisenstein, Kristina Toutanova, and Michael Collins. 2020. Sparse, dense, and attentional representations for text retrieval. In Proceedings of TACL, https://arxiv.org/abs/2005.00181
  • Omar Khattab and Matei Zaharia. 2020. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. In Proc. SIGIR. 39–48, https://arxiv.org/abs/2004.12832
  • Samuel Humeau, Kurt Shuster, Marie-Anne Lachaux, and J. Weston. 2020. Poly-encoders: Transformer Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring. In Proc. ICLR. https://arxiv.org/abs/1905.01969
  • F Lyu, X Tang, H Zhu, H Guo, Y Zhang, 2022, OptEmbed: Learning Optimal Embedding Table for Click-through Rate Prediction, https://arxiv.org/abs/2208.04482
  • Hanxiao Liu, Karen Simonyan, and Yiming Yang. 2018. Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055 (2018). https://arxiv.org/abs/1806.09055
  • Nicola Tonellotto, Craig Macdonald, and Iadh Ounis. 2018. Efficient Query Processing for Scalable Web Search. Foundations and Trends in Information Retrieval 12, 4–5 (2018), 319–492, https://ieeexplore.ieee.org/document/8620666
  • Wang-Cheng Kang, Derek Zhiyuan Cheng, Tiansheng Yao, Xinyang Yi, Ting Chen, Lichan Hong, Ed H. Chi, Learning to Embed Categorical Features without Embedding Tables for Recommendation, June 2021, https://arxiv.org/abs/2010.10784v2
  • Tong Chen, Hongzhi Yin, Yujia Zheng, Zi Huang, Yang Wang, and Meng Wang. 2021. Learning elastic embeddings for customizing on-device recommenders. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 138–147. https://arxiv.org/abs/2106.02223
  • Ting Chen, Lala Li, and Yizhou Sun. 2019. Differentiable product quantization for end-to-end embedding compression. arXiv preprint arXiv:1908.09756 (2019). https://arxiv.org/abs/1908.09756
  • Tianyu Gao, Xingcheng Yao, and Danqi Chen. SimCSE: Simple contrastive learning of sentence embeddings. CoRR, abs/2104.08821, 2021, https://arxiv.org/abs/2104.08821
  • Mandar Joshi, Eunsol Choi, Omer Levy, Daniel S. Weld, and Luke Zettlemoyer. pair2vec: Compositional word-pair embeddings for cross-sentence inference. In NAACL-HLT (1), pp. 3597–3608. Association for Computational Linguistics, 2019, https://arxiv.org/abs/1810.08854
  • Lili Mou, Ran Jia, Yan Xu, Ge Li, Lu Zhang, and Zhi Jin. Distilling word embeddings: An encoding approach. In CIKM, pp. 1977–1980. ACM, 2016, https://arxiv.org/abs/1506.04488
  • Ye Qi, Devendra Singh Sachan, Matthieu Felix, Sarguna Padmanabhan, and Graham Neubig. When and why are pre-trained word embeddings useful for neural machine translation? In NAACL-HLT (2), pp. 529–535. Association for Computational Linguistics, 2018, https://arxiv.org/abs/1804.06323
  • Shyam Upadhyay, Manaal Faruqui, Chris Dyer, and Dan Roth. Cross-lingual models of word embeddings: An empirical comparison. In ACL (1). The Association for Computer Linguistics, 2016, https://arxiv.org/abs/1604.00425
  • Alexis Conneau, German Kruszewski, Guillaume Lample, Lo¨ıc Barrault, and Marco Baroni. 2018. What You Can Cram into A Single $&!#∗ Vector: Probing Sentence Embeddings for Linguistic Properties. ACL. https://aclanthology.org/P18-1198/
  • Zihang Dai, Guokun Lai, Yiming Yang, and Quoc Le. 2020. Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing. In Proceedings of NeurIPS. https://proceedings.neurips.cc/paper/2020/hash/2cd2915e69546904e4e5d4a2ac9e1652-Abstract.html, https://arxiv.org/abs/2006.03236, Code: https://github.com/laiguokun/Funnel-Transformer
  • Hanrui Wang, Zhanghao Wu, Zhijian Liu, Han Cai, Ligeng Zhu, Chuang Gan, and Song Han. 2020. HAT: Hardware-aware transformers for efficient natural language processing. arXiv preprint arXiv:2005.14187. https://arxiv.org/abs/2005.14187
  • Representational Redundancy Reduction Strategies for Efficient Neural Network Architectures for Visual and Language Tasks, Rene Bidart, Ph.D. thesis, 2023, University of Waterloo, https://uwspace.uwaterloo.ca/bitstream/handle/10012/19682/Bidart_Rene.pdf?sequence=1
  • A. Chaulwar et al., "Extreme compression of sentence-transformer ranker models: Faster inference longer battery life and less storage on edge devices", arXiv:2207.12852, 2022. https://arxiv.org/abs/2207.12852v1 (Distillation from the point of view of embeddings.)
  • Tomas Mikolov, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. In ICLR, 2013 https://arxiv.org/abs/1301.3781 Code: https://code.google.com/p/word2vec/ (This is the word2vec algorithm.)
  • Guy Dar, Mor Geva, Ankit Gupta, Jonathan Berant, Dec 2022, Analyzing Transformers in Embedding Space, https://arxiv.org/pdf/2209.02535.pdf, Code: https://github.com/guyd1995/embedding-space (Maps backward from model parameters and computations in the "embedding space" and then projects this back to tokens.)
  • Xin Zhang, Zehan Li, Yanzhao Zhang, Dingkun Long, Pengjun Xie, Meishan Zhang, Min Zhang, Oct 2022, Language Models are Universal Embedders, https://arxiv.org/abs/2310.08232
  • Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022. https://arxiv.org/abs/2204.02311 (Google Palm architecture used RoPE embeddings.)
  • Jon Saad-Falcon, Amanpreet Singh, Luca Soldaini, Mike D'Arcy, Arman Cohan, Doug Downey, 30 Jan 2023 (v3), Embedding Recycling for Language Models, https://arxiv.org/abs/2207.04993
  • Hackerllama, January 7, 2024, Sentence Embeddings. Introduction to Sentence Embeddings, https://osanseviero.github.io/hackerllama/blog/posts/sentence_embeddings/
  • Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, Furu Wei, Dec 2023, Improving Text Embeddings with Large Language Models https://arxiv.org/abs/2401.00368
  • Lior Wolf, Feb 2017, Ofir Press, Using the Output Embedding to Improve Language Models, https://arxiv.org/abs/1608.05859
  • Yuhong Zhang, Shilai Yang, Gert Cauwenberghs, Tzyy-Ping Jung, 28 Jan 2024, From Word Embedding to Reading Embedding Using Large Language Model, EEG and Eye-tracking, https://arxiv.org/abs/2401.15681
  • Björn Deiseroth, Manuel Brack, Patrick Schramowski, Kristian Kersting, Samuel Weinbach, 27 Jun 2024, T-FREE: Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings, https://arxiv.org/abs/2406.19223
  • Barhoumi Mosbeh, Nov 2024, Late Chunking In Long Context Embedding Models, https://pub.towardsai.net/late-chunking-in-long-context-embedding-models-caf1c1209042
  • Emilia David, November 8, 2024, Multimodal RAG is growing, here’s the best way to get started, https://venturebeat.com/ai/multimodal-rag-is-growing-heres-the-best-way-to-get-started/

Embedding Size Optimization (NAS)

A conceptually simple way to reduce embedding size is to choose a smaller embedding size as a model hyper-parameter. The size of the embedding is a model "hyper-parameter" that is chosen before training. Optimizing this number is a sub-problem of "neural architecture search" (NAS), also called "hyper-parameter optimization" (HPO). The embedding-specific NAS problem has some research papers.

Embedding Matrix Compression (Embedding Pruning)

These papers are specifically on reducing the storage cost of large embedding matrices. Techniques include hashing vectors and pruning embeddings to create sparsity. Vocabulary size is also closely related to embeddings size (see tokenization and vocabulary research).

Embedding Low-Rank Matrix Factorization

The optimization of low-rank matrix factorization, or decomposition, can be applied to the embedding matrix. This is a specific subtype of embedding matrix compression.

Research papers on low-rank factorization of the embedding matrix:

  • Luke McDermott, 23 May 2024, Embedding Compression for Efficient Re-Identification, https://arxiv.org/abs/2405.14730
  • Patrick H. Chen, Si Si, Yang Li, Ciprian Chelba, and Cho-Jui Hsieh. GroupReduce: Block-wise low-rank approximation for neural language model shrinking. In NeurIPS, pp. 11011–11021, 2018. https://arxiv.org/abs/1806.06950 (Using low-rank matrices for vocabulary and embeddings.)
  • Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, James Hensman, 9 Feb 2024 (v2), SliceGPT: Compress Large Language Models by Deleting Rows and Columns, Microsoft Research, https://arxiv.org/abs/2401.15024 Code: https://github.com/microsoft/TransformerCompression (Pruning of matrices effectively prunes along the width dimension and the "fourth" internal dimension of embeddings using techniques such as low-rank matrix factorization.)

Unembedding Matrix (Output Embeddings)

The "unembedding" phase of a Transformer is where the output of the model in embedding format is converted back to tokens, as logits with probabilities. This means that each embedding has to map back to the tokens, which is the reverse of the initial embedding. This is more properly called the "output embedding" but I think the name "unembedding" is clearer.

The output phase uses an "unembedding matrix" which is usually the transpose or inverse of the embedding matrix. There's not a great deal of attention to unembeddings in the research literature.

Embedding Pruning

The idea of pruning can be applied to (a) the embedding matrix, or (b) the embeddings vectors of size the internal model dimension, dynamically pruning the embedding vectors. One particular implementation of dynamic embeddings pruning is the Funnel Transformer in 2020.

It should be noted that dynamic embeddings pruning, including the Funnel Transformer, have much overlap with dynamic activation sparsification, since activations are log-scale probabilities in the embedding space. See also: activation sparsity research.

Research papers on embedding pruning:

More AI Research

Read more about: