Aussie AI

Embeddings Research

  • Last Updated 24 February, 2025
  • by David Spuler, Ph.D.

The first step in model inference in the Transformer architecture is to convert an input sequence into numbers called tokens. However, these tokens are not used internally to the model, because the next step of Transformer inference is to immediately convert this sequence of tokens into another internal representation called an "embedding". An embedding is a vector of numbers that represents the information about the token sequence in very complex ways.

Note that the "embeddings" terminology is unrelated to "embedded" devices such as mobile phones or IoT edge devices. It's simply a different usage of the word.

The mapping from tokens to embeddings is actually learned during model training. The conversion of token vectors into a vector of embeddings is based on a single matrix multiplication using these learned embedding weights, with an additional step that adds "positional embeddings" (simply added in the Transformer architecture). The embedding matrix can be quite large, especially if the token vocabulary size is large. However, this multiplication occurs infrequently compared to other weight matrices, so it is not a latency-critical operation. Nevertheless, the storage cost of storing a large embedding matrix can be significant.

Related areas of LLM inference optimization include:

Embedding Optimization Research Papers

The embeddings don't receive a huge amount of research in the literature, because they aren't a bottleneck in inference. Most of the research on optimizing embeddings has been on compacting the space of the embedding matrices for use on smaller devices or with smaller models, using matrix compression techniques such as sparsity or hashing.

Embedding Size Optimization (NAS)

A conceptually simple way to reduce embedding size is to choose a smaller embedding size as a model hyper-parameter. The size of the embedding is a model "hyper-parameter" that is chosen before training. Optimizing this number is a sub-problem of "neural architecture search" (NAS), also called "hyper-parameter optimization" (HPO). The embedding-specific NAS problem has some research papers.

Embedding Matrix Compression (Embedding Pruning)

These papers are specifically on reducing the storage cost of large embedding matrices. Techniques include hashing vectors and pruning embeddings to create sparsity. Vocabulary size is also closely related to embeddings size (see tokenization and vocabulary research).

Embedding Low-Rank Matrix Factorization

The optimization of low-rank matrix factorization, or decomposition, can be applied to the embedding matrix. This is a specific subtype of embedding matrix compression.

Research papers on low-rank factorization of the embedding matrix:

  • Luke McDermott, 23 May 2024, Embedding Compression for Efficient Re-Identification, https://arxiv.org/abs/2405.14730
  • Patrick H. Chen, Si Si, Yang Li, Ciprian Chelba, and Cho-Jui Hsieh. GroupReduce: Block-wise low-rank approximation for neural language model shrinking. In NeurIPS, pp. 11011–11021, 2018. https://arxiv.org/abs/1806.06950 (Using low-rank matrices for vocabulary and embeddings.)
  • Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, James Hensman, 9 Feb 2024 (v2), SliceGPT: Compress Large Language Models by Deleting Rows and Columns, Microsoft Research, https://arxiv.org/abs/2401.15024 Code: https://github.com/microsoft/TransformerCompression (Pruning of matrices effectively prunes along the width dimension and the "fourth" internal dimension of embeddings using techniques such as low-rank matrix factorization.)

Unembedding Matrix (Output Embeddings)

The "unembedding" phase of a Transformer is where the output of the model in embedding format is converted back to tokens, as logits with probabilities. This means that each embedding has to map back to the tokens, which is the reverse of the initial embedding. This is more properly called the "output embedding" but I think the name "unembedding" is clearer.

The output phase uses an "unembedding matrix" which is usually the transpose or inverse of the embedding matrix. There's not a great deal of attention to unembeddings in the research literature.

Embedding Pruning

The idea of pruning can be applied to (a) the embedding matrix, or (b) the embeddings vectors of size the internal model dimension, dynamically pruning the embedding vectors. One particular implementation of dynamic embeddings pruning is the Funnel Transformer in 2020.

It should be noted that dynamic embeddings pruning, including the Funnel Transformer, have much overlap with dynamic activation sparsification, since activations are log-scale probabilities in the embedding space. See also: activation sparsity research.

Research papers on embedding pruning:

More AI Research

Read more about: