Aussie AI

Embedding Optimizations

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Embedding Optimizations

Several methods have been examined to optimize the embeddings components. The embedding matrix can be quite large, especially if the token vocabulary size is large. However, this multiplication occurs infrequently compared to other weight matrices, so it is not a latency-critical operation. Nevertheless, the storage cost of storing a large embedding matrix can be significant.

Embedding Size NAS. A conceptually simple way to reduce embedding size is to choose a smaller embedding size as a model hyper-parameter. The size of the embedding is a model “hyper-parameter” that is chosen before training. Optimizing this number is a sub-problem of “neural architecture search” (NAS), also called “hyper-parameter optimization” (HPO). The embedding-specific NAS problem has some research papers.

Embedding Matrix Compression. Memory costs of a large embedding matrix can be significant, especially on smaller platforms. There are several research papers specifically on reducing the storage cost of large embedding matrices. Techniques include hashing vectors and pruning embeddings to create sparsity. Vocabulary size is also closely related to embeddings size, since it is one of the dimensions of the matrix, so a smaller vocabulary can improve both memory usage and computation cost.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++