Aussie AI
Vector Dot Product Computation Reuse
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
Vector Dot Product Computation Reuse
A lot of the low-level computations of tensor multiplications in AI inference break down into a vector dot product computation (also called a “scalar product”). This is a multiplication-addition (multiply-accumulate or MAC) of all of the paired elements of two vectors to create a single number, usually a 32-bit floating-point result. Hence, it's a good candidate for caching with computation reuse since it's a large amount of arithmetic, and the result is only a single floating-point number, which won't need much extra space to store. The trade-off is attractive, with a small amount of extra storage used to avoid significant computations.
Another interesting feature is that one of those vectors is static during inference (e.g. the weights vector), whereas the other vector operand is dynamic (activations). Hence, the idea with vector computation reuse is to cache the computed dot product results and then detect when the second, incoming (dynamic) vector is the same as, or similar enough to, a previous incoming vector.
An alternative way to approach this is by combining pairs of vectors and using this as the cached vector. The two vectors are simply concatenated and treated as a single vector of double length, with the vector caching methods computed on the longer vector.
Various researchers have looked into this type of vector caching. The main methods to detect similar vectors are:
- Locality-Sensitive Hashing (LSH)
- Bit signatures
- K-means clustering
- Hyper-Cube
LSH is the most popular method, which uses cosine similarity to find reasonably “close” vectors in n-dimensional space. If the vectors are similar enough, the cached result is a reasonable approximation of the dot product computation, which is thereby avoided. If there are no vectors close enough, then the full dot product must be performed, and its result can be added to the vector cache.
The cost of looking up the cache must be low for this method to be effective, since the computation is being done in the busiest part of the AI algorithm: vector dot products inside matrix multiplications. Hence, this method typically uses hand-coded in-memory vector caching methods rather than vector databases. Theoretically, an in-memory highly tuned vector database could also do the job.
Assuming similar vectors can be identified efficiently, the question is: how often does AI model inference perform vector computations on similar vectors? What is the cache hit rate? At the start the vector cache is almost empty, and it takes a while for there to be enough vectors in the cache to get a high hit rate. But after the “warm-up” period, research papers seem to indicate that it's rather a lot, with some reporting 50% speedup of inference over time.
• Next: • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |