Aussie AI

Semantic Caching and Vector Databases

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Semantic Caching and Vector Databases

Semantic caching refers to a partial inference cache that finds cached responses not only to identical queries, but also to queries that are “semantically similar” to a cached query. For example, these queries have different token sequences, but the same meaning:

    What is a dog
    What is a dog?
    What is dog?
    Dog what is it

Immediately, we can think of various heuristics that can detect very similar queries, and poll a cache. For example, the cache lookup could detect question mark punctuations and words reordered in a query. These ideas could help improve speed a little, but the heuristics won't generalize to the semantic meaning (e.g. synonyms or other equivalent phrasings).

Vector hashing. The generalization of these heuristics is to use “vector hashing” to find semantically similar queries. The idea is that we first create a vector out of the query, which could be the token vector, but more likely effective is the embeddings vector. Then we can use “vector hashing” to find the “nearest neighbor” of that vector in N-dimensional space that is stored in our cache. Returning the cached results avoids any further computation on the query.

That sounds good, but it's glossed over something important: cache misses. For example, if our cache has seen only “What is a cat?” and this is returned as the nearest-neighbor vector for the query “What is a dog?” then the answer won't be very accurate. What's missing is a discussion of “closeness” of the two vectors, whereby the cached vector can be rejected, and a full inference cycle executed (and then its results are added to the cache).

Semantic cache lookup needs to have both cache hits and misses, like any normal caching algorithm. The semantic cache needs to make sure that the two vectors are similar enough (i.e. a “cache hit”), and this requires a measure of closeness between the query vector and the cached vector.

Vector databases. To implement our vector hashing capability for the semantic cache we can use Locality Sensitive Hashing (LSH) or some other algorithms. It's just a small matter of coding. Alternatively, there are “vector databases” available that have already implemented this functionality. Vector databases have been in use for years in various semi-AI functionality such as semantic document indexing and image similarity analysis. For example, open source and commercial vector databases include Pinecone, Weaviate, Milvus/Zilliz, Chroma, FAISS, Vespa, Qdrant, and Vald, to name a few.

Note that semantic caching with a vector database is technically a type of approximation. There is a trade-off in setting the level of closeness of two vectors for which the cache is used. If we set the threshold too high, then some answers will be wrong for the query. If the threshold is low, then the cache will miss often, and there will be the expense of computing additional inference queries.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++