Aussie AI

Retrieval-Augmented Generation (RAG)

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Retrieval-Augmented Generation (RAG)

RAG is a technique of merging external data sources with AI-based query answering. When it works well, RAG combines the speed of searching an information database with the elegance of fluent writing from an LLM.

RAG Architecture RAG is an architecture whereby the AI is integrated with an external document search mechanism. There are three components:

  • Retriever
  • Generator
  • Datastore

The “retriever” component looks up the user's query in a datastore of documents, using either keyword search or vector search. This is effectively a search component that accesses a database and finds all the related material. Typically, it returns excerpts or snippets of relevant text, rather than full documents.

The role of the “generator” component in RAG is to receive the document excerpts back from the retriever, and collate that into a prompt for the AI model. The snippets of text are merged as context for the user's question, and the combined prompt is sent to the AI engine. Hence, the role of the generator is mainly one of prompt engineering and forwarding requests to the LLM, and it tends to be a relatively simple component.

The datastore could be a classic database (e.g. SQL or MongoDB) with keyword lookup or a vector database with semantic lookup. The use of semantic lookup can give more meaningful document search results, with better model answers, but requires two additional steps. Firstly, the user's query must be converted into a vector format that represents its semantic meaning (called an “embedding”). Secondly, a vector database lookup is required using that embedding vector. There are various commercial and open source vector databases available.

How does RAG work? In the RAG approach, the model itself doesn't know about the new products, but instead, the engine knows how to:

    (a) search your company documents for the most relevant ones (retriever), and

    (b) summarize relevant parts of the documents into an answer for the user's question (generator).

Unlike fine-tuning, the RAG approach does not use your company documents as training data that you cram into an updated model. Instead, the documents are a source of input data that is integrated via a retriever search component, and sent as input to the AI engine using an unchanged model. RAG may require some “prompt engineering” that combines the document search results and a user's query, but the foundational model itself does not change.

The RAG component typically consists of a datastore of documents and a search mechanism. A typical setup would be a vector database containing documents that are indexed according to a semantic vectorization of their contents. The search mechanism would first vectorize the incoming query into its semantic components, then find the documents with the “nearest” matching vectors, which indicates a close semantic affinity.

Document Snippets. Typically, the results from the “retriever” component would be small sections or snippets of documents, rather than full-size documents. The reason small sections are desirable is because:

    (a) it would be costly to make an AI engine process a large document, and

    (b) it helps the AI find the most relevant information quickly.

The retrieved snippets or portions of documents would be returned to the AI. They would be prepended to the user's search query as “context” for the AI engine. Prompt engineering would then be used to ensure that the AI engine responds to the user query using information from the context document sections.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++