Aussie AI

RAG Project Design

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

RAG Project Design

General implementation steps in a typical RAG project are as follows:

1. Data identification. Identify the proprietary data you want to make the RAG system an expert on. This will also mean working out how to ingest the data and keep it refreshing. For example, if it’s a JIRA customer support database, a Confluence space, or a directory of PDF files on disk, that base of knowledge will get bigger over time. It is necessary to ponder the refresh operation, because purging and starting over can be expensive, especially if embeddings are being calculated.

2. Sizing. Determine the sizes of a “chunk” of data. The context size of the model matters here, because the chunks need to be substantially smaller than the context size. When a user asks a question, the system will be given 3-5 chunks of knowledge excerpts that are pertinent to the question. These snippets will be combined with the question. Even worse, if a back-and-forth dialog needs to occur between the users and the model, extra room needs to be available in the context size for follow-up questions.

3. Splitting sections. Determine how to “chunk” the data. A boundary for the chunk needs to be identified, which might be sections after a heading if it’s a web page. Larger sections might need the heading and one or two paragraphs in each chunk. Content from a bug tracking system might use the main description as one chunk, the follow-up comments as another chunk or multiple chunks, and the identified solution as another chunk. It's often beneficial to “overlap” chunks to hide the fact that chunking occurs.

4. Text-based database upload. Each chunk needs to be organized and stored in a text-based search engine, like Elastic Search, Azure Cognitive Search, etc. For documents and web pages, the organization can be minimal, but for a ticketing system with a complex structure (e.g. problem descriptions, comments, solutions), it all needs to be related somehow.

5. Vector database upload. The embedding for each chunk needs to be calculated and stored in a Vector Database. You can think of the embedding as a “summarization” of the chunk from the perspective of the model. The vector returned is typically multi-dimensional with 50+ dimensions. Each dimension represents a single concept in the contents. The idea is that from the model's perspective, similar content would produce similar vectors. A vector database can quickly find chunks with related data using vector lookup.

6. Optimizations. The embeddings can sometimes be calculated using a lazy evaluation algorithm (avoiding the cost of embeddings calculations for never-used documents), but this can also slow down inference, and full precomputation is faster. The model used for calculating embeddings does not need to be the same as the model answering the questions. Hence, a cheaper model can be used for embedding, such as GPT-Turbo, whereas GPT-4 could be used to answer questions.

• Next:

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs

The new AI programming book by Aussie AI co-founders:

AI coding in C++
Transformer engine speedups
LLM models
Phone and desktop AI
Code examples
Research citations

Get your copy from Amazon: Generative AI in C++

Aussie AI

RAG Project Design

RAG Project Design

Quick Links

Product

New to Writing?

Writing Styles