Aussie AI

RAG Detailed Algorithm

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

RAG Detailed Algorithm

The RAG algorithm is not training. Prompt engineering gives the model all the content it needs to answer the question in the prompt. You are effectively using the LLM to take your content, mix it with its own trained knowledge (in a limited way), eloquently answer the question, and then perhaps converse on it a little.

The basic technical algorithm flow for a user request in a RAG architecture can be something like this:

    a. Receive the user's question (input).

    b. Use the user's question to do a text-based (keyword) search on the index and get the top X hits (of documents or snippets).

    c. Calculate the “embedding” for the user's question (a vector that shows its semantic meaning in numbers).

    d. Calculate the embeddings for the top X hits (from text search) and add these embeddings vectors to the vector database.

    e. Do a vector search on embeddings and get the top Y hits.

    f. Filter top X hits (text-based) and top Y hits (vector-based) to find overlaps, this overlap represents the best text-based and vector-based hits. If there is no overlap, select some from each.

    g. Combine the top hits with any summarization from previous questions.

    h. Get the contents from the top hits and use prompt engineering to create a question something like:

      “Given <summary>, <chunk 1>, <chunk 2>, <chunk 3>, answer <question from user>.
      Only respond with content from the provided data.
      If you do not know the answer, respond with I do not know.
      Cite the content used.”

    i. Send the prompt to the LLM, and receive the answer back from the LLM.

    j. Resolve any citations in the answers back to URLs the end user can click on, e.g. Confluence page, Jira ticket/comment/solution etc.

    k. Summarize the conversation to date using the model (i.e. context for any subsequent questions).

    l. Send back answer + summarization (perhaps encoded). The idea is the encoded summarization will not be shown for this answer, but will only be used internally by the RAG components for follow-up questions.

    m. The client or caller is responsible for context management, which means ensuring that conversations end quickly and new topics result in new conversations. Otherwise, the context fills up quickly, the LLM forgets what it's already said, and things gets confusing.

The above algorithm is thorough in generating two sets of hits (top X and top Y). It's not strictly necessary to do two searches (one with text keywords and one with vector embeddings), as often vector embeddings are good enough. Alternatively, text-based keyword searches are often cheaper, and vector lookups could be skipped. At the end of the day, the chunks most likely to contain answers to the questions are being sought.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++