Aussie AI

Retrieval Augmented Generation (RAG) Architectures

  • Last Updated 11 December, 2024
  • by David Spuler, Ph.D.

What is RAG?

RAG is a fundamental technique in generative AI that extends the knowledge of an LLM without fine-tuning. Rather than train new knowledge in the LLM's parameters, we instead look up the extra information by searching a database. The LLM receives the user's prompt and the extra information found by the RAG lookup (called the "retriever" component). The LLM then uses its summarization and natural language capabilities to answer the user's question, based on the extra RAG text as input context.

RAG is commonly used as the go-to architecture for fine-tuning an LLM on a business's specialist data. For example, to create a chatbot that knows about your products, you could use fine-tuning to create a custom LLM that knows about your products. The more efficient way is to leave your LLM unchanged, but put your special documents into a RAG database (e.g. your entire website), and then have the LLM search these documents using a RAG architecture.

The current capabilities of Google and Bing with AI assistants are a RAG-like architecture, but more like a mega-RAG architecture, using a rather large database of documents. The way it works is that Google or Bing first search the entire internet (however they do this), and then the LLM summarizes the handful of internet documents into the final AI answer.

Beyond RAG

There's a lot of different variations on the RAG architecture. Also, RAG architectures can be extended in various ways. Some of the similar capabilities with "augmentation" of the LLM's input prompt with extra data include:

  • Retrieval Augmented Language Models (RALM) — the most general category including augmentation by basically anything; see more about RALM.
  • Tool-Augmented Language Models (TALM) — use dynamic tool execution to compute extra input data. See more about tool integrations.
  • Data source integrations ("plugins") — extended ways to search big databases, such as real estate listing or the entire internet, using a RAG-like approach.

Finally, note that RAG is an inherently "read-only" approach that only generates answers. It doesn't change anything for the user, and the generalization of that idea is "agents" that can do real-world actions (i.e., they're "read-write" and can do "actions"). For example, RAG could maybe tell you what your symptoms might be caused by, but an LLM agent can also book your doctor's appointment for you.

RAG Optimizations

First point: RAG architectures are inherently an optimization, themselves. RAG was created because fine-tuning was too expensive and has various other limitations (e.g., attribution, explainability), although Parameter-Efficient Fine-Tuning (PEFT) techniques have also attacked the inefficiences in fine-tuning, so maybe it's a tie between RAG and FT/PEFT.

But you can also optimize your RAG architecture. The first point is that many of the major LLM optimizations also work on the RAG LLM, so there's many ways to do this (e.g., quantization, pruning, inference optimizations, etc.)

However, there are a few techniques that are specifically applicable to RAG architectures because they optimize either (a) non-LLM RAG components, or (b) the RAG prompt structure.

Some examples of RAG non-LLM optimizations include:

  • RAG database speedups (e.g., indexing, all the usual database stuff)
  • Keyword versus vector lookups in the retriever (e.g., hybrid keyword-vector search, metadata search, etc.)
  • Caching — multiple types (e.g. caching in the retriever versus the LLM parts)

Secondly, there are some RAG-specific techniques on the "length" dimension (i.e., input tokens), that are applicable to an input prompt that is extended with extra prepended "context" tokens. Some examples include:

RAG is not the only architecture to use prepended context. For example, chatbots prepend the conversation history, so many of these approaches apply there too.

RAG Optimization Research Papers

Research papers on optimization of RAG architectures:

RAG Survey Papers

Survey papers on RAG architectures:

  • Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, Qing Li, 17 Jun 2024 (v3), A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models, https://arxiv.org/abs/2405.06211 Project: https://advanced-recommender-systems.github.io/RAG-Meets-LLMs/
  • Shangyu Wu, Ying Xiong, Yufei Cui, Haolun Wu, Can Chen, Ye Yuan, Lianming Huang, Xue Liu, Tei-Wei Kuo, Nan Guan, Chun Jason Xue, 18 Jul 2024, Retrieval-Augmented Generation for Natural Language Processing: A Survey, https://arxiv.org/abs/2407.13193
  • Siyun Zhao, Yuqing Yang, Zilong Wang, Zhiyuan He, Luna K. Qiu, Lili Qiu, 23 Sep 2024, Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely, https://arxiv.org/abs/2409.14924
  • Huayang Li, Yixuan Su, Deng Cai, Yan Wang, Lemao Liu, 13 Feb 2022 (v2), A Survey on Retrieval-Augmented Text Generation, https://arxiv.org/abs/2202.01110
  • Penghao Zhao, Hailin Zhang, Qinhan Yu, Zhengren Wang, Yunteng Geng, Fangcheng Fu, Ling Yang, Wentao Zhang, Jie Jiang, Bin Cui, 21 Jun 2024 (v6), Retrieval-Augmented Generation for AI-Generated Content: A Survey, https://arxiv.org/abs/2402.19473
  • Hao Yu, Aoran Gan, Kai Zhang, Shiwei Tong, Qi Liu, Zhaofeng Liu, 3 Jul 2024 (v2), Evaluation of Retrieval-Augmented Generation: A Survey, https://arxiv.org/abs/2405.07437
  • Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, Haofen Wang, 27 Mar 2024 (v5), Retrieval-Augmented Generation for Large Language Models: A Survey, https://arxiv.org/abs/2312.10997

Research Papers on RAG

There are rather a lot of research papers on RAG, as its a fundamental underpinning technique of generative AI. Here's a few of them:

Advanced RAG

Research papers on advanced RAG architectures:

Reranker Component in RAG

The reranker component aims to calibrate the best chunk for the LLM to use. The basic idea is:

  • Retriever returns several chunks
  • Reranker orders them in priority of relevance
  • Packer merges the chunks with the user's query and other global instructions
  • One final LLM request answers the user's question

Here are some research papers specific to the reranker component:

Long Context RAG

There is a lot of research on getting LLMs to run fast on long context inputs, and some of this is related to RAG architectures (i.e., big chunks!):

  • Ziyan Jiang, Xueguang Ma, Wenhu Chen, June 2024, LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs, arXiv preprint arXiv:2406.15319, https://arxiv.org/abs/2406.15319 (Improved accuracy performance of RAG methods when using a long context LLM and longer chunk sizes for the retriever.)
  • Qingfei Zhao, Ruobing Wang, Yukuo Cen, Daren Zha, Shicheng Tan, Yuxiao Dong, Jie Tang, 23 Oct 2024, LongRAG: A Dual-Perspective Retrieval-Augmented Generation Paradigm for Long-Context Question Answering, https://arxiv.org/abs/2410.18050 https://github.com/QingFei1/LongRAG
  • Tan Yu, Anbang Xu, Rama Akkiraju, 3 Sep 2024, In Defense of RAG in the Era of Long-Context Language Models, https://arxiv.org/abs/2409.01666
  • Zixuan Li, Jing Xiong, Fanghua Ye, Chuanyang Zheng, Xun Wu, Jianqiao Lu, Zhongwei Wan, Xiaodan Liang, Chengming Li, Zhenan Sun, Lingpeng Kong, Ngai Wong, 3 Oct 2024, UncertaintyRAG: Span-Level Uncertainty Enhanced Long-Context Modeling for Retrieval-Augmented Generation, https://arxiv.org/abs/2410.02719
  • Bowen Jin, Jinsung Yoon, Jiawei Han, Sercan O. Arik, 8 Oct 2024, Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG, https://arxiv.org/abs/2410.05983
  • Zhenrui Yue, Honglei Zhuang, Aijun Bai, Kai Hui, Rolf Jagerman, Hansi Zeng, Zhen Qin, Dong Wang, Xuanhui Wang, Michael Bendersky, 6 Oct 2024, Inference Scaling for Long-Context Retrieval Augmented Generation, https://arxiv.org/abs/2410.04343

RAG Knowledge Graph

A RAG Knowledge Graph architecture, or a "RAG Graph," is a combination of RAG with a Knowledge Graph. Instead of returning text chunks, the retriever returns a structured "graph" that represents additional knowledge. The advantage of a graph is that it contains concept relationships such as hierarchies.

Research on RAG with Knowledge Graphs:

RAG Caching

Several components in a RAG architecture can be optimized with a cache. The retrieval component can use all of the types of caching that are applicable to whatever database or datastore architecture it uses, irrespective whether it's keyword or vector lookup, and whether stored on disk or cached in memory. All of these different retrieval options can have a cache. At the bottom level of the LLM, there are various KV caching techniques (see further below). At the topmost level, there can be an overall cache via an "inference cache" for exactly identical queries, or a "semantic cache" for similar queries.

Research papers on RAG cache architectures:

KV Caching Optimizations

In addition to RAG caches, such as retrieval caches, there are various LLM cache methods. Several of the many types of KV caching optimizations can optimize RAG architectures (and other LLM use cases). The main KV cache techniques involve precomputed caches for RAG chunks, such as prefix caching or session caching. More information is available:

More Types of Caching

Other general types of caching that apply to any LLM system, and can be used with RAG:

More AI Research

Read more about: