Aussie AI

Retrieval Augmented Generation (RAG) Architectures

  • Last Updated 21 March, 2025
  • by David Spuler, Ph.D.

What is RAG?

RAG is a fundamental technique in generative AI that extends the knowledge of an LLM without fine-tuning. Rather than train new knowledge in the LLM's parameters, we instead look up the extra information by searching a database. The LLM receives the user's prompt and the extra information found by the RAG lookup (called the "retriever" component). The LLM then uses its summarization and natural language capabilities to answer the user's question, based on the extra RAG text as input context.

RAG is commonly used as the go-to architecture for fine-tuning an LLM on a business's specialist data. For example, to create a chatbot that knows about your products, you could use fine-tuning to create a custom LLM that knows about your products. The more efficient way is to leave your LLM unchanged, but put your special documents into a RAG database (e.g. your entire website), and then have the LLM search these documents using a RAG architecture.

The current capabilities of Google and Bing with AI assistants are a RAG-like architecture, but more like a mega-RAG architecture, using a rather large database of documents. The way it works is that Google or Bing first search the entire internet (however they do this), and then the LLM summarizes the handful of internet documents into the final AI answer.

Beyond RAG

There's a lot of different variations on the RAG architecture. Also, RAG architectures can be extended in various ways. Some of the similar capabilities with "augmentation" of the LLM's input prompt with extra data include:

  • Retrieval Augmented Language Models (RALM) — the most general category including augmentation by basically anything; see more about RALM.
  • Tool-Augmented Language Models (TALM) — use dynamic tool execution to compute extra input data. See more about tool integrations.
  • Data source integrations ("plugins") — extended ways to search big databases, such as real estate listing or the entire internet, using a RAG-like approach.

Finally, note that RAG is an inherently "read-only" approach that only generates answers. It doesn't change anything for the user, and the generalization of that idea is "agents" that can do real-world actions (i.e., they're "read-write" and can do "actions"). For example, RAG could maybe tell you what your symptoms might be caused by, but an LLM agent can also book your doctor's appointment for you.

RAG Optimizations

RAG optimizations are LLM efficiency improvements applied to a RAG architecture. First point: RAG architectures are inherently an optimization, themselves. RAG was created because fine-tuning was too expensive and has various other limitations (e.g., attribution, explainability), although Parameter-Efficient Fine-Tuning (PEFT) techniques have also attacked the inefficiences in fine-tuning, so maybe it's a tie between RAG and FT/PEFT.

But you can also optimize your RAG architecture. The first point is that many of the major LLM optimizations also work on the RAG LLM, so there's many ways to do this (e.g., quantization, pruning, inference optimizations, etc.)

However, there are a few techniques that are specifically applicable to RAG architectures because they optimize either (a) non-LLM RAG components, or (b) the RAG prompt structure.

Some examples of RAG non-LLM optimizations include:

  • RAG database speedups (e.g., indexing, all the usual database stuff)
  • Keyword versus vector lookups in the retriever (e.g., hybrid keyword-vector search, metadata search, etc.)
  • Caching — multiple types (e.g. caching in the retriever versus the LLM parts)

Secondly, there are some RAG-specific techniques on the "length" dimension (i.e., input tokens), that are applicable to an input prompt that is extended with extra prepended "context" tokens. Some examples include:

RAG is not the only architecture to use prepended context. For example, chatbots prepend the conversation history, so many of these approaches apply there too.

RAG Survey Papers

Survey papers on RAG architectures:

  • Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, Qing Li, 17 Jun 2024 (v3), A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models, https://arxiv.org/abs/2405.06211 Project: https://advanced-recommender-systems.github.io/RAG-Meets-LLMs/
  • Shangyu Wu, Ying Xiong, Yufei Cui, Haolun Wu, Can Chen, Ye Yuan, Lianming Huang, Xue Liu, Tei-Wei Kuo, Nan Guan, Chun Jason Xue, 18 Jul 2024, Retrieval-Augmented Generation for Natural Language Processing: A Survey, https://arxiv.org/abs/2407.13193
  • Siyun Zhao, Yuqing Yang, Zilong Wang, Zhiyuan He, Luna K. Qiu, Lili Qiu, 23 Sep 2024, Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely, https://arxiv.org/abs/2409.14924
  • Huayang Li, Yixuan Su, Deng Cai, Yan Wang, Lemao Liu, 13 Feb 2022 (v2), A Survey on Retrieval-Augmented Text Generation, https://arxiv.org/abs/2202.01110
  • Penghao Zhao, Hailin Zhang, Qinhan Yu, Zhengren Wang, Yunteng Geng, Fangcheng Fu, Ling Yang, Wentao Zhang, Jie Jiang, Bin Cui, 21 Jun 2024 (v6), Retrieval-Augmented Generation for AI-Generated Content: A Survey, https://arxiv.org/abs/2402.19473
  • Hao Yu, Aoran Gan, Kai Zhang, Shiwei Tong, Qi Liu, Zhaofeng Liu, 3 Jul 2024 (v2), Evaluation of Retrieval-Augmented Generation: A Survey, https://arxiv.org/abs/2405.07437
  • Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, Haofen Wang, 27 Mar 2024 (v5), Retrieval-Augmented Generation for Large Language Models: A Survey, https://arxiv.org/abs/2312.10997
  • Yucheng Hu, Yuxing Lu, 30 Apr 2024, RAG and RAU: A Survey on Retrieval-Augmented Language Model in Natural Language Processing, https://arxiv.org/abs/2404.19543
  • Aditi Singh, Abul Ehtesham, Saket Kumar, Tala Talaei Khoei, 15 Jan 2025, Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG, https://arxiv.org/abs/2501.09136
  • Mingyue Cheng, Yucong Luo, Jie Ouyang, Qi Liu, Huijie Liu, Li Li, Shuo Yu, Bohou Zhang, Jiawei Cao, Jie Ma, Daoyu Wang, Enhong Chen, 17 Mar 2025 (v2), A Survey on Knowledge-Oriented Retrieval-Augmented Generation, https://arxiv.org/abs/2503.10677

RAG Best Practices

RAG best practices are practical guidelines on getting the most out of your RAG architecture. This can include accuracy improvements and efficiency optimizations. Research papers that examine the general state of RAG architectures in terms of their best practices include:

  • Xiaohua Wang, Zhenghua Wang, Xuan Gao, Feiran Zhang, Yixin Wu, Zhibo Xu, Tianyuan Shi, Zhengyuan Wang, Shizheng Li, Qi Qian, Ruicheng Yin, Changze Lv, Xiaoqing Zheng, Xuanjing Huang, 1 Jul 2024, Searching for Best Practices in Retrieval-Augmented Generation, https://arxiv.org/abs/2407.01219 Project: https://github.com/FudanDNN-NLP/RAG (Attempts to optimize the entire RAG system, including the various options for different RAG modules in the RAG pipeline, such as optimal methods for chunking, retrieval, embedding models, vector databases, prompt compression, reranking, repacking, summarizers, and other components.)
  • Florian June Aug 2024, The Best Practices of RAG: Typical RAG Process, Best Practices for Each Module, and Comprehensive Evaluation, https://pub.towardsai.net/the-best-practices-of-rag-300e313322e6
  • Siran Li, Linus Stenzel, Carsten Eickhoff, Seyed Ali Bahrainian, 13 Jan 2025, Enhancing Retrieval-Augmented Generation: A Study of Best Practices, https://arxiv.org/abs/2501.07391 https://github.com/ali-bahrainian/RAG_best_practices (Examines RAG best practices such as model size, prompt wording, chunk size, knowledge base size, and more.)
  • Harvey Bower, 2024, Debugging RAG Pipelines: Best Practices for High-Performance LLMs, https://www.amazon.com/dp/B0DNWN5RB1
  • Venkatesh Balavadhani Parthasarathy, Ahtsham Zafar, Aafaq Khan, Arsalan Shahid, 30 Oct 2024 (v3), The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities, https://arxiv.org/abs/2408.13296

Chunking

Chunking is the splitting of documents into sections called "chunks" that are used as extra context for the LLM. Retrieving relevant chunks is very important for accurate RAG results, and the speed of a RAG system is also affected by the size of each chunk, as measured in tokens. Chunking is a complex issue that needs to decide where to split a document, such as at paragraph or section separators.

Research papers on chunking:

  • Xiaohua Wang, Zhenghua Wang, Xuan Gao, Feiran Zhang, Yixin Wu, Zhibo Xu, Tianyuan Shi, Zhengyuan Wang, Shizheng Li, Qi Qian, Ruicheng Yin, Changze Lv, Xiaoqing Zheng, Xuanjing Huang, 1 Jul 2024, Searching for Best Practices in Retrieval-Augmented Generation, https://arxiv.org/abs/2407.01219 Project: https://github.com/FudanDNN-NLP/RAG (Attempts to optimize the entire RAG system, including the various options for different RAG modules in the RAG pipeline, such as optimal methods for chunking, retrieval, embedding models, vector databases, prompt compression, reranking, repacking, summarizers, and other components.)
  • Thuwarakesh Murallie, Aug 2024, How to Achieve Near Human-Level Performance in Chunking for RAGs: The costly yet powerful splitting technique for superior RAG retrieval, https://towardsdatascience.com/agentic-chunking-for-rags-091beccd94b1
  • Florian June, Sep 2024, Kotaemon Unveiled: Innovations in RAG Framework for Document QA: PDF Parsing, GraphRAG, Agent-Based Reasoning, and Insights, https://ai.gopubby.com/kotaemon-unveiled-innovations-in-rag-framework-for-document-qa-0b6d67e4b9b7
  • Rama Akkiraju, Anbang Xu, Deepak Bora, Tan Yu, Lu An, Vishal Seth, Aaditya Shukla, Pritam Gundecha, Hridhay Mehta, Ashwin Jha, Prithvi Raj, Abhinav Balasubramanian, Murali Maram, Guru Muthusamy, Shivakesh Reddy Annepally, Sidney Knowles, Min Du, Nick Burnett, Sean Javiya, Ashok Marannan, Mamta Kumari, Surbhi Jha, Ethan Dereszenski, Anupam Chakraborty, Subhash Ranjan, Amina Terfai, Anoop Surya, Tracey Mercer, Vinodh Kumar Thanigachalam, Tamar Bar, Sanjana Krishnan, Samy Kilaru, Jasmine Jaksic, Nave Algarici, Jacob Liberman, Joey Conway, Sonu Nayyar, Justin Boitano, 10 Jul 2024, FACTS About Building Retrieval Augmented Generation-based Chatbots, NVIDIA Research, https://arxiv.org/abs/2407.07858
  • Brandon Smith, Anton Troynikov, July 03, 2024, Evaluating Chunking Strategies for Retrieval, Chroma Technical Report, https://research.trychroma.com/evaluating-chunking https://github.com/brandonstarxel/chunking_evaluation
  • Siran Li, Linus Stenzel, Carsten Eickhoff, Seyed Ali Bahrainian, 13 Jan 2025, Enhancing Retrieval-Augmented Generation: A Study of Best Practices, https://arxiv.org/abs/2501.07391 https://github.com/ali-bahrainian/RAG_best_practices (Examines RAG best practices such as model size, prompt wording, chunk size, knowledge base size, and more.)
  • Sergey Filimonov, Jan 15, 2025, Ingesting Millions of PDFs and why Gemini 2.0 Changes Everything, https://www.sergey.fyi/articles/gemini-flash-2
  • Andrew Neeser, Kaylen Latimer, Aadyant Khatri, Chris Latimer, Naren Ramakrishnan, 16 Feb 2025, QuOTE: Question-Oriented Text Embeddings, https://arxiv.org/abs/2502.10976 (Augmenting RAG chunks with additional information, such as questions the chunk might answer.)

Multimodal RAG

Multimodal RAG is the use of images in the datastore for chunk retrieval, and is also sometimes called "visual RAG." A common example of multimodal RAG is ingesting PDF documents in their native format, using image-based analysis, rather than converting them to text. The retriever in multimodal RAG may return images and/or text to be passed to the Multimodal LLM (MLLM) for inference. The final output from the visual RAG system may be text or images or both, as with any other use of a multimodal LLM.

Multimodal RAG is one of the newest areas of AI research, combining the recent advances in multimodal LLMs with the older RAG architectural styles. Research papers on multimodal RAG (visual RAG):

RAG Fusion

RAG fusion is a RAG extension that incorporates analyzing multiple versions of the query to return the best context chunks. The model generates multiple "reformulated" versions of the original text query, each of which is sent to the retriever, and a final use of "Reciprocal Rank Fusion" combines all of the returned chunks into a single ranking, like a "reranker" component, but using multiple similar rankings. The main advantage is finding more accurate context for the LLM, and the downside is the many additional calls to the retriever database with slightly modified queries.

Research on RAG fusion algorithms:

Super RAG

Super RAG is a generalization of retrieval to accept more general information than naive RAG systems. Hence, a "super RAG" system is an embodiment of a more general type of RALM. Research papers on "super RAG" include:

Agentic RAG

Agentic RAG is the combination of agent and RAG technologies. Traditional RAG is a read-only use of extra context, but adding agent capabilities to the system allows a RAG-based application to perform tasks or actions.

Papers on agentic RAG include:

Reranker Component in RAG

The reranker is a RAG component that aims to calibrate the best chunk for the LLM to use. The input is a set of chunks or documents from the retriever in a preliminary ordering, which are then "re-ranked" into a better order. The basic idea is:

  • Retriever returns several chunks
  • Reranker orders them in priority of relevance
  • Packer merges the chunks with the user's query and other global instructions
  • One final LLM request answers the user's question

Here are some research papers specific to the reranker component:

Long Context RAG

Long context RAG, or simply "long RAG", is the use of LLM long context capabilities to improve RAG architectures. The simplest ideas include using bigger chunks or sending more chunks to the LLM, both of which give more tokens for the LLM to process as context. There is a lot of research on getting LLMs to run fast on long context inputs, and some of this is specially related to RAG architectures.

Research papers on "long RAG" include:

  • Ziyan Jiang, Xueguang Ma, Wenhu Chen, June 2024, LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs, arXiv preprint arXiv:2406.15319, https://arxiv.org/abs/2406.15319 (Improved accuracy performance of RAG methods when using a long context LLM and longer chunk sizes for the retriever.)
  • Qingfei Zhao, Ruobing Wang, Yukuo Cen, Daren Zha, Shicheng Tan, Yuxiao Dong, Jie Tang, 23 Oct 2024, LongRAG: A Dual-Perspective Retrieval-Augmented Generation Paradigm for Long-Context Question Answering, https://arxiv.org/abs/2410.18050 https://github.com/QingFei1/LongRAG
  • Tan Yu, Anbang Xu, Rama Akkiraju, 3 Sep 2024, In Defense of RAG in the Era of Long-Context Language Models, https://arxiv.org/abs/2409.01666
  • Zixuan Li, Jing Xiong, Fanghua Ye, Chuanyang Zheng, Xun Wu, Jianqiao Lu, Zhongwei Wan, Xiaodan Liang, Chengming Li, Zhenan Sun, Lingpeng Kong, Ngai Wong, 3 Oct 2024, UncertaintyRAG: Span-Level Uncertainty Enhanced Long-Context Modeling for Retrieval-Augmented Generation, https://arxiv.org/abs/2410.02719
  • Bowen Jin, Jinsung Yoon, Jiawei Han, Sercan O. Arik, 8 Oct 2024, Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG, https://arxiv.org/abs/2410.05983
  • Zhenrui Yue, Honglei Zhuang, Aijun Bai, Kai Hui, Rolf Jagerman, Hansi Zeng, Zhen Qin, Dong Wang, Xuanhui Wang, Michael Bendersky, 6 Oct 2024, Inference Scaling for Long-Context Retrieval Augmented Generation, https://arxiv.org/abs/2410.04343
  • Contextual AI Team, March 19, 2024 Introducing RAG 2.0, https://contextual.ai/introducing-rag2/
  • Brian J Chan, Chao-Ting Chen, Jui-Hung Cheng, Hen-Hsen Huang, 20 Dec 2024, Don't Do RAG: When Cache-Augmented Generation is All You Need for Knowledge Tasks, https://arxiv.org/abs/2412.15605 (Mini-RAG architecture preloading the entire knowledge into the LLM context and then using KV caching.)
  • Xinze Li, Yixin Cao, Yubo Ma, Aixin Sun, 27 Dec 2024, Long Context vs. RAG for LLMs: An Evaluation and Revisits, https://arxiv.org/abs/2501.01880 (Long context, summarization-based RAG, and classic chunked RAG have different strengths and weaknesses for different types of query.)
  • Kuicai Dong, Yujing Chang, Xin Deik Goh, Dexun Li, Ruiming Tang, Yong Liu, 15 Jan 2025, MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents, https://arxiv.org/abs/2501.08828
  • Salvatore Raieli, Jan 2025, Do Not Flip a Coin: When to Use RAG or Long Context LLMs, Understanding the Trade-offs and Best Practices for Optimizing LLMs with External Knowledge Sources, https://levelup.gitconnected.com/do-not-flip-a-coin-when-to-use-rag-or-long-context-llms-6f51a39de98c (Analysis of several papers that compare LC to RAG)
  • Runheng Liu, Xingchen Xiao, Heyan Huang, Zewen Chi, Zhijing Wu, 16 May 2024 (v3), FlashBack:Efficient Retrieval-Augmented Language Modeling for Long Context Inference, https://arxiv.org/abs/2405.04065
  • Isuru Lakshan Ekanayaka, Jan 2025, Retrieval-Augmented Generation (RAG) vs. Cache-Augmented Generation (CAG): A Deep Dive into Faster, Smarter Knowledge Integration, https://pub.towardsai.net/retrieval-augmented-generation-rag-vs-0b4bc63c1653
  • Dr. Ashish Bamania Jan 10, 2025, Cache-Augmented Generation (CAG) Is Here To Replace RAG: A deep dive into how a novel technique called Cache-Augmented Generation (CAG) works and reduces/ eliminates the need for Retrieval-augmented generation (RAG). https://levelup.gitconnected.com/cache-augmented-generation-cag-is-here-to-replace-rag-3d25c52360b2
  • Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela, 12 Apr 2021 (v4), Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, https://arxiv.org/abs/2005.11401
  • Weihang Su, Yichen Tang, Qingyao Ai, Junxi Yan, Changyue Wang, Hongning Wang, Ziyi Ye, Yujia Zhou, Yiqun Liu, 27 Jan 2025, Parametric Retrieval Augmented Generation, https://arxiv.org/abs/2501.15915 https://github.com/oneal2000/prag (Parametric RAG (PRAG) is training the RAG documents into model parameters, rather than prepending documents using long context RAG, and this means a shorter inference token length.)
  • Xubin Ren, Lingrui Xu, Long Xia, Shuaiqiang Wang, Dawei Yin, Chao Huang, 3 Feb 2025, VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos, https://arxiv.org/abs/2502.01549 https://github.com/HKUDS/VideoRAG
  • Cristian Leo, Feb 2025, Don’t Do RAG: Cache is the future: CAG or RAG? Let’s explore Cached Augmented Generation, its math, and trade-offs. Let’s dig into its research paper to see what it excels at, and how you could leverage it. https://levelup.gitconnected.com/dont-do-rag-cache-is-the-future-d1e995f0c76f
  • Manpreet Singh, Feb 2025, Goodbye RAG? Gemini 2.0 Flash Have Just Killed It! https://ai.gopubby.com/goodbye-rag-gemini-2-0-flash-have-just-killed-it-96301113c01f
  • Kun Luo, Zheng Liu, Peitian Zhang, Hongjin Qian, Jun Zhao, Kang Liu, 17 Feb 2025, Does RAG Really Perform Bad For Long-Context Processing? https://arxiv.org/abs/2502.11444 (Long context RAG processing based on the KV cache data is similar to fused/substring KV caching methods.)
  • Xiaoran Liu, Ruixiao Li, Mianqiu Huang, Zhigeng Liu, Yuerong Song, Qipeng Guo, Siyang He, Qiqi Wang, Linlin Li, Qun Liu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu, 24 Feb 2025, Thus Spake Long-Context Large Language Model, https://arxiv.org/abs/2502.17129 (Impressive survey of many techniques to improve efficiency and accuracy of long context processing in both inference and training, covering text, video and multimodal models.)
  • Guanzheng Chen, Qilong Feng, Jinjie Ni, Xin Li, Michael Qizhe Shieh, 27 Feb 2025, Long-Context Inference with Retrieval-Augmented Speculative Decoding, https://arxiv.org/abs/2502.20330

Mini-RAG

Mini-RAG is single-document RAG that stores the entirety of the knowledge base in the LLM's input context. The advantage of this architecture is that there is no need for a retriever component at all, but the disadvantages include token counts for inference, and practical limitations on the size of the document being used. Efficiency constraints are crumbling lately, viz "long RAG" based on LLM efficiency optimizations, such as prefix KV caching.

Research papers on single-document RAG or "mini-RAG" include:

RAG Knowledge Graph

A RAG Knowledge Graph architecture, or a "RAG Graph," is a combination of RAG with a Knowledge Graph. Instead of returning text chunks, the retriever returns a structured "graph" that represents additional knowledge. The advantage of a graph is that it contains concept relationships such as hierarchies.

Research on RAG with Knowledge Graphs:

Ontology RAG

Ontology-based RAG is the use of a special type of Knowledge Graph, known as an "ontology" or "taxonomy" of the concept space. Extra information can be extracted from the taxonomy as a special type of retrieval for RAG-based systems. The advantage is the ability to better capture structured information and hierarchical relationships between concepts in the ontology.

Research papers on LLMs and Ontologies include:

RAG Caching

RAG caching is the use of caching optimizations to improve the latency and speed of a RAG system. Several components in a RAG architecture can be optimized with a cache. The retrieval component can use all of the types of caching that are applicable to whatever database or datastore architecture it uses, irrespective whether it's keyword or vector lookup, and whether stored on disk or cached in memory. All of these different retrieval options can have a cache. At the bottom level of the LLM, there are various KV caching techniques (see further below). At the topmost level, there can be an overall cache via an "inference cache" for exactly identical queries, or a "semantic cache" for similar queries.

Research papers on RAG cache architectures:

RAG KV Caching Optimizations

KV caching optimizations are the storing of Key-Vector data from LLM inference for use in subsequent inference requests in a RAG system. In addition to RAG caches, such as retrieval caches, there are various LLM cache methods. Several of the many types of KV caching optimizations can optimize RAG architectures (and other LLM use cases). The main KV cache techniques involve precomputed caches for RAG chunks, such as prefix caching or session caching. More information is available:

Other general types of caching that apply to any LLM system, and can be used with RAG:

RAG Optimization Research Papers

Research papers on optimization of RAG architectures:

General Research Papers on RAG

There are rather a lot of research papers on RAG, as its a fundamental underpinning technique of generative AI. Here's a few of them:

Advanced RAG

Research papers on advanced RAG architectures:

More AI Research

Read more about: