Aussie AI
Retrieval Augmented Generation (RAG) Architectures
-
Last Updated 11 December, 2024
-
by David Spuler, Ph.D.
What is RAG?
RAG is a fundamental technique in generative AI that extends the knowledge of an LLM without fine-tuning. Rather than train new knowledge in the LLM's parameters, we instead look up the extra information by searching a database. The LLM receives the user's prompt and the extra information found by the RAG lookup (called the "retriever" component). The LLM then uses its summarization and natural language capabilities to answer the user's question, based on the extra RAG text as input context.
RAG is commonly used as the go-to architecture for fine-tuning an LLM on a business's specialist data. For example, to create a chatbot that knows about your products, you could use fine-tuning to create a custom LLM that knows about your products. The more efficient way is to leave your LLM unchanged, but put your special documents into a RAG database (e.g. your entire website), and then have the LLM search these documents using a RAG architecture.
The current capabilities of Google and Bing with AI assistants are a RAG-like architecture, but more like a mega-RAG architecture, using a rather large database of documents. The way it works is that Google or Bing first search the entire internet (however they do this), and then the LLM summarizes the handful of internet documents into the final AI answer.
Beyond RAG
There's a lot of different variations on the RAG architecture. Also, RAG architectures can be extended in various ways. Some of the similar capabilities with "augmentation" of the LLM's input prompt with extra data include:
- Retrieval Augmented Language Models (RALM) — the most general category including augmentation by basically anything; see more about RALM.
- Tool-Augmented Language Models (TALM) — use dynamic tool execution to compute extra input data. See more about tool integrations.
- Data source integrations ("plugins") — extended ways to search big databases, such as real estate listing or the entire internet, using a RAG-like approach.
Finally, note that RAG is an inherently "read-only" approach that only generates answers. It doesn't change anything for the user, and the generalization of that idea is "agents" that can do real-world actions (i.e., they're "read-write" and can do "actions"). For example, RAG could maybe tell you what your symptoms might be caused by, but an LLM agent can also book your doctor's appointment for you.
RAG Optimizations
First point: RAG architectures are inherently an optimization, themselves. RAG was created because fine-tuning was too expensive and has various other limitations (e.g., attribution, explainability), although Parameter-Efficient Fine-Tuning (PEFT) techniques have also attacked the inefficiences in fine-tuning, so maybe it's a tie between RAG and FT/PEFT.
But you can also optimize your RAG architecture. The first point is that many of the major LLM optimizations also work on the RAG LLM, so there's many ways to do this (e.g., quantization, pruning, inference optimizations, etc.)
However, there are a few techniques that are specifically applicable to RAG architectures because they optimize either (a) non-LLM RAG components, or (b) the RAG prompt structure.
Some examples of RAG non-LLM optimizations include:
- RAG database speedups (e.g., indexing, all the usual database stuff)
- Keyword versus vector lookups in the retriever (e.g., hybrid keyword-vector search, metadata search, etc.)
- Caching — multiple types (e.g. caching in the retriever versus the LLM parts)
Secondly, there are some RAG-specific techniques on the "length" dimension (i.e., input tokens), that are applicable to an input prompt that is extended with extra prepended "context" tokens. Some examples include:
- Chunk compression (e.g., chunk pre-summarization)
- Prompt compression
- Context compression
- Prompt lookup decoding (an extension of speculative decoding)
- Prefix global KV cache
- Precomputed KV cache (for each RAG chunk)
RAG is not the only architecture to use prepended context. For example, chatbots prepend the conversation history, so many of these approaches apply there too.
RAG Optimization Research Papers
Research papers on optimization of RAG architectures:
- Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Xin Liu, Xuanzhe Liu, Xin Jin, 18 Apr 2024, RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation, https://arxiv.org/abs/2404.12457
- Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, Junchen Jiang, 3 Jun 2024 (v2), CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion, https://arxiv.org/abs/2405.16444 Code: https://github.com/YaoJiayi/CacheBlend.git (Generalizes prefix KV caching to KV cache fusion with selective recomputation of some KV cache data.)
- Runheng Liu, Xingchen Xiao, Heyan Huang, Zewen Chi, Zhijing Wu, 7 May 2024, FlashBack:Efficient Retrieval-Augmented Language Modeling for Long Context Inference, https://arxiv.org/abs/2405.04065 (Optimize RAG by appending rather than prepending documents, and modifying the attention for improvements in KV caching, by shimming or replacing some of the CUDA GPU low-level memory management APIs to avoid the need to rewrite kernels with extra higher-level memory management code.)
- Priyank Rathod, May 21, 2024, Efficient Usage of RAG Systems in the World of LLMs, https://www.techrxiv.org/doi/full/10.36227/techrxiv.171625877.73379410/v1
- 25 May 2024, Accelerating Inference of Retrieval-Augmented Generation via Sparse Context Selection, Yun Zhu, Jia-Chen Gu, Caitlin Sikora, Ho Ko, Yinxiao Liu, Chu-Cheng Lin, Lei Shu, Liangchen Luo, Lei Meng, Bang Liu, Jindong Chen, https://arxiv.org/abs/2405.16178
- Xiaohua Wang, Zhenghua Wang, Xuan Gao, Feiran Zhang, Yixin Wu, Zhibo Xu, Tianyuan Shi, Zhengyuan Wang, Shizheng Li, Qi Qian, Ruicheng Yin, Changze Lv, Xiaoqing Zheng, Xuanjing Huang, 1 Jul 2024, Searching for Best Practices in Retrieval-Augmented Generation, https://arxiv.org/abs/2407.01219 Project: https://github.com/FudanDNN-NLP/RAG (Attempts to optimize the entire RAG system, including the various options for different RAG modules in the RAG pipeline, such as optimal methods for chunking, retrieval, embedding models, vector databases, prompt compression, reranking, repacking, summarizers, and other components.)
- Dr. Ashish Bamania, Jun 18, 2024, Google’s New Algorithms Just Made Searching Vector Databases Faster Than Ever: A Deep Dive into how Google’s ScaNN and SOAR Search algorithms supercharge the performance of Vector Databases, https://levelup.gitconnected.com/googles-new-algorithms-just-made-searching-vector-databases-faster-than-ever-36073618d078
- Zilong Wang, Zifeng Wang, Long Le, Huaixiu Steven Zheng, Swaroop Mishra, Vincent Perot, Yuwei Zhang, Anush Mattapalli, Ankur Taly, Jingbo Shang, Chen-Yu Lee, Tomas Pfister, 11 Jul 2024, Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting, https://arxiv.org/abs/2407.08223
- Siddharth Jha, Lutfi Eren Erdogan, Sehoon Kim, Kurt Keutzer, Amir Gholami, 11 Jul 2024, Characterizing Prompt Compression Methods for Long Context Inference, https://arxiv.org/abs/2407.08892
- Baolin Li, Yankai Jiang, Vijay Gadepally, Devesh Tiwari, 17 Jul 2024, LLM Inference Serving: Survey of Recent Advances and Opportunities, https://arxiv.org/abs/2407.12391
- Eric Yang, Jonathan Amar, Jong Ha Lee, Bhawesh Kumar, Yugang Jia, 25 Jul 2024, The Geometry of Queries: Query-Based Innovations in Retrieval-Augmented Generation, https://arxiv.org/abs/2407.18044
- Vahe Aslanyan, June 11, 2024, Next-Gen Large Language Models: The Retrieval-Augmented Generation (RAG) Handbook, https://www.freecodecamp.org/news/retrieval-augmented-generation-rag-handbook/
- Thomas Merth, Qichen Fu, Mohammad Rastegari, Mahyar Najibi, 19 Jul 2024 (v2), Superposition Prompting: Improving and Accelerating Retrieval-Augmented Generation https://arxiv.org/abs/2404.06910 (Process each RAG chunk in parallel and choose a final output.)
- Zhi Jing, Yongye Su, Yikun Han, Bo Yuan, Haiyun Xu, Chunjiang Liu, Kehai Chen, Min Zhang, 6 Feb 2024 (v2), When Large Language Models Meet Vector Databases: A Survey, https://arxiv.org/abs/2402.01763
- Anthropic, 20 Sept 2024, Introducing Contextual Retrieval, https://www.anthropic.com/news/contextual-retrieval
- David Spuler, , September 26, 2024, RAG Optimization via Caching, https://www.aussieai.com/blog/rag-optimization-caching
- Zhenrui Yue, Honglei Zhuang, Aijun Bai, Kai Hui, Rolf Jagerman, Hansi Zeng, Zhen Qin, Dong Wang, Xuanhui Wang, Michael Bendersky, 6 Oct 2024, Inference Scaling for Long-Context Retrieval Augmented Generation, https://arxiv.org/abs/2410.04343
- Zhangchi Feng, Dongdong Kuang, Zhongyuan Wang, Zhijie Nie, Yaowei Zheng, Richong Zhang, 15 Oct 2024 (v2), EasyRAG: Efficient Retrieval-Augmented Generation Framework for Automated Network Operations, https://arxiv.org/abs/2410.10315 https://github.com/BUAADreamer/EasyRAG
- Tolga Şakar and Hakan Emekci, 30 October 2024, Maximizing RAG efficiency: A comparative analysis of RAG methods, Natural Language Processing. doi:10.1017/nlp.2024.53, https://www.cambridge.org/core/journals/natural-language-processing/article/maximizing-rag-efficiency-a-comparative-analysis-of-rag-methods/D7B259BCD35586E04358DF06006E0A85 https://www.cambridge.org/core/services/aop-cambridge-core/content/view/D7B259BCD35586E04358DF06006E0A85/S2977042424000530a.pdf/div-class-title-maximizing-rag-efficiency-a-comparative-analysis-of-rag-methods-div.pdf
- Sarayavalasaravikiran, Nov 2024, Optimizing RAG with Embedding Tuning, https://ai.plainenglish.io/optimizing-rag-with-embedding-tuning-2508af2ec049
RAG Survey Papers
Survey papers on RAG architectures:
- Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, Qing Li, 17 Jun 2024 (v3), A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models, https://arxiv.org/abs/2405.06211 Project: https://advanced-recommender-systems.github.io/RAG-Meets-LLMs/
- Shangyu Wu, Ying Xiong, Yufei Cui, Haolun Wu, Can Chen, Ye Yuan, Lianming Huang, Xue Liu, Tei-Wei Kuo, Nan Guan, Chun Jason Xue, 18 Jul 2024, Retrieval-Augmented Generation for Natural Language Processing: A Survey, https://arxiv.org/abs/2407.13193
- Siyun Zhao, Yuqing Yang, Zilong Wang, Zhiyuan He, Luna K. Qiu, Lili Qiu, 23 Sep 2024, Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely, https://arxiv.org/abs/2409.14924
- Huayang Li, Yixuan Su, Deng Cai, Yan Wang, Lemao Liu, 13 Feb 2022 (v2), A Survey on Retrieval-Augmented Text Generation, https://arxiv.org/abs/2202.01110
- Penghao Zhao, Hailin Zhang, Qinhan Yu, Zhengren Wang, Yunteng Geng, Fangcheng Fu, Ling Yang, Wentao Zhang, Jie Jiang, Bin Cui, 21 Jun 2024 (v6), Retrieval-Augmented Generation for AI-Generated Content: A Survey, https://arxiv.org/abs/2402.19473
- Hao Yu, Aoran Gan, Kai Zhang, Shiwei Tong, Qi Liu, Zhaofeng Liu, 3 Jul 2024 (v2), Evaluation of Retrieval-Augmented Generation: A Survey, https://arxiv.org/abs/2405.07437
- Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, Haofen Wang, 27 Mar 2024 (v5), Retrieval-Augmented Generation for Large Language Models: A Survey, https://arxiv.org/abs/2312.10997
Research Papers on RAG
There are rather a lot of research papers on RAG, as its a fundamental underpinning technique of generative AI. Here's a few of them:
- Abhimanyu Bambhaniya, Ritik Raj, Geonhwa Jeong, Souvik Kundu, Sudarshan Srinivasan, Midhilesh Elavazhagan, Madhu Kumar, Tushar Krishna, 3 Jun 2024, Demystifying Platform Requirements for Diverse LLM Inference Use Cases, https://arxiv.org/abs/2406.01698 Code: https://github.com/abhibambhaniya/GenZ-LLM-Analyzer (Analysis of cost of serving LLMs, including separate profiles of prefill versus decoding phases, and the cost of extra prompt processing in RAG architectures with prepended information.)
- Timo Lehto, June 2024, Developing LLM-powered Applications Using Modern Frameworks, Bachelor’s Thesis, Information and Communications Technology, Jamk University of Applied Sciences, Finland, June 2024, 53 pages., https://www.theseus.fi/bitstream/handle/10024/862271/Lehto_Timo.pdf?sequence=2 (Building LLM-based applications in RAG architecture using LangChain.)
- Runheng Liu, Xingchen Xiao, Heyan Huang, Zewen Chi, Zhijing Wu, 7 May 2024, FlashBack:Efficient Retrieval-Augmented Language Modeling for Long Context Inference, https://arxiv.org/abs/2405.04065 (Optimize RAG by appending rather than prepending documents, and modifying the attention for improvements in KV caching, by shimming or replacing some of the CUDA GPU low-level memory management APIs to avoid the need to rewrite kernels with extra higher-level memory management code.)
- Yucheng Hu, Yuxing Lu, 30 Apr 2024, RAG and RAU: A Survey on Retrieval-Augmented Language Model in Natural Language Processing, https://arxiv.org/abs/2404.19543 Project: https://github.com/2471023025/RALM_Survey
- Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, 22 Apr 2024, A Survey on Efficient Inference for Large Language Models, https://arxiv.org/abs/2404.14294
- Mandar Karhade, Mar 20, 2024, Why RAG Applications Fail in Production, Towards AI, https://pub.towardsai.net/why-rag-applications-fail-in-production-a-technical-deep-dive-15cc976af52c
- Priyank Rathod, May 21, 2024, Efficient Usage of RAG Systems in the World of LLMs, https://www.techrxiv.org/doi/full/10.36227/techrxiv.171625877.73379410/v1
- June 2024 (accessed), R2R: The ultimate open-source RAG framework, https://github.com/SciPhi-AI/R2R
- Penghao Zhao, Hailin Zhang, Qinhan Yu, Zhengren Wang, Yunteng Geng, Fangcheng Fu, Ling Yang, Wentao Zhang, Bin Cui, 27 Mar 2024 (v2), Retrieval-Augmented Generation for AI-Generated Content: A Survey, https://arxiv.org/abs/2402.19473 Project: https://github.com/hymie122/RAG-Survey
- Peter Hase, Mohit Bansal, Peter Clark, Sarah Wiegreffe, 12 Jan 2024, The Unreasonable Effectiveness of Easy Training Data for Hard Tasks, https://arxiv.org/abs/2401.06751
- Bijit Ghosh, Dec 25, 2023, Advanced RAG for LLMs/SLMs, Medium, https://medium.com/@bijit211987/advanced-rag-for-llms-slms-5bcc6fbba411
- Iulia Brezeanu, Jan 5, 2024, How to Cut RAG Costs by 80% Using Prompt Compression, Towards Data Science, https://towardsdatascience.com/how-to-cut-rag-costs-by-80-using-prompt-compression-877a07c6bedb
- James Nguyen, Nov 19, 2023, Forget RAG: Embrace agent design for a more intelligent grounded ChatGPT! https://james-tn.medium.com/forget-rag-embrace-agent-design-for-a-more-intelligent-grounded-chatgpt-6c562d903c61
- Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela, Apr 2021, Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, https://arxiv.org/abs/2005.11401
- Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, Junchen Jiang, 3 Jun 2024 (v2), CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion, https://arxiv.org/abs/2405.16444 Code: https://github.com/YaoJiayi/CacheBlend.git (Generalizes prefix KV caching to KV cache fusion with selective recomputation of some KV cache data.)
- David Spuler, March 2024, Chapter 6. Training, Fine-Tuning & RAG, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Tiernan Ray, June 3, 2024, Make room for RAG: How Gen AI's balance of power is shifting, https://www.zdnet.com/article/make-room-for-rag-how-gen-ais-balance-of-power-is-shifting/
- Shicheng Xu, Liang Pang, Mo Yu, Fandong Meng, Huawei Shen, Xueqi Cheng, Jie Zhou, 12 Jun 2024 (v2), Unsupervised Information Refinement Training of Large Language Models for Retrieval-Augmented Generation, https://arxiv.org/abs/2402.18150 (Analysis about how LLMs can mishandle information retrieved from a datastore and how to make LLMs better at handling RAG information using a specialized training regime.)
- Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, Qing Li, 17 Jun 2024 (v3), A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models, https://arxiv.org/abs/2405.06211 Project: https://advanced-recommender-systems.github.io/RAG-Meets-LLMs/
- Myeonghwa Lee, Seonho An, Min-Soo Kim, 18 Jun 2024, PlanRAG: A Plan-then-Retrieval Augmented Generation for Generative Large Language Models as Decision Makers, https://arxiv.org/abs/2406.12430 Code: https://github.com/myeon9h/PlanRAG
- Ziyan Jiang, Xueguang Ma, Wenhu Chen, June 2024, LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs, arXiv preprint arXiv:2406.15319, https://arxiv.org/abs/2406.15319 (Improved accuracy performance of RAG methods when using a long context LLM and longer chunk sizes for the retriever.)
- Zexuan Qiu, Zijing Ou, Bin Wu, Jingjing Li, Aiwei Liu, Irwin King, 25 Jun 2024, Entropy-Based Decoding for Retrieval-Augmented Large Language Models, https://arxiv.org/abs/2406.17519 (Enhanced decoding algorithm for multi-document RAG processing.)
- Louis-François Bouchard, Louie Peters, May 2024, Chapter 7: RAG, and Chapter 8, Advanced RAG, Building LLMs for Production: Enhancing LLM Abilities and Reliability with Prompting, Fine-Tuning, and RAG, https://www.amazon.com/Building-LLMs-Production-Reliability-Fine-Tuning/dp/B0D4FFPFW8/
- Matt Murphy, Tim Tully, Derek Xiao, January 18, 2024, The Modern AI Stack: Design Principles for the Future of Enterprise AI Architectures, Menlo Ventures, https://menlovc.com/perspective/the-modern-ai-stack-design-principles-for-the-future-of-enterprise-ai-architectures/ (Various details about the AI tech stack, organizational AI maturity levels, and several interesting facts: inference is 95% of AI cost now, 60% of organizations are using multi-model methods, RAG is the dominant architecture currently, and AI application development teams are primarily made up of non-ML software engineers leveraging on top of AI models.)
- Anirban Ghoshal, July 3, 2024, AWS approach to RAG evaluation could help enterprises reduce AI spending, https://www.infoworld.com/article/3715629/aws-new-approach-to-rag-evaluation-could-help-enterprises-reduce-ai-spending.html
- Yi Zhou, Dec 16, 2023, Optimizing GenAI: Comparing Model Training, Fine-Tuning, RAG, and Prompt Engineering, https://medium.com/generative-ai-revolution-ai-native-transformation/optimizing-genai-comparing-model-training-fine-tuning-rag-and-prompt-engineering-7a7c6c65e0f0
- Xiaohua Wang, Zhenghua Wang, Xuan Gao, Feiran Zhang, Yixin Wu, Zhibo Xu, Tianyuan Shi, Zhengyuan Wang, Shizheng Li, Qi Qian, Ruicheng Yin, Changze Lv, Xiaoqing Zheng, Xuanjing Huang, 1 Jul 2024, Searching for Best Practices in Retrieval-Augmented Generation, https://arxiv.org/abs/2407.01219 Project: https://github.com/FudanDNN-NLP/RAG (Attempts to optimize the entire RAG system, including the various options for different RAG modules in the RAG pipeline, such as optimal methods for chunking, retrieval, embedding models, vector databases, prompt compression, reranking, repacking, summarizers, and other components.)
- Shangyu Wu, Ying Xiong, Yufei Cui, Haolun Wu, Can Chen, Ye Yuan, Lianming Huang, Xue Liu, Tei-Wei Kuo, Nan Guan, Chun Jason Xue, 18 Jul 2024, Retrieval-Augmented Generation for Natural Language Processing: A Survey, https://arxiv.org/abs/2407.13193
- Chip Huyen, Jul 25, 2024, Building A Generative AI Platform, https://huyenchip.com/2024/07/25/genai-platform.html
- Akash Bajwa and Chia Jeng Yang, May 27, 2024, The RAG Stack: Featuring Knowledge Graphs: Reducing Hallucinations To Make LLMs Production-Grade With Complex RAG, https://akashbajwa.substack.com/p/the-rag-stack-featuring-knowledge
- Chips Ahoy Capital, Jul 02, 2024, Evolution of Databases in the World of AI Apps, https://chipsahoycapital.substack.com/p/evolution-of-databases-in-the-world?utm_source=substack&utm_medium=email
- Pavan Belagatti, Jul 31, 2024, Semantic Chunking for Enhanced RAG Applications! https://levelup.gitconnected.com/semantic-chunking-for-enhanced-rag-applications-b6bc92942af0
- Florian June Aug 2024, The Best Practices of RAG: Typical RAG Process, Best Practices for Each Module, and Comprehensive Evaluation, https://pub.towardsai.net/the-best-practices-of-rag-300e313322e6
- Louis-François Bouchard, Aug 12, 2024, When to Use GraphRAG, https://louisbouchard.substack.com/p/when-to-use-graphrag
- Vahe Aslanyan, June 11, 2024, Next-Gen Large Language Models: The Retrieval-Augmented Generation (RAG) Handbook, https://www.freecodecamp.org/news/retrieval-augmented-generation-rag-handbook/
- Yao Lu, Song Bian, Lequn Chen, Yongjun He, Yulong Hui, Matthew Lentz, Beibin Li, Fei Liu, Jialin Li, Qi Liu, Rui Liu, Xiaoxuan Liu, Lin Ma, Kexin Rong, Jianguo Wang, Yingjun Wu, Yongji Wu, Huanchen Zhang, Minjia Zhang, Qizhen Zhang, Tianyi Zhou, Danyang Zhuo, 17 Jan 2024, Computing in the Era of Large Generative Models: From Cloud-Native to AI-Native, https://arxiv.org/abs/2401.12230
- David Spuler, March 2024, Use Cases for FT vs RAG, in Generative AI in C++, https://www.aussieai.com/book/ch6-use-cases-rag-vs-ft
- Jason Perlow, Sept. 6, 2024, Understanding RAG: How to integrate generative AI LLMs with your business knowledge, https://www.zdnet.com/article/understanding-rag-how-to-integrate-generative-ai-llms-with-your-business-knowledge/
- Sau Sheong, Jun 13, 2024, Programming with AI — RAG: Using RAG in LLM Applications, https://sausheong.com/programming-with-ai-rag-27bf5c19daa7
Advanced RAG
Research papers on advanced RAG architectures:
- Xiaohua Wang, Zhenghua Wang, Xuan Gao, Feiran Zhang, Yixin Wu, Zhibo Xu, Tianyuan Shi, Zhengyuan Wang, Shizheng Li, Qi Qian, Ruicheng Yin, Changze Lv, Xiaoqing Zheng, Xuanjing Huang, 1 Jul 2024, Searching for Best Practices in Retrieval-Augmented Generation, https://arxiv.org/abs/2407.01219 Project: https://github.com/FudanDNN-NLP/RAG (Attempts to optimize the entire RAG system, including the various options for different RAG modules in the RAG pipeline, such as optimal methods for chunking, retrieval, embedding models, vector databases, prompt compression, reranking, repacking, summarizers, and other components.)
- Akash Bajwa and Chia Jeng Yang, May 27, 2024, The RAG Stack: Featuring Knowledge Graphs: Reducing Hallucinations To Make LLMs Production-Grade With Complex RAG, https://akashbajwa.substack.com/p/the-rag-stack-featuring-knowledge
- Xi Wang, Procheta Sen, Ruizhe Li, Emine Yilmaz, 31 Jul 2024, Adaptive Retrieval-Augmented Generation for Conversational Systems, https://arxiv.org/abs/2407.21712 (Deciding whether or not to include a RAG external data request in the inference of a chatbot in a multi-turn conversation.)
- Igor Novikov, Jul 23, 2024, RAG Architecture: Advanced RAG, https://pub.towardsai.net/rag-architecture-advanced-rag-3fea83e0d189
- Vishal Rajput, Apr 16, 2024, RAG 2.0: Retrieval Augmented Language Models, https://medium.com/aiguys/rag-2-0-retrieval-augmented-language-models-3762f3047256
- Florian June Aug 2024, The Best Practices of RAG: Typical RAG Process, Best Practices for Each Module, and Comprehensive Evaluation, https://pub.towardsai.net/the-best-practices-of-rag-300e313322e6
- Chandini Jain, Aug 15, 2024, The magic of RAG is in the retrieval, https://www.infoworld.com/article/3484132/the-magic-of-rag-is-in-the-retrieval.html (Quality of RAG answers is more dependent on the retriever than the LLM, needing both high quality data availability and accurate retriever query lookup.)
- Bhaskarjit Sarmah, Benika Hall, Rohan Rao, Sunil Patel, Stefano Pasquali, Dhagash Mehta, 9 Aug 2024, HybridRAG: Integrating Knowledge Graphs and Vector Retrieval Augmented Generation for Efficient Information Extraction, https://arxiv.org/abs/2408.04948
- Florian June, Jul 14, 2024, Three Practical Challenges of RAG and Their Mitigation Ideas: Strategies for Overcoming Obstacles in Real-World RAG Projects https://ai.gopubby.com/three-practical-challenges-of-rag-and-their-mitigation-ideas-5cc8e6dd7e30
- Matei Zaharia, Omar Khattab, Lingjiao Chen, Jared Quincy Davis, Heather Miller, Chris Potts, James Zou, Michael Carbin, Jonathan Frankle, Naveen Rao, Ali Ghodsi, Feb 18, 2024, The Shift from Models to Compound AI Systems, https://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems/
- Dr. Ashish Bamania, Aug 2024, ‘MedGraphRAG’ Is A Complete Game Changer For AI In Medicine A deep-dive into how RAG, GraphRAG, and MedGraphRAG work and how they significantly improve the performance of LLM responses in Medicine, https://levelup.gitconnected.com/medgraphrag-is-a-complete-game-changer-for-ai-in-medicine-c6b41b0effd6
- Junde Wu, Jiayuan Zhu, Yunli Qi, 8 Aug 2024, Medical Graph RAG: Towards Safe Medical Large Language Model via Graph Retrieval-Augmented Generation, https://arxiv.org/abs/2408.04187 Code: https://github.com/MedicineToken/Medical-Graph-RAG/tree/main
- Yuntong Hu, Zhihan Lei, Zheng Zhang, Bo Pan, Chen Ling, Liang Zhao, 26 May 2024, GRAG: Graph Retrieval-Augmented Generation, https://arxiv.org/abs/2405.16506
- Philip Rathle, Jul 11, 2024, The GraphRAG Manifesto: Adding Knowledge to GenAI, https://neo4j.com/blog/graphrag-manifesto/
- Tomaž Bratanič, Mar 12, 2024, Implementing Advanced Retrieval RAG Strategies With Neo4j, https://neo4j.com/developer-blog/advanced-rag-strategies-neo4j/
- Microsoft, Aug 2024 (accessed), GraphRAG: A modular graph-based Retrieval-Augmented Generation (RAG) system, https://github.com/microsoft/graphrag
- Zhihao Zhang, Alan Zhu, Lijie Yang, Yihua Xu, Lanting Li, Phitchaya Mangpo Phothilimthana, Zhihao Jia, July 2024, Accelerating Iterative Retrieval-augmented Language Model Serving with Speculation, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:60626-60643, 2024, https://proceedings.mlr.press/v235/zhang24cq.html
- Mintong Kang, Nezihe Merve Gürel, Ning Yu, Dawn Song, Bo Li, July 2024, C-RAG: Certified Generation Risks for Retrieval-Augmented Language Models, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:22963-23000, 2024, https://proceedings.mlr.press/v235/kang24a.html
- Ahmed Besbes, Aug 24, 2024, What Nobody Tells You About RAGs, https://towardsdatascience.com/what-nobody-tells-you-about-rags-b35f017e1570
- Ayush RoyChowdhury, Mulong Luo,, Prateek Sahu,, Sarbartha Banerjee, Mohit Tiwari, Aug 2024, ConfusedPilot: Confused Deputy Risks in RAG-based LLMs, https://confusedpilot.info/confused_pilot_new.pdf
- Ziyan Jiang, Xueguang Ma, Wenhu Chen, June 2024, LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs, arXiv preprint arXiv:2406.15319, https://arxiv.org/abs/2406.15319 (Improved accuracy performance of RAG methods when using a long context LLM and longer chunk sizes for the retriever.)
- Daniel Fleischer, Moshe Berchansky, Moshe Wasserblat, Peter Izsak, 5 Aug 2024, RAG Foundry: A Framework for Enhancing LLMs for Retrieval Augmented Generation, https://arxiv.org/abs/2408.02545 https://github.com/IntelLabs/RAGFoundry
- Jiajie Jin, Yutao Zhu, Xinyu Yang, Chenghao Zhang, Zhicheng Dou, 22 May 2024, FlashRAG: A Modular Toolkit for Efficient Retrieval-Augmented Generation Research, https://arxiv.org/abs/2405.13576 https://github.com/RUC-NLPIR/FlashRAG
- David Rau, Hervé Déjean, Nadezhda Chirkova, Thibault Formal, Shuai Wang, Vassilina Nikoulina, Stéphane Clinchant, 1 Jul 2024, BERGEN: A Benchmarking Library for Retrieval-Augmented Generation, https://arxiv.org/abs/2407.01102
- Ayush Thakur, Raghav Gupta, 13 Apr 2024, Introducing Super RAGs in Mistral 8x7B-v1, https://arxiv.org/abs/2404.08940
- SuperAgent, 2024, Super-Rag with SAML, https://docs.superagent.sh/overview/rag-retrieval/super-rag-with-saml
- Andrew Ditmer, May 13 2024, SuperRAG – How to achieve higher accuracy with Retrieval Augmented Generation, https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/superrag-how-to-achieve-higher-accuracy-with-retrieval-augmented/ba-p/4139004
- Chia Jeng Yang, Dec 14, 2023, A first intro to Complex RAG (Retrieval Augmented Generation), https://medium.com/enterprise-rag/a-first-intro-to-complex-rag-retrieval-augmented-generation-a8624d70090f
- Vahe Aslanyan, June 11, 2024, Next-Gen Large Language Models: The Retrieval-Augmented Generation (RAG) Handbook, https://www.freecodecamp.org/news/retrieval-augmented-generation-rag-handbook/
- Chandini Jain, August 28, 2024, The magic of RAG is in the retrieval, https://edt.infoworld.com/q/1tldUPQDxjluYqjeyhS98AV4/wv
- NirDiamant, Aug 2024, Advanced RAG Techniques: Elevating Your Retrieval-Augmented Generation Systems, https://github.com/NirDiamant/RAG_Techniques
- Asim Biswal, Liana Patel, Siddarth Jha, Amog Kamsetty, Shu Liu, Joseph E. Gonzalez, Carlos Guestrin, Matei Zaharia, 27 Aug 2024, Text2SQL is Not Enough: Unifying AI and Databases with TAG, https://arxiv.org/abs/2408.14717 https://github.com/TAG-Research/TAG-Bench
- Thomas Merth, Qichen Fu, Mohammad Rastegari, Mahyar Najibi, 19 Jul 2024 (v2), Superposition Prompting: Improving and Accelerating Retrieval-Augmented Generation https://arxiv.org/abs/2404.06910 (Process each RAG chunk in parallel and choose a final output.)
- Zheng Wang, Shu Xian Teo, Jieer Ouyang, Yongjun Xu, Wei Shi, 26 May 2024, M-RAG: Reinforcing Large Language Model Performance through Retrieval-Augmented Generation with Multiple Partitions, https://arxiv.org/abs/2405.16420
- Shenggang Li, Jul 30, 2024, Mem0: Is This the Future of AI Memory Management? https://ai.gopubby.com/mem0-is-this-the-future-of-ai-memory-management-1e228dc8220a
- C Yang, S Fujita, 2024, Adaptive Control of Retrieval-Augmented Generation for LLMs Through Reflective Tags, https://www.preprints.org/manuscript/202408.2152/download/final_file
- Thuwarakesh Murallie, Aug 2024, How to Achieve Near Human-Level Performance in Chunking for RAGs: The costly yet powerful splitting technique for superior RAG retrieval, https://towardsdatascience.com/agentic-chunking-for-rags-091beccd94b1
- Dom Couldwell, Sep 03, 2024 Dealing with ‘day two’ issues in generative AI deployments, https://www.infoworld.com/article/3493255/dealing-with-day-two-issues-in-generative-ai-deployments.html
- Niklas Muennighoff, Hongjin Su, Liang Wang, Nan Yang, Furu Wei, Tao Yu, Amanpreet Singh, Douwe Kiela, 17 Apr 2024 (v2), Generative Representational Instruction Tuning, https://arxiv.org/abs/2402.09906
- Tan Yu, Anbang Xu, Rama Akkiraju, 3 Sep 2024, In Defense of RAG in the Era of Long-Context Language Models, https://arxiv.org/abs/2409.01666
- Florian June, Feb 3, 2024, Advanced RAG 02: Unveiling PDF Parsing, https://pub.towardsai.net/advanced-rag-02-unveiling-pdf-parsing-b84ae866344e
- Lior Solomon, Sep 2024, Gen AI testing strategies and tools, https://medium.com/ai-in-grc/gen-ai-testing-strategies-and-tools-257383e5cbfb
- Vivedha Elango, Sep 2024, Search in the age of AI- Retrieval methods for Beginners, https://ai.gopubby.com/search-in-the-age-of-ai-retrieval-methods-for-beginners-557621e12ded
- Ali Forootani, Danial Esmaeili Aliabadi, Daniela Thraen, 11 Sep 2024, Bio-Eng-LMM AI Assist chatbot: A Comprehensive Tool for Research and Education, https://arxiv.org/abs/2409.07110
- Louis Bouchard, Sep 13, 2024, Top RAG Techniques You Should Know (Wang et al., 2024), https://www.louisbouchard.ai/top-rag-techniques/
- Sascha Heyer, Sep 2024, RAG API: 30 lines of code is all you need for RAG. The easiest way to get started with RAG. https://medium.com/google-cloud/google-cloud-rag-api-c7e3c9931b3e
- Florian June, Sep 2024, Kotaemon Unveiled: Innovations in RAG Framework for Document QA: PDF Parsing, GraphRAG, Agent-Based Reasoning, and Insights, https://ai.gopubby.com/kotaemon-unveiled-innovations-in-rag-framework-for-document-qa-0b6d67e4b9b7
- Michael D. Skarlinski, James D. Braza, SamCox, Michaela Hinks, Manvitha Ponnapati, Samuel G. Rodriques, Jon M. Laurent, Michael J. Hammerling, Andrew D. White, Sep 2024, Language Agents Achieve Superhuman Synthesis of Scientific Knowledge, https://storage.googleapis.com/fh-public/paperqa/Language_Agents_Science.pdf https://github.com/Future-House/paper-qa
- Pathway, Sep 2024, 2024 Top RAG Frameworks, https://pathway.com/rag-frameworks
- Anthropic, 20 Sept 2024, Introducing Contextual Retrieval, https://www.anthropic.com/news/contextual-retrieval
- Siyun Zhao, Yuqing Yang, Zilong Wang, Zhiyuan He, Luna K. Qiu, Lili Qiu, 23 Sep 2024, Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely, https://arxiv.org/abs/2409.14924
- Lei Liang, Mengshu Sun, Zhengke Gui, Zhongshu Zhu, Zhouyu Jiang, Ling Zhong, Yuan Qu, Peilong Zhao, Zhongpu Bo, Jin Yang, Huaidong Xiong, Lin Yuan, Jun Xu, Zaoyang Wang, Zhiqiang Zhang, Wen Zhang, Huajun Chen, Wenguang Chen, Jun Zhou, 24 Sep 2024 (v2), KAG: Boosting LLMs in Professional Domains via Knowledge Augmented Generation, https://arxiv.org/abs/2409.13731
- Surya Maddula, Sep 2024, Not RAG, but RAG Fusion? Understanding Next-Gen Info Retrieval. https://pub.towardsai.net/not-rag-but-rag-fusion-understanding-next-gen-info-retrieval-477788da02e2
- Adrian H. Raudaschl, Oct 6, 2023, Forget RAG, the Future is RAG-Fusion: The Next Frontier of Search: Retrieval Augmented Generation meets Reciprocal Rank Fusion and Generated Queries, https://towardsdatascience.com/forget-rag-the-future-is-rag-fusion-1147298d8ad1
- Deval Shah, Jul 4, 2024, Reciprocal Rank Fusion (RRF) explained in 4 mins — How to score results form multiple retrieval methods in RAG: Unlock the power of Reciprocal Rank Fusion in Retrieval-Augmented Generation. https://medium.com/@devalshah1619/mathematical-intuition-behind-reciprocal-rank-fusion-rrf-explained-in-2-mins-002df0cc5e2a
- Vishal Rajput, Sep 27, 2024, Why Scaling RAGs For Production Is So Hard? https://medium.com/aiguys/why-scaling-rags-for-production-is-so-hard-a2f540785e97
- Chirag Agrawal, Sep 20, 2024, Unlocking the Power of Efficient Vector Search in RAG Applications, https://pub.towardsai.net/unlocking-the-power-of-efficient-vector-search-in-rag-applications-c2e3a0c551d5
- Zixuan Li, Jing Xiong, Fanghua Ye, Chuanyang Zheng, Xun Wu, Jianqiao Lu, Zhongwei Wan, Xiaodan Liang, Chengming Li, Zhenan Sun, Lingpeng Kong, Ngai Wong, 3 Oct 2024, UncertaintyRAG: Span-Level Uncertainty Enhanced Long-Context Modeling for Retrieval-Augmented Generation, https://arxiv.org/abs/2410.02719
- Zhenrui Yue, Honglei Zhuang, Aijun Bai, Kai Hui, Rolf Jagerman, Hansi Zeng, Zhen Qin, Dong Wang, Xuanhui Wang, Michael Bendersky, 6 Oct 2024, Inference Scaling for Long-Context Retrieval Augmented Generation, https://arxiv.org/abs/2410.04343
- Bowen Jin, Jinsung Yoon, Jiawei Han, Sercan O. Arik, 8 Oct 2024, Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG, https://arxiv.org/abs/2410.05983
- Zhangchi Feng, Dongdong Kuang, Zhongyuan Wang, Zhijie Nie, Yaowei Zheng, Richong Zhang, 15 Oct 2024 (v2), EasyRAG: Efficient Retrieval-Augmented Generation Framework for Automated Network Operations, https://arxiv.org/abs/2410.10315 https://github.com/BUAADreamer/EasyRAG
- Barhoumi Mosbeh, Sep 29, 2024, Anthropic’s New RAG Approach, https://pub.towardsai.net/anthropics-new-rag-approach-e0c24a68893b
- Tianyang Zhang, Zhuoxuan Jiang, Shengguang Bai, Tianrui Zhang, Lin Lin, Yang Liu, Jiawei Ren, 21 Oct 2024, RAG4ITOps: A Supervised Fine-Tunable and Comprehensive RAG Framework for IT Operations and Maintenance, https://arxiv.org/abs/2410.15805
- Ran Xu, Hui Liu, Sreyashi Nag, Zhenwei Dai, Yaochen Xie, Xianfeng Tang, Chen Luo, Yang Li, Joyce C. Ho, Carl Yang, Qi He, 23 Oct 2024, SimRAG: Self-Improving Retrieval-Augmented Generation for Adapting Large Language Models to Specialized Domains, https://arxiv.org/abs/2410.17952
- Elias Lumer, Vamse Kumar Subbiah, James A. Burke, Pradeep Honaganahalli Basavaraju, Austin Huber, 22 Oct 2024 (v2), Toolshed: Scale Tool-Equipped Agents with Advanced RAG-Tool Fusion and Tool Knowledge Bases, https://arxiv.org/abs/2410.14594
- Kibeom Lee, Oct 2024, Retrieval-Augmented Generation: Enhancing LLMs with Dynamic Information Access, https://sendbird.com/developer/tutorials/rag (Covers BM25 "Best Match 25" vector search for RAG.)
- Damian Gil, Apr 17, 2024, Advanced Retriever Techniques to Improve Your RAGs, https://towardsdatascience.com/advanced-retriever-techniques-to-improve-your-rags-1fac2b86dd61
- Vectorize, October 29, 2024, Multimodal RAG Patterns Every AI Developer Should Know, https://vectorize.io/multimodal-rag-patterns/
- Tolga Şakar and Hakan Emekci, 30 October 2024, Maximizing RAG efficiency: A comparative analysis of RAG methods, Natural Language Processing. doi:10.1017/nlp.2024.53, https://www.cambridge.org/core/journals/natural-language-processing/article/maximizing-rag-efficiency-a-comparative-analysis-of-rag-methods/D7B259BCD35586E04358DF06006E0A85 https://www.cambridge.org/core/services/aop-cambridge-core/content/view/D7B259BCD35586E04358DF06006E0A85/S2977042424000530a.pdf/div-class-title-maximizing-rag-efficiency-a-comparative-analysis-of-rag-methods-div.pdf
- Sebastian Petrus, Sep 4, 2024, Top 10 RAG Frameworks Github Repos 2024, https://sebastian-petrus.medium.com/top-10-rag-frameworks-github-repos-2024-12b2a81f4a49
- Jason Perlow, Nov. 6, 2024, The best open-source AI models: All your free-to-use options explained: Here are the best open-source and free-to-use AI models for text, images, and audio, organized by type, application, and licensing considerations. https://www.zdnet.com/article/the-best-open-source-ai-models-all-your-free-to-use-options-explained/
- Ziting Wang, Haitao Yuan, Wei Dong, Gao Cong, Feifei Li, 1 Nov 2024, CORAG: A Cost-Constrained Retrieval Optimization System for Retrieval-Augmented Generation, https://arxiv.org/abs/2411.00744
- Fali Wang, Zhiwei Zhang, Xianren Zhang, Zongyu Wu, Tzuhao Mo, Qiuhao Lu, Wanjing Wang, Rui Li, Junjie Xu, Xianfeng Tang, Qi He, Yao Ma, Ming Huang, Suhang Wang, 4 Nov 2024, A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness, https://arxiv.org/abs/2411.03350
- Emilia David, November 8, 2024, Multimodal RAG is growing, here’s the best way to get started, https://venturebeat.com/ai/multimodal-rag-is-growing-heres-the-best-way-to-get-started/
- Shubham Sharma. November 12, 2024, How agentic RAG can be a game-changer for data processing and retrieval, https://venturebeat.com/ai/how-agentic-rag-can-be-a-game-changer-for-data-processing-and-retrieval/
- Alden Do Rosario, Nov 2024, Dear IT Departments, Please Stop Trying To Build Your Own RAG, https://pub.towardsai.net/dear-it-departments-please-stop-trying-to-build-your-own-rag-4546b4638273
- Cobus Greyling, Nov 2024, Four Levels of RAG — Research from Microsoft. Improving Retrieval-Augmented Generation (RAG) involves classifying queries based on user intent & focusing on context. Also utilising SLMs and fine-tuning to deliver more accurate & relevant results. https://cobusgreyling.medium.com/four-levels-of-rag-research-from-microsoft-fdc54388f0ff
- Rupali Patil, Nov 10, 2024, RAGate: Adaptive RAG for Conversational AI, https://pub.towardsai.net/ragate-adaptive-rag-for-conversational-ai-94b5ca469b7d
- Shalin Shah, Srikanth Ryali, Ramasubbu Venkatesh, 8 Nov 2024, Multi-Document Financial Question Answering using LLMs, https://arxiv.org/abs/2411.07264
- Alexandria Leto, Cecilia Aguerrebere, Ishwar Bhati, Ted Willke, Mariano Tepper, Vy Ai Vo, 11 Nov 2024, Toward Optimal Search and Retrieval for RAG, https://arxiv.org/abs/2411.07396
- Jiejun Tan, Zhicheng Dou, Wen Wang, Mang Wang, Weipeng Chen, Ji-Rong Wen, 5 Nov 2024, HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems, https://arxiv.org/abs/2411.02959
- Louis-François Bouchard, Nov 22, 2024, Advanced RAG Evaluation Techniques for Optimal LLM Performance. Why RAG Evaluation Matters and Techniques to Leverage, https://louisbouchard.substack.com/p/advanced-rag-evaluation-techniques
- Sonal Prabhune, Donald J. Berndt, 7 Nov 2024, Deploying Large Language Models With Retrieval Augmented Generation, https://arxiv.org/abs/2411.11895
- Mohammad Hassan Heydari, Arshia Hemmat, Erfan Naman, Afsaneh Fatemi. 25 Nov 2024, Context Awareness Gate For Retrieval Augmented Generation, https://arxiv.org/abs/2411.16133
- Shengming Zhao, Yuheng Huang, Jiayang Song, Zhijie Wang, Chengcheng Wan, Lei Ma, 29 Nov 2024, Towards Understanding Retrieval Accuracy and Prompt Quality in RAG Systems, https://arxiv.org/abs/2411.19463
- Matvey Arye, Avthar Sewrathan, 29 Oct 2024, Vector Databases Are the Wrong Abstraction, https://www.timescale.com/blog/vector-databases-are-the-wrong-abstraction/
Reranker Component in RAG
The reranker component aims to calibrate the best chunk for the LLM to use. The basic idea is:
- Retriever returns several chunks
- Reranker orders them in priority of relevance
- Packer merges the chunks with the user's query and other global instructions
- One final LLM request answers the user's question
Here are some research papers specific to the reranker component:
- Vahe Aslanyan, June 11, 2024, Next-Gen Large Language Models: The Retrieval-Augmented Generation (RAG) Handbook, https://www.freecodecamp.org/news/retrieval-augmented-generation-rag-handbook/
- Benjamin Clavié, 30 Aug 2024, rerankers: A Lightweight Python Library to Unify Ranking Methods, https://arxiv.org/abs/2408.17344 https://arxiv.org/pdf/2408.17344
- Vivedha Elango, Sep 2024, Search in the age of AI- Retrieval methods for Beginners, https://ai.gopubby.com/search-in-the-age-of-ai-retrieval-methods-for-beginners-557621e12ded
- Zhangchi Feng, Dongdong Kuang, Zhongyuan Wang, Zhijie Nie, Yaowei Zheng, Richong Zhang, 15 Oct 2024 (v2), EasyRAG: Efficient Retrieval-Augmented Generation Framework for Automated Network Operations, https://arxiv.org/abs/2410.10315 https://github.com/BUAADreamer/EasyRAG
Long Context RAG
There is a lot of research on getting LLMs to run fast on long context inputs, and some of this is related to RAG architectures (i.e., big chunks!):
- Ziyan Jiang, Xueguang Ma, Wenhu Chen, June 2024, LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs, arXiv preprint arXiv:2406.15319, https://arxiv.org/abs/2406.15319 (Improved accuracy performance of RAG methods when using a long context LLM and longer chunk sizes for the retriever.)
- Qingfei Zhao, Ruobing Wang, Yukuo Cen, Daren Zha, Shicheng Tan, Yuxiao Dong, Jie Tang, 23 Oct 2024, LongRAG: A Dual-Perspective Retrieval-Augmented Generation Paradigm for Long-Context Question Answering, https://arxiv.org/abs/2410.18050 https://github.com/QingFei1/LongRAG
- Tan Yu, Anbang Xu, Rama Akkiraju, 3 Sep 2024, In Defense of RAG in the Era of Long-Context Language Models, https://arxiv.org/abs/2409.01666
- Zixuan Li, Jing Xiong, Fanghua Ye, Chuanyang Zheng, Xun Wu, Jianqiao Lu, Zhongwei Wan, Xiaodan Liang, Chengming Li, Zhenan Sun, Lingpeng Kong, Ngai Wong, 3 Oct 2024, UncertaintyRAG: Span-Level Uncertainty Enhanced Long-Context Modeling for Retrieval-Augmented Generation, https://arxiv.org/abs/2410.02719
- Bowen Jin, Jinsung Yoon, Jiawei Han, Sercan O. Arik, 8 Oct 2024, Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG, https://arxiv.org/abs/2410.05983
- Zhenrui Yue, Honglei Zhuang, Aijun Bai, Kai Hui, Rolf Jagerman, Hansi Zeng, Zhen Qin, Dong Wang, Xuanhui Wang, Michael Bendersky, 6 Oct 2024, Inference Scaling for Long-Context Retrieval Augmented Generation, https://arxiv.org/abs/2410.04343
RAG Knowledge Graph
A RAG Knowledge Graph architecture, or a "RAG Graph," is a combination of RAG with a Knowledge Graph. Instead of returning text chunks, the retriever returns a structured "graph" that represents additional knowledge. The advantage of a graph is that it contains concept relationships such as hierarchies.
Research on RAG with Knowledge Graphs:
- Dr. Ashish Bamania, Aug 2024, ‘MedGraphRAG’ Is A Complete Game Changer For AI In Medicine A deep-dive into how RAG, GraphRAG, and MedGraphRAG work and how they significantly improve the performance of LLM responses in Medicine, https://levelup.gitconnected.com/medgraphrag-is-a-complete-game-changer-for-ai-in-medicine-c6b41b0effd6
- Junde Wu, Jiayuan Zhu, Yunli Qi, 8 Aug 2024, Medical Graph RAG: Towards Safe Medical Large Language Model via Graph Retrieval-Augmented Generation, https://arxiv.org/abs/2408.04187 Code: https://github.com/MedicineToken/Medical-Graph-RAG/tree/main
- Yuntong Hu, Zhihan Lei, Zheng Zhang, Bo Pan, Chen Ling, Liang Zhao, 26 May 2024, GRAG: Graph Retrieval-Augmented Generation, https://arxiv.org/abs/2405.16506
- Philip Rathle, Jul 11, 2024, The GraphRAG Manifesto: Adding Knowledge to GenAI, https://neo4j.com/blog/graphrag-manifesto/
- Microsoft, Aug 2024 (accessed), GraphRAG: A modular graph-based Retrieval-Augmented Generation (RAG) system, https://github.com/microsoft/graphrag
- Chia Jeng Yang, Dec 14, 2023, A first intro to Complex RAG (Retrieval Augmented Generation), https://medium.com/enterprise-rag/a-first-intro-to-complex-rag-retrieval-augmented-generation-a8624d70090f
- Vahe Aslanyan, June 11, 2024, Next-Gen Large Language Models: The Retrieval-Augmented Generation (RAG) Handbook, https://www.freecodecamp.org/news/retrieval-augmented-generation-rag-handbook/
- Lei Liang, Mengshu Sun, Zhengke Gui, Zhongshu Zhu, Zhouyu Jiang, Ling Zhong, Yuan Qu, Peilong Zhao, Zhongpu Bo, Jin Yang, Huaidong Xiong, Lin Yuan, Jun Xu, Zaoyang Wang, Zhiqiang Zhang, Wen Zhang, Huajun Chen, Wenguang Chen, Jun Zhou, 24 Sep 2024 (v2), KAG: Boosting LLMs in Professional Domains via Knowledge Augmented Generation, https://arxiv.org/abs/2409.13731
- Xinke Jiang, Rihong Qiu, Yongxin Xu, Wentao Zhang, Yichen Zhu, Ruizhe Zhang, Yuchen Fang, Xu Chu, Junfeng Zhao, Yasha Wang, 31 Oct 2024, RAGraph: A General Retrieval-Augmented Graph Learning Framework, https://arxiv.org/abs/2410.23855
- Cristian-George Crăciun, Răzvan-Alexandru Smădu, Dumitru-Clementin Cercel, Mihaela-Claudia Cercel, 5 Dec 2024, GRAF: Graph Retrieval Augmented by Facts for Legal Question Answering, https://arxiv.org/abs/2412.04119
RAG Caching
Several components in a RAG architecture can be optimized with a cache. The retrieval component can use all of the types of caching that are applicable to whatever database or datastore architecture it uses, irrespective whether it's keyword or vector lookup, and whether stored on disk or cached in memory. All of these different retrieval options can have a cache. At the bottom level of the LLM, there are various KV caching techniques (see further below). At the topmost level, there can be an overall cache via an "inference cache" for exactly identical queries, or a "semantic cache" for similar queries.
Research papers on RAG cache architectures:
- Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Xin Liu, Xuanzhe Liu, Xin Jin, 18 Apr 2024, RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation, https://arxiv.org/abs/2404.12457
- Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, Junchen Jiang, 3 Jun 2024 (v2), CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion, https://arxiv.org/abs/2405.16444
- Google, 2024, Context caching, https://ai.google.dev/gemini-api/docs/caching?lang=python (Pass in context tokens and reuse them without re-uploading, might be doing something like prefix KV caching underneath.)
- Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
- Pere Martra, Aug 2024 (accessed), Implementing semantic cache to improve a RAG system with FAISS, https://huggingface.co/learn/cookbook/semantic_cache_chroma_vector_database
- Richmond Alake, Apoorva Joshi, Aug 14, 2024, Adding Semantic Caching and Memory to Your RAG Application Using MongoDB and LangChain, MongoDB, https://www.mongodb.com/developer/products/atlas/advanced-rag-langchain-mongodb/
- Anthropic, 20 Sept 2024, Introducing Contextual Retrieval, https://www.anthropic.com/news/contextual-retrieval
- Yihua Cheng, Kuntai Du, Jiayi Yao, Junchen Jiang, 16 Sep 2024, Do Large Language Models Need a Content Delivery Network? https://arxiv.org/abs/2409.13761 https://github.com/LMCache/LMCache (Managing the process of sharing KV cache data over a network.)
- David Spuler, , September 26, 2024, RAG Optimization via Caching, https://www.aussieai.com/blog/rag-optimization-caching
- Songshuo Lu, Hua Wang, Yutian Rong, Zhi Chen, Yaohua Tang, 10 Oct 2024, TurboRAG: Accelerating Retrieval-Augmented Generation with Precomputed KV Caches for Chunked Text, https://arxiv.org/abs/2410.07590 (Fusing precomputed KV caches for each RAG chunk.)
- David Spuler, October 24, 2024, Generalizing Prefix KV Caching to RAG Chunks, Aussie AI Blog, https://www.aussieai.com/blog/prefix-kv-rag
KV Caching Optimizations
In addition to RAG caches, such as retrieval caches, there are various LLM cache methods. Several of the many types of KV caching optimizations can optimize RAG architectures (and other LLM use cases). The main KV cache techniques involve precomputed caches for RAG chunks, such as prefix caching or session caching. More information is available:
- Prefix KV cache
- Session KV cache (multi-turn KV caching)
- Substring KV cache (Lengthwise-fused KV caching)
- KV cache global (multi-query KV caching)
- KV caching (overview)
More Types of Caching
Other general types of caching that apply to any LLM system, and can be used with RAG:
More AI Research
Read more about: