Aussie AI

Fine-Tuning vs RAG

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Fine-Tuning vs RAG

As we've already discussed above, if you want an AI engine that can produce answers based on your company's product datasheets, there are two major options:

  • Fine-Tuning (FT): re-train your foundation model to answer new questions.
  • Retrieval-Augmented Generation (RAG) architecture: summarize excerpts of documents using an unchanged foundation model.

The two basic approaches are significantly different, with completely different architectures and a different project cost profile. Each approach has its own set of pros and cons.

Spoiler alert! RAG and fine-tuning can be combined.

Advantages of RAG. The RAG architecture typically has the following advantages:

  • Lower up-front cost
  • Flexibility
  • Up-to-date content
  • Access external data sources and/or internal proprietary documents
  • Content-rich answers from documents
  • Explainability (citations)
  • Personalization is easier
  • Hallucinations less likely (if retriever finds documents with the answer)
  • Scalability to as many documents as the datastore can handle.
  • RAG is regarded as “moderate” difficulty in terms of AI expertise.

The main goal of RAG is to avoid the expensive re-training and fine-tuning of a model. However, RAG also adds extra costs in terms of the RAG component and its database of documents, both at project setup and in ongoing usage, so it is not always a clear-cut win.

In addition to cost motivations, RAG may be advantageous in terms of flexibility to keep up-to-date with content for user answers. With a RAG component, any new documents can simply be added to the document datastore, rather than each new document requiring another model re-training cycle. This makes it easier for the AI application to stay up-to-date with current information includes in its answers.

Disadvantages of RAG. The disadvantages of RAG where the underlying model is not fine-tuned, include:

  • Architectural change required (retriever component integrated with model inference).
  • Slower inference latency and user response time (extra step in architecture).
  • Extra tokens of context from document excerpts (slower response and increased inference cost).
  • Extra ongoing costs of retriever component and datastore (e.g. hosting, licensing).
  • Larger foundation model required (increases latency and cost).
  • Model's answer style may not match domain (e.g. wrong style, tone, jargon, terminology).

Penalty for unnecessary dumbness. There's an even more fundamental limitation of RAG systems: they're not any smarter than the LLM they use. The overall RAG implementation relies on the LLM for basic common sense, conversational ability, and simple intelligence. It extends the LLM with extra data, but not extra skills.

For example, if you ask a RAG system with a chunked version of a hundred C++ textbooks, I doubt you could ask it to generate a Snake Game in C++. However, you could certainly ask it about the syntax of a switch statement in the C++ language.

With a RAG architecture, the model has not “learned” anything new, and you have only ensured it has the correct data to answer questions. For example, if you asked about how a “switch” statement works, the hope is that the retriever finds the chunks from the “Switch” sections of the C++ books, and not from five random C++ code examples that used the switch statement. This is all dependent on the text-based keyword indexing and how well the semantic embedding vectors work. The “R” in RAG is “Retrieval” and it's very dependent on that.

Advantages of fine-tuning. The main advantages of a fine-tuning architecture over a RAG setup include:

  • Style and tone of responses can be trained — e.g. positivity, politeness.
  • Use of correct industry jargon and terminology in responses.
  • Brand voice can be adjusted.
  • No change to inference architecture — just an updated model.
  • Faster inference latency — no extra retriever search step.
  • Reduced inference cost — fewer input context tokens.
  • No extra costs from retriever and datastore components.
  • Smaller model can be used — further reduced inference cost.

Fine-tuning is not an architectural change, but is an updated version of a major model (e.g. GPT-3), whereas RAG is a different architecture with an integration to a search component that accesses an external knowledge database or datastore and returns a set of documents or chunks/snippets of documents.

Disadvantages of fine-tuning. The disadvantages of a fine-tuning architecture without RAG include:

  • Training cost — up-front and ongoing scheduled fine-tuning is expensive.
  • Outdated information used in responses.
  • Needs a lot more proprietary data than RAG.
  • Training data must be in a paired input-output format, whereas RAG can use unstructured text.
  • Complexity of preparing and formatting the data for training (e.g. categorizing and labeling).
  • No access to external or internal data sources (except what it's been trained on).
  • Hallucinations more likely (if it hasn't been trained on the answer).
  • Personalization features are difficult.
  • Lack of explainability (hard to know how the model derived its answers or from what sources).
  • Poor scalability (e.g. if too many documents to re-train with).
  • Fine-tuning (training) is regarded as one of the highest difficulty-level projects in terms of AI expertise.

The main disadvantage of fine-tuning is the compute cost of GPUs for fine-tuning the model. This is at least a large up-front cost, and is also an ongoing cost to re-update the model with the latest information. The inherent disadvantage of doing scheduled fine-tuning is that the model is always out-of-date, since it only has information up to the most recent fine-tuning. This differs from RAG, where the queries can respond quickly using information in a new document, even within seconds of its addition to the document datastore.

Cost comparison

The fine-tuning approach has an up-front training cost, but a lower inference cost profile. However, fine-tuning may be required on an ongoing schedule, so this is not a once-only cost. The lower ongoing inference cost for fine-tuning is because (a) there's no extra “retrieval” component needed, and (b) a smaller model can be used.

RAG has the opposite cost profile. The RAG approach has a reduced initial cost because there is not a big GPU load (i.e. there's no re-training of any models). However, a RAG project still has up-front cost in terms of setting up the new architecture, but so does a fine-tuning project. In terms of ongoing costs, RAG also has an increased inference cost for every user query, because the AI engine has more work to do with extra information. There is also the hidden cost whereby the RAG architecture may increase the number of tokens sent as input to the inference engine, because they are “context” for the user query, and this increases the inference cost, whether it's commercially hosted or running in-house. This extra RAG inference cost continues for the lifetime of the application.

One final point: you need to double-check if RAG is cheaper than fine-tuning. I know, I know, sorry! I wrote that it was, and now I'm walking that claim back.

But commercial realities are what they are. There are a number of commercial vendors pushing the various components for RAG architecture and some are hyping that it's cheaper than buying a truckload of GPUs. But fine-tuning isn't that bad, and you need to be clear-eyed and compare the pricing of both approaches.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++