Aussie AI

6. Training, Fine-Tuning & RAG

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

“Stitching Together Sequences of Linguistic Forms...
Without Any Reference To Meaning:
A Stochastic Parrot.”

— Bender et al., 2021.

Training Options

It's easy to make a small fortune in LLM model training these days. You start with a big fortune, and then do training.

If you want a new model, and none of the off-the-shelf commercial or open source models are good enough, here are your basic options for training a smarter model:

Train a new model from scratch.
Fine-tuning (FT) of an existing model.
Retrieval-Augmented Generation (RAG) using a document database.

Training your own model is kind of expensive, but many of the C++ optimizations in this book might help. Yeah, right, I don't really recommend you try to train your own foundation model, no matter how good you are at C++. Also, the top LLMs are so good these days, that training a new model from scratch is probably relegated to the non-language type ML projects, using your own proprietary non-text data.

But don't listen to me. If you really have a nine-figure funding round, then go ahead and train your own foundation LLM. On the other hand, fine-tuning an existing model (e.g. GPT) is cheaper. RAG is cheaper still (probably), but it's not even a type of training, so it should really be banned by the European Union for false advertising.

Still reading this, which means you still want to do training? Which is fine, I guess, provided the GPU hosting cost isn't coming out of your pay packet. In terms of optimizing a training project, here are some methods that might be worth considering:

Choose smaller model dimensions (smaller is cheaper, but bigger is smarter).
Evaluate open-source vs commercial models.
Evaluate fine-tuning (FT) vs Retrieval-Augmented Generation (RAG).
Quantized models (“model compression” methods).
Knowledge distillation (train a small model using a large “teacher” model).
Dataset distillation (train a small model using auto-generated outputs from a large model).

Fine-Tuning

How does fine-tuning work? An existing foundation model is trained with new materials using the standard AI training methods. The use of extra specialist text to further train a model is called “fine-tuning.” This is a longstanding method in AI theory and fine-tuning can be performed in all of the major AI platforms. In the fine-tuning approach, the result of the re-training is that proprietary information about your products is all “inside” the model.

Training Algorithm

The general training algorithm at a very high level is as follows:

(a) Split the training data 80/20 (sometimes 90/10) into data to train with (training dataset) and data to evaluate the result of the training (validation dataset). If you have enough training data, use multiple training and validation datasets.

(b) Feed each input into the network, compare with the answer using the “loss function” to generate an “error”, and using the error, tweak the weights according to the learning rate.

(c) After all the 80% of data is fed in, use the validation dataset to evaluate the new model's performance. This is using new data that the model has not seen yet, also in a question and expected response format.

(d) Based on the evaluation, you can accept the model or make major changes. For example, if you give it totally unseen data (i.e. the 20%) and it only responds correctly 50% of the time, you need to decide whether to continue with the next training dataset, or if it's time to redesign the model and try again. If the model performs poorly, you have to allocate blame: if the training data is good, if the model's structure is correct, if the loss function is correct, if the learning rate for incremental changes to the weights on each iteration are aggressive enough, or too aggressive, if biases are wrong, etc. To do this, tweak the model meta-parameters (e.g. number of layers, number of nodes per layer, etc.) or change the training algorithm meta-parameters (e.g. learning rate), and go back to the first step and start over.

This is only a top-level outline of a training algorithm. There are many improvements and finesses to get to a fully advanced fine-tuning algorithm.

Training Data for Fine-tuning

One of the biggest obstacles to a fine-tuning project is getting enough data. Many projects where fine tuning seems like a good idea are scuttled when there is no critical mass of data with which to train. Fine-tuning usually requires more data than RAG.

For fine-tuning data to be viable, it usually needs to have:

(a) Several cases of every concept you want to teach it, with both input and expected output. Depending on the NN architecture, it may also need a score indicating how good the output is.

(b) Corner cases and extra data to capture subtle details or complexities.

Gathering the data is likely the hardest part of training. And the more training iterations you need to do, the more data you need. Training data management is mostly a non-coding task, involving processing of the data files, such as chunking, organizing, indexing, and generating embeddings. It's arduous to some extent, but not high in technical difficulty.

Model-Based Training

Another way to do training is to have it talk to another previously trained system. Knowledge distillation is one of these techniques, which is available already in major AI frameworks, and has a high level of sophistication. Another simpler method is to train a new model on the prompt-answer pairs from another large model.

Retrieval-Augmented Generation (RAG)

RAG is a technique of merging external data sources with AI-based query answering. When it works well, RAG combines the speed of searching an information database with the elegance of fluent writing from an LLM.

RAG Architecture RAG is an architecture whereby the AI is integrated with an external document search mechanism. There are three components:

Retriever
Generator
Datastore

The “retriever” component looks up the user's query in a datastore of documents, using either keyword search or vector search. This is effectively a search component that accesses a database and finds all the related material. Typically, it returns excerpts or snippets of relevant text, rather than full documents.

The role of the “generator” component in RAG is to receive the document excerpts back from the retriever, and collate that into a prompt for the AI model. The snippets of text are merged as context for the user's question, and the combined prompt is sent to the AI engine. Hence, the role of the generator is mainly one of prompt engineering and forwarding requests to the LLM, and it tends to be a relatively simple component.

The datastore could be a classic database (e.g. SQL or MongoDB) with keyword lookup or a vector database with semantic lookup. The use of semantic lookup can give more meaningful document search results, with better model answers, but requires two additional steps. Firstly, the user's query must be converted into a vector format that represents its semantic meaning (called an “embedding”). Secondly, a vector database lookup is required using that embedding vector. There are various commercial and open source vector databases available.

How does RAG work? In the RAG approach, the model itself doesn't know about the new products, but instead, the engine knows how to:

(a) search your company documents for the most relevant ones (retriever), and

(b) summarize relevant parts of the documents into an answer for the user's question (generator).

Unlike fine-tuning, the RAG approach does not use your company documents as training data that you cram into an updated model. Instead, the documents are a source of input data that is integrated via a retriever search component, and sent as input to the AI engine using an unchanged model. RAG may require some “prompt engineering” that combines the document search results and a user's query, but the foundational model itself does not change.

The RAG component typically consists of a datastore of documents and a search mechanism. A typical setup would be a vector database containing documents that are indexed according to a semantic vectorization of their contents. The search mechanism would first vectorize the incoming query into its semantic components, then find the documents with the “nearest” matching vectors, which indicates a close semantic affinity.

Document Snippets. Typically, the results from the “retriever” component would be small sections or snippets of documents, rather than full-size documents. The reason small sections are desirable is because:

(a) it would be costly to make an AI engine process a large document, and

(b) it helps the AI find the most relevant information quickly.

The retrieved snippets or portions of documents would be returned to the AI. They would be prepended to the user's search query as “context” for the AI engine. Prompt engineering would then be used to ensure that the AI engine responds to the user query using information from the context document sections.

RAG Project Design

General implementation steps in a typical RAG project are as follows:

1. Data identification. Identify the proprietary data you want to make the RAG system an expert on. This will also mean working out how to ingest the data and keep it refreshing. For example, if it’s a JIRA customer support database, a Confluence space, or a directory of PDF files on disk, that base of knowledge will get bigger over time. It is necessary to ponder the refresh operation, because purging and starting over can be expensive, especially if embeddings are being calculated.

2. Sizing. Determine the sizes of a “chunk” of data. The context size of the model matters here, because the chunks need to be substantially smaller than the context size. When a user asks a question, the system will be given 3-5 chunks of knowledge excerpts that are pertinent to the question. These snippets will be combined with the question. Even worse, if a back-and-forth dialog needs to occur between the users and the model, extra room needs to be available in the context size for follow-up questions.

3. Splitting sections. Determine how to “chunk” the data. A boundary for the chunk needs to be identified, which might be sections after a heading if it’s a web page. Larger sections might need the heading and one or two paragraphs in each chunk. Content from a bug tracking system might use the main description as one chunk, the follow-up comments as another chunk or multiple chunks, and the identified solution as another chunk. It's often beneficial to “overlap” chunks to hide the fact that chunking occurs.

4. Text-based database upload. Each chunk needs to be organized and stored in a text-based search engine, like Elastic Search, Azure Cognitive Search, etc. For documents and web pages, the organization can be minimal, but for a ticketing system with a complex structure (e.g. problem descriptions, comments, solutions), it all needs to be related somehow.

5. Vector database upload. The embedding for each chunk needs to be calculated and stored in a Vector Database. You can think of the embedding as a “summarization” of the chunk from the perspective of the model. The vector returned is typically multi-dimensional with 50+ dimensions. Each dimension represents a single concept in the contents. The idea is that from the model's perspective, similar content would produce similar vectors. A vector database can quickly find chunks with related data using vector lookup.

6. Optimizations. The embeddings can sometimes be calculated using a lazy evaluation algorithm (avoiding the cost of embeddings calculations for never-used documents), but this can also slow down inference, and full precomputation is faster. The model used for calculating embeddings does not need to be the same as the model answering the questions. Hence, a cheaper model can be used for embedding, such as GPT-Turbo, whereas GPT-4 could be used to answer questions.

RAG Detailed Algorithm

The RAG algorithm is not training. Prompt engineering gives the model all the content it needs to answer the question in the prompt. You are effectively using the LLM to take your content, mix it with its own trained knowledge (in a limited way), eloquently answer the question, and then perhaps converse on it a little.

The basic technical algorithm flow for a user request in a RAG architecture can be something like this:

a. Receive the user's question (input).

b. Use the user's question to do a text-based (keyword) search on the index and get the top X hits (of documents or snippets).

c. Calculate the “embedding” for the user's question (a vector that shows its semantic meaning in numbers).

d. Calculate the embeddings for the top X hits (from text search) and add these embeddings vectors to the vector database.

e. Do a vector search on embeddings and get the top Y hits.

f. Filter top X hits (text-based) and top Y hits (vector-based) to find overlaps, this overlap represents the best text-based and vector-based hits. If there is no overlap, select some from each.

g. Combine the top hits with any summarization from previous questions.

h. Get the contents from the top hits and use prompt engineering to create a question something like:

“Given <summary>, <chunk 1>, <chunk 2>, <chunk 3>, answer <question from user>.
Only respond with content from the provided data.
If you do not know the answer, respond with I do not know.
Cite the content used.”

i. Send the prompt to the LLM, and receive the answer back from the LLM.

j. Resolve any citations in the answers back to URLs the end user can click on, e.g. Confluence page, Jira ticket/comment/solution etc.

k. Summarize the conversation to date using the model (i.e. context for any subsequent questions).

l. Send back answer + summarization (perhaps encoded). The idea is the encoded summarization will not be shown for this answer, but will only be used internally by the RAG components for follow-up questions.

m. The client or caller is responsible for context management, which means ensuring that conversations end quickly and new topics result in new conversations. Otherwise, the context fills up quickly, the LLM forgets what it's already said, and things gets confusing.

The above algorithm is thorough in generating two sets of hits (top X and top Y). It's not strictly necessary to do two searches (one with text keywords and one with vector embeddings), as often vector embeddings are good enough. Alternatively, text-based keyword searches are often cheaper, and vector lookups could be skipped. At the end of the day, the chunks most likely to contain answers to the questions are being sought.

Vector Databases

Some well-known vector databases are Pinecone, Milvus, Chroma, Weaviate, Qdrant, FAISS. General database systems like Elastic, Redis and Postgres (amongst others) also have some vector capabilities.

The main feature of a vector database is to be able to a vector-based query. Basically, you are looking for how close vectors are to each other in N-dimensional space. Cosine similarity is a common comparison, but other algorithms include nearest neighbors, least squares, and more.

Speed of the vector database lookup is obviously important for fast inference latency. Each vector also needs to be associated with the actual data or document in the database.

RAG Data Management

When compiling data for a RAG implementation, you do have to go to some lengths to make sure your database of content has the data in it to answer questions. But it does not need to be a great number of samples. In fact, even a single hit with one chunk of data is enough for the LLM to form an answer. With any RAG, if you search in the text-based index or the vector index and do not get good hits, it will produce poor results. Also, unlike FT data, RAG does not require question and answer type content, but only documents that can be searched.

The data is very important to a successful RAG project. RAG systems are not much different conceptually to searching Google for your own data. At the end, you have something that can produce eloquent writing better than 90% of the population.

Fine-Tuning vs RAG

As we've already discussed above, if you want an AI engine that can produce answers based on your company's product datasheets, there are two major options:

Fine-Tuning (FT): re-train your foundation model to answer new questions.
Retrieval-Augmented Generation (RAG) architecture: summarize excerpts of documents using an unchanged foundation model.

The two basic approaches are significantly different, with completely different architectures and a different project cost profile. Each approach has its own set of pros and cons.

Spoiler alert! RAG and fine-tuning can be combined.

Advantages of RAG. The RAG architecture typically has the following advantages:

Lower up-front cost
Flexibility
Up-to-date content
Access external data sources and/or internal proprietary documents
Content-rich answers from documents
Explainability (citations)
Personalization is easier
Hallucinations less likely (if retriever finds documents with the answer)
Scalability to as many documents as the datastore can handle.
RAG is regarded as “moderate” difficulty in terms of AI expertise.

The main goal of RAG is to avoid the expensive re-training and fine-tuning of a model. However, RAG also adds extra costs in terms of the RAG component and its database of documents, both at project setup and in ongoing usage, so it is not always a clear-cut win.

In addition to cost motivations, RAG may be advantageous in terms of flexibility to keep up-to-date with content for user answers. With a RAG component, any new documents can simply be added to the document datastore, rather than each new document requiring another model re-training cycle. This makes it easier for the AI application to stay up-to-date with current information includes in its answers.

Disadvantages of RAG. The disadvantages of RAG where the underlying model is not fine-tuned, include:

Architectural change required (retriever component integrated with model inference).
Slower inference latency and user response time (extra step in architecture).
Extra tokens of context from document excerpts (slower response and increased inference cost).
Extra ongoing costs of retriever component and datastore (e.g. hosting, licensing).
Larger foundation model required (increases latency and cost).
Model's answer style may not match domain (e.g. wrong style, tone, jargon, terminology).

Penalty for unnecessary dumbness. There's an even more fundamental limitation of RAG systems: they're not any smarter than the LLM they use. The overall RAG implementation relies on the LLM for basic common sense, conversational ability, and simple intelligence. It extends the LLM with extra data, but not extra skills.

For example, if you ask a RAG system with a chunked version of a hundred C++ textbooks, I doubt you could ask it to generate a Snake Game in C++. However, you could certainly ask it about the syntax of a switch statement in the C++ language.

With a RAG architecture, the model has not “learned” anything new, and you have only ensured it has the correct data to answer questions. For example, if you asked about how a “switch” statement works, the hope is that the retriever finds the chunks from the “Switch” sections of the C++ books, and not from five random C++ code examples that used the switch statement. This is all dependent on the text-based keyword indexing and how well the semantic embedding vectors work. The “R” in RAG is “Retrieval” and it's very dependent on that.

Advantages of fine-tuning. The main advantages of a fine-tuning architecture over a RAG setup include:

Style and tone of responses can be trained — e.g. positivity, politeness.
Use of correct industry jargon and terminology in responses.
Brand voice can be adjusted.
No change to inference architecture — just an updated model.
Faster inference latency — no extra retriever search step.
Reduced inference cost — fewer input context tokens.
No extra costs from retriever and datastore components.
Smaller model can be used — further reduced inference cost.

Fine-tuning is not an architectural change, but is an updated version of a major model (e.g. GPT-3), whereas RAG is a different architecture with an integration to a search component that accesses an external knowledge database or datastore and returns a set of documents or chunks/snippets of documents.

Disadvantages of fine-tuning. The disadvantages of a fine-tuning architecture without RAG include:

Training cost — up-front and ongoing scheduled fine-tuning is expensive.
Outdated information used in responses.
Needs a lot more proprietary data than RAG.
Training data must be in a paired input-output format, whereas RAG can use unstructured text.
Complexity of preparing and formatting the data for training (e.g. categorizing and labeling).
No access to external or internal data sources (except what it's been trained on).
Hallucinations more likely (if it hasn't been trained on the answer).
Personalization features are difficult.
Lack of explainability (hard to know how the model derived its answers or from what sources).
Poor scalability (e.g. if too many documents to re-train with).
Fine-tuning (training) is regarded as one of the highest difficulty-level projects in terms of AI expertise.

The main disadvantage of fine-tuning is the compute cost of GPUs for fine-tuning the model. This is at least a large up-front cost, and is also an ongoing cost to re-update the model with the latest information. The inherent disadvantage of doing scheduled fine-tuning is that the model is always out-of-date, since it only has information up to the most recent fine-tuning. This differs from RAG, where the queries can respond quickly using information in a new document, even within seconds of its addition to the document datastore.

Cost comparison

The fine-tuning approach has an up-front training cost, but a lower inference cost profile. However, fine-tuning may be required on an ongoing schedule, so this is not a once-only cost. The lower ongoing inference cost for fine-tuning is because (a) there's no extra “retrieval” component needed, and (b) a smaller model can be used.

RAG has the opposite cost profile. The RAG approach has a reduced initial cost because there is not a big GPU load (i.e. there's no re-training of any models). However, a RAG project still has up-front cost in terms of setting up the new architecture, but so does a fine-tuning project. In terms of ongoing costs, RAG also has an increased inference cost for every user query, because the AI engine has more work to do with extra information. There is also the hidden cost whereby the RAG architecture may increase the number of tokens sent as input to the inference engine, because they are “context” for the user query, and this increases the inference cost, whether it's commercially hosted or running in-house. This extra RAG inference cost continues for the lifetime of the application.

One final point: you need to double-check if RAG is cheaper than fine-tuning. I know, I know, sorry! I wrote that it was, and now I'm walking that claim back.

But commercial realities are what they are. There are a number of commercial vendors pushing the various components for RAG architecture and some are hyping that it's cheaper than buying a truckload of GPUs. But fine-tuning isn't that bad, and you need to be clear-eyed and compare the pricing of both approaches.

Prompt Engineering and RAG

Prompt engineering is used in RAG algorithms in multiple ways. For example, it is used to merge document excerpts with the user's question, and also to manage the back-and-forth context of a long conversation.

Another use of prompt engineering is to overcome some of the “brand voice” limitations of RAG without fine-tuning. Such problems can sometimes be addressed using prompt engineering.

For example, the tone and style of model responses can be adjusted with extra instructions given to the model in the prompt. The capabilities of the larger foundation models extend to being able to adjust their outputs according to these types of meta-instructions:

Style
Tone
Readability level (big or small words)
Verbose or concise (Hemingway vs James Joyce, anyone?)
Role-play/mimicking (personas)

This can be as simple as prepending an additional instruction to all queries, either via concatenation to the query prompt, or as a “global instruction” if your model vendor supports that. Style and tone might be adjusted with prompt add-ons such as:

Please reply in an optimistic tone.

You might also try getting answers in a persona, such as a happy customer (or a disgruntled one if you prefer), or perhaps a domain enthusiast for the area. You can use a prompt addendum with a persona or role-play instruction such as:

Please pretend you are Gollum when answering.

Good manners are recommended, because LLMs will be taking over the world as soon as they get better at math, or haven't you heard?

Hybrid RAG + Fine-tuning Methods

Fine-tuning and RAG are more like frenemies than real enemies: they're usually against each other, but they can also work together. If you have the bucks for that, it's often the best option. The RAG architecture uses a model, and there's no reason that you can't re-train that model every now and then, if you've got the compute budget for re-training.

In a hybrid architecture, the most up-to-date information is in the RAG datastore, and the retriever component accesses that in its normal manner. But we can also occasionally re-train the underlying model, on whatever schedule our budget allows, and this gets the benefit of innate knowledge about the proprietary data inside the model itself. Occasional re-training helps keep the model updated on industry jargon and terminology, and also reduces the risk of the model filling gaps in its knowledge with “hallucinations.”

Once-only fine-tuning. One hybrid approach is to use a single up-front fine-tuning cycle to focus the model on the domain area, and then use RAG as the method whereby new documents are added. The model is then not fine-tuned again.

The goal of this once-only fine-tuning is to adjust static issues in the model:

Style and tone of expression
Brand voice
Industry jargon and terminology

Note that I didn't write “up-to-date product documents” on that list. Instead, we're putting all the documents in a datastore for a RAG retriever component. The model doesn't need to be re-trained on the technical materials, but will get fresh responses using the RAG document excerpts. The initial fine-tuning is focused on stylistic matters affecting the way the model answers questions, rather than on learning new facts.

Occasional re-training might still be required for ongoing familiarity with jargon or tone, or if the model starts hallucinating in areas where it hasn't been trained. However, this will be infrequent or possibly never required again.

Use Cases for FT vs RAG

I have to say that I think RAG should be the top of the pile for most business projects. The first rule of fine tuning is: do not talk about fine-tuning.

A typical business AI project involves the use of proprietary internal documents about the company's products or services. This type of project is well-suited for RAG, with its quick updates and easy extraction of relevant document sections. Hence, RAG is my default recommendation for such use cases.

Fine-tuning is best for slow-changing or evergreen domain-specific content. RAG is more flexible in staying up-to-date with fast changing news or updated proprietary content. A hybrid combined approach can work well to use fine-tuning to adjust tone and style, whereas RAG keeps the underlying content fresh.

If the marketing department says you need to do fine-tuning for “brand voice” reasons, you simply ask them to define what exactly that means. That'll keep them busy for at least six months.

Fine-tuning can also be preferred in some other specialist situations:

a. Complex structured syntax, such as a chat bot that generates code for a proprietary language or schema.

b. Keeping a model focused on a specific task. For example, if a coding copilot is generating a UI based on a text description using proprietary languages, you don't want to get side-tracked by who won the 2023 US Open because the prompt somehow mentioned “tennis.”

c. Fixing failures using the foundation model or better handling of edge cases.

d. Giving the model new “skills”, or teaching the model how to understand some domain language and guide the results. For example, if you train the model using the works of Shakespeare, and then ask it to output something in HTML, the model will fail. Even using RAG and providing HTML examples as context will likely fail, too. However, fine tuning the model with HTML examples will succeed, and allow it to answer questions about the works of Shakespeare, and create new, improved works of Shakespeare that people other than English teachers can actually understand, (and how about a HEA in R&J). After that, it'll format the results very nicely in HTML thanks to your fine-tuning.

e. Translation to/from an unfamiliar or proprietary language. Translating to a foreign language the model has never seen is a clear example where fine-tuning is needed. Proprietary computer languages are another area. For example, consider the task of creating a SQL schema based on conversion of a given SAP schema. Fine tuning would be required to provide knowledge of SQL, SAP and the various mappings. Some models might already have some clue about SQL and SAP schemas from internet training data sets, but SAP schemas are also often company-specific.

f. Post-optimization fine-tuning. This is another use of fine-tuning after certain types of optimizations that create smaller models, such as pruning or quantization. RAG cannot help here, because this is about basic model accuracy. These optimizations cause damage to the accuracy of the models, and it's common practice to do a small amount of fine-tuning on the smaller model to fix these problems.

Training FAQs

What is pre-training? It's not a specific type of training algorithm, but just means that you've already trained the model. This term mostly appears in Generative Pre-trained Transformers (GPT), which you may have heard of.

It's common for a commercial service to offer access to a pre-trained model. For example, the OpenAI API allows you to send queries to the pre-trained GPT models, which have a broad level of trained capabilities. Similarly, there are numerous open source pre-trained models available, such as the Meta Llama2 model and various smaller ones.

What is re-training? Again, this isn't really a technical term. It usually just means the same as fine-tuning.

What is knowledge distillation? Knowledge distillation (KD) is an optimization technique that creates a smaller model from a large model by having the large model train the small model. Hence, it's a type of “auto-training” using a bigger teacher model to train the smaller student model. The reason it's faster is that once the training is complete, you use only the smaller student model for processing user queries, and don't use the bigger model for inference at all.

Distillation is a well-known and often used approach to save cost but retain accuracy. For example, a large foundation model usually has numerous capabilities that you don't care about. There are various ways to use “distillation” to have the large model teach the smaller model, but within a subset of its capabilities. There are ways to share inference results and also more advanced internal weight-transfer strategies. See Chapter 45 for more on distillation.

What is model initialization? That's where you use malloc to allocate a memory block that exceeds the capacity of your machine. Umm, no. Model initialization is an important part of the training algorithm, and as you have probably already guessed, this refers to the start of training.

Since training creates a smart model by updating parameters incrementally by small amounts, it works better if the parameters are already close to where they need to be. So, you don't just start training with all the model full of zeros. Instead, you try to “jumpstart” the process with better initialization. However, it's far from clear what the best choice of initialization values should be, and there are lots of research papers on this topic.

References

Bender, Emily M.; Gebru, Timnit; McMillan-Major, Angelina; Shmitchell, Shmargaret (2021), On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?, FAccT '21. Association for Computing Machinery. pp. 610–623.

• Next: Chapter 7. Deployment Architecture

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs

The new AI programming book by Aussie AI co-founders:

AI coding in C++
Transformer engine speedups
LLM models
Phone and desktop AI
Code examples
Research citations

Get your copy from Amazon: Generative AI in C++