Aussie AI

2. Transformers & LLMs

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

“Autobots, transform and roll out!”

— Transformers, 2007.

AI Engines & Models

An AI application is really two components and it's not very complicated:

Engine — Transformer
Model — LLM

Transformers are a type of neural network engine that calculates the answers in Generative AI. The Large Language Model (LLM) contains all of the data about the relationships between words and their relative positioning.

In terms of technology, the distinction between engines and models is also very simple:

Engine — code
Model — data

The runtime code is the “engine” and the grunt work is often done in C++ under a Python wrapper. The data is the “model” which is literally all numbers, and no code. So far, not so exciting.

Where it gets more interesting is in the complex meshing between engines and models. Not all engines work with all models, and vice-versa. Even Transformers are tightly interwoven with their LLM data. There are many variants of Transformer architectures, and the data won't work with an architecture that's different.

Engines and models are symbiotic and you need both to get anything done. An engine without a model means you ran out of compute budget, whereas a model without an engine cannot really occur because engines create models via training.

Engines

What's an engine? The engine is code that you have to write. All of the fast low-level code is usually written in C++, but the higher-level control code is often written in Python. Somebody has probably used Java to do AI engines, but I'm not a fan of having ten directory levels. If you're using Visual Basic or Perl, we're in trouble.

All of the action is done by the engine based on the data in the model file. The C++ engine needs to load the model file, receive a user query, crank the query through the model weights, and output the best ideas it can think of.

There are two main types of C++ engine, and they're so closely related, that they're almost the same thing. Conceptually, there are two engines:

Training engine
Inference engine

The training engine computes answers to queries, compares the results to expectations, and then updates the weights in the model. The “loss function” calculates how close the results are to what's expected in the training data set. It's also sometimes called an “error function” because it computes an error metric between the computed results, and the expected results. At a very high level, the basic architecture is:

    Training engine = Inference engine + Loss function + Weight Updater

The training engine is used for training (surprise!) and for mini-training tasks like “fine-tuning” the model with small amounts of extra data. The main purpose of the training engine is to create the model by continually updating the weights.

The inference engine handles user queries at runtime. It requires a model that has been built during training, which is used to answer the users' prompts according to whatever has been trained into the model.

These two types of engines have the same inference component. A “training engine” is the inference engine plus a mechanism to compare results with expectations and then update weights appropriately. The central difference is that a training engine changes the weights, because it's creating the model, whereas the weights are static during inference. The weights are not updated by user queries. If you like programming (hopefully?), here's another way to think about model weights:

Training engine — Read/Write
Inference engine — Read-Only

Both of these engines do the same inference computations on weights for the “Read” phase. Hence, they share a lot of components, but the training engine adds extra components (the “Write” parts). The basic hyper-parameters of the model (e.g. number of weights, number of layers) must be identical for the training and inference phases. Hence, a query computation done by the training engine is the same set of computations as the same query done by the inference engine after training is complete.

Transformers

What's a Transformer? It's an engine that processes LLMs, and is an advanced type of neural network. The first Transformer was open-sourced by Google Research in 2017, and then everything got out of hand. And now, I have to mention that Transformers can be of three basic types:

Vanilla (encoder-decoder),
Decoder-only (e.g. GPT or Gemini)
Encoder-only (e.g. BERT)

Now you'll want definitions for those too? We'll be here all night, and it's not even a joke, because the whole book is about Transformers and LLMs. Here's a list of some of the well-known Transformer-LLM architectures:

GPT-3 (OpenAI)
GPT-4 (OpenAI)
PaLM (Google)
Gemini (Google)
Llama (Meta)
Claude (Anthropic)

Yes, I've missed a few! Although it all started as encoder-decoders in 2017, most of the modern Transformers are decoder-only (because it's faster). In addition, they've all tweaked the vanilla Transformer in lots of different ways and usually have also published nice research papers about it.

And I know what you're wondering: it's always about ChatGPT from OpenAI. No, ChatGPT isn't on the list, because it's not really a Transformer architecture or an LLM. Rather, it's more like an “app” (chatbot) or a “brand” or a “platform” that sits on the top using all that GPT stuff.

Models

What's a model? An AI model is literally a binary data file with mostly numbers and a few text strings. For really big models with billions of parameters, it's multiple files, but still only numbers and no code.

I really mean it: zero code. You won't find any programs or scripts, and not even any HTML markup. If you're looking for rules like “if the previous word was 'the' then output a noun” then you're out of luck, not to mention that you're about thirty years behind the times, because that's a rule-based expert system, and it's not how models work in this century.

The main thing in a model file is “weights” which are literally fractional numbers. Billions of them. They're sometimes called “parameters” when being more precise, but it's the same idea.

Weights are a multiplier of “signals” such as which word should be output next. A fractional number less than one makes a word less likely (decreasing a signal), whereas more than one increases the likelihood of outputting that word (amplifying a signal). A zero means don't output that word. A negative weight means really, really don't output the word.

Programmers don't create model files. You won't have to edit a model file and click away on your calculator to get the right parameter numbers. The numbers inside a model file are auto-generated by the training engine.

In fact, it's hard even to look at a model file, because it's so crammed full of numbers. You can do a basic sanity check that it's not spoiled with bogus Inf (infinity) and NaN (not-a-number) floating-point values, but you can't see the intelligence by looking at the numbers, even if you squint. However, programmers do have to decide on the meta-parameters for their model before they run the training phase.

What are meta-parameters? The meta-parameters of the model are counts of how many billion parameters it has, in how many layers, and how many different words it understands (typically, 50,000). These are all static, fixed numeric values. Most of the meta-parameters are fixed from training through to inference, such as the “dimensions” of the model (e.g. the number of “layers” in the model). The size of the model in terms of how many billions of parameters is mostly fixed, too, except there's some tricky ways to speed up inference by reducing or modifying parameters, called “pruning” and “quantization,” but now we're jumping ahead about twenty chapters.

Large Language Models (LLMs)

What's an LLM? There's nothing really special about Large Language Models (LLMs) used by ChatGPT, Gemini, or Llama, compared to other types of AI model files, except that they're:

(a) large,

(b) language-focused (not images), and

Well, you asked, so I answered.

More specifically, LLMs tend to be model files that are processed by Transformers, rather than other types of AI engines.

What's a Foundation Model? This is a large and general-purpose model that's already been broadly trained. Any model that has billions of parameters and gets mentioned in a press release is usually a foundation model. The biggest foundation models might support text in multiple languages along with programming language coding knowledge.

Technically, if a foundation model also has image generating capabilities as well as text output, or can also receive an image as part of its input, then that's not a normal foundation model (i.e. it's not really an LLM). Instead, this advanced model type is often distinguished as a “multi-modal” model. And if there's two of those working together, then it's a “multi-modal multi-model” and you should try saying that ten times in a row.

Training and Fine-tuning

What's the difference between training and fine-tuning? At least three zeros.

Training is how you shove all of the brain power from the entire Wikipedia corpus into a bunch of numbers. It takes a long time and the GDP of a small country to train a big model. Training is the big cost of a lot of AI projects.

The good news about training is that if you mess it up, you have to start all over again. Well, this isn't quite true, because training runs in batches of data. If the evaluation fails, you have to revert to the prior model candidate, since you can't “un-train” an AI model. However, a review can also suggest areas where a model needs more training, or needs to be directed towards new behavior or personality features. In addition to batched training, there is also research on “incremental learning” as a thing.

What is fine-tuning? Fine-tuning refers to smaller amounts of training that are done to a model that's already been fully trained. If you're training a new model from scratch, even a small one, then that's training, not fine-tuning.

The most common use of fine-tuning is to modify a powerful foundation model to do something more specific. Most foundation models have been broadly trained on general information. You might want to specialize the model for a particular use case or to use a new set of data. This can be done two ways:

Fine-tuning
Retrieval Augmentation Generation (RAG)

Proprietary business data is a common reason to fine-tune a foundation model (but there's also RAG to consider). For example, to create a support chatbot for customers using your products, you can customize a foundation model to know about your company's internal product documents via fine-tuning. To do this, you would fine-tune the foundation model using this extra internal data. In this way, a small amount of fine-tuning has added knowledge to the model about new data, which it can then incorporate into its answers to users.

RAG is not training. Note that Retrieval Augmentation Generation (RAG) is not a type of training or fine-tuning. In fact, it's a way to avoid them like the plague. RAG is an architectural add-on where the Transformer can talk to a component that knows how to “retrieve” extra information or documents, such as proprietary internal business documents about your products. This extra data is used as input context during inference of the model, thereby extending the basic model to answer questions specific to this extra material. The point is that it avoids the expense of training and fine-tuning, while incurring some extra cost in implementing the RAG component.

Data sets. High-quality training data is fundamental to both training and RAG techniques. Historically, much of the training data sets have been painstakingly compiled by humans. A newer technique is to use the output of one LLM as the input training dataset for another model. This method and other types of “synthetic data” are being used more fully.

Inference

What is inference? The term “inference” is the AI way of saying “running” or “executing” the AI model. Inference and training are different phases. When you're training or fine-tuning, that's not inference. But when you're done and deploy your model live for a nickel a query, that's inference. When you ask ChatGPT a question, you're sending a “query” or “prompt” to its “inference engine” and when it politely refuses to do what you ask, that's the output results of its “inference” code.

What are latency and throughput? Latency is how fast your inference engine runs. It's similar to the idea of “response time” for a single user. Throughput is a measure of how many queries your engine can handle over time, which relates more to how fast your engine can handle a group of users submitting many queries.

Types of inference. There's not only one type of inference, and the exact algorithm depends on what you're trying to do. Some of the types include:

Completions. This means extending the prompt into a longer answer. Common use cases include auto-writing text or answering questions.
Translation. Convert your Python code comments into Klingon.
Summarization. Taking the input prompt, such as a paragraph or document, and creating a brief summary.
Grammatical Error Correction (GEC). Also known to non-researchers as “editing.”
Transformation. Changing the tone or style of a text input, or changing the formatting and presentation.
Categorization. Analyzing the inputs into a set of different categories, which is effectively a subset of summarization.

Inference Settings. In addition to choosing the overarching type of inference algorithm, there are some common parameters to control an inference request.

Temperature. A higher temperature setting gives your engine a fever, and makes it output silly words. This is known as “creativity.”
Token limit. This is the simple idea of limiting the number of words (tokens) that the engine is allowed to output in its response.
Formatting. Do you want the engine to output plain text, a table, or some other format.

This is only a sample list, and API providers typically have many more options. There are also usually various other parameters related to security and tracking of requests. For example, you probably have to submit your security credentials (i.e. password) along with a unique ID for the request. This helps the API validate your request and helps you keep track of which end user submitted the request so you can send the answer back to them.

Context and Conversations

If you're creating a chatbot, you create a UI that accepts the user's inputs, sends it off to the AI engine via the network, and then outputs the answer back to the user. They go back-and-forth with a stream of requests and responses, thereby creating a conversation.

Oh, really?

What's missing is the “context” of every request that's part of the conversation. You cannot just send the user's latest response off to the engine, because:

AI engines are stateless.

Hence, the default AI engine doesn't remember what else it's already said. Maybe it's because the GPUs have stolen all their RAM.

Instead, it's up to you as the programmer to store and re-submit the entire conversational history with every request. This is a wonderful situation when you're paying per input token.

It seems like the API vendors could handle context management for you, but I'm not aware of any that do yet. The OpenAI API provides helpful ways to structure the historical context in a request, but doesn't yet store it for you.

Extended Transformers

The main type of Transformer that gets all the hype is the Generative Pre-Trained Transformer (GPT). This is the basic text processing Transformer that can process words and generate output with surprisingly human-like elegance.

Modern research has been applying Transformers to other types of input and uses cases. The result has been various new extensions of Transformer architectures.

Multi-modal Transformer. This refers to Transformers that can accept inputs in images (or video) rather than simple text prompts.
Vision Transformer (ViT). These are the use of Transformer technologies for computer vision applications, such as self-driving cars.
Bidirectional Transformer. This is a research type used in the past, that hasn't received as much attention lately. The idea is that it can examine its input data from both directions at the same time. The main example is “Bidirectional Encoder Representations from Transformers” (BERT) and its many variants.
Retrieval Augmentation Generation (RAG). This is an architecture where a Transformer is combined with a separate component that “retrieves” extra data (e.g. a document search mechanism). The idea is to extend the AI engine to new data without extra training.
Ensemble inference. An “ensemble cast” is a Hollywood term that means a film with a group of famous actors all starring together in the same story. Someone with a sense of humor (or very large ambitions) decided to use the same term for a group of AI models all working together to create the same masterpiece.

Some of the major areas of Transformer research involve addressing the resource-hungry nature of their execution. For example, a basic Transformer has quadratic cost complexity in terms of the input length. Hence, there are numerous modifications in Transformer architectures being created, both in industry and research labs. See Part VII of this book for a full literature review of the extensive body of research related to Transformers.

Other Types of Neural Networks

The Transformer was a breakthrough in the evolution of neural networks. One of its main advantages was its capacity to perform calculations in parallel, allowing it to increase intelligence through sheer brute-force algorithms. This led to a massive increase in the size of models into multi-billion parameter scale, which we now call Large Language Models (LLMs).

Before the Transformer, there were many different neural network architectures. Several of these designs are still being used today in areas where they are stronger than Transformers.

Recurrent Neural Networks (RNNs). An early type of neural network that worked iteratively through a sequence. An RNN processes its inputs one token at a time, creating its output response, and then re-enters its own output as an input to its next phase. Hence, it is “recursive” in processing its own output, which is also known as “auto-regressive” mode when this same idea occurs in Transformers. Transformers have largely displaced RNNs for applications in text processing and generative AI. However, there are still research papers attempting to revive RNNs with advancements, or to create hybrid Transformer-RNN architectures.

Generative Adversarial Networks (GANs). These are an advanced image-generating neural network. The idea is to combine two models, one that generate candidate images (the “generator”), and the other model that evaluates them (the “discriminator”). By a weird kind of “fighting” between the two models, the generator model gradually creates better images that please the discriminator. The results are surprisingly effective, and this technology is still in use today.

Convolutional Neural Networks (CNNs). Whereas RNNs and Transformers are focused on input sequences, CNNs are better at input data that has a structure, especially the spatial structure inherent in images. Modern image processing and computer vision technology still uses CNNs, although enhanced Transformer architectures, such as multimodal or vision transformers, can also be used. CNNs are good at splitting an image into separate input “channels” and then applying a “filter” to each channel. Hence, CNNs have been holding their own against Transformers in areas related to image processing.

There are various other types of neural network, which all have some research attention:

Long short-term memory (LSTM). A type of RNN.
Spiking neural networks (SNNs)
Liquid neural networks (LNNs)
Quantum neural networks (QNNs)

This book is mostly about Transformers, so the interested reader is referred to the research literature for these architectures. As a general rule, there are so many research papers being written about AI that there are literally exceptions to everything. But those intrepid researchers are doing a great service to programmers by giving us lots of gritty algorithms to code up.

• Next: Chapter 3. AI Phones

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs