Aussie AI

AI Engines and Models

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

AI Engines and Models

An AI application is really two components and it's not very complicated:

Engine — Transformer
Model — LLM

Transformers are a type of neural network engine that calculates the answers in Generative AI. The Large Language Model (LLM) contains all of the data about the relationships between words and their relative positioning.

In terms of technology, the distinction between engines and models is also very simple:

Engine — code
Model — data

The runtime code is the “engine” and the grunt work is often done in C++ under a Python wrapper. The data is the “model” which is literally all numbers, and no code. So far, not so exciting.

Where it gets more interesting is in the complex meshing between engines and models. Not all engines work with all models, and vice-versa. Even Transformers are tightly interwoven with their LLM data. There are many variants of Transformer architectures, and the data won't work with an architecture that's different.

Engines and models are symbiotic and you need both to get anything done. An engine without a model means you ran out of compute budget, whereas a model without an engine cannot really occur because engines create models via training.

Engines

What's an engine? The engine is code that you have to write. All of the fast low-level code is usually written in C++, but the higher-level control code is often written in Python. Somebody has probably used Java to do AI engines, but I'm not a fan of having ten directory levels. If you're using Visual Basic or Perl, we're in trouble.

All of the action is done by the engine based on the data in the model file. The C++ engine needs to load the model file, receive a user query, crank the query through the model weights, and output the best ideas it can think of.

There are two main types of C++ engine, and they're so closely related, that they're almost the same thing. Conceptually, there are two engines:

Training engine
Inference engine

The training engine computes answers to queries, compares the results to expectations, and then updates the weights in the model. The “loss function” calculates how close the results are to what's expected in the training data set. It's also sometimes called an “error function” because it computes an error metric between the computed results, and the expected results. At a very high level, the basic architecture is:

    Training engine = Inference engine + Loss function + Weight Updater

The training engine is used for training (surprise!) and for mini-training tasks like “fine-tuning” the model with small amounts of extra data. The main purpose of the training engine is to create the model by continually updating the weights.

The inference engine handles user queries at runtime. It requires a model that has been built during training, which is used to answer the users' prompts according to whatever has been trained into the model.

These two types of engines have the same inference component. A “training engine” is the inference engine plus a mechanism to compare results with expectations and then update weights appropriately. The central difference is that a training engine changes the weights, because it's creating the model, whereas the weights are static during inference. The weights are not updated by user queries. If you like programming (hopefully?), here's another way to think about model weights:

Training engine — Read/Write
Inference engine — Read-Only

Both of these engines do the same inference computations on weights for the “Read” phase. Hence, they share a lot of components, but the training engine adds extra components (the “Write” parts). The basic hyper-parameters of the model (e.g. number of weights, number of layers) must be identical for the training and inference phases. Hence, a query computation done by the training engine is the same set of computations as the same query done by the inference engine after training is complete.

Transformers

What's a Transformer? It's an engine that processes LLMs, and is an advanced type of neural network. The first Transformer was open-sourced by Google Research in 2017, and then everything got out of hand. And now, I have to mention that Transformers can be of three basic types:

Vanilla (encoder-decoder),
Decoder-only (e.g. GPT or Gemini)
Encoder-only (e.g. BERT)

Now you'll want definitions for those too? We'll be here all night, and it's not even a joke, because the whole book is about Transformers and LLMs. Here's a list of some of the well-known Transformer-LLM architectures:

GPT-3 (OpenAI)
GPT-4 (OpenAI)
PaLM (Google)
Gemini (Google)
Llama (Meta)
Claude (Anthropic)

Yes, I've missed a few! Although it all started as encoder-decoders in 2017, most of the modern Transformers are decoder-only (because it's faster). In addition, they've all tweaked the vanilla Transformer in lots of different ways and usually have also published nice research papers about it.

And I know what you're wondering: it's always about ChatGPT from OpenAI. No, ChatGPT isn't on the list, because it's not really a Transformer architecture or an LLM. Rather, it's more like an “app” (chatbot) or a “brand” or a “platform” that sits on the top using all that GPT stuff.

Models

What's a model? An AI model is literally a binary data file with mostly numbers and a few text strings. For really big models with billions of parameters, it's multiple files, but still only numbers and no code.

I really mean it: zero code. You won't find any programs or scripts, and not even any HTML markup. If you're looking for rules like “if the previous word was 'the' then output a noun” then you're out of luck, not to mention that you're about thirty years behind the times, because that's a rule-based expert system, and it's not how models work in this century.

The main thing in a model file is “weights” which are literally fractional numbers. Billions of them. They're sometimes called “parameters” when being more precise, but it's the same idea.

Weights are a multiplier of “signals” such as which word should be output next. A fractional number less than one makes a word less likely (decreasing a signal), whereas more than one increases the likelihood of outputting that word (amplifying a signal). A zero means don't output that word. A negative weight means really, really don't output the word.

Programmers don't create model files. You won't have to edit a model file and click away on your calculator to get the right parameter numbers. The numbers inside a model file are auto-generated by the training engine.

In fact, it's hard even to look at a model file, because it's so crammed full of numbers. You can do a basic sanity check that it's not spoiled with bogus Inf (infinity) and NaN (not-a-number) floating-point values, but you can't see the intelligence by looking at the numbers, even if you squint. However, programmers do have to decide on the meta-parameters for their model before they run the training phase.

What are meta-parameters? The meta-parameters of the model are counts of how many billion parameters it has, in how many layers, and how many different words it understands (typically, 50,000). These are all static, fixed numeric values. Most of the meta-parameters are fixed from training through to inference, such as the “dimensions” of the model (e.g. the number of “layers” in the model). The size of the model in terms of how many billions of parameters is mostly fixed, too, except there's some tricky ways to speed up inference by reducing or modifying parameters, called “pruning” and “quantization,” but now we're jumping ahead about twenty chapters.

Large Language Models (LLMs)

What's an LLM? There's nothing really special about Large Language Models (LLMs) used by ChatGPT, Gemini, or Llama, compared to other types of AI model files, except that they're:

(a) large,

(b) language-focused (not images), and

Well, you asked, so I answered.

More specifically, LLMs tend to be model files that are processed by Transformers, rather than other types of AI engines.

What's a Foundation Model? This is a large and general-purpose model that's already been broadly trained. Any model that has billions of parameters and gets mentioned in a press release is usually a foundation model. The biggest foundation models might support text in multiple languages along with programming language coding knowledge.

Technically, if a foundation model also has image generating capabilities as well as text output, or can also receive an image as part of its input, then that's not a normal foundation model (i.e. it's not really an LLM). Instead, this advanced model type is often distinguished as a “multi-modal” model. And if there's two of those working together, then it's a “multi-modal multi-model” and you should try saying that ten times in a row.

• Next:

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs

The new AI programming book by Aussie AI co-founders:

AI coding in C++
Transformer engine speedups
LLM models
Phone and desktop AI
Code examples
Research citations

Get your copy from Amazon: Generative AI in C++

Aussie AI

AI Engines and Models