Aussie AI

1. Introduction to AI in C++

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

“I've seen things you people wouldn't believe...
All those moments will be lost in time,
like tears in rain...”

— Blade Runner, 1982.

Everything's Bigger in AI

AI brings scale. Everything from the query rate of your C++ data structures to the size of your funding rounds will be much bigger. Running a query through an AI engine takes billions of floating-point operations in less than a second. Training an AI engine can take days or weeks and cost millions. The data sets have trillions of tokens, and the Large Language Models (LLMs) they train also have parameter sizes that are also heading up into the trillions nowadays.

With all this raw power comes fundamental capabilities. The latest AI engines and their models are amazing in terms of what they can offer. The basic levels are tasks like writing content, translating foreign languages, image generation, or code copiloting. On top of these basic capabilities, there are AI applications in almost every human endeavour, whether offered by longstanding companies or newly created startups. Every creative medium is on offer, from black-and-white sketching to auto-generated videos. And every vertical from medicine to law to finance already has specific tools to empower users.

What is AI?

AI is such a trendy and overhyped term that I hardly need to tell you what it stands for. Every single company on the planet is now calling themselves an “AI Company” and they're not incorrect. I mean, my toaster is technically an AI engine because there's silicon in there somewhere and it's “intelligent” enough not to burn bread.

And when you get your dream job as an overpaid Software Engineer doing LLMs at a major tech company, the phrase “AI Engineer” is a great term to impress your kids. Your official title, “ML Engineer”, not so much.

This cuts both ways, though. If you're haggling the price of your new car at the dealership, maybe stick to ML Engineer. Similarly, if you send your resume to a major tech company with “AI Engineer” as your career aspiration, they'll throw it in the trash and say, “Noob!” with a bemused look on their face.

AI means anything you want it to, but ML means “Machine Learning” to anyone important enough to have that title. The category of ML is specific to a piece of software that actually “learns” to be smarter (e.g. by “training”). The main ones this book is about are:

Transformers (e.g. ChatGPT's “engine”)
Large Language Models (LLMs) (e.g. GPT-3 or GPT-4)
Neural Networks (NNs) (i.e., an “artificial brain” simulated in C++.)

The general category of Deep Learning (DL) is the subset of ML involving neural networks. Hence, Transformers are a subset of DL. Some of the other more specific types of ML include:

Computer Vision (CV)
Autonomous Vehicles (AVs) — self-driving cars
Product Recommendation Systems — e-commerce
Machine Translation (MT) — foreign language translation
Content relevancy algorithms — social media feeds

Looking forward, some of the aspirations of the AI industry are capabilities such as:

Artificial General Intelligence (AGI) — human-like reasoning
Artificial Super-Intelligence (ASI) — who knows what?

The State of AI

There's so much going on in the AI industry that these words are out-of-date the second that I type them. Nevertheless, here are a few general thoughts on where we are:

AI is amazing. I'm still astounded by the capabilities of the latest AI apps, whether it's in creating fluent text or vibrant realistic images. There are so many advances happening in other areas such as speech, vision, animation, and video. The whole industry is evolving rapidly at such speed that I need an AI copilot to help me keep up with all the news.

AI is expensive. Remember the joke about how “boat” stands for “Bring Out Another Thousand”? That's nothing compared to AI. A single GPU costs more than your boat and a typical motherboard has eight of them. And the big companies have been buying these by the thousands. What should LLM stand for? “Lavish Leviathan Mammoth”? That was mine. “Ludicrously Large Mango” was Bing Chat with GPT-4's AI suggestion. Neither are great, which is comforting because it means there's still some work to be done.

AI is not new. The AI-related workload hosting market is many years old. Just because GenAI has blasted into consumer consciousness, and into boardroom discussions as a result, doesn't mean that AI is new. The cloud hosting companies like Amazon AWS, Microsoft Azure, and Google GCP, have been doing AI workloads for many customers, for many years. Instead of using GPUs for GenAI, they've been running workloads in other AI areas like Machine Learning (ML), machine vision (e.g. Tesla autonomous cars), product suggestion feeds, predictive modeling, auto-completion of search queries, and so on. There were already billions of dollars invested in AI long before ChatGPT set the web on fire.

AI Phones. AI is going to be on your phone, and it's going to be a big driver of new phone purchases. There are already low-end AI models that can run on your desktop PC, but it's not really true yet of phones. We're at the start of AI adoption inside phone apps, but there aren't many examples yet. See Chapter 3 for more about AI phones.

AI PCs. AI models and applications are set to make PCs hot again in the near-term. The next generation of laptops and desktops will likely run some AI models natively, and there will also be hybrid architectures with AI workloads offloaded into the cloud. The first generation is likely to include “AI Developer PCs” because software developers typically have high-end PCs, and various existing AI models can already run on desktop PCs. For end user applications, the model still has to run fast to give the user a decent response time, so there are still some significant obstacles for AI models on non-developer PCs, but hybrid cloud architectures will likely hide a lot of the limitations of native AI execution. It is early days in this trend, but it's surely going to be a major technology driver for years to come. See Chapter 4 for how to run a C++ AI engine on your desktop PC.

Green AI. The widespread use of AI makes it a significant contributor to energy consumption, and there is much research on the environmental impact from AI computing. On the positive side, this means that all of the research towards AI improvements is helpful for green AI, since it will also reduce its carbon footprint and environmental impacts. All of those C++ code optimizations to speed up the AI engine are also making things greener overall.

The Market for AI

Here are some future-looking thoughts about what the market for AI may look like. It seems likely that C++ programmers will be required for a little while longer.

It's a marathon, not a sprint. Consumers may continue to adopt genAI quickly, but that's not the most likely case for businesses. Whereas genAI is a hot topic in boardrooms, most business are still trying to find their feet in the area, with only exploratory projects launching. Small businesses and professionals (e.g. doctor's offices) will take years to adopt genAI, and larger enterprises will take even longer. There will be some early projects, sure, but the bulk of the B2B AI market will evolve more slowly. Projections for the B2B side of AI are over many years, even decades, with high CAGR. We've already seen this in the business adoption of cloud architectures, which is still ongoing, despite having been running since the early 2000's. The B2B AI market is likely to sustain very strong growth through 2030 and probably even into the 2040s and beyond.

B2B market opportunity trumps B2C. The massive ramp-up of consumer engagement with ChatGPT has made the consumer side seem hot. However, it's actually more likely to be the business side that makes more money (as usual). Predictions of the billions, maybe trillions, of dollars of benefit to economies through full AI integration into businesses, dwarf the predictions for consumer opportunities.

Training is the big B2B market? Early wisdom was that the high cost of training and fine-tuning would far exceed inference costs. This contention is somewhat in dispute, with some pundits saying that the sheer number of users will push inference ahead of training. Another factor is the trend toward using someone else's pre-trained LLM, whether it's GPT via the OpenAI API or the open source Llama models. Hence, there's definitely more inference than training in B2C projects, and it may also be taking over on the B2B side.

Fine-Tuning vs RAG. Most business AI projects will involve enhancing the model using proprietary data that the business owns. For example, a support chatbot has to learn information on the company's products, or an internal HR chatbot needs to use internal policy documents. There are two main ways to do this: fine-tuning or Retrieval-Augmented Generation (RAG). Current training and fine-tuning methods take a long time, need a lot of GPUs, and cost a great deal. However, RAG is becoming widely used to avoid the cost of fine-tuning.

Inference vs Training in the B2C market. Even the B2C genAI bots need continuous training and fine-tuning, to keep up with current events, so there will also be significant training cost (or RAG costs) in the B2C market. However, with millions of users for B2C apps, the cost of inference should overshadow training costs in the long run.

AI Technology Trends

Multi-model AI is here already. We're in the early stages of discovering what can be achieved by putting multiple AI models together. The formal research term for this is “ensemble” AI. For example, GPT-4 is rumored to be an eight-model architecture, and this will spur on many similar projects. As multiple-model approaches achieve greater levels of capability, this will in turn create further demand for AI models and their underlying infrastructure. This will amplify the need for optimizations in the underlying C++ engines.

Multimodal engines. Multimodality of the ability of an AI to understand inputs in text and images, and also output the same. Google Gemini is a notable large multimodal model. This area of technology is only at the beginning of its journey.

Longer Contexts. The ability of AI engines to handle longer texts has been improving, both in terms of computational efficiency and better understanding and generation results (called “length generalization”). GPT-2 had a context window of 1024 tokens, GPT-3 had 2048, and GPT-4 originally had versions from 4k up to 32k, but has now advanced to 128k tokens as I write this (November, 2023). An average fictional novel starts at 50,000 words and goes up to 200,000 words, so we're getting to the neighborhood of having AI engines generate a full work from a single prompt, although, at present, the quality is rather lacking compared to professional writers.

AI Gadgets. The use of AI in the user interface has made alternative form factors viable. Some of the novel uses of AI in hardware gadgets include the Rabbit R1, Humane Ai Pin, and Rewind Pendant.

Intelligent Autonomous Agents (IAAs). These types of “smart agents” will work on a continual basis, rather than waiting for human requests. The architecture is a combination of an AI engine with a datastore and a scheduler.

Small Models. Although the mega-size foundation models still capture all the headlines, small or medium size models are becoming more common in both open source and commercial settings. They even have their own acronym now: Small Language Models (SLMs). Notably, Microsoft has been doing some work in this area with its Orca and Phi models. Apparently 7B is “small” now.

Specialized Models. High quality, focused training data can obviate the need for a monolithic model. Training a specialized model for a particular task can be effective, and at a much lower cost. Expect to see a lot more of this in medicine, finance, and other industry verticals.

Data Feed Integrations. AI engines cannot answer every question alone. They need to access data from other sources, such as the broad Internet or specific databases such as real estate listings or medical research papers. Third-party data feeds can be integrated using a RAG-style architecture.

Tool Integrations. Answering some types of questions requires integration with various “tools” that the AI Engine can use for supplemental processing in user requests. For example, answering “What time is it?” is not possible via training with the entire Wikipedia corpus, but requires integration with a clock. Implementing an AI engine so that it knows both when and how to access tools is a complex engineering issue.

The Need for Speed. The prevailing problem at the moment is that AI engines are too inefficient, requiring too much computation and too many GPU cycles. Enter C++ as the savior? Well, yes and no. C++ is already in every AI stack, so the solution will be better use of C++ combined with research into better algorithms and optimization techniques, and increasingly powerful hardware underneath the C++ code.

Why AI and C++?

As a programmer, your job is to harness the power of your AI platform and offer it up to your many users in top-level features. Whether your AI project is about writing sports content or auto-diagnosing X-ray images, your work as an AI developer is based on fundamentally the same architecture. And to do this at a scale that matches the capability of your workhorse models, you need a programming language to match its power.

I'll give you three guesses which one I recommend.

C++ is on the inside of all AI engines. Whereas Python is often on the outside wrapping around the various models, C++ is always closer to the machine and its hardware. PyTorch and Tensorflow have lots of Python code on the top layers, but the grunt work underneath runs in highly optimized C++ code.

The main advantage of C++ is that it is super-fast, and has low-level capabilities, that makes its operations close to those of the hardware instructions. This is a perfect match, because AI engines need to run blazingly fast, with hardware-acceleration integrations direct to the GPU to handle literally billions of arithmetic calculations. And yet, C++ is also a high-level programming language with support for advanced features like classes and modularity, so it's great for programmer productivity.

Why is C++ Efficient?

Before beginning our discussion of optimizing AI and C++, it is interesting to discuss the origins of C and C++, and examine why these languages promote efficiency. The C++ language provides many features that make it easy for the programmer to write efficient code.

The C language was originally developed at AT&T’s Bell Laboratories by Dennis Ritchie and Ken Thompson. It was intended to remove the burden of programming in assembler from the programmer, while at the same time retaining most of assembler’s efficiency. At times, C has been called a “typed assembly language”, and there is some truth in this description. One of the main reasons for C’s efficiency is that C programs manipulate objects that are the same as those manipulated by the machine: int variables correspond to machine words, char variables correspond to bytes, pointer variables contain machine addresses.

The early versions of C had no official standard, and the de facto standard was the reference manual in the first edition of Kernighan and Ritchie’s book titled The C Programming Language. In 1983 an effort was initiated to formally standardize the C language, and in 1989 the final ANSI standard appeared.

Then came C++.

C++ was designed by Bjarne Stroustrup in the early 1980s, and is almost a complete superset of C. One of the primary design objects of C++ was to retain the efficiency of C. Most of the extra features of C++ do not affect run-time efficiency, but merely give the compiler more work to do at compile-time. Since C++ builds on C, it benefits from C’s use of data objects that are close to the machine: bytes, words and addresses. Although adding encapsulation and modularity via classes, even the earliest versions of C++ contained many features to promote efficiency. The inline qualifier allowed programmers to request that a call to a function be replaced automatically by inline code, thus removing the overhead of a function call, and introducing new opportunities for inter-function optimizations (this idea of compile-time optimization has since been expanded with the “constexpr” hint). The C++ concept of a reference type permitted large objects to be passed to functions by reference to improve efficiency, and they are safer to use than pointers. Only a few aspects of the early C++ class features required run-time support, which may reduce efficiency, such as virtual functions. However, even virtual functions were designed to be efficient from the beginnings of C++, and experienced C++ programmers find them invaluable.

C++ has evolved over the years into a massive language with endless standard libraries and classes available. Major features were incrementally added in C++11, C++14, C++17, C++20, and C++23 standards, and only a small number of features have been deprecated in each edition. Despite the ongoing additions, C++ has retained its overarching goal of highly efficient execution, and IMHO still remains the best choice for fast coding.

Why is AI Slow?

If C++ is so fast, then why is AI so slow? It's a fair question, since the computing power required by AI algorithms is legendary. Even with C++, the cost of training big models is prohibitive, and getting even small models to run fast on a developer's desktop PC is problematic.

But why?

The bottleneck is the humble multiplication. All AI models use “weights” which are numbers, often quite small fractions, that encode how likely or desirable a particular feature is. In an LLM, it might encode the probabilities of the next word being correct. For example, simplifying it considerably, a weight of “2.0” for the word “dog” would mean to make the word twice as likely to be the next word, and a weight of “0.5” for “cat” would mean to halve the probability of outputting that word. And each of these weights is multiplied against other probabilities in many of the nodes in a neural network.

How many multiplications? Lots! By which we mean billions every time it runs. A model of size 3B has 3 billion weights or “parameters” and each of these needs multiplication to work. GPT-3 as used by the first ChatGPT release had 175B weights, and GPT-4 apparently has more (it's confidential but an apparent “leak” rumored that it's a multi-model architecture with 8 models of 220B parameters each, giving a total of more than 1.7 trillion trained parameters).

Why so many weights? Short answer: because every weight is a little tiny bit of braininess.

Longer answer: because it has weights for every combination. Simplifying, a typical LLM will maintain a vector representation of words (called the model's “vocabulary”), where each number is the probability of emitting that word next. Actually, it's more complicated, with the use of “embeddings” as an indirect representation of the words, but conceptually the idea is to track word probabilities. To process these word tokens (or embeddings), the model has a set of “weights”, also sometimes called “parameters”, which are typically counted in the billions in advanced LLMs (e.g. a 3B model is considered “small” these days and OpenAI's GPT-3 had 175B).

Why is it slow on my PC? Each node of the neural network inside the LLMs is doing floating-point multiplications across its vocabulary (embeddings), using the weights, whereby multiplication by a weight either increases or decreases the likelihood of an output. And there are many nodes in a layer of an LLM that need to do these computations, and there are multiple layers in a model that each contain another set of those nodes. And all of that is just to spit out one word of a sentence in a response. Eventually, the combinatorial explosion of the sheer number of multiplication operations catches up to reality and overwhelms the poor CPU.

Bigger and Smarter AI

Although the compute cost of AI is a large negative, let us not forget that this is what achieves the results. The first use of GPUs for AI was a breakthrough that heralded the oncoming age of big models. Without all that computing power, we wouldn't have discovered how eloquent an LLM could be when helping us reorganize the laundry cupboard.

Here's a list of some of the bigger models that have already been delivered in terms of raw parameter counts:

MPT-30B (MosaicML) — 30 billion
Llama2 (Meta) — 70 billion
Grok-1 (XAI) — 70 billion
GPT-3 (OpenAI) — 175 billion
Jurassic-1 (AI21 Labs) — 178 billion
Gopher (DeepMind/Google) — 280 billion
PaLM-2 (Google) — 340 billion
MT-NLG (Microsoft/NVIDIA) — 530 billion
PaLM-1 (Google) — 540 billion
Switch-Transformer (Google) — 1 trillion
Gemini Ultra (Google) — (unknown)
Claude 2 (Anthropic) — 130 billion (unconfirmed)
GPT-4 (OpenAI) — 1.76 trillion (unconfirmed)
BaGuaLu (Sunway, China) — 174 trillion (not a typo)

Note that not all of these parameter counts are official, with some based on rumors or estimates from third parties. Also, some counts listed here are not apples-to-apples comparisons. For example, Google's Switch Transformer is a different architecture.

The general rule of AI models still remains: bigger is better. If you're promoting your amazing new AI foundation model to investors, it'd better have a “B” after its parameter count number (e.g. 70B), and soon it'll need a “T” instead. All of the major tech companies are talking about trillion-parameter models now.

The rule that bigger is better is somewhat nuanced now. For example, note that Google's PaLM version 2 had fewer parameters (340B) than PaLM version 1 (540B), but more capabilities. It seems likely that a few hundred billion is getting to be enough parameters for any use cases, and there is more value in quality of training at that level.

Another change is the appearance of multi-model architectures. Notably, the rumored architecture of GPT-4 is almost two trillion parameters, but not in one model. Instead, the new architecture is (apparently) an eight-model architecture, each with 220 billion parameters, in a “mixture-of-experts” architecture, for a total of 1.76 trillion parameters. Again, it looks like a few hundred billion parameters is enough for quality results.

We're only at the start of the multi-model wave, which is called “ensemble architectures” in the research literature. But it seems likely that the overall count of parameters will go upwards from here, in the many trillions of parameters, whether in one big model or several smaller ones combined.

Faster AI

It'll be a few years before a trillion-parameter model runs on your laptop, but the situation is not hopeless for AI's sluggishness. After all, we've all seen amazing AI products such as ChatGPT that respond very quickly. They aren't slow, even with millions of users, but the cost to achieve that level of speed is very high. The workload sent to the GPU is immense and those electrons aren't free.

There is currently a large trade-off in AI models: go big or go fast.

The biggest models have trillions of parameters and are lumbering behemoths dependent on an IV-drip of GPU-juice. Or you can wrap a large commercial model provider through their API (e.g. OpenAI's API, Google PaLM API, etc.), and using a major API has a dollar cost, although it probably replies quickly.

Smaller models are available if you want to run fast. You can pick one of several smaller open-source models. Here's a list of some of them:

Llama2 (Meta) — 70 billion
MPT-30B (MosaicML) — 30 billion
MPT-7B (MosaicML) — 7 billion
Mistral-7B (Mistral AI) — 7 billion

The compute cost of models in the 7B type range is much less. The problem with using smaller models is that they're not quite as smart, although a 7B model's capabilities still amaze me. These can definitely be adequate for many use cases, but tend not to be for areas that require finesse in the outputs, or detailed instruction following. Given the level of intense competition in the AI industry, a sub-optimal output may not be good enough.

For more capability, there are larger open-source models, such as Meta's Llama2 models, which has up to 70 billion parameters. But that just brings us back to the high compute costs of big models. They might be free of licensing costs, but they're not free in terms of GPU hosting costs.

What about both faster and smarter? So, you want to have you cake and eat it, too? That's a little trickier to do, but I know of a book that's got hundreds of pages on exactly how to do that.

There are many ways to make an AI engine go faster. The simplest is to use more GPUs, and that's probably been the prevailing optimization used to date. However, companies can't go on with that business model forever, and anyway, we'll need even more power to run the super-advanced new architectures, such as the multi-model AI engines that are emerging.

Algorithm-level improvements to AI are required to rein in the compute cost in terms of both cash and environmental impact. An entire industry is quickly evolving and advancing to offer faster and more efficient hardware and software to cope with ever-larger models.

But you can save your money for that Galápagos vacation: code it yourself. This whole book offers a survey of the many ways to combat the workload with optimized data structures and algorithms.

Human ingenuity is also on the prowl for new solutions and there are literally thousands of research papers on how to run an AI engine faster. The continued growth of models into trillions of parameters seems like a brute-force solution to a compute problem, and many approaches are being considered to achieve the same results with fewer resources. Some of these ideas have made their way into commercial and open source engines over the years, but there are many more to be tested and explored. See Part VII of this book for an extensive literature review of state-of-the-art optimization research.

• Next: Chapter 2. Transformers and LLMs

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs