Aussie AI

Why AI and C++and?

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Why AI and C++?

As a programmer, your job is to harness the power of your AI platform and offer it up to your many users in top-level features. Whether your AI project is about writing sports content or auto-diagnosing X-ray images, your work as an AI developer is based on fundamentally the same architecture. And to do this at a scale that matches the capability of your workhorse models, you need a programming language to match its power.

I'll give you three guesses which one I recommend.

C++ is on the inside of all AI engines. Whereas Python is often on the outside wrapping around the various models, C++ is always closer to the machine and its hardware. PyTorch and Tensorflow have lots of Python code on the top layers, but the grunt work underneath runs in highly optimized C++ code.

The main advantage of C++ is that it is super-fast, and has low-level capabilities, that makes its operations close to those of the hardware instructions. This is a perfect match, because AI engines need to run blazingly fast, with hardware-acceleration integrations direct to the GPU to handle literally billions of arithmetic calculations. And yet, C++ is also a high-level programming language with support for advanced features like classes and modularity, so it's great for programmer productivity.

Why is C++ Efficient?

Before beginning our discussion of optimizing AI and C++, it is interesting to discuss the origins of C and C++, and examine why these languages promote efficiency. The C++ language provides many features that make it easy for the programmer to write efficient code.

The C language was originally developed at AT&T’s Bell Laboratories by Dennis Ritchie and Ken Thompson. It was intended to remove the burden of programming in assembler from the programmer, while at the same time retaining most of assembler’s efficiency. At times, C has been called a “typed assembly language”, and there is some truth in this description. One of the main reasons for C’s efficiency is that C programs manipulate objects that are the same as those manipulated by the machine: int variables correspond to machine words, char variables correspond to bytes, pointer variables contain machine addresses.

The early versions of C had no official standard, and the de facto standard was the reference manual in the first edition of Kernighan and Ritchie’s book titled The C Programming Language. In 1983 an effort was initiated to formally standardize the C language, and in 1989 the final ANSI standard appeared.

Then came C++.

C++ was designed by Bjarne Stroustrup in the early 1980s, and is almost a complete superset of C. One of the primary design objects of C++ was to retain the efficiency of C. Most of the extra features of C++ do not affect run-time efficiency, but merely give the compiler more work to do at compile-time. Since C++ builds on C, it benefits from C’s use of data objects that are close to the machine: bytes, words and addresses. Although adding encapsulation and modularity via classes, even the earliest versions of C++ contained many features to promote efficiency. The inline qualifier allowed programmers to request that a call to a function be replaced automatically by inline code, thus removing the overhead of a function call, and introducing new opportunities for inter-function optimizations (this idea of compile-time optimization has since been expanded with the “constexpr” hint). The C++ concept of a reference type permitted large objects to be passed to functions by reference to improve efficiency, and they are safer to use than pointers. Only a few aspects of the early C++ class features required run-time support, which may reduce efficiency, such as virtual functions. However, even virtual functions were designed to be efficient from the beginnings of C++, and experienced C++ programmers find them invaluable.

C++ has evolved over the years into a massive language with endless standard libraries and classes available. Major features were incrementally added in C++11, C++14, C++17, C++20, and C++23 standards, and only a small number of features have been deprecated in each edition. Despite the ongoing additions, C++ has retained its overarching goal of highly efficient execution, and IMHO still remains the best choice for fast coding.

Why is AI Slow?

If C++ is so fast, then why is AI so slow? It's a fair question, since the computing power required by AI algorithms is legendary. Even with C++, the cost of training big models is prohibitive, and getting even small models to run fast on a developer's desktop PC is problematic.

But why?

The bottleneck is the humble multiplication. All AI models use “weights” which are numbers, often quite small fractions, that encode how likely or desirable a particular feature is. In an LLM, it might encode the probabilities of the next word being correct. For example, simplifying it considerably, a weight of “2.0” for the word “dog” would mean to make the word twice as likely to be the next word, and a weight of “0.5” for “cat” would mean to halve the probability of outputting that word. And each of these weights is multiplied against other probabilities in many of the nodes in a neural network.

How many multiplications? Lots! By which we mean billions every time it runs. A model of size 3B has 3 billion weights or “parameters” and each of these needs multiplication to work. GPT-3 as used by the first ChatGPT release had 175B weights, and GPT-4 apparently has more (it's confidential but an apparent “leak” rumored that it's a multi-model architecture with 8 models of 220B parameters each, giving a total of more than 1.7 trillion trained parameters).

Why so many weights? Short answer: because every weight is a little tiny bit of braininess.

Longer answer: because it has weights for every combination. Simplifying, a typical LLM will maintain a vector representation of words (called the model's “vocabulary”), where each number is the probability of emitting that word next. Actually, it's more complicated, with the use of “embeddings” as an indirect representation of the words, but conceptually the idea is to track word probabilities. To process these word tokens (or embeddings), the model has a set of “weights”, also sometimes called “parameters”, which are typically counted in the billions in advanced LLMs (e.g. a 3B model is considered “small” these days and OpenAI's GPT-3 had 175B).

Why is it slow on my PC? Each node of the neural network inside the LLMs is doing floating-point multiplications across its vocabulary (embeddings), using the weights, whereby multiplication by a weight either increases or decreases the likelihood of an output. And there are many nodes in a layer of an LLM that need to do these computations, and there are multiple layers in a model that each contain another set of those nodes. And all of that is just to spit out one word of a sentence in a response. Eventually, the combinatorial explosion of the sheer number of multiplication operations catches up to reality and overwhelms the poor CPU.

Bigger and Smarter AI

Although the compute cost of AI is a large negative, let us not forget that this is what achieves the results. The first use of GPUs for AI was a breakthrough that heralded the oncoming age of big models. Without all that computing power, we wouldn't have discovered how eloquent an LLM could be when helping us reorganize the laundry cupboard.

Here's a list of some of the bigger models that have already been delivered in terms of raw parameter counts:

  • MPT-30B (MosaicML) — 30 billion
  • Llama2 (Meta) — 70 billion
  • Grok-1 (XAI) — 70 billion
  • GPT-3 (OpenAI) — 175 billion
  • Jurassic-1 (AI21 Labs) — 178 billion
  • Gopher (DeepMind/Google) — 280 billion
  • PaLM-2 (Google) — 340 billion
  • MT-NLG (Microsoft/NVIDIA) — 530 billion
  • PaLM-1 (Google) — 540 billion
  • Switch-Transformer (Google) — 1 trillion
  • Gemini Ultra (Google) — (unknown)
  • Claude 2 (Anthropic) — 130 billion (unconfirmed)
  • GPT-4 (OpenAI) — 1.76 trillion (unconfirmed)
  • BaGuaLu (Sunway, China) — 174 trillion (not a typo)

Note that not all of these parameter counts are official, with some based on rumors or estimates from third parties. Also, some counts listed here are not apples-to-apples comparisons. For example, Google's Switch Transformer is a different architecture.

The general rule of AI models still remains: bigger is better. If you're promoting your amazing new AI foundation model to investors, it'd better have a “B” after its parameter count number (e.g. 70B), and soon it'll need a “T” instead. All of the major tech companies are talking about trillion-parameter models now.

The rule that bigger is better is somewhat nuanced now. For example, note that Google's PaLM version 2 had fewer parameters (340B) than PaLM version 1 (540B), but more capabilities. It seems likely that a few hundred billion is getting to be enough parameters for any use cases, and there is more value in quality of training at that level.

Another change is the appearance of multi-model architectures. Notably, the rumored architecture of GPT-4 is almost two trillion parameters, but not in one model. Instead, the new architecture is (apparently) an eight-model architecture, each with 220 billion parameters, in a “mixture-of-experts” architecture, for a total of 1.76 trillion parameters. Again, it looks like a few hundred billion parameters is enough for quality results.

We're only at the start of the multi-model wave, which is called “ensemble architectures” in the research literature. But it seems likely that the overall count of parameters will go upwards from here, in the many trillions of parameters, whether in one big model or several smaller ones combined.

Faster AI

It'll be a few years before a trillion-parameter model runs on your laptop, but the situation is not hopeless for AI's sluggishness. After all, we've all seen amazing AI products such as ChatGPT that respond very quickly. They aren't slow, even with millions of users, but the cost to achieve that level of speed is very high. The workload sent to the GPU is immense and those electrons aren't free.

There is currently a large trade-off in AI models: go big or go fast.

The biggest models have trillions of parameters and are lumbering behemoths dependent on an IV-drip of GPU-juice. Or you can wrap a large commercial model provider through their API (e.g. OpenAI's API, Google PaLM API, etc.), and using a major API has a dollar cost, although it probably replies quickly.

Smaller models are available if you want to run fast. You can pick one of several smaller open-source models. Here's a list of some of them:

  • Llama2 (Meta) — 70 billion
  • MPT-30B (MosaicML) — 30 billion
  • MPT-7B (MosaicML) — 7 billion
  • Mistral-7B (Mistral AI) — 7 billion

The compute cost of models in the 7B type range is much less. The problem with using smaller models is that they're not quite as smart, although a 7B model's capabilities still amaze me. These can definitely be adequate for many use cases, but tend not to be for areas that require finesse in the outputs, or detailed instruction following. Given the level of intense competition in the AI industry, a sub-optimal output may not be good enough.

For more capability, there are larger open-source models, such as Meta's Llama2 models, which has up to 70 billion parameters. But that just brings us back to the high compute costs of big models. They might be free of licensing costs, but they're not free in terms of GPU hosting costs.

What about both faster and smarter? So, you want to have you cake and eat it, too? That's a little trickier to do, but I know of a book that's got hundreds of pages on exactly how to do that.

There are many ways to make an AI engine go faster. The simplest is to use more GPUs, and that's probably been the prevailing optimization used to date. However, companies can't go on with that business model forever, and anyway, we'll need even more power to run the super-advanced new architectures, such as the multi-model AI engines that are emerging.

Algorithm-level improvements to AI are required to rein in the compute cost in terms of both cash and environmental impact. An entire industry is quickly evolving and advancing to offer faster and more efficient hardware and software to cope with ever-larger models.

But you can save your money for that Galápagos vacation: code it yourself. This whole book offers a survey of the many ways to combat the workload with optimized data structures and algorithms.

Human ingenuity is also on the prowl for new solutions and there are literally thousands of research papers on how to run an AI engine faster. The continued growth of models into trillions of parameters seems like a brute-force solution to a compute problem, and many approaches are being considered to achieve the same results with fewer resources. Some of these ideas have made their way into commercial and open source engines over the years, but there are many more to be tested and explored. See Part VII of this book for an extensive literature review of state-of-the-art optimization research.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++