Aussie AI

4. AI on Your Desktop

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

“Creating a toy is like planting a seed of fun.”

— Caleb Chung, creator of Furby.

Your Desktop AI Engine

This chapter is about using C++ to get AI running on your developer desktop. Native execution of an AI engine on a basic developer PC is already possible, and all of the C++ code runs locally. You can play with all of the C++ code, try to understand it all, and make changes to optimize things.

But don't get too excited: it's not very fast.

The state-of-the-art is that developers can run smaller models locally on a fairly high-spec developer box. The ability to run the better and bigger models directly on a PC is still in the future, and there aren't yet any major software applications that run the AI engine natively on the PC.

But don't get frustrated — it's exciting! We're close to that capability, and in that direction is where the major opportunity for advancement remains. We just need to tweak the C++ code a little.

Open Source C++ Transformer Engines

You don't need to create a whole C++ Transformer engine from scratch. There are multiple fully coded AI engines, with C++ source code to download, that are available under permissive open source licenses.

Some of the useful C++ engines that are great to experiment with on your desktop PC, or even a laptop, include these options:

GGML: https://github.com/ggerganov/ggml (MIT License)
Llama.cpp: https://github.com/ggerganov/llama.cpp (MIT License)
StarCoder from BigCode: https://github.com/bigcode-project/starcoder (Apache 2.0 License)
Intel Extension for Transformers: https://github.com/intel/intel-extension-for-transformers (Apache 2.0 License)
gemma.cpp: https://github.com/google/gemma.cpp (Apache 2.0 License and BSD-3-clause license)

You should check the current license status of these projects yourself, since many of these are not only free, but allow commercial usage. However, even permissive licenses have restrictions and obligations.

Open Source Models

There are numerous pre-trained LLMs available for free download under open source licenses. The best known such foundation model is Meta's Llama series of models, but there are many others with quite extensive capabilities. The main advantage of these models is obvious: you can avoid the expense of training your own model.

Typically, model files are uploaded to a repository website such as GitHub or Hugging Face. These are models whose weights have been trained using a variety of different data sets. In some cases, the models come with an engine platform, but in orders you will need to use a standard engine.

Some models are licensed for research-only or other non-commercial purposes. Several model files have permissive licenses that allow any usage, including commercial purposes. For example, Meta's Llama model was first licensed for research-only, but they subsequently released Llama2 under a more permissive license.

There are also derivative models available for download, which are based on modifications made to the larger models. The most common are quantized models, where an original full-precision model with 32-bit float weights has been “quantized” down to smaller data types (e.g. 16-bit or 8-bit integers). However, there are other types of derivatives, such as smaller models trained on the outputs of larger models.

AI PCs

The main research area in relation to “AI PCs” is optimization of inference algorithms, so that the models can run fast enough. This includes execution of AI inference on CPU-only PCs and low-end GPU-based PCs that are available. Training on PCs is a lower priority, because this can always be done offline in the cloud.

A desktop PC or laptop is more capable than a phone, so some of the problems with phones running AI inference are less problematic on a PC. Most obviously, a PC can have a decent GPU, which can then be used by AI engines (assuming you turn off your Minecraft server). Concerns about CPU usage, over-heating, and battery depletion are also less problematic for a PC than on a phone.

However, execution speed on a PC is still rather sluggish for large models, even on multi-thousand dollar PCs with powerful GPUs, so there is much research still to be done on optimization of inference. Large models are where the action is found in terms of AI functionality, so it may be that software developers are still using cloud server AI for some time to come. And certainly, training and fine-tuning workloads seem less likely to move down onto desktop PCs. However, “AI PCs” are already becoming available for everyday users and developers alike.

New C++ Language Features

Here's a summary of some C++ language features that are relevant to AI development. The longstanding features of C++ include:

Bitwise operators: & (bitwise-and), | (bitwise-or), ^ (bitwise-xor), ~ (bitwise two's complement), since language inception.
inline functions are fast (and it's actually true these days).
Short-circuiting of the && and || operators is standard. Use it and abuse it.
The ?: ternary operator also short-circuits.
String literal concatenation of adjacent string constants (e.g. "a" "b" becomes "ab").
Float versions of standard math functions (e.g. sqrtf, expf, logf, etc.)
Float versions of numeric constants (e.g. 0.0f is float, whereas 0.0 is double).
Persistent static local variables inside functions (often dangerous, but occasionally useful).
Unsigned constants with “u” suffix (e.g. 1u is unsigned int versus 1 is int).
__FILE__ and __LINE__ builtin preprocessor macros for source code filename and line number.
Operator overloading of new and delete to intercept them at link-time.
Variable-argument function definitions using va_start, va_arg, and va_end.

Recently Added C++ Features

The C++ language has evolved over many years, and has had a great many features added to it. The newer features of C++ standardization that may be useful to know:

std::bitset library for bit sets and bit vectors.
Builtin functions for MSVS and GCC compilers, including the intrinsic functions with access to machine-level instructions (e.g. the x86 instruction set).
The “restrict” keyword for pointers (indicates non-aliasing, for better compiler auto-optimization). This feature has been evolving over the years and there are various earlier language extensions with different keywords.
static_assert for compile-time assertions of constant expressions that trigger compiler errors.
Variable-argument preprocessor macros with the ellipsis “...” and __VA_ARG__ tokens (since C++14), such as to define your own debug macro version of printf. There's also the __VA_OPT__(args) method since C++20 for fixing some obscure syntax problems with using vararg macros.
reinterpret_cast for fixing tricky casts that should be illegal, but you need what you need.
The __func__ builtin preprocessor macro for source code function name.
std::backtrace library for function call stack reporting (in C++23).
GCC compiler non-standard #pragma directives such as: #pragma GCC unroll N
Standard C++ random number generator libraries. The older rand and srand functions are deprecated.
The register keyword is officially deprecated/removed, since C++17. Compilers don't need your help!
Binary numeric literals with 0b prefix and %b printf binary format specifier (since C++14). Works similarly to 0x and %x hexadecimals.
The <=> three-way comparison (spaceship) operator. My favorite one.

C++ Coding Strategy for AI

What's it like coding AI in C++? A short answer is “your brain will explode” because everything's different and at a scale you won't believe. But a longer answer is that although some of the optimizations get quite mind-bending, there are also many areas where it's just normal C++ coding applied to a new area of focus.

Some of the overarching characteristics of coding AI engines in C++ include:

No if statements! Seriously, what is going on here? There is no logic like: “if this token, then do that vector, else do this vector." All of the conditional logic is wrapped up in the weights, and rather than doing any testing, we just multiply all the weights by all the inputs. If you want conditional tests in the high-level control flow, that's only in the various “adaptive” optimization algorithms and research enhancements, because there's none in the vanilla Transformer model.
Brute-force algorithm. Every single weight is used in every invocation of the model. The whole logic of AI inference propagates a vector of computations through a fixed sequence of N layers and a fixed number of components in each layer, without much branching logic at all. It is finite, without any unbounded looping, and the whole algorithm can be flattened and unrolled. Indeed, that's exactly what ML compilers do by creating a graph representation.
Backend coding. The AI engine has no user interface other than accepting a text prompt as input, and outputting its response one word at a time. Image engines are a little different, but most of the execution is still batch backend work.
Non-interactive. The way that an AI engine cranks through all the weights and layers is not interactive. From the outside, there isn't anything much to do but wait for the first token to come out from the decoder. On the inside, all of the coding is batch.
Single platform. In a lot of AI projects, you can control the data center servers and GPU hardware, allowing you to optimize the C++ code for a single platform. This is also true if you're researching AI engines by running them locally on your own desktop.

Elements of AI in C++

You may need to refresh your knowledge about some of the basics of C++ coding. Some features are not as widely used in non-AI applications.

Bitwise operations. In general C++ coding, you use the bitwise operators mostly for bit flags. However, that's on unsigned integers, whereas AI coding also uses bitwise logic on float types, which is trickier. See Chapter 9.
Floating-point numbers. If you're like me when I started, you don't really know how the floating-point stuff works. When I started with AI, I knew there was an exponent and a mantissa, but I wasn't solid on it. You need to fix that to code well for AI engines. See Chapter 9.
Math. There's a lot more coding with “exp” and “log” type functions than some other coding jobs. For example, you need to know whether to use “exp” or “expf” variants of the math library functions (it's not that hard: float versions of math functions end with “f”).
Statistics. Do you know your mean from your median? If I told you that probabilities cannot be negative, do you know whether I'm lying to you? AI engines work by computing the probabilities of the next output word. Hence, a lot of AI is about probability distributions, so you might want to dig out your dusty stats textbook. Oops, sorry, that's so 1900s. I meant thought-query your Neuralink.

Advanced AI in C++

There's also a lot of new things to learn when coding AI in C++. Here's a list of some of them:

Vectorization. Many things you've learned about speeding up code in traditional sequential C++ is now wrong. Even half of your multi-threading or multi-process experience with parallel execution is irrelevant. You need to think “simple and parallel” with SIMD architectures on your mind. Modifying algorithms to parallelize properly on GPUs is called “vectorization.”
Intrinsics. Optimizing AI engines means vectorizing code to work on hardware accelerators. The way you do that in C++ is to use the various intrinsic functions (aka “builtins”) in standard platforms or GPU integration libraries.
Assembler. Sometimes even intrinsics aren't fast enough and you need to hard-code assembler into C++ source code using asm, __asm or other directives.
Data structures. The vector is the basis of AI inference, and it's really just an array. Matrices are the 2-D version with two-dimensional arrays. Tensors generalize this to 3-dimensional arrays. For optimizations, lookup tables and bit vectors are the go-to data structures. Hash tables are occasionally used, and also vector hashing. Binary trees and tries, not so much.
Algorithms. Vector dot product and matrix multiplication are the low-level computational routines for tensors. Vectorizing algorithms for GPU parallelization is job number one. Many other algorithms are more mathematical (e.g. exp and log) or statistical (e.g. variance and standard deviation).
Memory management. Normal C++ CPU memory management is stretched to gigabyte-sized data structures. And the GPU memory management paradigm is very different to stack or heap memory. You tend to write huge blocks to the GPU, which remain in somewhat static structures with less swapping, and you don't ever access individual bytes.

Downsides of AI in C++

Have you ever noticed in your life that absolutely everything has a trade-off between fast and safe? I mean, think about everything you do: driving when your kid's late for underwater ballet class, jogging with a cup of coffee, eating Weetbix without milk, using a paint stripper to blow-dry your damp clothes, I could go on. I'm not going to say that C++ is like using a microwave to boil an egg, but you know there's a few problems, right?

Most of the issues with C++ are not specific to AI applications and are well-known. There are plenty of insidious traps in C++ pointers and memory allocation to keep debuggers busy for decades to come. The C++ standardization committees have gradually addressed some of the problems, but other things are so widely used and intractable that they won't be fixed. Improved tools are probably the mainstay of improvements here.

In addition, here's a list of a few other problems in AI C++ coding:

16-bit float flux. It's hard to have simple C++ support for any 16-bit floating-point types, such as FP16 (float16) or BF16 (brain float 16). The C++23 version has added standardization of various 16-bit float types, which means this concern will probably abate once all C++23 new features are widely available in compilers.
Non-standardized hardware acceleration. Every hardware platform is different. It's tricky to do things in C++ like figuring out whether you have a GPU available and what hardware-acceleration features your current CPU has.

But never mind, all of these things are fixable with some elbow grease. It's just a small matter of coding, for which I estimate two weeks. And we wouldn't be together having this one-way discussion if coding it was all a piece of cake. So, hooray for AI in C++!

References on AI in C++

I didn't find any (!) books on LLMs and Transformer internals in C++, with mostly Python used at a higher level (if you like that kind of thing). But there are some C++ books on neural networks and machine learning (ML) using C++ that I recommend:

Kirill Kolodiazhnyi (2020), Hands-On Machine Learning in C++, Packt Publishing, May 2020, https://www.amazon.com/Hands-Machine-Learning-end-end/dp/1789955335 (And a second edition is forthcoming, due Nov 2024.)
Venish Patidar (2022), Developers Guide for Building Own Neural Network Library, October 1, 2022, https://www.amazon.com/DEVELOPERS-BUILDING-NEURAL-NETWORK-LIBRARY/dp/B0BGNF1KK6/