Aussie AI

28. Deslugging AI Engines

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

“Time is an illusion. Lunchtime doubly so.”

— Douglas Adams, The Hitchhiker's Guide to the Galaxy, 1979.

Everything's Slower in AI

All of the modern open source AI models are named after animals. There's falcons and llamas and alpacas. Really there should only be one species: the slug. Every AI engine is a squirming beast, moving along at the slowest pace, slurring out its words a few hundred milliseconds at a time.

How do we make it faster? Well, that's kind of the 64 thousand dollar question of the moment, except we might need to apply a scaling factor on that number. The answer is that nobody really has that answer, but everyone's trying a lot of things, up and down the tech stack from silicon to software, both in the labs of AI industry and in research departments of universities.

An AI engine is a special type of application and in many ways it is not a “typical” C++ program. Although all of the various C++ inefficiencies could infest an AI engine with slugs, there are also several major non-typical bottlenecks that are specific to the AI architecture:

The multiplication operation, even if being done in the GPU (usually in a parallelized version of vector dot product, matrix multiplication, or tensor processing kernel).
Memory access cost of reading all the model weight data and marshaling it ready for the GPU.
Floating-point mathematical functions you'd happily forgotten existed (e.g. logf, expf, sqrtf, etc.)
High-level algorithmic inefficiencies in AI engines such as the “autoregressive” decoding algorithm.

Accuracy-Degrading Optimizations

The question is not only about speed, but how to optimize an AI model, and still have it be equally as smart. You want to have your cake and eat it, too, I guess. But first let's look at some of the cake-eating optimizations that degrade accuracy of the model.

Many types of AI optimization techniques result in a less capable model, with a trade-off between speedup (latency/throughput) versus accuracy (perplexity). Here are some of the optimizations that will result in lower accuracy:

Model compression — quantization, pruning, distillation, weight sharing, weight clustering, etc.
Sparsity or low-rank matrix optimizations — LoRA/QLoRA.
Adaptive inference — early exit, dynamic pruning, layer fusion, etc.
Semantic caching — nearest-neighbor query caching with a vector database.
Approximations — of activation functions, matrix multiplication, arithmetic multiplication, etc.
Decoding algorithm optimizations — streamlined with approximation.
Linearized attention algorithms — local attention, Flash Attention.
Ensemble multi-model methods — mixture-of-experts, big-little, speculative decoding, cascades, wisdom of committees, swarms.
Stochastic optimization algorithms — intentional randomness.
Zero-multiplication models — e.g. bitshift, max-plus, or adder models; research-only.
Integer-only arithmetic models — usually quantization-related; research-only.
Advanced number system models — various obscure types; research-only.

The level of accuracy loss can vary greatly amongst these methods. For example, 4-bit integer quantization will be a lot worse than FP16 quantization with 16-bit “half precision” floating-point values, which will be very close in accuracy to the default “single precision” 32-bit float numbers. Where these methods are approximate, a lot can depend on the degree of precision involved in the approximation. For example, if you change the expf function in Softmax to use a 24-bit LUT, the precision will still be high and the error will only be in lower-order digits (i.e. tiny fractional errors), so the loss of model accuracy might be negligible.

Accuracy-Retaining Optimizations

Which AI engine optimizations are only about speed? We are looking for “pure speedups” that are only about the code and are therefore “smartness-retaining optimizations.” We want either shorter latency or greater throughput overall, but without any degradation in artificial braininess. Which techniques simply find faster ways to compute the exact same results?

Here's my short list of over-arching ideas:

Hardware optimizations (i.e. more and faster GPUs, and the CPUs, too).
Parallelization (multi-threading, multi-core, multi-CPU, multi-GPU, etc.).
Vectorization (chunking computations off to a GPU or CPU SIMD hardware intrinsic).
Pipelining optimizations (pipelined hardware or faster software scheduling optimizations; partitioning, marshaling, and dataflow optimizations).
Transformer component optimizations (non-approximate algorithm improvements or C++ speedups).
Memory access optimizations (e.g. contiguous memory, alignment, data locality, cache management, prefetching, offloading).

And here's some component-level methods:

Kernel optimizations (e.g. tiling to reorder memory accesses, kernel fission for vectorization, operator reordering, C++ coding improvements).
Kernel fusion (faster way by merging two sequential operations into one combined C++ kernel).
MatMul/GEMM kernel optimizations (i.e. the same multiplication computations, reordered to be faster).
KV caching (avoiding a well-known computation redundancy).
Zero padding avoidance and zero skipping (don't do redundant calculations on zeros).
Precalculation of exact results (non-approximate).
Logarithmic number system models (research-only).

The AI engine isn't the only slow-down. Other ways to speed up the overall architecture include:

Deployment architecture optimizations (e.g. web server, application logic server, hosting boxes, etc.)
Inference cache (identical results for identical queries).
Scheduling optimizations (across multiple AI engine servers).
ML Compilers (depending on how you use them, such as no pruning).

General coding improvements can be used for faster AI:

C++ intrinsics with underlying hardware support (e.g. AVX).
Assembly language (i.e. deep inside C++ kernels).
Floating-point calculation speedups (e.g. FTZ/DAZ CPU modes).
General C++ code optimizations (e.g. C++ tricks, loop optimizations, LUTs, etc.).

Some of the above listed items may be erroneous, or there may be sub-variants that retain or lose accuracy. I've tried to categorized them to the best of my ability, but it's hard to know all the cases from the many research papers (literally thousands!).

Transformer Architecture Choices

There are various architectural decisions that are made in the model design phase, which aren't really optimizations of a model, but can significantly impact its efficiency. Using a more advanced engine architecture is also effectively an optimization that “retains” accuracy because these changes allow the model to be fully trained in a better engine. Some important decisions include:

Decoder-only versus encoder-decoder architectures
Alternative floating-point representations (e.g. brain float)
Pre-norm versus post-norm
Positional encoding algorithms (embeddings)
Context length optimizations
Neural Architecture Search (NAS)

Data doesn't just magically end up in the GPU. There has to be software written to send the data there, and there are a lot of possible optimizations that are used in writing such software. This software is often called the “kernel” of the AI engine. The sub-components of the engine often get called the MatMul kernel, Softmax kernel, normalization kernel, and so on. Software techniques that aim to optimize parallelization primarily by increasing throughput and reducing latency include:

Vectorization
Multi-threading
Kernel fusion
Kernel fission
Pipelining
Scheduling algorithms

Memory usage optimizations: Software optimizations that aim to improve memory usage, and thereby benefit further from lowering memory access overhead to increase parallelism, include:

Tiling
Data locality optimizations
Dataflow optimizations
Memory management optimizations
Cache management
Prefetching
Offloading

Hybrid AI Engine Optimizations

Hybrid optimization approaches are those that combine two or more optimization techniques. Many of the optimization techniques can be combined in various ways. In research-speak, we say that they are “orthogonal” to each other.

Common hybrid approaches use the major techniques of quantization and pruning together. You can use either unstructured weight pruning (magnitude pruning) or structural pruning and then quantize the model. Or you can quantize the model first, and then do dynamic pruning at runtime (e.g. early exit of layers). Another hybrid optimization is that you can use distillation to train a smaller model (from a bigger teacher model), and then apply quantization and/or pruning to the resulting distilled model.

As another example, it is possible to do structural pruning on three orthogonal dimensions: depth, width, and length. Depth pruning is layer-wise pruning such as early exit, layer skipping, or layer fusion. Width pruning refers to reducing the “width” of the model, and refers to techniques such as attention head pruning or channel/filter pruning. Length pruning refers to the “length” of the input stream of tokens, with shortening optimizations possible such as prompt compression, token pruning, or embedding pruning. Combining two is “dual pruning”, such as depth and width pruning combined. The combination of all three in “triple pruning” is not something that I've seen in a research paper yet, but there's no theoretical obstacle that blocks it from being done.

A lot of the basic C++ speedups discussed in other chapters neither depend upon nor affect other optimizations. Similarly, faster C++ code for a Transformer component (e.g. MatMul, Softmax, activation functions, etc.), where it isn't an approximation, offers genuine improvement without any other impacts on the model or the engine.

Model Compression

Model compression is the general technique of optimizing AI inference by changing the model to be smaller, usually at the cost of some degree of accuracy. This can mean using fewer weights (e.g. pruning) or using weights that are smaller-size data types (e.g. quantization). The main established techniques with a longstanding body of research include:

Quantization of models. This is a well-known method whereby high-precision 32-bit floating-point multiplication of weights is replaced with lower-precision data, such as 16-bit floating-point numbers, or often integers to allow faster integer multiplication. Integer quantization has been researched for 8-bit integers all the way down to 4-bit, 3-bit, 2-bit and even 1-bit (binary) integer representations. Quantization of a pre-trained model for faster inference is called Post-Training Quantization (PTQ). It is also possible to use quantization during model training using Quantization-Aware Training (QAT) algorithms. Quantization not only improves speed, but also reduces the model size for storage or transmission (e.g. an 8-bit quantization of a 32-bit floating-point model reduces storage size by four).
Pruning (static and dynamic). This technique involves optimizing the weighted links in LLM models, such as “magnitude pruning” by cutting those with a very low value (indicating a feature has a low probability). If there are fewer model links, there are less multiplications required. See research on model pruning and Transformer-specific pruning ideas.
Layer pruning / layer compaction (“pancaking”). This is conceptually a generalization of pruning, whereby whole model layers are pruned. Models typically involve multiple hidden layers of weights and nodes, which perform the same underlying inference algorithms with different sets of weights. The layers can sometimes be removed without significant loss of model accuracy, resulting in proportional speedups.
Knowledge distillation. Training a smaller “student” model to be similar to a larger “teacher” model. This is a method where a large pre-trained model is used to train a smaller more-efficient model. The smaller model is then used for faster inference in the application.

There are several other less-used types of model compression, such as parameter sharing, layer fusion, and weight clustering.

ML Compilers

ML compilers are a way to optimize the execution of an AI model. Conceptually, you can give your model file to an ML compiler, and it will produce an optimized inference engine for that model. This type of “AI compiler” is not a C++ compiler, but there are parallels, and there's still a boatload of C++ happening under the covers. Like a programming language compiler, the ML compiler works “offline” to optimize the model so that it is faster at “runtime” (i.e. inference for users).

How do ML compilers work? Imagine the entire execution of the whole stack of your AI engine on a large LLM has been “flattened” and “unrolled” completely. Conceptually, that's what an ML compiler does. The result is a “graph” data structure that represents the execution of the LLM with all its data. Hence, ML compilers are sometimes called “graph compilers.”

ML compilers are lumbering beasts that do a lot of work up front to make things go faster later. The steps in ML compiler execution are conceptually:

1. Load the model data.

2. Create an “internal representation” (IR) of the model (usually a graph).

3. Optimize away on the graph IR.

4. Create a final version ready to run on a given computer architecture.

ML compilers can have various features that change what effect they have on the model. The final output version can be exactly the same as the original input model, just faster, or it might do model-changing optimizations (e.g. pruning small magnitude weights). The ML compiler does its optimizations on a final model trying to reduces its inference latency. Note that the “final version” is not a new model file, but is a combination of code and data that can be executed on the chosen target platform (i.e. so as to use that model for inference on a given platform). It is conceptually similar to serializing your entire model, and the inference engine kernels too, into a single massive C++ program.

Graphs. You'll have to dust off your memories of “graph algorithms” from third-year Computer Science to understand any research papers on ML compilers (e.g. depth-first search versus breadth-first search). The graph representation of an AI model has “nodes” which represent actions being one, and “arcs” representing data being sent from the output of one node as the input of the next node. Simplifying considerably, a node might be “vector dot product” or “add bias vector” or “Softmax calculation.” A node is not typically a huge tensor operation, because the goal is to flatten out the big tensors into individual matrices and then down to the vector level, so as to search for optimizations in the huge graph. Conceptually, the arcs send a matrix or vector of data between nodes.

Finiteness. The graph that represents your model is finite, albeit very large, because the number of layers in a Transformer is fixed, and each layer cranks through a finite number of tensor operations, and the amount of data in a model is finite. Admittedly, it's a weird type of “finite” that is so large as to be beyond human ability to understand the model data. Nevertheless, ML compilers create a massive but finite graph representation of your model and then crank away on it looking for optimization transformations.

Acyclic graph. In fact, the graph representation of a model is a finite “directed acyclic graph” (DAG) because all of the operations progress forward. There are no loops or cycles going backwards in the encoder or decoder architectures. The only loop is “outside” of the graph, where this entire graph of the model's encoder-decoder structure can be repeatedly executed by the autoregressive decoding algorithm (see Chapter 26 for decoding algorithms). Hence, the production of a single token from the AI engine doesn't have a loop backwards, and the graph is acyclic, which means the ML Compiler can use various graph optimization algorithms. The graph represents the sequence through all layers of an encoder and/or decoder, and each layer is a sub-graph where input data propagates through various nodes that do operations (e.g. MatMul, normalization, etc.), until a final output vector is produced.

Graph Optimizations. The ML compiler improves speed by using graph algorithms to find ways to optimize execution. Methods include:

Algebraic optimizations
Kernel fusion
Parallelization algorithms

Algebraic optimizations can be performed at a single node in the graph, or sometimes in a pair of nodes. As an example of a single-node algebraic optimization, since all the model's weight data is known at ML-compile-time, if there is a weight vector that is all zeros, the ML compiler can detect this redundant operation and nullify it. Some kinds of optimizations can propagate through the graph, analogous to sequential “constant propagation” algorithms in programming language compilers, but where each node is usually a vector operation.

Kernel fusion is the merging of two operation nodes into a combined single operation node. There are many combinations of operations that occur in sequence across the graph, and combining the code for two operators can be faster. For example, if there's a matrix-vector multiply node followed by a RELU node, these can be “fused” into a single node that does matrix-vector-multiply-RELU as one operation.

Parallelization optimizations at a low-level are inherent to the various algebraic and kernel fusion optimizations done by the ML compiler. However, the compiler can also do high-level parallelization, because it can know from the graph's ordering which operations depend on each other, in what sequence, and which ones can be done independently in parallel. Hence, an ML compiler can use advanced scheduling algorithms to efficiently parallelize node operations and marshal data into the GPU (or multiple GPUs) in a way that keeps the pipeline full (no “bubbles”).

AI Memory Reduction

Memory reduction optimizations involve using less memory during model inference. This means that inference requires less resources, and can also reduce processing time cost with less data being swapped in and out of memory. Memory optimization can refer to optimizing either CPU memory (RAM) or GPU memory (VRAM) or both.

Much research reports that model inference is memory-bound rather than CPU-bound or GPU-bound, with the processors and GPU often underutilized because they are waiting to receive the data from memory (or should I say they're “weighting”?). In such cases, memory management is key to improving latency and throughput.

On the other hand, researchers have also examined increasing memory usage to save time by caching and computation reuse. Actually, you can do both with caching at the front-end to ward off the common queries, but behind that you have the harder queries using memory management techniques in the engine to run faster.

Model Compression Techniques: The main class of optimizations that reduces memory requirements by making the model smaller is called “model compression.” These methods reduce both memory size and computation time by making the model smaller. The “big three” of well-known and often used model compression methods are: quantization, pruning, and knowledge distillation. However, there are several other types of model compression that result in a light-weight model: weight sharing, layer fusion, sparsity (sparse matrices), low-rank matrices, and weight clustering.

Recomputation: Trading Time for Space: On memory-constrained devices, it is possible to reduce space requirements at the cost of extra processor time. This is called “recomputation,” or sometimes in research papers it is called “rematerialization” or “checkpointing.” When this is used to optimize training of a model that is too large to fit inside GPU memory, it is called “gradient checkpointing.” The portion of this algorithm that involves swapping tensors off the GPU back to the CPU is often called “offloading.”

Model Pre-loading: The above points about recomputation show the best way to optimize memory accesses: avoid doing them at all for the second query, by preloading the entire model into the GPU memory cache. Hence, the GPU already has the whole model available for any followup inference queries, or indeed for the next token computation in an autoregressive algorithm. Ideally, there are zero accesses to the basic RAM needed from the GPU during inference.

For a large model, it isn't possible to store everything in a single GPU, but a lot can still be done to optimize a multi-GPU architecture where part of each model is stored in the GPU's cache. A lot can vary with the particular architecture of CPU and GPU capabilities. Judicious management of the various memory cache levels can help reduce the lag time from a memory-bound algorithm.

• Next: Chapter 29. Caching

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs