Aussie AI
ML Compilers
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
ML Compilers
ML compilers are a way to optimize the execution of an AI model. Conceptually, you can give your model file to an ML compiler, and it will produce an optimized inference engine for that model. This type of “AI compiler” is not a C++ compiler, but there are parallels, and there's still a boatload of C++ happening under the covers. Like a programming language compiler, the ML compiler works “offline” to optimize the model so that it is faster at “runtime” (i.e. inference for users).
How do ML compilers work? Imagine the entire execution of the whole stack of your AI engine on a large LLM has been “flattened” and “unrolled” completely. Conceptually, that's what an ML compiler does. The result is a “graph” data structure that represents the execution of the LLM with all its data. Hence, ML compilers are sometimes called “graph compilers.”
ML compilers are lumbering beasts that do a lot of work up front to make things go faster later. The steps in ML compiler execution are conceptually:
1. Load the model data.
2. Create an “internal representation” (IR) of the model (usually a graph).
3. Optimize away on the graph IR.
4. Create a final version ready to run on a given computer architecture.
ML compilers can have various features that change what effect they have on the model. The final output version can be exactly the same as the original input model, just faster, or it might do model-changing optimizations (e.g. pruning small magnitude weights). The ML compiler does its optimizations on a final model trying to reduces its inference latency. Note that the “final version” is not a new model file, but is a combination of code and data that can be executed on the chosen target platform (i.e. so as to use that model for inference on a given platform). It is conceptually similar to serializing your entire model, and the inference engine kernels too, into a single massive C++ program.
Graphs. You'll have to dust off your memories of “graph algorithms” from third-year Computer Science to understand any research papers on ML compilers (e.g. depth-first search versus breadth-first search). The graph representation of an AI model has “nodes” which represent actions being one, and “arcs” representing data being sent from the output of one node as the input of the next node. Simplifying considerably, a node might be “vector dot product” or “add bias vector” or “Softmax calculation.” A node is not typically a huge tensor operation, because the goal is to flatten out the big tensors into individual matrices and then down to the vector level, so as to search for optimizations in the huge graph. Conceptually, the arcs send a matrix or vector of data between nodes.
Finiteness. The graph that represents your model is finite, albeit very large, because the number of layers in a Transformer is fixed, and each layer cranks through a finite number of tensor operations, and the amount of data in a model is finite. Admittedly, it's a weird type of “finite” that is so large as to be beyond human ability to understand the model data. Nevertheless, ML compilers create a massive but finite graph representation of your model and then crank away on it looking for optimization transformations.
Acyclic graph. In fact, the graph representation of a model is a finite “directed acyclic graph” (DAG) because all of the operations progress forward. There are no loops or cycles going backwards in the encoder or decoder architectures. The only loop is “outside” of the graph, where this entire graph of the model's encoder-decoder structure can be repeatedly executed by the autoregressive decoding algorithm (see Chapter 26 for decoding algorithms). Hence, the production of a single token from the AI engine doesn't have a loop backwards, and the graph is acyclic, which means the ML Compiler can use various graph optimization algorithms. The graph represents the sequence through all layers of an encoder and/or decoder, and each layer is a sub-graph where input data propagates through various nodes that do operations (e.g. MatMul, normalization, etc.), until a final output vector is produced.
Graph Optimizations. The ML compiler improves speed by using graph algorithms to find ways to optimize execution. Methods include:
- Algebraic optimizations
- Kernel fusion
- Parallelization algorithms
Algebraic optimizations can be performed at a single node in the graph, or sometimes in a pair of nodes. As an example of a single-node algebraic optimization, since all the model's weight data is known at ML-compile-time, if there is a weight vector that is all zeros, the ML compiler can detect this redundant operation and nullify it. Some kinds of optimizations can propagate through the graph, analogous to sequential “constant propagation” algorithms in programming language compilers, but where each node is usually a vector operation.
Kernel fusion is the merging of two operation nodes into a combined single operation node. There are many combinations of operations that occur in sequence across the graph, and combining the code for two operators can be faster. For example, if there's a matrix-vector multiply node followed by a RELU node, these can be “fused” into a single node that does matrix-vector-multiply-RELU as one operation.
Parallelization optimizations at a low-level are inherent to the various algebraic and kernel fusion optimizations done by the ML compiler. However, the compiler can also do high-level parallelization, because it can know from the graph's ordering which operations depend on each other, in what sequence, and which ones can be done independently in parallel. Hence, an ML compiler can use advanced scheduling algorithms to efficiently parallelize node operations and marshal data into the GPU (or multiple GPUs) in a way that keeps the pipeline full (no “bubbles”).
• Next: • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |