Aussie AI

Deep Learning Compiler Optimization Techniques

  • Last Updated 22 November, 2024
  • by David Spuler, Ph.D.

Deep Learning Compilers are a specialized type of software tool used by ML software engineers that takes an AI model and generates optimized code to run that model's inference. There are various alternative names, such as graph compilers, ML compilers, model compilers, or even just "AI compilers". These compilers are lower level software than the common AI software frameworks. They are also unrelated to the traditional programming language compilers, used for everyday programming in languages such as C++ or Python.

The input to the deep learning compilers is the model, and the output is a runnable set of code that implements that model on a particular computing platform. Typically, these compilers take the model and first generate an internal representation format of the model, which can then be compiled to multiple platforms. These compilers are often an important part of running models on an edge device.

The Graph Nature of AI Computation

The Transformer architectures that run the inference and training of LLMs have a few peculiar characteristics:

  • No looping.
  • No alternative pathways (i.e., no "if" statements)

Hence, when you split out all the ways that the input tokens and the weights are processed, you can note that:

  • It's finite
  • It's a fixed sequence

The input tokens enter the pathway as a vector of tokens, which are then converted to vector-per-token representation that is called embeddings. These embedding values propagate through the Transformer components as "activations" (i.e., computed probabilities), and there is a huge amount of computation from multiplication of all these weights. But there are very few points where the computation can make a different pathway. Generally speaking, the fixed pathway looks like this:

  • Text converted to token vectors
  • Each token converts to a vectorized embedding
  • Each set of embedding vectors propagated through multiple layers
  • Each layer comtains standard subcomponents (i.e., attention module, activation functions, and feed-forward networks).
  • After all layers, the computed values ("logits") are converted to a token.

Where's the "if" statement? Well, there's very few. The main one is at the last step, whereby the output token is chosen, which is called the "decoding algorithm" (e.g., greedy decoding always chooses the highest probability token/word to output).

Anyway, the point of all this is that every single step along the pathway is fixed. And if you write out all the pathways, with "nodes" for each subcomponent, you get that it is a, ta da, graph.

In fact, it has these properties:

  • Finite (fixed number of layers with a fixed number of subcomponents).
  • No cycles (because there's no loops)

Hence, it's a finite, directed acyclic graph (DAG).

Yes, I know, there's exceptions. The whole engine does loop back at the end of outputting a token, and restarts the layers to begin computing the next token (called autoregressive decoding). Also, some optimizations of this process add selection tests. Early exit optimizations add a testing statement in there after one or more layers. KV caching optimizations add tests as to whether we have a valid cache. But the basic point that we get to is that the computation of the output of a single token is a DAG. Hence, graph compilers can "compile" that DAG into a fixed set of computations.

Efficiency Optimization Techniques

The main goal of deep learning compilers is optimization, and they may offer a variety of different optimization features. Some example types of optimizations that ML compilers can use include:

List of Machine Learning Compiler Platforms

Here is a brief list of some of the major ML compilers (graph compilers) that you can use:

Survey Papers on Machine Learning Compilers

Reviews of the many deep learning compilers available:

Research Papers on Machine Learning Compilers

Research on deep learning compilers and papers related to specific usage of their methods:

Kernel Fusion in ML Compilers

Kernel fusion is the optimization of merging two operations into a single operator. The combined operator is thereby faster than running the two operations sequentially. The optimization may reduce computations or memory accesses to address overall cost. Read more about kernel operator fusion techniques.

Deep learning compilers use kernel fusion as one of the major optimizations on the execution graph. Compiler-executed kernel operator fusion research papers:

More AI Research

Read more about: