Aussie AI

What is Kernel Fusionand?

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

What is Kernel Fusion?

Kernel fusion refers to “fusing” two kernels together. This is the optimization of merging different algorithmic components in an AI inference engine to create a single component. The full technical term probably should be “kernel operator fusion” but it's usually abbreviated to “kernel fusion” or “operator fusion” or just “fusion” in neural network research papers.

Kernel fusion is merging of code. Note that the term “kernel fusion” is distinct from similar terms such as layer fusion, which refers to merging of data (i.e., usually refers to weight sharing across multiple layers of a model). Operator fusion merges the algorithmic steps, but does not reduce weights, or share parameters.

Exact (Not Approximate). Kernel fusion is not an approximation, but an exact optimization. Kernel fusion works by merging two major steps to avoid redundancy from the use of the intermediate data in subsequent computations, but both steps are still effectively performed. Also, kernel fusion does not skip any processing of model data, making it different from techniques such as pruning (e.g. layer pruning).

ML Compilers. Operator fusion has been an active area of research for many years in ML Compilers (also called “graph compilers” or “deep learning compilers”). These ML compiler frameworks can apply many optimizations on the low-level final execution version of the model, usually based on a graph intermediate representation of the model and the engine. In this way, kernel fusion for an ML compiler means finding two adjacent nodes with operators that are a pair that is known to be fusable. Over the years, fused kernels have been built for many pairs of ML operator nodes.

Fused Transformer components. The various LLM and Transformer architectures don't need an ML compiler to use kernel fusion as an optimization. Any AI engine can use kernel fusion to combine any of the various pairs of operations that occur sequentially in the training or inference algorithms. Hence, kernel fusion of vector, matrix, and tensor operations, usually with another secondary operation, can be used in the high-level pathways of Transformer inference engines. Researchers and industry practitioners have found various ways to speed up a vanilla Transformer using kernel fusion techniques. Several of the major component steps in the Transformer algorithm (e.g. attention heads, normalization, FFN, Softmax, etc.) can be merged with a prior operation using kernel fusion ideas.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++