Aussie AI

Faster Together

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

Faster Together

Combining two operations on the same data together is often faster than running them sequentially. The idea is that instead of “1+1=2” in the sequential version, the fused version is “1+1=1.5” in terms of cost. Kernel fusion is closely related to the loop transformation optimizations of loop fusion. The improvement mainly arises when two components share data or if one component processes the output data of another. The benefits of kernel fusion are generally:

Avoidance of intermediate data computation and storage.
Reduced redundant memory loads and stores.
Reduced GPU-to-memory accesses and transfers.
Increased data locality (memory cache speedup).
Computation merging combinations (less commonly).

If there are two operations that work on the same data, one has to store its output, and the second one has to load it again later. In other words, there are redundant stores and loads to/from memory. If we merge the two kernels, there is no need to store or re-load the same data. If the computation is in a GPU, then there is no need for a round-trip from the GPU cache memory to the CPU RAM.

Merging two kernels is usually mainly about the memory access reductions, but can also reduce computations in some cases. If there are two kernels that both compute the same intermediate result, merging them is a form of “computation reuse.” However, more typically, kernel fusion is about merging two components that perform distinct operations on the same data.

Note that some of these benefits apply to both sequential and parallelized (GPU) kernels. For example, data locality improves CPU RAM accesses in sequential code due to cache pipeline prediction. However, the benefits are obviously greater for kernel fusion of vectorized loops in a GPU.

The idea of kernel fusion is not limited to two operations. The more the merrier, so long as you don't want readable C++ code any more. Merging three or more operations together can offer even greater speed improvements but can get quite complex.

RELU fusion. Examples of kernel fusion opportunities arise in Transformers where the same data is processed in sequence by two operations. For example, if a Vector-Matrix Multiplication (VMM) component creates a vector, which is then passed into RELU activation component, then both of these components are acting on the same output vector. A fused VMM-RELU component can perform the RELU component inside the VMM, just after it has written the final result to the vector element.

An important point here is that the RELU component cannot be run in parallel to the VMM's matrix-vector multiplication, because it is waiting on the output vector from the VMM component. We should not fuse two independent components which can be run in parallel, because that could be a de-optimization. Kernel fusion works best on two components that are running sequentially, where the second one is waiting for the first one's output.

The primary goal of kernel fusion is often to reduce data movement and memory usage, rather than computation reduction, but the two metrics are closely linked and often both goals are achieved. By merging two operations into one, the cost of storing intermediate results to memory is avoided. Hence, it is also a type of memory optimization technique for AI inference engines.

• Next:

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs

The new AI programming book by Aussie AI co-founders:

AI coding in C++
Transformer engine speedups
LLM models
Phone and desktop AI
Code examples
Research citations

Get your copy from Amazon: Generative AI in C++

Aussie AI

Faster Together

Faster Together

Quick Links

Product

New to Writing?

Writing Styles