Aussie AI

31. Kernel Fusion

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

“It is easier to smash an atom than a prejudice.”

— Albert Einstein.

What is Kernel Fusion?

Kernel fusion refers to “fusing” two kernels together. This is the optimization of merging different algorithmic components in an AI inference engine to create a single component. The full technical term probably should be “kernel operator fusion” but it's usually abbreviated to “kernel fusion” or “operator fusion” or just “fusion” in neural network research papers.

Kernel fusion is merging of code. Note that the term “kernel fusion” is distinct from similar terms such as layer fusion, which refers to merging of data (i.e., usually refers to weight sharing across multiple layers of a model). Operator fusion merges the algorithmic steps, but does not reduce weights, or share parameters.

Exact (Not Approximate). Kernel fusion is not an approximation, but an exact optimization. Kernel fusion works by merging two major steps to avoid redundancy from the use of the intermediate data in subsequent computations, but both steps are still effectively performed. Also, kernel fusion does not skip any processing of model data, making it different from techniques such as pruning (e.g. layer pruning).

ML Compilers. Operator fusion has been an active area of research for many years in ML Compilers (also called “graph compilers” or “deep learning compilers”). These ML compiler frameworks can apply many optimizations on the low-level final execution version of the model, usually based on a graph intermediate representation of the model and the engine. In this way, kernel fusion for an ML compiler means finding two adjacent nodes with operators that are a pair that is known to be fusable. Over the years, fused kernels have been built for many pairs of ML operator nodes.

Fused Transformer components. The various LLM and Transformer architectures don't need an ML compiler to use kernel fusion as an optimization. Any AI engine can use kernel fusion to combine any of the various pairs of operations that occur sequentially in the training or inference algorithms. Hence, kernel fusion of vector, matrix, and tensor operations, usually with another secondary operation, can be used in the high-level pathways of Transformer inference engines. Researchers and industry practitioners have found various ways to speed up a vanilla Transformer using kernel fusion techniques. Several of the major component steps in the Transformer algorithm (e.g. attention heads, normalization, FFN, Softmax, etc.) can be merged with a prior operation using kernel fusion ideas.

Faster Together

Combining two operations on the same data together is often faster than running them sequentially. The idea is that instead of “1+1=2” in the sequential version, the fused version is “1+1=1.5” in terms of cost. Kernel fusion is closely related to the loop transformation optimizations of loop fusion. The improvement mainly arises when two components share data or if one component processes the output data of another. The benefits of kernel fusion are generally:

Avoidance of intermediate data computation and storage.
Reduced redundant memory loads and stores.
Reduced GPU-to-memory accesses and transfers.
Increased data locality (memory cache speedup).
Computation merging combinations (less commonly).

If there are two operations that work on the same data, one has to store its output, and the second one has to load it again later. In other words, there are redundant stores and loads to/from memory. If we merge the two kernels, there is no need to store or re-load the same data. If the computation is in a GPU, then there is no need for a round-trip from the GPU cache memory to the CPU RAM.

Merging two kernels is usually mainly about the memory access reductions, but can also reduce computations in some cases. If there are two kernels that both compute the same intermediate result, merging them is a form of “computation reuse.” However, more typically, kernel fusion is about merging two components that perform distinct operations on the same data.

Note that some of these benefits apply to both sequential and parallelized (GPU) kernels. For example, data locality improves CPU RAM accesses in sequential code due to cache pipeline prediction. However, the benefits are obviously greater for kernel fusion of vectorized loops in a GPU.

The idea of kernel fusion is not limited to two operations. The more the merrier, so long as you don't want readable C++ code any more. Merging three or more operations together can offer even greater speed improvements but can get quite complex.

RELU fusion. Examples of kernel fusion opportunities arise in Transformers where the same data is processed in sequence by two operations. For example, if a Vector-Matrix Multiplication (VMM) component creates a vector, which is then passed into RELU activation component, then both of these components are acting on the same output vector. A fused VMM-RELU component can perform the RELU component inside the VMM, just after it has written the final result to the vector element.

An important point here is that the RELU component cannot be run in parallel to the VMM's matrix-vector multiplication, because it is waiting on the output vector from the VMM component. We should not fuse two independent components which can be run in parallel, because that could be a de-optimization. Kernel fusion works best on two components that are running sequentially, where the second one is waiting for the first one's output.

The primary goal of kernel fusion is often to reduce data movement and memory usage, rather than computation reduction, but the two metrics are closely linked and often both goals are achieved. By merging two operations into one, the cost of storing intermediate results to memory is avoided. Hence, it is also a type of memory optimization technique for AI inference engines.

Kernel Fission

Kernel fission is the opposite of kernel fusion. Whereas kernel fusion merges two operators into one kernel, kernel fission splits a single kernel into two simpler kernels. Kernel fission is analogous to the loop transformation of loop fission, which involves splitting a loop in two, whereas loop fusion is merging two loops. Kernel fission is less used than kernel fusion, but can be effective.

Like kernel fusion, kernel fission involves changes to the C++ kernel code doing the inference rather than the model's data. Kernel fission can apply to inference and/or training, depending on which operation is split apart into two.

Faster Apart. The optimization goal of loop fission is usually to create two simpler kernels that can each be more efficiently vectorized. The performance improvement from kernel fission occurs in separating two combined operations, hopefully resulting in either or both of the two smaller kernels being faster through vectorization or streamlined scheduling for pipelining. Another goal may be to run both of the two simpler kernels in parallel, rather than merged.

There can sometimes be an advantage in terms of data locality and cache access speed, but it will more often be worsened by having two kernel operator loops skimming through the same data twice. At least one of the split-out pair of simpler kernels must run much faster separately, usually from accessing hardware acceleration, or else we've simply added extra loop overhead and worsened the overall performance.

Exact (Not Approximate). Kernel fission is also an exact optimization, like kernel fusion. The same computations are performed, but in a different order, split-out into two vectorized loops. One of the split-out loops could subsequently be changed to an approximate version, but that is a separate optimization, and is not usually a goal of kernel fission.

Transformer Component Fusion

The operations performed by Transformer components can be fused, which means to combine the code of two operations into one operation, reducing issues with temporary data and other overhead. Some examples of what is possible for fused operations:

Fused multi-head attention (fused MHA)
Fused Multiply-Add (FMA)
Fused normalization (e.g. fused LayerNorm or fused BatchNorm)
Fused SoftMax
Fused activations (e.g. fused RELU, fused GELU, fused SwiGLU, etc.)
Fused Add-Bias
Fused matrix transpose

Note that all of these kernel fusion examples require two operations to be merged. Usually, the above fused components would be merged back into the prior component.

Example: Fused VMM-add-bias

The way that kernel fusion works is to combine or merge the code for two operations into a single merged operation. For example, a typical linear layer will do a vector-matrix multiplication (“VMM”), followed by adding the bias vector (“add-bias”).

The same operands are used as without fusion. The VMM is a matrix-vector multiply that takes two operands: a weights matrix and the vector (calculated internal data). The add-bias is vector addition that takes two operands: a bias vector and the calculated vector (from the VMM). In the non-fused method, the VMM first multiplies the weights matrix by the input vector, creating a new vector. The add-bias operation then adds the bias vector to that vector, outputting the final resulting vector. In pseudocode, it looks like:

    vout = vector_matrix_multiply(weight_matrix, vinput);
    vfinal = add_bias(vout, vbias);

Both VMM and add-bias are working on the same vector of propagated internal data, and these two operations can be “fused” to create a combined “VMM-add-bias” operator. The result is a “three-way” fused operator, which has three inputs (the weights matrix, the bias vector, and the incoming calculated vector), and a single vector output.

    vfinal = fused_vmm_add_bias(weight_matrix, vinput, vbias);

The result of the merged VMM-add-bias combined operation is not approximated. The output is exactly the same vector that would result from sequentially doing a VMM and then an add-bias.

How is this any faster? It's true that the fused operator does the same number of calculations in terms of multiplications and additions (i.e. the same FLOPs, assuming it's floating-point data). The speedup comes from “fusing” the add-bias calculations into the same code as the VMM, which reduces some computation overhead, such as scanning through the vector twice. There is also a reduction in an entire vector of temporary data (i.e. the intermediate “vout” vector result of the VMM before it goes into the add-bias), which reduces data transfers, and improves memory usage and dataflow, also benefiting wall clock speed.

Fused Activation Functions

Activation functions are often good candidates for fusing back into the prior operation. An activation function is waiting for a vector output, and then simply transforms every element of that vector. Common examples of activation function fusion include:

Fused RELU
Fused GELU
Fused SwiGLU

Example: Fused VMM & RELU. It really seems to me that RELU should win the award for the most obscure name for a simple thing. I mean, it's just: make all negatives into zero. Why isn't it called “make positive” or something? Oh, but that's wrong because zero isn't positive. Okay, so let's name it: Make It Zero Unless It's Already Positive (MIZUIAP). See where I'm going with this?

The good news about this is it's easy to code RELU in C++:

    float aussie_RELU_basic(float f)   // Basic RELU (inefficient)
    {
        if (f <= 0.0) return 0.0;
        return f;
    }

In fact, it's so simple that it doesn't even deserve to be a C++ function:

    #define AUSSIE_RELU_MACRO(f)  ( (f) <= 0.0 ? 0.0 : (f) )   // Macro RELU

So, let's say we want a RELU after our Vector-Matrix Multiply (VMM) component. Remember that a matrix-vector multiplication ends up being a lot of vector dot products, at least in the simple versions.

    void aussie_VMM_vector_basic_vecdot(const ymatrix m, const float v[], int n, float vout[])
    {
        // Basic matrix-by-vector using vector dot products..
        for (int i = 0; i < n; i++) {
                const float* rowvector = &m[i][0];
                float sum = aussie_vecdot_basic(rowvector, v, n);  // Dot product
                vout[i] = sum;
        }
    }

And we also use a function to apply RELU to each element of a vector:

    void aussie_vector_reluize(float v[], int n)   // Apply RELU to each element
    {
        for (int i = 0; i < n; i++) {
                v[i] = AUSSIE_RELU_MACRO(v[i]);
        }
    }

Hence, our non-fused sequence of two Transformer components is a matrix-vector multiplication (e.g. a linear layer component) followed by an element-wise RELU on the output vector:

   aussie_VMM_vector_basic_vecdot(m, v, n, vout);  // Matrix-vector multiply
   aussie_vector_reluize(vout, n);  // Apply RELU on the output vector

The above method is needlessly inefficient with two sequential components, and has a double-scan of the output vector, resulting in redundant memory accesses on the “vout” vector. And this looks like a good candidate for kernel fusion because:

(a) there are two loops on the same data in the output vector, and

(b) the two components are both forced to run sequentially because the RELU component is waiting for the vector data from the VMM component.

Here's how to do a simplistic “fused-VMM-RELU” with the RELU activation function done at the vector dot product (inside a VMM) using RELU on the result of the vector dot product:

    void aussie_VMM_vector_basic_vecdot_RELU(const ymatrix m, const float v[], int n, float vout[])
    {
        // Basic fused matrix-by-vector RELU using vector dot products..
        for (int i = 0; i < n; i++) {
                const float* rowvector = &m[i][0];
                float sum = aussie_vecdot_basic(rowvector, v, n);  // Dot product
                vout[i] = AUSSIE_RELU_MACRO(sum);
        }
    }

This is still unnecessarily inefficient, and although this is a type of “fused-VMM-RELU”, it isn't really a deeply fused vecdot-RELU method at all. So, let's “fuse” the RELU even deeper inside the vector dot product code, right at the end, creating a “fused-vecdot-RELU” that can be called by our VMM-RELU component:

    float aussie_fused_vecdot_RELU_basic(float v1[], float v2[], int n)   // Basic fused dot product + RELU
    {
        float sum = 0.0;
        for (int i = 0; i < n; i++) {
                sum += v1[i] * v2[i];
        }
        return AUSSIE_RELU_MACRO(sum);
    }

That example is so simplistic it almost seems like it's not an optimization. Isn't it just a flattened function call? Surely that can't be all there is to kernel operator fusion?

Well, yes, RELU is very simple, and the above example vector dot product hasn't been optimized for sequential execution or vectorized to run in parallel. Believe me, if you look at the real code for fused operators, it gets squirrelly very fast. But even though simple, the above example is actually more performant than simply removing function call overhead, because it has avoided double-scanning the output vector, thereby removing redundant memory accesses, and getting rid of the “vout” temporary vector entirely.

So, fair enough complaint, it's a bit simplistic. Let's add vectorization to the kernel fusion ideas. Here's the AVX optimization of RELU on a vector by doing “max(0,v)” using the AVX “max” intrinsic and a vector full of zeros to parallelize 4 float values at a time:

    void aussie_vector_reluize_AVX1(float v[], int n)   
    {
        // Apply RELU to each element with AVX (sets negatives to zero)
        yassert(n % 4 == 0);
        const __m128 rzeros = _mm_set1_ps(0.0f);  // Set up vector full of zeros...
        for (int i = 0; i < n; i += 4) {
                __m128 r1 = _mm_loadu_ps(&v[i]);   // Load floats into 128-bits
                __m128 dst = _mm_max_ps(r1, rzeros);   // MAX(r1,0)
                _mm_store_ps(&v[i], dst);  // store back to floats
        }
    }

And here's the AVX version of vector dot product that uses “_mm_dp_ps” to add 4 float product-pairs at a time:

    float aussie_vecdot_unroll_AVX1(const float v1[], const float v2[], int n)  
    {                
        // AVX-1 loop-unrolled (4 floats) Vector dot product 
        yassert(n % 4 == 0);
        float sum = 0.0;
        for (int i = 0; i < n; i += 4) {
                // AVX1: Vector dot product of 2 vectors 
                //  ... process 4x32-bit floats in 128 bits
                __m128 r1 = _mm_loadu_ps(&v1[i]);   // Load floats into 128-bits
                __m128 r2 = _mm_loadu_ps(&v2[i]);
                __m128 dst = _mm_dp_ps(r1, r2, 0xf1); // Dot product
                sum += _mm_cvtss_f32(dst);
        }
        return sum;
    }

But here's the VMM code using the AVX1 version of the vector dot product:

    void aussie_VMM_vector_vecdot_AVX1(const ymatrix m, const float v[], int n, float vout[])
    {
        // VMM matrix-by-vector using AVX1 vector dot product
        for (int i = 0; i < n; i++) {
                const float* rowvector = &m[i][0];
                float sum = aussie_vecdot_unroll_AVX1(rowvector, v, n);  // Dot product
                vout[i] = sum;
        }
    }

So far, these are vectorized with AVX, but not fused.

How can we merge them and reduce either the number of AVX intrinsic calls or vector memory accesses by putting them together? There's no an obvious way to merge the parallel AVX RELU version inside the vector dot product or the VMM code. Obviously, we can run the VMM and then call the AVX RELU function on its output vector afterwards. Is this an example where kernel fission applies, with two loops faster when not merged?

We can use kernel fusion by unrolling the outer loop so that it computes 4 dot products at a time (i.e. v[i], v[i+1], v[i+2], v[i+3]). Then we have 4 float values in an array that the RELU AVX sequence can process in parallel (i.e. 4 RELU computations at once). But that isn't much of an improvement because we're still reading and writing the 4 float values to the “v” array twice.

There is actually a better kernel fusion solution but it's quite involved. We need to flatten the call hierarchy so that the VMM is two nested loops, and then unroll the outer loop so that it iterates on the output vector 4 float values at a time, and then using tiling to make the two loops work on a 4x4 tile, and then unroll the 4x4 tile completely, and then the RELU operation can be done with AVX on the 4 output vector results in parallel, as they get finalized. However, the speedup from that optimization is more from the tiling and unrolling than from the kernel fusion.

Kernel Fusion of Vector Min and Max

As another example, consider the fusion of the two loops for finding the “min” and “max” of a vector. This is a horizontal reduction for both, but for AVX1 and AVX2, we need to do them piecewise (whereas AVX-512 has reduction intrinsics).

The full source for the min and max functions is in Chapter 30 (Vectorization). The main loop for “min” in AVX1 is this:

    __m128 sumdst = _mm_loadu_ps(&v[0]);   // Initial 4 values
    for (int i = 4 /*not 0*/; i < n; i += 4) {
        __m128 r1 = _mm_loadu_ps(&v[i]); // Load floats into 128-bits
        sumdst = _mm_min_ps(r1, sumdst); // dst = MIN(dst, r1)
    }

And the main loop for “max” in AVX1 is almost identical:

    __m128 sumdst = _mm_loadu_ps(&v[0]);   // Initial 4 values
    for (int i = 4 /*not 0*/; i < n; i += 4) {
        __m128 r1 = _mm_loadu_ps(&v[i]); // Load floats into 128-bits
        sumdst = _mm_max_ps(r1, sumdst); // dst = MAX(dst, r1)
    }

The two main loops for min and max are easily fused, and this avoids one “load” intrinsic into the “r1” register. The fused main loop has 3 calls to intrinsics, rather than 4 calls (i.e. 2 per iteration for each min and max function separately), so we have a 25% reduction in calls. There isn't much to do for fusing the last step, but this is not the critical part of the function. Here is the final fused loop for AVX1 for min and max of a vector:

    float aussie_vector_max_min_fusion_AVX1b(float v[], int n, float &fminout)   
    {
        // Maximum and Minimum (horizontal) of a single vector
        yassert(n % 4 == 0);
        __m128 maxsumdst = _mm_loadu_ps(&v[0]);   // Initial 4 values
        __m128 minsumdst = maxsumdst;   // Initial 4 values
        for (int i = 4 /*not 0*/; i < n; i += 4) {
                __m128 r1 = _mm_loadu_ps(&v[i]); // Load floats into 128-bits
                maxsumdst = _mm_max_ps(r1, maxsumdst); // dst = MAX(dst, r1)
                minsumdst = _mm_min_ps(r1, minsumdst); // dst = MIN(dst, r1)
        }
        // Find Min of the final 4 accumulators
    #define FMIN(x,y)  ( (x) < (y) ? (x) : (y) )
        float* farr = minsumdst.m128_f32;
        float fmin1 = FMIN(farr[0], farr[1]);
        float fmin2 = FMIN(farr[2], farr[3]);
        fminout = FMIN(fmin1, fmin2);
        // Find Max of the final 4 accumulators
    #define FMAX(x,y)  ( (x) > (y) ? (x) : (y) )
        float* farr = maxsumdst.m128_f32;
        float fmax1 = FMAX(farr[0], farr[1]);
        float fmax2 = FMAX(farr[2], farr[3]);
        return FMAX(fmax1, fmax2); 
    }

• Next: Chapter 32. Quantization

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs