Aussie AI

Fused Activation Functions

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Fused Activation Functions

Activation functions are often good candidates for fusing back into the prior operation. An activation function is waiting for a vector output, and then simply transforms every element of that vector. Common examples of activation function fusion include:

  • Fused RELU
  • Fused GELU
  • Fused SwiGLU

Example: Fused VMM & RELU. It really seems to me that RELU should win the award for the most obscure name for a simple thing. I mean, it's just: make all negatives into zero. Why isn't it called “make positive” or something? Oh, but that's wrong because zero isn't positive. Okay, so let's name it: Make It Zero Unless It's Already Positive (MIZUIAP). See where I'm going with this?

The good news about this is it's easy to code RELU in C++:

    float aussie_RELU_basic(float f)   // Basic RELU (inefficient)
    {
        if (f <= 0.0) return 0.0;
        return f;
    }

In fact, it's so simple that it doesn't even deserve to be a C++ function:

    #define AUSSIE_RELU_MACRO(f)  ( (f) <= 0.0 ? 0.0 : (f) )   // Macro RELU

So, let's say we want a RELU after our Vector-Matrix Multiply (VMM) component. Remember that a matrix-vector multiplication ends up being a lot of vector dot products, at least in the simple versions.

    void aussie_VMM_vector_basic_vecdot(const ymatrix m, const float v[], int n, float vout[])
    {
        // Basic matrix-by-vector using vector dot products..
        for (int i = 0; i < n; i++) {
                const float* rowvector = &m[i][0];
                float sum = aussie_vecdot_basic(rowvector, v, n);  // Dot product
                vout[i] = sum;
        }
    }

And we also use a function to apply RELU to each element of a vector:

    void aussie_vector_reluize(float v[], int n)   // Apply RELU to each element
    {
        for (int i = 0; i < n; i++) {
                v[i] = AUSSIE_RELU_MACRO(v[i]);
        }
    }

Hence, our non-fused sequence of two Transformer components is a matrix-vector multiplication (e.g. a linear layer component) followed by an element-wise RELU on the output vector:

   aussie_VMM_vector_basic_vecdot(m, v, n, vout);  // Matrix-vector multiply
   aussie_vector_reluize(vout, n);  // Apply RELU on the output vector

The above method is needlessly inefficient with two sequential components, and has a double-scan of the output vector, resulting in redundant memory accesses on the “vout” vector. And this looks like a good candidate for kernel fusion because:

    (a) there are two loops on the same data in the output vector, and

    (b) the two components are both forced to run sequentially because the RELU component is waiting for the vector data from the VMM component.

Here's how to do a simplistic “fused-VMM-RELU” with the RELU activation function done at the vector dot product (inside a VMM) using RELU on the result of the vector dot product:

    void aussie_VMM_vector_basic_vecdot_RELU(const ymatrix m, const float v[], int n, float vout[])
    {
        // Basic fused matrix-by-vector RELU using vector dot products..
        for (int i = 0; i < n; i++) {
                const float* rowvector = &m[i][0];
                float sum = aussie_vecdot_basic(rowvector, v, n);  // Dot product
                vout[i] = AUSSIE_RELU_MACRO(sum);
        }
    }

This is still unnecessarily inefficient, and although this is a type of “fused-VMM-RELU”, it isn't really a deeply fused vecdot-RELU method at all. So, let's “fuse” the RELU even deeper inside the vector dot product code, right at the end, creating a “fused-vecdot-RELU” that can be called by our VMM-RELU component:

    float aussie_fused_vecdot_RELU_basic(float v1[], float v2[], int n)   // Basic fused dot product + RELU
    {
        float sum = 0.0;
        for (int i = 0; i < n; i++) {
                sum += v1[i] * v2[i];
        }
        return AUSSIE_RELU_MACRO(sum);
    }

That example is so simplistic it almost seems like it's not an optimization. Isn't it just a flattened function call? Surely that can't be all there is to kernel operator fusion?

Well, yes, RELU is very simple, and the above example vector dot product hasn't been optimized for sequential execution or vectorized to run in parallel. Believe me, if you look at the real code for fused operators, it gets squirrelly very fast. But even though simple, the above example is actually more performant than simply removing function call overhead, because it has avoided double-scanning the output vector, thereby removing redundant memory accesses, and getting rid of the “vout” temporary vector entirely.

So, fair enough complaint, it's a bit simplistic. Let's add vectorization to the kernel fusion ideas. Here's the AVX optimization of RELU on a vector by doing “max(0,v)” using the AVX “max” intrinsic and a vector full of zeros to parallelize 4 float values at a time:

    void aussie_vector_reluize_AVX1(float v[], int n)   
    {
        // Apply RELU to each element with AVX (sets negatives to zero)
        yassert(n % 4 == 0);
        const __m128 rzeros = _mm_set1_ps(0.0f);  // Set up vector full of zeros...
        for (int i = 0; i < n; i += 4) {
                __m128 r1 = _mm_loadu_ps(&v[i]);   // Load floats into 128-bits
                __m128 dst = _mm_max_ps(r1, rzeros);   // MAX(r1,0)
                _mm_store_ps(&v[i], dst);  // store back to floats
        }
    }

And here's the AVX version of vector dot product that uses “_mm_dp_ps” to add 4 float product-pairs at a time:

    float aussie_vecdot_unroll_AVX1(const float v1[], const float v2[], int n)  
    {                
        // AVX-1 loop-unrolled (4 floats) Vector dot product 
        yassert(n % 4 == 0);
        float sum = 0.0;
        for (int i = 0; i < n; i += 4) {
                // AVX1: Vector dot product of 2 vectors 
                //  ... process 4x32-bit floats in 128 bits
                __m128 r1 = _mm_loadu_ps(&v1[i]);   // Load floats into 128-bits
                __m128 r2 = _mm_loadu_ps(&v2[i]);
                __m128 dst = _mm_dp_ps(r1, r2, 0xf1); // Dot product
                sum += _mm_cvtss_f32(dst);
        }
        return sum;
    }

But here's the VMM code using the AVX1 version of the vector dot product:

    void aussie_VMM_vector_vecdot_AVX1(const ymatrix m, const float v[], int n, float vout[])
    {
        // VMM matrix-by-vector using AVX1 vector dot product
        for (int i = 0; i < n; i++) {
                const float* rowvector = &m[i][0];
                float sum = aussie_vecdot_unroll_AVX1(rowvector, v, n);  // Dot product
                vout[i] = sum;
        }
    }

So far, these are vectorized with AVX, but not fused.

How can we merge them and reduce either the number of AVX intrinsic calls or vector memory accesses by putting them together? There's no an obvious way to merge the parallel AVX RELU version inside the vector dot product or the VMM code. Obviously, we can run the VMM and then call the AVX RELU function on its output vector afterwards. Is this an example where kernel fission applies, with two loops faster when not merged?

We can use kernel fusion by unrolling the outer loop so that it computes 4 dot products at a time (i.e. v[i], v[i+1], v[i+2], v[i+3]). Then we have 4 float values in an array that the RELU AVX sequence can process in parallel (i.e. 4 RELU computations at once). But that isn't much of an improvement because we're still reading and writing the 4 float values to the “v” array twice.

There is actually a better kernel fusion solution but it's quite involved. We need to flatten the call hierarchy so that the VMM is two nested loops, and then unroll the outer loop so that it iterates on the output vector 4 float values at a time, and then using tiling to make the two loops work on a 4x4 tile, and then unroll the 4x4 tile completely, and then the RELU operation can be done with AVX on the 4 output vector results in parallel, as they get finalized. However, the speedup from that optimization is more from the tiling and unrolling than from the kernel fusion.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++