Aussie AI

21. Activation Functions

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

“It is by will alone I set my mind in motion.”

— Mentat Mantra, Frank Herbert, Dune, 1965.

The idea of an “activation function” is that it controls whether or not a neuron in a neural network will be “activated” or not. For example, early neural networks used a “threshold” or “step” activation function to decide whether a neuron would pass its result along to the next layer. However, this idea is largely opaque in modern Transformer architectures, with the neuron activation concepts overshadowed by massive tensor structures. Even so, activation functions are intricately linked into Transformers and calculated billions of times.

The choice of activation function is an important choice for a model because you're stuck with it. Inference and training must use the same activation function. There are a lot of opinions as to which activation function makes the model the smartest, but there's no disagreement on this: RELU is the fastest.

Activation functions in neural networks can be optimized in various ways. Linear activation functions (e.g. RELU) are more efficient than non-linear functions (e.g. GELU). Non-linear activation functions can be optimized through precalculated lookup tables and approximations. Even linear activation functions can gain further efficiency improvement by fusing them into a MatMul using kernel operator fusion.

Common Activation Functions

Various functions have been tested as activation functions, both linear and non-linear. The main activation functions that have emerged in practical usage of Transformers for LLMs are:

RELU (Rectified Linear Unit)
GELU (Gaussian Linear Unit)
SwiGLU (Swish Linear Unit)
SiLU (Sigmoid Linear Unit)

RELU is one of the earliest activation functions, but has stood firm over the years, with applicability for many usages. However, the default OpenAI GPT2 and GPT3 activation function was called “GELU-New,” which is actually what is usually meant by “GELU” nowadays, but these architectures could have been trained with RELU, Swish, or GELU-Old. InstructGPT uses the sigmoid/SiLU activation function. The Llama and Llama2 models from Meta's Facebook Research use Swish/SwiGLU for their activation function. Although confidential, apparently GPT-4 uses the sigmoid activation function (i.e. SiLU) in its loss function. RELU is still often used in many open source models for lower-end architectures because of its efficiency.

There are various other ones that have been used in earlier research, or are sometimes still used:

Step function
tanh (hyperbolic tangent)
Leaky RELU
ELU (Exponential Linear Unit)
Softplus (not to be confused with “Softmax”)

Optimization of Activation Functions

In order to optimize the speed of the various activation function computations, several techniques are available:

Choose a fast activation function (e.g. RELU)
Choose an activation function without trainable parameters.
Algebraic approximations of the activation function
Precomputed lookup tables (sequential)
Basic vectorization (e.g. with AVX operation sequences)
Vectorization of the precomputed lookup tables (i.e. parallel LUTs)
Kernel fusion (e.g. fuse the activation function computations back into a MatMul kernel).

Which activation function is the fastest? Why, it's RELU, of course. I mean, it's more like a typo than some real coding. Does RELU even deserve to be called a “function”?

The logic of RELU is simply to convert all negatives to zero, but leave positive values unchanged. This can be as fast as a sign bit test, making RELU the fastest activation to compute.

The other functions are “non-linear” which is a cryptic way of saying “slooow.” GELU and SwiGLU usually need to be approximated to be efficient, or, even better, pre-calculated into a lookup table, assuming you're not using 32-bit float values (or maybe you're running with a 16GB precomputed LUT for all 32-bits).

Learned Activation Parameters

Some of the activation functions have “learned activation parameters” or “trainable activation parameters.” For example, Swish/SwiGLU has “alpha” and “beta” parameters that modify activation behavior. Not all activation functions have parameters (e.g. basic RELU and GELU don't), but those that do are technically called “adaptable activation functions.” These extra parameters are stored in the model like weights, but relate to the activation functions, whereas weights apply in matrix multiplications (e.g. linear layers in FFNs).

The advantage of using an activation with trainable parameters is that there is greater opportunity for the model to contain intelligence (i.e. perplexity, accuracy, finesse). The downside is that these activation function parameters require extra computation and also hamper certain optimizations that can be applied to simpler activation functions such as RELU.

The fact that there are granular parameters for the activation functions is quite limiting in terms of optimizing the speed of these parameterized activation functions. We cannot, for example, precompute the entire range of an adaptive activation function, because the parameters would be fixed. Instead, we can precompute parts of the activation function formula, such as calls to expf (to exponentiate the input numbers), but then we have to apply the activation parameters as separate, extra steps. These extra parameter computations can often be vectorized themselves, but it's still an extra step that isn't required with different choices of activation functions (did I mention RELU?).

Inputs, Outputs and Dimensions

A basic activation function is a “scalar-to-scalar” function that simply performs a mathematical operation, such as making the number positive with RELU. However, activation function components usually operate on the output of another Transformer component, which is typically a vector (or tensor). Hence, these activation function blocks are “vector-to-vector” operations, where the activation function is an “element-wise” operation on the vector (i.e. the activation function of one element is not dependent on the values of any other elements of the vector).

The vector-to-vector architecture may not be obvious if you look at any C++ code. You can run an activation function on the elements within any structure, including a 2-D matrix or 3-D tensor, and this is often done in optimized tensor operations. However, conceptually it is still acting on each vector. Furthermore, simple activation functions such as RELU will be “fused” back into a prior kernel operation (e.g. the preceding MatMul) to the point where it'll be hard to find the C++ code for the activation function itself.

The dimension of the vectors processed by the activation components as their inputs are the outputs from the calculations of other components (e.g. a logits vector). The dimension of the output of activation components is the same as its input dimension.

Most activation functions are “element-wise” in the sense that they only depend on a single number, one at a time, within the vector. Running an activation function on a vector (or matrix/tensor) is just doing the same thing on each element, and is easy to parallelize because each operation is unaffected by the other values. Their element-wise nature also makes them a good candidate for “kernel fusion” whereby activation functions are merged (“fused”) with the operation that preceded them for a speedup.

It is possible to define non-element-wise activation functions that depend on the other numbers in a vector, but that's usually done in other Transformer components. For example, normalization layers and the Softmax component are “vector-wise” and act on all the numbers in a vector together (e.g. scaling them all by the sum of all elements).

RELU Activation Function

In terms of code, RELU is often written in research papers using the “max” function:

    RELU = max(0,x)

In real code, the max function isn't needlessly called, but a simpler test for negatives is used, such as an if statement:

    float aussie_RELU_if_test_slow(float f)
    {
        if (f <= 0.0f) return 0.0f; 
        else return f; 
    }

Here's a faster macro version with the C++ ternary operator:

    #define AUSSIE_RELU_MACRO(f)  ( (f) <= 0.0f ? 0.0f : (f) )

The assignment of x to itself when it has a positive value can be avoided with logic such as:

    #define RELUIZE1(x) \
      ((x) = AUSSIE_RELU_MACRO(x)) // Slower version
    #define RELUIZE2(x) \
      if ((x) < 0.0f) { (x) = 0.0f; }  // If-then version
    #define RELUIZE3(x) ( \ 
      (x) < 0.0f && ( (x) = 0.0f) )   // Short-circuiting

Even these are probably not the fastest way. A full implementation would use a bitwise test of the sign bit, relying on the IEEE 754 bit format of floating-point types, rather than the “<” less-than operator.

One way to access the sign bit is to use the standard C++ “signbit” function, which returns true or false depending on the floating-point sign bit. Hopefully it's using some fast assembly code in the implementation.

    #define RELU4(x)  (std::signbit(x) ? 0.0f : (x) )
    #define RELUIZE4(x) (std::signbit(x) && ( (x) = 0.0f) )
    // Multiply version (slow)
    #define RELUIZE4b(x) ((x) *= (int)!std::signbit(x))

For greater efficiency, RELU should be “fused” back into a MatMul or other prior component via kernel operator fusion, so that the clearing of negatives is done incrementally during the prior calculation, when the value is already in fast memory. An example of a “fused RELU” is given under kernel fusion in Chapter 31.

RELU AVX SIMD Vectorization

The RELU function is applied element-wise to a vector of computed values. We can compute RELU in parallel on 4 float values (AVX-1) or 8 float values (AVX-2) using a SIMD “max” computation with a register full of zeros as the other operand. This implementation is shown in Chapter 17.

I'm not aware of a single-instruction hardware RELU implementation on x86 or other CPUs, although there may well be one. There are certainly various research papers on computing RELU in hardware. It's a simple computation: if sign bit is on, then clear every bit to zero, else do nothing. For example, do a bitwise-and of all 32 bits against the inverted sign bit.

GELU Activation Function

The term “GELU” usually means the “GELU-New” function, which is the product of x times the Gaussian Phi function of x, rather than “GELU-Old” which is simply the Gaussian Phi function. Here is the basic mathematical version of GELU (i.e. GELU-New), in unoptimized C++ code, according to the original paper:

    float aussie_GELU_basic(float x)   
    {
        // Basic Gaussian GELU (inefficient)
        float phival = 0.5 * (1.0 + erff(x / sqrt(2.0)));
        return x * phival;
    }

Note that erff is the float version of the “error function” from statistics.

The basic GELU arithmetic can be optimized by precomputing sqrt(2.0) using its reciprocal so as to multiply rather than divide, and avoiding the use of a temporary variable. Here's a slightly improved version:

    float aussie_GELU_basic2(float x)  
    {
        // Basic Gaussian GELU (still inefficient)
        // Once-only initialization
        static float s_reciprocal_sqrt_2_0 = 1.0f / sqrtf(2.0f);
        return x * ( 0.5 * (1.0 + erff(x * s_reciprocal_sqrt_2_0)));
    }

To further optimize GELU, there are two approximations given in the original paper, which are examined later in this chapter. And the code can then be further optimized via calculation changes and table lookups, as shown in the GELU approximations example code.

GELU AVX SIMD Vectorization

The GELU function is an element-wise activation function on an input vector, so it is a good candidate for vectorization. However, the GELU function is complicated to compute in parallel, even though AVX has SIMD support for the error function (“erf”). Raw computations of GELU would require a multiply-by-scalar, erf computation, scalar addition, scalar multiplication, and then non-scale multiplication. We could do all these with sequential AVX intrinsics, but with so many operations, that doesn't seem like a good plan. Hence, our best option is to use precomputation into a lookup-table (LUT) in combination with AVX “gather” intrinsics to vectorize the table lookups.

SiLU/Sigmoid Activation Function

The SiLU activation function uses the product of x and the sigmoid function of x. This is also equivalent to the Swish activation function, with its parameter beta set to 1. Here is the basic SiLU activation function in C++:

    float aussie_SiLU_basic(float x)   // Basic SiLU (inefficient)
    {
        // Sigmoid = 1 + e^(-x)
        // SiLU = x * (1 + e^(-x) )
        //      = x * 1.0 / (1.0 + expf(-x));
        return x / (1.0f + expf(-x));
    }

The SiLU function is inefficient by default. Its speed can be improved via table lookups and approximations.

SiLU AVX SIMD Vectorization

The SiLU function has a sequence of operations that could be vectorized with AVX: negation, exponentiation, scalar addition, reciprocal, and multiplication. With so many distinct AVX operations required in sequence, even if parallelized across 4 or 8 float values, it will probably be slow. Hence, our best option is probably a vectorized lookup-table instead.

SwiGLU/Swish Activation Function

The SwiGLU activation function has become widely used in commercial AI models. It is based on the “Swish” function, which has become a popular activation function, notably used in the Llama models from Meta. Swish is based on the sigmoid function, and the SwiGLU activation function is a generalization of the SiLU activation function, using an extra parameter (beta), in this formula:

    Swish(x) = x * sigmoid(beta * x)

The beta parameter can be a constant or a trained activation function parameter. If the parameter beta equals 1, then the Swish function is simply the SiLU activation function (x times the sigmoid function). Swish has independent significance for other values of beta.

Here is the basic C++ to compute a simple Swish function:

    float aussie_swish(float x, float beta)
    {
        // SWISH = x * sigmoid(beta * x)
        return x * aussie_sigmoid(beta * x);
    }

Here is the C++ version with the sigmoid function call flattened:

    float aussie_swish2(float x, float beta)
    {
        // SWISH = x * sigmoid(beta * x)
        // SIGMOID = 1 / ( 1 + e^-x)
        return x * ( 1.0f / (1.0f + expf(-(beta * x))));

    }

ELU Activation Function

The ELU activation function was first designed by Clevert, Unterthiner & Hochreiter (2016). The ELU function is somewhat related to the RELU function, but ELU's output can go negative for input values below zero. It also has an extra hyperparameter, alpha, to give multiple versions of ELU, where the alpha parameter controls how fast it goes negative. Here's some example C++ code of a very basic ELU implementation according to the paper:

    float aussie_ELU_basic(float x, float alpha_hyperparam)
    {
        // Basic ELU activation (inefficient)
        // ELU = x  if x > 0 .0
        //     = alpha * (exp(x) - 1) if x <= 0.0
        if (x <= 0.0) 
            return alpha_hyperparam * (expf(x) - 1.0f);
        return x;  // x if x > 0.0
    }

Precomputed Lookup Tables

One of the simplest methods to speed up activation functions is to use a pre-computed table lookup. If you are quantizing to 16 bits, whether FP16 or INT16, then the input to the function is 16-bits, and there are only 2^16=65,536 different possible input values. Hence, your precomputed table for activation function calculations is only 65536x2=128k bytes for output in 16-bit precision (i.e. FP16/INT16 is 2 byte outputs), or 256kb for 32-bit precision outputs (FP32/INT32). However, it's not quite so good for non-quantized 32-bit (i.e. FP32 or INT32), since 2^32 is about 4 billion possible values, and ideally needs to store 4-byte results, so it's 4x4=16 Gigabytes of RAM for that precomputed table.

The simplest way to build the pre-computed table for an activation function is to do it dynamically at initialization-time. Simply call your non-optimized activation function code, get the results, and store it in the precomputed table. The idea is:

    for (int i = 0; i < 65536; i++) {
        // INT16 only
        g_global_GELU_table[i] = aussie_GELU_slow(i);
    }

When you access the pre-computed table, it's simply an indexed access for a global array. So, the fast, precomputed version of GELU is a table lookup like this:

    // INT16 only, doesn't work for FP16!
    gelu_value = g_global_GELU_table[x];

It all sounds very easy, but there's some weird wrinkles. Firstly, C++ doesn't have a portable standardized native 16-bit float type, so you have to mess around with either declaring your own type (as a class), trying to use platform-specific extensions (e.g. “__fp16”, “_Float16” or C++23's “std::float16_t”) or hacking it by using the “short int” 16-bit integer type.

Secondly, the above code examples required integers and don't really work in practice for FP16. And to make it work is a lot of bit-hacking for conversions back-and-forth between unsigned integers and floating-point numbers. Here's the idea for 32-bit float variables (i.e. non-quantized FP32):

    float g_global_GELU_table_FP32[1<<32];  // ~4 billion
    // ...
    void aussie_GELU_setup_table_FP32()
    {
        // Initialize GELU precomputed table
        unsigned long int i64 = 0;  // Has to be "long"!
        yassert(sizeof(i64) > 4);
        for (; i64 < (1 << 32); i64++) {
                int i32 = (int)i64;  // 64-bit to 32-bit
                float f32 = *(float*)&i32;  // Type-cast trick
                g_global_GELU_table_FP32[i32] = aussie_GELU_basic2(f32);  // FP32
        }
    }

And the fast GELU lookup table version for FP32 becomes:

    float gelu_fast_FP32(float f)    // Table lookup GELU
    {
        int i32 = *(int*)&f;
        return g_global_GELU_table_FP32[i32];   // FP32 version
    }

In this FP32 case, the loop iterator has to be a 64-bit long int, otherwise the loop will be infinite because a 32-bit int will overflow and wrap-around to zero without failing the loop test. In reality, the above code doesn't actually work on a low-end laptop with the MSVS C++ compiler, because it requires an array of size 16GB, so it requires a more powerful box.

Load-Time Precompilation

If you're really fancy, you can write code to use once-only (forever!), which similarly repeatedly calls the inefficient exact version of your activation function in a big loop, but then spits out a C++ source file with all of those result numbers in it. I don't mean a binary data file. I really do mean a full C++ source file, with a declaration of your global array type variable at the top (and a nice comment!), and then a lot of numbers in plain text with commas between them. You then compile and link this C++ source file with your other code. After that's checked into source control, you can go tell your boss that you've just added a few hundred million SLOC to the project, and ask for a bonus.

Once this is compiled in, you don't even need to call anything during runtime initialization. The numbers are already pre-compiled into a global array variable which is simply loaded from the linked executable file at load-time.

Approximating Activation Functions

Some non-linear activation functions have a non-trivial computation cost, such as sigmoid, GELU, and tanh. Improvements have included using simpler functions (e.g. RELU), mathematical approximations for calculating the non-linear functions, or optimization techniques such as table lookups.

Example: GELU Approximations #1: The original GELU paper proposed two alternative mathematical approximations. Here's a naive C++ implementation of the first one:

    float aussie_GELU_approx1(float f)   
    {
        // Approximated Gaussian GELU
        // GELU paper approx #1 = 0.5 * x * 
        //   ( 1 + tanh ( sqrt(2/PI) * 
        // (x + 0.44715 * x^3)  ) ) 

        return 0.5f * f *
           (1.0f + tanhf(sqrtf(2.0f / AUSSIE_PI)
            * (f + ( 0.44715f * (f * f * f)))));
    }

The first obvious improvement is to avoid re-calculating constant values.

    float aussie_GELU_approx1_optimized(float f)   
    {
        // Approximated Gaussian GELU #1 (minor optimizations)
        static float s_sqrt_2_div_pi = sqrtf(2.0f / AUSSIE_PI);
        return 0.5f * f * 
                (1.0f + tanhf(s_sqrt_2_div_pi *
                        (f + (0.44715f * (f * f * f))))
                );
    }

This code can be further improved with minor changes to the arithmetic computations. In particular, some algebraic manipulation allows us to avoid the “x*x*x” term, and reduce multiplications:

    float aussie_GELU_approx1_optimized2(float f)   
    {
        // Approximated Gaussian GELU #1 (minor optimizations)
        // Optimize by factoring out f
        // Reduces x*x*x to x*x
        static float s_sqrt_2_div_pi = sqrt(2.0 / AUSSIE_PI);
        return 0.5f * f *
                    (1.0f 
                         + tanhf(s_sqrt_2_div_pi *
                                 f * 
                                  (1.0f + (0.44715f * (f * f)))
                             )
                        );
    }

Example: GELU Approximations #2: The second approximation suggested in the original GELU paper is much simpler, based on the sigmoid function, but a little less accurate. Here's a C++ version without any optimizations:

    float aussie_sigmoid(float x)
    {
        // SIGMOID = 1 / ( 1 + e^-x)
        return 1.0f / (1.0f + expf(-x));
    }

    float aussie_GELU_approx2(float x)   // Approx GELU #2
    {
        // GELU paper approx #2 = x * sigmoid (1.702 * x) 
        return x * aussie_sigmoid(1.702f * x);
    }

But the use of two functions is needlessly expensive (although declaring them as “inline” might help), and can be optimized by flattening the call hierarchy. By merging the code of the sigmoid function into the GELU code, there are also opportunities to reduce the number of arithmetic operations.

    float aussie_GELU_approx2b(float x)   // Approximated Gaussian GELU #2b
    {
        // GELU paper approx #2 = x * sigmoid (1.702 * x) 
        // return x * 1.0 / (1.0 + expf(-(1.702 * x)));
        return x / (1.0f + expf(-1.702f * x));
    }

All of these GELU approximations could be easily beaten by changing to a table lookup method. And there's not much point in using a faster approximation method in the precomputed LUT calculations.

Activation Function Research

There are several avenues for speedups in the use of activation functions that have received research attention:

Approximations
Integer-only activation functions
Pruning activation function components
Reordering activation components

Approximations. Some approximations have been examined above in this chapter. Various research exists that examines approximations for each of the different types of activation function.

Integer-only Activation Functions. One particular type of approximation is the change to integer-only arithmetic. This is trivial for RELU, but problematic for the other non-linear activation functions. This is also required to achieve end-to-end integer arithmetic in Transformers, which is an ongoing area of research.

Pruning Activation Functions. In some cases, the activation function component can be removed, or “pruned,” so that the original calculated values are used without modification. For example, in the vanilla FFN component, removing the interleaved activation function from between the two linear layers creates what is termed a “bilinear layer” component (i.e. two matrix multiplications without any activation function). Research has examined the situations when this is possible.

Activation Function Reordering. The standard Transformer has an activation function in between the two linear layers of the FFN. Early research has examined “pre-activation” versus “post-activation” inside the FNN for different effects.

• Next: Chapter 22. Vector Algorithms

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs

The new AI programming book by Aussie AI co-founders:

AI coding in C++
Transformer engine speedups
LLM models
Phone and desktop AI
Code examples
Research citations

Get your copy from Amazon: Generative AI in C++