Aussie AI

Activation Function Optimization

  • Last Updated 12 December, 2024
  • by David Spuler, Ph.D.

Activation functions in neural networks can be optimized in various ways. Linear activation functions (e.g. RELU) are more efficient than non-linear functions (e.g. GELU). Non-linear activation functions can be optimized through approximations. Even linear activation functions can have efficiency improvement by fusing them into a MatMul (see kernal operator fusion).

The RELU backlash. Although RELU has been known as the most efficient (speedwise) activation function, the big models seem not to use it any more. For example, the Meta Llama models and Mistral models use more complex non-linear activation functions (e.g. Swish) rather than RELU. This makes speed optimizations more difficult, because RELU is not only fast to run, but also helps cause sparsity of activations at runtime. Optimizations such as negative skipping that rely on RELU are also unavailable.

Activation Function Alternatives

Various functions have been tested as activations, both linear and non-linear. Common examples include:

  • sigmoid/SiLU
  • tanh (hyperbolic tangent)
  • RELU (Rectified Linear Unit)
  • Leaky RELU
  • ELU (Exponential Linear Unit)
  • Gaussian/GELU
  • Swish/SwiGLU
  • Exponential function/ELU
  • Softplus (not to be confused with Softmax)

Efficiency of Activation Functions

Which activation function is the fastest? Why, it's RELU, of course. I mean, it's more like a typo than some real coding. Does RELU even deserve to be called a "function"?

The logic of RELU is simply to convert all negatives to zero, but leave positive values unchanged. This can be as fast as a sign bit test, making RELU the fastest activation to compute.

The other functions are "non-linear" which is a cryptic way of saying "slooow". GELU and SwiGLU usually need to be approximated to be efficient, or ideally pre-calculated in a lookup table if you're not using 32-bit floats.

Example: RELU Activation Function

In terms of code, RELU is often written in research papers using the "max" function:

    RELU = max(0,x)

In real code, the max function isn't needlessly called, but a simpler test for negatives is used, such as an if statement:

    float yapi_RELU_if_test_slow(float f)
    {
        if (f <= 0.0) return 0.0; 
	else return f; 
    }

Here's a faster macro version with the C++ ternary operator:

    #define YAPI_RELU_MACRO(f)  ( (f) <= 0.0 ? 0.0 : (f) )

The assignment of x to itself when it has a positive value can be avoided with logic such as:

    #define RELUIZE1(x) ( (x) = YAPI_RELU_MACRO(x))     // Slower version
    #define RELUIZE2(x) if ( (x) < 0.0) { (x) = 0.0; }  // If-then version
    #define RELUIZE3(x) ( (x) < 0.0 && ( (x) = 0.0) )   // Short-circuited operator version

Even this is not the fastest way. A full implementation would use a sign bit test on the IEEE 754 bit format of floating point types, rather than the "<" less-than operator. And then even this would be "fused" back into a MatMul via kernel operator fusion, so that the clearing of negatives is done incrementally during the prior calculation, when the value is already in fast memory.

Example: GELU Activation Function

Here is the basic mathematical version of GELU, in unoptimized C++ code, according to the original paper:

    float yapi_GELU_basic(float x)   // Basic Gaussian GELU (inefficient)
    {
	float phival = 0.5 * (1.0 + erff(x / sqrt(2.0)));   // NOTE: erff() is float version of erf() "error function"
	return x * phival;
    }

The basic GELU arithmetic can be optimized by precomputing "sqrt(2.0)", using its reciprocal so as to multiply rather than divide, and avoiding the use of a temporary variable. Here's a slightly improved version:

    float yapi_GELU_basic2(float x)   // Basic Gaussian GELU (still inefficient)
    {
	static float s_reciprocal_sqrt_2_0 = 1.0f / sqrtf(2.0f);  // Once-only initializations
	return x * ( 0.5 * (1.0 + erff(x * s_reciprocal_sqrt_2_0)));
    }

To further optimize GELU, there are two approximations given in the original paper. And the code can then be further optimized via calculation changes and table lookups, as shown in the GELU approximations example code.

Example: SiLU Activation Function

Here is the basic SiLU activation function in C++:

    float yapi_SiLU_basic(float x)   // Basic SiLU (inefficient)
    {
	// Sigmoid = 1 + e^(-x)
	// SiLU = x * (1 + e^(-x) )
	//      = x * 1.0 / (1.0 + expf(-x));
	return x / (1.0f + expf(-x));
    }

The SiLU function is inefficient by default. Its speed can be improved via table lookups and approximations.

Example: ELU Activation Function

The ELU activation function was first designed by Clevert, Unterthiner & Hochreiter (2016). The ELU function is somewhat related to the RELU function, but ELU's output can go negative for input values below zero. It also has an extra hyperparameter, alpha, to give multiple versions of ELU, where the alpha parameter controls how fast it goes negative. Here's some example C++ code of a very basic ELU implementation according to the paper:

    float yapi_ELU_basic(float x, float alpha_hyperparam)   // Basic ELU activation (inefficient)
    {
	// ELU = x  if x > 0 .0
	//     = alpha * ( exp(x) - 1) if x <= 0.0
	if (x <= 0.0) return alpha_hyperparam * ( expf(x) - 1.0f );
	return x;  // x if x > 0.0
    }

Precomputed Table Lookup for Activation Functions

One of the simplest methods to speed up activation functions is to use a pre-computed table lookup. If you are quantizing to 16 bits, whether FP16 or INT16, then the input to the function is 16-bits, and there are only 2^16=65,536 different possible input values. Hence, your precomputed table for activation function calculations is only 65536x2=128k bytes for output in 16-bit precision (i.e. FP16/INT16 is 2 byte outputs), or 256kb for 32-bit precision outputs (FP32/INT32). However, it's not quite so good for non-quantized 32-bit (i.e. FP32 or INT32), since 2^32 is about 4 billion possible values, and ideally needs to store 4-byte results, so it's 4x4=16 Gigabytes of RAM for that precomputed table.

The simplest way to build the pre-computed table for an activation function is to do it dynamically at initialization-time. Simply call your non-optimized activation function code, get the results, and store it in the precomputed table. The idea is:

    for (int i = 0; i < 65536; i++) {
        g_global_GELU_table[i] = yapi_GELU_slow(i);  // INT16 only
    }

When you access the pre-computed table, it's simply an indexed access for a global array. So the fast, precomputed version of GELU is a table lookup like this:

    gelu_value = g_global_GELU_table[x];   // INT16 only, doesn't work for FP16!

It all sounds very easy, but there's some weird wrinkles. Firstly, C++ doesn't have a portable standardized native 16-bit float type ("float" is 32-bit, and "double" is 64-bit), so you have to mess around with either declaring your own type (as a class), trying to use platform-specific extensions (e.g. "__fp16", "_Float16" or C++23's "std::float16_t") or hacking it by using the "short int" 16-bit integer type.

Secondly, the above code examples don't really work in practice for FP16. And to make it work is a lot of bit-hacking for conversions back-and-forth between unsigned integers and floating point numbers. Here's the idea for 32-bit float variables (i.e. non-quantized FP32):

    float g_global_GELU_table_FP32[1<<32 /*~4 billion*/];  // Is this really a good idea?
    ...
    void yapi_GELU_setup_table_FP32() // Initialize GELU precomputed table
    {
	unsigned long int i64 = 0;  // Has to be "long"!
	yassert(sizeof(i64) > 4);
	for (; i64 < (1 << 32); i64++) {
		int i32 = (int)i64;  // Switch down from 64-bit to 32-bit
		float f32 = *(float*)&i32;  // Type-cast bit trick to get the float
		g_global_GELU_table_FP32[i32] = yapi_GELU_basic2(f32);  // FP32
	}
    }

And the fast GELU lookup table version for FP32 becomes:

    float gelu_fast_FP32(float f)    // Table lookup GELU
    {
	int i32 = *(int*)&f;
	return g_global_GELU_table_FP32[i32];   // FP32 version
    }

In this FP32 case, the loop iterator has to be a 64-bit long int, otherwise the loop will be infinite because a 32-bit int will overflow and wrap-around to zero without failing the loop test.

In reality, the above code doesn't actually work on a standard Windows box. And we're probably not wanting the FP32 version, but to do this for FP16 is even more of a mess!

Load-Time Precompilation

If you're really fancy, you can write code to use once-only (forever!), which similarly repeatedly calls the inefficient exact version of your activation function in a big loop, but then spits out a C++ source file with all of those result numbers in it. I don't mean a binary data file. I really do mean a full C++ source file, with a declaration of your global array type variable at the top (and a nice comment!), and then a lot of numbers in plain text with commas between them. You then compile and link this C++ source file with your other code. After that's checked into source control, you go tell your boss that you just added a few hundred million SLOC to the project, and ask for a raise.

Once this is compiled in, you don't even need to call anything during runtime initialization. The numbers are already pre-compiled into a global array variable which is simply loaded from the linked executable file at load-time. There's currently not a good FP16 type name built into C++, i.e. no "short float" or "fp16" type. C++23 defines "std::float16_t" so this will probably work in the near future. The IEEE 754 standard has long ago defined a "float16" as having a 5-bit exponent and 11-bit mantissa (adding a bit for the sign bit, but subtracting an implied bit in the mantissa, gets us to 16 bits total). For comparison, the standard 32-bit float has 1 sign bit, 8 exponent bits, and 24 mantissa bits (only 23 are stored).

Approximate Activation Functions

Some non-linear activation functions have a non-trivial computation cost, such as sigmoid, GELU, and tanh. General ways to use approximations include:

  • Mathematically-close approximations (e.g., for GELU)
  • Integer-only activation functions
  • Low-bit integer activation functions (i.e., with quantized kernels)

Other improvements have also included using simpler functions (e.g. RELU), mathematical approximations for calculating the non-linear functions, or optimization techniques such as precomputed table lookups.

Example: GELU Approximations #1

The original GELU paper proposed two alternative mathematical approximations. Here's a naive C++ implementation of the first one:

    float yapi_GELU_approx1(float f)   // Approximated Gaussian GELU
    {
	// GELU paper approx #1 = 0.5 * x * ( 1 + tanh ( sqrt(2/PI) * (x + 0.44715 * x^3)  ) ) 

	return 0.5f * f * (1.0f + tanhf(sqrtf(2.0f / YAPI_PI) * (f + ( 0.44715f * (f * f * f)))));
    }

The first obvious improvement is to avoid re-calculating constant values.

    float yapi_GELU_approx1_optimized(float f)   // Approximated Gaussian GELU (with minor optimizations)
    {
	// GELU paper approx #1 = 0.5 * x * ( 1 + tanh ( sqrt(2/PI) * (x + 0.44715 * x^3)  ) ) 
	static float s_sqrt_2_div_pi = sqrtf(2.0f / YAPI_PI);
	return 0.5f * f * 
		(1.0f + tanhf(s_sqrt_2_div_pi *
			(f + (0.44715f * (f * f * f))))
		);
    }

This code can be further improved with minor changes to the arithmetic computations. In particular, some algebraic manipulation allows us to avoid the "x*x*x" term, and reduce multiplications:

    float yapi_GELU_approx1_optimized2(float f)   // Approximated Gaussian GELU (with 2nd minor optimizations)
    {
	// GELU paper approx #1 = 0.5 * x * ( 1 + tanh ( sqrt(2/PI) * (x + 0.44715 * x^3)  ) ) 
	// Optimize by factoring out one multiplication by f (reducing x*x*x to x*x)
	static float s_sqrt_2_div_pi = sqrt(2.0 / YAPI_PI);
	return 0.5f * f *
		    (1.0f 
			 + tanhf(s_sqrt_2_div_pi *
			         f * 
			          (1.0f + (0.44715f * (f * f)))
		             )
			);
    }

Example: GELU Approximations #2

The second approximation suggested in the original GELU paper is much simpler, based on the sigmoid function, but a little less accurate. Here's a C++ version without any optimizations:

    float yapi_sigmoid(float x)
    {
	// SIGMOID = 1 / ( 1 + e^-x)
	return 1.0f / (1.0f + expf(-x));
    }

    float yapi_GELU_approx2(float x)   // Approximated Gaussian GELU #2
    {
	// GELU paper approx #2 = x * sigmoid (1.702 * x) 
	return x * yapi_sigmoid(1.702f * x);
    }

But the use of two functions is needlessly expensive (although declaring them as "inline" might help), and can be optimized by flattening the call hierarchy. By merging the code of the sigmoid function into the GELU code, there are also opportunities to reduce the number of arithmetic operations.

    float yapi_GELU_approx2b(float x)   // Approximated Gaussian GELU #2b
    {
	// GELU paper approx #2 = x * sigmoid (1.702 * x) 
	// return x * 1.0 / (1.0 + expf(-(1.702 * x)));
	return x / (1.0f + expf(-1.702f * x));
    }

All of these GELU approximations could be easily beaten by changing to a table lookup method.

Research on Activation Approximations

Various of the research papers on activation approximation are listed below:

Integer-only Activation Functions

One particular type of approximation is the change to integer-only arithmetic. This has become quite standard in quantized kernels, especially activation quantization, since the activation functions are working on an integer with a smaller number of bits.

Some of the earlier research papers with integer-based activation functions are below. See also the overview of integer arithmetic in neural networks.

Pruning Activation Functions

The activation function can be removed, or "pruned", so that the original calculated values are used. For example, removing the interleaved activation function from between the vanilla FFN's two linear layers creates what is termed a "bilinear layer".

Research on removing activation functions or bilinear layers:

  • Noam Shazeer, Feb 2020, GLU Variants Improve Transformer, https://arxiv.org/pdf/2002.05202.pdf (Examines GLUs including bilinear layers.)
  • Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier, 2016, Language modeling with gated convolutional networks. CoRR, abs/1612.08083, http://arxiv.org/abs/1612.08083 (Also suggests bilinear layers without an activation function.)
  • Andriy Mnih and Geoffrey Hinton. Three new graphical models for statistical language modelling. In Proceedings of the 24th international conference on Machine learning, pages 641–648, 2007, PDF: https://www.cs.toronto.edu/~amnih/papers/threenew.pdf (Prunes the activation function in the FFN to create a "bilinear layer".)

Activation Function Reordering

The standard Transformer has an activation function in between the two linear layers of the FFN. Earlier research examined "pre-activation" versus "post-activation" for different effects. Note that this is a different issue to the placement of normalization blocks, which is the "pre-norm" versus "post-norm" choice.

Here are some research papers on where to place the activation function.

Research on Activation Function Optimizations

Research paper on activation functions in neural networks:

RELU

Research on RELU:

GELU

Research on GELU:

  • Z Zou, C Zhang, S Chen, H Kou, B Liu, March 2024, Integer Arithmetic-Based and Activation-Aware GELU Optimization for Vision Transformer, 2024 Conference of Science and Technology for Integrated Circuits (CSTIC), 17-18 March 2024, https://ieeexplore.ieee.org/abstract/document/10531966/
  • David Spuler, March 2024, Chapter 21. Activation Functions, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
  • Chen, C, 2024, Hardware‑software co‑exploration and optimization for next‑generation learning machines. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/178423 (Extensive coverage of hardware design with multiple contributions to accelerating various neural network types, ranging from acceleration of various single non-linear functions and end-to-end optimization algorithms. Specific topics include data compression, non-maximum suppression, MHA, and MatMul/GEMM optimizations.)
  • Andrea Belano, Yvan Tortorella, Angelo Garofalo, Luca Benini, Davide Rossi, Francesco Conti, 9 Dec 2024, A Flexible Template for Edge Generative AI with High-Accuracy Accelerated Softmax & GELU, https://arxiv.org/abs/2412.06321

GELU Approximation

Research on GELU approximation methods:

  • Y Liang, Z Wang, X Xu, Y Tang, Z Jie, J Lu, Oct 2023, MCUFormer: Deploying Vision Tranformers on Microcontrollers with Limited Memory, arXiv preprint arXiv:2310.16898, https://arxiv.org/pdf/2310.16898.pdf
  • M Huang, J Luo, C Ding, Z Wei, S Huang, H Yu, Oct 2023, An Integer-Only and Group-Vector Systolic Accelerator for Efficiently Mapping Vision Transformer on Edge, IEEE Transactions on Circuits and Systems I: Regular Papers ( Early Access ), https://ieeexplore.ieee.org/abstract/document/10288182/
  • Y Wu, Z Wang, WD Lu, Oct 2023, PIM-GPT: A Hybrid Process-in-Memory Accelerator for Autoregressive Transformers https://arxiv.org/pdf/2310.09385.pdf
  • Y Dong, W Lu, Y Zheng, H Wu, D Zhao, J Tan, July 2023, PUMA: Secure Inference of LLaMA-7B in Five Minutes, https://arxiv.org/abs/2307.12533
  • Mohammadreza Tayaranian, Seyyed Hasan Mozafari, James J. Clark, Brett Meyer, Warren Gross, 2 Feb 2024, Faster Inference of Integer SWIN Transformer by Removing the GELU Activation, https://arxiv.org/abs/2402.01169 (Replace GELU with RELU.)
  • W. Wang, W. Sun and Y. Liu, "Improving Transformer Inference Through Optimized Non-Linear Operations With Quantization-Approximation-Based Strategy," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, doi: 10.1109/TCAD.2024.3488572. https://ieeexplore.ieee.org/abstract/document/10738457

More AI Research

Read more about: