Aussie AI

Activation Function Optimization

Last Updated 7 March, 2025

by David Spuler, Ph.D.

Activation functions in neural networks can be optimized in various ways. Linear activation functions (e.g. RELU) are more efficient than non-linear functions (e.g. GELU). Non-linear activation functions can be optimized through approximations. Even linear activation functions can have efficiency improvement by fusing them into a MatMul (see kernal operator fusion).

The RELU backlash. Although RELU has been known as the most efficient (speedwise) activation function, the big models seem not to use it any more. For example, the Meta Llama models and Mistral models use more complex non-linear activation functions (e.g. Swish) rather than RELU. This makes speed optimizations more difficult, because RELU is not only fast to run, but also helps cause sparsity of activations at runtime. Optimizations such as negative skipping that rely on RELU are also unavailable.

Activation Function Alternatives

Various functions have been tested as activations, both linear and non-linear. Common examples include:

sigmoid/SiLU
tanh (hyperbolic tangent)
RELU (Rectified Linear Unit)
Leaky RELU
ELU (Exponential Linear Unit)
Gaussian/GELU
Swish/SwiGLU
Exponential function/ELU
Softplus (not to be confused with Softmax)

Efficiency of Activation Functions

Which activation function is the fastest? Why, it's RELU, of course. I mean, it's more like a typo than some real coding. Does RELU even deserve to be called a "function"?

The logic of RELU is simply to convert all negatives to zero, but leave positive values unchanged. This can be as fast as a sign bit test, making RELU the fastest activation to compute.

The other functions are "non-linear" which is a cryptic way of saying "slooow". GELU and SwiGLU usually need to be approximated to be efficient, or ideally pre-calculated in a lookup table if you're not using 32-bit floats.

Example: RELU Activation Function

In terms of code, RELU is often written in research papers using the "max" function:

    RELU = max(0,x)

In real code, the max function isn't needlessly called, but a simpler test for negatives is used, such as an if statement:

    float yapi_RELU_if_test_slow(float f)
    {
        if (f <= 0.0) return 0.0; 
	else return f; 
    }

Here's a faster macro version with the C++ ternary operator:

    #define YAPI_RELU_MACRO(f)  ( (f) <= 0.0 ? 0.0 : (f) )

The assignment of x to itself when it has a positive value can be avoided with logic such as:

    #define RELUIZE1(x) ( (x) = YAPI_RELU_MACRO(x))     // Slower version
    #define RELUIZE2(x) if ( (x) < 0.0) { (x) = 0.0; }  // If-then version
    #define RELUIZE3(x) ( (x) < 0.0 && ( (x) = 0.0) )   // Short-circuited operator version

Even this is not the fastest way. A full implementation would use a sign bit test on the IEEE 754 bit format of floating point types, rather than the "<" less-than operator. And then even this would be "fused" back into a MatMul via kernel operator fusion, so that the clearing of negatives is done incrementally during the prior calculation, when the value is already in fast memory.

Example: GELU Activation Function

Here is the basic mathematical version of GELU, in unoptimized C++ code, according to the original paper:

    float yapi_GELU_basic(float x)   // Basic Gaussian GELU (inefficient)
    {
	float phival = 0.5 * (1.0 + erff(x / sqrt(2.0)));   // NOTE: erff() is float version of erf() "error function"
	return x * phival;
    }

The basic GELU arithmetic can be optimized by precomputing "sqrt(2.0)", using its reciprocal so as to multiply rather than divide, and avoiding the use of a temporary variable. Here's a slightly improved version:

    float yapi_GELU_basic2(float x)   // Basic Gaussian GELU (still inefficient)
    {
	static float s_reciprocal_sqrt_2_0 = 1.0f / sqrtf(2.0f);  // Once-only initializations
	return x * ( 0.5 * (1.0 + erff(x * s_reciprocal_sqrt_2_0)));
    }

To further optimize GELU, there are two approximations given in the original paper. And the code can then be further optimized via calculation changes and table lookups, as shown in the GELU approximations example code.

Example: SiLU Activation Function

Here is the basic SiLU activation function in C++:

    float yapi_SiLU_basic(float x)   // Basic SiLU (inefficient)
    {
	// Sigmoid = 1 + e^(-x)
	// SiLU = x * (1 + e^(-x) )
	//      = x * 1.0 / (1.0 + expf(-x));
	return x / (1.0f + expf(-x));
    }

The SiLU function is inefficient by default. Its speed can be improved via table lookups and approximations.

Example: ELU Activation Function

The ELU activation function was first designed by Clevert, Unterthiner & Hochreiter (2016). The ELU function is somewhat related to the RELU function, but ELU's output can go negative for input values below zero. It also has an extra hyperparameter, alpha, to give multiple versions of ELU, where the alpha parameter controls how fast it goes negative. Here's some example C++ code of a very basic ELU implementation according to the paper:

    float yapi_ELU_basic(float x, float alpha_hyperparam)   // Basic ELU activation (inefficient)
    {
	// ELU = x  if x > 0 .0
	//     = alpha * ( exp(x) - 1) if x <= 0.0
	if (x <= 0.0) return alpha_hyperparam * ( expf(x) - 1.0f );
	return x;  // x if x > 0.0
    }

Precomputed Table Lookup for Activation Functions

One of the simplest methods to speed up activation functions is to use a pre-computed table lookup. If you are quantizing to 16 bits, whether FP16 or INT16, then the input to the function is 16-bits, and there are only 2^16=65,536 different possible input values. Hence, your precomputed table for activation function calculations is only 65536x2=128k bytes for output in 16-bit precision (i.e. FP16/INT16 is 2 byte outputs), or 256kb for 32-bit precision outputs (FP32/INT32). However, it's not quite so good for non-quantized 32-bit (i.e. FP32 or INT32), since 2^32 is about 4 billion possible values, and ideally needs to store 4-byte results, so it's 4x4=16 Gigabytes of RAM for that precomputed table.

The simplest way to build the pre-computed table for an activation function is to do it dynamically at initialization-time. Simply call your non-optimized activation function code, get the results, and store it in the precomputed table. The idea is:

    for (int i = 0; i < 65536; i++) {
        g_global_GELU_table[i] = yapi_GELU_slow(i);  // INT16 only
    }

When you access the pre-computed table, it's simply an indexed access for a global array. So the fast, precomputed version of GELU is a table lookup like this:

    gelu_value = g_global_GELU_table[x];   // INT16 only, doesn't work for FP16!

It all sounds very easy, but there's some weird wrinkles. Firstly, C++ doesn't have a portable standardized native 16-bit float type ("float" is 32-bit, and "double" is 64-bit), so you have to mess around with either declaring your own type (as a class), trying to use platform-specific extensions (e.g. "__fp16", "_Float16" or C++23's "std::float16_t") or hacking it by using the "short int" 16-bit integer type.

Secondly, the above code examples don't really work in practice for FP16. And to make it work is a lot of bit-hacking for conversions back-and-forth between unsigned integers and floating point numbers. Here's the idea for 32-bit float variables (i.e. non-quantized FP32):

    float g_global_GELU_table_FP32[1<<32 /*~4 billion*/];  // Is this really a good idea?
    ...
    void yapi_GELU_setup_table_FP32() // Initialize GELU precomputed table
    {
	unsigned long int i64 = 0;  // Has to be "long"!
	yassert(sizeof(i64) > 4);
	for (; i64 < (1 << 32); i64++) {
		int i32 = (int)i64;  // Switch down from 64-bit to 32-bit
		float f32 = *(float*)&i32;  // Type-cast bit trick to get the float
		g_global_GELU_table_FP32[i32] = yapi_GELU_basic2(f32);  // FP32
	}
    }

And the fast GELU lookup table version for FP32 becomes:

    float gelu_fast_FP32(float f)    // Table lookup GELU
    {
	int i32 = *(int*)&f;
	return g_global_GELU_table_FP32[i32];   // FP32 version
    }

In this FP32 case, the loop iterator has to be a 64-bit long int, otherwise the loop will be infinite because a 32-bit int will overflow and wrap-around to zero without failing the loop test.

In reality, the above code doesn't actually work on a standard Windows box. And we're probably not wanting the FP32 version, but to do this for FP16 is even more of a mess!

Load-Time Precompilation

If you're really fancy, you can write code to use once-only (forever!), which similarly repeatedly calls the inefficient exact version of your activation function in a big loop, but then spits out a C++ source file with all of those result numbers in it. I don't mean a binary data file. I really do mean a full C++ source file, with a declaration of your global array type variable at the top (and a nice comment!), and then a lot of numbers in plain text with commas between them. You then compile and link this C++ source file with your other code. After that's checked into source control, you go tell your boss that you just added a few hundred million SLOC to the project, and ask for a raise.

Once this is compiled in, you don't even need to call anything during runtime initialization. The numbers are already pre-compiled into a global array variable which is simply loaded from the linked executable file at load-time. There's currently not a good FP16 type name built into C++, i.e. no "short float" or "fp16" type. C++23 defines "std::float16_t" so this will probably work in the near future. The IEEE 754 standard has long ago defined a "float16" as having a 5-bit exponent and 11-bit mantissa (adding a bit for the sign bit, but subtracting an implied bit in the mantissa, gets us to 16 bits total). For comparison, the standard 32-bit float has 1 sign bit, 8 exponent bits, and 24 mantissa bits (only 23 are stored).

Approximate Activation Functions

Some non-linear activation functions have a non-trivial computation cost, such as sigmoid, GELU, and tanh. General ways to use approximations include:

Mathematically-close approximations (e.g., for GELU)
Integer-only activation functions
Low-bit integer activation functions (i.e., with quantized kernels)

Other improvements have also included using simpler functions (e.g. RELU), mathematical approximations for calculating the non-linear functions, or optimization techniques such as precomputed table lookups.

Example: GELU Approximations #1

The original GELU paper proposed two alternative mathematical approximations. Here's a naive C++ implementation of the first one:

    float yapi_GELU_approx1(float f)   // Approximated Gaussian GELU
    {
	// GELU paper approx #1 = 0.5 * x * ( 1 + tanh ( sqrt(2/PI) * (x + 0.44715 * x^3)  ) ) 

	return 0.5f * f * (1.0f + tanhf(sqrtf(2.0f / YAPI_PI) * (f + ( 0.44715f * (f * f * f)))));
    }

The first obvious improvement is to avoid re-calculating constant values.

    float yapi_GELU_approx1_optimized(float f)   // Approximated Gaussian GELU (with minor optimizations)
    {
	// GELU paper approx #1 = 0.5 * x * ( 1 + tanh ( sqrt(2/PI) * (x + 0.44715 * x^3)  ) ) 
	static float s_sqrt_2_div_pi = sqrtf(2.0f / YAPI_PI);
	return 0.5f * f * 
		(1.0f + tanhf(s_sqrt_2_div_pi *
			(f + (0.44715f * (f * f * f))))
		);
    }

This code can be further improved with minor changes to the arithmetic computations. In particular, some algebraic manipulation allows us to avoid the "x*x*x" term, and reduce multiplications:

    float yapi_GELU_approx1_optimized2(float f)   // Approximated Gaussian GELU (with 2nd minor optimizations)
    {
	// GELU paper approx #1 = 0.5 * x * ( 1 + tanh ( sqrt(2/PI) * (x + 0.44715 * x^3)  ) ) 
	// Optimize by factoring out one multiplication by f (reducing x*x*x to x*x)
	static float s_sqrt_2_div_pi = sqrt(2.0 / YAPI_PI);
	return 0.5f * f *
		    (1.0f 
			 + tanhf(s_sqrt_2_div_pi *
			         f * 
			          (1.0f + (0.44715f * (f * f)))
		             )
			);
    }

Example: GELU Approximations #2

The second approximation suggested in the original GELU paper is much simpler, based on the sigmoid function, but a little less accurate. Here's a C++ version without any optimizations:

    float yapi_sigmoid(float x)
    {
	// SIGMOID = 1 / ( 1 + e^-x)
	return 1.0f / (1.0f + expf(-x));
    }

    float yapi_GELU_approx2(float x)   // Approximated Gaussian GELU #2
    {
	// GELU paper approx #2 = x * sigmoid (1.702 * x) 
	return x * yapi_sigmoid(1.702f * x);
    }

But the use of two functions is needlessly expensive (although declaring them as "inline" might help), and can be optimized by flattening the call hierarchy. By merging the code of the sigmoid function into the GELU code, there are also opportunities to reduce the number of arithmetic operations.

    float yapi_GELU_approx2b(float x)   // Approximated Gaussian GELU #2b
    {
	// GELU paper approx #2 = x * sigmoid (1.702 * x) 
	// return x * 1.0 / (1.0 + expf(-(1.702 * x)));
	return x / (1.0f + expf(-1.702f * x));
    }

All of these GELU approximations could be easily beaten by changing to a table lookup method.

Research on Activation Approximations

Various of the research papers on activation approximation are listed below:

W Li, H Hacid, E Almazrouei, M Debbah, 2023, A Comprehensive Review and a Taxonomy of Edge Machine Learning: Requirements, Paradigms, and Techniques, AI 2023, 4(3), 729-786, https://www.mdpi.com/2673-2688/4/3/39 (General review contains a section on approximating activation functions.)
Amin, H.; Curtis, K.M.; Hayes-Gill, B.R., 1997, Piecewise linear approximation applied to nonlinear function of a neural network. IEEE Proc. Circuits Devices Syst. 1997, 144, 313–317. http://dx.doi.org/10.1049/ip-cds:19971587 https://www.academia.edu/50788608/Piecewise_linear_approximation_applied_to_nonlinear_function_of_a_neural_network (Early paper on activation approximation.)
Hu, Z.; Zhang, J.; Ge, Y., 2021, Handling Vanishing Gradient Problem Using Artificial Derivative. IEEE Access 2021, 9, 22371–22377. http://dx.doi.org/10.1109/ACCESS.2021.3054915, https://ieeexplore.ieee.org/document/9336631 (Examines RELU as an improved less expensive activation function.)
J Zhong, Z Liu, X Chen, Apr 2023, Transformer-based models and hardware acceleration analysis in autonomous driving: A survey, https://arxiv.org/abs/2304.10891 (Includes a section on GELU approximation.)
A. Apicella, F. Donnarumma, F. Isgrò, and R. Prevete, A survey on modern trainable activation functions, Neural Networks, vol. 138, pp.14–32, 2021, https://arxiv.org/abs/2005.00817 (Extensive survey all about activation functions, e.g. RELU, Swish, Maxout, leaky RELU.)
Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems (MCSS), 2(4):303–314. https://link.springer.com/article/10.1007/BF02551274, PDF: https://cognitivemedium.com/magic_paper/assets/Cybenko.pdf
DasGupta, B. and Schnitger, G. (1993). The power of approximating: a comparison of activation functions. In Advances in neural information processing systems, pages 615–622. https://dl.acm.org/doi/10.5555/2987061.2987137, PDF: https://proceedings.neurips.cc/paper/1992/file/e555ebe0ce426f7f9b2bef0706315e0c-Paper.pdf
Wenxuan Zeng, Meng Li, Wenjie Xiong, Tong Tong, Wen-jie Lu, Jin Tan, Runsheng Wang, Ru Huang, Aug 2023, MPCViT: Searching for Accurate and Efficient MPC-Friendly Vision Transformer with Heterogeneous Attention, https://arxiv.org/abs/2211.13955, PDF: https://openaccess.thecvf.com/content/ICCV2023/papers/Zeng_MPCViT_Searching_for_Accurate_and_Efficient_MPC-Friendly_Vision_Transformer_with_ICCV_2023_paper.pdf Code: https://github.com/PKU-SEC-Lab/mpcvit (Uses an approximated linear GELU variant.)
Payman Mohassel and Yupeng Zhang, 2017, SecureML: A system for scalable privacy-preserving machine learning. IEEE symposium on security and privacy (SP), pages 19–38. https://ieeexplore.ieee.org/document/7958569, PDF: https://eprint.iacr.org/2017/396.pdf (Includes approximations for Softmax and sigmoid.)
Liangzhen Lai, Naveen Suda, Vikas Chandra, 2018, CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M CPUs, arXiv preprint arXiv:1801.06601, https://arxiv.org/abs/1801.06601 PDF: https://arxiv.org/pdf/1801.06601 (Approximations of sigmoid and tanh.)

Integer-only Activation Functions

One particular type of approximation is the change to integer-only arithmetic. This has become quite standard in quantized kernels, especially activation quantization, since the activation functions are working on an integer with a smaller number of bits.

Some of the earlier research papers with integer-based activation functions are below. See also the overview of integer arithmetic in neural networks.

Ruokai Yin, Yuhang Li, Abhishek Moitra, Priyadarshini Panda, Dec 2022, Training Integer-Only Deep Recurrent Neural Networks https://arxiv.org/abs/2212.11791 (Integer-only version of RNNs called iRNN, with integer-only layer normalization, integer-only attention, and piecewise linear approximation for integer-only activation functions such as tanh and sigmoid.)
Z Zhang, B He, Z Zhang, 2023, Practical Edge Kernels for Integer-Only Vision Transformers Under Post-training Quantization, Proceedings of Machine Learning and Systems 5 pre-proceedings (MLSys 2023) mlsys2023, https://proceedings.mlsys.org/paper_files/paper/2023/hash/023560744aae353c03f7ae787f2998dd-Abstract-mlsys2023.html, PDF: https://proceedings.mlsys.org/paper_files/paper/2023/file/023560744aae353c03f7ae787f2998dd-Paper-mlsys2023.pdf (Integer-only-arithmetic quantization with integer-only versions of Softmax, LayerNorm, and GELU.)
Y. Lin, Y. Li, T. Liu et al., “Towards fully 8-bit integer inference for the transformer model,” in Proc. of IJCAI, 2020, pp. 3759–3765. https://arxiv.org/abs/2009.08034 (Integers for weights, but also for Softmax, layer normalization, and other components, by replacing or approximating non-linear functions such as exponential and square-root.)
A. Rock, A. Untether, O. Khalil, O. Shai, and P. Grouchy, 2022, INT8 Transformers for Inference Acceleration, 36th Conference on Neural Information Processing Systems (NeurIPS), PDF: https://neurips2022-enlsp.github.io/papers/paper_52.pdf (Integer versions of GELU, Softmax, and normalization.)

Pruning Activation Functions

The activation function can be removed, or "pruned", so that the original calculated values are used. For example, removing the interleaved activation function from between the vanilla FFN's two linear layers creates what is termed a "bilinear layer".

Research on removing activation functions or bilinear layers:

Noam Shazeer, Feb 2020, GLU Variants Improve Transformer, https://arxiv.org/pdf/2002.05202.pdf (Examines GLUs including bilinear layers.)
Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier, 2016, Language modeling with gated convolutional networks. CoRR, abs/1612.08083, http://arxiv.org/abs/1612.08083 (Also suggests bilinear layers without an activation function.)
Andriy Mnih and Geoffrey Hinton. Three new graphical models for statistical language modelling. In Proceedings of the 24th international conference on Machine learning, pages 641–648, 2007, PDF: https://www.cs.toronto.edu/~amnih/papers/threenew.pdf (Prunes the activation function in the FFN to create a "bilinear layer".)

Activation Function Reordering

The standard Transformer has an activation function in between the two linear layers of the FFN. Earlier research examined "pre-activation" versus "post-activation" for different effects. Note that this is a different issue to the placement of normalization blocks, which is the "pre-norm" versus "post-norm" choice.

Here are some research papers on where to place the activation function.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, July 2016, Identity Mappings in Deep Residual Networks, In European conference on computer vision, pp. 630–645. Springer, 2016. https://arxiv.org/abs/1603.05027, Code: https://github.com/KaimingHe/resnet-1k-layers

Research on Activation Function Optimizations

Research paper on activation functions in neural networks:

Noam Shazeer, Feb 2020, GLU Variants Improve Transformer, https://arxiv.org/pdf/2002.05202.pdf (Examines various activation functions such as RELU, Swish/SwiGLU, GELU, etc., with a focus on perplexity, not computation speed.)
M. Trimmel, M. Zanfir, R. Hartley, and C. Sminchisescu, ERA: Enhanced rational activations, in European Conference on Computer Vision. Springer, 2022, pp. 722–738. https://link.springer.com/chapter/10.1007/978-3-031-20044-1_41, Code: https://github.com/martrim/ERA (Activation function theory.)
Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier, 2016, Language modeling with gated convolutional networks. CoRR, abs/1612.08083, http://arxiv.org/abs/1612.08083 (Early 2016 paper on Gated Linear Unit (GLU) with two Matmul linear layers and an intervening sigmoid activation function.)
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. In Advances in Neural Information Processing Systems, 2016. https://arxiv.org/abs/1607.06450 (LayerNorm paper.)
Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pp. 448–456, 2015. https://arxiv.org/abs/1502.03167 (BatchNorm paper.)
Prajit Ramachandran, Barret Zoph, and Quoc V Le, 2017, Searching for activation functions, arXiv preprint arXiv:1710.05941, https://arxiv.org/abs/1710.05941 (Introduced Swish activation function as better than RELU in terms of model accuracy, although slower speed.)
Dan Hendrycks and Kevin Gimpel, 2016, Bridging nonlinearities and stochastic regularizers with gaussian error linear units, CoRR, abs/1606.08415, http://arxiv.org/abs/1606.08415 (GELU original paper.)
Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 315–323, 2011. http://proceedings.mlr.press/v15/glorot11a (RELU original paper from 2011.)
Tianyang Lin, Yuxin Wang, Xiangyang Liu, Xipeng Qiu, June 2021, A Survey of Transformers, AI Open, https://arxiv.org/abs/2106.04554 (Examines some Transformer models with different activation functions, e.g. Swish, GELU.)
Wenxuan Zeng, Meng Li, Wenjie Xiong, Tong Tong, Wen-jie Lu, Jin Tan, Runsheng Wang, Ru Huang, Aug 2023, MPCViT: Searching for Accurate and Efficient MPC-Friendly Vision Transformer with Heterogeneous Attention, https://arxiv.org/abs/2211.13955, PDF: https://openaccess.thecvf.com/content/ICCV2023/papers/Zeng_MPCViT_Searching_for_Accurate_and_Efficient_MPC-Friendly_Vision_Transformer_with_ICCV_2023_paper.pdf Code: https://github.com/PKU-SEC-Lab/mpcvit (Optimizes Softlayer, GELU, and MatMul.)
Djork-Arne Clevert, Thomas Unterthiner & Sepp Hochreiter, Feb 2016, Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs), https://arxiv.org/abs/1511.07289 (Original ELU paper.)
Iman Mirzadeh, Keivan Alizadeh, Sachin Mehta, Carlo C Del Mundo, Oncel Tuzel, Golnoosh Samei, Mohammad Rastegari, Mehrdad Farajtabar, Oct 2023 ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models, https://arxiv.org/abs/2310.04564 (Recommends reinstating the simpler RELU rather than GELU or SiLU, with a focus on inference efficiency.)
PyTorch, 2023 (accessed), Activation Functions, TORCH.NN (documentation), https://pytorch.org/docs/stable/nn.html#non-linear-activations-other
Alibaba Qwen Team, Sep 2023, Qwen Technical Report, https://arxiv.org/pdf/2309.16609.pdf (SwiGLU activation function used.)
PyTorch, 2023 (accessed), SOFTPLUS, https://pytorch.org/docs/stable/generated/torch.nn.Softplus.html
PyTorch, 2023 (accessed), SIGMOID, https://pytorch.org/docs/stable/generated/torch.nn.Sigmoid.html
PyTorch, 2023 (accessed), SOFTMIN, https://pytorch.org/docs/stable/generated/torch.nn.Softmin.html
PyTorch, 2023 (accessed), MISH, https://pytorch.org/docs/stable/generated/torch.nn.Mish.html
PyTorch, 2023 (accessed), SOFTMAX, https://pytorch.org/docs/stable/generated/torch.nn.Softmax.html
Diganta Misra, Aug 2020, Mish: A Self Regularized Non-Monotonic Activation Function, https://arxiv.org/abs/1908.08681, Code: https://github.com/digantamisra98/Mish (Introduces the Mish activation function.)
Stefan Elfwing, Eiji Uchibe, Kenji Doya, 2017, Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning, https://arxiv.org/abs/1702.03118
Prajit Ramachandran, Barret Zoph, Quoc V. Le, 2017, Swish: a Self-Gated Activation Function, https://arxiv.org/abs/1710.05941 (Introduces the Swish activation function.)
Charles Dugas, Yoshua Bengio, François Bélisle, Claude Nadeau, René Garcia, 2000, Incorporating second-order functional knowledge for better option pricing, NIPS'00: Proceedings of the 13th International Conference on Neural Information Processing Systems, January 2000, Pages 451–457, https://dl.acm.org/doi/10.5555/3008751.3008817 PDF: https://papers.nips.cc/paper/2000/file/44968aece94f667e4095002d140b5896-Paper.pdf (Introduces the Softplus activation function based on sigmoid.)
Wikipedia, 2023 (accessed), Heaviside step function, https://en.wikipedia.org/wiki/Heaviside_step_function
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022. https://arxiv.org/abs/2204.02311 (Google Palm architecture used SwiGLU activation.)
Je-Yong Lee, Donghyun Lee, Genghan Zhang, Mo Tiwari, Azalia Mirhoseini, 12 Apr 2024, CATS: Contextually-Aware Thresholding for Sparsity in Large Language Models, https://arxiv.org/abs/2404.08763 (Sparsity with dynamic control over the thresholds with an effect that is similar to intra-model MoE. Also discusses the RELU backlash: although RELU causes activation sparsity, it's not being used by large models.)
Z Zou, C Zhang, S Chen, H Kou, B Liu, March 2024, Integer Arithmetic-Based and Activation-Aware GELU Optimization for Vision Transformer, 2024 Conference of Science and Technology for Integrated Circuits (CSTIC), 17-18 March 2024, https://ieeexplore.ieee.org/abstract/document/10531966/
Y Fu, C Zhou, T Huang, E Han, Y He, H Jiao, 2024, SoftAct: A High-Precision Softmax Architecture for Transformers Supporting Nonlinear Functions, https://ieeexplore.ieee.org/abstract/document/10495359/ (Hardware-optimized Softmax and non-linear activation functions.)
David Spuler, March 2024, Chapter 21. Activation Functions, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Nelson Elhage∗†, Tristan Hume∗, Catherine Olsson∗, Neel Nanda∗§, Tom Henighan†, Scott Johnston†, Sheer El Showk†, Nicholas Joseph†, Nova DasSarma†, Ben Mann†, Danny Hernandez, Amanda Askell, Kamal Ndousse, Andy Jones, Dawn Drain, Anna Chen, Yuntao Bai, Deep Ganguli, Liane Lovitt, Zac Hatfield-Dodds, Jackson Kernion, Tom Conerly, Shauna Kravec, Stanislav Fort, Saurav Kadavath, Josh Jacobson, Eli Tran-Johnson, Jared Kaplan, Jack Clark, Tom Brown, Sam McCandlish, Dario Amodei, Christopher Olah‡, June 2022, Softmax Linear Units, https://transformer-circuits.pub/2022/solu/index.html (Early SoftMax paper.)
Tom Hubrecht, Orégane Desrentes, Florent de Dinechin, Nov 2024, Activations in Low Precision with High Accuracy, https://inria.hal.science/hal-04776745/document
Francesco Franco, Dec 2024, Activation Functions: ReLU, Sigmoid, Tanh and Softmax, https://medium.com/@francescofranco_39234/four-key-activation-functions-relu-sigmoid-tanh-and-softmax-6d2525eb55a4
Andrea Matarazzo, Riccardo Torlone, 3 Jan 2025, A Survey on Large Language Models with some Insights on their Capabilities and Limitations, https://arxiv.org/abs/2501.04040 (Broad survey with many LLM topics covered from history to architectures to optimizations.)

RELU

Research on RELU:

David R. So, Wojciech Mańke, Hanxiao Liu, Zihang Dai, Noam Shazeer, Quoc V. Le, Jan 2022, Primer: Searching for Efficient Transformers for Language Modeling, https://arxiv.org/abs/2109.08668
David Spuler, March 2024, Chapter 21. Activation Functions, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
NVIDIA, June 2024, Nemotron-4 340B Technical Report, https://d1qx31qr3h6wln.cloudfront.net/publications/Nemotron_4_340B_8T_0.pdf (Architecture is decoder-only with GQA, SentencePiece tokenizer, causal attention masks, RoPE, 96 layers, 96 heads, 8 KV heads, 256,000 vocabulary, 18432 internal dimension, context window 4096, and uses squared RELU.)
Nandan Kumar Jha, Brandon Reagen, 12 Oct 2024, ReLU's Revival: On the Entropic Overload in Normalization-Free Large Language Models, https://arxiv.org/abs/2410.09637
Francesco Franco, Dec 2024, Activation Functions: ReLU, Sigmoid, Tanh and Softmax, https://medium.com/@francescofranco_39234/four-key-activation-functions-relu-sigmoid-tanh-and-softmax-6d2525eb55a4

GELU

Research on GELU:

Z Zou, C Zhang, S Chen, H Kou, B Liu, March 2024, Integer Arithmetic-Based and Activation-Aware GELU Optimization for Vision Transformer, 2024 Conference of Science and Technology for Integrated Circuits (CSTIC), 17-18 March 2024, https://ieeexplore.ieee.org/abstract/document/10531966/
David Spuler, March 2024, Chapter 21. Activation Functions, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Chen, C, 2024, Hardware‑software co‑exploration and optimization for next‑generation learning machines. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/178423 (Extensive coverage of hardware design with multiple contributions to accelerating various neural network types, ranging from acceleration of various single non-linear functions and end-to-end optimization algorithms. Specific topics include data compression, non-maximum suppression, MHA, and MatMul/GEMM optimizations.)
Andrea Belano, Yvan Tortorella, Angelo Garofalo, Luca Benini, Davide Rossi, Francesco Conti, 9 Dec 2024, A Flexible Template for Edge Generative AI with High-Accuracy Accelerated Softmax & GELU, https://arxiv.org/abs/2412.06321
Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Griffin Adams, Jeremy Howard, Iacopo Poli, 19 Dec 2024 (v2)], Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference, https://arxiv.org/abs/2412.13663 (Encoder-only BERT model updated with modern optimizations including Flash attention, bias removal, RoPE, pre-norm, and GeGLU, a GELU varaint, hybrid local-global attention, and zero padding removal.)

GELU Approximation

Research on GELU approximation methods:

Y Liang, Z Wang, X Xu, Y Tang, Z Jie, J Lu, Oct 2023, MCUFormer: Deploying Vision Tranformers on Microcontrollers with Limited Memory, arXiv preprint arXiv:2310.16898, https://arxiv.org/pdf/2310.16898.pdf
M Huang, J Luo, C Ding, Z Wei, S Huang, H Yu, Oct 2023, An Integer-Only and Group-Vector Systolic Accelerator for Efficiently Mapping Vision Transformer on Edge, IEEE Transactions on Circuits and Systems I: Regular Papers ( Early Access ), https://ieeexplore.ieee.org/abstract/document/10288182/
Y Wu, Z Wang, WD Lu, Oct 2023, PIM-GPT: A Hybrid Process-in-Memory Accelerator for Autoregressive Transformers https://arxiv.org/pdf/2310.09385.pdf
Y Dong, W Lu, Y Zheng, H Wu, D Zhao, J Tan, July 2023, PUMA: Secure Inference of LLaMA-7B in Five Minutes, https://arxiv.org/abs/2307.12533
Mohammadreza Tayaranian, Seyyed Hasan Mozafari, James J. Clark, Brett Meyer, Warren Gross, 2 Feb 2024, Faster Inference of Integer SWIN Transformer by Removing the GELU Activation, https://arxiv.org/abs/2402.01169 (Replace GELU with RELU.)
W. Wang, W. Sun and Y. Liu, "Improving Transformer Inference Through Optimized Non-Linear Operations With Quantization-Approximation-Based Strategy," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, doi: 10.1109/TCAD.2024.3488572. https://ieeexplore.ieee.org/abstract/document/10738457
Gansen Hu, Zhaoguo Wang, Jinglin Wei, Wei Huang, Haibo Chen, 17 Jan 2025, Accelerating Large Language Models through Partially Linear Feed-Forward Network, https://arxiv.org/abs/2501.10054 (Inspired by constant folding, the optimization is merging the two MatMuls in an FFN by approximating the itervening non-linear activation function (e.g., RELU or GELU), with linear functions and merging the two matrices using matrix-multiplication associativity.)

SwiGLU

Research on the SwiGLU activation function:

Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, Furu Wei, 9 May 2024 (v2), You Only Cache Once: Decoder-Decoder Architectures for Language Models, https://arxiv.org/abs/2405.05254 Code: https://aka.ms/YOCO (A novel decoder-decoder architecture with fast KV caching and cross-attention.)
Jie Wu, Yufeng Zhu, Lei Shen, Xuqing Lu, 14 Jun 2024, GEB-1.3B: Open Lightweight Large Language Model, https://arxiv.org/abs/2406.09900 Code: https://huggingface.co/GEB-AGI/geb-1.3b
David Spuler, March 2024, Chapter 21. Activation Functions, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
kipply's blog, 2023-03-30, Transformer Taxonomy (the last lit review), https://kipp.ly/transformer-taxonomy/ (Papers for all the Transformer architectures and milestone papers for the major optimization improvements on them.)
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng Xue, Na Ni, Pei Zhang, Peng Wang, Ru Peng, Rui Men, Ruize Gao, Runji Lin, Shijie Wang, Shuai Bai, Sinan Tan, Tianhang Zhu, Tianhao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng, Xiaohuan Zhou, Xingzhang Ren, Xinyu Zhang, Xipin Wei, Xuancheng Ren, Yang Fan, Yang Yao, Yichang Zhang, Yu Wan, Yunfei Chu, Zeyu Cui, Zhenru Zhang, Zhihao Fan, 15 Jul 2024, Qwen2 Technical Report, https://arxiv.org/abs/2407.10671
Together AI, September 5, 2024, Supercharging NVIDIA H200 and H100 GPU Cluster Performance With Together Kernel Collection, https://www.together.ai/blog/nvidia-h200-and-h100-gpu-cluster-performance-together-kernel-collection
Byron (Pin-Lun)Hsu, Yun Dai, Vignesh Kothapalli, Qingquan Song, Shao Tang, Siyu Zhu, Steven Shimizu, Shivam Sahni, Haowen Ning, Yanning Chen, 14 Oct 2024, Liger Kernel: Efficient Triton Kernels for LLM Training, https://arxiv.org/abs/2410.10989 http://github.com/linkedin/Liger-Kernel
Asif Razzaq, March 5, 2025, Qwen Releases QwQ-32B: A 32B Reasoning Model that Achieves Significantly Enhanced Performance in Downstream Task, https://www.marktechpost.com/2025/03/05/qwen-releases-qwq-32b-a-32b-reasoning-model-that-achieves-significantly-enhanced-performance-in-downstream-task/ (Features 32B parameters, 32K context length, 64 layers, RoPE, SwiGLU, RMSNorm, and attention enhancements.)

SiLU

Research papers on SiLU:

David Spuler, March 2024, Chapter 21. Activation Functions, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Seungrok Jung., 15, Mar 2024, Large language model inference optimizations on AMD GPUs, ROCm blogs, https://rocm.blogs.amd.com/artificial-intelligence/llm-inference-optimize/README.html