Aussie AI
Approximating Activation Functions
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
Approximating Activation Functions
Some non-linear activation functions have a non-trivial computation cost, such as sigmoid, GELU, and tanh. Improvements have included using simpler functions (e.g. RELU), mathematical approximations for calculating the non-linear functions, or optimization techniques such as table lookups.
Example: GELU Approximations #1: The original GELU paper proposed two alternative mathematical approximations. Here's a naive C++ implementation of the first one:
float aussie_GELU_approx1(float f) { // Approximated Gaussian GELU // GELU paper approx #1 = 0.5 * x * // ( 1 + tanh ( sqrt(2/PI) * // (x + 0.44715 * x^3) ) ) return 0.5f * f * (1.0f + tanhf(sqrtf(2.0f / AUSSIE_PI) * (f + ( 0.44715f * (f * f * f))))); }
The first obvious improvement is to avoid re-calculating constant values.
float aussie_GELU_approx1_optimized(float f) { // Approximated Gaussian GELU #1 (minor optimizations) static float s_sqrt_2_div_pi = sqrtf(2.0f / AUSSIE_PI); return 0.5f * f * (1.0f + tanhf(s_sqrt_2_div_pi * (f + (0.44715f * (f * f * f)))) ); }
This code can be further improved with minor changes to the arithmetic computations.
In particular, some algebraic manipulation allows us to avoid the “x*x*x
” term,
and reduce multiplications:
float aussie_GELU_approx1_optimized2(float f) { // Approximated Gaussian GELU #1 (minor optimizations) // Optimize by factoring out f // Reduces x*x*x to x*x static float s_sqrt_2_div_pi = sqrt(2.0 / AUSSIE_PI); return 0.5f * f * (1.0f + tanhf(s_sqrt_2_div_pi * f * (1.0f + (0.44715f * (f * f))) ) ); }
Example: GELU Approximations #2: The second approximation suggested in the original GELU paper is much simpler, based on the sigmoid function, but a little less accurate. Here's a C++ version without any optimizations:
float aussie_sigmoid(float x) { // SIGMOID = 1 / ( 1 + e^-x) return 1.0f / (1.0f + expf(-x)); } float aussie_GELU_approx2(float x) // Approx GELU #2 { // GELU paper approx #2 = x * sigmoid (1.702 * x) return x * aussie_sigmoid(1.702f * x); }
But the use of two functions is needlessly expensive (although declaring them as “inline
” might help),
and can be optimized by flattening the call hierarchy.
By merging the code of the sigmoid function into the GELU code,
there are also opportunities to reduce the number of arithmetic operations.
float aussie_GELU_approx2b(float x) // Approximated Gaussian GELU #2b { // GELU paper approx #2 = x * sigmoid (1.702 * x) // return x * 1.0 / (1.0 + expf(-(1.702 * x))); return x / (1.0f + expf(-1.702f * x)); }
All of these GELU approximations could be easily beaten by changing to a table lookup method. And there's not much point in using a faster approximation method in the precomputed LUT calculations.
• Next: • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |