Aussie AI
Learned Activation Parameters
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
Learned Activation Parameters
Some of the activation functions have “learned activation parameters” or “trainable activation parameters.” For example, Swish/SwiGLU has “alpha” and “beta” parameters that modify activation behavior. Not all activation functions have parameters (e.g. basic RELU and GELU don't), but those that do are technically called “adaptable activation functions.” These extra parameters are stored in the model like weights, but relate to the activation functions, whereas weights apply in matrix multiplications (e.g. linear layers in FFNs).
The advantage of using an activation with trainable parameters is that there is greater opportunity for the model to contain intelligence (i.e. perplexity, accuracy, finesse). The downside is that these activation function parameters require extra computation and also hamper certain optimizations that can be applied to simpler activation functions such as RELU.
The fact that there are granular parameters for the activation functions
is quite limiting in terms of optimizing the speed of these parameterized activation functions.
We cannot, for example, precompute the entire range of an adaptive activation function,
because the parameters would be fixed.
Instead, we can precompute parts of the activation function formula, such as calls to expf
(to exponentiate the input numbers), but then we have to apply the activation parameters
as separate, extra steps.
These extra parameter computations can often be vectorized themselves,
but it's still an extra step that isn't required with different choices
of activation functions (did I mention RELU?).
• Next: • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |