Aussie AI
Common Activation Functions
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
Common Activation Functions
Various functions have been tested as activation functions, both linear and non-linear. The main activation functions that have emerged in practical usage of Transformers for LLMs are:
- RELU (Rectified Linear Unit)
- GELU (Gaussian Linear Unit)
- SwiGLU (Swish Linear Unit)
- SiLU (Sigmoid Linear Unit)
RELU is one of the earliest activation functions, but has stood firm over the years, with applicability for many usages. However, the default OpenAI GPT2 and GPT3 activation function was called “GELU-New,” which is actually what is usually meant by “GELU” nowadays, but these architectures could have been trained with RELU, Swish, or GELU-Old. InstructGPT uses the sigmoid/SiLU activation function. The Llama and Llama2 models from Meta's Facebook Research use Swish/SwiGLU for their activation function. Although confidential, apparently GPT-4 uses the sigmoid activation function (i.e. SiLU) in its loss function. RELU is still often used in many open source models for lower-end architectures because of its efficiency.
There are various other ones that have been used in earlier research, or are sometimes still used:
- Step function
- tanh (hyperbolic tangent)
- Leaky RELU
- ELU (Exponential Linear Unit)
- Softplus (not to be confused with “Softmax”)
• Next: • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |