Aussie AI
Inputs, Outputs and Dimensions
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
Inputs, Outputs and Dimensions
A basic activation function is a “scalar-to-scalar” function that simply performs a mathematical operation, such as making the number positive with RELU. However, activation function components usually operate on the output of another Transformer component, which is typically a vector (or tensor). Hence, these activation function blocks are “vector-to-vector” operations, where the activation function is an “element-wise” operation on the vector (i.e. the activation function of one element is not dependent on the values of any other elements of the vector).
The vector-to-vector architecture may not be obvious if you look at any C++ code. You can run an activation function on the elements within any structure, including a 2-D matrix or 3-D tensor, and this is often done in optimized tensor operations. However, conceptually it is still acting on each vector. Furthermore, simple activation functions such as RELU will be “fused” back into a prior kernel operation (e.g. the preceding MatMul) to the point where it'll be hard to find the C++ code for the activation function itself.
The dimension of the vectors processed by the activation components as their inputs are the outputs from the calculations of other components (e.g. a logits vector). The dimension of the output of activation components is the same as its input dimension.
Most activation functions are “element-wise” in the sense that they only depend on a single number, one at a time, within the vector. Running an activation function on a vector (or matrix/tensor) is just doing the same thing on each element, and is easy to parallelize because each operation is unaffected by the other values. Their element-wise nature also makes them a good candidate for “kernel fusion” whereby activation functions are merged (“fused”) with the operation that preceded them for a speedup.
It is possible to define non-element-wise activation functions that depend on the other numbers in a vector, but that's usually done in other Transformer components. For example, normalization layers and the Softmax component are “vector-wise” and act on all the numbers in a vector together (e.g. scaling them all by the sum of all elements).
• Next: • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |