Aussie AI

Inputs, Outputs and Dimensions

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Inputs, Outputs and Dimensions

A basic activation function is a “scalar-to-scalar” function that simply performs a mathematical operation, such as making the number positive with RELU. However, activation function components usually operate on the output of another Transformer component, which is typically a vector (or tensor). Hence, these activation function blocks are “vector-to-vector” operations, where the activation function is an “element-wise” operation on the vector (i.e. the activation function of one element is not dependent on the values of any other elements of the vector).

The vector-to-vector architecture may not be obvious if you look at any C++ code. You can run an activation function on the elements within any structure, including a 2-D matrix or 3-D tensor, and this is often done in optimized tensor operations. However, conceptually it is still acting on each vector. Furthermore, simple activation functions such as RELU will be “fused” back into a prior kernel operation (e.g. the preceding MatMul) to the point where it'll be hard to find the C++ code for the activation function itself.

The dimension of the vectors processed by the activation components as their inputs are the outputs from the calculations of other components (e.g. a logits vector). The dimension of the output of activation components is the same as its input dimension.

Most activation functions are “element-wise” in the sense that they only depend on a single number, one at a time, within the vector. Running an activation function on a vector (or matrix/tensor) is just doing the same thing on each element, and is easy to parallelize because each operation is unaffected by the other values. Their element-wise nature also makes them a good candidate for “kernel fusion” whereby activation functions are merged (“fused”) with the operation that preceded them for a speedup.

It is possible to define non-element-wise activation functions that depend on the other numbers in a vector, but that's usually done in other Transformer components. For example, normalization layers and the Softmax component are “vector-wise” and act on all the numbers in a vector together (e.g. scaling them all by the sum of all elements).

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++