Aussie AI
Why is Normalization Neededand?
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
Why is Normalization Needed?
To simplify the issue considerably,
we don't want to have
the interim “activations” (similar to logits numbers) getting too big or too negative,
because overflowing into Inf
or NaN
would be like gimbal lock: hard to reverse
and continuing onwards.
When a set of probabilities runs through a Transformer layer, the outputs are modified by the weights, which can be positive to increase the probability of outputting a token, or negative to decrease the probability (or zero if there's nothing to say). The results of dot products can therefore be numbers that are positive or negative, because of the various allowed weight values combined with the incoming probability vector.
If the input vectors contain any large positive or negative values, then these
can get amplified by more weights in the dot product computation.
Hence, if we allow this to happen repeatedly, across multiple Transformer layers,
the magnitude of numbers can increase exponentially.
This hampers training's calculations of gradients
and also the risk increases of some type of overflow (e.g. to +Inf
, -Inf
or NaN
).
Normalization is therefore used at each layer to “re-normalize” the numbers
to a more reasonable range, thereby avoiding problems with overflow
at the positive or negative ends.
Overall, it works a lot better if each component and each layer is guaranteed that its inputs
will be “reasonable” and in a “normalized” range of values (i.e. 0..1
).
Hence, Transformer layers typically have a normalization component
that acts on the inputs prior to each layer, and other points between the layer components.
• Next: • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |