Aussie AI
24. Normalization
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
“We will be restoring normality just as soon as
we are sure what is normal anyway.”
— Douglas Adams, The Hitchhiker's Guide to the Galaxy, 1979.
What is Normalization?
Normalization is the process of controlling the generated numbers in AI inference.
We want the numbers to be like probabilities, so that the weights modify those probabilities.
Although probabilities should be in the range 0..1
(inclusive), we can allow computations to
create bigger or smaller numbers within computations (e.g. inside a component), provided that we
scale them back whenever we need them to be
real probabilities.
Hence, the numbers are “normalized” in Transformer computations
at various points between components and between layers.
There are several types of normalization:
- Basic min-max scaled normalization
- Z-score normalization
- Root Mean Square normalization (RMSNorm)
- Batch normalization (BatchNorm)
- Layer normalization (LayerNorm)
Note that Softmax is a type of normalization of a vector of numbers that's treated specially in Transformers. Softmax is examined in detail in Chapter 25.
Inputs, Outputs and Dimensions
The input to a normalization component is a vector of logits (probability-like values).
The output is the same vector, but with all values changed
into a normalized scale, one way or another (e.g. in the range [0..1]
, inclusive).
Hence, the operation is a many-to-many vector operation,
which can require multiple scans of every element of the vector.
The dimension of the vector, both input and output,
is the model's dimension size.
Typically, a model's hidden dimension will be chosen as a multiple of a power-of-two that is large enough to allow easy vectorization. Hence, the size of the normalization input and output vectors should be amenable to vectorization without any extra cases.
Why is Normalization Needed?
To simplify the issue considerably,
we don't want to have
the interim “activations” (similar to logits numbers) getting too big or too negative,
because overflowing into Inf
or NaN
would be like gimbal lock: hard to reverse
and continuing onwards.
When a set of probabilities runs through a Transformer layer, the outputs are modified by the weights, which can be positive to increase the probability of outputting a token, or negative to decrease the probability (or zero if there's nothing to say). The results of dot products can therefore be numbers that are positive or negative, because of the various allowed weight values combined with the incoming probability vector.
If the input vectors contain any large positive or negative values, then these
can get amplified by more weights in the dot product computation.
Hence, if we allow this to happen repeatedly, across multiple Transformer layers,
the magnitude of numbers can increase exponentially.
This hampers training's calculations of gradients
and also the risk increases of some type of overflow (e.g. to +Inf
, -Inf
or NaN
).
Normalization is therefore used at each layer to “re-normalize” the numbers
to a more reasonable range, thereby avoiding problems with overflow
at the positive or negative ends.
Overall, it works a lot better if each component and each layer is guaranteed that its inputs
will be “reasonable” and in a “normalized” range of values (i.e. 0..1
).
Hence, Transformer layers typically have a normalization component
that acts on the inputs prior to each layer, and other points between the layer components.
Optimizing Normalization
Although not as much of a bottleneck as MatMul or VecDot operations, the normalization algorithms are quite expensive. They are many-to-many vector operations that apply on a vector of probabilities (activations or logits). There are different normalization algorithms, but they will usually require computation of the min, max, and average, which all require scanning every vector element. And then every element of the vector needs to be scaled, which processes each vector element a second time.
Research has suggested several various ways to speed up the normalization component. They used to ask Statisticians how to double the average function in C++, but that was too mean. Examples of normalization speed improvements include:
- Vectorization (hardware-acceleration with GPU or CPU intrinsics)
- Code optimizations (i.e. just a small matter of coding)
- Normalization alternatives (e.g. MinMax is faster than BatchNorm)
- Normalization approximations
- Integer-only normalization
- Removing normalization (“norm pruning”)
- Placement of normalization blocks (i.e. “pre-norm” vs “post-norm”)
- Fused normalization (i.e., kernel fusion with a prior operation)
Vectorization is top of the list and can be key to performance improvement of normalization.
Normalization functions are not usually as significant as MatMul in terms of time cost,
but they can still be worth optimizing.
A typical normalization requires a multi-scan of all of the elements of the output vectors.
And this is done multiple times per token throughout each inference phase.
Some research has suggested that the normalization layer in Transformers can be removed
without a major loss in model accuracy,
which has been called “norm pruning.”
If possible, nothing could be faster than this!
The concern with this approach is two-fold.
Firstly, whether non-normalization will give rise to outliers that distort the accuracy of the output.
Secondly, the practical matter of ensuring the computations don't cause a floating-point
overflow into
Even if we cannot remove them all, it is an important design
decision how often we need to normalize.
It's not a cheap operation, so we shouldn't re-normalize
after every Transformer component.
However, the typical Transformer architectures tend to use the normalization blocks
quite heavily, in one way or another.
The original 2017 vanilla Transformer architecture (Vaswani et al, 2017) had a “post-norm” architecture
where LayerNorm was done on the output logits from a layer.
Subsequently, researchers found that switching to a “pre-norm” architecture, instead of post-norm, could
fix one of the major problems with the original Transformer,
namely that it was initially unstable in training, requiring a “warmup” phase.
Pre-norm was found to stabilize early training and remove the need for any special handling in the warmup.
Since then, several researchers have explored where to place the layer normalization submodule.
The general consensus seems to be that placing them before computations (“pre-norm”)
is better than after calculations (“post-norm”).
However, there are still research papers going either way, so there's still room for definitive research.
The pre-norm versus post-norm architectural decision is more of an issue for model accuracy
and training warmup than a computation cost decision.
There's not a lot of benefit in execution speed regarding the placement of normalization in a different alignment.
Rather, we would prefer to use fewer normalization blocks overall,
rather than just rearrange them.
In fact, pre-norm may be slower than post-norm,
because it moves the post-norm normalization components from the outputs to the inputs,
but then also requires an extra normalization at the end.
Note that we cannot mix these approaches: using pre-norm for faster training
and then switching to post-norm for faster inference just won't work.
Like many other Transformer architectural decisions, the same normalization architecture must be
retained across training and inference, unless you'd rather gibberish for output.
Min-max normalization is the simplest incarnation of normalization,
scaling numbers so that they are in the range
Using this formula, the minimum will become 0, and the maximum will become 1.
All of the other numbers will be inside the range
Here is the C++ code for a naive unoptimized version:
Reciprocal Multiplication:
The division can be optimized to a multiplication by its reciprocal,
as follows:
Pointer Arithmetic.
The loop can be optimized with pointer arithmetic to remove the index variable.
The new version looks like:
Loop Fusion:
We have two loops in a row computing min and max,
which can obviously be merged into one scan of the vector.
This also removes the overhead of one function call.
Here is the fused-loop version of computing minimum and maximum of a vector:
Benchmarking Results of MinMax Normalization.
The MinMax normalization method is simple and relatively fast,
when compared to Z-score normalization or BatchNorm (see further below),
which do more arithmetic operations.
Running the various versions of MinMax gave the following benchmark results:
The Root Mean Square (RMS) is a well-known measure of accuracy
in statistical regressions.
It is computed as the square root of the average of the squares of every element.
When the RMS computation is applied to a vector,
there is a relationship with the vector magnitude.
The RMS value is the vector's magnitude divided by the square root of N.
Hence, the RMS is significantly smaller than vector magnitude.
Here is a basic unoptimized C++ version of RMSNorm.
Note that it uses an epsilon factor to avoid division-by-zero and subnormal fraction problems.
RMSNorm uses the RMS factor to scale the elements of the logits vector.
Every element is divided by the RMS factor,
which should make all the elements smaller if most are large.
However, if enough of the vector elements are small fractions,
then the RMS can be a fraction less than 1,
and its division will increase the size of all the vector elements.
The RMSNorm does not ensure all elements are scaled into the
range
Z-scores are a statistical calculation
that have special meaning for a “normal distribution” of probabilities,
although we have no guarantee that our probabilities will have a distribution
that's anything like normal.
In statistics, the “mean” is the average,
and is the “center” of the normal distribution.
The idea of a “z-score” is to give each number a measure of how many standard deviations
a number is away from the mean.
A number larger than the mean (average) will have a positive z-score,
but a number less than the mean will get a negative z-score.
In code, the z-score is given by:
Note that z-scores are not limited to the range
Because the numbers are not fully normalized to
Here is that the basic z-score normalization C++ code looks like without any optimization:
And here is the naive C++ computation of the statistical mean (i.e. average):
And here is the naive C++ computation of the standard deviation and variance measures:
No Precomputation?
The first point to note is that there's almost nothing to precompute here.
There's only one call to the
Computation reuse.
The first inefficiency to note is that the mean is actually calculated twice here:
once from the z-score function directly, and once indirectly in the variance routine.
This is a blatant algorithmic error that needs to be fixed.
We need one routine that is combined to return both the mean and variance.
The correction is simply to modify the code to pass back the mean as a reference parameter.
Reciprocal Multiplication:
There is a floating-point division on every element of the vector.
The division operation is very slow, and can be replaced by a floating-point multiplication
by its reciprocal:
Note that the reciprocal is calculated before the loop, only once.
Otherwise, we would be needlessly putting “loop-invariant code” into the loop body.
Moving such code out of the loop body is called “hoisting” or “loop code motion.”
Loop Fusion with Computation Reuse:
A careful look at the code shows that the expression “
The main loop can then avoid the subtraction:
Interestingly, benchmarking showed that this code performed only a few percent better.
Hence, adding the assignment to
Vectorization:
There are multiple loops that can be vectorized in this z-score normalization algorithm:
Vectorized Summation.
We can use vectorization routines in AVX1 and AVX2 to parallelize the calculation
of the sum for the mean.
This is a horizontal reduction of the vector to its sum,
which is shown in Chapter 17.
Vectorized Summation and Multiply-by-Scalar.
In addition to the vector summation,
the last step of multiplying each element by a single scalar, the reciprocal,
can also be easily vectorized.
This element-wise multiplication by a scalar in AVX1 and AVX2 is shown in Chapter 17.
Loop Fission for Further Vectorization.
We have now vectorized the first summation (for the mean) and the final multiplication
by a scalar (the reciprocal of the standard deviation).
However, we haven't vectorized the two middle algorithm steps: subtraction of the mean (to get the differences),
and the summation of the square differences.
Let us consider this “fused” code again that
computes the sum-of-squared-differences that also stores the result:
This is conceptually easier for vectorization if we do loop fission to split out the loops.
Here is the version with loop fission splitting the main loop into two separate loops,
which is inefficient in sequential code, but helps with vectorization for hardware acceleration:
A simpler way to write this with the vectorizable primitives made clearer:
AVX Vectorized Versions of Z-score Normalization.
AVX intrinsics are SIMD instructions for x86 CPUs that allow vectorized operations on 4
Z-score normalization benchmarking results.
Here are the benchmark timing results for the various z-score normalization versions.
An extension would be to add a vectorized version using AVX-512 (16
Batch normalization or “BatchNorm” was introduced
to generalize simple normalization with some extra parameters
called the “gain” and “bias” factors.
The gain factor is a multiplicative scaling factor, often called “lambda” or “alpha.”
The bias factor is an additive factor, often called “beta.”
There is also a third factor, usually called “epsilon,” which has a small magnitude
and whose purpose is more about practicality than intelligence.
It is
used mainly to avoid pathological cases
such as division-by-zero or a floating-point overflow that could result
from division by a tiny divisor.
The batch normalization operation involves scanning the full vector,
modifying each element so that it is re-centered to a zero mean,
and re-scaled to a normal magnitude.
A naive non-optimized version of C++ of BatchNorm looks like this:
Loop Fusion.
This version is very inefficient with literally five scans of the entire vector.
Loop fusion can obviously improve this, with the loops doing multiplication by
Loop Fission.
Further optimizations become clear once we notice that each element of the vector has four operations being performed on it: subtracting the mean, dividing by the denominator, multiplying by
Fusion and Fission.
A little thought shows us that we are still
subtracting the mean from every element twice: once inside the variance computation,
and once in the explicit re-centering call.
Hence, we can use another more complicated type of “loop fusion” to combine those two loops, by having the variance
loop leave the re-centered value (“
AVX Vectorization.
I used AVX1 and AVX2 SIMD hardware-acceleration for the x86 CPU to vectorize
some of these BatchNorm versions.
What I did was use the AVX vectorized versions for summation, multiply-by-scalar,
and the fused variance loop.
Benchmarking results of these optimizations are shown below.
AVX1 can vectorize 4
Fewer Normalization Parameters.
Another algorithmic way to optimize this code is simply to remove the
Benchmarking BatchNorm:
Here are the results when I benchmarked this code for various basic C++ versions (i.e. sequential execution)
and hardware-accelerated versions that are vectorized with x86 CPU AVX1/AVX2 intrinsics:
This generally shows what we expect.
Hardware-accelerated versions with AVX1 or AVX2 beat the sequential versions,
and AVX2 (8
Layer normalization or “LayerNorm” was introduced in 2016 as an improvement on BatchNorm (Ba et al., 2016).
The idea was to standardize the normalization across an entire layer of the AI engine.
The effect is to standardize the scaling parameters for each “batch” in a layer,
rather than having different parameters for different batches in a layer.
Since then, LayerNorm has largely superseded BatchNorm, and is widely used in Transformers.
The default LayerNorm algorithm had two parameters of “bias” and “gain,”
which are the same as for BatchNorm.
However, it was found that LayerNorm was better without these parameters,
which was originally called “LayerNorm-simple” (Xu et al, 2019).
LayerNorm has been an innate part of most Transformer architectures since it was discovered,
and has been continually shown to be important.
For example, it helps ensure attention equally across all keys in the forward pass,
and smooths the gradients in the backward pass for faster training.
However, researchers are still struggling to understand why it works so well,
and there are still research papers being written about LayerNorm.
• Next: Chapter 25. Softmax
• Up: Table of Contents
Get your copy from Amazon: Generative AI in C++
Norm Pruning
+Inf
, -Inf
or NaN
.
We could add some operations that ensure we don't have outliers and avoid those nasty floating-point
oddities, but, umm, that's actually normalization, so we're just adding it back in again!
Maybe it's faster to do a type of “mini-normalization” that fixes up some of these issues without
fully normalizing every value.
It's a little unclear, since these “norm pruning” approaches are only seen in research papers so far.
Pre-Norm vs Post-Norm
Basic Min-Max Scaled Normalization
[0..1]
, inclusive.
This transformation changes the vector so that the smallest value (minimum) is 0,
whereas the largest value (maximum) becomes 1.
If all the numbers are positive, this effectively moves them back to zero, and then scales
them with a multiplier so that the biggest is 1.
If any of the numbers are negative, this scaling also works with a minimum value that is negative,
effectively moving any negatives
over to the positive side (i.e. with the most negative value starting at zero),
and then scaling the magnitudes into the lower part of the range [0..1]
inclusive,
with any positives scaled closer to 1.
In code, the simplest method is based on the minimum and maximum values:
range = xmax - xmin;
x[i] = ( x[i] - xmin ) / range;
[0..1]
.
As a practical matter, note that there is a divide-by-zero problem
if the minimum and maximum are the same value.
void aussie_vector_normalize_min_max_basic(float v[], int n)
{
float fmin = aussie_vector_min(v, n); // Minimum
float fmax = aussie_vector_max(v, n); // Maximum
float frange = fmax - fmin;
if (frange == 0.0f) {
yassert(frange != 0.0f);
return; // fail
}
for (int i = 0; i < n; i++) {
v[i] = (v[i] - fmin) / frange;
}
}
void aussie_vector_normalize_min_max_reciprocal(float v[], int n)
{
float fmin = aussie_vector_min(v, n); // Minimum
float fmax = aussie_vector_max(v, n); // Maximum
float frange = fmax - fmin;
if (frange == 0.0f) {
yassert(frange != 0.0f);
return; // fail
}
float frecip = 1.0f / frange;
for (int i = 0; i < n; i++) {
v[i] = (v[i] - fmin) * frecip; // Reciprocal
}
}
void aussie_vector_normalize_min_max_pointer_arith(float v[], int n)
{
float fmin = aussie_vector_min(v, n); // Minimum
float fmax = aussie_vector_max(v, n); // Maximum
float frange = fmax - fmin;
if (frange == 0.0f) {
yassert(frange != 0.0f);
return; // fail
}
float frecip = 1.0f / frange;
float* vend = &v[n];
for (; v != vend; v++) {
*v = (*v - fmin) * frecip; // Reciprocal multiplication
}
}
float aussie_vector_min_max_fused(float v[], int n, float& vmax)
{
// Mininum returned, maximum in parameter
vmax = v[0];
float vmin = v[0];
for (int i = 1 /*not 0*/; i < n; i++) {
if (v[i] < vmin) vmin = v[i];
if (v[i] > vmax) vmax = v[i];
}
return vmin;
}
void aussie_vector_normalize_min_max_fusion(float v[], int n)
{
float fmax = 0.0f; // Maximum
float fmin = aussie_vector_min_max_fused(v, n, fmax);
float frange = fmax - fmin;
if (frange == 0.0f) {
yassert(frange != 0.0f);
return; // fail
}
float frecip = 1.0f / frange;
float* vend = &v[n];
for (; v != vend; v++) {
*v = (*v - fmin) * frecip; // Reciprocal
}
}
MinMax benchmarks (N=2048, ITER=100000)
MinMax basic: 999 ticks (1.00 seconds)
MinMax reciprocal: 842 ticks (0.84 seconds)
MinMax ptr arith: 710 ticks (0.71 seconds)
MinMax loop fusion: 653 ticks (0.65 seconds)
Root Mean Square Normalization
float sum_squares = aussie_vector_sum_squared(v, n); // Sum of squares
float avg_squares = sum_squares / n; // Average of the squares
float rms = sqrtf(avg_squares); // RMS factor
void aussie_vector_rms_normalize_basic(float v[], int n) // Basic RMSNorm
{
const float epsilon = 0.00005; // Smoothing term -- 1^e-5 (0.00005)
float sum_squares = aussie_vector_sum_squared(v, n); // Sum of squares
float avg_squares = sum_squares / n; // Average of the squares...
float denom = sqrtf(avg_squares + epsilon); // RMS factor
aussie_vector_divide_scalar(v, n, denom); // Divide by the RMS scale factor
}
[0..1]
, nor does it re-center the elements to any particular mean,
and it also does not ensure they add up to 1.
This can make it faster to calculate than some other normalizations.
On the other hand, all those multiplications for the squares can slow it down.
Z-score Normalization
zscore = (x[i] - mean) / stddev;
0..1
.
It is possible to be +2
if a number if two standard deviations greater than the mean.
Also possible is -2
if the number is less than the mean by two standard deviations.
0..1
, it is possible to get
pathological cases that result in large positive or negative values, which is what we wanted to avoid
by normalization.
For example, if there is one large value and thousands of tiny fractions, then the
larger value can get amplified in magnitude.
Hence, we may wish to remove large magnitude outliers from the z-score normalized values,
which adds another element-wise vector computation.
void aussie_vector_normalize_zscore(float v[], int n) // Change to z-scores from normal distribution
{
float vmean = aussie_vector_mean(v, n);
float stddev = aussie_vector_standard_deviation(v, n);
for (int i = 0; i < n; i++) {
v[i] = (v[i] - vmean) / stddev; // z-score = How many standard deviations away from the mean?
}
}
float aussie_vector_mean(float v[], int n) // Mean (same as average)
{
if (n == 0) {
yassert(n != 0);
return 0.0; // fail internal error
}
float sum = aussie_vector_sum(v, n);
return sum / (float)n;
}
float aussie_vector_standard_deviation(float v[], int n) // Std. dev (sqrt of variance)
{
float vvariance = aussie_vector_variance(v, n); // Get the Variance
return sqrtf(vvariance); // Std.dev is just the sqrt of variance...
}
float aussie_vector_variance(float v[], int n) // Variance (square of std. dev.)
{
float vmean = aussie_vector_mean(v, n); // Get the mean/average
float sumsquares = aussie_vector_sum_diff_squared(v, n, vmean); // Sum of squared-diffs from mean
return sumsquares / (float)n; // Divide the sum-of-squares by N...
}
float aussie_vector_sum_diff_squared(float v[], int n, float meanval)
{
float sum = 0.0;
for (int i = 0; i < n; i++) {
float fdiff = v[i] - meanval;
sum += (fdiff * fdiff); // add the squared differences..
}
return sum;
}
sqrtf
function, which is not even done element-wise.
The other computations are simple multiplications and subtractions
that are not based on any complicated mathematical function
that would benefit from precomputation into a lookup table.
void aussie_vector_normalize_zscore_fix_mean(float v[], int n)
{
float vmean = 0.0f;
float stddev = aussie_vector_mean_and_stddev(v, n, vmean);
for (int i = 0; i < n; i++) {
v[i] = (v[i] - vmean) / stddev; // z-score = How many standard deviations away from the mean?
}
}
void aussie_vector_normalize_zscore_reciprocal(float v[], int n)
{
float vmean = 0.0f;
float stddev = aussie_vector_mean_and_stddev(v, n, vmean);
float frecip = 1.0f / stddev; // Before the loop
for (int i = 0; i < n; i++) {
v[i] = (v[i] - vmean) * frecip; // Multiply by reciprocal
}
}
v[i]-mean
” is computed
twice for each element, once as part of the variance calculation,
and then again in the scaling loop at the end.
We can optimize this by computing it once only, and storing it in the vector itself
as part of the variance algorithm,
allowing computation reuse for the second time.
Here is the “fused” version of sum-of-squared-differences that also stores the result:
float aussie_vector_sum_diff_squared_fused(float v[], int n, float meanval)
{
// Fused version of "sum diff squares" that leaves the DIFF in the vector...
float sum = 0.0;
for (int i = 0; i < n; i++) {
float fdiff = v[i] - meanval;
sum += (fdiff * fdiff); // add the squared differences..
v[i] = fdiff; // Fusion: store the diff there (for another loop to use later)
}
return sum;
}
void aussie_vector_normalize_zscore_fused(float v[], int n)
{
float vmean = 0.0f;
float stddev = aussie_vector_mean_and_stddev(v, n, vmean);
float frecip = 1.0f / stddev;
for (int i = 0; i < n; i++) {
v[i] *= frecip; // No subtraction
}
}
v[i]
is almost as expensive as removing the “v[i]-mean
” subtraction.
However, the new loop structure is also more amenable to vectorization.
float aussie_vector_sum_diff_squared_fused(float v[], int n, float meanval)
{
// Fused version of "sum diff squared" leaving the DIFF in the vector...
float sum = 0.0;
for (int i = 0; i < n; i++) {
float fdiff = v[i] - meanval;
sum += (fdiff * fdiff); // add the squared differences..
v[i] = fdiff; // Fusion: store the diff
}
return sum;
}
float aussie_vector_sum_diff_squared_fission(float v[], int n, float meanval)
{
// FISSION version of "sum diff squares"
// Loop 1. Calculate DIFF in the vector...
for (int i = 0; i < n; i++) {
v[i] = v[i] - meanval;
}
// Loop 2. Calculate the sum-of-squares...
float sum = 0.0;
for (int i = 0; i < n; i++) {
float f = v[i];
sum += (f * f); // add the squares..
}
return sum;
}
float aussie_vector_sum_diff_squared_fissionB(float v[], int n, float meanval)
{
// FISSION version of "sum diff squared"
aussie_vector_add_scalar(v, n, -meanval); // Loop 1. DIFFs
float sum = aussie_vector_sum_squares(v, n); // Loop 2. Sum-of-squares...
return sum;
}
float
values (AVX1) or 8 float
values (AVX2)
in parallel without a GPU.
I coded up three versions vectorized with AVX1 and AVX2,
using the C++ intrinsic calls (see the details on AVX vectorization in Chapter 17).
The first z-score version had just the summation in AVX,
with the second adding multiply-by-scalar at the end vectorized with AVX.
The third version had all four main steps vectorized with AVX1 or AVX2.
Z-score normalization benchmarks (N=2048, ITER=100000)
Z-score norm basic: 2081 ticks (2.08 seconds)
Z-score norm fix mean: 1413 ticks (1.41 seconds)
Z-score norm reciprocal: 1372 ticks (1.37 seconds)
Z-score norm fused: 1406 ticks (1.41 seconds)
Z-score norm AVX1 sum: 1094 ticks (1.09 seconds)
Z-score norm AVX1 sum+multiply: 940 ticks (0.94 seconds)
Z-score norm AVX2 sum: 994 ticks (0.99 seconds)
Z-score norm AVX2 sum+multiply: 781 ticks (0.78 seconds)
Z-score norm AVX1 all: 754 ticks (0.75 seconds)
Z-score norm AVX2 all: 429 ticks (0.43 seconds)
float
numbers) or the upcoming AVX-10.
Another extension would be to test whether it was a good idea to do “loop fission” above,
by trying to create AVX1 and AVX2 versions of the original sum-difference-squared fused loop.
The loop fission version made things easier to code, but might have been the slower option.
Batch Normalization
Example: BatchNorm Optimization
void aussie_vector_batch_normalize_basic( // Basic normalization (BatchNorm)
float v[], int n,
float epsilon, // Smoothing term -- usually 1^e-5 (0.00005)
float lambda, // Scaling term hyper-parameter (multiplication)
float beta // Bias/shift term hyper-parameter (addition)
)
{
float fmean = aussie_vector_mean(v, n); // Calculate "mean" (aka average)
float variance = aussie_vector_variance_of_mean(v, n, fmean); // Variance (sum-of-diffs-squared)
float denom = sqrtf(variance + epsilon); // like std. deviation, but smoothed by epsilon
for (int i = 0; i < n; i++) {
v[i] = (v[i] - fmean) / denom; // Normalize all elements to re-center and scale
}
aussie_vector_multiply_scalar(v, n, lambda); // Scale all values by lambda hyper-param
aussie_vector_add_scalar(v, n, beta); // Add beta hyper-param to all values
}
lambda
and the addition of beta
merged into the prior “for
” loop.
lambda
, and adding beta
.
We can use a loop fission optimization to split out the first two operations into separate loops, where simpler operations are probably faster with hardware acceleration.
And then we notice that division and multiply are two versions of the same operation,
so we can then use the loop fusion technique to merge the division-by-denom
and multiply-by-lambda
into a single multiplication by a combined scaling factor.
Faster C++ code that has one less loop,
and also calls atomic vector operations (easier to hardware accelerate), then results from these changes:
aussie_vector_add_scalar(v, n, -mean); // Subtract the mean (re-centering)
float scalef = lambda / denom; // Combined scale factor
aussie_vector_multiply_scalar(v, n, scalef); // Scale by both denom and lambda
aussie_vector_add_scalar(v, n, beta); // Add beta hyper-param to all values
diff
”) stored inside the vector,
so that we can fully remove the re-centering loop that subtracts the mean.
float mean = aussie_vector_mean(v, n);
float variance = aussie_vector_variance_of_mean_fused(v, n, mean); // FUSION: leaves DIFF from MEAN in vector...
// NOT NEEDED! ... aussie_vector_add_scalar(v, n, -mean); // Subtract the mean
float denom = sqrtf(variance + epsilon); // like std. deviation, but smoothed by epsilon
// ... etc
float
values at a time (128 bits total),
whereas AVX2 does 8 float
values.
AVX-512 does 16 float
numbers, but it isn't always available.
Details of how to use AVX1, AVX2 and AVX-512 for vectorization are shown in Chapter 17.
lambda
and beta
parameters.
Choosing lambda=1
and beta=0
means that the last two code loops for scalar multiplication and scalar addition loops can be avoided.
However, there's now little benefit to removing lambda
in the merged code above,
although the add-beta
loop can still be removed.
Anyway, whether we can remove these parameters is not a speed decision,
but depends
on whether these two learned parameters are important to the overall model's capability.
Note that there is also little value in trying to remove epsilon
, as it is only used once in total.
BatchNorm benchmarks (N=2048, ITER=100000)
BatchNorm basic: 2070 ticks (2.07 seconds)
BatchNorm reciprocal: 2135 ticks (2.13 seconds)
BatchNorm fission: 1879 ticks (1.88 seconds)
BatchNorm fusion/fission: 1633 ticks (1.63 seconds)
BatchNorm no params: 1375 ticks (1.38 seconds)
BatchNorm fission AVX1: 1200 ticks (1.20 seconds)
BatchNorm fission AVX2: 931 ticks (0.93 seconds)
BatchNorm fusion/fission AVX1: 788 ticks (0.79 seconds)
BatchNorm fusion/fission AVX2: 460 ticks (0.46 seconds)
float
numbers in parallel) does better than AVX1 (4 float
numbers).
Interestingly, although theoretically loop fission can improve vectorization,
it was the hand-fused version that did better overall.
Although the fused AVX versions had the same sequence of AVX arithmetic operations,
presumably they also avoided some AVX loads and stores.
Layer Normalization
The new AI programming book by Aussie AI co-founders: