Aussie AI

24. Normalization

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

“We will be restoring normality just as soon as
we are sure what is normal anyway.”

— Douglas Adams, The Hitchhiker's Guide to the Galaxy, 1979.

What is Normalization?

Normalization is the process of controlling the generated numbers in AI inference. We want the numbers to be like probabilities, so that the weights modify those probabilities. Although probabilities should be in the range 0..1 (inclusive), we can allow computations to create bigger or smaller numbers within computations (e.g. inside a component), provided that we scale them back whenever we need them to be real probabilities. Hence, the numbers are “normalized” in Transformer computations at various points between components and between layers.

There are several types of normalization:

Basic min-max scaled normalization
Z-score normalization
Root Mean Square normalization (RMSNorm)
Batch normalization (BatchNorm)
Layer normalization (LayerNorm)

Note that Softmax is a type of normalization of a vector of numbers that's treated specially in Transformers. Softmax is examined in detail in Chapter 25.

Inputs, Outputs and Dimensions

The input to a normalization component is a vector of logits (probability-like values). The output is the same vector, but with all values changed into a normalized scale, one way or another (e.g. in the range [0..1], inclusive). Hence, the operation is a many-to-many vector operation, which can require multiple scans of every element of the vector. The dimension of the vector, both input and output, is the model's dimension size.

Typically, a model's hidden dimension will be chosen as a multiple of a power-of-two that is large enough to allow easy vectorization. Hence, the size of the normalization input and output vectors should be amenable to vectorization without any extra cases.

Why is Normalization Needed?

To simplify the issue considerably, we don't want to have the interim “activations” (similar to logits numbers) getting too big or too negative, because overflowing into Inf or NaN would be like gimbal lock: hard to reverse and continuing onwards.

When a set of probabilities runs through a Transformer layer, the outputs are modified by the weights, which can be positive to increase the probability of outputting a token, or negative to decrease the probability (or zero if there's nothing to say). The results of dot products can therefore be numbers that are positive or negative, because of the various allowed weight values combined with the incoming probability vector.

If the input vectors contain any large positive or negative values, then these can get amplified by more weights in the dot product computation. Hence, if we allow this to happen repeatedly, across multiple Transformer layers, the magnitude of numbers can increase exponentially. This hampers training's calculations of gradients and also the risk increases of some type of overflow (e.g. to +Inf, -Inf or NaN). Normalization is therefore used at each layer to “re-normalize” the numbers to a more reasonable range, thereby avoiding problems with overflow at the positive or negative ends.

Overall, it works a lot better if each component and each layer is guaranteed that its inputs will be “reasonable” and in a “normalized” range of values (i.e. 0..1). Hence, Transformer layers typically have a normalization component that acts on the inputs prior to each layer, and other points between the layer components.

Optimizing Normalization

Although not as much of a bottleneck as MatMul or VecDot operations, the normalization algorithms are quite expensive. They are many-to-many vector operations that apply on a vector of probabilities (activations or logits). There are different normalization algorithms, but they will usually require computation of the min, max, and average, which all require scanning every vector element. And then every element of the vector needs to be scaled, which processes each vector element a second time.

Research has suggested several various ways to speed up the normalization component. They used to ask Statisticians how to double the average function in C++, but that was too mean. Examples of normalization speed improvements include:

Vectorization (hardware-acceleration with GPU or CPU intrinsics)
Code optimizations (i.e. just a small matter of coding)
Normalization alternatives (e.g. MinMax is faster than BatchNorm)
Normalization approximations
Integer-only normalization
Removing normalization (“norm pruning”)
Placement of normalization blocks (i.e. “pre-norm” vs “post-norm”)
Fused normalization (i.e., kernel fusion with a prior operation)

Vectorization is top of the list and can be key to performance improvement of normalization. Normalization functions are not usually as significant as MatMul in terms of time cost, but they can still be worth optimizing. A typical normalization requires a multi-scan of all of the elements of the output vectors. And this is done multiple times per token throughout each inference phase.

Norm Pruning

Some research has suggested that the normalization layer in Transformers can be removed without a major loss in model accuracy, which has been called “norm pruning.” If possible, nothing could be faster than this!

The concern with this approach is two-fold. Firstly, whether non-normalization will give rise to outliers that distort the accuracy of the output. Secondly, the practical matter of ensuring the computations don't cause a floating-point overflow into +Inf, -Inf or NaN. We could add some operations that ensure we don't have outliers and avoid those nasty floating-point oddities, but, umm, that's actually normalization, so we're just adding it back in again! Maybe it's faster to do a type of “mini-normalization” that fixes up some of these issues without fully normalizing every value. It's a little unclear, since these “norm pruning” approaches are only seen in research papers so far.

Even if we cannot remove them all, it is an important design decision how often we need to normalize. It's not a cheap operation, so we shouldn't re-normalize after every Transformer component. However, the typical Transformer architectures tend to use the normalization blocks quite heavily, in one way or another.

Pre-Norm vs Post-Norm

The original 2017 vanilla Transformer architecture (Vaswani et al, 2017) had a “post-norm” architecture where LayerNorm was done on the output logits from a layer. Subsequently, researchers found that switching to a “pre-norm” architecture, instead of post-norm, could fix one of the major problems with the original Transformer, namely that it was initially unstable in training, requiring a “warmup” phase. Pre-norm was found to stabilize early training and remove the need for any special handling in the warmup.

Since then, several researchers have explored where to place the layer normalization submodule. The general consensus seems to be that placing them before computations (“pre-norm”) is better than after calculations (“post-norm”). However, there are still research papers going either way, so there's still room for definitive research.

The pre-norm versus post-norm architectural decision is more of an issue for model accuracy and training warmup than a computation cost decision. There's not a lot of benefit in execution speed regarding the placement of normalization in a different alignment. Rather, we would prefer to use fewer normalization blocks overall, rather than just rearrange them. In fact, pre-norm may be slower than post-norm, because it moves the post-norm normalization components from the outputs to the inputs, but then also requires an extra normalization at the end.

Note that we cannot mix these approaches: using pre-norm for faster training and then switching to post-norm for faster inference just won't work. Like many other Transformer architectural decisions, the same normalization architecture must be retained across training and inference, unless you'd rather gibberish for output.

Basic Min-Max Scaled Normalization

Min-max normalization is the simplest incarnation of normalization, scaling numbers so that they are in the range [0..1], inclusive. This transformation changes the vector so that the smallest value (minimum) is 0, whereas the largest value (maximum) becomes 1. If all the numbers are positive, this effectively moves them back to zero, and then scales them with a multiplier so that the biggest is 1. If any of the numbers are negative, this scaling also works with a minimum value that is negative, effectively moving any negatives over to the positive side (i.e. with the most negative value starting at zero), and then scaling the magnitudes into the lower part of the range [0..1] inclusive, with any positives scaled closer to 1. In code, the simplest method is based on the minimum and maximum values:

    range = xmax - xmin;
    x[i] = ( x[i] - xmin ) / range;

Using this formula, the minimum will become 0, and the maximum will become 1. All of the other numbers will be inside the range [0..1]. As a practical matter, note that there is a divide-by-zero problem if the minimum and maximum are the same value.

Here is the C++ code for a naive unoptimized version:

    void aussie_vector_normalize_min_max_basic(float v[], int n)
    {
        float fmin = aussie_vector_min(v, n);   // Minimum
        float fmax = aussie_vector_max(v, n);   // Maximum
        float frange = fmax - fmin;
        if (frange == 0.0f) {
            yassert(frange != 0.0f);
            return;  // fail
        }
        for (int i = 0; i < n; i++) {
            v[i] = (v[i] - fmin) / frange;
        }
    }

Reciprocal Multiplication: The division can be optimized to a multiplication by its reciprocal, as follows:

    void aussie_vector_normalize_min_max_reciprocal(float v[], int n)
    {
        float fmin = aussie_vector_min(v, n);   // Minimum
        float fmax = aussie_vector_max(v, n);   // Maximum
        float frange = fmax - fmin;
        if (frange == 0.0f) {
            yassert(frange != 0.0f);
            return;  // fail
        }
        float frecip = 1.0f / frange;
        for (int i = 0; i < n; i++) {
            v[i] = (v[i] - fmin) * frecip;  // Reciprocal
        }
    }

Pointer Arithmetic. The loop can be optimized with pointer arithmetic to remove the index variable. The new version looks like:

    void aussie_vector_normalize_min_max_pointer_arith(float v[], int n)
    {
        float fmin = aussie_vector_min(v, n);   // Minimum
        float fmax = aussie_vector_max(v, n);   // Maximum
        float frange = fmax - fmin;
        if (frange == 0.0f) {
            yassert(frange != 0.0f);
            return;  // fail
        }
        float frecip = 1.0f / frange;
        float* vend = &v[n];
        for (; v != vend; v++) {
            *v = (*v - fmin) * frecip;  // Reciprocal multiplication
        }
    }

Loop Fusion: We have two loops in a row computing min and max, which can obviously be merged into one scan of the vector. This also removes the overhead of one function call. Here is the fused-loop version of computing minimum and maximum of a vector:

    float aussie_vector_min_max_fused(float v[], int n, float& vmax)  
    {
        // Mininum returned, maximum in parameter
        vmax = v[0];
        float vmin = v[0];
        for (int i = 1 /*not 0*/; i < n; i++) {
            if (v[i] < vmin) vmin = v[i];
            if (v[i] > vmax) vmax = v[i];
        }
        return vmin;
    }

    void aussie_vector_normalize_min_max_fusion(float v[], int n)
    {
        float fmax = 0.0f;   // Maximum
        float fmin = aussie_vector_min_max_fused(v, n, fmax);
        float frange = fmax - fmin;
        if (frange == 0.0f) {
            yassert(frange != 0.0f);
            return;  // fail
        }
        float frecip = 1.0f / frange;
        float* vend = &v[n];
        for (; v != vend; v++) {
            *v = (*v - fmin) * frecip;  // Reciprocal
        }
    }

Benchmarking Results of MinMax Normalization. The MinMax normalization method is simple and relatively fast, when compared to Z-score normalization or BatchNorm (see further below), which do more arithmetic operations. Running the various versions of MinMax gave the following benchmark results:

    MinMax benchmarks (N=2048, ITER=100000)
    MinMax basic: 999 ticks (1.00 seconds)
    MinMax reciprocal: 842 ticks (0.84 seconds)
    MinMax ptr arith: 710 ticks (0.71 seconds)
    MinMax loop fusion: 653 ticks (0.65 seconds)

Root Mean Square Normalization

The Root Mean Square (RMS) is a well-known measure of accuracy in statistical regressions. It is computed as the square root of the average of the squares of every element.

    float sum_squares = aussie_vector_sum_squared(v, n); // Sum of squares
    float avg_squares = sum_squares / n;  // Average of the squares
    float rms = sqrtf(avg_squares);  // RMS factor

When the RMS computation is applied to a vector, there is a relationship with the vector magnitude. The RMS value is the vector's magnitude divided by the square root of N. Hence, the RMS is significantly smaller than vector magnitude.

Here is a basic unoptimized C++ version of RMSNorm. Note that it uses an epsilon factor to avoid division-by-zero and subnormal fraction problems.

    void aussie_vector_rms_normalize_basic(float v[], int n)  // Basic RMSNorm
    {
        const float epsilon = 0.00005; // Smoothing term -- 1^e-5 (0.00005)
        float sum_squares = aussie_vector_sum_squared(v, n);  // Sum of squares
        float avg_squares = sum_squares / n;         // Average of the squares...
        float denom = sqrtf(avg_squares + epsilon);  // RMS factor
        aussie_vector_divide_scalar(v, n, denom);  // Divide by the RMS scale factor
    }

RMSNorm uses the RMS factor to scale the elements of the logits vector. Every element is divided by the RMS factor, which should make all the elements smaller if most are large. However, if enough of the vector elements are small fractions, then the RMS can be a fraction less than 1, and its division will increase the size of all the vector elements.

The RMSNorm does not ensure all elements are scaled into the range [0..1], nor does it re-center the elements to any particular mean, and it also does not ensure they add up to 1. This can make it faster to calculate than some other normalizations. On the other hand, all those multiplications for the squares can slow it down.

Z-score Normalization

Z-scores are a statistical calculation that have special meaning for a “normal distribution” of probabilities, although we have no guarantee that our probabilities will have a distribution that's anything like normal. In statistics, the “mean” is the average, and is the “center” of the normal distribution. The idea of a “z-score” is to give each number a measure of how many standard deviations a number is away from the mean. A number larger than the mean (average) will have a positive z-score, but a number less than the mean will get a negative z-score.

In code, the z-score is given by:

   zscore = (x[i] - mean) / stddev;

Note that z-scores are not limited to the range 0..1. It is possible to be +2 if a number if two standard deviations greater than the mean. Also possible is -2 if the number is less than the mean by two standard deviations.

Because the numbers are not fully normalized to 0..1, it is possible to get pathological cases that result in large positive or negative values, which is what we wanted to avoid by normalization. For example, if there is one large value and thousands of tiny fractions, then the larger value can get amplified in magnitude. Hence, we may wish to remove large magnitude outliers from the z-score normalized values, which adds another element-wise vector computation.

Here is that the basic z-score normalization C++ code looks like without any optimization:

    void aussie_vector_normalize_zscore(float v[], int n)   // Change to z-scores from normal distribution
    {
        float vmean = aussie_vector_mean(v, n);
        float stddev = aussie_vector_standard_deviation(v, n);
        for (int i = 0; i < n; i++) {
            v[i] = (v[i] - vmean) / stddev; // z-score = How many standard deviations away from the mean?
        }
    }

And here is the naive C++ computation of the statistical mean (i.e. average):

    float aussie_vector_mean(float v[], int n)  // Mean (same as average)
    {
        if (n == 0) {
            yassert(n != 0);
            return 0.0;  // fail internal error
        }
        float sum = aussie_vector_sum(v, n);
        return sum / (float)n;
    }

And here is the naive C++ computation of the standard deviation and variance measures:

    float aussie_vector_standard_deviation(float v[], int n)  // Std. dev (sqrt of variance)
    {
        float vvariance = aussie_vector_variance(v, n);  // Get the Variance
        return sqrtf(vvariance);   // Std.dev is just the sqrt of variance...
    }

    float aussie_vector_variance(float v[], int n)  // Variance (square of std. dev.)
    {
        float vmean = aussie_vector_mean(v, n);  // Get the mean/average
        float sumsquares = aussie_vector_sum_diff_squared(v, n, vmean);  // Sum of squared-diffs from mean
        return sumsquares / (float)n;  // Divide the sum-of-squares by N...
    }

    float aussie_vector_sum_diff_squared(float v[], int n, float meanval)
    {
        float sum = 0.0;
        for (int i = 0; i < n; i++) {
            float fdiff = v[i] - meanval;
            sum += (fdiff * fdiff);   // add the squared differences..
        }
        return sum;
    }

No Precomputation? The first point to note is that there's almost nothing to precompute here. There's only one call to the sqrtf function, which is not even done element-wise. The other computations are simple multiplications and subtractions that are not based on any complicated mathematical function that would benefit from precomputation into a lookup table.

Computation reuse. The first inefficiency to note is that the mean is actually calculated twice here: once from the z-score function directly, and once indirectly in the variance routine. This is a blatant algorithmic error that needs to be fixed. We need one routine that is combined to return both the mean and variance. The correction is simply to modify the code to pass back the mean as a reference parameter.

    void aussie_vector_normalize_zscore_fix_mean(float v[], int n)
    {
        float vmean = 0.0f;
        float stddev = aussie_vector_mean_and_stddev(v, n, vmean);
        for (int i = 0; i < n; i++) {
            v[i] = (v[i] - vmean) / stddev; // z-score = How many standard deviations away from the mean?
        }
    }

Reciprocal Multiplication: There is a floating-point division on every element of the vector. The division operation is very slow, and can be replaced by a floating-point multiplication by its reciprocal:

    void aussie_vector_normalize_zscore_reciprocal(float v[], int n)
    {
        float vmean = 0.0f;
        float stddev = aussie_vector_mean_and_stddev(v, n, vmean);
        float frecip = 1.0f / stddev;  // Before the loop
        for (int i = 0; i < n; i++) {
            v[i] = (v[i] - vmean) * frecip; // Multiply by reciprocal
        }
    }

Note that the reciprocal is calculated before the loop, only once. Otherwise, we would be needlessly putting “loop-invariant code” into the loop body. Moving such code out of the loop body is called “hoisting” or “loop code motion.”

Loop Fusion with Computation Reuse: A careful look at the code shows that the expression “v[i]-mean” is computed twice for each element, once as part of the variance calculation, and then again in the scaling loop at the end. We can optimize this by computing it once only, and storing it in the vector itself as part of the variance algorithm, allowing computation reuse for the second time. Here is the “fused” version of sum-of-squared-differences that also stores the result:

    float aussie_vector_sum_diff_squared_fused(float v[], int n, float meanval)
    {
        // Fused version of "sum diff squares" that leaves the DIFF in the vector...
        float sum = 0.0;
        for (int i = 0; i < n; i++) {
            float fdiff = v[i] - meanval;
            sum += (fdiff * fdiff);   // add the squared differences..
            v[i] = fdiff;  // Fusion: store the diff there (for another loop to use later)
        }
        return sum;
    }

The main loop can then avoid the subtraction:

    void aussie_vector_normalize_zscore_fused(float v[], int n)
    {
        float vmean = 0.0f;
        float stddev = aussie_vector_mean_and_stddev(v, n, vmean);
        float frecip = 1.0f / stddev;
        for (int i = 0; i < n; i++) {
            v[i] *= frecip; // No subtraction
        }
    }

Interestingly, benchmarking showed that this code performed only a few percent better. Hence, adding the assignment to v[i] is almost as expensive as removing the “v[i]-mean” subtraction. However, the new loop structure is also more amenable to vectorization.

Vectorization: There are multiple loops that can be vectorized in this z-score normalization algorithm:

Sum of the vector (for mean/average computation)
Subtraction of the mean from each element (scalar subtraction/addition)
Calculation of the difference-squared for each element (multiply by itself, then sum)
Multiply vector by scalar (element-wise scaling by reciprocal)

Vectorized Summation. We can use vectorization routines in AVX1 and AVX2 to parallelize the calculation of the sum for the mean. This is a horizontal reduction of the vector to its sum, which is shown in Chapter 17.

Vectorized Summation and Multiply-by-Scalar. In addition to the vector summation, the last step of multiplying each element by a single scalar, the reciprocal, can also be easily vectorized. This element-wise multiplication by a scalar in AVX1 and AVX2 is shown in Chapter 17.

Loop Fission for Further Vectorization. We have now vectorized the first summation (for the mean) and the final multiplication by a scalar (the reciprocal of the standard deviation). However, we haven't vectorized the two middle algorithm steps: subtraction of the mean (to get the differences), and the summation of the square differences. Let us consider this “fused” code again that computes the sum-of-squared-differences that also stores the result:

    float aussie_vector_sum_diff_squared_fused(float v[], int n, float meanval)
    {
        // Fused version of "sum diff squared" leaving the DIFF in the vector...
        float sum = 0.0;
        for (int i = 0; i < n; i++) {
            float fdiff = v[i] - meanval;
            sum += (fdiff * fdiff);   // add the squared differences..
            v[i] = fdiff;  // Fusion: store the diff
        }
        return sum;
    }

This is conceptually easier for vectorization if we do loop fission to split out the loops. Here is the version with loop fission splitting the main loop into two separate loops, which is inefficient in sequential code, but helps with vectorization for hardware acceleration:

    float aussie_vector_sum_diff_squared_fission(float v[], int n, float meanval)
    {
        // FISSION version of "sum diff squares" 
        // Loop 1. Calculate DIFF in the vector...
        for (int i = 0; i < n; i++) {
            v[i] = v[i] - meanval;
        }
        // Loop 2. Calculate the sum-of-squares...
        float sum = 0.0;
        for (int i = 0; i < n; i++) {
            float f = v[i];
            sum += (f * f);   // add the squares..
        }
        return sum;
    }

A simpler way to write this with the vectorizable primitives made clearer:

    float aussie_vector_sum_diff_squared_fissionB(float v[], int n, float meanval)
    {
        // FISSION version of "sum diff squared" 
        aussie_vector_add_scalar(v, n, -meanval);  // Loop 1. DIFFs
        float sum = aussie_vector_sum_squares(v, n);  // Loop 2. Sum-of-squares...
        return sum;
    }

AVX Vectorized Versions of Z-score Normalization. AVX intrinsics are SIMD instructions for x86 CPUs that allow vectorized operations on 4 float values (AVX1) or 8 float values (AVX2) in parallel without a GPU. I coded up three versions vectorized with AVX1 and AVX2, using the C++ intrinsic calls (see the details on AVX vectorization in Chapter 17). The first z-score version had just the summation in AVX, with the second adding multiply-by-scalar at the end vectorized with AVX. The third version had all four main steps vectorized with AVX1 or AVX2.

Z-score normalization benchmarking results. Here are the benchmark timing results for the various z-score normalization versions.

    Z-score normalization benchmarks (N=2048, ITER=100000)
    Z-score norm basic: 2081 ticks (2.08 seconds)
    Z-score norm fix mean: 1413 ticks (1.41 seconds)
    Z-score norm reciprocal: 1372 ticks (1.37 seconds)
    Z-score norm fused: 1406 ticks (1.41 seconds)
    Z-score norm AVX1 sum: 1094 ticks (1.09 seconds)
    Z-score norm AVX1 sum+multiply: 940 ticks (0.94 seconds)
    Z-score norm AVX2 sum: 994 ticks (0.99 seconds)
    Z-score norm AVX2 sum+multiply: 781 ticks (0.78 seconds)
    Z-score norm AVX1 all: 754 ticks (0.75 seconds)
    Z-score norm AVX2 all: 429 ticks (0.43 seconds)

An extension would be to add a vectorized version using AVX-512 (16 float numbers) or the upcoming AVX-10. Another extension would be to test whether it was a good idea to do “loop fission” above, by trying to create AVX1 and AVX2 versions of the original sum-difference-squared fused loop. The loop fission version made things easier to code, but might have been the slower option.

Batch Normalization

Batch normalization or “BatchNorm” was introduced to generalize simple normalization with some extra parameters called the “gain” and “bias” factors. The gain factor is a multiplicative scaling factor, often called “lambda” or “alpha.” The bias factor is an additive factor, often called “beta.”

There is also a third factor, usually called “epsilon,” which has a small magnitude and whose purpose is more about practicality than intelligence. It is used mainly to avoid pathological cases such as division-by-zero or a floating-point overflow that could result from division by a tiny divisor.

Example: BatchNorm Optimization

The batch normalization operation involves scanning the full vector, modifying each element so that it is re-centered to a zero mean, and re-scaled to a normal magnitude. A naive non-optimized version of C++ of BatchNorm looks like this:

    void aussie_vector_batch_normalize_basic(    // Basic normalization (BatchNorm)
        float v[], int n, 
        float epsilon, // Smoothing term -- usually 1^e-5 (0.00005)
        float lambda, // Scaling term hyper-parameter (multiplication)
        float beta    // Bias/shift term hyper-parameter (addition)
    ) 
    {
        float fmean = aussie_vector_mean(v, n);  // Calculate "mean" (aka average)
        float variance = aussie_vector_variance_of_mean(v, n, fmean);  // Variance (sum-of-diffs-squared)
        
        float denom = sqrtf(variance + epsilon);  // like std. deviation, but smoothed by epsilon
        for (int i = 0; i < n; i++) {
            v[i] = (v[i] - fmean) / denom; // Normalize all elements to re-center and scale
        }
        aussie_vector_multiply_scalar(v, n, lambda);  // Scale all values by lambda hyper-param
        aussie_vector_add_scalar(v, n, beta);  // Add beta hyper-param to all values 
    }

Loop Fusion. This version is very inefficient with literally five scans of the entire vector. Loop fusion can obviously improve this, with the loops doing multiplication by lambda and the addition of beta merged into the prior “for” loop.

Loop Fission. Further optimizations become clear once we notice that each element of the vector has four operations being performed on it: subtracting the mean, dividing by the denominator, multiplying by lambda, and adding beta. We can use a loop fission optimization to split out the first two operations into separate loops, where simpler operations are probably faster with hardware acceleration. And then we notice that division and multiply are two versions of the same operation, so we can then use the loop fusion technique to merge the division-by-denom and multiply-by-lambda into a single multiplication by a combined scaling factor. Faster C++ code that has one less loop, and also calls atomic vector operations (easier to hardware accelerate), then results from these changes:

    aussie_vector_add_scalar(v, n, -mean);  // Subtract the mean (re-centering)
    float scalef = lambda / denom;  // Combined scale factor
    aussie_vector_multiply_scalar(v, n, scalef);  // Scale by both denom and lambda 
    aussie_vector_add_scalar(v, n, beta);  // Add beta hyper-param to all values

Fusion and Fission. A little thought shows us that we are still subtracting the mean from every element twice: once inside the variance computation, and once in the explicit re-centering call. Hence, we can use another more complicated type of “loop fusion” to combine those two loops, by having the variance loop leave the re-centered value (“diff”) stored inside the vector, so that we can fully remove the re-centering loop that subtracts the mean.

    float mean = aussie_vector_mean(v, n);
    float variance = aussie_vector_variance_of_mean_fused(v, n, mean);  // FUSION: leaves DIFF from MEAN in vector...
    // NOT NEEDED! ... aussie_vector_add_scalar(v, n, -mean);  // Subtract the mean
    float denom = sqrtf(variance + epsilon);  // like std. deviation, but smoothed by epsilon
    // ... etc

AVX Vectorization. I used AVX1 and AVX2 SIMD hardware-acceleration for the x86 CPU to vectorize some of these BatchNorm versions. What I did was use the AVX vectorized versions for summation, multiply-by-scalar, and the fused variance loop. Benchmarking results of these optimizations are shown below.

AVX1 can vectorize 4 float values at a time (128 bits total), whereas AVX2 does 8 float values. AVX-512 does 16 float numbers, but it isn't always available. Details of how to use AVX1, AVX2 and AVX-512 for vectorization are shown in Chapter 17.

Fewer Normalization Parameters. Another algorithmic way to optimize this code is simply to remove the lambda and beta parameters. Choosing lambda=1 and beta=0 means that the last two code loops for scalar multiplication and scalar addition loops can be avoided. However, there's now little benefit to removing lambda in the merged code above, although the add-beta loop can still be removed. Anyway, whether we can remove these parameters is not a speed decision, but depends on whether these two learned parameters are important to the overall model's capability. Note that there is also little value in trying to remove epsilon, as it is only used once in total.

Benchmarking BatchNorm: Here are the results when I benchmarked this code for various basic C++ versions (i.e. sequential execution) and hardware-accelerated versions that are vectorized with x86 CPU AVX1/AVX2 intrinsics:

    BatchNorm benchmarks (N=2048, ITER=100000)
    BatchNorm basic: 2070 ticks (2.07 seconds)
    BatchNorm reciprocal: 2135 ticks (2.13 seconds)
    BatchNorm fission: 1879 ticks (1.88 seconds)
    BatchNorm fusion/fission: 1633 ticks (1.63 seconds)
    BatchNorm no params: 1375 ticks (1.38 seconds)
    BatchNorm fission AVX1: 1200 ticks (1.20 seconds)
    BatchNorm fission AVX2: 931 ticks (0.93 seconds)
    BatchNorm fusion/fission AVX1: 788 ticks (0.79 seconds)
    BatchNorm fusion/fission AVX2: 460 ticks (0.46 seconds)

This generally shows what we expect. Hardware-accelerated versions with AVX1 or AVX2 beat the sequential versions, and AVX2 (8 float numbers in parallel) does better than AVX1 (4 float numbers). Interestingly, although theoretically loop fission can improve vectorization, it was the hand-fused version that did better overall. Although the fused AVX versions had the same sequence of AVX arithmetic operations, presumably they also avoided some AVX loads and stores.

Layer Normalization

Layer normalization or “LayerNorm” was introduced in 2016 as an improvement on BatchNorm (Ba et al., 2016). The idea was to standardize the normalization across an entire layer of the AI engine. The effect is to standardize the scaling parameters for each “batch” in a layer, rather than having different parameters for different batches in a layer. Since then, LayerNorm has largely superseded BatchNorm, and is widely used in Transformers.

The default LayerNorm algorithm had two parameters of “bias” and “gain,” which are the same as for BatchNorm. However, it was found that LayerNorm was better without these parameters, which was originally called “LayerNorm-simple” (Xu et al, 2019).

LayerNorm has been an innate part of most Transformer architectures since it was discovered, and has been continually shown to be important. For example, it helps ensure attention equally across all keys in the forward pass, and smooths the gradients in the backward pass for faster training. However, researchers are still struggling to understand why it works so well, and there are still research papers being written about LayerNorm.

• Next: Chapter 25. Softmax

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs