Aussie AI

Z-score Normalization

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Z-score Normalization

Z-scores are a statistical calculation that have special meaning for a “normal distribution” of probabilities, although we have no guarantee that our probabilities will have a distribution that's anything like normal. In statistics, the “mean” is the average, and is the “center” of the normal distribution. The idea of a “z-score” is to give each number a measure of how many standard deviations a number is away from the mean. A number larger than the mean (average) will have a positive z-score, but a number less than the mean will get a negative z-score.

In code, the z-score is given by:

   zscore = (x[i] - mean) / stddev;

Note that z-scores are not limited to the range 0..1. It is possible to be +2 if a number if two standard deviations greater than the mean. Also possible is -2 if the number is less than the mean by two standard deviations.

Because the numbers are not fully normalized to 0..1, it is possible to get pathological cases that result in large positive or negative values, which is what we wanted to avoid by normalization. For example, if there is one large value and thousands of tiny fractions, then the larger value can get amplified in magnitude. Hence, we may wish to remove large magnitude outliers from the z-score normalized values, which adds another element-wise vector computation.

Here is that the basic z-score normalization C++ code looks like without any optimization:

    void aussie_vector_normalize_zscore(float v[], int n)   // Change to z-scores from normal distribution
    {
        float vmean = aussie_vector_mean(v, n);
        float stddev = aussie_vector_standard_deviation(v, n);
        for (int i = 0; i < n; i++) {
            v[i] = (v[i] - vmean) / stddev; // z-score = How many standard deviations away from the mean?
        }
    }

And here is the naive C++ computation of the statistical mean (i.e. average):

    float aussie_vector_mean(float v[], int n)  // Mean (same as average)
    {
        if (n == 0) {
            yassert(n != 0);
            return 0.0;  // fail internal error
        }
        float sum = aussie_vector_sum(v, n);
        return sum / (float)n;
    }

And here is the naive C++ computation of the standard deviation and variance measures:

    float aussie_vector_standard_deviation(float v[], int n)  // Std. dev (sqrt of variance)
    {
        float vvariance = aussie_vector_variance(v, n);  // Get the Variance
        return sqrtf(vvariance);   // Std.dev is just the sqrt of variance...
    }

    float aussie_vector_variance(float v[], int n)  // Variance (square of std. dev.)
    {
        float vmean = aussie_vector_mean(v, n);  // Get the mean/average
        float sumsquares = aussie_vector_sum_diff_squared(v, n, vmean);  // Sum of squared-diffs from mean
        return sumsquares / (float)n;  // Divide the sum-of-squares by N...
    }

    float aussie_vector_sum_diff_squared(float v[], int n, float meanval)
    {
        float sum = 0.0;
        for (int i = 0; i < n; i++) {
            float fdiff = v[i] - meanval;
            sum += (fdiff * fdiff);   // add the squared differences..
        }
        return sum;
    }

No Precomputation? The first point to note is that there's almost nothing to precompute here. There's only one call to the sqrtf function, which is not even done element-wise. The other computations are simple multiplications and subtractions that are not based on any complicated mathematical function that would benefit from precomputation into a lookup table.

Computation reuse. The first inefficiency to note is that the mean is actually calculated twice here: once from the z-score function directly, and once indirectly in the variance routine. This is a blatant algorithmic error that needs to be fixed. We need one routine that is combined to return both the mean and variance. The correction is simply to modify the code to pass back the mean as a reference parameter.

    void aussie_vector_normalize_zscore_fix_mean(float v[], int n)
    {
        float vmean = 0.0f;
        float stddev = aussie_vector_mean_and_stddev(v, n, vmean);
        for (int i = 0; i < n; i++) {
            v[i] = (v[i] - vmean) / stddev; // z-score = How many standard deviations away from the mean?
        }
    }

Reciprocal Multiplication: There is a floating-point division on every element of the vector. The division operation is very slow, and can be replaced by a floating-point multiplication by its reciprocal:

    void aussie_vector_normalize_zscore_reciprocal(float v[], int n)
    {
        float vmean = 0.0f;
        float stddev = aussie_vector_mean_and_stddev(v, n, vmean);
        float frecip = 1.0f / stddev;  // Before the loop
        for (int i = 0; i < n; i++) {
            v[i] = (v[i] - vmean) * frecip; // Multiply by reciprocal
        }
    }

Note that the reciprocal is calculated before the loop, only once. Otherwise, we would be needlessly putting “loop-invariant code” into the loop body. Moving such code out of the loop body is called “hoisting” or “loop code motion.”

Loop Fusion with Computation Reuse: A careful look at the code shows that the expression “v[i]-mean” is computed twice for each element, once as part of the variance calculation, and then again in the scaling loop at the end. We can optimize this by computing it once only, and storing it in the vector itself as part of the variance algorithm, allowing computation reuse for the second time. Here is the “fused” version of sum-of-squared-differences that also stores the result:

    float aussie_vector_sum_diff_squared_fused(float v[], int n, float meanval)
    {
        // Fused version of "sum diff squares" that leaves the DIFF in the vector...
        float sum = 0.0;
        for (int i = 0; i < n; i++) {
            float fdiff = v[i] - meanval;
            sum += (fdiff * fdiff);   // add the squared differences..
            v[i] = fdiff;  // Fusion: store the diff there (for another loop to use later)
        }
        return sum;
    }

The main loop can then avoid the subtraction:

    void aussie_vector_normalize_zscore_fused(float v[], int n)
    {
        float vmean = 0.0f;
        float stddev = aussie_vector_mean_and_stddev(v, n, vmean);
        float frecip = 1.0f / stddev;
        for (int i = 0; i < n; i++) {
            v[i] *= frecip; // No subtraction
        }
    }

Interestingly, benchmarking showed that this code performed only a few percent better. Hence, adding the assignment to v[i] is almost as expensive as removing the “v[i]-mean” subtraction. However, the new loop structure is also more amenable to vectorization.

Vectorization: There are multiple loops that can be vectorized in this z-score normalization algorithm:

  • Sum of the vector (for mean/average computation)
  • Subtraction of the mean from each element (scalar subtraction/addition)
  • Calculation of the difference-squared for each element (multiply by itself, then sum)
  • Multiply vector by scalar (element-wise scaling by reciprocal)

Vectorized Summation. We can use vectorization routines in AVX1 and AVX2 to parallelize the calculation of the sum for the mean. This is a horizontal reduction of the vector to its sum, which is shown in Chapter 17.

Vectorized Summation and Multiply-by-Scalar. In addition to the vector summation, the last step of multiplying each element by a single scalar, the reciprocal, can also be easily vectorized. This element-wise multiplication by a scalar in AVX1 and AVX2 is shown in Chapter 17.

Loop Fission for Further Vectorization. We have now vectorized the first summation (for the mean) and the final multiplication by a scalar (the reciprocal of the standard deviation). However, we haven't vectorized the two middle algorithm steps: subtraction of the mean (to get the differences), and the summation of the square differences. Let us consider this “fused” code again that computes the sum-of-squared-differences that also stores the result:

    float aussie_vector_sum_diff_squared_fused(float v[], int n, float meanval)
    {
        // Fused version of "sum diff squared" leaving the DIFF in the vector...
        float sum = 0.0;
        for (int i = 0; i < n; i++) {
            float fdiff = v[i] - meanval;
            sum += (fdiff * fdiff);   // add the squared differences..
            v[i] = fdiff;  // Fusion: store the diff
        }
        return sum;
    }

This is conceptually easier for vectorization if we do loop fission to split out the loops. Here is the version with loop fission splitting the main loop into two separate loops, which is inefficient in sequential code, but helps with vectorization for hardware acceleration:

    float aussie_vector_sum_diff_squared_fission(float v[], int n, float meanval)
    {
        // FISSION version of "sum diff squares" 
        // Loop 1. Calculate DIFF in the vector...
        for (int i = 0; i < n; i++) {
            v[i] = v[i] - meanval;
        }
        // Loop 2. Calculate the sum-of-squares...
        float sum = 0.0;
        for (int i = 0; i < n; i++) {
            float f = v[i];
            sum += (f * f);   // add the squares..
        }
        return sum;
    }

A simpler way to write this with the vectorizable primitives made clearer:

    float aussie_vector_sum_diff_squared_fissionB(float v[], int n, float meanval)
    {
        // FISSION version of "sum diff squared" 
        aussie_vector_add_scalar(v, n, -meanval);  // Loop 1. DIFFs
        float sum = aussie_vector_sum_squares(v, n);  // Loop 2. Sum-of-squares...
        return sum;
    }

AVX Vectorized Versions of Z-score Normalization. AVX intrinsics are SIMD instructions for x86 CPUs that allow vectorized operations on 4 float values (AVX1) or 8 float values (AVX2) in parallel without a GPU. I coded up three versions vectorized with AVX1 and AVX2, using the C++ intrinsic calls (see the details on AVX vectorization in Chapter 17). The first z-score version had just the summation in AVX, with the second adding multiply-by-scalar at the end vectorized with AVX. The third version had all four main steps vectorized with AVX1 or AVX2.

Z-score normalization benchmarking results. Here are the benchmark timing results for the various z-score normalization versions.

    Z-score normalization benchmarks (N=2048, ITER=100000)
    Z-score norm basic: 2081 ticks (2.08 seconds)
    Z-score norm fix mean: 1413 ticks (1.41 seconds)
    Z-score norm reciprocal: 1372 ticks (1.37 seconds)
    Z-score norm fused: 1406 ticks (1.41 seconds)
    Z-score norm AVX1 sum: 1094 ticks (1.09 seconds)
    Z-score norm AVX1 sum+multiply: 940 ticks (0.94 seconds)
    Z-score norm AVX2 sum: 994 ticks (0.99 seconds)
    Z-score norm AVX2 sum+multiply: 781 ticks (0.78 seconds)
    Z-score norm AVX1 all: 754 ticks (0.75 seconds)
    Z-score norm AVX2 all: 429 ticks (0.43 seconds)

An extension would be to add a vectorized version using AVX-512 (16 float numbers) or the upcoming AVX-10. Another extension would be to test whether it was a good idea to do “loop fission” above, by trying to create AVX1 and AVX2 versions of the original sum-difference-squared fused loop. The loop fission version made things easier to code, but might have been the slower option.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++