Aussie AI
Example: BatchNorm Optimization
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
Example: BatchNorm Optimization
The batch normalization operation involves scanning the full vector, modifying each element so that it is re-centered to a zero mean, and re-scaled to a normal magnitude. A naive non-optimized version of C++ of BatchNorm looks like this:
void aussie_vector_batch_normalize_basic( // Basic normalization (BatchNorm) float v[], int n, float epsilon, // Smoothing term -- usually 1^e-5 (0.00005) float lambda, // Scaling term hyper-parameter (multiplication) float beta // Bias/shift term hyper-parameter (addition) ) { float fmean = aussie_vector_mean(v, n); // Calculate "mean" (aka average) float variance = aussie_vector_variance_of_mean(v, n, fmean); // Variance (sum-of-diffs-squared) float denom = sqrtf(variance + epsilon); // like std. deviation, but smoothed by epsilon for (int i = 0; i < n; i++) { v[i] = (v[i] - fmean) / denom; // Normalize all elements to re-center and scale } aussie_vector_multiply_scalar(v, n, lambda); // Scale all values by lambda hyper-param aussie_vector_add_scalar(v, n, beta); // Add beta hyper-param to all values }
Loop Fusion.
This version is very inefficient with literally five scans of the entire vector.
Loop fusion can obviously improve this, with the loops doing multiplication by lambda
and the addition of beta
merged into the prior “for
” loop.
Loop Fission.
Further optimizations become clear once we notice that each element of the vector has four operations being performed on it: subtracting the mean, dividing by the denominator, multiplying by lambda
, and adding beta
.
We can use a loop fission optimization to split out the first two operations into separate loops, where simpler operations are probably faster with hardware acceleration.
And then we notice that division and multiply are two versions of the same operation,
so we can then use the loop fusion technique to merge the division-by-denom
and multiply-by-lambda
into a single multiplication by a combined scaling factor.
Faster C++ code that has one less loop,
and also calls atomic vector operations (easier to hardware accelerate), then results from these changes:
aussie_vector_add_scalar(v, n, -mean); // Subtract the mean (re-centering) float scalef = lambda / denom; // Combined scale factor aussie_vector_multiply_scalar(v, n, scalef); // Scale by both denom and lambda aussie_vector_add_scalar(v, n, beta); // Add beta hyper-param to all values
Fusion and Fission.
A little thought shows us that we are still
subtracting the mean from every element twice: once inside the variance computation,
and once in the explicit re-centering call.
Hence, we can use another more complicated type of “loop fusion” to combine those two loops, by having the variance
loop leave the re-centered value (“diff
”) stored inside the vector,
so that we can fully remove the re-centering loop that subtracts the mean.
float mean = aussie_vector_mean(v, n); float variance = aussie_vector_variance_of_mean_fused(v, n, mean); // FUSION: leaves DIFF from MEAN in vector... // NOT NEEDED! ... aussie_vector_add_scalar(v, n, -mean); // Subtract the mean float denom = sqrtf(variance + epsilon); // like std. deviation, but smoothed by epsilon // ... etc
AVX Vectorization. I used AVX1 and AVX2 SIMD hardware-acceleration for the x86 CPU to vectorize some of these BatchNorm versions. What I did was use the AVX vectorized versions for summation, multiply-by-scalar, and the fused variance loop. Benchmarking results of these optimizations are shown below.
AVX1 can vectorize 4 float
values at a time (128 bits total),
whereas AVX2 does 8 float
values.
AVX-512 does 16 float
numbers, but it isn't always available.
Details of how to use AVX1, AVX2 and AVX-512 for vectorization are shown in Chapter 17.
Fewer Normalization Parameters.
Another algorithmic way to optimize this code is simply to remove the lambda
and beta
parameters.
Choosing lambda=1
and beta=0
means that the last two code loops for scalar multiplication and scalar addition loops can be avoided.
However, there's now little benefit to removing lambda
in the merged code above,
although the add-beta
loop can still be removed.
Anyway, whether we can remove these parameters is not a speed decision,
but depends
on whether these two learned parameters are important to the overall model's capability.
Note that there is also little value in trying to remove epsilon
, as it is only used once in total.
Benchmarking BatchNorm: Here are the results when I benchmarked this code for various basic C++ versions (i.e. sequential execution) and hardware-accelerated versions that are vectorized with x86 CPU AVX1/AVX2 intrinsics:
BatchNorm benchmarks (N=2048, ITER=100000) BatchNorm basic: 2070 ticks (2.07 seconds) BatchNorm reciprocal: 2135 ticks (2.13 seconds) BatchNorm fission: 1879 ticks (1.88 seconds) BatchNorm fusion/fission: 1633 ticks (1.63 seconds) BatchNorm no params: 1375 ticks (1.38 seconds) BatchNorm fission AVX1: 1200 ticks (1.20 seconds) BatchNorm fission AVX2: 931 ticks (0.93 seconds) BatchNorm fusion/fission AVX1: 788 ticks (0.79 seconds) BatchNorm fusion/fission AVX2: 460 ticks (0.46 seconds)
This generally shows what we expect.
Hardware-accelerated versions with AVX1 or AVX2 beat the sequential versions,
and AVX2 (8 float
numbers in parallel) does better than AVX1 (4 float
numbers).
Interestingly, although theoretically loop fission can improve vectorization,
it was the hand-fused version that did better overall.
Although the fused AVX versions had the same sequence of AVX arithmetic operations,
presumably they also avoided some AVX loads and stores.
• Next: • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |