Aussie AI
Vectorized Sum-of-Squares Reduction
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
Vectorized Sum-of-Squares Reduction
The sum of the square of an element of a vector has various applications in our AI Engine. Firstly, it can be used to compute the magnitude of a vector. Secondly, the sum-of-squares is used in various normalization functions, as part of computing the variance from the sum-of-squares of the difference between values and the mean. The RMS factor in RMSNorm is also the square root of the sum-of-squares.
The method to add up the sum-of-squares for a vector reduction to a single float
is very similar to a simple summation reduction.
The idea for AVX1 and AVX2 is to keep 4 or 8 running sum accumulators,
and then add them up at the final step.
Here is the AVX1 version of sum-of-squares of a vector:
float aussie_vector_sum_squares_AVX1(float v[], int n) { // Summation of squares of all elements if (n % 4 != 0) { // Safety check (no extra cases) yassert(n % 4 == 0); return 0.0; // fail } __m128 sumdst = _mm_setzero_ps(); // Zero accumulators for (int i = 0; i < n; i += 4) { __m128 r1 = _mm_loadu_ps(&v[i]); // Load floats __m128 sqr = _mm_mul_ps(r1, r1); // Square (V*V) sumdst = _mm_add_ps(sqr, sumdst); // SUM = SUM + V*V } // Add the final 4 accumulators manually float* farr = sumdst.m128_f32; float sum = farr[0] + farr[1] + farr[2] + farr[3]; return sum; }
And here is the AVX2 version of sum-of-squares:
float aussie_vector_sum_squares_AVX2(float v[], int n) { // Summation of squares of all elements if (n % 8 != 0) { // Safety check (no extra cases) yassert(n % 8 == 0); return 0.0; // fail } __m256 sumdst = _mm256_setzero_ps(); // Zero accumulators for (int i = 0; i < n; i += 8) { __m256 r1 = _mm256_loadu_ps(&v[i]); // Load floats __m256 sqr = _mm256_mul_ps(r1, r1); // Square (V*V) sumdst = _mm256_add_ps(sqr, sumdst); // SUM = SUM + V*V } // Add the final 8 accumulators manually float* farr = sumdst.m256_f32; float sum = farr[0] + farr[1] + farr[2] + farr[3] + farr[4] + farr[5] + farr[6] + farr[7]; return sum; }
Various optimizations can be further applied to these versions.
Like the summation reduction, these loops needlessly add zero at the first iteration,
and loop peeling should be used for split out the first iteration separately.
The final horizontal addition of 4 or 8 float
values should be optimized.
AVX-512 should be used for greater parallelism to 16 float
numbers.
Finally, basic loop optimizations of pointer arithmetic and loop unrolling could be applied.
• Next: • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |