Aussie AI

Vectorized Sum-of-Squares Reduction

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Vectorized Sum-of-Squares Reduction

The sum of the square of an element of a vector has various applications in our AI Engine. Firstly, it can be used to compute the magnitude of a vector. Secondly, the sum-of-squares is used in various normalization functions, as part of computing the variance from the sum-of-squares of the difference between values and the mean. The RMS factor in RMSNorm is also the square root of the sum-of-squares.

The method to add up the sum-of-squares for a vector reduction to a single float is very similar to a simple summation reduction. The idea for AVX1 and AVX2 is to keep 4 or 8 running sum accumulators, and then add them up at the final step.

Here is the AVX1 version of sum-of-squares of a vector:

    float aussie_vector_sum_squares_AVX1(float v[], int n)  
    {
        // Summation of squares of all elements
        if (n % 4 != 0) { // Safety check (no extra cases)
            yassert(n % 4 == 0);
            return 0.0; // fail
        }
        __m128 sumdst = _mm_setzero_ps();   // Zero accumulators
        for (int i = 0; i < n; i += 4) {
            __m128 r1 = _mm_loadu_ps(&v[i]); // Load floats
            __m128 sqr = _mm_mul_ps(r1, r1);   // Square (V*V)
            sumdst = _mm_add_ps(sqr, sumdst); // SUM = SUM + V*V
        }
        // Add the final 4 accumulators manually
        float* farr = sumdst.m128_f32;
        float sum = farr[0] + farr[1] + farr[2] + farr[3];
        return sum;
    }

And here is the AVX2 version of sum-of-squares:

    float aussie_vector_sum_squares_AVX2(float v[], int n)  
    {
        // Summation of squares of all elements
        if (n % 8 != 0) { // Safety check (no extra cases)
            yassert(n % 8 == 0);
            return 0.0; // fail
        }

        __m256 sumdst = _mm256_setzero_ps();   // Zero accumulators
        for (int i = 0; i < n; i += 8) {
            __m256 r1 = _mm256_loadu_ps(&v[i]);   // Load floats
            __m256 sqr = _mm256_mul_ps(r1, r1);   // Square (V*V)
            sumdst = _mm256_add_ps(sqr, sumdst); // SUM = SUM + V*V
        }

        // Add the final 8 accumulators manually
        float* farr = sumdst.m256_f32;
        float sum = farr[0] + farr[1] + farr[2] + farr[3]
                  + farr[4] + farr[5] + farr[6] + farr[7];
        return sum;
    }

Various optimizations can be further applied to these versions. Like the summation reduction, these loops needlessly add zero at the first iteration, and loop peeling should be used for split out the first iteration separately. The final horizontal addition of 4 or 8 float values should be optimized. AVX-512 should be used for greater parallelism to 16 float numbers. Finally, basic loop optimizations of pointer arithmetic and loop unrolling could be applied.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++