Aussie AI

Example: AVX Vector Sum Reduction

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Example: AVX Vector Sum Reduction

Let us suppose we need to calculate the sum of all the elements of a vector. This is a “reduction” that has dimensions “vector-to-scalar.” Here is a basic naive C++ version without any optimizations:

    float aussie_vector_sum(float v[], int n)  // Summation
    {
        float sum = 0.0;
        for (int i = 0; i < n; i++) {
            sum += v[i];
        }
        return sum;
    }

AVX vector reductions have some issues in the early releases. Although AVX has SIMD instructions to add two vectors in parallel, it struggles to do a “reduction” operation like this. AVX and AVX-2 do have “horizontal add” (“hadd”) intrinsics, but these only do pairwise additions within the single vector, rather than adding all elements. AVX-512 has a “reduce add” intrinsic (“_mm512_reduce_add_ps”) for horizontally adds 16 float numbers, which works a lot better.

For AVX and AVX-2, are we stuck with doing multiple calls to the pairwise “hadd” intrinsics? No, there's a non-obvious way to use the “vertical add” intrinsics in parallel. We can do “in parallel” squared. It's almost like we're doing math inside a computer.

The trick is to use the AVX registers as a set of 4 parallel accumulators (AVX 128 bits) or 8 parallel accumulators (AVX-2's 256 bits). In this way, we can defer the “hadd” until the very end, and since it's not in the critical loop, its performance hardly matters. Here's the code for AVX-1 with 128-bit registers:

    float aussie_vector_sum_AVX1(float v[], int n)   // Summation (horizontal) of a single vector
    {
        if (n % 4 != 0) {  // Safety
            yassert(n % 4 == 0);
            return 0.0; // fail
        }

        __m128 sumdst = _mm_setzero_ps();   // Set accumulators to zero
        for (int i = 0; i < n; i += 4) {
            __m128 r1 = _mm_loadu_ps(&v[i]); // Load floats into 128-bits
            sumdst = _mm_add_ps(r1, sumdst); // SUM = SUM + V
        }

        // Add the final 4 accumulators manually
        float* farr = sumdst.m128_f32;
        float sum = farr[0] + farr[1] + farr[2] + farr[3];
        return sum;
    }

The AVX-2 version is faster, because it processes 8 float values at a time. This uses the same strategy of 8 parallel accumulators and a loop unrolling factor of 8 (i.e. the loop incrementer is now “i+=8”). Here's the C++ code:

    float aussie_vector_sum_AVX2(float v[], int n)   // Summation (horizontal) of a single vector
    {
        if (n % 8 != 0) { // Safety check (no extra cases)
            yassert(n % 8 == 0);
            return 0.0; // fail
        }

        __m256 sumdst = _mm256_setzero_ps();   // Set 8 accumulators to zero
        for (int i = 0; i < n; i += 8) {
            __m256 r1 = _mm256_loadu_ps(&v[i]);   // Load 8 floats into 256-bits
            sumdst = _mm256_add_ps(r1, sumdst); // SUM = SUM + V
        }

        // Add the final 8 accumulators manually
        float* farr = sumdst.m256_f32;
        float sum = farr[0] + farr[1] + farr[2] + farr[3]
                  + farr[4] + farr[5] + farr[6] + farr[7]; 
        return sum;
    }

I've been lazy not bothering to optimize the final horizontal addition. A small extra speedup is probably available using the “hadd” intrinsics 3 times in a row to drop it down from 8 accumulators to a single float. If this was AVX-512, we could use the horizontal reduction “_mm512_reduce_add_ps” intrinsic for summation at the end (for adding 16 partial sums of type float).

Loop Peeling Optimization: Another inefficiency with these AVX addition routines it that they needlessly perform an addition with zero in the first iteration. Effectively, we need to do “loop peeling” to handle the first loop iteration differently. This is the slow first iteration of AVX2 vector sum:

    __m256 sumdst = _mm256_setzero_ps();   // Set 8 accumulators to zero
    for (int i = 0; i < n; i += 8) {
        // ... 
    }

Loop peeling says to replace the initialization with zero with loading the first 8 values from the vector. The loop starts its first iteration at i=8 instead of i=0, skipping what had been the first addition:

    __m256 sumdst = _mm256_loadu_ps(&v[0]);  // Get first 8 values
    for (int i = 8 /*not 0!*/; i < n; i += 8) {
        // ... same
    }

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++