Aussie AI

Kernel Fusion of Vector Min and Max

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Kernel Fusion of Vector Min and Max

As another example, consider the fusion of the two loops for finding the “min” and “max” of a vector. This is a horizontal reduction for both, but for AVX1 and AVX2, we need to do them piecewise (whereas AVX-512 has reduction intrinsics).

The full source for the min and max functions is in Chapter 30 (Vectorization). The main loop for “min” in AVX1 is this:

    __m128 sumdst = _mm_loadu_ps(&v[0]);   // Initial 4 values
    for (int i = 4 /*not 0*/; i < n; i += 4) {
        __m128 r1 = _mm_loadu_ps(&v[i]); // Load floats into 128-bits
        sumdst = _mm_min_ps(r1, sumdst); // dst = MIN(dst, r1)
    }

And the main loop for “max” in AVX1 is almost identical:

    __m128 sumdst = _mm_loadu_ps(&v[0]);   // Initial 4 values
    for (int i = 4 /*not 0*/; i < n; i += 4) {
        __m128 r1 = _mm_loadu_ps(&v[i]); // Load floats into 128-bits
        sumdst = _mm_max_ps(r1, sumdst); // dst = MAX(dst, r1)
    }

The two main loops for min and max are easily fused, and this avoids one “load” intrinsic into the “r1” register. The fused main loop has 3 calls to intrinsics, rather than 4 calls (i.e. 2 per iteration for each min and max function separately), so we have a 25% reduction in calls. There isn't much to do for fusing the last step, but this is not the critical part of the function. Here is the final fused loop for AVX1 for min and max of a vector:

    float aussie_vector_max_min_fusion_AVX1b(float v[], int n, float &fminout)   
    {
        // Maximum and Minimum (horizontal) of a single vector
        yassert(n % 4 == 0);
        __m128 maxsumdst = _mm_loadu_ps(&v[0]);   // Initial 4 values
        __m128 minsumdst = maxsumdst;   // Initial 4 values
        for (int i = 4 /*not 0*/; i < n; i += 4) {
                __m128 r1 = _mm_loadu_ps(&v[i]); // Load floats into 128-bits
                maxsumdst = _mm_max_ps(r1, maxsumdst); // dst = MAX(dst, r1)
                minsumdst = _mm_min_ps(r1, minsumdst); // dst = MIN(dst, r1)
        }
        // Find Min of the final 4 accumulators
    #define FMIN(x,y)  ( (x) < (y) ? (x) : (y) )
        float* farr = minsumdst.m128_f32;
        float fmin1 = FMIN(farr[0], farr[1]);
        float fmin2 = FMIN(farr[2], farr[3]);
        fminout = FMIN(fmin1, fmin2);
        // Find Max of the final 4 accumulators
    #define FMAX(x,y)  ( (x) > (y) ? (x) : (y) )
        float* farr = maxsumdst.m128_f32;
        float fmax1 = FMAX(farr[0], farr[1]);
        float fmax2 = FMAX(farr[2], farr[3]);
        return FMAX(fmax1, fmax2); 
    }

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++