Aussie AI
Kernel Fusion of Vector Min and Max
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
Kernel Fusion of Vector Min and Max
As another example, consider the fusion of the two loops
for finding the “min
” and “max
” of a vector.
This is a horizontal reduction for both,
but for AVX1 and AVX2, we need to do them piecewise
(whereas AVX-512 has reduction intrinsics).
The full source for the min
and max
functions is in Chapter 30 (Vectorization).
The main loop for “min
” in AVX1 is this:
__m128 sumdst = _mm_loadu_ps(&v[0]); // Initial 4 values for (int i = 4 /*not 0*/; i < n; i += 4) { __m128 r1 = _mm_loadu_ps(&v[i]); // Load floats into 128-bits sumdst = _mm_min_ps(r1, sumdst); // dst = MIN(dst, r1) }
And the main loop for “max
” in AVX1 is almost identical:
__m128 sumdst = _mm_loadu_ps(&v[0]); // Initial 4 values for (int i = 4 /*not 0*/; i < n; i += 4) { __m128 r1 = _mm_loadu_ps(&v[i]); // Load floats into 128-bits sumdst = _mm_max_ps(r1, sumdst); // dst = MAX(dst, r1) }
The two main loops for min
and max
are easily fused,
and this avoids one “load” intrinsic into the “r1
” register.
The fused main loop has 3 calls to intrinsics, rather than 4 calls (i.e. 2 per iteration
for each min
and max
function separately), so we have a 25% reduction in calls.
There isn't much to do for fusing the last step,
but this is not the critical part of the function.
Here is the final fused loop for AVX1 for min
and max
of a vector:
float aussie_vector_max_min_fusion_AVX1b(float v[], int n, float &fminout) { // Maximum and Minimum (horizontal) of a single vector yassert(n % 4 == 0); __m128 maxsumdst = _mm_loadu_ps(&v[0]); // Initial 4 values __m128 minsumdst = maxsumdst; // Initial 4 values for (int i = 4 /*not 0*/; i < n; i += 4) { __m128 r1 = _mm_loadu_ps(&v[i]); // Load floats into 128-bits maxsumdst = _mm_max_ps(r1, maxsumdst); // dst = MAX(dst, r1) minsumdst = _mm_min_ps(r1, minsumdst); // dst = MIN(dst, r1) } // Find Min of the final 4 accumulators #define FMIN(x,y) ( (x) < (y) ? (x) : (y) ) float* farr = minsumdst.m128_f32; float fmin1 = FMIN(farr[0], farr[1]); float fmin2 = FMIN(farr[2], farr[3]); fminout = FMIN(fmin1, fmin2); // Find Max of the final 4 accumulators #define FMAX(x,y) ( (x) > (y) ? (x) : (y) ) float* farr = maxsumdst.m128_f32; float fmax1 = FMAX(farr[0], farr[1]); float fmax2 = FMAX(farr[2], farr[3]); return FMAX(fmax1, fmax2); }
• Next: • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |