Aussie AI
AVX Horizontal Intrinsics
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
AVX Horizontal Intrinsics
Horizontal operations refer to arithmetic across the values within one vector. AVX intrinsics exist to do “horizontal” operations across the same vector, such as adding horizontal elements of a vector, or finding the maximum of pairs of elements within a vector.
Horizontal SIMD instructions are typically designated with a “h” prefix
(e.g. “horizontal add” is “hadd”).
More specifically,
the intrinsic for 128-bit horizontal add is “_mm_hadd_ps
”
and it is “_mm256_hadd_ps
” for 256-bits.
However, do not make the mistake of assuming that these horizontal AVX
intrinsics are a “reduction” of a vector down to a single float (i.e. vector-to-scalar).
I mean, they really should do exactly that,
but that would be too good to be true.
The horizontal intrinsic functions
are still effectively “pairwise” operations for AVX and AVX-2, except the pairs are within the same vector (i.e. horizontal pairs).
If you want to add all elements of a vector, or find the maximum,
you will need multiple calls to these intrinsics,
each time processing pairs of numbers,
halving the number of elements you are examining at each iteration.
Hence, for example, summing all the float
values in a vector with AVX or AVX-2 uses a method of “shuffle-and-add” multiple times.
Thankfully, AVX-512 actually does have horizontal reductions
that process all the elements in their 512 bit registers.
Hence, the 512-bit horizontal add uses a different naming convention
and uses the prefix of “reduce add” in the intrinsic name (e.g. _mm512_reduce_add_ps
is a summation reduction).
In other words, this reduction operates in parallel on all 16 float
values in an AVX-512 register,
and the _mm512_reduce_add_ps
intrinsic can add up all 16 float
values in one operation.
This horizontal reduction summation is useful for vectorizing functions such as average,
and could be used for vector dot products
(i.e. do an AVX-512 SIMD vertical multiplication into a third vector of 16 float
values, then a horizontal reduction to sum those 16 float
values),
although there's an even better way with FMA intrinsics.
Supported AVX horizontal operations for pairwise horizontal calculations (AVX or AVX-2) or vector-to-scalar reductions (AVX-512) include floating-point and integer versions, with various sizes, for primitives, such as:
- Addition
- Maximum
- Minimum
- Bitwise operations
• Next: • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |