Aussie AI

AVX Horizontal Intrinsics

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

AVX Horizontal Intrinsics

Horizontal operations refer to arithmetic across the values within one vector. AVX intrinsics exist to do “horizontal” operations across the same vector, such as adding horizontal elements of a vector, or finding the maximum of pairs of elements within a vector.

Horizontal SIMD instructions are typically designated with a “h” prefix (e.g. “horizontal add” is “hadd”). More specifically, the intrinsic for 128-bit horizontal add is “_mm_hadd_ps” and it is “_mm256_hadd_ps” for 256-bits.

However, do not make the mistake of assuming that these horizontal AVX intrinsics are a “reduction” of a vector down to a single float (i.e. vector-to-scalar). I mean, they really should do exactly that, but that would be too good to be true. The horizontal intrinsic functions are still effectively “pairwise” operations for AVX and AVX-2, except the pairs are within the same vector (i.e. horizontal pairs). If you want to add all elements of a vector, or find the maximum, you will need multiple calls to these intrinsics, each time processing pairs of numbers, halving the number of elements you are examining at each iteration. Hence, for example, summing all the float values in a vector with AVX or AVX-2 uses a method of “shuffle-and-add” multiple times.

Thankfully, AVX-512 actually does have horizontal reductions that process all the elements in their 512 bit registers. Hence, the 512-bit horizontal add uses a different naming convention and uses the prefix of “reduce add” in the intrinsic name (e.g. _mm512_reduce_add_ps is a summation reduction). In other words, this reduction operates in parallel on all 16 float values in an AVX-512 register, and the _mm512_reduce_add_ps intrinsic can add up all 16 float values in one operation. This horizontal reduction summation is useful for vectorizing functions such as average, and could be used for vector dot products (i.e. do an AVX-512 SIMD vertical multiplication into a third vector of 16 float values, then a horizontal reduction to sum those 16 float values), although there's an even better way with FMA intrinsics.

Supported AVX horizontal operations for pairwise horizontal calculations (AVX or AVX-2) or vector-to-scalar reductions (AVX-512) include floating-point and integer versions, with various sizes, for primitives, such as:

  • Addition
  • Maximum
  • Minimum
  • Bitwise operations

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++