Aussie AI

17. AVX Intrinsics

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

“Change before you have to.”

— Jack Welch

What are AVX Intrinsics?

AVX intrinsics are SIMD parallel instructions for x86 and x64 architectures. They are actually machine opcodes supported by the x86/x64 CPU, but are wrapped in the intrinsic prototypes for easy access from a C++ program.

The main advantage of SIMD instructions is that they are CPU-supported parallel optimizations. Hence, they do not require a GPU, and can even be used on a basic Windows laptop. The main downside is that their level of parallelism is nowhere near that of a high-end GPU.

There are multiple generations of AVX intrinsics based on x86/x64 CPU instructions. Different CPUs support different features, and exactly which intrinsic calls can be used will depend on the CPU on which your C++ is running. The basic AVX types are:

AVX — 128-bit registers = 4 x 32-bit float values
AVX-2 — 256-bit registers = 8 x 32-bit float values
AVX-512 — 512-bit registers = 16 x 32-bit float values
AVX-10 — 512-bit registers (with speedups)

The AVX intrinsics use C++ type names to declare variables for their registers. The float types used to declare the registers in AVX using C++ all have a double-underscore prefix with “__m128” for 128-bit registers (4 floats), “__m256” for 256 bit registers (8 floats), and “__m512” for 512 bits (16 floats). Similarly, there are also register type names for int types (__m128i, __m256i, and __m512i), and types for “double” registers (__m128d, __m256d, and __m512d).

AVX intrinsic functions and their types are declared as ordinary function prototypes in header files. The header files that you may need to include for these intrinsics include <intrin.h>, <emmintrin.h>, and <immintrin.h>.

Useful AVX SIMD vector intrinsics for float types include:

Initialize to all-zeros — _mm_setzero_ps, _mm256_setzero_ps
Set all values to a single float — _mm_set1_ps, _mm256_set1_ps
Set to 4 or 8 values — _mm_set_ps, _mm256_set_ps
Load from arrays to AVX registers — _mm_loadu_ps, _mm256_loadu_ps
Store registers back to float arrays — _mm_storeu_ps, _mm256_storeu_ps
Addition — _mm_add_ps, _mm256_add_ps
Multiplication — _mm_mul_ps (SSE), _mm256_mul_ps (AVX-2)
Vector dot product — _mm_dp_ps, _mm256_dp_ps
Fused Multiply-Add (FMA — _mm_fmadd_ps, _mm256_fmadd_ps
Horizontal addition (pairwise) — _mm_hadd_ps, _mm256_hadd_ps

Note that the names of the intrinsic functions have meaningful suffixes. The “_ps” suffix means “packed-single-precision” (i.e. float), whereas “_pd” suffix means “packed-double-precision” (i.e. double).

AVX Operations

The main SIMD instructions are called “vertical” instructions, by convention. They take one vector and a second vector (e.g. both are 128-bit), apply an operation element-wise in parallel, and put the result into a third register. In other words, they return the result of a “pair-wise” or “element-wise” operation on two vectors into a third vector.

For example, vertical addition requires two input vectors and will output a third vector with the sums. AVX-512 SIMD addition will add two 512-bit registers full of float values on a paired element basis (i.e. adds 16 pairs of 32-bit float values), yielding a third 512-bit vector with the result (16 float values).

Binary operations. The full list of binary AVX operations is very long. Supported AVX operations include:

Multiplication
Addition
Subtraction
Division
Maximum
Minimum
Fused Multiply-Add (FMA)
Bitwise operations
...and many more

Unary operations. AVX unary intrinsics apply a particular function to all elements of an AVX register in parallel, and return the resulting register. Supported AVX unary operations include:

Clear to zero
Set to a constant
Casts
Conversions
Popcount (POPCNT)
Leading-zero count (LZCNT)

Mathematical Functions. Simple float-to-float mathematical functions are effectively a type of unary operator. AVX supports a variety of functions with vector hardware instructions, such as:

Absolute value: abs
Error function: erf
Reciprocal
Rounding, ceiling, floor
Roots: sqrt (square root), cube root
Inverted roots (e.g. invsqrt)
Exponential: exp, exp10
Logarithm: log, log10
Trigonometric functions
Hyperbolic functions
Statistics (e.g. Cumulative Distribution Function)

AVX Horizontal Intrinsics

Horizontal operations refer to arithmetic across the values within one vector. AVX intrinsics exist to do “horizontal” operations across the same vector, such as adding horizontal elements of a vector, or finding the maximum of pairs of elements within a vector.

Horizontal SIMD instructions are typically designated with a “h” prefix (e.g. “horizontal add” is “hadd”). More specifically, the intrinsic for 128-bit horizontal add is “_mm_hadd_ps” and it is “_mm256_hadd_ps” for 256-bits.

However, do not make the mistake of assuming that these horizontal AVX intrinsics are a “reduction” of a vector down to a single float (i.e. vector-to-scalar). I mean, they really should do exactly that, but that would be too good to be true. The horizontal intrinsic functions are still effectively “pairwise” operations for AVX and AVX-2, except the pairs are within the same vector (i.e. horizontal pairs). If you want to add all elements of a vector, or find the maximum, you will need multiple calls to these intrinsics, each time processing pairs of numbers, halving the number of elements you are examining at each iteration. Hence, for example, summing all the float values in a vector with AVX or AVX-2 uses a method of “shuffle-and-add” multiple times.

Thankfully, AVX-512 actually does have horizontal reductions that process all the elements in their 512 bit registers. Hence, the 512-bit horizontal add uses a different naming convention and uses the prefix of “reduce add” in the intrinsic name (e.g. _mm512_reduce_add_ps is a summation reduction). In other words, this reduction operates in parallel on all 16 float values in an AVX-512 register, and the _mm512_reduce_add_ps intrinsic can add up all 16 float values in one operation. This horizontal reduction summation is useful for vectorizing functions such as average, and could be used for vector dot products (i.e. do an AVX-512 SIMD vertical multiplication into a third vector of 16 float values, then a horizontal reduction to sum those 16 float values), although there's an even better way with FMA intrinsics.

Supported AVX horizontal operations for pairwise horizontal calculations (AVX or AVX-2) or vector-to-scalar reductions (AVX-512) include floating-point and integer versions, with various sizes, for primitives, such as:

Addition
Maximum
Minimum
Bitwise operations

Portability Checking of AVX Versions

The power of AVX support has changed over the years, with different CPUs having different capabilities, not only with AVX, AVX-2 and AVX-512, but also their sub-releases. And it's also a little unclear into the future, with reports that some of the newer Intel chips have AVX-512 disabled.

If you write some code using AVX-512 intrinsics, and compile your C++ into an executable with the AVX-512 flags on, and then it runs on a lower-capability CPU without AVX-512, what happens? Do the AVX-512 intrinsics fail, or are they simulated somehow so that they're slower but still work? Answer: kaboom on MSVS. In the MSVS IDE, if you try to call these intrinsics on a CPU that doesn't support it, you get “unhandled exception: illegal instruction.” In other words, the C++ compiler still emits the AVX-512 instruction codes, but they aren't valid, so it excepts at runtime.

Hence, the calls to AVX-512 are not emulated at run-time on lower-capability CPUs. And they aren't checked, either. That's up to you!

Dynamic test required: Firstly, you cannot use the preprocessor. You can't test #if or #ifdef for whether you've got AVX-512 in the CPU or not. You can use the preprocessor to distinguish between different platforms where you'll compile a separate binary (e.g. ARM Neon for phones or Apple M1/M2/M3 chipsets). But you cannot choose between AVX/AVX-2/AVX-512 at compile-time, unless you really plan to ship three separate binary executables. Well, you probably could do this if you really, really wanted to.

The other thing you don't really want to do is low-level testing of capabilities. You don't want to test a flag right in front of every AVX-512 intrinsic call. Otherwise, you'll lose most of the speedup benefits. Instead, you want this test done much higher up, and then have multiple versions of the higher-level kernel operations (e.g. vector add, vector multiply, vector dot product, etc.)

What this means is that you have to check in your runtime code what the CPU's capabilities are, at a very high level in your program. Hence, it is important to check your platform has the AVX support that you need, such as via the “cpuid” intrinsic at program startup. Then you have a dynamic flag that specifies whether you have AVX-512 or not, and you can then choose between an AVX-2 dot product or an AVX-512 dot product, or whatever else, during execution. Obviously, it gets a bit convoluted when you have to dynamically choose between versions for AVX, AVX-2 and AVX-512 (not to mention all the AVX sub-capabilities and also AVX-10 coming soon).

Example: Basic AVX SIMD Multiply

Let us do a basic element-wise SIMD multiply using AVX (version 1) and its 128-bit registers. This will do a paired vector multiply an array of 4 float numbers (i.e. 4 x 32-bit float = 128 bits). Each float in the resulting array is a pairwise multiplication of the elements in the two operands.

This is how SIMD instructions work, by operating on each element of the array (i.e. “pairwise” or “element-wise”). For example, a “vertical” multiply will take the 4 float values in one input array, and multiply each of them by the corresponding float in the other input array of 4 float numbers, and then will return a resulting output array with 4 float values.

For testing, let us assume with want to create an AVX function that multiplies 4 float values element-wise. The test code looks like:

    float arr1[4] = { 1.0f , 2.5f , 3.14f, 0.0f };
    float arr2[4] = { 1.0f , 2.5f , 3.14f, 0.0f };
    float resultarr[4];
    // Multiply element-wise
    aussie_multiply_vectors(arr1, arr2, resultarr, 4);

Testing the results of the multiply as an element-wise multiply of each pair in the 4 float values (using my home-grown “ytestf” unit testing function that compares float numbers for equality):

    ytestf(resultarr[0], 1.0f * 1.0f);  // Unit tests
    ytestf(resultarr[1], 2.5f * 2.5f);
    ytestf(resultarr[2], 3.14f * 3.14f);
    ytestf(resultarr[3], 0.0f * 0.0f);

Here's the low-level C++ code that actually does the SIMD multiply using the “_mm_mul_ps” AVX intrinsic function:

    #include <xmmintrin.h>
    #include <intrin.h>

    void aussie_avx_multiply_4_floats(
        float v1[4], float v2[4], float vresult[4])
    {
        // Multiply 4x32-bit float in 128-bit AVX registers
        __m128 r1 = _mm_loadu_ps(v1);   // Load floats
        __m128 r2 = _mm_loadu_ps(v2);
        __m128 dst = _mm_mul_ps(r1, r2);   // AVX SIMD Multiply
        _mm_storeu_ps(vresult, dst);  // Convert back to floats
    }

Explaining this code one line at a time:

1. The header files are included: <xmmintrin.h> and <intrin.h>.

2. The basic AVX register type is “__m128” which is an AVX 128-bit register (i.e., it is 128 bits in the basic AVX version, not AVX-2 or AVX-512).

3. The variables “r1” and “r2” are declared as _mm128 registers. The names “r1” and “r2” are not important, and are just variable names.

4. The intrinsic function “_mm_loadu_ps” is used to convert the arrays of 4 float values into the 128-bit register types, and the result is “loaded” into the “r1” and “r2” 128-bit types.

5. Another 128-bit variable “dst” is declared to hold the results of the SIMD multiply. The name “dst” can be any variable name.

6. The main AVX SIMD multiply is performed by the “_mm_mul_ps” intrinsic function. The suffix “s” means “single-precision” (i.e. 32-bit float). This is where the rubber meets the road, and the results of the element-wise multiplication of registers “r1” and “r2” are computed and saved into the “dst” register. It is analogous to the basic C++ expression: dst = r1*r2;

7. The 128-bit result register variable “dst” is converted back to 32-bit float values (4 of them), by “storing” the 128 bits into the float array using the “_mm_storeu_ps” AVX intrinsic.

AVX Memory Alignment Issues

The above example glosses over the issue of managing “alignment” of memory addresses on byte boundaries with the “alignas” specifier. Some of the AVX SIMD intrinsic calls require that addresses are 16-byte aligned (i.e. this is effectively 128-bit alignment), which is not guaranteed by the C++ compiler. However, we've tolerated non-aligned addresses by using the “_mm_storeu_ps” intrinsic, which works with either aligned or non-aligned addresses.

Note that alignment restriction requirements of AVX are somewhat in flux. Not all AVX intrinsics require alignment, and they are “relaxed” in many cases. There have also been some bugs in compiler toleration of non-aligned addresses in C++ intrinsics. Where required, the alignment needs are:

AVX-1 — 16-byte alignment (128-bit).
AVX-2 — 32-byte alignment (256-bit).
AVX-512 — 64-byte alignment (512-bit).

Since we can sort out alignment at compile-time using the C++ “alignas” specifier and “aligned” type attributes, there is no performance penalty (except in terms of space) for ensuring greater compatibility across CPU platforms and compiler versions by preferring aligned addresses.

You can create your own macros to easily test pointer addresses for alignment by checking their remainder with the % operator. These examples use bitwise-and to replace the slow remainder operator:

    #define aussie_is_aligned_16(ptr)  ((((unsigned long)(ptr)) &15ul) == 0)
    #define aussie_is_aligned_32(ptr)  ((((unsigned long)(ptr)) &31ul) == 0)

Although our code to multiply 4 float values tolerates non-alignment, it's a minor slug. The “_mm_storeu_ps” AVX intrinsic is slower if the addresses are not aligned, so we should fix the alignment for performance reasons. There's also another “store” intrinsic to convert from 128-bits to 4 floats called “_mm_store_ps” (without the “u”) that runs faster, but does not tolerate non-aligned float arrays. Actually, “_mm_storeu_ps” is supposed to be equally as fast as “_mm_store_ps” if the address is correctly aligned, so we can still use that intrinsic if we prefer safety, but we need to change the variables to be aligned on 16-byte boundaries for a speedup.

To ensure alignment in C++, there is an “alignas” specifier for variable declarations. We can use “alignas(16)” to force C++ to create the variables with 16-byte alignment of the address where they are stored. For example, our unit test harness code could have ensured 16-byte alignment of all memory addresses via:

    // Test with 16-byte alignment
    alignas(16) float arr1[4] = { 1.0f , 2.5f , 3.14f, 0.0f };
    alignas(16) float arr2[4] = { 1.0f , 2.5f , 3.14f, 0.0f };
    alignas(16) float resultarr[4];

There are various non-standard alternatives to “alignas” in the various compilers. For example, MSVS has “__declspec(align(16))” with two prefix underscores, and GCC supports “decltype(align(16))”.

The AVX code for an alignment-requiring version is not much different, with minor changes to the names of the C++ intrinsics:

    void aussie_avx_multiply_4_floats_aligned(float v1[4], float v2[4], float vresult[4])
    {
        // Use 128-bit AVX registers to multiply 4x32-bit floats...
        __m128 r1 = _mm_loadu_ps(v1);   // Load floats into 128-bits
        __m128 r2 = _mm_loadu_ps(v2);
        __m128 dst = _mm_mul_ps(r1, r2);   // Multiply
        _mm_store_ps(vresult, dst);  // Aligned version convert to floats
    }

Ideally we'd like to ensure that the function is only called with aligned addresses at compile-time. The first attempt is to declare “vresult” above as “alignas(16)” for type checking of alignment issues, but it fails for function parameters. Fortunately, there's another way using type attributes:

    __attribute__((aligned(16)))

Another method is to define our own assertion that uses bitwise tests on the address instead:

    #define is_aligned_16(ptr)  ((((unsigned long int)(ptr)) & 15) == 0)

This tests the address is a number that is a multiple of 16 using bitwise-and with 15, but this is at runtime and costs extra cycles.

AVX-2 SIMD Multiplication

Here is the AVX-2 version of pairwise SIMD multiply with intrinsics for 256-bit registers, which is eight 32-bit float variables.

    void aussie_avx2_multiply_8_floats(
        float v1[8], float v2[8], float vresult[8])
    {
        // Multiply 8x32-bit floats in 256-bit AVX2 registers
        __m256 r1 = _mm256_loadu_ps(v1);   // Load floats
        __m256 r2 = _mm256_loadu_ps(v2);
        __m256 dst = _mm256_mul_ps(r1, r2);  // Multiply (SIMD)
        _mm256_storeu_ps(vresult, dst);  // Convert to 8 floats
    }

This is similar to the basic AVX 128-bit version, with some differences:

The type for 256-bit registers is “__m256”.
The AVX-2 loading intrinsic is “_mm256_loadu_ps”.
The AVX-2 multiplication intrinsic is “_mm256_mul_ps”.
The conversion back to float uses AVX-2 intrinsic “_mm256_storeu_ps”.

AVX-512 SIMD Multiplication

Here is the basic 16 float SIMD vector multiplication using 512-bits in AVX-512.

    void aussie_avx512_multiply_16_floats(
        float v1[16], float v2[16], float vresult[16])
    {
        // Multiply 16x32-bit floats in 512-bit registers
        __m512 r1 = _mm512_loadu_ps(v1); // Load 16 floats
        __m512 r2 = _mm512_loadu_ps(v2);
        __m512 dst = _mm512_mul_ps(r1, r2); // Multiply (SIMD)
        _mm512_storeu_ps(vresult, dst);  // Convert to floats
    }

Note that AVX-512 will fail with an “unhandled exception: illegal instruction” (e.g. in MSVS) if AVX-512 is not supported on your CPU.

Example: AVX 128-Bit Dot Product

The AVX instruction set has a vector dot product intrinsic that wraps an x86 dot product instruction. There are versions of the dot product intrinsic for AVX (128-bit), AVX-2 (256-bit) and AVX-512 (512-bit).

For basic AVX (128 bits), this is a full vector dot product of two vectors with 4 x 32-bit float numbers in each vector. One oddity is that although the result is a floating-point scalar (i.e. a single 32-bit float), it's still stored in a 128-bit register, and must be extracted using the “_mm_cvtss_f32” intrinsic. The example code looks like:

    float aussie_avx_vecdot_4_floats(float v1[4], float v2[4])
    {
        // AVX dot product: 2 vectors of 4x32-bit floats
        __m128 r1 = _mm_loadu_ps(v1);   // Load floats
        __m128 r2 = _mm_loadu_ps(v2);
        __m128 dst = _mm_dp_ps(r1, r2, 0xf1); // Dot product
        float fret = _mm_cvtss_f32(dst);  // Extract float
        return fret;
    }

Example: AVX-2 256-Bit Dot Product

Here is my attempt at the 256-bit version of a vector dot product of 8 float values using AVX-2 instructions, which seems like it should work:

    float aussie_avx2_vecdot_8_floats_buggy(
        float v1[8], float v2[8])
    {
        // AVX2 dot product: 2 vectors, 8x32-bit floats
        __m256 r1 = _mm256_loadu_ps(v1); // Load floats
        __m256 r2 = _mm256_loadu_ps(v2);
        __m256 dst = _mm256_dp_ps(r1, r2, 0xf1); // Bug!
        float fret = _mm256_cvtss_f32(dst); 
        return fret;
    }

But it doesn't! Instead of working on 8 pairs of float numbers, it does the vector dot product of only 4 pairs of float values, just like the first AVX code. The problem wasn't related to alignment to 256-bit blocks, because I added “alignas(32)” to the arrays passed in. It seems that the “_mm256_dp_ps” intrinsic doesn't actually do 256-bit dot products, but is similar to the 128-bit “_mm_dp_ps” intrinsic that does only four float numbers (128 bits). These are based on the VDPPS opcode in the x86 instruction for 32-bit float values and there is VDPPD for 64-bit double numbers. However, it seems that “_mm256_dp_ps” is not using the 256-bit version. Or maybe my code is just buggy!

References

Intel (2023), Intel® 64 and IA-32 Architectures Optimization Reference Manual: Volume 1, August 2023, 248966-Software-Optimization-Manual-V1-048.pdf
Agner Fog (2023), Optimizing subroutines in assembly language, https://www.agner.org/optimize/optimizing_assembly.pdf
Félix Cloutier (2023), x86 and amd64 instruction reference, https://www.felixcloutier.com/x86/
Microsoft (2023), x86 intrinsics list, https://learn.microsoft.com/en-us/cpp/intrinsics/x86-intrinsics-list
Intel (2023), Intel Intrinsics Guide, Version 3.6.6, May 10th, 2023, https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html
Intel (2023), Intel C++ Compiler Classic Developer Guide, version 2021.10, https://www.intel.com/content/www/us/en/docs/cpp-compiler/developer-guide-reference/2021-10/overview.html, PDF: https://cdrdv2.intel.com/v1/dl/getContent/781922?fileName=cpp-compiler_developer-guide-reference_2021.10-767249-781922.pdf

• Next: Chapter 18. Parallel Data Structures

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs

The new AI programming book by Aussie AI co-founders:

AI coding in C++
Transformer engine speedups
LLM models
Phone and desktop AI
Code examples
Research citations

Get your copy from Amazon: Generative AI in C++