Aussie AI

What are AVX Intrinsicsand?

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

What are AVX Intrinsics?

AVX intrinsics are SIMD parallel instructions for x86 and x64 architectures. They are actually machine opcodes supported by the x86/x64 CPU, but are wrapped in the intrinsic prototypes for easy access from a C++ program.

The main advantage of SIMD instructions is that they are CPU-supported parallel optimizations. Hence, they do not require a GPU, and can even be used on a basic Windows laptop. The main downside is that their level of parallelism is nowhere near that of a high-end GPU.

There are multiple generations of AVX intrinsics based on x86/x64 CPU instructions. Different CPUs support different features, and exactly which intrinsic calls can be used will depend on the CPU on which your C++ is running. The basic AVX types are:

  • AVX — 128-bit registers = 4 x 32-bit float values
  • AVX-2 — 256-bit registers = 8 x 32-bit float values
  • AVX-512 — 512-bit registers = 16 x 32-bit float values
  • AVX-10 — 512-bit registers (with speedups)

The AVX intrinsics use C++ type names to declare variables for their registers. The float types used to declare the registers in AVX using C++ all have a double-underscore prefix with “__m128” for 128-bit registers (4 floats), “__m256” for 256 bit registers (8 floats), and “__m512” for 512 bits (16 floats). Similarly, there are also register type names for int types (__m128i, __m256i, and __m512i), and types for “double” registers (__m128d, __m256d, and __m512d).

AVX intrinsic functions and their types are declared as ordinary function prototypes in header files. The header files that you may need to include for these intrinsics include <intrin.h>, <emmintrin.h>, and <immintrin.h>.

Useful AVX SIMD vector intrinsics for float types include:

  • Initialize to all-zeros — _mm_setzero_ps, _mm256_setzero_ps
  • Set all values to a single float_mm_set1_ps, _mm256_set1_ps
  • Set to 4 or 8 values — _mm_set_ps, _mm256_set_ps
  • Load from arrays to AVX registers — _mm_loadu_ps, _mm256_loadu_ps
  • Store registers back to float arrays — _mm_storeu_ps, _mm256_storeu_ps
  • Addition — _mm_add_ps, _mm256_add_ps
  • Multiplication — _mm_mul_ps (SSE), _mm256_mul_ps (AVX-2)
  • Vector dot product — _mm_dp_ps, _mm256_dp_ps
  • Fused Multiply-Add (FMA — _mm_fmadd_ps, _mm256_fmadd_ps
  • Horizontal addition (pairwise) — _mm_hadd_ps, _mm256_hadd_ps

Note that the names of the intrinsic functions have meaningful suffixes. The “_ps” suffix means “packed-single-precision” (i.e. float), whereas “_pd” suffix means “packed-double-precision” (i.e. double).

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++