Aussie AI
What are AVX Intrinsicsand?
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
What are AVX Intrinsics?
AVX intrinsics are SIMD parallel instructions for x86 and x64 architectures. They are actually machine opcodes supported by the x86/x64 CPU, but are wrapped in the intrinsic prototypes for easy access from a C++ program.
The main advantage of SIMD instructions is that they are CPU-supported parallel optimizations. Hence, they do not require a GPU, and can even be used on a basic Windows laptop. The main downside is that their level of parallelism is nowhere near that of a high-end GPU.
There are multiple generations of AVX intrinsics based on x86/x64 CPU instructions. Different CPUs support different features, and exactly which intrinsic calls can be used will depend on the CPU on which your C++ is running. The basic AVX types are:
- AVX — 128-bit registers = 4 x 32-bit
float
values - AVX-2 — 256-bit registers = 8 x 32-bit
float
values - AVX-512 — 512-bit registers = 16 x 32-bit
float
values - AVX-10 — 512-bit registers (with speedups)
The AVX intrinsics use C++ type names to declare variables for their registers.
The float
types used to declare the registers in AVX using C++
all have a double-underscore prefix
with “__m128
” for 128-bit registers (4 float
s), “__m256
” for 256 bit registers (8 float
s),
and “__m512
” for 512 bits (16 float
s).
Similarly, there are also register type names for int
types (__m128i
, __m256i
, and __m512i
),
and types for “double
” registers (__m128d
, __m256d
, and __m512d
).
AVX intrinsic functions and their types are declared as ordinary function prototypes in header files.
The header files that you may need to include for these intrinsics include <intrin.h>
, <emmintrin.h>
,
and <immintrin.h>
.
Useful AVX SIMD vector intrinsics for float
types include:
- Initialize to all-zeros —
_mm_setzero_ps
,_mm256_setzero_ps
- Set all values to a single
float
—_mm_set1_ps
,_mm256_set1_ps
- Set to 4 or 8 values —
_mm_set_ps
,_mm256_set_ps
- Load from arrays to AVX registers —
_mm_loadu_ps
,_mm256_loadu_ps
- Store registers back to
float
arrays —_mm_storeu_ps
,_mm256_storeu_ps
- Addition —
_mm_add_ps
,_mm256_add_ps
- Multiplication —
_mm_mul_ps
(SSE),_mm256_mul_ps
(AVX-2) - Vector dot product —
_mm_dp_ps
,_mm256_dp_ps
- Fused Multiply-Add (FMA —
_mm_fmadd_ps
,_mm256_fmadd_ps
- Horizontal addition (pairwise) —
_mm_hadd_ps
,_mm256_hadd_ps
Note that the names of the intrinsic functions have meaningful suffixes.
The “_ps
” suffix means “packed-single-precision” (i.e. float
),
whereas “_pd
” suffix means “packed-double-precision” (i.e. double
).
• Next: • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |