Aussie AI
Example: Basic AVX SIMD Multiply
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
Example: Basic AVX SIMD Multiply
Let us do a basic element-wise SIMD multiply using AVX (version 1) and its 128-bit registers.
This will do a paired vector multiply an array of 4 float
numbers (i.e. 4 x 32-bit float
= 128 bits).
Each float
in the resulting array is a pairwise multiplication of the elements in the two operands.
This is how SIMD instructions work, by operating on each element of the array (i.e. “pairwise” or “element-wise”).
For example, a “vertical” multiply
will take the 4 float
values in one input array,
and multiply each of them by the corresponding float
in the other input array of 4 float
numbers,
and then will return a resulting output array with 4 float
values.
For testing, let us assume with want to create an AVX function that multiplies 4 float
values element-wise.
The test code looks like:
float arr1[4] = { 1.0f , 2.5f , 3.14f, 0.0f }; float arr2[4] = { 1.0f , 2.5f , 3.14f, 0.0f }; float resultarr[4]; // Multiply element-wise aussie_multiply_vectors(arr1, arr2, resultarr, 4);
Testing the results of the multiply as an element-wise multiply of each pair in the 4 float
values
(using my home-grown “ytestf
” unit testing function that compares float
numbers for equality):
ytestf(resultarr[0], 1.0f * 1.0f); // Unit tests ytestf(resultarr[1], 2.5f * 2.5f); ytestf(resultarr[2], 3.14f * 3.14f); ytestf(resultarr[3], 0.0f * 0.0f);
Here's the low-level C++ code that actually does the SIMD multiply
using the “_mm_mul_ps
” AVX intrinsic function:
#include <xmmintrin.h> #include <intrin.h> void aussie_avx_multiply_4_floats( float v1[4], float v2[4], float vresult[4]) { // Multiply 4x32-bit float in 128-bit AVX registers __m128 r1 = _mm_loadu_ps(v1); // Load floats __m128 r2 = _mm_loadu_ps(v2); __m128 dst = _mm_mul_ps(r1, r2); // AVX SIMD Multiply _mm_storeu_ps(vresult, dst); // Convert back to floats }
Explaining this code one line at a time:
1. The header files are included: <xmmintrin.h>
and <intrin.h>
.
2. The basic AVX register type is “__m128
” which is an AVX 128-bit register (i.e., it is 128 bits in the basic AVX version, not AVX-2 or AVX-512).
3. The variables “r1
” and “r2
” are declared as _mm128
registers. The names “r1
” and “r2
” are not important, and are just variable names.
4. The intrinsic function “_mm_loadu_ps
” is used to convert the arrays of 4 float
values into the 128-bit register types,
and the result is “loaded” into the “r1
” and “r2
” 128-bit types.
5. Another 128-bit variable “dst
” is declared to hold the results of the SIMD multiply. The name “dst
” can be any variable name.
6. The main AVX SIMD multiply is performed by the “_mm_mul_ps
” intrinsic function. The suffix “s
” means “single-precision” (i.e. 32-bit float
).
This is where the rubber meets the road, and the results
of the element-wise multiplication of registers “r1
” and “r2
”
are computed and saved into the “dst
” register.
It is analogous to the basic C++ expression: dst = r1*r2;
7. The 128-bit result register variable “dst
” is converted back to 32-bit float
values (4 of them),
by “storing” the 128 bits into the float
array using the “_mm_storeu_ps
” AVX intrinsic.
• Next: • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |