Aussie AI

Example: Basic AVX SIMD Multiply

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Example: Basic AVX SIMD Multiply

Let us do a basic element-wise SIMD multiply using AVX (version 1) and its 128-bit registers. This will do a paired vector multiply an array of 4 float numbers (i.e. 4 x 32-bit float = 128 bits). Each float in the resulting array is a pairwise multiplication of the elements in the two operands.

This is how SIMD instructions work, by operating on each element of the array (i.e. “pairwise” or “element-wise”). For example, a “vertical” multiply will take the 4 float values in one input array, and multiply each of them by the corresponding float in the other input array of 4 float numbers, and then will return a resulting output array with 4 float values.

For testing, let us assume with want to create an AVX function that multiplies 4 float values element-wise. The test code looks like:

    float arr1[4] = { 1.0f , 2.5f , 3.14f, 0.0f };
    float arr2[4] = { 1.0f , 2.5f , 3.14f, 0.0f };
    float resultarr[4];
    // Multiply element-wise
    aussie_multiply_vectors(arr1, arr2, resultarr, 4);  

Testing the results of the multiply as an element-wise multiply of each pair in the 4 float values (using my home-grown “ytestf” unit testing function that compares float numbers for equality):

    ytestf(resultarr[0], 1.0f * 1.0f);  // Unit tests
    ytestf(resultarr[1], 2.5f * 2.5f);
    ytestf(resultarr[2], 3.14f * 3.14f);
    ytestf(resultarr[3], 0.0f * 0.0f);

Here's the low-level C++ code that actually does the SIMD multiply using the “_mm_mul_ps” AVX intrinsic function:

    #include <xmmintrin.h>
    #include <intrin.h>

    void aussie_avx_multiply_4_floats(
        float v1[4], float v2[4], float vresult[4])
        // Multiply 4x32-bit float in 128-bit AVX registers
        __m128 r1 = _mm_loadu_ps(v1);   // Load floats
        __m128 r2 = _mm_loadu_ps(v2);
        __m128 dst = _mm_mul_ps(r1, r2);   // AVX SIMD Multiply
        _mm_storeu_ps(vresult, dst);  // Convert back to floats

Explaining this code one line at a time:

    1. The header files are included: <xmmintrin.h> and <intrin.h>.

    2. The basic AVX register type is “__m128” which is an AVX 128-bit register (i.e., it is 128 bits in the basic AVX version, not AVX-2 or AVX-512).

    3. The variables “r1” and “r2” are declared as _mm128 registers. The names “r1” and “r2” are not important, and are just variable names.

    4. The intrinsic function “_mm_loadu_ps” is used to convert the arrays of 4 float values into the 128-bit register types, and the result is “loaded” into the “r1” and “r2” 128-bit types.

    5. Another 128-bit variable “dst” is declared to hold the results of the SIMD multiply. The name “dst” can be any variable name.

    6. The main AVX SIMD multiply is performed by the “_mm_mul_ps” intrinsic function. The suffix “s” means “single-precision” (i.e. 32-bit float). This is where the rubber meets the road, and the results of the element-wise multiplication of registers “r1” and “r2” are computed and saved into the “dst” register. It is analogous to the basic C++ expression: dst = r1*r2;

    7. The 128-bit result register variable “dst” is converted back to 32-bit float values (4 of them), by “storing” the 128 bits into the float array using the “_mm_storeu_ps” AVX intrinsic.



Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++