Aussie AI

30. Vectorization

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

“Your direction is more important than your speed.”

— Richard L. Evans.

What is Vectorization?

Vectorization is the name given to transforming a software loop from running sequentially on an array of data to performing the same computation fully in parallel, by sending the data to a GPU or CPU SIMD extensions. Vectorization uses techniques from loop optimizations to transform loops into faster parallelizable versions, such as “unrolling” a loop into all its element-wise actions, and loop distribution (also called “loop sectioning”), which breaks the array into segments that are the right size to fit in parallel into your GPU or CPU SIMD extensions. In theory, a good optimizing compiler can do vectorization optimizations automatically for simple loops, but often you have to do it yourself.

Vectorization with AVX Intrinsics

The AVX intrinsics are C++ built-in functions that wrap around SIMD instruction codes in the x86 instruction set. The basic AVX intrinsics are 128-bits (4 float values of size 32-bits), AVX-2 is 256 bits (8 float values), and AVX-512 is 512 bits (surprise!), which is 16 float numbers. The upcoming AVX-10 (announced in July 2023) is also 512 bits, but with extra capabilities.

Obviously, since the largest number of floating-point values that can be parallelized is 16, the AVX intrinsics cannot fully vectorize a larger vector of many float values, such as an AI model with dimension 1024. Instead, we can use AVX intrinsics on segments of vectors, and thereby vectorize chunks of the right size to get a speedup.

Example: AVX Vectorized Dot Product

Here is the basic non-vectorized dot product computation without any optimization attempts.

    float aussie_vecdot_basic(float v1[], float v2[], int n)   
    {
        // Basic FLOAT vector dot product
        float sum = 0.0;
        for (int i = 0; i < n; i++) {
            sum += v1[i] * v2[i];
        }
        return sum;
    }

To use AVX to vectorize it, we need to unroll the loop first. Here's a simple vector dot product with its inner loop unrolled 4 times. This version assumes that n is a multiple of 4 rather than handling odd cases:

   float aussie_vecdot_unroll4_basic(float v1[], float v2[], int n)
   {
        // Loop-unrolled Vector dot product 
        if (n % 4 != 0) {
            yassert(n % 4 == 0);
            return 0.0; // fail
        }
        float sum = 0.0;
        for (int i = 0; i < n; ) {
            sum += v1[i] * v2[i]; i++;
            sum += v1[i] * v2[i]; i++;
            sum += v1[i] * v2[i]; i++;
            sum += v1[i] * v2[i]; i++;
        }
        return sum;
    }

So, now we can change those 4 unrolled multiplications into one AVX computation of the vector dot product of 4 float numbers.

    #include <intrin.h>

    float aussie_vecdot_unroll_AVX1(float v1[], float v2[], int n)  
    {                
        // AVX-1 loop-unrolled (4 floats) vector dot product 
        if (n % 4 != 0) {
            yassert(n % 4 == 0);
            return 0.0; // fail
        }
        float sum = 0.0;
        for (int i = 0; i < n; i += 4) {
            // AVX1: Vector dot product of 2 vectors 
            //  ... process 4x32-bit floats in 128 bits
            __m128 r1 = _mm_loadu_ps(&v1[i]);   // Load floats into 128-bits
            __m128 r2 = _mm_loadu_ps(&v2[i]);
            __m128 dst = _mm_dp_ps(r1, r2, 0xf1); // Dot product
            sum += _mm_cvtss_f32(dst);
        }
        return sum;
    }

This basic AVX sequence of code to do the 4 float dot product has been analyzed in a separate chapter. The main dot product computation is “_mm_dp_ps” which is an AVX intrinsic and multiplies 4 pairs of 32-bit float numbers, and then sums them, all in one call to an intrinsic. Note that the loop now iterates 4 at a time through the array of float values (i.e. “i+=4”) and then the AVX intrinsic does the rest.

Here's the benchmark analysis showing that the AVX-vectorized version is more than twice as fast:

    FLOAT Vector dot product benchmarks:
    Time taken: Vecdot basic: 2805 ticks (2.81 seconds)
    Time taken: Vecdot AVX1 unroll (4 floats, 128-bits): 1142 ticks (1.14 seconds)

Fused Multiply-Add (FMA) in AVX-2. The AVX-2 FMA intrinsic takes 3 vectors, each of size 256-bits, multiplies two of them pair-wise, and then adds the third vector. Both the multiplication and addition are done in element-wise SIMD style. At first blush this sounds like doing a vector multiply and then adding a “bias” vector, and hence doesn't sound like a good optimization for the vector dot product. The SIMD pairwise multiplication is the first step of dot products, but the vector addition seems the opposite of what we want, which is “horizontal” addition of the products that result from the multiplications.

The default idea is doing a dot product of 8 float values, and then another one, and then adding each individual sum at the end. With that idea, the vertical addition in FMA is not what we want, and it looks like using SIMD multiplication and an extra horizontal addition would be better than using a single FMA intrinsic. However, we can make like Superman III...

Reverse it!

If you think about FMA not as a multiplication and then addition, but as “adding multiplications” in the reverse order, then there is a eureka moment: put the addition first. The idea is that we can maintain a vector of running sums, and then only do a single horizontal addition at the very end. It's kind of mind-bending, but here's the code:

float aussie_vecdot_FMA_unroll_AVX2(float v1[], float v2[], int n)   
{
        // AVX2 vecdot using FMA (Fused Multiply-Add) primitives
        if (n % 8 != 0) {
            yassert(n % 8 == 0);
            return 0.0; // fail
        }
        __m256 sumdst = _mm256_setzero_ps();   // Set accumulators to zero
        for (int i = 0; i < n; i += 8) {
            // AVX2: process 8x32-bit floats in 256 bits
            __m256 r1 = _mm256_loadu_ps(&v1[i]);   // Load floats into 256-bits
            __m256 r2 = _mm256_loadu_ps(&v2[i]);
            sumdst = _mm256_fmadd_ps(r1, r2, sumdst); // FMA of 3 vectors
        }

        // Add the final 8 accumulators manually
        float* farr = (float*)&sumdst;
        float sum = farr[0] + farr[1] + farr[2] + farr[3]
                  + farr[4] + farr[5] + farr[6] + farr[7]; 
        return sum;
}

How does this work? Well, we declare “sumdst” as a vector of 8 float numbers that maintains the 8 parallel accumulators, which is first initialized to all-zeros via the “_mm256_setzero_ps” intrinsic. In the main loop, we use “sumdst” to maintain a running sum in all 8 of those parallel accumulators across multiple segments of the vector. One accumulator sums the products in array indices 0,8,16,..., and the next accumulator sums the products for indices 1,9,17,... We use the FMA intrinsic (“_mm256_fmadd_ps” in AVX2) to do the SIMD multiplication, but rather than trying to add the 8 resulting products together, we add each product to a separate accumulator. This works very neatly, because the AVX-2 FMA intrinsics does this all in SIMD parallelism with the combined FMA intrinsic. Only at the very end, after the main loop, we do a horizontal add of the 8 parallel accumulators to get the final sum.

This idea works surprisingly well, and is gratifying since I couldn't get the AVX-2 256-bit version with the dot product “_mm256_dp_ps” intrinsic to run correctly on 8 float values. Here's the benchmarking, which shows that AVX-2 using FMA on 8 float values in parallel runs much faster than the AVX1 unrolled vector dot product using the intrinsic “_mm_dp_ps” with 4 float values.

    FLOAT Vector dot product benchmarks: (N=1024, Iter=1000000)
    Vecdot basic: 2961 ticks (2.96 seconds)
    Vecdot AVX1 unroll (4 floats, 128-bits): 1169 ticks (1.17 seconds)
    Vecdot AVX1 FMA (4 floats, 128-bits): 1314 ticks (1.31 seconds)
    Vecdot AVX2 FMA (8 floats, 256-bits): 783 ticks (0.78 seconds)

Note that we can improve on the horizontal addition at the very end. The example code just uses basic C++ with 7 additions and 8 array index computations. Instead, this last computation should really use some AVX “hadd” intrinsics instead (it needs 3 calls to horizontal-pairwise add 8 float values).

Example: AVX Vector Sum Reduction

Let us suppose we need to calculate the sum of all the elements of a vector. This is a “reduction” that has dimensions “vector-to-scalar.” Here is a basic naive C++ version without any optimizations:

    float aussie_vector_sum(float v[], int n)  // Summation
    {
        float sum = 0.0;
        for (int i = 0; i < n; i++) {
            sum += v[i];
        }
        return sum;
    }

AVX vector reductions have some issues in the early releases. Although AVX has SIMD instructions to add two vectors in parallel, it struggles to do a “reduction” operation like this. AVX and AVX-2 do have “horizontal add” (“hadd”) intrinsics, but these only do pairwise additions within the single vector, rather than adding all elements. AVX-512 has a “reduce add” intrinsic (“_mm512_reduce_add_ps”) for horizontally adds 16 float numbers, which works a lot better.

For AVX and AVX-2, are we stuck with doing multiple calls to the pairwise “hadd” intrinsics? No, there's a non-obvious way to use the “vertical add” intrinsics in parallel. We can do “in parallel” squared. It's almost like we're doing math inside a computer.

The trick is to use the AVX registers as a set of 4 parallel accumulators (AVX 128 bits) or 8 parallel accumulators (AVX-2's 256 bits). In this way, we can defer the “hadd” until the very end, and since it's not in the critical loop, its performance hardly matters. Here's the code for AVX-1 with 128-bit registers:

    float aussie_vector_sum_AVX1(float v[], int n)   // Summation (horizontal) of a single vector
    {
        if (n % 4 != 0) {  // Safety
            yassert(n % 4 == 0);
            return 0.0; // fail
        }

        __m128 sumdst = _mm_setzero_ps();   // Set accumulators to zero
        for (int i = 0; i < n; i += 4) {
            __m128 r1 = _mm_loadu_ps(&v[i]); // Load floats into 128-bits
            sumdst = _mm_add_ps(r1, sumdst); // SUM = SUM + V
        }

        // Add the final 4 accumulators manually
        float* farr = sumdst.m128_f32;
        float sum = farr[0] + farr[1] + farr[2] + farr[3];
        return sum;
    }

The AVX-2 version is faster, because it processes 8 float values at a time. This uses the same strategy of 8 parallel accumulators and a loop unrolling factor of 8 (i.e. the loop incrementer is now “i+=8”). Here's the C++ code:

    float aussie_vector_sum_AVX2(float v[], int n)   // Summation (horizontal) of a single vector
    {
        if (n % 8 != 0) { // Safety check (no extra cases)
            yassert(n % 8 == 0);
            return 0.0; // fail
        }

        __m256 sumdst = _mm256_setzero_ps();   // Set 8 accumulators to zero
        for (int i = 0; i < n; i += 8) {
            __m256 r1 = _mm256_loadu_ps(&v[i]);   // Load 8 floats into 256-bits
            sumdst = _mm256_add_ps(r1, sumdst); // SUM = SUM + V
        }

        // Add the final 8 accumulators manually
        float* farr = sumdst.m256_f32;
        float sum = farr[0] + farr[1] + farr[2] + farr[3]
                  + farr[4] + farr[5] + farr[6] + farr[7]; 
        return sum;
    }

I've been lazy not bothering to optimize the final horizontal addition. A small extra speedup is probably available using the “hadd” intrinsics 3 times in a row to drop it down from 8 accumulators to a single float. If this was AVX-512, we could use the horizontal reduction “_mm512_reduce_add_ps” intrinsic for summation at the end (for adding 16 partial sums of type float).

Loop Peeling Optimization: Another inefficiency with these AVX addition routines it that they needlessly perform an addition with zero in the first iteration. Effectively, we need to do “loop peeling” to handle the first loop iteration differently. This is the slow first iteration of AVX2 vector sum:

    __m256 sumdst = _mm256_setzero_ps();   // Set 8 accumulators to zero
    for (int i = 0; i < n; i += 8) {
        // ... 
    }

Loop peeling says to replace the initialization with zero with loading the first 8 values from the vector. The loop starts its first iteration at i=8 instead of i=0, skipping what had been the first addition:

    __m256 sumdst = _mm256_loadu_ps(&v[0]);  // Get first 8 values
    for (int i = 8 /*not 0!*/; i < n; i += 8) {
        // ... same
    }

AVX Vector Max and Min Reductions

The need to find a minimum or maximum of a vector's elements is similar to a summation reduction. Again, AVX1 and AVX2 don't have proper “reduction” intrinsics for max or min, but we can compute them in parallel by keeping a running min or max value of 4 or 8 float values (i.e. analogous to parallel accumulators when doing summation). The AVX intrinsics are:

MIN: _mm_min_ps, _mm256_min_ps
MAX: _mm_max_ps, _mm256_max_ps

Here is the AVX1 version of MAX vector reduction:

    float aussie_vector_max_AVX1(float v[], int n)   
    {
        // Maximum (horizontal) of a single vector
        if (n % 4 != 0) {
            yassert(n % 4 == 0);
            return 0.0; // fail
        }
        __m128 sumdst = _mm_loadu_ps(&v[0]);   // Initial values
        for (int i = 4 /*not 0*/; i < n; i += 4) {
            __m128 r1 = _mm_loadu_ps(&v[i]); // Load floats into 128-bits
            sumdst = _mm_max_ps(r1, sumdst); // dst = MAX(dst, r1)
        }
        // Find Max of the final 4 accumulators
        float* farr = sumdst.m128_f32;
        float fmax = farr[0];
        if (farr[1] > fmax) fmax = farr[1];
        if (farr[2] > fmax) fmax = farr[2];
        if (farr[3] > fmax) fmax = farr[3];
        return fmax;
    }

And here is the analogous AVX2 version of MAX vector reduction:

    float aussie_vector_max_AVX2(float v[], int n)
    {
        // Maximum (horizontal) of a single vector
        if (n % 8 != 0) { // Safety check (no extra cases)
            yassert(n % 8 == 0);
            return 0.0; // fail
        }
        __m256 sumdst = _mm256_loadu_ps(&v[0]);   // Initial 8 values
        for (int i = 8/*not 0*/; i < n; i += 8) {
            __m256 r1 = _mm256_loadu_ps(&v[i]); // Load floats into 256-bits
            sumdst = _mm256_max_ps(r1, sumdst); // dst = MAX(dst, r1)
        }

        // Find Max of the final 8 accumulators
        float* farr = sumdst.m256_f32;
        float fmax = farr[0];
        if (farr[1] > fmax) fmax = farr[1];
        if (farr[2] > fmax) fmax = farr[2];
        if (farr[3] > fmax) fmax = farr[3];
        if (farr[4] > fmax) fmax = farr[4];
        if (farr[5] > fmax) fmax = farr[5];
        if (farr[6] > fmax) fmax = farr[6];
        if (farr[7] > fmax) fmax = farr[7];
        return fmax;
    }

The MIN versions are very similar. They use the “min” AVX intrinsics, and the final steps use “<” not “>” operations. Here's the AVX1 version of a MIN vector reduction:

    float aussie_vector_min_AVX1(float v[], int n)
    {
        // Minimum (horizontal) of a single vector
        if (n % 4 != 0) {
            yassert(n % 4 == 0);
            return 0.0; // fail
        }
        __m128 sumdst = _mm_loadu_ps(&v[0]);   // Initial values
        for (int i = 4 /*not 0*/; i < n; i += 4) {
            __m128 r1 = _mm_loadu_ps(&v[i]); // Load floats into 128-bits
            sumdst = _mm_min_ps(r1, sumdst); // dst = MIN(dst, r1)
        }
        // Find Min of the final 4 accumulators
        float* farr = sumdst.m128_f32;
        float fmin = farr[0];
        if (farr[1] < fmin) fmin = farr[1];
        if (farr[2] < fmin) fmin = farr[2];
        if (farr[3] < fmin) fmin = farr[3];
        return fmin;
    }

This is the AVX2 version of a MIN vector reduction:

    float aussie_vector_min_AVX2(float v[], int n)   // Minimum (horizontal) of a single vector
    {
        if (n % 8 != 0) { // Safety check (no extra cases)
            yassert(n % 8 == 0);
            return 0.0; // fail
        }
        __m256 sumdst = _mm256_loadu_ps(&v[0]);   // Initial 8 values
        for (int i = 8/*not 0*/; i < n; i += 8) {
            __m256 r1 = _mm256_loadu_ps(&v[i]); // Load floats into 256-bits
            sumdst = _mm256_min_ps(r1, sumdst); // dst = MIN(dst, r1)
        }

        // Find Min of the final 8 accumulators
        float* farr = sumdst.m256_f32;
        float fmin = farr[0];
        if (farr[1] < fmin) fmin = farr[1];
        if (farr[2] < fmin) fmin = farr[2];
        if (farr[3] < fmin) fmin = farr[3];
        if (farr[4] < fmin) fmin = farr[4];
        if (farr[5] < fmin) fmin = farr[5];
        if (farr[6] < fmin) fmin = farr[6];
        if (farr[7] < fmin) fmin = farr[7];
        return fmin;
    }

These versions are not especially optimized. AVX-512 would allow us to further vectorize to 16 float values. Also, the final computation of the maximum or minimum of 8 float numbers is far from optimal. The AVX horizontal min/max intrinsics would be used (pairwise, multiple times). Or we can at least avoid some comparisons by doing it pairwise sequentially. Here's the alternative for AVX1 minimum computation:

        // Find Min of the final 4 accumulators
    #define FMIN(x,y)  ( (x) < (y) ? (x) : (y) )
        float* farr = sumdst.m128_f32;
        float fmin1 = FMIN(farr[0], farr[1]);
        float fmin2 = FMIN(farr[2], farr[3]);
        float fmin = FMIN(fmin1, fmin2);
        return fmin;

These functions can also have their main loops further improved. Other basic optimizations would include using loop pointer arithmetic to remove the index variable “i” and also unrolling the loop body multiple times.

Vectorized Sum-of-Squares Reduction

The sum of the square of an element of a vector has various applications in our AI Engine. Firstly, it can be used to compute the magnitude of a vector. Secondly, the sum-of-squares is used in various normalization functions, as part of computing the variance from the sum-of-squares of the difference between values and the mean. The RMS factor in RMSNorm is also the square root of the sum-of-squares.

The method to add up the sum-of-squares for a vector reduction to a single float is very similar to a simple summation reduction. The idea for AVX1 and AVX2 is to keep 4 or 8 running sum accumulators, and then add them up at the final step.

Here is the AVX1 version of sum-of-squares of a vector:

    float aussie_vector_sum_squares_AVX1(float v[], int n)  
    {
        // Summation of squares of all elements
        if (n % 4 != 0) { // Safety check (no extra cases)
            yassert(n % 4 == 0);
            return 0.0; // fail
        }
        __m128 sumdst = _mm_setzero_ps();   // Zero accumulators
        for (int i = 0; i < n; i += 4) {
            __m128 r1 = _mm_loadu_ps(&v[i]); // Load floats
            __m128 sqr = _mm_mul_ps(r1, r1);   // Square (V*V)
            sumdst = _mm_add_ps(sqr, sumdst); // SUM = SUM + V*V
        }
        // Add the final 4 accumulators manually
        float* farr = sumdst.m128_f32;
        float sum = farr[0] + farr[1] + farr[2] + farr[3];
        return sum;
    }

And here is the AVX2 version of sum-of-squares:

    float aussie_vector_sum_squares_AVX2(float v[], int n)  
    {
        // Summation of squares of all elements
        if (n % 8 != 0) { // Safety check (no extra cases)
            yassert(n % 8 == 0);
            return 0.0; // fail
        }

        __m256 sumdst = _mm256_setzero_ps();   // Zero accumulators
        for (int i = 0; i < n; i += 8) {
            __m256 r1 = _mm256_loadu_ps(&v[i]);   // Load floats
            __m256 sqr = _mm256_mul_ps(r1, r1);   // Square (V*V)
            sumdst = _mm256_add_ps(sqr, sumdst); // SUM = SUM + V*V
        }

        // Add the final 8 accumulators manually
        float* farr = sumdst.m256_f32;
        float sum = farr[0] + farr[1] + farr[2] + farr[3]
                  + farr[4] + farr[5] + farr[6] + farr[7];
        return sum;
    }

Various optimizations can be further applied to these versions. Like the summation reduction, these loops needlessly add zero at the first iteration, and loop peeling should be used for split out the first iteration separately. The final horizontal addition of 4 or 8 float values should be optimized. AVX-512 should be used for greater parallelism to 16 float numbers. Finally, basic loop optimizations of pointer arithmetic and loop unrolling could be applied.

Vectorized Multiply Vector by Scalar

The requirement to multiply a vector by a scalar is common when using scaling vectors. Division by a scalar is also handled by multiplying by the reciprocal (e.g. needed for Softmax). Multiplication by a scalar is amenable to vectorization because the naive C++ version is very simple:

    void aussie_vector_multiply_scalar(float v[], int n, float c)  
    {
        // Multiply all vector elements by constant
        for (int i = 0; i < n; i++) {
            v[i] *= c;
        }
    }

Loop Pointer Arithmetic. First, we can try the basic C++ optimization of pointer arithmetic:

    void aussie_vector_multiply_scalar_pointer_arith(float v[], int n, float c)  
    {
        // Multiply all vector elements by constant
        for (; n > 0; n--, v++) {
            *v *= c;
        }
    }

AVX1 multiply-by-scalar: There is no special scalar multiplication opcode in AVX or AVX-2, but we can populate a constant register (128-bit or 256-bit) with multiple copies of the scalar (i.e. _mm_set1_ps or _mm256_set1_ps), and we need do this only once. We can then use the SIMD multiply intrinsics in the unrolled loop section. The AVX 128-bit vector multiplication by scalar becomes:

    void aussie_vector_multiply_scalar_AVX1(float v[], int n, float c)  
    { 
        const __m128 rscalar = _mm_set1_ps(c);  // Vector of scalars
        for (int i = 0; i < n; i += 4) {
            __m128 r1 = _mm_loadu_ps(&v[i]);   // Load floats
            __m128 dst = _mm_mul_ps(r1, rscalar); // Multiply by scalars
            _mm_store_ps(&v[i], dst);  // convert to floats (aligned)
        }
    }

AVX2 multiply-by-scalar: Even faster is to use 8 parallel multiplications with AVX-2's 256-bit registers. The AVX-1 version is simply changed to use the “__m256” type and the analogous AVX-2 intrinsics. The new code looks like:

    void aussie_vector_multiply_scalar_AVX2(float v[], int n, float c)  
    {
        const __m256 rscalar = _mm256_set1_ps(c);  // Vector of scalars
        for (int i = 0; i < n; i += 8) {
            __m256 r1 = _mm256_loadu_ps(&v[i]);   // Load floats
            __m256 dst = _mm256_mul_ps(r1, rscalar); // Multiply by scalars
            _mm256_store_ps(&v[i], dst);  // convert to floats (aligned)
        }
    }

Combining AVX-2 with pointer arithmetic. Finally, we can get a small extra benefit by adding pointer arithmetic optimizations to the AVX-2 parallelized version. The new code is:

    void aussie_vector_multiply_scalar_AVX2_pointer_arith(float v[], int n, float c)  
    {
        // Multiply all vector elements by constant
        const __m256 rscalar = _mm256_set1_ps(c);  // vector full of scalars...
        for (; n > 0; n -= 8, v += 8) {
            __m256 r1 = _mm256_loadu_ps(v);   // Load floats into 256-bits
            __m256 dst = _mm256_mul_ps(r1, rscalar);   // Multiply by scalars
            _mm256_store_ps(v, dst);  // convert to floats (Aligned version)
        }
    }

Benchmarking results. In theory, the AVX-2 intrinsics could parallelize the computation by 8 times, but benchmarking showed that it only achieved a 4-times speedup.

    Vector-scalar operation benchmarks (N=1024, ITER=1000000):
    Vector mult-scalar C++: 1412 ticks (1.41 seconds)
    Vector mult-scalar pointer-arith: 995 ticks (0.99 seconds)
    Vector mult-scalar AVX1: 677 ticks (0.68 seconds)
    Vector mult-scalar AVX2: 373 ticks (0.37 seconds)
    Vector mult-scalar AVX2 + pointer arith: 340 ticks (0.34 seconds)

Vectorized Add Scalar

The code to vectorize an “add-scalar” operation is almost identical to “multiply-scalar” operations, except that “add” intrinsics are used. Here is the AVX-1 version with “_mm_add_ps”:

    void aussie_vector_add_scalar_AVX1(float v[], int n, float c)
    {
        // Add scalar constant to all vector elements
        const __m128 rscalar = _mm_set1_ps(c);  // Set up vector full of scalars...
        for (int i = 0; i < n; i += 4) {
            __m128 r1 = _mm_loadu_ps(&v[i]);   // Load floats into 128-bits
            __m128 dst = _mm_add_ps(r1, rscalar);   // Add scalars
            _mm_store_ps(&v[i], dst);  // store back to floats
        }
    }

And this is the analogous AVX-2 version using the “_mm256_add_ps” intrinsic:

    void aussie_vector_add_scalar_AVX2(float v[], int n, float c)  // Add scalar constant to all vector elements
    {
        const __m256 rscalar = _mm256_set1_ps(c);  // vector full of scalars...
        for (int i = 0; i < n; i += 8) {
            __m256 r1 = _mm256_loadu_ps(&v[i]);   // Load floats into 256-bits
            __m256 dst = _mm256_add_ps(r1, rscalar);   // Add scalars
            _mm256_store_ps(&v[i], dst);  // convert to floats (Aligned version)
        }
    }

Vectorized RELU with Max Intrinsics

The RELU activation function simply converts negatives to zero, leaving positives unchanged. This is algebraically equivalent to max(x,0), which can be implemented in AVX like a “max-scalar” operation.

To vectorize RELU applied to a whole vector of float elements, we are effectively doing a SIMD max operation with a scalar zero (i.e., 0.0). Hence, the code is very similar to vectorization of add-scalar, but uses the “_mm_max_ps” intrinsic.

The AVX1 version of vectorized RELU looks like:

    void aussie_vector_reluize_AVX1(float v[], int n)   // Apply RELU to each element (sets negatives to zero)
    {
        if (n % 4 != 0) {
            yassert(n % 4 == 0);
            return; // fail
        }
        const __m128 rzeros = _mm_set1_ps(0.0f);  // Set up vector full of zeros...
        for (int i = 0; i < n; i += 4) {
            __m128 r1 = _mm_loadu_ps(&v[i]);   // Load floats into 128-bits
            __m128 dst = _mm_max_ps(r1, rzeros);   // MAX(r1,0)
            _mm_store_ps(&v[i], dst);  // store back to floats
        }
    }

And here is the AVX2 version doing 8 float elements at a time using the “_mm256_max_ps” intrinsic:

    void aussie_vector_reluize_AVX2(float v[], int n)  // Apply RELU to each element (sets negatives to zero)
    {
        if (n % 8 != 0) {
            yassert(n % 8 == 0);
            return; // fail
        }
        const __m256 rzeros = _mm256_set1_ps(0.0f);  // vector full of zeros...
        for (int i = 0; i < n; i += 8) {
            __m256 r1 = _mm256_loadu_ps(&v[i]);   // Load floats into 256-bits
            __m256 dst = _mm256_max_ps(r1, rzeros);   // MAX(R1, 0)
            _mm256_store_ps(&v[i], dst);  // store back to floats
        }
    }

Vectorization of Exponentiation

The expf function is very expensive to call, but exponentiation of entire vectors of float values are required in several parts of AI engines, such as activation functions and Softmax normalization. Surprisingly, in x86 there are CPU opcodes to do exponentiation in hardware, and there are matching AVX intrinsics for SIMD exponentiation operations on small vectors (i.e. 4 float values for AVX-1 and 8 float values for AVX-2).

The basic C++ version to apply expf to every element of a vector, and store the result in the original vector, looks like this:

    void aussie_vector_expf(float v[], int n)   
    {
        // Apply EXPF (exponential) to each element
        for (int i = 0; i < n; i++) {
            v[i] = expf(v[i]);
        }
    }

Loop Pointer arithmetic. Applying the basic C++ optimization of pointer arithmetic, the new code is:

    void aussie_vector_expf_pointer_arith(float v[], int n)
    {
        for (; n > 0; n--, v++) {
            *v = expf(*v);
        }
    }

AVX1 SIMD exponentiation of 4 values: There is an AVX intrinsic called “_mm_exp_ps” to exponentiate 4 float values in parallel using the 128-bit registers. Here's the new vector exponentiation code with loop unrolling every 4 elements and AVX1 vectorization:

    void aussie_vector_expf_AVX1(float v[], int n)
    {
        for (int i = 0; i < n; i += 4) {
            __m128 r1 = _mm_loadu_ps(&v[i]);   // Load floats into 128-bits
            __m128 dst = _mm_exp_ps(r1);   // Exponentiate (expf)
            _mm_store_ps(&v[i], dst);  // convert to floats (Aligned version)
        }
    }

AVX2 SIMD exponentiation of 8 values: The AVX2 intrinsic is “_mm256_exp_ps” to exponentiate 8 elements in parallel using the 256-bit registers. The new code with loop unrolling every 8 values and AVX-2 intrinsics becomes:

    void aussie_vector_expf_AVX2(float v[], int n)  // Apply EXPF (exponential) to each element
    {
        for (int i = 0; i < n; i += 8) {
            __m256 r1 = _mm256_loadu_ps(&v[i]);   // Load floats into 256-bits
            __m256 dst = _mm256_exp_ps(r1);    // Exponentiate (expf)
            _mm256_store_ps(&v[i], dst);  // convert to floats (Aligned version)
        }
    }

Benchmarking results. The results of optimization of exponentiation are striking! AVX1 is massively faster, cutting out 97% of the original computation time, and then AVX2 is faster still. It's almost like hardware is faster than software. Who knew?

    Vector-exponentiation operation benchmarks (N=1024, ITER=100000):
    Vector expf basic: 6695 ticks (6.70 seconds)
    Vector expf pointer-arith: 6395 ticks (6.39 seconds)
    Vector expf AVX1: 260 ticks (0.26 seconds)
    Vector expf AVX2: 124 ticks (0.12 seconds)

Vectorization of Lookup Tables

The use of lookup-tables is already a powerful speed optimization, but we can double down by adding vectorization. The AVX SIMD instruction sets include a variety of “gather” intrinsics that perform parallel array lookups from a vector of integer indices, using a base address.

The basic algorithm we're going to use for AVX SIMD optimizations of a LUT precalculation of some mathematical function is as follows:

Offline: Precalculate a big LUT for 24 bits with 2^24 elements using non-AVX basic C++ methods.
Input: vector of 4 float values (AVX-1) or 8 float values (AVX-2).
Use a cast to treat these float arrays as arrays of integers.
Load these “int” arrays into an AVX register.
AVX shift right by 8 with the AVX-2 “_mm_srli_epi32” intrinsic, which shifts right and adds zero bits, so that they are now 24-bit numbers in 32 bits, with a zero sign bit (hence, all indices are positive integers).
AVX “gather” with base on the LUT array, and scale of 4 (i.e., float byte size).
Store the AVX register results back into an array of float values.
Output: vector of 4/8 float values with the LUT-calculated function.

Note that we can use a smaller (or bigger) LUT than 24 bits simply by modifying the bitshift counts.

LUTs with AVX Shuffle. Another way to implement a LUT in AVX is to use “shuffle” operations on another register. This only works for small lookup tables, that have few enough elements to fit inside AVX registers. In other words, this can be fast, but only for 16 or 32 elements in the LUT for AVX-2, or more if you use AVX-512. This optimization is unlikely to be relevant to computing the massive 16-bit or 24-bit LUTs that we need for AI mathematical functions.

AVX SIMD Pointer Dereferences. A corollary to the AVX LUT “gather” functionality is they can possibly be used to vectorize arrays of pointers, where the pointers are directly aimed at the data without any intervening lookup-table. For example, suppose we have an array of pointers to float (i.e. rather than an array of integer indices), and we want to access these addresses to generate the corresponding array of float. This is analogous to using a lookup table, but with a base address of zero. Hence, we could potentially use AVX “gather” intrinsics with a zero base address, and the integer offsets equal to the address (i.e. the pointers converted to integer). The x86 platform has 64-bit pointers, so 64-bit integer index offsets are required in the “gather” intrinsic. For example, the AVX2 “_mm256_i64gather_epi32” and “_mm256_i64gather_ps” intrinsics seem to be along these lines with 64-bit indices. I haven't actually tested this approach to check if it works.

Auto-Vectorization and Restricted Pointers

Modern C++ compilers attempt to automatically vectorize simple loops. Basic loop structures can be unrolled by optimizers, either partially or fully, and then sent to hardware acceleration automatically.

One of the most important hints to the compiler is a “restrict” designation on pointer variables. Ironically, the benefit of restrict is to limit what you can code, but also to allow unrestricted use of the pointers by the optimizer.

The purpose of the restrict attribute is a type specifier to tell the C++ compiler that a given pointer or array variable is not an “alias” for any other pointer. There are various loop transformations and vectorization optimizations that cannot be performed if the compiler has to be conservative and assume that aliasing could occur.

One of the main uses of restrict is on pointer or array function parameters, because arrays are pointers in this context. For example, if we have two function parameters (e.g. vector addition), declaring both parameters as restrict tells the compiler that the two pointers will never point to the other vector.

Note that this use of the word “aliasing” refers to two pointers referring to the same object or array (i.e. the pointers are aliases of each other). There is another unrelated but similar use of the term in C++ “aliases” for declarations, which means one function or type with two alias names.

The “restrict” keyword is merely a hint to the optimizer, and recalcitrant C++ compilers are free to ignore the advice. In fact, “restrict” isn't even valid C++, because it's part of C, but not yet in the C++ standard. Nevertheless, various compilers support it or similar extensions like __restrict__, so it can be used in C++ programs.

Restricted pointers don't always need to be marked as such. In some usages, the use of “const” can allow the compiler to infer non-aliasing of parameters, but it probably doesn't hurt to declare it with “restrict” as well. Note also that the C++ compiler is free to assume non-aliasing of pointers of different types, because it is undefined behavior if they are aliases. This is known as the “strict aliasing rule” and this assumption can be disabled in GCC via the option “-fno-strict-aliasing”.

The C++ compiler doesn't really check if you are lying (to yourself). If you tell the compiler that pointers are restricted, and then pass in two aliased pointers, the behavior of your program is “undefined” and there aren't likely to be any compilation errors or runtime warnings. So, don't do that.

The correct declaration of a “restrict” pointer is:

    int * restrict ptr;  // Correct

This is actually incorrect:

    int restrict * ptr;   // Wrong
    restrict int * ptr;   // Also wrong

The syntax for array parameters has the keyword inside the square brackets:

   void myfunc(int arr[restrict]);

Read-only functions. Note that read-only functions don't really need to use the restrict keyword. For example, the calculation of a vector dot product of two arrays doesn't really have an aliasing problem, since neither of the vectors are changed.

Restricted references. The “restrict” type specifier can be used on references, as well as pointers and arrays. This is helpful for some of the issues with aliasing between references in pass-by-reference function parameters. But this usage of restrict for references isn't very important for auto-vectorization optimizations.

Restricted “this” pointer. GCC also supports specifying that the class object “this” pointer is unaliased by marking the function body with the “__restrict__” keyword. This is placed after the closing right parenthesis of the function parameters (i.e. similar to a const member function declaration). The declaration looks like:

    void MyClass::myfunc(int x) __restrict__;

Portability of restrict. The portability of the “restrict” keyword is still somewhat lacking. The C99 standard added the “restrict” keyword to the C language, but it was not added to C++, and has not been widely supported by C++ compilers. The reason that restrict is standardized in C and not C++ is not an oversight; there are a large number of problematic issues in merging it into various C++ language features.

Nevertheless, various non-standard features have been added to C++ compilers and can improve their levels of vectorization optimizations. GCC has supported two different keywords, __restrict__ and __restrict, with the same intended meaning. MSVS has __declspec(restrict) which defines a restricted storage class, and also supports __restrict.

Some compilers have other related keywords for different types of aliasing. Microsoft also has __declspec(noalias) which has a slightly different meaning to restricted pointers, in that it tells the optimizer that a function does not modify global state in a hidden way, other than via its parameters, which is a different type of aliasing. GCC also has a “mayalias” attribute that permits a pointer to be aliased to suppress some warnings (and presumably also won't be optimized very much!).

• Next: Chapter 31. Kernel Fusion

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs