Aussie AI

Vectorized and Fused Loop Softmax

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

Vectorized & Fused Loop Softmax

What about vectorization applied to these fused loops. Can we do better than using the two vectorized loops in sequence? Can we merge the exponentiation and summation into a single unrolled loop and vectorize that using AVX intrinsics? I'm just teasing you. Of course, we can!

Here is “kernel fusion” of the vector expf and vector summation into a fused-expf-summation kernel. I coded this for both AVX1 and AVX2, with both very similar in structure. Here is the code for AVX2:

    float aussie_vector_fused_expf_sum_AVX2(float v[], int n)   
    {
        // Fused EXPF and SUMMATION of a single vector
        if (n % 8 != 0) { // Safety check (no extra cases)
            yassert(n % 8 == 0);
            return 0.0; // fail
        }
        __m256 sumdst = _mm256_setzero_ps();   // Set accumulators to zero
        for (int i = 0; i < n; i += 8) {
            __m256 r1 = _mm256_loadu_ps(&v[i]);   // Load floats into 256-bits
            __m256 expdst = _mm256_exp_ps(r1);    // Exponentiate (expf)
            sumdst = _mm256_add_ps(expdst, sumdst); // SUM = SUM + V
        }
        // Add the final 8 accumulators manually
        float* farr = sumdst.m256_f32;
        float sum = farr[0] + farr[1] + farr[2] + farr[3]
                  + farr[4] + farr[5] + farr[6] + farr[7];
        return sum;
    }

And here is the AVX2 code that uses that fused expf-summation routine as one loop, and has a multiply-by-scalar afterwards.

    void aussie_vector_softmax_fused_exp_sum_mult_AVX2(float v[], int n) 
    {
        // Softmax with EXP and SUM and MULT in AVX2
        yassert(n % 8 == 0);
        float denom = aussie_vector_fused_expf_sum_AVX2(v, n);  // Element-wise expf...
        if (denom == 0.0) {
            yassert(denom != 0.0);
            return;  // fail (should not occur)
        }
        float recip = 1.0f / denom;
        aussie_vector_multiply_scalar_AVX2(v, n, recip);
    }

• Next:

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs

The new AI programming book by Aussie AI co-founders:

AI coding in C++
Transformer engine speedups
LLM models
Phone and desktop AI
Code examples
Research citations

Get your copy from Amazon: Generative AI in C++

Aussie AI

Vectorized and Fused Loop Softmax

Vectorized & Fused Loop Softmax

Quick Links

Product

New to Writing?

Writing Styles