Aussie AI

Softmax C++ Optimizations

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

Softmax C++ Optimizations

The Softmax function is an inefficient computation by default because it has two scans of the vector and each scan does an expensive operation (exponentiation and division). First, the whole vector is scanned to compute the sum of the exponential of each value. Then the whole vector is re-scanned to divide each vector element by this sum-of-exponentials.

Here is a naive implementation of Softmax in C++:

    #include <math.h>  // Declare expf()
    ....
    float aussie_vector_sum_of_exponentials(float v[], int n)
    {
        float sum = 0.0;
        for (int i = 0; i < n; i++) {
            float e = expf(v[i]);
            yassert(e >= 0.0);
            sum += e;
        }
        return sum;
    }

    void aussie_vector_softmax_basic(float v[], int n)
    {
        float denom = aussie_vector_sum_of_exponentials(v, n);  // Denominator...
        if (denom == 0.0) {
            yassert(denom != 0.0);
            return;  // fail (should not occur)
        }
        for (int i = 0; i < n; i++) {
            v[i] = expf(v[i]) / denom;
        }
    }

Reciprocal Multiplication. One simple speed improvement is to multiply by the reciprocal instead of using floating-point division:

    float recip = 1.0f / denom;
    for (int i = 0; i < n; i++) {
        v[i] = expf(v[i]) * recip;   // Scale by reciprocal multiplication
    }

Common sub-expression elimination. If we look carefully, we can note that expf is called twice on v[i] for every element, and expf is quite an expensive mathematical function to be messing with. A faster method is to compute expf(v[i]) once and then leave it in the vector to use again, thereby avoiding the second expf call.

    aussie_vector_expf(v, n);  // Element-wise expf...
    float denom = aussie_vector_sum(v, n);  // Denominator...
    // ....
    float recip = 1.0f / denom;
    for (int i = 0; i < n; i++) {
        v[i] *= recip;  // NOTE: v[i] is already expf'd
    }

This uses “aussie_vector_expf” to exponentiate each element of the vector. A naive implementation is:

    void aussie_vector_expf(float v[], int n)   
    {
        // Apply EXPF (exponential) to each element
        for (int i = 0; i < n; i++) {
            v[i] = expf(v[i]);
        }
    }

Fused Loop Softmax. The above code has two initial loops doing exponentiation and summation. There's an opportunity for “loop fusion” here by merging the two loops in “aussie_vector_expf” and “aussie_vector_sum” into one loop that exponentiates and sums as it goes. The call becomes:

   float denom = aussie_vector_expf_and_sum(v, n);  // Exponentiate & sum

This is how the fused version of “exponentiate-and-sum” looks in a simple C++ version without any optimizations:

    float aussie_vector_expf_and_sum(float v[], int n)   // Apply EXPF (exponential) to each element
    {
        float sum = 0.0f;
        for (int i = 0; i < n; i++) {
            v[i] = expf(v[i]);
            sum += v[i];
        }
        return sum;
    }

More fusion? The Softmax algorithm still has two loops, but we have difficulty fusing the first part (“exponentiate and sum”) with the second part (“scale by the sum”). The second code block awaits the sum to use as the scale factor (as a reciprocal). Hence, it doesn't work to “scale as we go” because then we'd have to go back and re-scale earlier vector elements anyway. I can't really see a good way that we can avoid this roadblock to fusion.

• Next:

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs

The new AI programming book by Aussie AI co-founders:

AI coding in C++
Transformer engine speedups
LLM models
Phone and desktop AI
Code examples
Research citations

Get your copy from Amazon: Generative AI in C++

Aussie AI

Softmax C++ Optimizations

Softmax C++ Optimizations

Quick Links

Product

New to Writing?

Writing Styles