Aussie AI

25. Softmax

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

“We must combine the toughness of the serpent
with the softness of the dove,
a tough mind and a tender heart.”

— Martin Luther King, Jr.

 

 

What is Softmax?

Softmax is a relatively simple component of the Transformer. There are no tensors or matrix multiplications. All it does is operate on a vector of numbers and change the numbers. It is a type of “normalization” (like BatchNorm or LayerNorm in the prior chapter) but Softmax is used for many different reasons in a Transformer.

The purpose of Softmax is to take a set of values in a vector of calculated values, and normalize them into probabilities in a new output vector. After Softmax, the output vector contains a new normalized set of values which all add up to 1, and they are intended to represent probabilities of the likelihood of each token/word associated with each vector element.

The Softmax algorithm is basically:

  • Exponentiate each vector element.
  • Add up the exponentials.
  • Divide every vector element by this sum.

In fewer words, scale by the sum of the exponentials.

So, why do we need all the exponentials? The idea is that the input vector contains “logits,” which are logarithms of probabilities, so we are exponentiating each one to bring it out of log-domain into real-domain. Then with the division step, we are normalizing them all so that they are probabilities that total exactly 1.

Inputs, Outputs and Dimensions

To understand Softmax, let's examine its inputs and outputs in more detail. Overall, Softmax is a “vector-to-vector” algorithm. The input and output vectors have the same dimension, which is the model size.

Softmax is not an “element-wise” vector operation. The change to each element in the vector depends on all the elements in the vector, not only on the current element. For example, it adds up all the elements to use as a scaling factor (after exponentiating them).

The input to Softmax is a vector of floating-point numbers containing logits. These are coming out of the model's calculation layers, and are a rough representation of word probabilities.

However, the input vector is a bit messy. Firstly, they are in the “log-domain” rather than real probabilities. In other words, they are the logarithm of the probabilities. Secondly, the values in the vectors are not “normalized” so there are numbers outside the ranges 0...1, including large numbers and negatives. Thirdly, the numbers also don't nicely add up to 1 like disjoint probabilities should.

The output of Softmax is a beautiful vector that's perfect in every way, with harps playing softly in the background. The log-domain has been fixed by exponentiation. All of the numbers are scaled into 0...1, and they all add up to 1 in total like a good probability distribution of disjoint events. The output vector from Softmax fills every Statistician's heart with joy.

Softmax and Temperature

One important use of Softmax is in the decoding step. At the end of each decoder sequence, the Softmax function is used to normalize the logits before processing by a decoding algorithm to choose an output token with the highest probability. As part of this method, the Softmax function is usually changed to a “scaled Softmax” that uses an extra parameter called the “temperature.”

What is the temperature? The purpose of the temperature parameter is as a hyper-parameter that influences the level of randomness or unpredictability in the output. A higher setting for temperature means that the decoder is more likely to output the lower-probability tokens (i.e., it has a fever and says silly stuff). If the temperature is low, the decoder is mostly going to output the highest probability token, meaning it is much less random (like a cold-hearted robot).

What is the value of the temperature? The temperature is a non-zero positive floating-point number that can be between 0 and 1, or can also be greater than 1. A temperature of zero cannot be used as it would cause divide-by-zero errors. If the temperature equals 1.0, it doesn't change the Softmax function at all (i.e. continues harmlessly without scaling). Since the Softmax function is scaled by the reciprocal of the temperature, the effect is to make randomness higher with a larger temperature setting (so it runs “hotter” and gets more “bubbly”). If the temperature is below 1.0, making it a fraction, the effect is to spread out the logits more, which has the effect of reducing randomness of the output. If the temperature is greater than 1.0, this contracts the logits towards each other, making the decoder more likely to choose each of them (although still with some randomness), thereby increasing output randomness.

Softmax C++ Optimizations

The Softmax function is an inefficient computation by default because it has two scans of the vector and each scan does an expensive operation (exponentiation and division). First, the whole vector is scanned to compute the sum of the exponential of each value. Then the whole vector is re-scanned to divide each vector element by this sum-of-exponentials.

Here is a naive implementation of Softmax in C++:

    #include <math.h>  // Declare expf()
    ....
    float aussie_vector_sum_of_exponentials(float v[], int n)
    {
        float sum = 0.0;
        for (int i = 0; i < n; i++) {
            float e = expf(v[i]);
            yassert(e >= 0.0);
            sum += e;
        }
        return sum;
    }

    void aussie_vector_softmax_basic(float v[], int n)
    {
        float denom = aussie_vector_sum_of_exponentials(v, n);  // Denominator...
        if (denom == 0.0) {
            yassert(denom != 0.0);
            return;  // fail (should not occur)
        }
        for (int i = 0; i < n; i++) {
            v[i] = expf(v[i]) / denom;
        }
    }

Reciprocal Multiplication. One simple speed improvement is to multiply by the reciprocal instead of using floating-point division:

    float recip = 1.0f / denom;
    for (int i = 0; i < n; i++) {
        v[i] = expf(v[i]) * recip;   // Scale by reciprocal multiplication
    }
Common sub-expression elimination. If we look carefully, we can note that expf is called twice on v[i] for every element, and expf is quite an expensive mathematical function to be messing with. A faster method is to compute expf(v[i]) once and then leave it in the vector to use again, thereby avoiding the second expf call.
    aussie_vector_expf(v, n);  // Element-wise expf...
    float denom = aussie_vector_sum(v, n);  // Denominator...
    // ....
    float recip = 1.0f / denom;
    for (int i = 0; i < n; i++) {
        v[i] *= recip;  // NOTE: v[i] is already expf'd
    }

This uses “aussie_vector_expf” to exponentiate each element of the vector. A naive implementation is:

    void aussie_vector_expf(float v[], int n)   
    {
        // Apply EXPF (exponential) to each element
        for (int i = 0; i < n; i++) {
            v[i] = expf(v[i]);
        }
    }

Fused Loop Softmax. The above code has two initial loops doing exponentiation and summation. There's an opportunity for “loop fusion” here by merging the two loops in “aussie_vector_expf” and “aussie_vector_sum” into one loop that exponentiates and sums as it goes. The call becomes:

   float denom = aussie_vector_expf_and_sum(v, n);  // Exponentiate & sum

This is how the fused version of “exponentiate-and-sum” looks in a simple C++ version without any optimizations:

    float aussie_vector_expf_and_sum(float v[], int n)   // Apply EXPF (exponential) to each element
    {
        float sum = 0.0f;
        for (int i = 0; i < n; i++) {
            v[i] = expf(v[i]);
            sum += v[i];
        }
        return sum;
    }

More fusion? The Softmax algorithm still has two loops, but we have difficulty fusing the first part (“exponentiate and sum”) with the second part (“scale by the sum”). The second code block awaits the sum to use as the scale factor (as a reciprocal). Hence, it doesn't work to “scale as we go” because then we'd have to go back and re-scale earlier vector elements anyway. I can't really see a good way that we can avoid this roadblock to fusion.

Vectorized Softmax

The Softmax code has two loops that run sequentially: summing the exponentials, and scaling by the sum's reciprocal. Both loops are candidates for vectorization. The only real problem is we can't fuse the two loops into one, because the second loop needs the result of the first loop as the scaling factor.

Second things first. The second loop is easy to vectorize because it's just multiplying a vector by a scalar. The second loop does not have any exponentiation, because the first loop has stored the exponentiated values in the vector, so there is only a scaling multiplication by the reciprocal.

    for (int i = 0; i < n; i++) {
        v[i] *= recip;  // NOTE: v[i] is already expf'd
    }

Vectorizing exponentials. The first loop has exponentiation and also summing of the results. That sounds like it's going to be expensive, but the “exp” and “expf” functions have had hardware support for years. The x86 processor architecture has opcodes to do various common math functions including exponentials, and these can be accessed via the AVX C++ intrinsic functions.

Vectorized Softmax with AVX

The AVX intrinsics use x86 SIMD instructions to operate on multiple float values or integers at once (e.g. 4 float values for AVX-1, 8 float values for AVX-2, 16 float values for AVX-512). Surprisingly, there are AVX SIMD exponential function intrinsics, to apply “expf” to multiple elements of a vector in parallel.

Example: Softmax with AVX exponential and summation. We can vectorize both these loops separately using AVX intrinsics. Vectorized versions of expf and summation were examined in the hardware acceleration chapter. The version for AVX1 becomes:

    void aussie_vector_softmax_exponentiate_and_sum_AVX1(float v[], int n)
    {
        yassert(n % 4 == 0);
        aussie_vector_expf_AVX1(v, n);  // AVX1-accelerated expf...
        float denom = aussie_vector_sum_AVX1(v, n);  // AVX1-accelerated sum
        if (denom == 0.0) {
            yassert(denom != 0.0);
            return;  // fail (should not occur)
        }
        float recip = 1.0f / denom;
        for (int i = 0; i < n; i++) {
            v[i] *= recip;  // NOTE: v[i] is already expf'd
        }
    }

Actually, that's only vectorized two out of three loops. Here's the code with the third loop, multiply-by-scalar, also done with AVX, as was also shown in the vectorization chapter. This code is the AVX2 version:

    void aussie_vector_softmax_fused_exp_sum_mult_AVX2(float v[], int n) 
    {
        // Softmax with EXP and SUM and MULT in AVX2
        yassert(n % 8 == 0);
        float denom = aussie_vector_fused_expf_sum_AVX2(v, n);  // Element-wise expf...
        if (denom == 0.0) {
            yassert(denom != 0.0);
            return;  // fail (should not occur)
        }
        float recip = 1.0f / denom;
        aussie_vector_multiply_scalar_AVX2(v, n, recip);
    }

Vectorized & Fused Loop Softmax

What about vectorization applied to these fused loops. Can we do better than using the two vectorized loops in sequence? Can we merge the exponentiation and summation into a single unrolled loop and vectorize that using AVX intrinsics? I'm just teasing you. Of course, we can!

Here is “kernel fusion” of the vector expf and vector summation into a fused-expf-summation kernel. I coded this for both AVX1 and AVX2, with both very similar in structure. Here is the code for AVX2:

    float aussie_vector_fused_expf_sum_AVX2(float v[], int n)   
    {
        // Fused EXPF and SUMMATION of a single vector
        if (n % 8 != 0) { // Safety check (no extra cases)
            yassert(n % 8 == 0);
            return 0.0; // fail
        }
        __m256 sumdst = _mm256_setzero_ps();   // Set accumulators to zero
        for (int i = 0; i < n; i += 8) {
            __m256 r1 = _mm256_loadu_ps(&v[i]);   // Load floats into 256-bits
            __m256 expdst = _mm256_exp_ps(r1);    // Exponentiate (expf)
            sumdst = _mm256_add_ps(expdst, sumdst); // SUM = SUM + V
        }
        // Add the final 8 accumulators manually
        float* farr = sumdst.m256_f32;
        float sum = farr[0] + farr[1] + farr[2] + farr[3]
                  + farr[4] + farr[5] + farr[6] + farr[7];
        return sum;
    }

And here is the AVX2 code that uses that fused expf-summation routine as one loop, and has a multiply-by-scalar afterwards.

    void aussie_vector_softmax_fused_exp_sum_mult_AVX2(float v[], int n) 
    {
        // Softmax with EXP and SUM and MULT in AVX2
        yassert(n % 8 == 0);
        float denom = aussie_vector_fused_expf_sum_AVX2(v, n);  // Element-wise expf...
        if (denom == 0.0) {
            yassert(denom != 0.0);
            return;  // fail (should not occur)
        }
        float recip = 1.0f / denom;
        aussie_vector_multiply_scalar_AVX2(v, n, recip);
    }

Softmax Benchmarking Results

Here's the result from my benchmarking 100,000 calls to the various Softmax versions for a vector with 1024 elements, for all these algorithms, including both sequential and AVX parallel versions.

    Softmax benchmarks (N=1024, ITER=100000)
    Softmax basic: 13186 ticks (13.19 seconds)
    Softmax reciprocal: 12986 ticks (12.99 seconds)
    Softmax expf-first: 6977 ticks (6.98 seconds)
    Softmax expf-sum-fused: 6682 ticks (6.68 seconds)
    Softmax expf with AVX1: 1095 ticks (1.09 seconds)
    Softmax expf/sum AVX1: 910 ticks (0.91 seconds)
    Softmax fused expf/sum AVX1: 1095 ticks (1.09 seconds)
    Softmax fused expf/sum/mult AVX1: 831 ticks (0.83 seconds)
    Softmax expf with AVX2: 538 ticks (0.54 seconds)
    Softmax expf/sum AVX2: 306 ticks (0.31 seconds)
    Softmax fused expf/sum AVX2: 252 ticks (0.25 seconds)
    Softmax fused expf/sum/mult AVX2: 176 ticks (0.18 seconds)

Interestingly, fusing the expf and summation kernels was actually worse for AVX1, but it was faster for AVX2. Otherwise, our speedups were as we would expect, with the triple-AVX optimizations of expf, summation, and multiply-by-scalar (reciprocal) getting the best results by far. The triple-vectorized AVX2 version is 73 times faster than the naive C++ sequential version, using about 1.4% of its CPU time cost. And we haven't even tried AVX-512 optimization yet!

Softmax Overflow and Underflow

Note that a simplified version of Softmax has been used in the code examples of this chapter for simplicity of explanation. Not only is this computation still very slow (even if we precompute all those calls to expf), it's also prone to overflow and underflow. The real computation of Softmax needs to be further optimized algebraically and scaled to avoid these problems.

This scaled computation can then be optimized using many of the same methods as for the naive Softmax version, as above. Further optimizations may include the use of calls to hardware acceleration APIs, pre-computed lookup tables as approximations for the expf function, and converting the loops to pointer arithmetic.

Softmax Optimization Research

The Softmax function is a significant cost in Transformer inference because it is part of the attention mechanism, whereas it was less of a bottleneck in earlier neural network architectures. A vanilla Softmax implementation is very expensive because it involves computing the exponentials of all of the elements of the logits vector. Various attempts have been made to optimize and approximate Softmax calculations, including:

  • Softmax code optimizations (sequential)
  • Vectorized Softmax (parallelization)
  • Softmax approximations
  • Integer-only Softmax
  • Pruned Softmax (removal)
  • Fused Softmax (kernel fusion)
  • Softmax replacements (use different functions)

Related Research Areas: Note that there are several other areas of theory that are relevant to Softmax optimizations and approximation. The denominator of the Softmax formula is a “sum of exponentials” and this type of calculation also appears in Logarithmic Number System (LNS) addition. Also, the sum of exponentials calculation, appears in “log-sum-exp networks,” which are somewhat related to “tropical algebra.” The area of “max-plus networks” may also be relevant to Softmax approximation research.

 

Next: Chapter 26. Decoding Algorithms

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++