Aussie AI

Vectorization of Exponentiation

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Vectorization of Exponentiation

The expf function is very expensive to call, but exponentiation of entire vectors of float values are required in several parts of AI engines, such as activation functions and Softmax normalization. Surprisingly, in x86 there are CPU opcodes to do exponentiation in hardware, and there are matching AVX intrinsics for SIMD exponentiation operations on small vectors (i.e. 4 float values for AVX-1 and 8 float values for AVX-2).

The basic C++ version to apply expf to every element of a vector, and store the result in the original vector, looks like this:

    void aussie_vector_expf(float v[], int n)   
    {
        // Apply EXPF (exponential) to each element
        for (int i = 0; i < n; i++) {
            v[i] = expf(v[i]);
        }
    }

Loop Pointer arithmetic. Applying the basic C++ optimization of pointer arithmetic, the new code is:

    void aussie_vector_expf_pointer_arith(float v[], int n)
    {
        for (; n > 0; n--, v++) {
            *v = expf(*v);
        }
    }

AVX1 SIMD exponentiation of 4 values: There is an AVX intrinsic called “_mm_exp_ps” to exponentiate 4 float values in parallel using the 128-bit registers. Here's the new vector exponentiation code with loop unrolling every 4 elements and AVX1 vectorization:

    void aussie_vector_expf_AVX1(float v[], int n)
    {
        for (int i = 0; i < n; i += 4) {
            __m128 r1 = _mm_loadu_ps(&v[i]);   // Load floats into 128-bits
            __m128 dst = _mm_exp_ps(r1);   // Exponentiate (expf)
            _mm_store_ps(&v[i], dst);  // convert to floats (Aligned version)
        }
    }

AVX2 SIMD exponentiation of 8 values: The AVX2 intrinsic is “_mm256_exp_ps” to exponentiate 8 elements in parallel using the 256-bit registers. The new code with loop unrolling every 8 values and AVX-2 intrinsics becomes:

    void aussie_vector_expf_AVX2(float v[], int n)  // Apply EXPF (exponential) to each element
    {
        for (int i = 0; i < n; i += 8) {
            __m256 r1 = _mm256_loadu_ps(&v[i]);   // Load floats into 256-bits
            __m256 dst = _mm256_exp_ps(r1);    // Exponentiate (expf)
            _mm256_store_ps(&v[i], dst);  // convert to floats (Aligned version)
        }
    }

Benchmarking results. The results of optimization of exponentiation are striking! AVX1 is massively faster, cutting out 97% of the original computation time, and then AVX2 is faster still. It's almost like hardware is faster than software. Who knew?

    Vector-exponentiation operation benchmarks (N=1024, ITER=100000):
    Vector expf basic: 6695 ticks (6.70 seconds)
    Vector expf pointer-arith: 6395 ticks (6.39 seconds)
    Vector expf AVX1: 260 ticks (0.26 seconds)
    Vector expf AVX2: 124 ticks (0.12 seconds)

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++