Aussie AI
Vectorization of Exponentiation
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
Vectorization of Exponentiation
The expf
function is very expensive to call,
but exponentiation of entire vectors of float
values are required in several parts of AI engines,
such as activation functions and Softmax normalization.
Surprisingly, in x86 there are CPU opcodes to do exponentiation in hardware,
and there are matching AVX intrinsics for SIMD exponentiation operations on small vectors
(i.e. 4 float
values for AVX-1 and 8 float
values for AVX-2).
The basic C++ version to apply expf
to every element of a vector, and store the
result in the original vector, looks like this:
void aussie_vector_expf(float v[], int n) { // Apply EXPF (exponential) to each element for (int i = 0; i < n; i++) { v[i] = expf(v[i]); } }
Loop Pointer arithmetic. Applying the basic C++ optimization of pointer arithmetic, the new code is:
void aussie_vector_expf_pointer_arith(float v[], int n) { for (; n > 0; n--, v++) { *v = expf(*v); } }
AVX1 SIMD exponentiation of 4 values:
There is an AVX intrinsic called “_mm_exp_ps
” to exponentiate 4 float
values in parallel using the 128-bit registers.
Here's the new vector exponentiation code with loop unrolling every 4 elements and AVX1 vectorization:
void aussie_vector_expf_AVX1(float v[], int n) { for (int i = 0; i < n; i += 4) { __m128 r1 = _mm_loadu_ps(&v[i]); // Load floats into 128-bits __m128 dst = _mm_exp_ps(r1); // Exponentiate (expf) _mm_store_ps(&v[i], dst); // convert to floats (Aligned version) } }
AVX2 SIMD exponentiation of 8 values:
The AVX2 intrinsic is “_mm256_exp_ps
” to exponentiate 8 elements in parallel using the 256-bit registers.
The new code with loop unrolling every 8 values and AVX-2 intrinsics becomes:
void aussie_vector_expf_AVX2(float v[], int n) // Apply EXPF (exponential) to each element { for (int i = 0; i < n; i += 8) { __m256 r1 = _mm256_loadu_ps(&v[i]); // Load floats into 256-bits __m256 dst = _mm256_exp_ps(r1); // Exponentiate (expf) _mm256_store_ps(&v[i], dst); // convert to floats (Aligned version) } }
Benchmarking results. The results of optimization of exponentiation are striking! AVX1 is massively faster, cutting out 97% of the original computation time, and then AVX2 is faster still. It's almost like hardware is faster than software. Who knew?
Vector-exponentiation operation benchmarks (N=1024, ITER=100000): Vector expf basic: 6695 ticks (6.70 seconds) Vector expf pointer-arith: 6395 ticks (6.39 seconds) Vector expf AVX1: 260 ticks (0.26 seconds) Vector expf AVX2: 124 ticks (0.12 seconds)
• Next: • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |