Aussie AI
Vectorized and Fused Loop Softmax
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
Vectorized & Fused Loop Softmax
What about vectorization applied to these fused loops. Can we do better than using the two vectorized loops in sequence? Can we merge the exponentiation and summation into a single unrolled loop and vectorize that using AVX intrinsics? I'm just teasing you. Of course, we can!
Here is “kernel fusion” of the vector expf
and vector summation into a fused-expf
-summation kernel.
I coded this for both AVX1 and AVX2,
with both very similar in structure.
Here is the code for AVX2:
float aussie_vector_fused_expf_sum_AVX2(float v[], int n) { // Fused EXPF and SUMMATION of a single vector if (n % 8 != 0) { // Safety check (no extra cases) yassert(n % 8 == 0); return 0.0; // fail } __m256 sumdst = _mm256_setzero_ps(); // Set accumulators to zero for (int i = 0; i < n; i += 8) { __m256 r1 = _mm256_loadu_ps(&v[i]); // Load floats into 256-bits __m256 expdst = _mm256_exp_ps(r1); // Exponentiate (expf) sumdst = _mm256_add_ps(expdst, sumdst); // SUM = SUM + V } // Add the final 8 accumulators manually float* farr = sumdst.m256_f32; float sum = farr[0] + farr[1] + farr[2] + farr[3] + farr[4] + farr[5] + farr[6] + farr[7]; return sum; }
And here is the AVX2 code that uses that fused expf
-summation routine as one loop,
and has a multiply-by-scalar afterwards.
void aussie_vector_softmax_fused_exp_sum_mult_AVX2(float v[], int n) { // Softmax with EXP and SUM and MULT in AVX2 yassert(n % 8 == 0); float denom = aussie_vector_fused_expf_sum_AVX2(v, n); // Element-wise expf... if (denom == 0.0) { yassert(denom != 0.0); return; // fail (should not occur) } float recip = 1.0f / denom; aussie_vector_multiply_scalar_AVX2(v, n, recip); }
• Next: • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |