Aussie AI
Vectorized RELU with Max Intrinsics
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
Vectorized RELU with Max Intrinsics
The RELU activation function simply converts negatives to zero, leaving positives unchanged. This is algebraically equivalent to max(x,0), which can be implemented in AVX like a “max-scalar” operation.
To vectorize RELU applied to a whole vector of float
elements,
we are effectively doing a SIMD max
operation with a scalar zero (i.e., 0.0
).
Hence, the code is very similar to vectorization of add-scalar,
but uses the “_mm_max_ps
” intrinsic.
The AVX1 version of vectorized RELU looks like:
void aussie_vector_reluize_AVX1(float v[], int n) // Apply RELU to each element (sets negatives to zero) { if (n % 4 != 0) { yassert(n % 4 == 0); return; // fail } const __m128 rzeros = _mm_set1_ps(0.0f); // Set up vector full of zeros... for (int i = 0; i < n; i += 4) { __m128 r1 = _mm_loadu_ps(&v[i]); // Load floats into 128-bits __m128 dst = _mm_max_ps(r1, rzeros); // MAX(r1,0) _mm_store_ps(&v[i], dst); // store back to floats } }
And here is the AVX2 version doing 8 float
elements at a time
using the “_mm256_max_ps
” intrinsic:
void aussie_vector_reluize_AVX2(float v[], int n) // Apply RELU to each element (sets negatives to zero) { if (n % 8 != 0) { yassert(n % 8 == 0); return; // fail } const __m256 rzeros = _mm256_set1_ps(0.0f); // vector full of zeros... for (int i = 0; i < n; i += 8) { __m256 r1 = _mm256_loadu_ps(&v[i]); // Load floats into 256-bits __m256 dst = _mm256_max_ps(r1, rzeros); // MAX(R1, 0) _mm256_store_ps(&v[i], dst); // store back to floats } }
• Next: • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |