Aussie AI
Vectorized Multiply Vector by Scalar
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
Vectorized Multiply Vector by Scalar
The requirement to multiply a vector by a scalar is common when using scaling vectors. Division by a scalar is also handled by multiplying by the reciprocal (e.g. needed for Softmax). Multiplication by a scalar is amenable to vectorization because the naive C++ version is very simple:
void aussie_vector_multiply_scalar(float v[], int n, float c) { // Multiply all vector elements by constant for (int i = 0; i < n; i++) { v[i] *= c; } }
Loop Pointer Arithmetic. First, we can try the basic C++ optimization of pointer arithmetic:
void aussie_vector_multiply_scalar_pointer_arith(float v[], int n, float c) { // Multiply all vector elements by constant for (; n > 0; n--, v++) { *v *= c; } }
AVX1 multiply-by-scalar:
There is no special scalar multiplication opcode in AVX or AVX-2,
but we can populate a constant register (128-bit or 256-bit) with multiple copies of the scalar (i.e. _mm_set1_ps
or _mm256_set1_ps
),
and we need do this only once.
We can then use the SIMD multiply intrinsics in the unrolled loop section.
The AVX 128-bit vector multiplication by scalar becomes:
void aussie_vector_multiply_scalar_AVX1(float v[], int n, float c) { const __m128 rscalar = _mm_set1_ps(c); // Vector of scalars for (int i = 0; i < n; i += 4) { __m128 r1 = _mm_loadu_ps(&v[i]); // Load floats __m128 dst = _mm_mul_ps(r1, rscalar); // Multiply by scalars _mm_store_ps(&v[i], dst); // convert to floats (aligned) } }
AVX2 multiply-by-scalar:
Even faster is to use 8 parallel multiplications with AVX-2's 256-bit registers.
The AVX-1 version is simply changed to use the “__m256
” type and the analogous AVX-2 intrinsics.
The new code looks like:
void aussie_vector_multiply_scalar_AVX2(float v[], int n, float c) { const __m256 rscalar = _mm256_set1_ps(c); // Vector of scalars for (int i = 0; i < n; i += 8) { __m256 r1 = _mm256_loadu_ps(&v[i]); // Load floats __m256 dst = _mm256_mul_ps(r1, rscalar); // Multiply by scalars _mm256_store_ps(&v[i], dst); // convert to floats (aligned) } }
Combining AVX-2 with pointer arithmetic. Finally, we can get a small extra benefit by adding pointer arithmetic optimizations to the AVX-2 parallelized version. The new code is:
void aussie_vector_multiply_scalar_AVX2_pointer_arith(float v[], int n, float c) { // Multiply all vector elements by constant const __m256 rscalar = _mm256_set1_ps(c); // vector full of scalars... for (; n > 0; n -= 8, v += 8) { __m256 r1 = _mm256_loadu_ps(v); // Load floats into 256-bits __m256 dst = _mm256_mul_ps(r1, rscalar); // Multiply by scalars _mm256_store_ps(v, dst); // convert to floats (Aligned version) } }
Benchmarking results. In theory, the AVX-2 intrinsics could parallelize the computation by 8 times, but benchmarking showed that it only achieved a 4-times speedup.
Vector-scalar operation benchmarks (N=1024, ITER=1000000): Vector mult-scalar C++: 1412 ticks (1.41 seconds) Vector mult-scalar pointer-arith: 995 ticks (0.99 seconds) Vector mult-scalar AVX1: 677 ticks (0.68 seconds) Vector mult-scalar AVX2: 373 ticks (0.37 seconds) Vector mult-scalar AVX2 + pointer arith: 340 ticks (0.34 seconds)
• Next: • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |