Aussie AI

Vectorized Multiply Vector by Scalar

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Vectorized Multiply Vector by Scalar

The requirement to multiply a vector by a scalar is common when using scaling vectors. Division by a scalar is also handled by multiplying by the reciprocal (e.g. needed for Softmax). Multiplication by a scalar is amenable to vectorization because the naive C++ version is very simple:

    void aussie_vector_multiply_scalar(float v[], int n, float c)  
    {
        // Multiply all vector elements by constant
        for (int i = 0; i < n; i++) {
            v[i] *= c;
        }
    }

Loop Pointer Arithmetic. First, we can try the basic C++ optimization of pointer arithmetic:

    void aussie_vector_multiply_scalar_pointer_arith(float v[], int n, float c)  
    {
        // Multiply all vector elements by constant
        for (; n > 0; n--, v++) {
            *v *= c;
        }
    }

AVX1 multiply-by-scalar: There is no special scalar multiplication opcode in AVX or AVX-2, but we can populate a constant register (128-bit or 256-bit) with multiple copies of the scalar (i.e. _mm_set1_ps or _mm256_set1_ps), and we need do this only once. We can then use the SIMD multiply intrinsics in the unrolled loop section. The AVX 128-bit vector multiplication by scalar becomes:

    void aussie_vector_multiply_scalar_AVX1(float v[], int n, float c)  
    { 
        const __m128 rscalar = _mm_set1_ps(c);  // Vector of scalars
        for (int i = 0; i < n; i += 4) {
            __m128 r1 = _mm_loadu_ps(&v[i]);   // Load floats
            __m128 dst = _mm_mul_ps(r1, rscalar); // Multiply by scalars
            _mm_store_ps(&v[i], dst);  // convert to floats (aligned)
        }
    }

AVX2 multiply-by-scalar: Even faster is to use 8 parallel multiplications with AVX-2's 256-bit registers. The AVX-1 version is simply changed to use the “__m256” type and the analogous AVX-2 intrinsics. The new code looks like:

    void aussie_vector_multiply_scalar_AVX2(float v[], int n, float c)  
    {
        const __m256 rscalar = _mm256_set1_ps(c);  // Vector of scalars
        for (int i = 0; i < n; i += 8) {
            __m256 r1 = _mm256_loadu_ps(&v[i]);   // Load floats
            __m256 dst = _mm256_mul_ps(r1, rscalar); // Multiply by scalars
            _mm256_store_ps(&v[i], dst);  // convert to floats (aligned)
        }
    }

Combining AVX-2 with pointer arithmetic. Finally, we can get a small extra benefit by adding pointer arithmetic optimizations to the AVX-2 parallelized version. The new code is:

    void aussie_vector_multiply_scalar_AVX2_pointer_arith(float v[], int n, float c)  
    {
        // Multiply all vector elements by constant
        const __m256 rscalar = _mm256_set1_ps(c);  // vector full of scalars...
        for (; n > 0; n -= 8, v += 8) {
            __m256 r1 = _mm256_loadu_ps(v);   // Load floats into 256-bits
            __m256 dst = _mm256_mul_ps(r1, rscalar);   // Multiply by scalars
            _mm256_store_ps(v, dst);  // convert to floats (Aligned version)
        }
    }

Benchmarking results. In theory, the AVX-2 intrinsics could parallelize the computation by 8 times, but benchmarking showed that it only achieved a 4-times speedup.

    Vector-scalar operation benchmarks (N=1024, ITER=1000000):
    Vector mult-scalar C++: 1412 ticks (1.41 seconds)
    Vector mult-scalar pointer-arith: 995 ticks (0.99 seconds)
    Vector mult-scalar AVX1: 677 ticks (0.68 seconds)
    Vector mult-scalar AVX2: 373 ticks (0.37 seconds)
    Vector mult-scalar AVX2 + pointer arith: 340 ticks (0.34 seconds)

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++