Aussie AI

Example: AVX-2 256-Bit Dot Product

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Example: AVX-2 256-Bit Dot Product

Here is my attempt at the 256-bit version of a vector dot product of 8 float values using AVX-2 instructions, which seems like it should work:

    float aussie_avx2_vecdot_8_floats_buggy(
        float v1[8], float v2[8])
    {
        // AVX2 dot product: 2 vectors, 8x32-bit floats
        __m256 r1 = _mm256_loadu_ps(v1); // Load floats
        __m256 r2 = _mm256_loadu_ps(v2);
        __m256 dst = _mm256_dp_ps(r1, r2, 0xf1); // Bug!
        float fret = _mm256_cvtss_f32(dst); 
        return fret;
    }

But it doesn't! Instead of working on 8 pairs of float numbers, it does the vector dot product of only 4 pairs of float values, just like the first AVX code. The problem wasn't related to alignment to 256-bit blocks, because I added “alignas(32)” to the arrays passed in. It seems that the “_mm256_dp_ps” intrinsic doesn't actually do 256-bit dot products, but is similar to the 128-bit “_mm_dp_ps” intrinsic that does only four float numbers (128 bits). These are based on the VDPPS opcode in the x86 instruction for 32-bit float values and there is VDPPD for 64-bit double numbers. However, it seems that “_mm256_dp_ps” is not using the 256-bit version. Or maybe my code is just buggy!

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++