Aussie AI

Tiled Matrix-Vector Multiplication

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Tiled Matrix-Vector Multiplication

A more detailed analysis of the matrix-vector algorithm shows that it is not optimal in at least three areas:

  • Data locality
  • Pipelining AVX intrinsic arithmetic
  • Redundant loads

The data locality of the 2x2 tiled version is better, but more improvement is possible, starting with the use of AVX intrinsics inside the “sub-kernel” for the tile. The AVX instruction sequences of “load, calculate, store” in the earlier non-tiled AVX-optimized versions are not allowing for the natural instruction pipelining of the AVX intrinsics to calculate multiple sums or FMA operations with near-parallel pipelining. And the entire input vector is getting re-loaded repeatedly for every row of the matrix. So, we need to examine improvements on three aspects.

A tiled sub-kernel is the main way to fix data locality and pipelining. Improving data locality is somewhat inherent to tiling. The pipelining can be improved by unrolling the tiled sub-kernel and reordering the loads and stores so they don't block the arithmetic of AVX intrinsics.

Can we avoid redundant vector loads? Since it's unavoidable to access every element of every row at least once, the redundant loads of the vector suggest that we should modify the algorithm so as to work on a subsection of the vector for each of the matrix rows. This suggests an inversion of the main nested loops of the algorithm. However, that runs into the major problem that it destroys cache locality, by scanning down the column of the first matrix. I benchmarked this loop interchange idea, and it actually increased execution time. Maybe we should use the transpose of the first matrix, so that it's in column-major order when scanning its columns? No, that's actually just going back to the original algorithm without the loop interchange.

Anyway, a better plan seems to be to reduce the redundant loading by using temporary calculations inside the tile sub-kernel. Here is what a basic tiled/blocked algorithm using 2x2 tiles looks like in basic sequential C++:

    void aussie_matmul_vector_tiled_2x2_better(const ymatrix m, const float v[], int n, float vout[])
    {
        // Tiled/blocked matrix-by-vector using 2x2 tiling.. (faster sub-kernel)
        yassert(n % 2 == 0);
        for (int row = 0; row < n; row += 2) {
                vout[row] = 0.0f;
                vout[row + 1] = 0.0f;
                for (int col = 0; col < n; col += 2) {
                        vout[row] += (m[row][col] * v[col])  // row+0, col + 0
                            + (m[row][col + 1] * v[col + 1]) // row+0, col + 1
                                ;
                        vout[row + 1] += (m[row + 1][col] * v[col])   // row+1, col + 0
                                + (m[row + 1][col + 1] * v[col + 1])  // row+1, col + 1
                                ; 
                }
        }
    }

One minor improvement would be to use memset to clear the whole output vector to zero, rather than individual assignments, which I added to the 4x4 tiled version. There is another minor improvement is removing the “common sub-expressions” of v[col] and v[col+1] and I tried this with no improvement noted in the 2x2 tiled version, but about 10% improvement in the 4x4 tiled version. The computations of m[row] and m[row+1], etc., can also be hoisted out of the inner loop, giving another 10% gain for the 4x4 tiled version. The C++ code for the 4x4 tiled version with a fully unrolled 4x4 sub-kernel now looks like:

    void aussie_matmul_vector_tiled_4x4_CSE2(const ymatrix m, const float v[], int n, float vout[])
    {
        // Tiled/blocked matrix-by-vector using 4x4 tiling..
        yassert(n % 4 == 0);
        memset(vout, 0, sizeof(float) * n);
        for (int row = 0; row < n; row += 4) {
                const float* rowvec = &m[row][0];
                const float* rowvec1 = &m[row + 1][0];
                const float* rowvec2 = &m[row + 2][0];
                const float* rowvec3 = &m[row + 3][0];
                for (int col = 0; col < n; col += 4) {
                        float fcol0 = v[col];
                        float fcol1 = v[col + 1];
                        float fcol2 = v[col + 2];
                        float fcol3 = v[col + 3];
                        vout[row] +=
                                (rowvec[col] * fcol0) // row+0, col + 0
                                + (rowvec[col + 1] * fcol1) // row+0, col + 1
                                + (rowvec[col + 2] * fcol2) // row+0, col + 2
                                + (rowvec[col + 3] * fcol3) // row+0, col + 3
                                ;
                        vout[row + 1] +=
                                (rowvec1[col] * fcol0) // row+1, col + 0
                                + (rowvec1[col + 1] * fcol1) // row+1, col + 1
                                + (rowvec1[col + 2] * fcol2) // row+1, col + 2
                                + (rowvec1[col + 3] * fcol3) // row+1, col + 3
                                ;
                        vout[row + 2] +=
                                (rowvec2[col] * fcol0) // row+2, col + 0
                                + (rowvec2[col + 1] * fcol1) // row+2, col + 1
                                + (rowvec2[col + 2] * fcol2) // row+2, col + 2
                                + (rowvec2[col + 3] * fcol3) // row+2, col + 3
                                ;
                        vout[row + 3] +=
                                (rowvec3[col] * fcol0) // row+3, col + 0
                                + (rowvec3[col + 1] * fcol1) // row+3, col + 1
                                + (rowvec3[col + 2] * fcol2) // row+3, col + 2
                                + (rowvec3[col + 3] * fcol3) // row+3, col + 3
                                ;
                }
        }
    }

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++