Aussie AI

Rectangular Matrix-Vector Multiplication

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

Rectangular Matrix-Vector Multiplication

The general case of a rectangular matrix multiplied by a vector is a little trickier, but not a lot. If our matrix is MxN and the vector is size N, then the output vector has size M. Note the two of the dimensions must match: the columns of the matrix and the elements of the input vector are both N. However, this dimension N “disappears” and the output vector has size only dependent on M. The pseudocode:

    MAT[M][N] * VIN[N] -> VOUT[M]

The rectangular matrix-vector multiplication is almost identical to square matrix-vector computations. Each element of the output vector is a dot product of a matrix row with the input vector. Again, we note that the dimensions of the matrix rows (N) must match the size of the input vector (N), or else we cannot compute it. I mean, we could still compute it with mismatched dimensions, such as by assuming that the shorter one (matrix row or input vector) had zeros in the missing elements, but that sounds a little buggy.

Matrix-Matrix Multiplication

Now let's look at matrix-matrix multiplication, whereas above we looked at matrix-vector multiplication. The proper MatMul and GEMM kernels are coded for full matrix-matrix multiplication.

Matrix multiplication results in another matrix as the output. For the simple case of two square matrices of the same size, the resulting output matrix is also of the same dimensions. In pseudocode:

    M1[N][N] * M2[N][N] -> MOUT[N][N]

Algorithmic Complexity. The naive implementation of a matrix-matrix multiplication via three nested loops is a cubic algorithm, with O(N^3) complexity. The well-known Strassen algorithm has complexity about O(N^2.7), which looks like such a massive improvement. Other algorithms such as the Coppersmith-Winograd algorithm and numerous sub-variants have better asymptotic complexity, but with a high constant overhead, making them impracticable for anything but very large values of N.

Basic Matrix-Matrix Multiplication. The basic algorithm for matrix multiplication is three nested loops. There is nothing fancy here: this is just coding up the basic matrix multiplication method that you forgot the second you finished your Senior math exam. If you don't believe me, check it out on Wikipedia. Here's the C++ code:

    void aussie_matmul_matrix_basic(const ymatrix m1, const ymatrix m2, int n, ymatrix mout)
    {
        // Matrix-Matrix multiplication basic naive n^3 algorithm...
        for (int row = 0; row < n; row++) {
                for (int col = 0; col < n; col++) {
                        float sum = 0.0f;
                        for (int k = 0; k < n; k++) {
                                sum += (m1[row][k] * m2[k][col]);
                        }
                        mout[row][col] = sum;
                }
        }
    }

The two outer loops are scanning the rows of the first matrix, and the columns of the second matrix. The innermost of the three loops is doing a vector dot product computation over the “k” index variable. However, it's not a normal vector-vector dot product. Instead, it's the dot product of one “horizontal” vector, which is a row of the first matrix, and of a second “vertical” vector, which is a column of the second matrix. Hence, the number of rows in the first matrix must equal the columns of the second matrix, which is true here because we're assuming that both matrices are square. Hence, the “k” variable is spinning down the n elements of a row and a column at the same time. Every element of the NxN output matrix requires a vector dot product calculation like this.

Vectorization. From the point-of-view of AI engines, none of these matrix multiplication algorithms are especially good, because they are all sequential, rather than parallel algorithms. Neither the naive cubic version nor the Strassen algorithm are what we need. What we need for GPUs and CPU SIMD intrinsics are vectorizable algorithms for matrix-matrix multiplication. Unfortunately, the above simple triple-nested matrix multiplication algorithm is not one of them, because non-contiguous storage of the second matrix hampers vectorization.

Memory layout problems for matrix-matrix multiplication: The layout of memory for matrix-matrix multiplications is not as fortuitous as it was for matrix-vector multiplications. Each computation in matrix-matrix multiplication is a vector dot product of a row of the first matrix with a column of the second matrix. Each row of the first matrix is happily stored in contiguous memory, but the columns of the second matrix are not. In fact, the “stride” between two elements of a column of a matrix is a very large number of bytes in the default memory layout.

The default storage of matrices and two-dimensional arrays in C++ is called “row-major” storage layout. Row-major storage has each row in contiguous memory. The rows are stored one at a time, top to bottom, and adjacent elements in a row are also adjacent memory addresses. Columns are a second-class citizen in row-major layout, and you have to jump around to find adjacent elements of a column vector.

The alternative storage method is “column-major” storage layout where the columns are stored in contiguous memory, and it's the rows that are in the smoker's carriage at the back of the train. However, column-major is not the default C++ storage mode.

Hence, to vectorize a matrix-matrix multiplication, we want to keep the first matrix in row-major storage, but we need to rearrange the storage of the second matrix to be column-major storage, rather than the default row-major storage. Column-major storage would help vectorize the columns with each column element in adjacent memory locations. The first matrix is fine, but we want the second matrix to be stored in a mirror image of itself.

Hmm, a mirror and a matrix. What does that sound like? A transposed matrix.

Pseudo-Transposed Second Matrix. The simplest way to get column-major order of a matrix (especially if square) is to use the transpose of the matrix, and modify the internals of the matrix multiplication function to pretend that the transpose is actually the column-major storage of the original second matrix. I call it the “fake transpose” method, which is a bit of a misnomer because it is the actual transposed matrix, but we modify the matrix multiplication code to access it with reversed logic indices.

Confusing? Yes, I felt the same way, but if you follow it through carefully, you can see that the transpose is really very similar to storing the original matrix in column-major order, where each column element is stored in adjacent memory. The columns of the original problematic matrix become fake rows in the fake transpose, stored in sequential memory addresses. So, for square matrices, we can take the transpose of a matrix, and it's like the matrix has been converted into column major storage. However, we also need to change the C++ code in the matrix multiplication kernel, because it assumes row-major order storage of both matrices, but now we've got row-major storage only for the first matrix, and column-major storage for the second one (our fake transpose).

The main point of optimization with a transpose is that the column becomes a contiguous vector from a row in the transposed matrix. Here's what the matrix multiplication algorithm looks like when it's working on a “fake” transpose:

    void aussie_matmul_matrix_fake_transpose(const ymatrix m1, const ymatrix m2, int n, ymatrix mout)
    {
        // Matrix-Matrix naive n^3 algorithm on a TRANSPOSE...
        for (int row = 0; row < n; row++) {
                const float* rowvec = &m1[row][0];
                for (int col = 0; col < n; col++) {
                        float sum = 0.0f;
                        const float* colvec = &m2[col][0];  // Row!
                        for (int k = 0; k < n; k++) {
                                sum += (rowvec[k] * colvec[k]);
                        }
                        mout[row][col] = sum;
                }
        }
    }

Note that the above code assumes the transpose has already been computed. However, it is viable to compute a new transpose matrix in a preliminary step and still be faster, because transposing a matrix only adds an extra O(N^2) time to compute the transpose (and N^2 storage space to store it temporarily), whereas the main matrix multiplication is O(N^3) time.

Perhaps surprisingly, this transpose method is much faster even without any vectorization. Because the column vectors are accessed in sequential order from contiguous memory, there is much better data locality for the memory cache, and also for any predictive pipelining happening in the cache. Here's the benchmark comparison:

    Matrix-Matrix multiplication (MatMul) benchmarks (N=2048, ITER=1):
    Matrix-matrix mult basic: 69479 ticks (69.48 seconds)
    Matrix-matrix fake transpose: 47469 ticks (47.47 seconds)

The transpose method is 31% faster with an unchanged basic MatMul algorithm. And all we did was permute two indices in a two-dimensional array. This code does exactly the same arithmetic computations as the naive version, but accesses memory in a different order, giving us a cache speedup.

There are various other small coding optimizations that can improve the transposed MatMul method further. The loop body could be partially unrolled by 4 or 8 iterations (or more). Here's the C++ code of the version with an unrolling factor of 8 iterations:

    void aussie_matmul_matrix_fake_transpose_unrolled8(const ymatrix m1, const ymatrix m2, int n, ymatrix mout)
    {
        // Transpose Matrix-Matrix multiplication with 8 iteration unrolling...
        yassert(n % 8 == 0);
        for (int row = 0; row < n; row++) {
                const float* rowvec = &m1[row][0];
                for (int col = 0; col < n; col++) {
                        float sum = 0.0f;
                        const float* colvec = &m2[col][0];
                        for (int k = 0; k < n; k += 8) {
                                sum += (rowvec[k] * colvec[k])
                                        + (rowvec[k + 1] * colvec[k + 1])
                                        + (rowvec[k + 2] * colvec[k + 2])
                                        + (rowvec[k + 3] * colvec[k + 3])
                                        + (rowvec[k + 4] * colvec[k + 4])
                                        + (rowvec[k + 5] * colvec[k + 5])
                                        + (rowvec[k + 6] * colvec[k + 6])
                                        + (rowvec[k + 7] * colvec[k + 7])
                                        ;
                        }
                        mout[row][col] = sum;
                }
        }
    }

Here are the benchmark results:

    Matrix-Matrix multiplication (MatMul) benchmarks (N=2048, ITER=1):
    Matrix-matrix fake transpose unrolled 4: 15221 ticks (15.22 seconds)
    Matrix-matrix fake transpose unrolled 8: 12151 ticks (12.15 seconds)

Further tweaks are possible. The internal loop could be fully unrolled for a known vector size. Also, the initialization “sum=0.0f” could be removed by peeling the first iteration and starting the loop at “k=1”. Pointer arithmetic could be used to avoid loop indices and the double bracket accesses. However, these are small fry, and we're now on the hunt for the Spanish mackerel of MatMul optimizations: vectorization.

Vectorized MatMul

Cache speedup is not the only benefit of the transpose method. Once we have column-major storage for the second matrix, then both the rows of the first matrix, and the columns of the second matrix are in contiguous memory. The computation is a normal vector dot product again on two vectors stored as arrays in memory (i.e. “rowvec” and “colvec” in the C++ code above). Hence, we can use all of our standard vector dot product speedups again, including vectorization and hardware acceleration.

As an example, here's the AVX-2 vectorization of the transpose method using the FMA 256-bit intrinsics to do the vector dot product in parallel (see Chapter 17 for this AVX vector dot product code). This parallelizes the dot product by 8 elements at a time:

    void aussie_matmul_matrix_fake_transpose_vecdot_AVX2(const ymatrix m1, const ymatrix m2, int n, ymatrix mout)
    {
        // AVX2 Matrix-Matrix multiplication 
        yassert(n % 8 == 0);
        for (int row = 0; row < n; row++) {
                const float* rowvec = &m1[row][0];
                for (int col = 0; col < n; col++) {
                        const float* colvec = &m2[col][0];
                        mout[row][col] = aussie_vecdot_FMA_unroll_AVX2(rowvec, colvec, n);
                }
        }
    }

Here are the benchmark results:

    Matrix-Matrix multiplication (MatMul) benchmarks (N=2048, ITER=1):
    Matrix-matrix fake transpose AVX1: 19522 ticks (19.52 seconds)
    Matrix-matrix fake transpose AVX2: 12747 ticks (12.75 seconds)

If anything, these AVX results are disappointing. Basic loop unrolling techniques (in the prior section) did better than AVX1 and the same as AVX2 vectorization. However, we haven't used AVX optimally inside the sequential code here. The AVX intrinsic calls should be moved up into the loop body without any function call overhead (i.e. inlining the function manually). I coded up that idea, and it made almost zero difference! I guess the C++ compiler is already inlining it, or function call overhead is a tiny percentage.

Further parallelization speedups would include using AVX-512 or AVX-10 intrinsics for vectorizing 16 elements in parallel. Also desirable are various further optimizations of the sequential code around any AVX intrinsics. The inner “col” loop could be fully or partially unrolled with multiple AVX sequences and/or optimized with pointer arithmetic.

Loop Tiled/Blocked MatMul

The triple-nested MatMul version with the vectorized inner loop is still nowhere near what is possible. There are three more ways to increase throughput:

Data locality within the matrices.
Pipelining of the SIMD instructions.
Avoiding repeated loads of the same data.

The data locality of the basic AVX transposed MatMul algorithm is still far from optimal, although we fixed the most egregious problem by using the transpose. The algorithm is simply scanning down all of the dimensions, without really any attempt to maintain data locality.

The method of calling AVX intrinsics is simply doing “load, FMA, store” repeatedly along blocks of 4 or 8 elements, which does not allow for the natural pipelining of the FMA instructions. The loads and stores are interrupting the flow of computation.

Secondly, if you look carefully at the “load” operations that are happening in the sequence, you realize that it is repeatedly loading the same regions of the matrices.

Tiling or blocking the MatMul loops are far more effective. The basic idea is that instead of scanning sequentially, we process smaller square or rectangular “tiles” or “blocks” of the data, one at a time. Refer to Chapter 15 for the basic idea of tiling/blocking optimizations of nested loops. Data locality is the main aim of a tiled algorithm, but it also helps us achieve better pipelining of SIMD instructions, because we can load all the data in, and then perform multiple arithmetic operations on it without any intervening loads or stores. And since a tiled MatMul is iterating more carefully over smaller blocks of data within the matrices, there's also less redundant loading of the data overall.

Multiplying Rectangular Matrices

AI Engines don't always use square matrices, and in fact, several of the main matrices have rectangular dimensions. For example, the QKV matrices in the attention algorithm have two different dimensions.

For multiplying two rectangular matrices, or sizes MxN and NxP, we get an output matrix of size MxP (i.e. the inner N dimensions disappear). In pseudocode style:

    M1[M][N] * M2[N][P] -> MOUT[M][P]

Note that P=1 is the case of matrix-vector multiplication, because an Nx1 matrix is actually a vector with N rows of a single element (i.e., one column).

Fast Matrix Multiplication Theory

The main techniques for faster matrix multiplication of general matrices include:

Strassen's algorithm
Winograd's algorithm
Fast Fourier Transform (FFT) methods

Matrix multiplications can also be sped up by restricting our AI engine to only use matrices that are of special types:

Low-rank matrix factorization
Sparse matrices
Special matrix methods (e.g. Butterfly matrices, Monarch matrices, etc.)

Each of these specialized matrix types can have a faster matrix multiplication kernel than using the all-purpose GEMM kernel. For example, sparse matrices can be stored in a compacted permuted-tuple format, with parallelization of permutation arrays for computation.

Multiplying by Transpose

The transpose of a matrix is used in matrix multiplications in some parts of the AI Engine. For example, this occurs with the QKV matrix computations inside the attention heads, where the transpose of K is used, usually denoted as K^T in the algebraic formula.

Note that this is the actual algebraic use of the real transpose, as opposed to the idea of using a “fake transpose” to get column-major storage of matrices for easier vectorization. The code to compute the transpose of a matrix is shown below for a square matrix:

    void aussie_matrix_transpose_basic(const ymatrix m1, int n, ymatrix transpose)
    {
        // Transpose: put the transposed matrix into the output matrix (square matrix)
        for (int i = 0; i < n; i++) {
                for (int j = 0; j < n; j++) {
                        transpose[j][i] = m1[i][j];
                }
        }
    }

The funny thing is that if we want to multiply a “real” transpose as the second matrix in some computation, then the original non-transposed matrix is the “fake transpose” of the “real” transpose. How awkward! But it's actually good, because we usually already have the original matrix in memory, and we don't even need to compute the (real) transpose. Instead, to do a MatMul of a matrix with this real transpose, we can instead use the original matrix as the second operand in the kernel that is based on the column-major storage of a fake transpose. Oh, dear, I feel like it's all circular and I'm digging myself into a word pit here! But it all works out in the end, and it's fast, which is really the one and only thing.

Approximate Matrix Multiplication

Approximate Matrix Multiplication (AMM) refers to a variety of complicated model optimization techniques that replace matrix multiplications with various approximations that avoid the cost of arithmetic multiplication, trading off some accuracy. These methods are usually distinct from quantization methods, are not specific to certain subclasses of matrices, and evoke more advanced mathematics in the theory of matrices.

Note that these algorithms apply at the high-level of how matrices are multiplied with other matrices or with vectors (e.g. avoiding some vector dot products), whereas there are also low-level optimizations of the arithmetic operation of multiplying two numbers (see advanced mathematics in Chapter 53). These two classes of approximation research are not the same, and are actually orthogonal to each other.

• Next:

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs

The new AI programming book by Aussie AI co-founders:

AI coding in C++
Transformer engine speedups
LLM models
Phone and desktop AI
Code examples
Research citations

Get your copy from Amazon: Generative AI in C++

Aussie AI

Rectangular Matrix-Vector Multiplication