Aussie AI

34. MatMul/GEMM

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

“Show me the money.”

— Jerry Maguire, 1996.

What is MatMul?

This is where the rubber meets the road. The core of the method used by AI Engines to trick us into thinking they're smart is in the matrix multiplications. This is where billions of weights get applied to whatever words you prompted into the AI. The words get changed to numbers (i.e. tokens), and they get crunched up through many layers of tensor kernels, hung out to dry in a normalization blender, buried in peat moss, and then munged up into tiny fractional probabilities, and out pops the answer. Simples.

Tensors are just matrices on steroids, and matrix multiplications are the basis of neural network processing. Typically, AI matrix multiplication is shortened to “MatMul” or referred to as “GEMM” (General Matrix Multiplication). If you're writing your own C++ matrix multiplication code, you're writing a “MatMul algorithm” or a “GEMM kernel.”

Note that MatMul and GEMM both refer to matrix-matrix multiplication. The simpler case of matrix-vector multiplication is called Vector Matrix Multiplication (VMM). Also, the “general” in GEMM means that any type of matrices may be multiplied, as distinct from special cases such as “sparse matrix multiplication” or triangular matrices.

The main bottleneck in Transformers is the multiplication operation deep inside the nested loops of the matrix multiplier. If you write a Transformer in C++ that runs without a GPU, the main time cost will be the “*” operator on floating-point numbers, deep inside a matrix multiplication function.

The main optimization of MatMul in modern times is hardware optimizations, to run lots of those multiplication computations in parallel. Using tensors scales matrix multiplication up into a third dimension for increased parallelization, but it's all matrix algebra underneath. Well, actually, it's vector dot product under that, too.

Matrix-Vector Multiplication

Matrix multiplication by a vector gives another vector. Let us consider the simple cast first, where the matrix is square with dimensions NxN and the vector is also of size N. The matrix has N rows and N columns, and the input vector has N elements. The resulting output vector will also have N elements. Conceptually, in pseudocode:

    MAT[N][N] * VIN[N] -> VOUT[N]

It's not immediately obvious, or at least, I don't remember my High School Math teacher mentioning it, but matrix-vector multiplication is a bunch of vector dot product computations. We need to do a vector dot product for each of the elements of the output vector. Each element is a dot product of a matrix row times the input vector. Note that the dimensions match for a dot product, with N matrix rows and N elements in the input vector.

Complexity of Matrix-Vector Multiplication. The algorithmic complexity of matrix-vector multiplication is quadratic in N, whereas matrix-matrix multiplication is cubic in N. The basic matrix-vector multiplication scans N rows of the matrix, with each row element performing a computation against each of the N elements of the vector, giving two nested loops with an overall O(N^2) cost.

Memory layout: One important point for the efficiency of matrix-vector multiplication is that the default memory layout has contiguous addresses for both the matrix row and the vector. Obviously, a vector is just a sequence of memory with all the elements in series. Not so obviously, a row of a matrix, when stored as a C++ two-dimensional array, is also a contiguous set of data (i.e. a matrix row is like a vector). Hence, the dot product multiplication of a matrix row and the input vector is simply scanning forward along contiguous addresses for both of its inputs, which makes it easy to vectorize.

Spot the Buggy MatMul

Have a look at this code for a matrix-vector multiplication using vector dot product. It took me a long time to realize what was wrong with this. Can you spot the bug?

    void aussie_matmul_vector_basic1_buggy(ymatrix m, float v[], int n)
    {
        // Basic matrix-by-vector using vector dot products..
        for (int i = 0; i < n; i++) {
                float* rowvector = &m[i][0];
                float sum = aussie_vecdot_basic(rowvector, v, n);  // Dot product
                v[i] = sum;
        }
    }

The bug is a kind of aliasing problem here:

    v[i] = sum;  // Bug!

It looks correct, but it's wrong. The computation of v[i] is setting its value in the middle of the loop, and then going around for the next matrix row, which will then use that newly calculated v[i] value as if it was part of the input vector. Because I'm misusing “v” as both the input and output vector, parts of the output vector will get used as the input vector. It's a very insidious type of aliasing bug, and many of my simple unit tests with zero matrices and identity matrices were still succeeding. It's my fault for trying to do matrix-vector multiplication as an element-wise vector method. The solution is simple: matrix-vector multiplication needs a third operand for the output vector.

Optimizing Matrix-Vector Multiplication

The fixed-up version of matrix-vector multiplication with row-wise vector dot products simply outputs to another separate destination vector operand.

    void aussie_matmul_vector_basic_out1(const ymatrix m, const float v[], int n, float vout[])
    {
        // Basic matrix-by-vector using vector dot products..
        for (int i = 0; i < n; i++) {
                const float* rowvector = &m[i][0];
                float sum = aussie_vecdot_basic(rowvector, v, n);  // Dot product
                vout[i] = sum;
        }
    }

Nested Loop Matrix-Vector Version: The same matrix-vector multiplication algorithm in the form of two nested loops is below. This is flattening the call to the lower-level vector dot product function and putting its inner summation loop directly inside the outer main loop. The basic C++ code looks like:

    void aussie_matmul_vector_basic_out2(const ymatrix m, const float v[], int n, float vout[])
    {
        // Basic matrix-by-vector using nested loops..
        for (int row = 0; row < n; row++) {
                float sum = 0.0f;
                for (int col = 0; col < n; col++) {
                        sum += (m[row][col] * v[col]);
                }
                vout[row] = sum;
        }
    }

Optimizations of matrix-vector multiplication. Various ways to optimize the naive nested loop matrix-vector multiplication suggest themselves:

Hoisting loop-invariant code (loop code motion) of the “m[row]” expression.
Loop pointer arithmetic for both loops.
Loop unrolling of the inner loop to unroll 4, 8 or more iterations.
Loop tiling to unroll a 2x2 tile/block.
Vectorization using the AVX1/AVX2 vector dot product versions we already examined.

I tried coding several more of these optimizations and here are the benchmarks:

    Matrix-Vector multiplication (MatMulVec) benchmarks (N=2048, ITER=300):
    Matrix-vector nested loops: 3480 ticks (3.48 seconds)
    Matrix-vector nested loops hoisted: 3489 ticks (3.49 seconds)
    Matrix-vector nested ptr-arith: 3415 ticks (3.42 seconds)
    Matrix-vector unrolled inner (4): 1166 ticks (1.17 seconds)
    Matrix-vector unrolled inner (8): 938 ticks (0.94 seconds)
    Matrix-vector nested tiled 2x2: 1995 ticks (2.00 seconds)
    Matrix-vector vecdot AVX1 DP: 1414 ticks (1.41 seconds)
    Matrix-vector vecdot AVX2 FMA: 929 ticks (0.93 seconds)

Interestingly, code hoisting and loop pointer arithmetic were a waste of effort. Loop tiling did better than the original, but probably its speedup is primarily from the effect of loop unrolling rather than data locality or cache hit rates, since simpler loop unrolling did better. Note that the AVX1 version used the “dot product” intrinsic but AVX-2 used the FMA intrinsic, for reasons covered in Chapter 17. Simple loop unrolling also did as well as AVX2 hardware vectorization, probably because the versions of AVX1 and AVX2 were simply calling the vector dot product functions, so they still had function call overhead. Hence, this algorithm can be further optimized by inlining to fix the AVX function call overhead, combining AVX intrinsics with unrolling of the inner loop, and then some minor final tweaks such as pointer arithmetic.

Tiled Matrix-Vector Multiplication

A more detailed analysis of the matrix-vector algorithm shows that it is not optimal in at least three areas:

Data locality
Pipelining AVX intrinsic arithmetic
Redundant loads

The data locality of the 2x2 tiled version is better, but more improvement is possible, starting with the use of AVX intrinsics inside the “sub-kernel” for the tile. The AVX instruction sequences of “load, calculate, store” in the earlier non-tiled AVX-optimized versions are not allowing for the natural instruction pipelining of the AVX intrinsics to calculate multiple sums or FMA operations with near-parallel pipelining. And the entire input vector is getting re-loaded repeatedly for every row of the matrix. So, we need to examine improvements on three aspects.

A tiled sub-kernel is the main way to fix data locality and pipelining. Improving data locality is somewhat inherent to tiling. The pipelining can be improved by unrolling the tiled sub-kernel and reordering the loads and stores so they don't block the arithmetic of AVX intrinsics.

Can we avoid redundant vector loads? Since it's unavoidable to access every element of every row at least once, the redundant loads of the vector suggest that we should modify the algorithm so as to work on a subsection of the vector for each of the matrix rows. This suggests an inversion of the main nested loops of the algorithm. However, that runs into the major problem that it destroys cache locality, by scanning down the column of the first matrix. I benchmarked this loop interchange idea, and it actually increased execution time. Maybe we should use the transpose of the first matrix, so that it's in column-major order when scanning its columns? No, that's actually just going back to the original algorithm without the loop interchange.

Anyway, a better plan seems to be to reduce the redundant loading by using temporary calculations inside the tile sub-kernel. Here is what a basic tiled/blocked algorithm using 2x2 tiles looks like in basic sequential C++:

    void aussie_matmul_vector_tiled_2x2_better(const ymatrix m, const float v[], int n, float vout[])
    {
        // Tiled/blocked matrix-by-vector using 2x2 tiling.. (faster sub-kernel)
        yassert(n % 2 == 0);
        for (int row = 0; row < n; row += 2) {
                vout[row] = 0.0f;
                vout[row + 1] = 0.0f;
                for (int col = 0; col < n; col += 2) {
                        vout[row] += (m[row][col] * v[col])  // row+0, col + 0
                            + (m[row][col + 1] * v[col + 1]) // row+0, col + 1
                                ;
                        vout[row + 1] += (m[row + 1][col] * v[col])   // row+1, col + 0
                                + (m[row + 1][col + 1] * v[col + 1])  // row+1, col + 1
                                ; 
                }
        }
    }

One minor improvement would be to use memset to clear the whole output vector to zero, rather than individual assignments, which I added to the 4x4 tiled version. There is another minor improvement is removing the “common sub-expressions” of v[col] and v[col+1] and I tried this with no improvement noted in the 2x2 tiled version, but about 10% improvement in the 4x4 tiled version. The computations of m[row] and m[row+1], etc., can also be hoisted out of the inner loop, giving another 10% gain for the 4x4 tiled version. The C++ code for the 4x4 tiled version with a fully unrolled 4x4 sub-kernel now looks like:

    void aussie_matmul_vector_tiled_4x4_CSE2(const ymatrix m, const float v[], int n, float vout[])
    {
        // Tiled/blocked matrix-by-vector using 4x4 tiling..
        yassert(n % 4 == 0);
        memset(vout, 0, sizeof(float) * n);
        for (int row = 0; row < n; row += 4) {
                const float* rowvec = &m[row][0];
                const float* rowvec1 = &m[row + 1][0];
                const float* rowvec2 = &m[row + 2][0];
                const float* rowvec3 = &m[row + 3][0];
                for (int col = 0; col < n; col += 4) {
                        float fcol0 = v[col];
                        float fcol1 = v[col + 1];
                        float fcol2 = v[col + 2];
                        float fcol3 = v[col + 3];
                        vout[row] +=
                                (rowvec[col] * fcol0) // row+0, col + 0
                                + (rowvec[col + 1] * fcol1) // row+0, col + 1
                                + (rowvec[col + 2] * fcol2) // row+0, col + 2
                                + (rowvec[col + 3] * fcol3) // row+0, col + 3
                                ;
                        vout[row + 1] +=
                                (rowvec1[col] * fcol0) // row+1, col + 0
                                + (rowvec1[col + 1] * fcol1) // row+1, col + 1
                                + (rowvec1[col + 2] * fcol2) // row+1, col + 2
                                + (rowvec1[col + 3] * fcol3) // row+1, col + 3
                                ;
                        vout[row + 2] +=
                                (rowvec2[col] * fcol0) // row+2, col + 0
                                + (rowvec2[col + 1] * fcol1) // row+2, col + 1
                                + (rowvec2[col + 2] * fcol2) // row+2, col + 2
                                + (rowvec2[col + 3] * fcol3) // row+2, col + 3
                                ;
                        vout[row + 3] +=
                                (rowvec3[col] * fcol0) // row+3, col + 0
                                + (rowvec3[col + 1] * fcol1) // row+3, col + 1
                                + (rowvec3[col + 2] * fcol2) // row+3, col + 2
                                + (rowvec3[col + 3] * fcol3) // row+3, col + 3
                                ;
                }
        }
    }

Rectangular Matrix-Vector Multiplication

The general case of a rectangular matrix multiplied by a vector is a little trickier, but not a lot. If our matrix is MxN and the vector is size N, then the output vector has size M. Note the two of the dimensions must match: the columns of the matrix and the elements of the input vector are both N. However, this dimension N “disappears” and the output vector has size only dependent on M. The pseudocode:

    MAT[M][N] * VIN[N] -> VOUT[M]

The rectangular matrix-vector multiplication is almost identical to square matrix-vector computations. Each element of the output vector is a dot product of a matrix row with the input vector. Again, we note that the dimensions of the matrix rows (N) must match the size of the input vector (N), or else we cannot compute it. I mean, we could still compute it with mismatched dimensions, such as by assuming that the shorter one (matrix row or input vector) had zeros in the missing elements, but that sounds a little buggy.

Matrix-Matrix Multiplication

Now let's look at matrix-matrix multiplication, whereas above we looked at matrix-vector multiplication. The proper MatMul and GEMM kernels are coded for full matrix-matrix multiplication.

Matrix multiplication results in another matrix as the output. For the simple case of two square matrices of the same size, the resulting output matrix is also of the same dimensions. In pseudocode:

    M1[N][N] * M2[N][N] -> MOUT[N][N]

Algorithmic Complexity. The naive implementation of a matrix-matrix multiplication via three nested loops is a cubic algorithm, with O(N^3) complexity. The well-known Strassen algorithm has complexity about O(N^2.7), which looks like such a massive improvement. Other algorithms such as the Coppersmith-Winograd algorithm and numerous sub-variants have better asymptotic complexity, but with a high constant overhead, making them impracticable for anything but very large values of N.

Basic Matrix-Matrix Multiplication. The basic algorithm for matrix multiplication is three nested loops. There is nothing fancy here: this is just coding up the basic matrix multiplication method that you forgot the second you finished your Senior math exam. If you don't believe me, check it out on Wikipedia. Here's the C++ code:

    void aussie_matmul_matrix_basic(const ymatrix m1, const ymatrix m2, int n, ymatrix mout)
    {
        // Matrix-Matrix multiplication basic naive n^3 algorithm...
        for (int row = 0; row < n; row++) {
                for (int col = 0; col < n; col++) {
                        float sum = 0.0f;
                        for (int k = 0; k < n; k++) {
                                sum += (m1[row][k] * m2[k][col]);
                        }
                        mout[row][col] = sum;
                }
        }
    }

The two outer loops are scanning the rows of the first matrix, and the columns of the second matrix. The innermost of the three loops is doing a vector dot product computation over the “k” index variable. However, it's not a normal vector-vector dot product. Instead, it's the dot product of one “horizontal” vector, which is a row of the first matrix, and of a second “vertical” vector, which is a column of the second matrix. Hence, the number of rows in the first matrix must equal the columns of the second matrix, which is true here because we're assuming that both matrices are square. Hence, the “k” variable is spinning down the n elements of a row and a column at the same time. Every element of the NxN output matrix requires a vector dot product calculation like this.

Vectorization. From the point-of-view of AI engines, none of these matrix multiplication algorithms are especially good, because they are all sequential, rather than parallel algorithms. Neither the naive cubic version nor the Strassen algorithm are what we need. What we need for GPUs and CPU SIMD intrinsics are vectorizable algorithms for matrix-matrix multiplication. Unfortunately, the above simple triple-nested matrix multiplication algorithm is not one of them, because non-contiguous storage of the second matrix hampers vectorization.

Memory layout problems for matrix-matrix multiplication: The layout of memory for matrix-matrix multiplications is not as fortuitous as it was for matrix-vector multiplications. Each computation in matrix-matrix multiplication is a vector dot product of a row of the first matrix with a column of the second matrix. Each row of the first matrix is happily stored in contiguous memory, but the columns of the second matrix are not. In fact, the “stride” between two elements of a column of a matrix is a very large number of bytes in the default memory layout.

The default storage of matrices and two-dimensional arrays in C++ is called “row-major” storage layout. Row-major storage has each row in contiguous memory. The rows are stored one at a time, top to bottom, and adjacent elements in a row are also adjacent memory addresses. Columns are a second-class citizen in row-major layout, and you have to jump around to find adjacent elements of a column vector.

The alternative storage method is “column-major” storage layout where the columns are stored in contiguous memory, and it's the rows that are in the smoker's carriage at the back of the train. However, column-major is not the default C++ storage mode.

Hence, to vectorize a matrix-matrix multiplication, we want to keep the first matrix in row-major storage, but we need to rearrange the storage of the second matrix to be column-major storage, rather than the default row-major storage. Column-major storage would help vectorize the columns with each column element in adjacent memory locations. The first matrix is fine, but we want the second matrix to be stored in a mirror image of itself.

Hmm, a mirror and a matrix. What does that sound like? A transposed matrix.

Pseudo-Transposed Second Matrix. The simplest way to get column-major order of a matrix (especially if square) is to use the transpose of the matrix, and modify the internals of the matrix multiplication function to pretend that the transpose is actually the column-major storage of the original second matrix. I call it the “fake transpose” method, which is a bit of a misnomer because it is the actual transposed matrix, but we modify the matrix multiplication code to access it with reversed logic indices.

Confusing? Yes, I felt the same way, but if you follow it through carefully, you can see that the transpose is really very similar to storing the original matrix in column-major order, where each column element is stored in adjacent memory. The columns of the original problematic matrix become fake rows in the fake transpose, stored in sequential memory addresses. So, for square matrices, we can take the transpose of a matrix, and it's like the matrix has been converted into column major storage. However, we also need to change the C++ code in the matrix multiplication kernel, because it assumes row-major order storage of both matrices, but now we've got row-major storage only for the first matrix, and column-major storage for the second one (our fake transpose).

The main point of optimization with a transpose is that the column becomes a contiguous vector from a row in the transposed matrix. Here's what the matrix multiplication algorithm looks like when it's working on a “fake” transpose:

    void aussie_matmul_matrix_fake_transpose(const ymatrix m1, const ymatrix m2, int n, ymatrix mout)
    {
        // Matrix-Matrix naive n^3 algorithm on a TRANSPOSE...
        for (int row = 0; row < n; row++) {
                const float* rowvec = &m1[row][0];
                for (int col = 0; col < n; col++) {
                        float sum = 0.0f;
                        const float* colvec = &m2[col][0];  // Row!
                        for (int k = 0; k < n; k++) {
                                sum += (rowvec[k] * colvec[k]);
                        }
                        mout[row][col] = sum;
                }
        }
    }

Note that the above code assumes the transpose has already been computed. However, it is viable to compute a new transpose matrix in a preliminary step and still be faster, because transposing a matrix only adds an extra O(N^2) time to compute the transpose (and N^2 storage space to store it temporarily), whereas the main matrix multiplication is O(N^3) time.

Perhaps surprisingly, this transpose method is much faster even without any vectorization. Because the column vectors are accessed in sequential order from contiguous memory, there is much better data locality for the memory cache, and also for any predictive pipelining happening in the cache. Here's the benchmark comparison:

    Matrix-Matrix multiplication (MatMul) benchmarks (N=2048, ITER=1):
    Matrix-matrix mult basic: 69479 ticks (69.48 seconds)
    Matrix-matrix fake transpose: 47469 ticks (47.47 seconds)

The transpose method is 31% faster with an unchanged basic MatMul algorithm. And all we did was permute two indices in a two-dimensional array. This code does exactly the same arithmetic computations as the naive version, but accesses memory in a different order, giving us a cache speedup.

There are various other small coding optimizations that can improve the transposed MatMul method further. The loop body could be partially unrolled by 4 or 8 iterations (or more). Here's the C++ code of the version with an unrolling factor of 8 iterations:

    void aussie_matmul_matrix_fake_transpose_unrolled8(const ymatrix m1, const ymatrix m2, int n, ymatrix mout)
    {
        // Transpose Matrix-Matrix multiplication with 8 iteration unrolling...
        yassert(n % 8 == 0);
        for (int row = 0; row < n; row++) {
                const float* rowvec = &m1[row][0];
                for (int col = 0; col < n; col++) {
                        float sum = 0.0f;
                        const float* colvec = &m2[col][0];
                        for (int k = 0; k < n; k += 8) {
                                sum += (rowvec[k] * colvec[k])
                                        + (rowvec[k + 1] * colvec[k + 1])
                                        + (rowvec[k + 2] * colvec[k + 2])
                                        + (rowvec[k + 3] * colvec[k + 3])
                                        + (rowvec[k + 4] * colvec[k + 4])
                                        + (rowvec[k + 5] * colvec[k + 5])
                                        + (rowvec[k + 6] * colvec[k + 6])
                                        + (rowvec[k + 7] * colvec[k + 7])
                                        ;
                        }
                        mout[row][col] = sum;
                }
        }
    }

Here are the benchmark results:

    Matrix-Matrix multiplication (MatMul) benchmarks (N=2048, ITER=1):
    Matrix-matrix fake transpose unrolled 4: 15221 ticks (15.22 seconds)
    Matrix-matrix fake transpose unrolled 8: 12151 ticks (12.15 seconds)

Further tweaks are possible. The internal loop could be fully unrolled for a known vector size. Also, the initialization “sum=0.0f” could be removed by peeling the first iteration and starting the loop at “k=1”. Pointer arithmetic could be used to avoid loop indices and the double bracket accesses. However, these are small fry, and we're now on the hunt for the Spanish mackerel of MatMul optimizations: vectorization.

Vectorized MatMul

Cache speedup is not the only benefit of the transpose method. Once we have column-major storage for the second matrix, then both the rows of the first matrix, and the columns of the second matrix are in contiguous memory. The computation is a normal vector dot product again on two vectors stored as arrays in memory (i.e. “rowvec” and “colvec” in the C++ code above). Hence, we can use all of our standard vector dot product speedups again, including vectorization and hardware acceleration.

As an example, here's the AVX-2 vectorization of the transpose method using the FMA 256-bit intrinsics to do the vector dot product in parallel (see Chapter 17 for this AVX vector dot product code). This parallelizes the dot product by 8 elements at a time:

    void aussie_matmul_matrix_fake_transpose_vecdot_AVX2(const ymatrix m1, const ymatrix m2, int n, ymatrix mout)
    {
        // AVX2 Matrix-Matrix multiplication 
        yassert(n % 8 == 0);
        for (int row = 0; row < n; row++) {
                const float* rowvec = &m1[row][0];
                for (int col = 0; col < n; col++) {
                        const float* colvec = &m2[col][0];
                        mout[row][col] = aussie_vecdot_FMA_unroll_AVX2(rowvec, colvec, n);
                }
        }
    }

Here are the benchmark results:

    Matrix-Matrix multiplication (MatMul) benchmarks (N=2048, ITER=1):
    Matrix-matrix fake transpose AVX1: 19522 ticks (19.52 seconds)
    Matrix-matrix fake transpose AVX2: 12747 ticks (12.75 seconds)

If anything, these AVX results are disappointing. Basic loop unrolling techniques (in the prior section) did better than AVX1 and the same as AVX2 vectorization. However, we haven't used AVX optimally inside the sequential code here. The AVX intrinsic calls should be moved up into the loop body without any function call overhead (i.e. inlining the function manually). I coded up that idea, and it made almost zero difference! I guess the C++ compiler is already inlining it, or function call overhead is a tiny percentage.

Further parallelization speedups would include using AVX-512 or AVX-10 intrinsics for vectorizing 16 elements in parallel. Also desirable are various further optimizations of the sequential code around any AVX intrinsics. The inner “col” loop could be fully or partially unrolled with multiple AVX sequences and/or optimized with pointer arithmetic.

Loop Tiled/Blocked MatMul

The triple-nested MatMul version with the vectorized inner loop is still nowhere near what is possible. There are three more ways to increase throughput:

Data locality within the matrices.
Pipelining of the SIMD instructions.
Avoiding repeated loads of the same data.

The data locality of the basic AVX transposed MatMul algorithm is still far from optimal, although we fixed the most egregious problem by using the transpose. The algorithm is simply scanning down all of the dimensions, without really any attempt to maintain data locality.

The method of calling AVX intrinsics is simply doing “load, FMA, store” repeatedly along blocks of 4 or 8 elements, which does not allow for the natural pipelining of the FMA instructions. The loads and stores are interrupting the flow of computation.

Secondly, if you look carefully at the “load” operations that are happening in the sequence, you realize that it is repeatedly loading the same regions of the matrices.

Tiling or blocking the MatMul loops are far more effective. The basic idea is that instead of scanning sequentially, we process smaller square or rectangular “tiles” or “blocks” of the data, one at a time. Refer to Chapter 15 for the basic idea of tiling/blocking optimizations of nested loops. Data locality is the main aim of a tiled algorithm, but it also helps us achieve better pipelining of SIMD instructions, because we can load all the data in, and then perform multiple arithmetic operations on it without any intervening loads or stores. And since a tiled MatMul is iterating more carefully over smaller blocks of data within the matrices, there's also less redundant loading of the data overall.

Multiplying Rectangular Matrices

AI Engines don't always use square matrices, and in fact, several of the main matrices have rectangular dimensions. For example, the QKV matrices in the attention algorithm have two different dimensions.

For multiplying two rectangular matrices, or sizes MxN and NxP, we get an output matrix of size MxP (i.e. the inner N dimensions disappear). In pseudocode style:

    M1[M][N] * M2[N][P] -> MOUT[M][P]

Note that P=1 is the case of matrix-vector multiplication, because an Nx1 matrix is actually a vector with N rows of a single element (i.e., one column).

Fast Matrix Multiplication Theory

The main techniques for faster matrix multiplication of general matrices include:

Strassen's algorithm
Winograd's algorithm
Fast Fourier Transform (FFT) methods

Matrix multiplications can also be sped up by restricting our AI engine to only use matrices that are of special types:

Low-rank matrix factorization
Sparse matrices
Special matrix methods (e.g. Butterfly matrices, Monarch matrices, etc.)

Each of these specialized matrix types can have a faster matrix multiplication kernel than using the all-purpose GEMM kernel. For example, sparse matrices can be stored in a compacted permuted-tuple format, with parallelization of permutation arrays for computation.

Multiplying by Transpose

The transpose of a matrix is used in matrix multiplications in some parts of the AI Engine. For example, this occurs with the QKV matrix computations inside the attention heads, where the transpose of K is used, usually denoted as K^T in the algebraic formula.

Note that this is the actual algebraic use of the real transpose, as opposed to the idea of using a “fake transpose” to get column-major storage of matrices for easier vectorization. The code to compute the transpose of a matrix is shown below for a square matrix:

    void aussie_matrix_transpose_basic(const ymatrix m1, int n, ymatrix transpose)
    {
        // Transpose: put the transposed matrix into the output matrix (square matrix)
        for (int i = 0; i < n; i++) {
                for (int j = 0; j < n; j++) {
                        transpose[j][i] = m1[i][j];
                }
        }
    }

The funny thing is that if we want to multiply a “real” transpose as the second matrix in some computation, then the original non-transposed matrix is the “fake transpose” of the “real” transpose. How awkward! But it's actually good, because we usually already have the original matrix in memory, and we don't even need to compute the (real) transpose. Instead, to do a MatMul of a matrix with this real transpose, we can instead use the original matrix as the second operand in the kernel that is based on the column-major storage of a fake transpose. Oh, dear, I feel like it's all circular and I'm digging myself into a word pit here! But it all works out in the end, and it's fast, which is really the one and only thing.

Approximate Matrix Multiplication

Approximate Matrix Multiplication (AMM) refers to a variety of complicated model optimization techniques that replace matrix multiplications with various approximations that avoid the cost of arithmetic multiplication, trading off some accuracy. These methods are usually distinct from quantization methods, are not specific to certain subclasses of matrices, and evoke more advanced mathematics in the theory of matrices.

Note that these algorithms apply at the high-level of how matrices are multiplied with other matrices or with vectors (e.g. avoiding some vector dot products), whereas there are also low-level optimizations of the arithmetic operation of multiplying two numbers (see advanced mathematics in Chapter 53). These two classes of approximation research are not the same, and are actually orthogonal to each other.

References on MatMul

Ulrich Drepper (2007), What Every Programmer Should Know About Memory, November 21, 2007, http://people.redhat.com/drepper/cpumemory.pdf
Kazushige Goto (2008), Anatomy of High-Performance Matrix Multiplication, ACM Transactions on Mathematical Software, Volume 34, Issue 3, Article No.: 12, May 2008, pp 1–25, https://doi.org/10.1145/1356052.1356053, PDF: https://www.cs.utexas.edu/~flame/pubs/GotoTOMS_revision.pdf
Harald Prokop (1999), Cache-Oblivious Algorithms, Masters Thesis, MIT, June 1999, http://supertech.csail.mit.edu/papers/Prokop99.pdf
Intel (2023), Intel® 64 and IA-32 Architectures Optimization Reference Manual: Volume 1, August 2023, 248966-Software-Optimization-Manual-V1-048.pdf
Agner Fog (2022), Vector Class Library (VCL), https://www.agner.org/optimize/vcl_manual.pdf
Sergey Slotin (2022), Matrix Multiplication, Algorithmica, https://en.algorithmica.org/hpc/algorithms/matmul/ Code: https://github.com/algorithmica-org/algorithmica/blob/master/content/english/hpc/algorithms/matmul.md

• Next: Chapter 35. Lookup Tables and Precomputation

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs

The new AI programming book by Aussie AI co-founders:

AI coding in C++
Transformer engine speedups
LLM models
Phone and desktop AI
Code examples
Research citations

Get your copy from Amazon: Generative AI in C++