Aussie AI

Optimizing Matrix Multiplication Algorithms and MatMul/GEMM Kernels

  • Last Updated 7 December, 2024
  • by David Spuler, Ph.D.

Matrix Multiplication Research

With papers on this since the 1970s, you'd think we'd be done optimizing the matrix multiplication code by now. Although they were sequential algorithms to start with, even parallel algorithms for matrix multiplication have decades of theory. And yet there is a continual stream of research papers on MatMul/GEMM optimizations.

Feel free to blame the AI bubble for this. AI engines are built on matrix multiplication, and it occurs in all of these standard LLM components:

Note that attention and FFN blocks occur in every layer of an LLM.

There's quite a lot of work to do to optimize attention and FFNs. The reasons for the ongoing research are that there are many nuances and special cases. Some examples:

  • Sparse versus dense matrices (i.e. SpMM versus GEMM algorithms).
  • Matrix-matrix versus matrix-vector multiplications.
  • Rectangular versus square matrices, including tall or flat rectangular matrices.
  • Special shapes of matrices (e.g. triangular).

Optimizations to the matrix processing code must also deal with:

  • Compute-bound versus memory-bound algorithms.
  • Different computational versus memory costs in different hardware architectures (e.g., CPUs, GPUs, NPUs).
  • Generalization from 2D matrices to 3D tensors (i.e., "tensor slices").
  • Integer versus floating-point arithmetic.
  • Quantization to fewer bits, such as 8-bit integer or 4-bit kernels (also reduced-bit floating-point FP8 and even FP4 kernels).
  • Permutation-based vector/matrix/tensor algorithms.
  • Sparse matrix storage compressions formats (there are various in-memory methods).
  • Non-negative matrices (e.g., activations arising after RELU, or fixed non-negatives from scaling the weight matrices).

There are also some features specific to AI engine algorithms and their use of matrices:

  • Prefill phase (prompt processing) is compute-bound, but in the decoding phase, FFNs are compute-bound because the matrices are static, whereas the QKV attention algorithms are memory-bound because the KV cache is not static data.
  • FFNs involve two matrix multiplications separated by an element-wise activation function (e.g., kernel fusion is possible).
  • Attention kernels have a particular QKV algorithm, for which there are many specialty kernels: e.g., Group Query Attention (GQA), Multi-Query Attention (MAQ), Flash Attention, Paged Attention, etc.
  • Transposed matrices are used in the QKV algorithm, adding another wrinkle.
  • Embedding and unembedding matrices used to convert between tokens and embedding vectors.

And there are some theoretical alternative methods of handling matrix multiplications:

  • Low-rank matrix factorization algorithms (e.g., SVD).
  • Approximate matrix multiplication
  • Element-wise multiplication (Hadamard product)

Since many of these techniques are orthogonal, you can see the combinatorial growth of the various sub-cases. We might be reading GEMM papers for a while!

Optimizing Matrix Multiplication

Matrix multiplications are the basis of neural network processing. The main bottleneck is the multiplication operation deep inside the nested loops of the matrix multiplier. The main optimization in modern times is hardware optimizations, to run lots of those computations in parallel.

The main techniques to use in faster MatMul/GEMM kernels include:

  • GPU versions of kernels
  • Advanced theoretical matrix multiplication algorithms (e.g. Strassen's)
  • Approximate matrix multiplication

Researchers have been optimizing matrix multiplication for decades, so you might think it's all done and dusted. Well, firstly, a lot of that old research is sequential algorithms rather than parallel. Furthermore, most of such research was focused on reducing the number of multiplications (i.e. FLOPs), whereas the cost of memory access is a greater issue on very large matrices. Hence, there's a steady stream of matrix multiplation optimization papers, and no end in sight. Some of areas of focus for this AI-related matrix multiplication research include:

Several of the components of a basic AI Transformer engine (for LLMs) use matrix multiplications in a specialized way. See also the more specific optimizations of these components:

Naive Matrix Multiplication Algorithm

The simplest algorithm to compute matrix multiplication is a cubic-complexity algorithm, which involves three nested loops. There are various papers with examples:

Survey Papers on Matrix Multiplication Kernels (MatMul/GEMM)

Survey papers on matrix multiplication include:

Surveys on Sparse Matrix Multiplication

Review papers on sparse matrix multiplication (SpMM):

Sparse Matrix Kernels

Sparse matrices are those with lots of zeros and few non-zero values. There are various ways to optimize MatMul/GEMM for sparsity.

Tiled MatMul/GEMM

Research papers on tiled MatMul/GEMM optimizations:

CUDA Matrix Multiplication

There are various papers with examples of coding Matmul/GEMM in CUDA C++:

Lookup Table (LUT) for MatMul/GEMM

The use of table lookup can sometimes reduce or remove the number of multiplication operations. This method has been used in various MatMul/GEMM kernels, usually when they are quantized to smaller ranges than floating-point.

Research on LUT MatMUL/GEMM algorithms:

Matrix-Vector Multiplication

Matrix-vector multiplication is also called Vector Matrix Multiplication (VMM) or Generatized Matrix Vector (GEMV).

Transpose Optimizations in Matrix Multiplication

One of the basic methods to optimize matrix multiplication is to operate on a matrix's transpose, because this leads to contiguous memory blocks, which are faster to access (i.e., it's a memory access speed optimization, not a reduction in computations). The issue is the method to store matrices in row-major versus column-major ordering, with the two matrices in a matrix-matrix computation requiring opposite storage methods. It can even be beneficial in some cases to compute the tranpose on-the-fly (which has quadratic complexity), storing it as a temporary matrix, just to speed up the full matrix-matrix operation (which has cubic complexity). Hence, there are various optimizations including:

  • Fast computation of the tranpose.
  • Caching a pre-computed transpose

Research papers on handling a transpose, or creating one, for faster MatMul include:

Matrix Multiplication Optimization in AI Models

Most of AI inference algorithms are based on matrix multiplications or similarly on vector dot products. The primary optimization is to parallelize matrix or vector operations using hardware acceleration in GPUs or other chips. However, several improvements to matrix and vector algorithms have also been found. Research papers specific to matrix multiplication include:

Fast Matrix Multiplication Computation Theory

The main techniques for faster matrix multiplication, of general matrices, include:

  • Strassen's algorithm
  • Winograd's algorithm
  • Fast Fourier Transform (FFT) methods

Matrix multiplications can also be sped up by using specialized matrices:

General Matrix Multiplication Research

Papers on the low-level algorithms for optimizing matrix multiplication for general matrices (i.e. all matrices, not just a subtype):

Strassen Matrix Multiplication

The Strassen method is based on a way to multiply 2x2 matrices using seven arithmetic multiplications, rather than the usual eight. Generalizing this idea leads to a faster way to multiply general matrices. This reduces the asymptomic complexity of matrix-matrix multiplication to O(n^2.8) (where 2.8 is log base 2 of 7) rather than O(n^3), i.e., cubic 3.0 (where 3 is log base 2 of 8).

Winograd Matrix Multiplication

Research papers on using the Winograd or the Coppersmith-Winograd algorithm:

FFT Matrix Multiplication

Research papers using the Fast Fourier Transformation (FFT) for matrix multiplication:

  • Zlateski, A., Jia, Z., Li, K., Durand, F., The anatomy of efficient fft and winograd convolutions on modern cpus. Proceedings of the ACM International Conference on Supercomputing (New York, NY, USA, 2019), ICS ’19, Association for Computing Machinery, p. 414–424. PDF: https://dl.acm.org/doi/pdf/10.1145/3330345.3330382
  • Alexandre Siron, Sean Molesky, 25 Jun 2024, A Split Fast Fourier Transform Algorithm for Block Toeplitz Matrix-Vector Multiplication, https://arxiv.org/abs/2406.17981
  • Andrew Lavin, Scott Gray, 10 Nov 2015 (v2), Fast Algorithms for Convolutional Neural Networks, https://arxiv.org/abs/1509.09308

Approximate Matrix Multiplication

Approximate Matrix Multiplication (AMM) is a variety of complicated model optimization techniques that replace matrix multiplications with various approximations that avoid the cost of arithmetic multiplication, trading off some accuracy. These methods are usually distinct from quantization methods, are not specific to certain subclasses of matrices, and evoke more advanced mathematics in the theory of matrices.

Note that these algorithms apply at the high-level of how matrices are multiplied with other matrices or with vectors (e.g. avoiding some vector dot products), whereas there are also low-level optimizations of the arithmetic operation of multiplying two numbers (see advanced mathematics). These two classes of research are not the same, and are actually orthogonal to each other.

Research papers on approximate matrix multiplication include:

More AI Research

Read more about: