Aussie AI
Cached or Precomputed Transpose
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
Cached or Precomputed Transpose
Matrix multiplications can be much faster if they operate on the transpose, because this has the columns stored in sequential memory addresses. Our MatMul/GEMM kernel is much faster if it can send sequential blocks of data to the GPU, and it's also faster for CPU-only versions of matrix multiplication because of data locality benefits that speed up memory accesses.
The value of using a transpose of a matrix is so significant that we can calculate the transpose on the fly if we need it. Creating a transpose is O(N^2) and we are speeding up the O(N^3) MatMul operation, so the extra benefit is worth the cost. Then we can further optimize by storing the transpose in a cache for next time.
On the other hand, why not precompute the transpose! If it's the transpose of a weight matrix, then it's known at compile-time (i.e. pre-inference time), and we could fully precompute it and store it with the rest of the model. Thus, a significant way to optimize matrix multiplications is to store both versions of the matrix in memory: the original matrix and its transpose. This can help to speed up inference by:
(a) avoiding the need to compute the transpose on-the-fly, and
(b) by having the transpose already laid out properly in contiguous memory for pipelining and dataflow efficiency.
Note that this transpose caching method doesn't work as well for training or fine-tuning, because the weights in the matrix change often, and our cached transpose would become out-of-date. Further details of using a precomputed transpose in MatMul are covered in Chapter 34.
• Next: • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |