Aussie AI
Loop Tiling or Blocking
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
Loop Tiling or Blocking
When you hear about a “tiled MatMul” or a “blocked GEMM,” this is the “tiling” or “blocking” optimization method it refers to. MatMul is matrix multiplication and GEMM is General Matrix Multiplication (i.e. the same thing). Tiling is the optimization that most applies to speeding up matrix or tensor multiplication in AI engines.
This optimization is for two-dimensional data (e.g. matrices). When you hear “tiles” or “blocks,” think squares or rectangles of data. For example, if you have a 512x512 matrix, then a tiled algorithm might act on 16x16 sized chunks, one at a time. Loop tiling is an optimization of two-dimensional or three-dimensional data such as matrices or tensors. The one-dimensional equivalent of processing sub-parts of a one-dimensional array is called “strip mining”, “loop sectioning” or often simply “vectorization.”
In other words, tiling means operating on small subsections of a matrix. If you hear “tiled tensor” that could mean two-dimensional data (i.e. just a fancy name for a matrix), or alternatively it might refer to three-dimensional data, in which case, don't think anything or else your head will hurt.
Loop tiling is a method of executing sub-parts of nested loops in a way that maximizes data locality, increases cache utilization, and improves parallel execution. This is also called “loop blocking” because it processes the data a “block” at a time, although the term “tiling” is more widely used in AI research. The two-dimensional sub-partitions of the data that are square or rectangular are called “tiles” or “blocks”.
The same number of arithmetic operations are performed in a tiled versus non-tiled algorithm. However, there should be fewer loads of the data into memory with tiling. The downside is that tiling introduces additional loop overhead. In fact, rather than flattening nested loops over a 2-D array (e.g. 512x512), tiling often introduces additional levels of nesting! The two small loops that spin through the 16x16 square shape of a single “tile” or “block” are often newly added inner loops. So, loop tiling often adds two new layers of nested loops inside your already-nested loops. It makes you wonder how it can even be faster!
Example: Tiled Matrix Clear:
For these examples, there is a type “ymatrix
” type:
typedef float ymatrix[ROWS][COLUMNS];
If we forget about memset
,
here is the simple code to clear a matrix one element at a time
in a brute-force nested loop (non-tiled):
void aussie_clear_matrix(ymatrix m) { for (int i = 0; i < ROWS; i++) { for (int j = 0; j < COLUMNS; j++) { m[i][j] = 0.0; } } }
Now we decide to add a 4x4 square tile optimization to this code. The result is an extra two levels of nested loops. Here is the basic code which assumes that the row and column dimensions are exact multiples of the tile size, so there's no extra leftover cases to handle:
void aussie_clear_matrix_tiled(ymatrix m) { const int TILEX = 4; // 4x4 tile size const int TILEY = 4; static_assert(ROWS % TILEX == 0, "Exact X"); static_assert(COLUMNS % TILEY == 0, "Exact Y"); for (int i = 0; i < ROWS; i += TILEX) { for (int j = 0; j < COLUMNS; j += TILEY) { // Do the 4x4 tile... for (int tx=i; tx < i+TILEX; tx++) { for (int ty=j; ty < j+TILEY; ty++) { m[tx][tiley] = 0.0f; } } } } }
Unrolled Tiles.
One followup optimization trick with a tiled loop algorithm is
to apply loop unrolling to the two inner loops.
This avoids the extra overhead of the two extra inner loops, but retains
the data locality benefits of tiling.
This optimization results in a fully “unrolled tile” computation without any extra inner loops.
In the above example, the two inner loops of a 4x4 tile would be replaced with 16 unrolled computations in sequence.
Or for a vectorized version, a fully unrolled tile would be 4 sequential calls to vectorized intrinsics
that each do 4 operations in parallel (e.g. AVX intrinsics each do 4 float
operations in parallel).
Example: Tiled Matrix Multiplication: Tiling techniques are widely used inside neural network code to improve the efficiency of MatMul's and thereby get better throughput of tensor calculations from a GPU. Matrix multiplication is a good candidate for this optimization because it has O(n^3) arithmetic calculations, but uses only O(n^2) data. Hence, a naive matrix multiplication algorithm that doesn't address locality will re-load the same data into memory many times, whereas a tiled algorithm can reuse the same data more efficiently.
A tiled version of MatMul processes “tiles” or “blocks” of each matrix one at a time (i.e. small square or rectangular sections), with the aim of keeping small parts of the matrix in the memory cache while they are processed. The algorithm progresses across the matrix a tile/block at a time, rather than scanning all the way down one dimension (row or column). The same number of multiplication operations are performed as a non-tiled MatMul, but data locality and cache freshness should improve the overall speed.
• Next: • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |