Aussie AI
Optimizing Matrix-Vector Multiplication
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
Optimizing Matrix-Vector Multiplication
The fixed-up version of matrix-vector multiplication with row-wise vector dot products simply outputs to another separate destination vector operand.
void aussie_matmul_vector_basic_out1(const ymatrix m, const float v[], int n, float vout[]) { // Basic matrix-by-vector using vector dot products.. for (int i = 0; i < n; i++) { const float* rowvector = &m[i][0]; float sum = aussie_vecdot_basic(rowvector, v, n); // Dot product vout[i] = sum; } }
Nested Loop Matrix-Vector Version: The same matrix-vector multiplication algorithm in the form of two nested loops is below. This is flattening the call to the lower-level vector dot product function and putting its inner summation loop directly inside the outer main loop. The basic C++ code looks like:
void aussie_matmul_vector_basic_out2(const ymatrix m, const float v[], int n, float vout[]) { // Basic matrix-by-vector using nested loops.. for (int row = 0; row < n; row++) { float sum = 0.0f; for (int col = 0; col < n; col++) { sum += (m[row][col] * v[col]); } vout[row] = sum; } }
Optimizations of matrix-vector multiplication. Various ways to optimize the naive nested loop matrix-vector multiplication suggest themselves:
- Hoisting loop-invariant code (loop code motion) of the “
m[row]
” expression. - Loop pointer arithmetic for both loops.
- Loop unrolling of the inner loop to unroll 4, 8 or more iterations.
- Loop tiling to unroll a 2x2 tile/block.
- Vectorization using the AVX1/AVX2 vector dot product versions we already examined.
I tried coding several more of these optimizations and here are the benchmarks:
Matrix-Vector multiplication (MatMulVec) benchmarks (N=2048, ITER=300): Matrix-vector nested loops: 3480 ticks (3.48 seconds) Matrix-vector nested loops hoisted: 3489 ticks (3.49 seconds) Matrix-vector nested ptr-arith: 3415 ticks (3.42 seconds) Matrix-vector unrolled inner (4): 1166 ticks (1.17 seconds) Matrix-vector unrolled inner (8): 938 ticks (0.94 seconds) Matrix-vector nested tiled 2x2: 1995 ticks (2.00 seconds) Matrix-vector vecdot AVX1 DP: 1414 ticks (1.41 seconds) Matrix-vector vecdot AVX2 FMA: 929 ticks (0.93 seconds)
Interestingly, code hoisting and loop pointer arithmetic were a waste of effort. Loop tiling did better than the original, but probably its speedup is primarily from the effect of loop unrolling rather than data locality or cache hit rates, since simpler loop unrolling did better. Note that the AVX1 version used the “dot product” intrinsic but AVX-2 used the FMA intrinsic, for reasons covered in Chapter 17. Simple loop unrolling also did as well as AVX2 hardware vectorization, probably because the versions of AVX1 and AVX2 were simply calling the vector dot product functions, so they still had function call overhead. Hence, this algorithm can be further optimized by inlining to fix the AVX function call overhead, combining AVX intrinsics with unrolling of the inner loop, and then some minor final tweaks such as pointer arithmetic.
• Next: • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |