Aussie AI
Example: Fused VMM-add-bias
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
Example: Fused VMM-add-bias
The way that kernel fusion works is to combine or merge the code for two operations into a single merged operation. For example, a typical linear layer will do a vector-matrix multiplication (“VMM”), followed by adding the bias vector (“add-bias”).
The same operands are used as without fusion. The VMM is a matrix-vector multiply that takes two operands: a weights matrix and the vector (calculated internal data). The add-bias is vector addition that takes two operands: a bias vector and the calculated vector (from the VMM). In the non-fused method, the VMM first multiplies the weights matrix by the input vector, creating a new vector. The add-bias operation then adds the bias vector to that vector, outputting the final resulting vector. In pseudocode, it looks like:
vout = vector_matrix_multiply(weight_matrix, vinput); vfinal = add_bias(vout, vbias);
Both VMM and add-bias are working on the same vector of propagated internal data, and these two operations can be “fused” to create a combined “VMM-add-bias” operator. The result is a “three-way” fused operator, which has three inputs (the weights matrix, the bias vector, and the incoming calculated vector), and a single vector output.
vfinal = fused_vmm_add_bias(weight_matrix, vinput, vbias);
The result of the merged VMM-add-bias combined operation is not approximated. The output is exactly the same vector that would result from sequentially doing a VMM and then an add-bias.
How is this any faster? It's true that the fused operator does the same number of calculations in terms of multiplications and additions (i.e. the same FLOPs, assuming it's floating-point data). The speedup comes from “fusing” the add-bias calculations into the same code as the VMM, which reduces some computation overhead, such as scanning through the vector twice. There is also a reduction in an entire vector of temporary data (i.e. the intermediate “vout” vector result of the VMM before it goes into the add-bias), which reduces data transfers, and improves memory usage and dataflow, also benefiting wall clock speed.
• Next: • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |