Aussie AI

Example: Fused VMM-add-bias

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Example: Fused VMM-add-bias

The way that kernel fusion works is to combine or merge the code for two operations into a single merged operation. For example, a typical linear layer will do a vector-matrix multiplication (“VMM”), followed by adding the bias vector (“add-bias”).

The same operands are used as without fusion. The VMM is a matrix-vector multiply that takes two operands: a weights matrix and the vector (calculated internal data). The add-bias is vector addition that takes two operands: a bias vector and the calculated vector (from the VMM). In the non-fused method, the VMM first multiplies the weights matrix by the input vector, creating a new vector. The add-bias operation then adds the bias vector to that vector, outputting the final resulting vector. In pseudocode, it looks like:

    vout = vector_matrix_multiply(weight_matrix, vinput);
    vfinal = add_bias(vout, vbias);

Both VMM and add-bias are working on the same vector of propagated internal data, and these two operations can be “fused” to create a combined “VMM-add-bias” operator. The result is a “three-way” fused operator, which has three inputs (the weights matrix, the bias vector, and the incoming calculated vector), and a single vector output.

    vfinal = fused_vmm_add_bias(weight_matrix, vinput, vbias);

The result of the merged VMM-add-bias combined operation is not approximated. The output is exactly the same vector that would result from sequentially doing a VMM and then an add-bias.

How is this any faster? It's true that the fused operator does the same number of calculations in terms of multiplications and additions (i.e. the same FLOPs, assuming it's floating-point data). The speedup comes from “fusing” the add-bias calculations into the same code as the VMM, which reduces some computation overhead, such as scanning through the vector twice. There is also a reduction in an entire vector of temporary data (i.e. the intermediate “vout” vector result of the VMM before it goes into the add-bias), which reduces data transfers, and improves memory usage and dataflow, also benefiting wall clock speed.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++