Aussie AI
Multiplication Optimizations
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
Multiplication Optimizations
Multiplication is the foremost bottleneck in training and inference of neural networks and Transformer architectures. Most models rely on matrix multiplications, whether you call it tensors or convolutions, which involve vector dot products, which in turn involve “multiply-and-add” sequences (called “multiply-accumulate” or MAC). The multiplication part is more expensive than the accumulation.
There have been various ideas over the years of AI research as to how to optimize multiplications, including:
- Hardware-accelerated multiplication (lately, this is a GPU's bread-and-butter)
- Advanced floating-point formats (e.g. bfloat16)
- Faster multiplication arithmetic algorithms
- Approximate multiplication arithmetic algorithms
- Integer multiplication instead of floating-point (see quantization)
- Faster matrix multiplication algorithms (e.g. low-rank matrices, tensor decomposition)
- Avoiding or reducing multiplications (e.g. zero-multiplication models, pruning, zero skipping, sparsity, etc.)
- Advanced mathematical numerical systems
Although being able to multiply two integers together is taken for granted by modern programmers, there are actually complicated algorithms happening behind the scenes (i.e. in the chips). Early algorithms include Karatsuba multiplication (1962), Toom-Cook multiplication, Schonhage–Strassen algorithms, and contributions by Knuth. The improvement and parallelization of such algorithms is fundamental to GPU and hardware accelerator design. Use of such algorithms in software acceleration of model inference seems unlikely to beat hardware acceleration.
• Next: • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |