Aussie AI
Dynamic Inference Optimizations
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
Dynamic Inference Optimizations
Whereas model compression makes static changes to the model, there are numerous “dynamic” optimizations to the runtime inference code. Each different AI model architecture has slightly different features in its inference loop, but the underlying code is very iterative across multiple layers, which in turn loop across many matrices or tensors of weights.
Numerous methods have been considered to reduce the number of computations or use a simpler type of arithmetic calculation. A short list of such optimizations includes:
- Early exits of loops (dynamically skipping layers)
- Dynamic pruning (depth pruning, width pruning, length pruning)
- Zero-multiplication models
- Integer-only arithmetic quantization
- Loop vectorization optimizations (e.g. hardware acceleration, loop tiling, etc.)
- Sparsification
- Matrix factorization (low-rank) and matrix algebra
- Non-autoregression (parallelized output of multiple tokens per iteration)
- General programming loop optimizations (e.g. loop unrolling, parallelization, etc.)
The other major optimization strategy is to use multiple models. Some of the research ideas include:
- Model selection algorithms
- Mixture of experts
- Big-little architectures
- Speculative decoding
- Consensus decoding
See the following chapters for more detailed information on many of these research topics.
• Next: • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |