Aussie AI
Inference Loop Optimizations
-
Last Updated 2nd September, 2023
-
by David Spuler, Ph.D.
Changes to the actual C++ code that executes the inference algorithm on the weights is an interesting optimization idea. The inference loop is the main code during inference that iterates through the various layers of the model. Apply the various well-known loop optimizations in general coding to this inference loop creates various inference loop optimizations.
Dynamic Inference Optimizations
Inference loop optimizations are inherently dynamic algorithms. They can rely on pre-inference optimizations, such as quantization, but this section focuses on change to the inference logic at runtime.
Each different AI model architecture has slightly different features in its inference loop, but the underlying code is very iterative across multiple layers, which in turn loop across many matrices or tensors of weights. Optimizations may include:
- Integer-only quantization (see quantization)
- Reducing multiplications (see zero-multiplication inference)
- Early exits of loops (dynamically skipping layers)
- Loop optimizations (e.g. loop unrolling, loop fusion, loop tiling, as often done by frameworks/compilers)
- Dynamic pruning (see pruning)
- Sparsification
- Submatrix identification (see matrix algebra)
- Matrix factorization (low-rank)
- Mixture of experts
- Non-autoregression (parallelizing to output multiple tokens per iteration)
- General programming loop optimizations (e.g. loop unrolling, parallelization, etc.)
This document addresses only the optimizations that are directly related to the loop code that executes inference.
Early Exit of Inference Layer Loop
Early exit is quitting the main inference loop at one of the layers. It is a form of dynamic layer pruning, since it skips (prunes) some of the model layers. Read more about: Early exit.
There are other optimizations of the layer loops: layer pruning, layer skipping, layer reordering, and layer fusion
Other Dynamic Inference Loop Optimizations
Early exits and layer skipping optimizations are not the only dynamic loop optimizations for inference algorithms. Other general loop optimizations include:
- Loop unrolling
- Loop tiling
- Loop reordering
- Loop strip-mining (partitioning)
- Loop interchange
- Loop reversal
- Loop interleaving
Some other papers on loop optimizations include:
- J. Shen, Y. Wang, P. Xu, Y. Fu, Z. Wang, Y. Lin, Fractional skipping: Towards finer-grained dynamic CNN inference, in: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, AAAI Press, 2020, pp. 5700–5708, https://aaai.org/ojs/index.php/AAAI/article/view/6025, https://arxiv.org/abs/2001.00705
- Robert Lim, 2019, Methods for accelerating machine learning in high performance computing, Report AREA-2019-01, School of Computer and Data Sciences, University of Oregon, https://www.cs.uoregon.edu/Reports/AREA-201901-Lim.pdf
Dynamic inference optimizations are not limited to loop optimizations. See also dynamic layer pruning, dynamic layer skipping, dynamic channel pruning, dynamic token pruning, dynamic head pruning, and other dynamic strategies under model pruning. Read the full list of inference optimizations.
More AI Research
Read more about:
- Long List of Optimizations
- Inference Optimizations
- Code Optimizations
- Approximate Computing
- Advanced AI Mathematics
- Matrix Algebra
- « Research Home