Aussie AI

Inference Loop Optimizations

  • Last Updated 2nd September, 2023
  • by David Spuler, Ph.D.

Changes to the actual C++ code that executes the inference algorithm on the weights is an interesting optimization idea. The inference loop is the main code during inference that iterates through the various layers of the model. Apply the various well-known loop optimizations in general coding to this inference loop creates various inference loop optimizations.

Dynamic Inference Optimizations

Inference loop optimizations are inherently dynamic algorithms. They can rely on pre-inference optimizations, such as quantization, but this section focuses on change to the inference logic at runtime.

Each different AI model architecture has slightly different features in its inference loop, but the underlying code is very iterative across multiple layers, which in turn loop across many matrices or tensors of weights. Optimizations may include:

This document addresses only the optimizations that are directly related to the loop code that executes inference.

Early Exit of Inference Layer Loop

Early exit is quitting the main inference loop at one of the layers. It is a form of dynamic layer pruning, since it skips (prunes) some of the model layers. Read more about: Early exit.

There are other optimizations of the layer loops: layer pruning, layer skipping, layer reordering, and layer fusion

Other Dynamic Inference Loop Optimizations

Early exits and layer skipping optimizations are not the only dynamic loop optimizations for inference algorithms. Other general loop optimizations include:

Some other papers on loop optimizations include:

  • J. Shen, Y. Wang, P. Xu, Y. Fu, Z. Wang, Y. Lin, Fractional skipping: Towards finer-grained dynamic CNN inference, in: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, AAAI Press, 2020, pp. 5700–5708, https://aaai.org/ojs/index.php/AAAI/article/view/6025, https://arxiv.org/abs/2001.00705
  • Robert Lim, 2019, Methods for accelerating machine learning in high performance computing, Report AREA-2019-01, School of Computer and Data Sciences, University of Oregon, https://www.cs.uoregon.edu/Reports/AREA-201901-Lim.pdf

Dynamic inference optimizations are not limited to loop optimizations. See also dynamic layer pruning, dynamic layer skipping, dynamic channel pruning, dynamic token pruning, dynamic head pruning, and other dynamic strategies under model pruning. Read the full list of inference optimizations.

More AI Research

Read more about: