Aussie AI

Dynamic Inference Optimizations

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

Dynamic Inference Optimizations

Whereas model compression makes static changes to the model, there are numerous “dynamic” optimizations to the runtime inference code. Each different AI model architecture has slightly different features in its inference loop, but the underlying code is very iterative across multiple layers, which in turn loop across many matrices or tensors of weights.

Numerous methods have been considered to reduce the number of computations or use a simpler type of arithmetic calculation. A short list of such optimizations includes:

Early exits of loops (dynamically skipping layers)
Dynamic pruning (depth pruning, width pruning, length pruning)
Zero-multiplication models
Integer-only arithmetic quantization
Loop vectorization optimizations (e.g. hardware acceleration, loop tiling, etc.)
Sparsification
Matrix factorization (low-rank) and matrix algebra
Non-autoregression (parallelized output of multiple tokens per iteration)
General programming loop optimizations (e.g. loop unrolling, parallelization, etc.)

The other major optimization strategy is to use multiple models. Some of the research ideas include:

Model selection algorithms
Mixture of experts
Big-little architectures
Speculative decoding
Consensus decoding

See the following chapters for more detailed information on many of these research topics.

• Next:

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs

The new AI programming book by Aussie AI co-founders:

AI coding in C++
Transformer engine speedups
LLM models
Phone and desktop AI
Code examples
Research citations

Get your copy from Amazon: Generative AI in C++

Aussie AI

Dynamic Inference Optimizations

Dynamic Inference Optimizations

Quick Links

Product

New to Writing?

Writing Styles