Aussie AI

Index of Inference Optimization Techniques

This page is intended as a long alphabetical list and index to the many and various inference optimization techniques. See also these articles:

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z

A

Activation function approximation
Activation function reordering
Activation fusion (fused RELU/GELU/SwiGLU)
Adaptive inference algorithms (see Dynamic inference)
Adder networks
Add-as-integer networks
Addition optimizations
Aggressive decoding
Alignment (see Safety)
Approximate activation functions
Approximate addition
Approximate attention heads
Approximate caching
Approximate computing (generally)
Approximate division
Approximate layers
Approximate multiplication
Approximate matrix multiplication
Approximate normalization
Approximate Softmax
Approximate top-k algorithms
Approximate Transformer
Architectures (Transformers)
ASICs (see Hardware acceleration)
Attention head pruning
Attention head approximation
Attention head caching (see KV caching)
Attention optimizations
Autoregressive algorithms (avoiding)

B

Backpropagation
Batch normalization (see Norm optimizations)
Bayesian neural networks
bcmp intrinsic function
bcopy intrinsic function
Beam search decoding
Bfloat16 number representation
Bias (see Safety)
Bias fusion ("fused add-bias")
Bias vector pruning
Big-little architectures
Binary quantization
_BitScanForward intrinsic function
_BitScanReverse intrinsic function
Bitshift-add networks
Bitwise floating point arithmetic
Bitshift power-of-two quantization
Bitwise neural networks
Bitserial operations
Bitwise operator inference
Bitwise operator optimizations
Block pruning
Blockwise parallel decoding
Bloom filters (see also hashing)
Builtin functions
Butterfly matrices
Byte-wise intrinsic functions
bzero intrinsic function

C

Caching (of computed results)
Caching vector dot products
Caching (KV)
Cascades
Cell phones (inference) (see AI phones and On-device inference)
Channel pruning
Chatbot
Checkpointing (see recomputation)
Cloud inference servers
Clustering (Weights)
clz (count leading zeros) intrinsic function
Co-design (hardware-software)
Code optimizations (generally)
Collaborative inference (see also ensemble inference)
Common subexpression elimination
Compilers (ML Compilers)
Common case first (see Conditional computation)
Computation reuse
Conditional computation
Consensus decoding
Constant folding
Context caching
Context compression
Context length
Context pruning (see token pruning)
Contrastive decoding
Convolution optimizations
copysign intrinsic function
CPU (see Hardware acceleration)
CPU inference optimization (see also AI PCs)
Cross attention

D

Data reuse
Dataset distillation
Debugging AI framework code
Deep learning compilers
Depth pruning (see also Layer pruning)
Depth-wise separable convolutions
Desktop PC inference (see AI PCs and On-device inference)
Device inference (for phones/PCs)
Difference squares
Distillation (knowledge distillation)
Division optimizations
Division vs reciprocal multiplication
Division quantization
Dot product optimization
Double-base number system (DBNS)
Dual pruning (depth+width pruning)
Dyadic numbers
Dyadic quantization
Dynamic inference
Dynamic NAS
Dynamic number systems
Dynamic pruning

E

Early exit (see also layer pruning)
Easy case first (see Conditional computation)
Edge devices (inference) (see AI phones and AI PCs)
Element-wise multiplication models (Hadamard)
Embeddings pruning
End-to-end integer inference
End-to-end logarithmic model
Ensemble architectures (multi-model AI engines)
Ensemble Knowledge Distillation
Environment (see Green AI)
Ethics (see Safety)
Extreme quantization (see Binary quantization)

F

G

Generalized speculative decoding
Governance (see Safety)
GPU (see Hardware acceleration)
Gradient checkpointing (see recomputation)
Gradient descent
Graph compilers
Graph search algorithms
Greedy decoding
Green AI
Group Query Attention (GQA)
Grouped convolutions

H

Hadamard multiplication models
Hallucinations (see Safety)
Hardware acceleration
Hardware-software co-design
Hashing
Hashing vectors (Locality-Sensitive Hashing)
Head pruning
Hybrid cloud-on-device inference
Hybrid number systems
Hybrid Transformer architectures
Hybrid pruning (see Dual pruning)
Hybrid optimization techniques
Hybrid pruning
Hyper-Parameter Optimization (HPO) (see Neural Architecture Search (NAS))

I

ilogb intrinsic function (extract exponent)
Incremental algorithms
Incremental inference (see Caching)
Inference Cache
Inference optimizations (generally)
IEEE 754 floating point
Integer arithmetic
Integer-only activation functions
Integer-only inference (end-to-end)
Integer-only normalization
Integer-only Softmax
Integer-only-arithmetic quantization
Integer quantization
INT2 quantization (2-bit)
INT3 quantization (3-bit)
INT4 quantization (4-bit)
INT8 quantization
INT16 quantization
INT32 quantization
Integer-only arithmetic quantization
Intrinsic functions

J

K

Kernel operator fission (splitting)
Kernel operator fusion (merging)
Kernel optimizations (see inference optimization)
Kernel tiling
Knowledge distillation (KD)
KV caching
KV Cache Compression
KV Cache Quantization
KV caching in early exit
KV cache sparsity
KV cache token pruning
KV cache eviction policies
KV cache quantization
KV cache layer fusion
KV cache layer pruning
KV cache reuse
KV cache global (multi-query KV caching)
KV cache prefix
KV cache session-based (multi-turn KV caching)
KV cache substring (Lengthwise-fused KV caching))

L

M

Machine learning compilers
Magnitude pruning
Market research
Matrix algebra optimizations
Matrix multiplication algorithms
Max-Plus networks (see also Tropical algebra)
Medical ethics (see Safety)
memcmp byte-compare intrinsic function
memcpy byte-copy intrinsic function
memmove byte-move intrinsic function
Memoization (see Caching)
Memory optimization
memset intrinsic function
MHA
MHA fusion
Min-Max-Plus networks
MiniMax algebra
Mitchell approximate multiplication
Mixed-precision quantization
Mixture of Experts (MoE)
Mobile phones (inference) (see AI phones and On-device inference)
Model compression
Model selection
modf intrinsic function (float fractional part)
Monarch matrices
Monte Carlo method (see Probabilistic algorithms)
Morphological network (see Max-Plus networks)
Movement pruning
Multi-AI (see Ensemble)
Multi-dimensional logarithmic number system (MDLNS)
Multi-Head Attention (MHA)
Multi-Query Attention (MQA)
Multiple-Base Number System (MBNS)
Multiplication algorithms
Multiplication approximation
Multiplication-free inference
Multiply by reciprocal
Multi-dimensional pruning
Multi-head attention fusion
Multi-turn KV cache (session-based KV caching)
Multi-threading (parallelization)

N

n-gram parallel decoding
NAND bitwise operator
Narrow networks (See Slim networks)
Negative skipping
Neural Architecture Search (NAS)
Neural text degeneration (see Greedy decoding)
nlz (number leading zeros) intrinsic function
Non-autoregressive algorithms
NOR bitwise operator
Normalization approximation
Normalization fusion (e.g. "fused LayerNorm")
Normalization pruning
Normalization reordering
Notebook AI inference (see AI PCs and On-device inference)
Nucleus sampling (see Top-p sampling)
Number systems

O

Offloading
Offloading (recomputation)
On-device inference (for phones/PCs)
Operator fission (splitting operators)
Operator fusion (combining operators)

P

Padding
Padding avoidance
Paged Attention
Parallel decoding
Parallelization
Parameter sharing (aka "weight sharing")
Partitioning
PC devices (inference) (see also On-device inference)
Permutation arrays
Ph.D thesis topics
Phone AI devices (inference) (see also On-device inference)
Pipelining
Pointer arithmetic
Pointers vs references
Popcount intrinsic function (see also Bitserial operations)
Portability
Posit number system (PNS)
Positional encoding optimization
Positional encoding pruning
Post-Activation (reordering)
Post-norm (Normalization reordering)
Post-Training Quantization (PTQ)
Precomputation (see also Weight precomputations)
Pre-Activation (reordering)
Prefix KV cache
Pre-norm (Normalization reordering)
Probabilistic optimizations
Prompt compression
Prompt lookup decoding
Pruning
Power-of-two quantization (logarithmic quantization)

Q

R

Random number seeding
Reciprocal multiplication
References vs pointers
Regulation (see Safety)
remainder intrinsic function (float fraction)
Reordering layers
Reordering loops
Research questions
Residue number system (RNS)
Reuse computations
Risks (see Safety)

S

Safety
Scalar product optimization
scalbn intrinsic function (float bitshift)
Self speculative decoding
Semi-Logarithmic Number System (SLNS)
Sentinel loop optimization
Session KV cache (multi-turn KV caching)
Shallow decoder transformer
Sharing weights (see Parameter sharing)
Shift-add networks
Simple case first (see Conditional computation)
Simulated quantization
signbit intrinsic function
Skipping
Skipping layers
Skipping negatives
Skipping zero
Sliding window attention
Slimmable networks (width pruning)
Smartphones (inference) (see AI phones and On-device inference)
Softmax approximation
Softmax fusion
Softmax pruning
Softmax alternatives
Software acceleration (optimizations)
Software frameworks (inference)
Software-hardware co-design
Sparse attention
Sparse matrices (sparsification)
Speculative decoding (see also Transformer architectures)
speculative decoding (generalized)
Static pruning
Stochastic quantization
Stochastic algorithms (see Probabilistic algorithms)
Strength Reduction
Structured pruning
Superposed decoding
Swarm ensemble architectures

T

Table lookup
Temperature (Softmax)
Tensor decomposition (see Low-rank matrix factorization)
Ternary quantization
Thin networks (See Slim networks)
Tiling (kernel)
Tiling (loop)
Token merging
Token pruning
Token skipping (see Token pruning)
Top-k decoding
Top-p decoding
Top-p sampling (decoding)
Training optimizations
Transformer architecture optimizations
Tree attention
Tree speculative decoding
Trigonometric approximate inference
Tropical algebra (see also Max-Plus networks)
Triple-axis pruning (see also Dual pruning)
Type consistency

U

Unnatural instructions (data sets)
Unstructured pruning (see also Magnitude pruning)

V

W

Weightless Neural Networks (WNNs)
Width pruning
Weight clustering
Weight precomputations
Weight pruning (see Magnitude pruning and Model pruning)
Weight sharing (see Parameter sharing)

X

XNOR bitwise operator
XNOR networks (see also Binary quantization)

Y

Yield operation

Z