Aussie AI
Index of Inference Optimization Techniques
This page is intended as a long alphabetical list and index to the many and various inference optimization techniques. See also these articles:
- What's Hot in Inference Optimization?
- 500+ Techniques for LLM Inference Optimization
- Long List of LLM Optimization Techniques
- Inference Optimization Research Blog
A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z
A
- Activation function approximation
- Activation function reordering
- Activation fusion (fused RELU/GELU/SwiGLU)
- Adaptive inference algorithms (see Dynamic inference)
- Adder networks
- Add-as-integer networks
- Addition optimizations
- Aggressive decoding
- Alignment (see Safety)
- Approximate activation functions
- Approximate addition
- Approximate attention heads
- Approximate caching
- Approximate computing (generally)
- Approximate division
- Approximate layers
- Approximate multiplication
- Approximate matrix multiplication
- Approximate normalization
- Approximate Softmax
- Approximate top-k algorithms
- Approximate Transformer
- Architectures (Transformers)
- ASICs (see Hardware acceleration)
- Attention head pruning
- Attention head approximation
- Attention head caching (see KV caching)
- Attention optimizations
- Autoregressive algorithms (avoiding)
B
- Backpropagation
- Batch normalization (see Norm optimizations)
- Bayesian neural networks
- bcmp intrinsic function
- bcopy intrinsic function
- Beam search decoding
- Bfloat16 number representation
- Bias (see Safety)
- Bias fusion ("fused add-bias")
- Bias vector pruning
- Big-little architectures
- Binary quantization
- _BitScanForward intrinsic function
- _BitScanReverse intrinsic function
- Bitshift-add networks
- Bitwise floating point arithmetic
- Bitshift power-of-two quantization
- Bitwise neural networks
- Bitserial operations
- Bitwise operator inference
- Bitwise operator optimizations
- Block pruning
- Blockwise parallel decoding
- Bloom filters (see also hashing)
- Builtin functions
- Butterfly matrices
- Byte-wise intrinsic functions
- bzero intrinsic function
C
- Caching (of computed results)
- Caching vector dot products
- Caching (KV)
- Cascades
- Cell phones (inference) (see AI phones and On-device inference)
- Channel pruning
- Chatbot
- Checkpointing (see recomputation)
- Cloud inference servers
- Clustering (Weights)
- clz (count leading zeros) intrinsic function
- Co-design (hardware-software)
- Code optimizations (generally)
- Collaborative inference (see also ensemble inference)
- Common subexpression elimination
- Compilers (ML Compilers)
- Common case first (see Conditional computation)
- Computation reuse
- Conditional computation
- Consensus decoding
- Constant folding
- Context caching
- Context compression
- Context length
- Context pruning (see token pruning)
- Contrastive decoding
- Convolution optimizations
- copysign intrinsic function
- CPU (see Hardware acceleration)
- CPU inference optimization (see also AI PCs)
- Cross attention
D
- Data reuse
- Dataset distillation
- Debugging AI framework code
- Deep learning compilers
- Depth pruning (see also Layer pruning)
- Depth-wise separable convolutions
- Desktop PC inference (see AI PCs and On-device inference)
- Device inference (for phones/PCs)
- Difference squares
- Distillation (knowledge distillation)
- Division optimizations
- Division vs reciprocal multiplication
- Division quantization
- Dot product optimization
- Double-base number system (DBNS)
- Dual pruning (depth+width pruning)
- Dyadic numbers
- Dyadic quantization
- Dynamic inference
- Dynamic NAS
- Dynamic number systems
- Dynamic pruning
E
- Early exit (see also layer pruning)
- Easy case first (see Conditional computation)
- Edge devices (inference) (see AI phones and AI PCs)
- Element-wise multiplication models (Hadamard)
- Embeddings pruning
- End-to-end integer inference
- End-to-end logarithmic model
- Ensemble architectures (multi-model AI engines)
- Ensemble Knowledge Distillation
- Environment (see Green AI)
- Ethics (see Safety)
- Extreme quantization (see Binary quantization)
F
- Fairness (see Safety)
- Fake quantization
- FFN pruning
- ffs (find first set bit) intrinsic function
- Filter pruning (see width pruning)
- Fixed point number system
- Flash Attention
- Flash Decoding
- Floating point bitwise arithmetic
- Floating-point intrinsic functions
- Floating point number system (FLP)
- fma (fused multiply-add) intrinsic function
- FP4 quantization
- FP8 quantization
- FP16 quantization
- FP32
- FPGA (see Hardware acceleration)
- Frameworks (inference)
- frexp intrinsic function (extract mantissa/exponent)
- Fused bias ("fused add-bias")
- Fused activation functions (fused RELU/GELU/SwiGLU)
- Fused BatchNorm
- Fused GELU
- Fused LayerNorm
- Fused layers (see also parameter sharing)
- Fused MHA
- Fused Multiply-Add (FMA)
- Fused normalization (e.g. "fused LayerNorm")
- Fused RELU
- Fused Softmax
- Fused SwiGLU
- Fused transpose
- Fission (kernel operator)
- Fusion (kernel operator)
- Fuzzy logic
G
- Generalized speculative decoding
- Governance (see Safety)
- GPU (see Hardware acceleration)
- Gradient checkpointing (see recomputation)
- Gradient descent
- Graph compilers
- Graph search algorithms
- Greedy decoding
- Green AI
- Group Query Attention (GQA)
- Grouped convolutions
H
- Hadamard multiplication models
- Hallucinations (see Safety)
- Hardware acceleration
- Hardware-software co-design
- Hashing
- Hashing vectors (Locality-Sensitive Hashing)
- Head pruning
- Hybrid cloud-on-device inference
- Hybrid number systems
- Hybrid Transformer architectures
- Hybrid pruning (see Dual pruning)
- Hybrid optimization techniques
- Hybrid pruning
- Hyper-Parameter Optimization (HPO) (see Neural Architecture Search (NAS))
I
- ilogb intrinsic function (extract exponent)
- Incremental algorithms
- Incremental inference (see Caching)
- Inference Cache
- Inference optimizations (generally)
- IEEE 754 floating point
- Integer arithmetic
- Integer-only activation functions
- Integer-only inference (end-to-end)
- Integer-only normalization
- Integer-only Softmax
- Integer-only-arithmetic quantization
- Integer quantization
- INT2 quantization (2-bit)
- INT3 quantization (3-bit)
- INT4 quantization (4-bit)
- INT8 quantization
- INT16 quantization
- INT32 quantization
- Integer-only arithmetic quantization
- Intrinsic functions
J
K
- Kernel operator fission (splitting)
- Kernel operator fusion (merging)
- Kernel optimizations (see inference optimization)
- Kernel tiling
- Knowledge distillation (KD)
- KV caching
- KV Cache Compression
- KV Cache Quantization
- KV caching in early exit
- KV cache sparsity
- KV cache token pruning
- KV cache eviction policies
- KV cache quantization
- KV cache layer fusion
- KV cache layer pruning
- KV cache reuse
- KV cache global (multi-query KV caching)
- KV cache prefix
- KV cache session-based (multi-turn KV caching)
- KV cache substring (Lengthwise-fused KV caching))
L
- Laptop AI inference (see AI PCs and On-device inference)
- Laws (see Safety)
- Lazy evaluation (see Conditional computation)
- Layer approximation
- Layer fusion
- Layer pruning
- Layer reordering
- Layer skipping
- ldexp intrinsic function (float bitshift)
- Length extension (context window)
- Length pruning
- Linear attention optimizations
- Local attention
- Locality-Sensitive Hashing (LSH) (vector hashing)
- Logarithmic approximate multiplication
- Logarithmic number system (LNS)
- Logarithmic quantization
- logb intrinsic function (extract exponent)
- Log-sum-exp (LSE) networks
- Long context models
- Lookahead decoding
- Lookup decoding
- Look-up tables (LUTs)
- Loop coalescing
- Loop collapsing
- Loop distribution
- Loop fission
- Loop fusion (see also kernel fusion)
- Loop interchange
- Loop-invariant code motion ("hoisting")
- Loop normalization
- Loop optimizations
- Loop parallelization (see Vectorization)
- Loop peeling
- Loop perforation
- Loop reordering
- Loop reversal
- Loop sectioning (strip mining)
- Loop sentinel
- Loop skewing
- Loop splitting
- Loop spreading
- Loop strip mining
- Loop tiling
- Loop unrolling
- Loop unswitching (see Loop distribution)
- Lottery ticket hypothesis (see Low-Rank Matrices)
- Low-bit integer quantization
- LoRa
- Low-rank matrices
M
- Machine learning compilers
- Magnitude pruning
- Market research
- Matrix algebra optimizations
- Matrix multiplication algorithms
- Max-Plus networks (see also Tropical algebra)
- Medical ethics (see Safety)
- memcmp byte-compare intrinsic function
- memcpy byte-copy intrinsic function
- memmove byte-move intrinsic function
- Memoization (see Caching)
- Memory optimization
- memset intrinsic function
- MHA
- MHA fusion
- Min-Max-Plus networks
- MiniMax algebra
- Mitchell approximate multiplication
- Mixed-precision quantization
- Mixture of Experts (MoE)
- Mobile phones (inference) (see AI phones and On-device inference)
- Model compression
- Model selection
- modf intrinsic function (float fractional part)
- Monarch matrices
- Monte Carlo method (see Probabilistic algorithms)
- Morphological network (see Max-Plus networks)
- Movement pruning
- Multi-AI (see Ensemble)
- Multi-dimensional logarithmic number system (MDLNS)
- Multi-Head Attention (MHA)
- Multi-Query Attention (MQA)
- Multiple-Base Number System (MBNS)
- Multiplication algorithms
- Multiplication approximation
- Multiplication-free inference
- Multiply by reciprocal
- Multi-dimensional pruning
- Multi-head attention fusion
- Multi-turn KV cache (session-based KV caching)
- Multi-threading (parallelization)
N
- n-gram parallel decoding
- NAND bitwise operator
- Narrow networks (See Slim networks)
- Negative skipping
- Neural Architecture Search (NAS)
- Neural text degeneration (see Greedy decoding)
- nlz (number leading zeros) intrinsic function
- Non-autoregressive algorithms
- NOR bitwise operator
- Normalization approximation
- Normalization fusion (e.g. "fused LayerNorm")
- Normalization pruning
- Normalization reordering
- Notebook AI inference (see AI PCs and On-device inference)
- Nucleus sampling (see Top-p sampling)
- Number systems
O
- Offloading
- Offloading (recomputation)
- On-device inference (for phones/PCs)
- Operator fission (splitting operators)
- Operator fusion (combining operators)
P
- Padding
- Padding avoidance
- Paged Attention
- Parallel decoding
- Parallelization
- Parameter sharing (aka "weight sharing")
- Partitioning
- PC devices (inference) (see also On-device inference)
- Permutation arrays
- Ph.D thesis topics
- Phone AI devices (inference) (see also On-device inference)
- Pipelining
- Pointer arithmetic
- Pointers vs references
- Popcount intrinsic function (see also Bitserial operations)
- Portability
- Posit number system (PNS)
- Positional encoding optimization
- Positional encoding pruning
- Post-Activation (reordering)
- Post-norm (Normalization reordering)
- Post-Training Quantization (PTQ)
- Precomputation (see also Weight precomputations)
- Pre-Activation (reordering)
- Prefix KV cache
- Pre-norm (Normalization reordering)
- Probabilistic optimizations
- Prompt compression
- Prompt lookup decoding
- Pruning
- Power-of-two quantization (logarithmic quantization)
Q
- Quadratic attention complexity
- Quantization
- Quantization, KV Cache
- Quantization-Aware Training (QAT)
- QLoRa (Quantized Low-Rank)
R
- Random number seeding
- Reciprocal multiplication
- References vs pointers
- Regulation (see Safety)
- remainder intrinsic function (float fraction)
- Reordering layers
- Reordering loops
- Research questions
- Residue number system (RNS)
- Reuse computations
- Risks (see Safety)
S
- Safety
- Scalar product optimization
- scalbn intrinsic function (float bitshift)
- Self speculative decoding
- Semi-Logarithmic Number System (SLNS)
- Sentinel loop optimization
- Session KV cache (multi-turn KV caching)
- Shallow decoder transformer
- Sharing weights (see Parameter sharing)
- Shift-add networks
- Simple case first (see Conditional computation)
- Simulated quantization
- signbit intrinsic function
- Skipping
- Skipping layers
- Skipping negatives
- Skipping zero
- Sliding window attention
- Slimmable networks (width pruning)
- Smartphones (inference) (see AI phones and On-device inference)
- Softmax approximation
- Softmax fusion
- Softmax pruning
- Softmax alternatives
- Software acceleration (optimizations)
- Software frameworks (inference)
- Software-hardware co-design
- Sparse attention
- Sparse matrices (sparsification)
- Speculative decoding (see also Transformer architectures)
- speculative decoding (generalized)
- Static pruning
- Stochastic quantization
- Stochastic algorithms (see Probabilistic algorithms)
- Strength Reduction
- Structured pruning
- Superposed decoding
- Swarm ensemble architectures
T
- Table lookup
- Temperature (Softmax)
- Tensor decomposition (see Low-rank matrix factorization)
- Ternary quantization
- Thin networks (See Slim networks)
- Tiling (kernel)
- Tiling (loop)
- Token merging
- Token pruning
- Token skipping (see Token pruning)
- Top-k decoding
- Top-p decoding
- Top-p sampling (decoding)
- Training optimizations
- Transformer architecture optimizations
- Tree attention
- Tree speculative decoding
- Trigonometric approximate inference
- Tropical algebra (see also Max-Plus networks)
- Triple-axis pruning (see also Dual pruning)
- Type consistency
U
- Unnatural instructions (data sets)
- Unstructured pruning (see also Magnitude pruning)
V
- Vanilla Transformer
- Vector dot product caching
- Vector dot product optimization
- Vector hashing
- Vectorization
- Vision neural network
- Vocabulary size
W
- Weightless Neural Networks (WNNs)
- Width pruning
- Weight clustering
- Weight precomputations
- Weight pruning (see Magnitude pruning and Model pruning)
- Weight sharing (see Parameter sharing)
X
Y
- Yield operation
Z