Aussie AI
Inference Optimization Techniques
-
Last Updated 24th August, 2024
-
by David Spuler, Ph.D.
This is a list of neural network and Transformer optimizations, with a specific focus on speeding up Transformer inference. Resources include:
- 500+ LLM Inference Optimization Techniques
- inference optimizations
- Research Blog
- What's Hot in Inference Optimization?
- Patents in inference optimization
Hot New Research Areas
Areas of inference efficiency research that have been recently getting attention:
- On-device inference (native phone and PC AI)
- Generalized speculative decoding
- Consensus decoding
- Multi-LoRA inference
- KV Cache Compression/Quantization
- KV Cache Layer Fusion
- Prefix KV Caching
- Prefill optimizations (decoder-only engines)
- KV cache recomputation with early exit
- Deep prefill, shallow decoder architecture
- Fixed-point quantization (integer)
- Fixed-point arithmetic
- Block floating-point arithmetic
- FFN sublayer pruning
Hot Old Research Areas
Longstanding research areas that are still seeing a continual stream of papers:
Model Compression
- Model compression (overview) (static model optimizations)
- Pruning
- Quantization
- Knowledge Distillation (KD)
- Parameter sharing
- Low-rank matrices
- Neural Architecture Search (NAS)
Pruning
- Pruning overview
- Unstructured pruning
- Depthwise structural pruning (vertical):
- Depth pruning (overview)
- Layer pruning
- Early exit (dynamic layer pruning)
- Layer skipping
- Layer approximation
- Widthwise structural pruning (horizontal):
- Lengthwise structural pruning (longitudinal/end-to-end):
- Pruning by dimension:
- Transformer-specific pruning (component removal):
- Shallow decoder architecture (layer pruning)
- Attention head pruning (width pruning)
- FFN pruning
- Normalization pruning
- Positional embeddings pruning
- Softmax pruning
- Dynamic pruning
- Hybrid pruning
Quantization
- Quantization overview
- Post-Training Quantization (PTQ)
- Quantization-Aware Training (QAT)
- Low-bit integer quantization:
- Integer quantization
- 4-bit quantization (INT4)
- 5-bit quantization (INT5)
- 6-bit quantization (INT6)
- 7-bit quantization (INT7)
- 8-bit quantization (INT8)
- INT16 quantization
- INT32 quantization
- Quantization hybrids:
- Floating-point quantization
- FP16 quantization
- FP8 quantization
- FP4 quantization
- Other types of quantization:
- Mixed precision quantization
- Fixed-point quantization (integer quantization)
- Logarithmic power-of-two quantization (bitshift quantization)
- Double bitshift power-of-two quantization
- Division quantization
- Cluster-based quantization (Weight clustering)
- Dyadic quantization
- Fake quantization
- Simulated quantization
- Stochastic quantization (probabilistic)
Distillation
- Knowledge Distillation (KD) overview
- Ensemble Distillation
- Unnatural instructions (data sets)
- Dataset Distillation
Parameter Sharing
- Parameter sharing (overview)
- Weight sharing
- Activation sharing
- Layer fusion
- Clustering (Weights)
- KV cache layer fusion
Attention Optimization
- Attention optimizations (overview)
- Multi-Head Attention (MHA)
- Group Query Attention (GQA)
- Multi-Query Attention (MQA)
- Sparse attention
- Local attention
- Flash Attention
- Paged Attention
- Linear attention
- Cross attention
- Tree attention
- Sliding window attention
- Approximate attention heads
- Additive attention
- Multiplicative attention
- Attention alternatives/replacements
- Fused MHA
Transformer Component Optimizations
- Transformer architectures (overview)
- Layers:
- Activations:
- Activation function optimizations
- Activation function approximation
- Integer-only activation functions
- Fused activation functions (kernel fusion)
- Fused RELU
- Fused GELU
- Fused SwiGLU
- Activation alternatives/replacements
- Activation function pruning/removal (bilinear layers)
- Activation function reordering
- Normalization:
- Softmax:
- Softmax optimizations
- Approximate Softmax
- Softmax alternatives/replacements
- Integer-only Softmax
- Fused Softmax (kernel fusion)
- Feed-Forward Network (FFN):
- FFN pruning
- FFN approximation
- FFN alternatives/replacements
- Integer-only FFN
- Bias optimizations
- Fused add-bias (see kernel fusion)
- Bias vector pruning
- MatMul/GEMM operations:
- Positional Encoding (PE):
- Positional encoding optimization
- Pruning positional encoding (removal)
- Positional encoding approximation
- Integer-only positional encoding
- Decoding algorithms:
- Decoder algorithm optimizations
- Greedy decoding
- Top-k decoding
- Top-p decoding
- Flash decoding
- Parallel decoding
- Beam search decoding
- Aggressive decoding
- Edit decoding
- Contrastive decoding
- Lookahead decoding
- Lookup decoding
- Retrieval lookup decoding
- Prompt lookup decoding
- Self speculative decoding
- Tree speculative decoding
- Superposed decoding
- Blockwise parallel decoding
- n-gram parallel decoding
- Other:
- Approximate top-k algorithms
Transformer General Optimizations
- Transformer architectural optimizations (overview)
- Transformer low-level optimizations (overview)
- Approximate Transformer
- Caching
- Inference Cache
- KV Cache Compression/Quantization
- Prefill optimizations
- Integer-only Transformers
- Non-autoregressive methods
- Context length optimizations
- Zero-padding avoidance
KV Caching Optimizations
- KV caching
- KV caching in early exit
- KV cache compression
- KV cache sparsity
- KV cache token pruning
- KV cache eviction policies
- KV cache quantization
- KV cache layer fusion
- KV cache layer pruning
- KV cache reuse
- KV cache global (multi-query KV caching)
- Prefix KV cache
- Session KV cache (multi-turn KV caching)
- Substring KV cache (Lengthwise-fused KV caching)
Non-Multiplication AI Models
- Zero-Multiplication Models (overview)
- Binary quantization
- Ternary quantization
- 2-bit quantization (INT2)
- Adder networks
- Bitshift-add networks
- Bitshift power-of-2 quantization
- Double bitshift quantization
- Add-as-integer networks
- Logarithmic Models
- Bitwise neural networks
- Diff-squared networks
- Log-sum-exp (LSE) networks
- Max-Plus networks
- Min-Max-Plus networks
- Morphological networks
- Trigonometric approximate inference
- Weightless Neural Networks (WNNs)
- XNOR networks (see also Binary quantization)
- Hadamard elementwise matrix multiplication models
Prefill Phase Optimizations
- Prefill optimizations generally (overview)
- Chunked prefill
- Disaggregated prefill scheduling
- Context cache (global KV caching)
- Prefix KV cache
Computation Optimizations
- Kernel operator fusion (merging)
- Advanced AI Mathematics
- Approximate activation functions
- Caching / memoization
- Computation reuse
- Precomputation
- Conditional computation
- Approximations
- Integer-only arithmetic quantization
- Layer reordering
- Negative skipping
- Weight precomputations
- Zero-skipping
- Operator reordering
- Sparse matrix multiplication algorithms
- Approximate caching
- End-to-End integer inference
- Kernel fission (splitting)
- Padding
General Coding Efficiency
- Constant folding
- Common subexpression elimination
- Strength reduction
- Type consistency
- Reciprocal multiplication
- References vs pointers
Loop Optimizations
- Loop optimizations (overview)
- Loop fusion (merging loops)
- Loop unrolling
- Loop perforation
- Loop reordering
- Loop tiling
- Loop reversal
- Loop fission (splitting a loop)
- Loop interleave
- Loop interchange
- Loop coalescing
- Loop-invariant code motion ("hoisting")
- Loop distribution
- Pointer arithmetic
- Loop peeling (unrolling first iterations)
- Loop splittingLoop sentinel
- Loop collapsing
- Loop normalization
- Loop strip mining (Loop sectioning)
- Loop skewing
- Loop spreading
Memory Utilization Optimizations
- Memory optimization techniques
- Parameter sharing
- Model compression
- Low-bit quantization
- Binary quantization
- Ternary quantization
- Layer fusion
- Sparsification techniques
- Recomputation optimizations
- Memory cache management algorithms
- Kernel operator fusion
Numeric Representation Optimizations
- Fixed point number system (FXP) optimizations
- Floating point number system (FLP) optimizations
- Foating point bitwise arithmetic
- IEEE 754 floating point optimizations
- Binary quantization
- Ternary quantization
Advanced Number Systems
- Posit number system (PNS)
- Residue number system (RNS)
- Logarithmic number system (LNS)
- Dyadic numbers
- Double-base number system (DBNS)
- Dynamic number systems
- Hybrid number systems
- Tropical algebra (max-plus)
- MiniMax algebra
- Multi-dimensional logarithmic number system (MDLNS)
- Multiple-Base Number System (MBNS)
- Semi-Logarithmic Number System (SLNS)
- Lattice algebra
Faster Arithmetic
- Integer operations
- Addition optimizations
- Approximate addition
- Multiplication algorithms
- Approximate division
- Approximate multiplication
- Bitwise operator inference
- Bitserial operations
- Division optimizations
- Logarithmic approximate multiplication
Low-Rank Matrices
- LoRa
- Low-rank matrices
- QLoRa (Quantized Low-Rank)
- Matrix factorization/tensor decomposition
Advanced Matrices
- Matrix Algebra (factorization)
- Approximate matrix multiplication
- Butterfly matrices
- Monarch matrices
- Sparse matrices (sparsification)
Data Structures
- Hashing
- Locality-Sensitive Hashing (LSH)
- Look-up tables (LUTs)
- Trees
- Bloom filters
- Vector dot product caching
- Bit vectors
- Permutation arrays
Multi-AI Architectures
- Model selection algorithms
- Ensemble inference (multi-model AI engines)
- Big-little architectures
- Cascades
- Cloud inference servers
- Collaborative inference
- Consensus decoding
- Mixture of Experts (MoE)
- Speculative decoding
- Generalized speculative decoding
- Swarm ensemble architectures
- Committee ensemble architectures
- Ensemble averaging
Device Architectures
- On-device inference (for phones/PCs)
- Hybrid cloud-on-device inference
- AI Phones
- AI PCs (desktops/laptops)
- Edge device inference (IoT/mobile/PC)
Parallel Programming Optimization Techniques
- Hardware acceleration
- Hardware-software co-design
- Parallelization
- Vectorization
- Pipelining
- Offloading
- Partitioning
- Dataflow optimization
General Classes of Optimization Techniques
- Dynamic inference (adaptive inference)
- Skipping
- Heuristics
- Data locality optimization
- Probabilistic optimizations
- Approximate computing
- Code optimizations
- Deep learning compilers
- Incremental algorithms
- Fuzzy logic
- Dynamic NAS
More AI Research
Read more about: