Aussie AI Blog
500+ LLM Inference Optimization Techniques
-
September 2nd, 2024
-
Updated: September 20, 2024
-
by David Spuler, Ph.D.
LLM Inference Optimization
We do a lot of research on inference optimization techniques, so here's a very long list of all the techniques about which we have research papers. There more than 500 now, but see our earlier blog post if you only want to know about the latest LLM inference techniques.
LLM Inference Optimizations List
Here's the list! It's over 500 and growing!
-
Model compression main subtypes:
- Model compression (overview)
- — Pruning (overview)
- — Quantization (overview)
- — Knowledge Distillation (KD)
- — Parameter sharing (weight sharing)
- — Low-rank matrices
- — Small Language Models (SLMs)
- — Data compression algorithms
Pruning main types: - Dynamic pruning
- Hybrid pruning
- Unstructured pruning
- Semi-Structured Pruning
- Structured pruning
Layerwise structured pruning subtypes (depth dimension): - Depthwise structural pruning (overview)
- — Static layer pruning
- — Layer pruning
- — Early exit
- — Dynamic layer pruning
- — Layer skipping
- — Layer approximation
- — Shallow decoder architecture
- — Layer reordering
- — Layer Importance
Width-wise structured pruning subtypes: - Widthwise structural pruning (overview)
- — Attention head pruning
- — Slimmable networks (width pruning)
- — FFN pruning
- — Channel pruning
- — Filter pruning
Length-wise structured pruning subtypes: - Lengthwise structural pruning (longitudinal/input/end-to-end):
- — Token pruning (input pruning)
- — Dynamic token pruning
- — Prompt compression
- — Context compression
- — Token merging
- — Token skipping
- — Token dropping
- — Zero padding removal
Model dimension embedding pruning subtypes: - Embedding-dimension pruning
- — Embedding pruning
- — Embedding matrix compression (embedding pruning)
- — Embedding low-rank matrix factorization
- — Unembedding matrix (output embeddings)
Hybrid multi-dimensional pruning: - Multi-dimensional pruning
- — Dual pruning
- — Triple pruning
- — Quadruple pruning
- — 3D CNN model pruning
Transformer component pruning: - Normalization pruning
- Positional embeddings pruning
- Softmax pruning
- Skip connection pruning (residual connection removal)
Unstructured pruning subtypes: - Unstructured pruning (overview)
- — Magnitude pruning
- — Movement pruning
- — Gradual pruning
Quantization subtypes: - Post-Training Quantization (PTQ)
- Quantization-Aware Training (QAT)
- Activation Quantization
- Outlier-aware quantization
Integer quantization subtypes: - Integer quantization (overview)
- — Integer-only arithmetic quantization
- — Fixed-point quantization (integer)
- — Low-bit integer quantization (overview)
- — Binary quantization
- — Ternary quantization
- — 2-bit quantization (INT2)
- — 3-bit quantization (INT3)
- — 4-bit quantization (INT4)
- — 5-bit quantization (INT5)
- — 6-bit quantization (INT6)
- — 7-bit quantization (INT7)
- — 8-bit quantization (INT8)
- — 9-bit quantization (INT9)
- — 10-bit quantization (INT10)
- — 11-bit quantization (INT11)
- — 12-bit quantization (INT12)
- — 16-bit INT16 quantization
- — 32-bit INT32 quantization
Floating-point quantization subtypes: - Floating-point quantization
- — FP4 quantization
- — FP6 quantization
- — FP8 quantization
- — FP16 quantization
- — FP32 quantization
Other quantization subtypes: - Mixed-precision quantization
- Logarithmic power-of-two quantization (bitshift quantization)
- Double bitshift power-of-two quantization
- Division quantization
- Cluster-based quantization (Weight clustering)
- Hashing-based weight clustering
- Dyadic quantization
- Fake quantization
- Simulated quantization
- Stochastic quantization (probabilistic)
Granularity-level quantization subtypes: - Granular quantization (overview)
- — Layerwise Quantization
- — Blockwise Quantization
- — Vector quantization
Knowledge distillation subtypes: - Knowledge Distillation (overview)
- — Ensemble Distillation
- — Unnatural instructions (data sets)
- — Dataset Distillation
Parameter/weight sharing subtypes: - Parameter/Weight sharing (overview)
- — Activation sharing
- — Layer fusion
- — Clustering (Weights)
- — Attention head fusion
- — FFN fusion
- — KV cache layer fusion (depthwise)
- — KV cache head fusion (widthwise)
Activation function optimizations: - Activation function optimizations (overview)
- — Activation function approximation
- — Integer-only activation functions
- — Fused activation functions (kernel fusion)
- — Fused RELU
- — Fused GELU
- — Fused SwiGLU
- — Activation alternatives/replacements
- — Activation function pruning/removal (bilinear layers)
- — Activation function reordering
Normalization optimization types: - Normalization algorithm optimizations (overview)
- — Approximate normalization
- — Norm reordering (pre-norm/post-norm)
- — Integer-only normalization
- — Normalization alternatives/replacements
- — Fused normalization (e.g. "fused LayerNorm" in kernel fusion)
Softmax optimization types: - Softmax optimizations (overview)
- — Softmax pruning
- — Approximate Softmax
- — Softmax alternatives/replacements
- — Integer-only Softmax
- — Fused Softmax
Feed-Forward Network (FFN) optimization types: - FFN optimizations (overview)
- — FFN pruning
- — FFN approximation
- — Fused add-bias
- — Bias vector pruning
- — FFN sparsity
- — FFN alternatives/replacements
- — Integer-only FFN
- — Bias optimizations
MatMul/GEMM optimization types: - MatMul/GEMM kernel optimizations (overview)
- — Faster matrix multiplication (e.g. Winograd, Strassen)
- — Approximate matrix multiplication
- — Transpose cache
- — Fused multiply-add (FMA)
- — Fused transpose
- — Vector dot product optimization
- — Sparse MatMul/GEMM
- — Tiled MatMul
Positional Encoding optimizations: - Positional encoding optimization (overview)
- — RoPE (Rotary Positional Encoding)
- — Pruning positional encoding (removal/NoPE)
- — Positional encoding approximation
- — Integer-only positional encoding
NAS subtypes: - Neural Architecture Search (NAS)
- — Dynamic NAS
- — Embedding Size Optimization (embeddings NAS)
Platform-specific optimization subtypes: - On-device inference (native phone and PC AI)
- AI Phones
- AI PCs (desktops/laptops)
- Edge device inference (IoT/mobile/PC)
- Hybrid cloud-on-device inference
Decoding algorithm subtypes: - Decoding algorithms (overview)
- — Non-autoregressive decoding
- — Greedy decoding
- — Top-k decoding
- — Top-p decoding
- — Min-P Sampling
- — Flash decoding
- — Beam search decoding
- — Edit decoding
- — Contrastive decoding
- — Approximate top-k algorithms
- — Bidirectional decoding
- — Constrained decoding
Parallel Decoding algorithms: - Parallel decoding
- — Blockwise parallel decoding
- — n-gram parallel decoding
- — Lookahead decoding
- — Medusa decoding
- — Consensus decoding
- — Mutually-guided decoding
- — Multi-token generation
Speculative decoding subtypes: - Speculative decoding (overview)
- — Generalized speculative decoding
- — Aggressive decoding
- — Lookup decoding
- — Retrieval lookup decoding
- — Prompt lookup decoding
- — Self speculative decoding
- — Tree speculative decoding
- — Superposed decoding
- — Hierarchical speculative decoding
- — Heuristic speculative decoding
- — Multi-token speculative decoding
- — Sequential speculative decoding
Parameter Efficient Fine-Tuning (PEFT) subtypes: - PEFT (overview)
- — LoRA
- — Multi-LoRA inference
- — QLoRa (Quantized Low-Rank Adapters)
- — LoRA inference optimizations (load/unload)
- — Prompt Tuning (Extended Vocabulary PEFT)
Ensemble multi-LLM subtypes: - Ensemble inference (overview of multi-model AI engines)
- — Mixture of Experts (MoE)
- — Model selection algorithms
- — Big-little architectures
- — Cascades
- — Collaborative inference
- — Consensus decoding
- — Swarm ensemble architectures
- — Committee ensemble architectures
- — Ensemble averaging
- — Easy-hard queries
- — Submodels (Many-Models-in-One)
- — Distributed Inference
Orchestration, Deployment and Serving: - Cloud inference servers
- Orchestration frameworks
- Scheduling optimizations
- Serving
- Load balancing
- Batching
- Continuous batching
- Deployment
- Serverless
- Networking optimizations
Attention optimization subtypes: - Attention optimizations (overview)
- — Multi-Head Attention (MHA)
- — Group Query Attention (GQA)
- — Multi-Query Attention (MQA)
- — Sparse attention
- — Local attention
- — Memory-efficient attention algorithms
- — Flash Attention
- — Paged Attention
- — Linear attention
- — Cross attention
- — Tree attention
- — Sliding window attention
- — Approximate attention heads
- — Attention alternatives/replacements
- — Fused MHA
- — Low-rank matrix attention
- — Medusa attention
- — Block attention
- — Cross attention
- — Fused head attention
- — Hybrid local-global attention
- — FFT attention
- — QKV computation optimizations
- — Additive attention
- — Multiplicative attention
- — Graph attention
- — Chunked attention
- — Attention sink
- — Attention steering
- — Bilinear attention
- — Attention-free methods
- — Mixture-of-Heads (MOH) Attention (MoE+MHA)
- — Star attention
- Flex attention
- Razor attention
Long context optimizations (attention): - — Long context models
- — Length generalization
- — Quadratic attention complexity
- — Long RAG
Caching optimizations: - Caching (overview)
- — Inference Cache (text-to-text)
- — Inference cache (global KV caching)
- — Prompt caching
- — Input Similarity-Based Caching (frame skipping in video)
- — Semantic caching (text-to-text)
- — Semantic KV caching
- — Vector database caching
- — Chatbot caching
- — Vector Caching (Vector hashing)
- — Caching vector dot products
- — Caching general theory
KV cache optimizations: - KV Caching (overview)
- — KV cache global (multi-query KV caching)
- — KV cache reuse
- — Global semantic KV caching (difficult!)
- — Context cache (global KV caching)
- — Prefix KV Caching
- — KV cache recomputation with early exit
- — Session KV cache (multi-turn KV caching)
- — Substring/fused KV cache (Lengthwise-fused KV caching)
KV cache memory size reduction: - KV cache compression
- — KV cache quantization
- — KV cache sparsity
- — KV cache token pruning
- — KV cache eviction policies
- — KV cache layer fusion
- — KV cache layer pruning
- — KV Cache low-rank matrix factorization
Non-Multiplication AI Models: - Zero-Multiplication Models (overview)
- — Binary quantization
- — Ternary quantization
- — 2-bit quantization (INT2)
- — Adder networks
- — Bitshift-add networks
- — Bitshift power-of-2 quantization (logarithmic quantization)
- — Double bitshift quantization
- — Add-as-integer networks
- — Logarithmic Models
- — Bitwise neural networks
- — Diff-squared networks
- — Log-sum-exp (LSE) networks
- — Max-Plus networks
- — Min-Max-Plus networks
- — Morphological networks
- — Trigonometric approximate inference
- — Weightless Neural Networks (WNNs)
- — XNOR networks
- — Hadamard elementwise matrix multiplication models
- — Other addition-related zero-multiplication networks
- — Table lookups replace multiplication
- — Other multiplication-free neural networks
Advanced Number System optimizations: - Advanced Number Systems (overview)
- — Posit number system (PNS)
- — Residue number system (RNS)
- — Dyadic numbers
- — Double-base number system (DBNS)
- — Dynamic number systems
- — Hybrid number systems
- — Tropical algebra (max-plus)
- — MiniMax algebra
- — Multi-dimensional logarithmic number system (MDLNS)
- — Multiple-Base Number System (MBNS)
- — Semi-Logarithmic Number System (SLNS)
- — Lattice algebra
Logarithmic Number System optimizations: - Logarithmic number system (LNS) (overview)
- — End-to-end LNS logarithmic model
- — LNS addition and subtraction
- — LNS in AI models
- — LNS Hardware Acceleration
- — LNS mathematical and algorithmic theory
- — LNS algebra
- — LNS extensions
Prefill phase optimizations: - Prefill optimizations (overview)
- — Chunked prefill
- — Disaggregated prefill scheduling (Phase splitting)
- — Deep prefill, shallow decoder architecture
- — Mini-prefill recomputation
Parallel Programming Optimization Techniques: - Parallelization techniques (overview)
- — Hardware acceleration
- — Hardware-software co-design
- — Vectorization
- — Pipelining (pipeline parallelism)
- — Overlapping (new)
- — Overlapping communications and computation (new)
- — Overlapping rematerialization (new)
- — Overlapping memory access & computation (new)
- — Offloading
- — Partitioning
- — Dataflow optimizations
- — Sharding
- — Overlapping
- — Data parallelism
- — Query parallelism
- — Tensor parallelism
- — Model parallelism
- — Prefetching
- — Speculative execution
- — Sequence Parallelism
- — Skeleton-of-Thought (Query Parallelism)
Hardware Optimizations: - Hardware Acceleration (overview)
- — Software accelerations
- — Hardware-software co-design
- — GPU
- — GPU software platforms
- — Multi-GPU
- — CPU Execution
- — Single Instruction Multiple Data (SIMD)
- — AVX (AVX/AVX-2/AVX-512)
- — ARM NEON
- — Neural Processing Unit (NPU)
- — Overclocking CPU
- — Overclocking GPU
- — Assembly language
RAG Architecture Optimizations: - RAG architectures (overview)
- — RAG cache
- — RAG optimizations
- — RAG retriever datastore indexing
- — Advanced RAG
- — Speculative RAG
- — Reranker in RAG
- — Chunk-specific global KV caching
- — Chunk-specific prefix KV caching
- — RAG Knowledge Graph
Sparsity Optimizations: - Sparsification techniques (overview)
- — Activation Sparsity
- — Dynamic Sparsity
- — Block sparsity
- — Vector sparsity
- — Tensor sparsity
- — Sparse matrix kernels
- — Outlier-aware sparsification
Memory Utilization Optimizations: - Memory optimization techniques (overview)
- — Parameter sharing
- — Model compression
- — Low-bit integer quantization
- — Binary quantization
- — Ternary quantization
- — Layer fusion
- — Recomputation: trading time for space
- — Memory-bound versus CPU-bound
- — Data locality optimization
- — Compute-in-Memory (CIM) architectures
- — Memory cache management algorithms
- — Kernel operator fusion
- — Flash Inference (FlashInfer)
- — Checkpointing
- — Offloading
Numerical representation subtypes: - Floating-point representations (overview)
- — Floating Point Bit Tricks
- — Block floating-point arithmetic
- — Fixed point number system (FXP) optimizations
- — Floating point number system (FLP) optimizations
- — Foating point bitwise arithmetic
- — FTZ/DAZ floating point CPU settings
Kernel optimizations: - Kernel optimizations (overview)
- — Kernel operator fusion (merging)
- — Kernel fission (splitting)
- — Kernel tiling
- — Operator reordering
- — Graph operator fusion (Deep learning compilers)
Computation optimizations: - Advanced AI Mathematics
- Approximate activation functions
- Caching / memoization
- Computation reuse
- Precomputation
- Source code precomputation
- Conditional computation
- Approximations
- Integer-only arithmetic quantization
- Weight precomputations
- Zero-skipping
- — Low-Level Zero Skipping
- — High-Level Zero Skipping
- Negative skipping
- Approximate caching
- End-to-End integer inference
- Padding usage
- Incremental inference (new)
Arithmetic optimizations: - Integer operations
- Addition optimizations
- Bitwise operation tricks
- Approximate addition
- Multiplication algorithms
- Approximate division
- Approximate multiplication
- Bitwise operator inference
- Bitserial operations
- Division optimizations
- Logarithmic approximate multiplication
- Integer Dot Product
- Vector dot product optimization
Advanced matrix algebra optimizations: - Matrix Algebra (overview)
- — Approximate matrix multiplication
- — Butterfly matrices
- — Monarch matrices
- — Sparse matrices (sparsification)
Low-rank matrix optimizations: - Low-rank matrix factorization (overview)
- — Tensor decomposition
- — Tucker decomposition
- — Embedding low-rank matrix factorization
- — KV Cache low-rank matrix factorization
Transformer architectural optimizations: - Transformer architectures (overview)
- — Transformer low-level optimizations (overview)
- — Adaptive Inference
- — Integer-only Transformers
- — Approximate Transformers
- — Decoder-Only Architectures
- — Encoder-Only Architectures
- — Encoder-Decoder Architectures
Transformers and LLMs: - Open source models
- Inference frameworks
- Open source frameworks
Next-Generation Transformer architectures: - Next-generation architectures (overview)
- — Hybrid Transformer architectures
- — Newer Transformer architectures
- — BERT (encoder)
- — State Space Models (SSMs)
- — Mamba
- — RWKV
- — Knowledge graph AI architectures
- — Compound AI architectures
General Classes of Optimization Techniques: - Dynamic inference (adaptive inference)
- Skipping
- Heuristics
- Probabilistic optimizations
- Approximate computing
- Code optimizations
- Deep learning compilers
- Incremental algorithms
- Fuzzy logic
Loop Optimizations: - Loop optimizations (overview)
- — Inference loop optimizations
- — Loop fusion (merging loops)
- — Loop unrolling
- — Loop perforation
- — Loop reordering
- — Loop tiling
- — Loop reversal
- — Loop fission (splitting a loop)
- — Loop interleave
- — Loop interchange
- — Loop coalescing
- — Loop-invariant code motion ("hoisting")
- — Loop distribution
- — Pointer arithmetic
- — Loop peeling (unrolling first iterations)
- — Loop splitting— Loop sentinel
- — Loop collapsing
- — Loop normalization
- — Loop strip mining (Loop sectioning)
- — Loop skewing
- — Loop spreading
Low-Level Coding Efficiency: - Code optimizations (overview)
- — Constant folding
- — Common subexpression elimination
- — Algebraic identities
- — Strength reduction
- — Type consistency
- — Reciprocal multiplication
- — References vs pointers
- — Compile-time optimizations
- — Pointer arithmetic
- — Algorithm-level optimizations
- — Lazy evaluation
- — Memory reduction heuristics
Data Structures for AI optimization: - — Hashing
- — Perfect hashing
- — Look-up tables (LUTs)
- — Bloom filters
- — Trees
- — Tries
- — Bloom filters
- — Bitserial operations
- — Permutation arrays
Vector Data Structures: - — Parallel data structures
- — Bit vectors
- — Vector hashing
- — Locality-Sensitive Hashing (LSH)
- — Vector dot product caching
- — Bit signatures (vector algorithm)
- — K-means clustering (vector algorithm)
- — Hyper-Cube (vector algorithm)
Convolution Optimizations in CNNs: - Convolution optimizations (overview)
- — Grouped convolutions
- — Depth-wise separable convolutions
Tokenization and Vocabulary Optimizations: - Tokenization (overview)
- — Tokenizer and model inference latency
- — Semantic tokenization
- — Tokenization for Machine Vision
- — Tokenization of non-English languages
- Vocabulary optimizations:
- — Vocabulary size
- — Lexical shortlisting
- — Vocabulary trimming
- — Vocabulary expansion
- — Dynamic vocabulary pruning
Overall summaries of AI optimizations: - — Deslugging AI engines
- — Accuracy-degrading optimizations
- — Accuracy-retaining optimizations
- — Uncommon inference optimizations
Not Enough?
More inference optimization resources:
- What's Hot in Inference Optimization?
- Inference optimization research overview
- Research Blog
- Patents in inference optimization
More AI Research Topics
Read more about: