Aussie AI Blog

500+ LLM Inference Optimization Techniques

  • September 2nd, 2024
  • Updated: September 20, 2024
  • by David Spuler, Ph.D.

LLM Inference Optimization

We do a lot of research on inference optimization techniques, so here's a very long list of all the techniques about which we have research papers. There more than 500 now, but see our earlier blog post if you only want to know about the latest LLM inference techniques.

LLM Inference Optimizations List

Here's the list! It's over 500 and growing!

    Model compression main subtypes:
  1. Model compression (overview)
  2. Pruning (overview)
  3. Quantization (overview)
  4. Knowledge Distillation (KD)
  5. Parameter sharing (weight sharing)
  6. Low-rank matrices
  7. Small Language Models (SLMs)
  8. Data compression algorithms

    Pruning main types:
  9. Dynamic pruning
  10. Hybrid pruning
  11. Unstructured pruning
  12. Semi-Structured Pruning
  13. Structured pruning

    Layerwise structured pruning subtypes (depth dimension):
  14. Depthwise structural pruning (overview)
  15. Static layer pruning
  16. Layer pruning
  17. Early exit
  18. Dynamic layer pruning
  19. Layer skipping
  20. Layer approximation
  21. Shallow decoder architecture
  22. Layer reordering
  23. Layer Importance

    Width-wise structured pruning subtypes:
  24. Widthwise structural pruning (overview)
  25. Attention head pruning
  26. Slimmable networks (width pruning)
  27. FFN pruning
  28. Channel pruning
  29. Filter pruning

    Length-wise structured pruning subtypes:
  30. Lengthwise structural pruning (longitudinal/input/end-to-end):
  31. Token pruning (input pruning)
  32. Dynamic token pruning
  33. Prompt compression
  34. Context compression
  35. Token merging
  36. Token skipping
  37. Token dropping
  38. Zero padding removal

    Model dimension embedding pruning subtypes:
  39. Embedding-dimension pruning
  40. Embedding pruning
  41. Embedding matrix compression (embedding pruning)
  42. Embedding low-rank matrix factorization
  43. Unembedding matrix (output embeddings)

    Hybrid multi-dimensional pruning:
  44. Multi-dimensional pruning
  45. Dual pruning
  46. Triple pruning
  47. Quadruple pruning
  48. 3D CNN model pruning

    Transformer component pruning:
  49. Normalization pruning
  50. Positional embeddings pruning
  51. Softmax pruning
  52. Skip connection pruning (residual connection removal)

    Unstructured pruning subtypes:
  53. Unstructured pruning (overview)
  54. Magnitude pruning
  55. Movement pruning
  56. — Gradual pruning

    Quantization subtypes:
  57. Post-Training Quantization (PTQ)
  58. Quantization-Aware Training (QAT)
  59. Activation Quantization
  60. Outlier-aware quantization

    Integer quantization subtypes:
  61. Integer quantization (overview)
  62. Integer-only arithmetic quantization
  63. Fixed-point quantization (integer)
  64. Low-bit integer quantization (overview)
  65. Binary quantization
  66. Ternary quantization
  67. 2-bit quantization (INT2)
  68. 3-bit quantization (INT3)
  69. 4-bit quantization (INT4)
  70. 5-bit quantization (INT5)
  71. 6-bit quantization (INT6)
  72. 7-bit quantization (INT7)
  73. 8-bit quantization (INT8)
  74. 9-bit quantization (INT9)
  75. 10-bit quantization (INT10)
  76. 11-bit quantization (INT11)
  77. 12-bit quantization (INT12)
  78. 16-bit INT16 quantization
  79. 32-bit INT32 quantization

    Floating-point quantization subtypes:
  80. Floating-point quantization
  81. FP4 quantization
  82. FP6 quantization
  83. FP8 quantization
  84. FP16 quantization
  85. FP32 quantization

    Other quantization subtypes:
  86. Mixed-precision quantization
  87. Logarithmic power-of-two quantization (bitshift quantization)
  88. Double bitshift power-of-two quantization
  89. Division quantization
  90. Cluster-based quantization (Weight clustering)
  91. Hashing-based weight clustering
  92. Dyadic quantization
  93. Fake quantization
  94. Simulated quantization
  95. Stochastic quantization (probabilistic)

    Granularity-level quantization subtypes:
  96. Granular quantization (overview)
  97. Layerwise Quantization
  98. Blockwise Quantization
  99. Vector quantization

    Knowledge distillation subtypes:
  100. Knowledge Distillation (overview)
  101. Ensemble Distillation
  102. Unnatural instructions (data sets)
  103. Dataset Distillation

    Parameter/weight sharing subtypes:
  104. Parameter/Weight sharing (overview)
  105. Activation sharing
  106. Layer fusion
  107. Clustering (Weights)
  108. Attention head fusion
  109. FFN fusion
  110. KV cache layer fusion (depthwise)
  111. KV cache head fusion (widthwise)

    Activation function optimizations:
  112. Activation function optimizations (overview)
  113. Activation function approximation
  114. Integer-only activation functions
  115. Fused activation functions (kernel fusion)
  116. Fused RELU
  117. Fused GELU
  118. Fused SwiGLU
  119. Activation alternatives/replacements
  120. Activation function pruning/removal (bilinear layers)
  121. Activation function reordering

    Normalization optimization types:
  122. Normalization algorithm optimizations (overview)
  123. Approximate normalization
  124. Norm reordering (pre-norm/post-norm)
  125. Integer-only normalization
  126. Normalization alternatives/replacements
  127. Fused normalization (e.g. "fused LayerNorm" in kernel fusion)

    Softmax optimization types:
  128. Softmax optimizations (overview)
  129. Softmax pruning
  130. Approximate Softmax
  131. Softmax alternatives/replacements
  132. Integer-only Softmax
  133. Fused Softmax

    Feed-Forward Network (FFN) optimization types:
  134. FFN optimizations (overview)
  135. FFN pruning
  136. FFN approximation
  137. Fused add-bias
  138. Bias vector pruning
  139. FFN sparsity
  140. FFN alternatives/replacements
  141. Integer-only FFN
  142. — Bias optimizations

    MatMul/GEMM optimization types:
  143. MatMul/GEMM kernel optimizations (overview)
  144. Faster matrix multiplication (e.g. Winograd, Strassen)
  145. Approximate matrix multiplication
  146. Transpose cache
  147. Fused multiply-add (FMA)
  148. Fused transpose
  149. Vector dot product optimization
  150. Sparse MatMul/GEMM
  151. — Tiled MatMul

    Positional Encoding optimizations:
  152. Positional encoding optimization (overview)
  153. RoPE (Rotary Positional Encoding)
  154. Pruning positional encoding (removal/NoPE)
  155. — Positional encoding approximation
  156. — Integer-only positional encoding

    NAS subtypes:
  157. Neural Architecture Search (NAS)
  158. Dynamic NAS
  159. Embedding Size Optimization (embeddings NAS)

    Platform-specific optimization subtypes:
  160. On-device inference (native phone and PC AI)
  161. AI Phones
  162. AI PCs (desktops/laptops)
  163. Edge device inference (IoT/mobile/PC)
  164. Hybrid cloud-on-device inference

    Decoding algorithm subtypes:
  165. Decoding algorithms (overview)
  166. Non-autoregressive decoding
  167. Greedy decoding
  168. Top-k decoding
  169. Top-p decoding
  170. Min-P Sampling
  171. Flash decoding
  172. Beam search decoding
  173. Edit decoding
  174. Contrastive decoding
  175. — Approximate top-k algorithms
  176. — Bidirectional decoding
  177. Constrained decoding

    Parallel Decoding algorithms:
  178. Parallel decoding
  179. Blockwise parallel decoding
  180. n-gram parallel decoding
  181. Lookahead decoding
  182. Medusa decoding
  183. Consensus decoding
  184. — Mutually-guided decoding
  185. — Multi-token generation

    Speculative decoding subtypes:
  186. Speculative decoding (overview)
  187. Generalized speculative decoding
  188. Aggressive decoding
  189. Lookup decoding
  190. Retrieval lookup decoding
  191. Prompt lookup decoding
  192. Self speculative decoding
  193. Tree speculative decoding
  194. Superposed decoding
  195. Hierarchical speculative decoding
  196. Heuristic speculative decoding
  197. Multi-token speculative decoding
  198. Sequential speculative decoding

    Parameter Efficient Fine-Tuning (PEFT) subtypes:
  199. PEFT (overview)
  200. LoRA
  201. Multi-LoRA inference
  202. QLoRa (Quantized Low-Rank Adapters)
  203. LoRA inference optimizations (load/unload)
  204. Prompt Tuning (Extended Vocabulary PEFT)

    Ensemble multi-LLM subtypes:
  205. Ensemble inference (overview of multi-model AI engines)
  206. Mixture of Experts (MoE)
  207. Model selection algorithms
  208. Big-little architectures
  209. Cascades
  210. Collaborative inference
  211. Consensus decoding
  212. — Swarm ensemble architectures
  213. — Committee ensemble architectures
  214. — Ensemble averaging
  215. Easy-hard queries
  216. Submodels (Many-Models-in-One)
  217. Distributed Inference

    Orchestration, Deployment and Serving:
  218. Cloud inference servers
  219. Orchestration frameworks
  220. Scheduling optimizations
  221. Serving
  222. Load balancing
  223. Batching
  224. Continuous batching
  225. Deployment
  226. Serverless
  227. Networking optimizations

    Attention optimization subtypes:
  228. Attention optimizations (overview)
  229. Multi-Head Attention (MHA)
  230. Group Query Attention (GQA)
  231. Multi-Query Attention (MQA)
  232. Sparse attention
  233. Local attention
  234. Memory-efficient attention algorithms
  235. Flash Attention
  236. Paged Attention
  237. Linear attention
  238. Cross attention
  239. Tree attention
  240. Sliding window attention
  241. Approximate attention heads
  242. Attention alternatives/replacements
  243. Fused MHA
  244. Low-rank matrix attention
  245. Medusa attention
  246. Block attention
  247. Cross attention
  248. Fused head attention
  249. Hybrid local-global attention
  250. FFT attention
  251. QKV computation optimizations
  252. Additive attention
  253. Multiplicative attention
  254. Graph attention
  255. Chunked attention
  256. Attention sink
  257. Attention steering
  258. Bilinear attention
  259. Attention-free methods
  260. Mixture-of-Heads (MOH) Attention (MoE+MHA)
  261. Star attention
  262. Flex attention
  263. Razor attention

    Long context optimizations (attention):
  264. Long context models
  265. Length generalization
  266. Quadratic attention complexity
  267. Long RAG

    Caching optimizations:
  268. Caching (overview)
  269. Inference Cache (text-to-text)
  270. Inference cache (global KV caching)
  271. Prompt caching
  272. Input Similarity-Based Caching (frame skipping in video)
  273. Semantic caching (text-to-text)
  274. Semantic KV caching
  275. Vector database caching
  276. Chatbot caching
  277. Vector Caching (Vector hashing)
  278. Caching vector dot products
  279. Caching general theory

    KV cache optimizations:
  280. KV Caching (overview)
  281. KV cache global (multi-query KV caching)
  282. KV cache reuse
  283. Global semantic KV caching (difficult!)
  284. Context cache (global KV caching)
  285. Prefix KV Caching
  286. KV cache recomputation with early exit
  287. Session KV cache (multi-turn KV caching)
  288. Substring/fused KV cache (Lengthwise-fused KV caching)

    KV cache memory size reduction:
  289. KV cache compression
  290. KV cache quantization
  291. KV cache sparsity
  292. KV cache token pruning
  293. KV cache eviction policies
  294. KV cache layer fusion
  295. KV cache layer pruning
  296. KV Cache low-rank matrix factorization

    Non-Multiplication AI Models:
  297. Zero-Multiplication Models (overview)
  298. Binary quantization
  299. Ternary quantization
  300. 2-bit quantization (INT2)
  301. Adder networks
  302. Bitshift-add networks
  303. Bitshift power-of-2 quantization (logarithmic quantization)
  304. Double bitshift quantization
  305. Add-as-integer networks
  306. Logarithmic Models
  307. Bitwise neural networks
  308. Diff-squared networks
  309. Log-sum-exp (LSE) networks
  310. Max-Plus networks
  311. Min-Max-Plus networks
  312. Morphological networks
  313. Trigonometric approximate inference
  314. Weightless Neural Networks (WNNs)
  315. XNOR networks
  316. Hadamard elementwise matrix multiplication models
  317. Other addition-related zero-multiplication networks
  318. Table lookups replace multiplication
  319. Other multiplication-free neural networks

    Advanced Number System optimizations:
  320. Advanced Number Systems (overview)
  321. Posit number system (PNS)
  322. Residue number system (RNS)
  323. Dyadic numbers
  324. Double-base number system (DBNS)
  325. Dynamic number systems
  326. Hybrid number systems
  327. Tropical algebra (max-plus)
  328. MiniMax algebra
  329. Multi-dimensional logarithmic number system (MDLNS)
  330. Multiple-Base Number System (MBNS)
  331. — Semi-Logarithmic Number System (SLNS)
  332. — Lattice algebra

    Logarithmic Number System optimizations:
  333. Logarithmic number system (LNS) (overview)
  334. End-to-end LNS logarithmic model
  335. LNS addition and subtraction
  336. LNS in AI models
  337. LNS Hardware Acceleration
  338. LNS mathematical and algorithmic theory
  339. LNS algebra
  340. LNS extensions

    Prefill phase optimizations:
  341. Prefill optimizations (overview)
  342. Chunked prefill
  343. Disaggregated prefill scheduling (Phase splitting)
  344. Deep prefill, shallow decoder architecture
  345. Mini-prefill recomputation

    Parallel Programming Optimization Techniques:
  346. Parallelization techniques (overview)
  347. Hardware acceleration
  348. Hardware-software co-design
  349. Vectorization
  350. Pipelining (pipeline parallelism)
  351. Overlapping (new)
  352. Overlapping communications and computation (new)
  353. Overlapping rematerialization (new)
  354. Overlapping memory access & computation (new)
  355. Offloading
  356. Partitioning
  357. Dataflow optimizations
  358. — Sharding
  359. — Overlapping
  360. Data parallelism
  361. Query parallelism
  362. Tensor parallelism
  363. Model parallelism
  364. — Prefetching
  365. — Speculative execution
  366. Sequence Parallelism
  367. Skeleton-of-Thought (Query Parallelism)

    Hardware Optimizations:
  368. Hardware Acceleration (overview)
  369. Software accelerations
  370. Hardware-software co-design
  371. GPU
  372. GPU software platforms
  373. Multi-GPU
  374. CPU Execution
  375. Single Instruction Multiple Data (SIMD)
  376. AVX (AVX/AVX-2/AVX-512)
  377. — ARM NEON
  378. Neural Processing Unit (NPU)
  379. — Overclocking CPU
  380. — Overclocking GPU
  381. Assembly language

    RAG Architecture Optimizations:
  382. RAG architectures (overview)
  383. RAG cache
  384. RAG optimizations
  385. — RAG retriever datastore indexing
  386. Advanced RAG
  387. — Speculative RAG
  388. Reranker in RAG
  389. — Chunk-specific global KV caching
  390. — Chunk-specific prefix KV caching
  391. RAG Knowledge Graph

    Sparsity Optimizations:
  392. Sparsification techniques (overview)
  393. Activation Sparsity
  394. Dynamic Sparsity
  395. Block sparsity
  396. Vector sparsity
  397. Tensor sparsity
  398. Sparse matrix kernels
  399. Outlier-aware sparsification

    Memory Utilization Optimizations:
  400. Memory optimization techniques (overview)
  401. Parameter sharing
  402. Model compression
  403. Low-bit integer quantization
  404. Binary quantization
  405. Ternary quantization
  406. Layer fusion
  407. Recomputation: trading time for space
  408. Memory-bound versus CPU-bound
  409. — Data locality optimization
  410. — Compute-in-Memory (CIM) architectures
  411. — Memory cache management algorithms
  412. Kernel operator fusion
  413. — Flash Inference (FlashInfer)
  414. — Checkpointing
  415. Offloading

    Numerical representation subtypes:
  416. Floating-point representations (overview)
  417. Floating Point Bit Tricks
  418. Block floating-point arithmetic
  419. Fixed point number system (FXP) optimizations
  420. Floating point number system (FLP) optimizations
  421. Foating point bitwise arithmetic
  422. FTZ/DAZ floating point CPU settings

    Kernel optimizations:
  423. Kernel optimizations (overview)
  424. Kernel operator fusion (merging)
  425. Kernel fission (splitting)
  426. Kernel tiling
  427. — Operator reordering
  428. Graph operator fusion (Deep learning compilers)

    Computation optimizations:
  429. Advanced AI Mathematics
  430. Approximate activation functions
  431. Caching / memoization
  432. Computation reuse
  433. Precomputation
  434. Source code precomputation
  435. Conditional computation
  436. Approximations
  437. Integer-only arithmetic quantization
  438. Weight precomputations
  439. Zero-skipping
  440. Low-Level Zero Skipping
  441. High-Level Zero Skipping
  442. Negative skipping
  443. Approximate caching
  444. End-to-End integer inference
  445. Padding usage
  446. Incremental inference (new)

    Arithmetic optimizations:
  447. Integer operations
  448. Addition optimizations
  449. Bitwise operation tricks
  450. Approximate addition
  451. Multiplication algorithms
  452. Approximate division
  453. Approximate multiplication
  454. Bitwise operator inference
  455. Bitserial operations
  456. Division optimizations
  457. Logarithmic approximate multiplication
  458. Integer Dot Product
  459. Vector dot product optimization

    Advanced matrix algebra optimizations:
  460. Matrix Algebra (overview)
  461. Approximate matrix multiplication
  462. Butterfly matrices
  463. Monarch matrices
  464. Sparse matrices (sparsification)

    Low-rank matrix optimizations:
  465. Low-rank matrix factorization (overview)
  466. — Tensor decomposition
  467. — Tucker decomposition
  468. Embedding low-rank matrix factorization
  469. KV Cache low-rank matrix factorization

    Transformer architectural optimizations:
  470. Transformer architectures (overview)
  471. Transformer low-level optimizations (overview)
  472. — Adaptive Inference
  473. Integer-only Transformers
  474. Approximate Transformers
  475. Decoder-Only Architectures
  476. Encoder-Only Architectures
  477. Encoder-Decoder Architectures

    Transformers and LLMs:
  478. Open source models
  479. Inference frameworks
  480. Open source frameworks

    Next-Generation Transformer architectures:
  481. Next-generation architectures (overview)
  482. Hybrid Transformer architectures
  483. Newer Transformer architectures
  484. BERT (encoder)
  485. — State Space Models (SSMs)
  486. Mamba
  487. RWKV
  488. Knowledge graph AI architectures
  489. Compound AI architectures

    General Classes of Optimization Techniques:
  490. Dynamic inference (adaptive inference)
  491. Skipping
  492. Heuristics
  493. Probabilistic optimizations
  494. Approximate computing
  495. Code optimizations
  496. Deep learning compilers
  497. Incremental algorithms
  498. Fuzzy logic

    Loop Optimizations:
  499. Loop optimizations (overview)
  500. Inference loop optimizations
  501. Loop fusion (merging loops)
  502. Loop unrolling
  503. Loop perforation
  504. Loop reordering
  505. Loop tiling
  506. Loop reversal
  507. Loop fission (splitting a loop)
  508. — Loop interleave
  509. Loop interchange
  510. Loop coalescing
  511. Loop-invariant code motion ("hoisting")
  512. Loop distribution
  513. Pointer arithmetic
  514. Loop peeling (unrolling first iterations)
  515. Loop splittingLoop sentinel
  516. Loop collapsing
  517. Loop normalization
  518. Loop strip mining (Loop sectioning)
  519. Loop skewing
  520. Loop spreading

    Low-Level Coding Efficiency:
  521. Code optimizations (overview)
  522. Constant folding
  523. Common subexpression elimination
  524. Algebraic identities
  525. Strength reduction
  526. Type consistency
  527. Reciprocal multiplication
  528. References vs pointers
  529. Compile-time optimizations
  530. Pointer arithmetic
  531. Algorithm-level optimizations
  532. Lazy evaluation
  533. Memory reduction heuristics

    Data Structures for AI optimization:
  534. Hashing
  535. Perfect hashing
  536. Look-up tables (LUTs)
  537. Bloom filters
  538. — Trees
  539. — Tries
  540. Bloom filters
  541. Bitserial operations
  542. Permutation arrays

    Vector Data Structures:
  543. Parallel data structures
  544. Bit vectors
  545. Vector hashing
  546. Locality-Sensitive Hashing (LSH)
  547. Vector dot product caching
  548. — Bit signatures (vector algorithm)
  549. — K-means clustering (vector algorithm)
  550. — Hyper-Cube (vector algorithm)

    Convolution Optimizations in CNNs:
  551. Convolution optimizations (overview)
  552. Grouped convolutions
  553. Depth-wise separable convolutions

    Tokenization and Vocabulary Optimizations:
  554. Tokenization (overview)
  555. Tokenizer and model inference latency
  556. Semantic tokenization
  557. Tokenization for Machine Vision
  558. Tokenization of non-English languages
  559. Vocabulary optimizations:
  560. Vocabulary size
  561. Lexical shortlisting
  562. Vocabulary trimming
  563. Vocabulary expansion
  564. Dynamic vocabulary pruning

    Overall summaries of AI optimizations:
  565. Deslugging AI engines
  566. Accuracy-degrading optimizations
  567. Accuracy-retaining optimizations
  568. Uncommon inference optimizations

Not Enough?

More inference optimization resources:

More AI Research Topics

Read more about: