Aussie AI Blog

500+ LLM Inference Optimization Techniques

  • Updated: March 3rd, 2025
  • by David Spuler, Ph.D.

LLM Inference Optimization

We do a lot of research on inference optimization techniques, so here's a very long list of all the techniques about which we have research papers. There more than 500 now, but see our earlier blog post if you only want to know about the latest LLM inference techniques.

Update in March 2025: well, now we're into 2025 and this list has outgrown its title. There are over 600 items on the list below, all of which are related to LLM efficiency. The main change in 2025 is that the recent releases of "reasoning models" has spawned a new area of research in optimizing the efficiency of LLM reasoning algorithms such as Chain-of-Thought.

LLM Inference Optimizations List

Here's the list! It's over 600 and growing!

    Reasoning Efficiency Optimization (REO): it's the latest hot research area in 2025!
  1. Reasoning inference optimization (RIO) (blog article)
  2. Chain-of-Thought (CoT) optimization
  3. CoT token reduction
  4. CoT step skipping
  5. CoT path reduction
  6. CoT early stopping
  7. CoT reasoning decoding
  8. Constrained CoT
  9. Coconut
  10. Concise CoT
  11. Hidden CoT (interim steps in latent space)
  12. CoT prompt sequence optimizations
  13. CoT sparsity
  14. CoT distillation
  15. Long context CoT
  16. Small reasoning models
  17. Reasoning tokens
  18. Adaptive inference time compute
  19. One-step reasoning models (e.g. DeepSeek R1's long answers)

    Model compression main subtypes:
  20. Model compression (overview)
  21. Pruning (overview)
  22. Quantization (overview)
  23. Knowledge Distillation (KD)
  24. Parameter sharing (weight sharing)
  25. Low-rank matrices
  26. Small Language Models (SLMs)
  27. Data compression algorithms

    Pruning main types:
  28. Dynamic pruning
  29. Hybrid pruning
  30. Unstructured pruning
  31. Semi-Structured Pruning
  32. Structured pruning

    Layerwise structured pruning subtypes (depth dimension):
  33. Depthwise structural pruning (overview)
  34. Static layer pruning
  35. Layer pruning
  36. Early exit
  37. Dynamic layer pruning
  38. Layer skipping
  39. Layer approximation
  40. Shallow decoder architecture
  41. Layer reordering
  42. Layer Importance

    Width-wise structured pruning subtypes:
  43. Widthwise structural pruning (overview)
  44. Attention head pruning
  45. Slimmable networks (width pruning)
  46. FFN pruning
  47. Channel pruning
  48. Filter pruning

    Length-wise structured pruning subtypes:
  49. Lengthwise structural pruning (longitudinal/input/end-to-end):
  50. Token pruning (input pruning)
  51. Dynamic token pruning
  52. Prompt compression
  53. Context compression
  54. Token merging
  55. Token skipping
  56. Token dropping
  57. Zero padding removal
  58. Token reduction
  59. Token compression
  60. Input text compression

    Model dimension embedding pruning subtypes:
  61. Embedding-dimension pruning
  62. Embedding pruning
  63. Embedding matrix compression (embedding pruning)
  64. Embedding low-rank matrix factorization
  65. Unembedding matrix (output embeddings)

    Hybrid multi-dimensional pruning:
  66. Multi-dimensional pruning
  67. Dual pruning
  68. Triple pruning
  69. Quadruple pruning
  70. 3D CNN model pruning
  71. Pyramid inference

    Transformer component pruning:
  72. Normalization pruning
  73. Positional embeddings pruning
  74. Softmax pruning
  75. Skip connection pruning (residual connection removal)

    Unstructured pruning subtypes:
  76. Unstructured pruning (overview)
  77. Magnitude pruning
  78. Movement pruning
  79. — Gradual pruning

    Quantization theory and major subtypes:
  80. Post-Training Quantization (PTQ)
  81. Quantization-Aware Training (QAT)
  82. Activation Quantization
  83. Outlier-aware quantization
  84. Dequantization

    Integer quantization subtypes:
  85. Integer quantization (overview)
  86. Integer-only arithmetic quantization
  87. Fixed-point quantization (integer)
  88. Low-bit integer quantization (overview)
  89. Binary quantization
  90. Ternary quantization
  91. 2-bit quantization (INT2)
  92. 3-bit quantization (INT3)
  93. 4-bit quantization (INT4)
  94. 5-bit quantization (INT5)
  95. 6-bit quantization (INT6)
  96. 7-bit quantization (INT7)
  97. 8-bit quantization (INT8)
  98. 9-bit quantization (INT9)
  99. 10-bit quantization (INT10)
  100. 11-bit quantization (INT11)
  101. 12-bit quantization (INT12)
  102. 16-bit INT16 quantization
  103. 32-bit INT32 quantization

    Floating-point quantization subtypes:
  104. Floating-point quantization
  105. FP4 quantization
  106. FP6 quantization
  107. FP8 quantization
  108. FP16 quantization
  109. FP32 quantization

    Other quantization subtypes:
  110. Mixed-precision quantization
  111. Logarithmic power-of-two quantization (bitshift quantization)
  112. Double bitshift power-of-two quantization
  113. Division quantization
  114. Cluster-based quantization (Weight clustering)
  115. Hashing-based weight clustering
  116. Dyadic quantization
  117. Fake quantization
  118. Simulated quantization
  119. Stochastic quantization (probabilistic)

    Granularity-level quantization subtypes:
  120. Granular quantization (overview)
  121. Layerwise Quantization
  122. Blockwise Quantization
  123. Vector quantization

    Knowledge distillation subtypes:
  124. Knowledge Distillation (overview)
  125. Ensemble Distillation
  126. Unnatural instructions (data sets)
  127. Dataset Distillation
  128. Black Box Distillation
  129. White Box Distillation

    Parameter/weight sharing subtypes:
  130. Parameter/Weight sharing (overview)
  131. Activation sharing
  132. Layer fusion
  133. Clustering (Weights)
  134. Attention head fusion
  135. FFN fusion
  136. KV cache layer fusion (depthwise)
  137. KV cache head fusion (widthwise)

    Activation function optimizations:
  138. Activation function optimizations (overview)
  139. Activation function approximation
  140. Integer-only activation functions
  141. Fused activation functions (kernel fusion)
  142. Fused RELU
  143. Fused GELU
  144. Fused SwiGLU
  145. Activation alternatives/replacements
  146. Activation function pruning/removal (bilinear layers)
  147. Activation function reordering

    Normalization optimization types:
  148. Normalization algorithm optimizations (overview)
  149. Approximate normalization
  150. Norm reordering (pre-norm/post-norm)
  151. Integer-only normalization
  152. Normalization alternatives/replacements
  153. Fused normalization (e.g. "fused LayerNorm" in kernel fusion)

    Softmax optimization types:
  154. Softmax optimizations (overview)
  155. Softmax pruning
  156. Approximate Softmax
  157. Softmax alternatives/replacements
  158. Integer-only Softmax
  159. Fused Softmax

    Feed-Forward Network (FFN) optimization types:
  160. FFN optimizations (overview)
  161. FFN pruning
  162. FFN approximation
  163. Fused add-bias
  164. Bias vector pruning
  165. FFN sparsity
  166. FFN alternatives/replacements
  167. Integer-only FFN
  168. FFN fusion (shared parameters)
  169. — Bias optimizations
  170. — FFN matrix merging

    MatMul/GEMM optimization types:
  171. MatMul/GEMM kernel optimizations (overview)
  172. Faster matrix multiplication (e.g. Winograd, Strassen)
  173. Approximate matrix multiplication
  174. Transpose cache
  175. Fused multiply-add (FMA)
  176. Fused transpose
  177. Vector dot product optimization
  178. Sparse MatMul/GEMM
  179. — Tiled MatMul

    Positional Encoding optimizations:
  180. Positional encoding optimization (overview)
  181. RoPE (Rotary Positional Encoding)
  182. Pruning positional encoding (removal/NoPE)
  183. — Positional encoding approximation
  184. — Integer-only positional encoding

    NAS subtypes:
  185. Neural Architecture Search (NAS)
  186. Dynamic NAS
  187. Embedding Size Optimization (embeddings NAS)

    Platform-specific optimization subtypes:
  188. On-device inference (native phone and PC AI)
  189. AI Phones
  190. AI PCs (desktops/laptops)
  191. Edge device inference (IoT/mobile/PC)
  192. Hybrid cloud-on-device inference

    Decoding algorithm subtypes:
  193. Decoding algorithms (overview)
  194. Non-autoregressive decoding
  195. Greedy decoding
  196. Top-k decoding
  197. Top-p decoding
  198. Min-P Sampling
  199. Flash decoding
  200. Beam search decoding
  201. Edit decoding
  202. Contrastive decoding
  203. — Approximate top-k algorithms
  204. — Bidirectional decoding
  205. Constrained decoding

    Parallel Decoding algorithms:
  206. Parallel decoding
  207. Blockwise parallel decoding
  208. n-gram parallel decoding
  209. Lookahead decoding
  210. Medusa decoding
  211. Consensus decoding
  212. — Mutually-guided decoding
  213. — Multi-token generation
  214. — Eagle decoding

    Speculative decoding subtypes:
  215. Speculative decoding (overview)
  216. Generalized speculative decoding
  217. Aggressive decoding
  218. Lookup decoding
  219. Retrieval lookup decoding
  220. Prompt lookup decoding
  221. — Multi-query prompt lookup decoding (across entire LLM history)
  222. Self speculative decoding
  223. Tree speculative decoding
  224. Superposed decoding
  225. Hierarchical speculative decoding
  226. Heuristic speculative decoding
  227. Multi-token speculative decoding
  228. Sequential speculative decoding
  229. Eagle speculative decoding
  230. — Redrafting

    Parameter Efficient Fine-Tuning (PEFT) subtypes:
  231. PEFT (overview)
  232. LoRA
  233. Multi-LoRA inference
  234. QLoRa (Quantized Low-Rank Adapters)
  235. LoRA inference optimizations (load/unload)
  236. Prompt Tuning (Extended Vocabulary PEFT)
  237. Prefix Tuning

    Ensemble multi-LLM subtypes:
  238. Ensemble inference (overview of multi-model AI engines)
  239. Mixture of Experts (MoE)
  240. Model selection algorithms
  241. Big-little architectures
  242. Cascades
  243. Collaborative inference
  244. Consensus decoding
  245. — Swarm ensemble architectures
  246. — Committee ensemble architectures
  247. — Ensemble averaging
  248. Easy-hard queries
  249. Submodels (Many-Models-in-One)
  250. Distributed Inference

    Orchestration, Deployment and Serving:
  251. Cloud inference servers
  252. Orchestration frameworks
  253. Scheduling optimizations
  254. Serving
  255. Load balancing
  256. Batching
  257. Continuous batching
  258. Deployment
  259. Serverless
  260. Networking optimizations
  261. In-flight batching

    Attention optimization subtypes:
  262. Attention optimizations (overview)
  263. Multi-Head Attention (MHA)
  264. Group Query Attention (GQA)
  265. Multi-Query Attention (MQA)
  266. Sparse attention
  267. Local attention
  268. Memory-efficient attention algorithms
  269. Flash Attention
  270. Paged Attention
  271. Linear attention
  272. Cross attention
  273. Tree attention
  274. Sliding window attention
  275. Approximate attention heads
  276. Attention alternatives/replacements
  277. Fused MHA
  278. Low-rank matrix attention
  279. Medusa attention
  280. Block attention
  281. Cross attention
  282. Fused head attention
  283. Hybrid local-global attention
  284. FFT attention
  285. QKV computation optimizations
  286. Additive attention
  287. Multiplicative attention
  288. Graph attention
  289. Chunked attention
  290. Attention sink
  291. Attention steering
  292. Bilinear attention
  293. Attention-free methods
  294. Mixture-of-Heads (MOH) Attention (MoE+MHA)
  295. Star attention
  296. Ring attention
  297. — Flex attention
  298. — Razor attention
  299. — Contiguous QKV tensor
  300. — Relative Attention Bias (RAB)
  301. Lightning attention
  302. Multihead Latent Attention (MLA (DeepSeek)
  303. — FFT attention
  304. — Round attention
  305. Delta attention

    Long context optimizations (attention):
  306. Long context models
  307. Length generalization
  308. Quadratic attention complexity
  309. Long RAG

    Caching optimizations:
  310. Caching (overview)
  311. Inference Cache (text-to-text)
  312. Inference cache (global KV caching)
  313. Prompt caching
  314. Input Similarity-Based Caching (frame skipping in video)
  315. Semantic caching (text-to-text)
  316. Semantic KV caching
  317. Vector database caching
  318. Chatbot caching
  319. Vector Caching (Vector hashing)
  320. Caching vector dot products
  321. Caching general theory

    KV cache optimizations:
  322. KV Caching (overview)
  323. KV cache global (multi-query KV caching)
  324. KV cache reuse
  325. Global semantic KV caching (difficult!)
  326. Context cache (global KV caching)
  327. Prefix KV Caching
  328. KV cache recomputation with early exit
  329. Session KV cache (multi-turn KV caching)
  330. Substring/fused/concatenated KV cache (Lengthwise-fused KV caching)
  331. — Paged KV caching (related to paged attention)
  332. — KV cache offloading (to CPU)

    KV cache memory size reduction:
  333. KV cache compression
  334. KV cache quantization
  335. KV cache sparsity
  336. KV cache token pruning
  337. KV cache eviction policies
  338. KV cache layer fusion
  339. KV cache layer pruning
  340. KV Cache low-rank matrix factorization
  341. — Cyclic KV cache (Rolling buffer KV cache or circular KV cache)
  342. — KV cache token merging
  343. — KV head fusion
  344. — KV head pruning
  345. — KV mixed-precision quantization
  346. — KV context compression
  347. — KV block pruning

    Non-Multiplication AI Models:
  348. Zero-Multiplication Models (overview)
  349. Binary quantization
  350. Ternary quantization
  351. 2-bit quantization (INT2)
  352. Adder networks
  353. Bitshift-add networks
  354. Bitshift power-of-2 quantization (logarithmic quantization)
  355. Double bitshift quantization
  356. Add-as-integer networks
  357. Logarithmic Models
  358. Bitwise neural networks
  359. Diff-squared networks
  360. Log-sum-exp (LSE) networks
  361. Max-Plus networks
  362. Min-Max-Plus networks
  363. Morphological networks
  364. Trigonometric approximate inference
  365. Weightless Neural Networks (WNNs)
  366. XNOR networks
  367. Hadamard elementwise matrix multiplication models
  368. Other addition-related zero-multiplication networks
  369. Table lookups replace multiplication
  370. Other multiplication-free neural networks

    Advanced Number System optimizations:
  371. Advanced Number Systems (overview)
  372. Posit number system (PNS)
  373. Residue number system (RNS)
  374. Dyadic numbers
  375. Double-base number system (DBNS)
  376. Dynamic number systems
  377. Hybrid number systems
  378. Tropical algebra (max-plus)
  379. MiniMax algebra
  380. Multi-dimensional logarithmic number system (MDLNS)
  381. Multiple-Base Number System (MBNS)
  382. — Semi-Logarithmic Number System (SLNS)
  383. — Lattice algebra

    Logarithmic Number System optimizations:
  384. Logarithmic number system (LNS) (overview)
  385. End-to-end LNS logarithmic model
  386. LNS addition and subtraction
  387. LNS in AI models
  388. LNS Hardware Acceleration
  389. LNS mathematical and algorithmic theory
  390. LNS algebra
  391. LNS extensions

    Prefill phase optimizations:
  392. Prefill optimizations (overview)
  393. Chunked prefill
  394. Disaggregated prefill scheduling (Phase splitting)
  395. Deep prefill, shallow decoder architecture
  396. Mini-prefill recomputation

    Parallel Programming Optimization Techniques:
  397. Parallelization techniques (overview)
  398. Hardware acceleration
  399. Hardware-software co-design
  400. Vectorization
  401. Pipelining (pipeline parallelism)
  402. Overlapping (new)
  403. Overlapping communications and computation (new)
  404. Overlapping rematerialization (new)
  405. Overlapping memory access & computation (new)
  406. Offloading
  407. Partitioning
  408. Dataflow optimizations
  409. — Sharding
  410. — Overlapping
  411. Data parallelism
  412. Query parallelism
  413. Tensor parallelism
  414. Model parallelism
  415. — Prefetching
  416. — Speculative execution
  417. Sequence Parallelism
  418. Skeleton-of-Thought (Query Parallelism)

    Hardware Optimizations:
  419. Hardware Acceleration (overview)
  420. Software accelerations
  421. Hardware-software co-design
  422. GPU
  423. GPU software platforms
  424. Multi-GPU
  425. CPU Execution
  426. Single Instruction Multiple Data (SIMD)
  427. AVX (AVX/AVX-2/AVX-512)
  428. — ARM NEON
  429. Neural Processing Unit (NPU)
  430. — Overclocking CPU
  431. — Overclocking GPU
  432. Assembly language

    RAG Architecture Optimizations:
  433. RAG architectures (overview)
  434. RAG cache
  435. RAG optimizations
  436. — RAG retriever datastore indexing
  437. Advanced RAG
  438. — Speculative RAG
  439. Reranker in RAG
  440. — Chunk-specific global KV caching
  441. — Chunk-specific prefix KV caching
  442. RAG Knowledge Graph
  443. RAG Ontologies/Taxonomies
  444. RAG fusion
  445. Mini-RAG (single-document RAG)

    Sparsity Optimizations:
  446. Sparsification techniques (overview)
  447. Activation Sparsity
  448. Dynamic Sparsity
  449. Block sparsity
  450. Vector sparsity
  451. Tensor sparsity
  452. Sparse matrix kernels
  453. Outlier-aware sparsification

    Memory Utilization Optimizations:
  454. Memory optimization techniques (overview)
  455. Parameter sharing
  456. Model compression
  457. Low-bit integer quantization
  458. Binary quantization
  459. Ternary quantization
  460. Layer fusion
  461. Recomputation: trading time for space
  462. Memory-bound versus CPU-bound
  463. Data locality optimization
  464. Compute-in-Memory (CIM) architectures (also called PIM)
  465. — Memory cache management algorithms
  466. Kernel operator fusion
  467. — Flash Inference (FlashInfer)
  468. — Checkpointing
  469. Offloading
  470. SSD storage

    Numerical representation subtypes:
  471. Floating-point representations (overview)
  472. Floating Point Bit Tricks
  473. Block floating-point arithmetic
  474. Fixed point number system (FXP) optimizations
  475. Floating point number system (FLP) optimizations
  476. Foating point bitwise arithmetic
  477. FTZ/DAZ floating point CPU settings

    Kernel optimizations:
  478. Kernel optimizations (overview)
  479. Kernel operator fusion (merging)
  480. Kernel fission (splitting)
  481. Kernel tiling
  482. — Operator reordering
  483. Graph operator fusion (Deep learning compilers)

    Computation optimizations:
  484. Advanced AI Mathematics
  485. Approximate activation functions
  486. Caching / memoization
  487. Computation reuse
  488. Precomputation
  489. Source code precomputation
  490. Conditional computation
  491. Approximations
  492. Integer-only arithmetic quantization
  493. Weight precomputations
  494. Zero-skipping
  495. Low-Level Zero Skipping
  496. High-Level Zero Skipping
  497. Negative skipping
  498. Approximate caching
  499. End-to-End integer inference
  500. Padding usage
  501. Incremental inference (new)

    Arithmetic optimizations:
  502. Integer operations
  503. Addition optimizations
  504. Bitwise operation tricks
  505. Approximate addition
  506. Multiplication algorithms
  507. Approximate division
  508. Approximate multiplication
  509. Bitwise operator inference
  510. Bitserial operations
  511. Division optimizations
  512. Logarithmic approximate multiplication
  513. Integer Dot Product
  514. Vector dot product optimization

    Advanced matrix algebra optimizations:
  515. Matrix Algebra (overview)
  516. Approximate matrix multiplication
  517. Butterfly matrices
  518. Monarch matrices
  519. Sparse matrices (sparsification)

    Low-rank matrix optimizations:
  520. Low-rank matrix factorization (overview)
  521. — Tensor decomposition
  522. — Tucker decomposition
  523. Embedding low-rank matrix factorization
  524. KV Cache low-rank matrix factorization

    Transformer architectural optimizations:
  525. Transformer architectures (overview)
  526. Transformer low-level optimizations (overview)
  527. — Adaptive Inference (dynamic inference)
  528. Integer-only Transformers
  529. Approximate Transformers
  530. Decoder-Only Architectures
  531. Encoder-Only Architectures
  532. Encoder-Decoder Architectures

    Transformers and LLMs:
  533. Open source models
  534. Inference frameworks
  535. Open source frameworks

    Next-Generation Transformer architectures:
  536. Next-generation architectures (overview)
  537. Hybrid Transformer architectures
  538. Newer Transformer architectures
  539. BERT (encoder)
  540. — State Space Models (SSMs)
  541. Mamba
  542. RWKV
  543. Knowledge graph AI architectures
  544. Compound AI architectures
  545. Large Concept Model (LCM)

    General Classes of Optimization Techniques:
  546. Dynamic inference (adaptive inference)
  547. Skipping
  548. Heuristics
  549. Probabilistic optimizations
  550. Approximate computing
  551. Code optimizations
  552. Deep learning compilers
  553. Incremental algorithms
  554. Fuzzy logic
  555. Inference budget (with adaptive inference)

    Loop Optimizations:
  556. Loop optimizations (overview)
  557. Inference loop optimizations
  558. Loop fusion (merging loops)
  559. Loop unrolling
  560. Loop perforation
  561. Loop reordering
  562. Loop tiling
  563. Loop reversal
  564. Loop fission (splitting a loop)
  565. — Loop interleave
  566. Loop interchange
  567. Loop coalescing
  568. Loop-invariant code motion ("hoisting")
  569. Loop distribution
  570. Pointer arithmetic
  571. Loop peeling (unrolling first iterations)
  572. Loop splittingLoop sentinel
  573. Loop collapsing
  574. Loop normalization
  575. Loop strip mining (Loop sectioning)
  576. Loop skewing
  577. Loop spreading

    Low-Level Coding Efficiency:
  578. Code optimizations (overview)
  579. Constant folding
  580. Common subexpression elimination
  581. Algebraic identities
  582. Strength reduction
  583. Type consistency
  584. Reciprocal multiplication
  585. References vs pointers
  586. Compile-time optimizations
  587. Pointer arithmetic
  588. Algorithm-level optimizations
  589. Lazy evaluation
  590. Memory reduction heuristics

    Data Structures for AI optimization:
  591. Hashing
  592. Perfect hashing
  593. Look-up tables (LUTs)
  594. Bloom filters
  595. — Trees
  596. — Tries
  597. Bloom filters
  598. Bitserial operations
  599. Permutation arrays

    Vector Data Structures:
  600. Parallel data structures
  601. Bit vectors
  602. Vector hashing
  603. Locality-Sensitive Hashing (LSH)
  604. Vector dot product caching
  605. — Bit signatures (vector algorithm)
  606. — K-means clustering (vector algorithm)
  607. — Hyper-Cube (vector algorithm)

    Convolution Optimizations in CNNs:
  608. Convolution optimizations (overview)
  609. Grouped convolutions
  610. Depth-wise separable convolutions

    Tokenization and Vocabulary Optimizations:
  611. Tokenization (overview)
  612. Tokenizer and model inference latency
  613. Semantic tokenization
  614. Tokenization for Machine Vision
  615. Tokenization of non-English languages
  616. Vocabulary optimizations:
  617. Vocabulary size
  618. Lexical shortlisting
  619. Vocabulary trimming
  620. Vocabulary expansion
  621. Dynamic vocabulary pruning

    Overall summaries of AI optimizations:
  622. Deslugging AI engines
  623. Accuracy-degrading optimizations
  624. Accuracy-retaining optimizations
  625. Uncommon inference optimizations

Not Enough?

More inference optimization resources:

More AI Research Topics

Read more about: