Aussie AI Blog

500+ LLM Inference Optimization Techniques

  • Updated: March 3rd, 2025
  • by David Spuler, Ph.D.

LLM Inference Optimization

We do a lot of research on inference optimization techniques, so here's a very long list of all the techniques about which we have research papers. There more than 500 now, but see our earlier blog post if you only want to know about the latest LLM inference techniques.

Update in March 2025: well, now we're into 2025 and this list has outgrown its title. There are over 600 items on the list below, all of which are related to LLM efficiency. The main change in 2025 is that the recent releases of "reasoning models" has spawned a new area of research in optimizing the efficiency of LLM reasoning algorithms such as Chain-of-Thought.

LLM Inference Optimizations List

Here's the list! It's over 600 and growing!

    Reasoning Efficiency Optimization (REO): it's the latest hot research area in 2025!
  1. Reasoning inference optimization (RIO) (blog article)
  2. Chain-of-Thought (CoT) optimization
  3. CoT token reduction
  4. CoT step skipping
  5. CoT path reduction
  6. CoT early stopping
  7. CoT reasoning decoding
  8. Constrained CoT
  9. Coconut
  10. Concise CoT
  11. Hidden CoT (interim steps in latent space)
  12. CoT prompt sequence optimizations
  13. CoT sparsity
  14. CoT distillation
  15. Long context CoT
  16. Small reasoning models
  17. Reasoning tokens
  18. Adaptive inference time compute
  19. One-step reasoning models (e.g. DeepSeek R1's long answers)

    Model compression main subtypes:
  20. Model compression (overview)
  21. Pruning (overview)
  22. Quantization (overview)
  23. Knowledge Distillation (KD)
  24. Parameter sharing (weight sharing)
  25. Low-rank matrices
  26. Small Language Models (SLMs)
  27. Data compression algorithms

    Pruning main types:
  28. Dynamic pruning
  29. Hybrid pruning
  30. Unstructured pruning
  31. Semi-Structured Pruning
  32. Structured pruning

    Layerwise structured pruning subtypes (depth dimension):
  33. Depthwise structural pruning (overview)
  34. Static layer pruning
  35. Layer pruning
  36. Early exit
  37. Dynamic layer pruning
  38. Layer skipping
  39. Layer approximation
  40. Shallow decoder architecture
  41. Layer reordering
  42. Layer Importance

    Width-wise structured pruning subtypes:
  43. Widthwise structural pruning (overview)
  44. Attention head pruning
  45. Slimmable networks (width pruning)
  46. FFN pruning
  47. Channel pruning
  48. Filter pruning

    Length-wise structured pruning subtypes:
  49. Lengthwise structural pruning (longitudinal/input/end-to-end):
  50. Token pruning (input pruning)
  51. Dynamic token pruning
  52. Prompt compression
  53. Context compression
  54. Token merging
  55. Token skipping
  56. Token dropping
  57. Zero padding removal
  58. Token reduction
  59. Token compression
  60. Input text compression

    Model dimension embedding pruning subtypes:
  61. Embedding-dimension pruning
  62. Embedding pruning
  63. Embedding matrix compression (embedding pruning)
  64. Embedding low-rank matrix factorization
  65. Unembedding matrix (output embeddings)

    Hybrid multi-dimensional pruning:
  66. Multi-dimensional pruning
  67. Dual pruning
  68. Triple pruning
  69. Quadruple pruning
  70. 3D CNN model pruning
  71. Pyramid inference

    Transformer component pruning:
  72. Normalization pruning
  73. Positional embeddings pruning
  74. Softmax pruning
  75. Skip connection pruning (residual connection removal)

    Unstructured pruning subtypes:
  76. Unstructured pruning (overview)
  77. Magnitude pruning
  78. Movement pruning
  79. — Gradual pruning

    Quantization theory and major subtypes:
  80. Post-Training Quantization (PTQ)
  81. Quantization-Aware Training (QAT)
  82. Activation Quantization
  83. Outlier-aware quantization
  84. Dequantization

    Integer quantization subtypes:
  85. Integer quantization (overview)
  86. Integer-only arithmetic quantization
  87. Fixed-point quantization (integer)
  88. Low-bit integer quantization (overview)
  89. Binary quantization
  90. Ternary quantization
  91. 2-bit quantization (INT2)
  92. 3-bit quantization (INT3)
  93. 4-bit quantization (INT4)
  94. 5-bit quantization (INT5)
  95. 6-bit quantization (INT6)
  96. 7-bit quantization (INT7)
  97. 8-bit quantization (INT8)
  98. 9-bit quantization (INT9)
  99. 10-bit quantization (INT10)
  100. 11-bit quantization (INT11)
  101. 12-bit quantization (INT12)
  102. 16-bit INT16 quantization
  103. 32-bit INT32 quantization

    Floating-point quantization subtypes:
  104. Floating-point quantization
  105. FP4 quantization
  106. FP6 quantization
  107. FP8 quantization
  108. FP16 quantization
  109. FP32 quantization

    Other quantization subtypes:
  110. Mixed-precision quantization
  111. Logarithmic power-of-two quantization (bitshift quantization)
  112. Double bitshift power-of-two quantization
  113. Division quantization
  114. Cluster-based quantization (Weight clustering)
  115. Hashing-based weight clustering
  116. Dyadic quantization
  117. Fake quantization
  118. Simulated quantization
  119. Stochastic quantization (probabilistic)

    Granularity-level quantization subtypes:
  120. Granular quantization (overview)
  121. Layerwise Quantization
  122. Blockwise Quantization
  123. Vector quantization

    Knowledge distillation subtypes:
  124. Knowledge Distillation (overview)
  125. Ensemble Distillation
  126. Unnatural instructions (data sets)
  127. Dataset Distillation
  128. Black Box Distillation
  129. White Box Distillation

    Parameter/weight sharing subtypes:
  130. Parameter/Weight sharing (overview)
  131. Activation sharing
  132. Layer fusion
  133. Clustering (Weights)
  134. Attention head fusion
  135. FFN fusion
  136. KV cache layer fusion (depthwise)
  137. KV cache head fusion (widthwise)

    Activation function optimizations:
  138. Activation function optimizations (overview)
  139. Activation function approximation
  140. Integer-only activation functions
  141. Fused activation functions (kernel fusion)
  142. Fused RELU
  143. Fused GELU
  144. Fused SwiGLU
  145. Activation alternatives/replacements
  146. Activation function pruning/removal (bilinear layers)
  147. Activation function reordering

    Normalization optimization types:
  148. Normalization algorithm optimizations (overview)
  149. Approximate normalization
  150. Norm reordering (pre-norm/post-norm)
  151. Integer-only normalization
  152. Normalization alternatives/replacements
  153. Fused normalization (e.g. "fused LayerNorm" in kernel fusion)

    Softmax optimization types:
  154. Softmax optimizations (overview)
  155. Softmax pruning
  156. Approximate Softmax
  157. Softmax alternatives/replacements
  158. Integer-only Softmax
  159. Fused Softmax

    Feed-Forward Network (FFN) optimization types:
  160. FFN optimizations (overview)
  161. FFN pruning
  162. FFN approximation
  163. Fused add-bias
  164. Bias vector pruning
  165. FFN sparsity
  166. FFN alternatives/replacements
  167. Integer-only FFN
  168. FFN fusion (shared parameters)
  169. — Bias optimizations
  170. — FFN matrix merging

    MatMul/GEMM optimization types:
  171. MatMul/GEMM kernel optimizations (overview)
  172. Faster matrix multiplication (e.g. Winograd, Strassen)
  173. Approximate matrix multiplication
  174. Transpose cache
  175. Fused multiply-add (FMA)
  176. Fused transpose
  177. Vector dot product optimization
  178. Sparse MatMul/GEMM
  179. — Tiled MatMul

    Positional Encoding optimizations:
  180. Positional encoding optimization (overview)
  181. RoPE (Rotary Positional Encoding)
  182. Pruning positional encoding (removal/NoPE)
  183. — Positional encoding approximation
  184. — Integer-only positional encoding

    NAS subtypes:
  185. Neural Architecture Search (NAS)
  186. Dynamic NAS
  187. Embedding Size Optimization (embeddings NAS)

    Platform-specific optimization subtypes:
  188. On-device inference (native phone and PC AI)
  189. AI Phones
  190. AI PCs (desktops/laptops)
  191. Edge device inference (IoT/mobile/PC)
  192. Hybrid cloud-on-device inference

    Decoding algorithm subtypes:
  193. Decoding algorithms (overview)
  194. Non-autoregressive decoding
  195. Greedy decoding
  196. Top-k decoding
  197. Top-p decoding
  198. Min-P Sampling
  199. Flash decoding
  200. Beam search decoding
  201. Edit decoding
  202. Contrastive decoding
  203. — Approximate top-k algorithms
  204. — Bidirectional decoding
  205. Constrained decoding

    Parallel Decoding algorithms:
  206. Parallel decoding
  207. Blockwise parallel decoding
  208. n-gram parallel decoding
  209. Lookahead decoding
  210. Medusa decoding
  211. Consensus decoding
  212. — Mutually-guided decoding
  213. — Multi-token generation
  214. — Eagle decoding

    Speculative decoding subtypes:
  215. Speculative decoding (overview)
  216. Generalized speculative decoding
  217. Aggressive decoding
  218. Lookup decoding
  219. Retrieval lookup decoding
  220. Prompt lookup decoding
  221. — Multi-query prompt lookup decoding (across entire LLM history)
  222. Self speculative decoding
  223. Tree speculative decoding
  224. Superposed decoding
  225. Hierarchical speculative decoding
  226. Heuristic speculative decoding
  227. Multi-token speculative decoding
  228. Sequential speculative decoding
  229. Eagle speculative decoding
  230. — Redrafting

    Parameter Efficient Fine-Tuning (PEFT) subtypes:
  231. PEFT (overview)
  232. LoRA
  233. Multi-LoRA inference
  234. QLoRa (Quantized Low-Rank Adapters)
  235. LoRA inference optimizations (load/unload)
  236. Prompt Tuning (Extended Vocabulary PEFT)
  237. Prefix Tuning

    Ensemble multi-LLM subtypes:
  238. Ensemble inference (overview of multi-model AI engines)
  239. Mixture of Experts (MoE)
  240. Model selection algorithms
  241. Big-little architectures
  242. Cascades
  243. Collaborative inference
  244. Consensus decoding
  245. — Swarm ensemble architectures
  246. — Committee ensemble architectures
  247. — Ensemble averaging
  248. Easy-hard queries
  249. Submodels (Many-Models-in-One)
  250. Distributed Inference

    Orchestration, Deployment and Serving:
  251. Cloud inference servers
  252. Orchestration frameworks
  253. Scheduling optimizations
  254. Serving
  255. Load balancing
  256. Batching
  257. Continuous batching
  258. Deployment
  259. Serverless
  260. Networking optimizations
  261. In-flight batching

    Attention optimization subtypes:
  262. Attention optimizations (overview)
  263. Multi-Head Attention (MHA)
  264. Group Query Attention (GQA)
  265. Multi-Query Attention (MQA)
  266. Sparse attention
  267. Local attention
  268. Memory-efficient attention algorithms
  269. Flash Attention
  270. Paged Attention
  271. Linear attention
  272. Cross attention
  273. Tree attention
  274. Sliding window attention
  275. Approximate attention heads
  276. Attention alternatives/replacements
  277. Fused MHA
  278. Low-rank matrix attention
  279. Medusa attention
  280. Block attention
  281. Cross attention
  282. Fused head attention
  283. Hybrid local-global attention
  284. FFT attention
  285. QKV computation optimizations
  286. Additive attention
  287. Multiplicative attention
  288. Graph attention
  289. Chunked attention
  290. Attention sink
  291. Attention steering
  292. Bilinear attention
  293. Attention-free methods
  294. Mixture-of-Heads (MOH) Attention (MoE+MHA)
  295. Star attention
  296. Ring attention
  297. — Flex attention
  298. — Razor attention
  299. — Contiguous QKV tensor
  300. — Relative Attention Bias (RAB)
  301. Lightning attention
  302. Multihead Latent Attention (MLA (DeepSeek)
  303. — FFT attention
  304. — Round attention

    Long context optimizations (attention):
  305. Long context models
  306. Length generalization
  307. Quadratic attention complexity
  308. Long RAG

    Caching optimizations:
  309. Caching (overview)
  310. Inference Cache (text-to-text)
  311. Inference cache (global KV caching)
  312. Prompt caching
  313. Input Similarity-Based Caching (frame skipping in video)
  314. Semantic caching (text-to-text)
  315. Semantic KV caching
  316. Vector database caching
  317. Chatbot caching
  318. Vector Caching (Vector hashing)
  319. Caching vector dot products
  320. Caching general theory

    KV cache optimizations:
  321. KV Caching (overview)
  322. KV cache global (multi-query KV caching)
  323. KV cache reuse
  324. Global semantic KV caching (difficult!)
  325. Context cache (global KV caching)
  326. Prefix KV Caching
  327. KV cache recomputation with early exit
  328. Session KV cache (multi-turn KV caching)
  329. Substring/fused/concatenated KV cache (Lengthwise-fused KV caching)
  330. — Paged KV caching (related to paged attention)
  331. — KV cache offloading (to CPU)

    KV cache memory size reduction:
  332. KV cache compression
  333. KV cache quantization
  334. KV cache sparsity
  335. KV cache token pruning
  336. KV cache eviction policies
  337. KV cache layer fusion
  338. KV cache layer pruning
  339. KV Cache low-rank matrix factorization
  340. — Cyclic KV cache (Rolling buffer KV cache or circular KV cache)
  341. — KV cache token merging
  342. — KV head fusion
  343. — KV head pruning
  344. — KV mixed-precision quantization
  345. — KV context compression
  346. — KV block pruning

    Non-Multiplication AI Models:
  347. Zero-Multiplication Models (overview)
  348. Binary quantization
  349. Ternary quantization
  350. 2-bit quantization (INT2)
  351. Adder networks
  352. Bitshift-add networks
  353. Bitshift power-of-2 quantization (logarithmic quantization)
  354. Double bitshift quantization
  355. Add-as-integer networks
  356. Logarithmic Models
  357. Bitwise neural networks
  358. Diff-squared networks
  359. Log-sum-exp (LSE) networks
  360. Max-Plus networks
  361. Min-Max-Plus networks
  362. Morphological networks
  363. Trigonometric approximate inference
  364. Weightless Neural Networks (WNNs)
  365. XNOR networks
  366. Hadamard elementwise matrix multiplication models
  367. Other addition-related zero-multiplication networks
  368. Table lookups replace multiplication
  369. Other multiplication-free neural networks

    Advanced Number System optimizations:
  370. Advanced Number Systems (overview)
  371. Posit number system (PNS)
  372. Residue number system (RNS)
  373. Dyadic numbers
  374. Double-base number system (DBNS)
  375. Dynamic number systems
  376. Hybrid number systems
  377. Tropical algebra (max-plus)
  378. MiniMax algebra
  379. Multi-dimensional logarithmic number system (MDLNS)
  380. Multiple-Base Number System (MBNS)
  381. — Semi-Logarithmic Number System (SLNS)
  382. — Lattice algebra

    Logarithmic Number System optimizations:
  383. Logarithmic number system (LNS) (overview)
  384. End-to-end LNS logarithmic model
  385. LNS addition and subtraction
  386. LNS in AI models
  387. LNS Hardware Acceleration
  388. LNS mathematical and algorithmic theory
  389. LNS algebra
  390. LNS extensions

    Prefill phase optimizations:
  391. Prefill optimizations (overview)
  392. Chunked prefill
  393. Disaggregated prefill scheduling (Phase splitting)
  394. Deep prefill, shallow decoder architecture
  395. Mini-prefill recomputation

    Parallel Programming Optimization Techniques:
  396. Parallelization techniques (overview)
  397. Hardware acceleration
  398. Hardware-software co-design
  399. Vectorization
  400. Pipelining (pipeline parallelism)
  401. Overlapping (new)
  402. Overlapping communications and computation (new)
  403. Overlapping rematerialization (new)
  404. Overlapping memory access & computation (new)
  405. Offloading
  406. Partitioning
  407. Dataflow optimizations
  408. — Sharding
  409. — Overlapping
  410. Data parallelism
  411. Query parallelism
  412. Tensor parallelism
  413. Model parallelism
  414. — Prefetching
  415. — Speculative execution
  416. Sequence Parallelism
  417. Skeleton-of-Thought (Query Parallelism)

    Hardware Optimizations:
  418. Hardware Acceleration (overview)
  419. Software accelerations
  420. Hardware-software co-design
  421. GPU
  422. GPU software platforms
  423. Multi-GPU
  424. CPU Execution
  425. Single Instruction Multiple Data (SIMD)
  426. AVX (AVX/AVX-2/AVX-512)
  427. — ARM NEON
  428. Neural Processing Unit (NPU)
  429. — Overclocking CPU
  430. — Overclocking GPU
  431. Assembly language

    RAG Architecture Optimizations:
  432. RAG architectures (overview)
  433. RAG cache
  434. RAG optimizations
  435. — RAG retriever datastore indexing
  436. Advanced RAG
  437. — Speculative RAG
  438. Reranker in RAG
  439. — Chunk-specific global KV caching
  440. — Chunk-specific prefix KV caching
  441. RAG Knowledge Graph
  442. RAG Ontologies/Taxonomies
  443. RAG fusion
  444. Mini-RAG (single-document RAG)

    Sparsity Optimizations:
  445. Sparsification techniques (overview)
  446. Activation Sparsity
  447. Dynamic Sparsity
  448. Block sparsity
  449. Vector sparsity
  450. Tensor sparsity
  451. Sparse matrix kernels
  452. Outlier-aware sparsification

    Memory Utilization Optimizations:
  453. Memory optimization techniques (overview)
  454. Parameter sharing
  455. Model compression
  456. Low-bit integer quantization
  457. Binary quantization
  458. Ternary quantization
  459. Layer fusion
  460. Recomputation: trading time for space
  461. Memory-bound versus CPU-bound
  462. Data locality optimization
  463. Compute-in-Memory (CIM) architectures (also called PIM)
  464. — Memory cache management algorithms
  465. Kernel operator fusion
  466. — Flash Inference (FlashInfer)
  467. — Checkpointing
  468. Offloading
  469. SSD storage

    Numerical representation subtypes:
  470. Floating-point representations (overview)
  471. Floating Point Bit Tricks
  472. Block floating-point arithmetic
  473. Fixed point number system (FXP) optimizations
  474. Floating point number system (FLP) optimizations
  475. Foating point bitwise arithmetic
  476. FTZ/DAZ floating point CPU settings

    Kernel optimizations:
  477. Kernel optimizations (overview)
  478. Kernel operator fusion (merging)
  479. Kernel fission (splitting)
  480. Kernel tiling
  481. — Operator reordering
  482. Graph operator fusion (Deep learning compilers)

    Computation optimizations:
  483. Advanced AI Mathematics
  484. Approximate activation functions
  485. Caching / memoization
  486. Computation reuse
  487. Precomputation
  488. Source code precomputation
  489. Conditional computation
  490. Approximations
  491. Integer-only arithmetic quantization
  492. Weight precomputations
  493. Zero-skipping
  494. Low-Level Zero Skipping
  495. High-Level Zero Skipping
  496. Negative skipping
  497. Approximate caching
  498. End-to-End integer inference
  499. Padding usage
  500. Incremental inference (new)

    Arithmetic optimizations:
  501. Integer operations
  502. Addition optimizations
  503. Bitwise operation tricks
  504. Approximate addition
  505. Multiplication algorithms
  506. Approximate division
  507. Approximate multiplication
  508. Bitwise operator inference
  509. Bitserial operations
  510. Division optimizations
  511. Logarithmic approximate multiplication
  512. Integer Dot Product
  513. Vector dot product optimization

    Advanced matrix algebra optimizations:
  514. Matrix Algebra (overview)
  515. Approximate matrix multiplication
  516. Butterfly matrices
  517. Monarch matrices
  518. Sparse matrices (sparsification)

    Low-rank matrix optimizations:
  519. Low-rank matrix factorization (overview)
  520. — Tensor decomposition
  521. — Tucker decomposition
  522. Embedding low-rank matrix factorization
  523. KV Cache low-rank matrix factorization

    Transformer architectural optimizations:
  524. Transformer architectures (overview)
  525. Transformer low-level optimizations (overview)
  526. — Adaptive Inference (dynamic inference)
  527. Integer-only Transformers
  528. Approximate Transformers
  529. Decoder-Only Architectures
  530. Encoder-Only Architectures
  531. Encoder-Decoder Architectures

    Transformers and LLMs:
  532. Open source models
  533. Inference frameworks
  534. Open source frameworks

    Next-Generation Transformer architectures:
  535. Next-generation architectures (overview)
  536. Hybrid Transformer architectures
  537. Newer Transformer architectures
  538. BERT (encoder)
  539. — State Space Models (SSMs)
  540. Mamba
  541. RWKV
  542. Knowledge graph AI architectures
  543. Compound AI architectures
  544. Large Concept Model (LCM)

    General Classes of Optimization Techniques:
  545. Dynamic inference (adaptive inference)
  546. Skipping
  547. Heuristics
  548. Probabilistic optimizations
  549. Approximate computing
  550. Code optimizations
  551. Deep learning compilers
  552. Incremental algorithms
  553. Fuzzy logic
  554. Inference budget (with adaptive inference)

    Loop Optimizations:
  555. Loop optimizations (overview)
  556. Inference loop optimizations
  557. Loop fusion (merging loops)
  558. Loop unrolling
  559. Loop perforation
  560. Loop reordering
  561. Loop tiling
  562. Loop reversal
  563. Loop fission (splitting a loop)
  564. — Loop interleave
  565. Loop interchange
  566. Loop coalescing
  567. Loop-invariant code motion ("hoisting")
  568. Loop distribution
  569. Pointer arithmetic
  570. Loop peeling (unrolling first iterations)
  571. Loop splittingLoop sentinel
  572. Loop collapsing
  573. Loop normalization
  574. Loop strip mining (Loop sectioning)
  575. Loop skewing
  576. Loop spreading

    Low-Level Coding Efficiency:
  577. Code optimizations (overview)
  578. Constant folding
  579. Common subexpression elimination
  580. Algebraic identities
  581. Strength reduction
  582. Type consistency
  583. Reciprocal multiplication
  584. References vs pointers
  585. Compile-time optimizations
  586. Pointer arithmetic
  587. Algorithm-level optimizations
  588. Lazy evaluation
  589. Memory reduction heuristics

    Data Structures for AI optimization:
  590. Hashing
  591. Perfect hashing
  592. Look-up tables (LUTs)
  593. Bloom filters
  594. — Trees
  595. — Tries
  596. Bloom filters
  597. Bitserial operations
  598. Permutation arrays

    Vector Data Structures:
  599. Parallel data structures
  600. Bit vectors
  601. Vector hashing
  602. Locality-Sensitive Hashing (LSH)
  603. Vector dot product caching
  604. — Bit signatures (vector algorithm)
  605. — K-means clustering (vector algorithm)
  606. — Hyper-Cube (vector algorithm)

    Convolution Optimizations in CNNs:
  607. Convolution optimizations (overview)
  608. Grouped convolutions
  609. Depth-wise separable convolutions

    Tokenization and Vocabulary Optimizations:
  610. Tokenization (overview)
  611. Tokenizer and model inference latency
  612. Semantic tokenization
  613. Tokenization for Machine Vision
  614. Tokenization of non-English languages
  615. Vocabulary optimizations:
  616. Vocabulary size
  617. Lexical shortlisting
  618. Vocabulary trimming
  619. Vocabulary expansion
  620. Dynamic vocabulary pruning

    Overall summaries of AI optimizations:
  621. Deslugging AI engines
  622. Accuracy-degrading optimizations
  623. Accuracy-retaining optimizations
  624. Uncommon inference optimizations

Not Enough?

More inference optimization resources:

More AI Research Topics

Read more about: