Aussie AI Blog

500+ LLM Inference Optimization Techniques

  • Updated: Feb 6th, 2026
  • by David Spuler, Ph.D.

LLM Inference Optimization

We do a lot of research on inference optimization techniques, so here's a very long list of all the techniques about which we have research papers. There's more than 500 (600+ now!), but see the blog post links below if you only want to know about the latest LLM inference techniques.

Update in Feb 2026: As we head into 2026, some of the more recent areas of attention include:

Research areas that remain as hot as always include:

Areas of new research relevance include:

And no doubt much more to come in 2026!

Update in March 2025: well, now we're into 2025 and this list has outgrown its title. There are over 600 items on the list below, all of which are related to LLM efficiency. The main change in 2025 is that the recent releases of "reasoning models" has spawned a new area of research in optimizing the efficiency of LLM reasoning algorithms such as Chain-of-Thought.

Free AI C++ books: for more about LLM optimization, read books online or download a PDF:

Popular articles: additional research articles on faster LLM inference:

More lists: lots of general efficiency optimization information:

LLM Inference Optimizations List

Here's the list! It's over 600 and growing!

    Reasoning Efficiency Optimization (REO): it's the latest hot research area in 2025!
  1. Reasoning inference optimization (RIO) (blog article)
  2. Chain-of-Thought (CoT) optimization
  3. CoT token reduction
  4. CoT step skipping
  5. CoT path reduction
  6. CoT early stopping
  7. CoT reasoning decoding
  8. Constrained CoT
  9. Coconut
  10. Concise CoT
  11. Hidden CoT (interim steps in latent space)
  12. CoT prompt sequence optimizations
  13. CoT sparsity
  14. CoT distillation
  15. Long context CoT
  16. Small reasoning models
  17. Reasoning tokens
  18. Adaptive inference time compute
  19. One-step reasoning models (e.g. DeepSeek R1's long answers)

    Inference Modes and Token API optimizations:
  20. "Fast mode" inference (e.g. from OpenAI or Anthropic)
  21. Cached tokens
  22. Batched tokens
  23. Low batch size inference
  24. Priority batching
  25. API model routing features

    Model compression main subtypes:
  26. Model compression (overview)
  27. Pruning (overview)
  28. Quantization (overview)
  29. Knowledge Distillation (KD)
  30. Parameter sharing (weight sharing)
  31. Low-rank matrices
  32. Small Language Models (SLMs)
  33. Data compression algorithms

    Pruning main types:
  34. Dynamic pruning
  35. Hybrid pruning
  36. Unstructured pruning
  37. Semi-Structured Pruning
  38. Structured pruning

    Layerwise structured pruning subtypes (depth dimension):
  39. Depthwise structural pruning (overview)
  40. Static layer pruning
  41. Layer pruning
  42. Early exit
  43. Dynamic layer pruning
  44. Layer skipping
  45. Layer approximation
  46. Shallow decoder architecture
  47. Layer reordering
  48. Layer Importance

    Width-wise structured pruning subtypes:
  49. Widthwise structural pruning (overview)
  50. Attention head pruning
  51. Slimmable networks (width pruning)
  52. FFN pruning
  53. Channel pruning
  54. Filter pruning

    Length-wise structured pruning subtypes:
  55. Lengthwise structural pruning (longitudinal/input/end-to-end):
  56. Token pruning (input pruning)
  57. Dynamic token pruning
  58. Prompt compression
  59. Context compression
  60. Token merging
  61. Token skipping
  62. Token dropping
  63. Zero padding removal
  64. Token reduction
  65. Token compression
  66. Input text compression

    Model dimension embedding pruning subtypes:
  67. Embedding-dimension pruning
  68. Embedding pruning
  69. Embedding matrix compression (embedding pruning)
  70. Embedding low-rank matrix factorization
  71. Unembedding matrix (output embeddings)

    Hybrid multi-dimensional pruning:
  72. Multi-dimensional pruning
  73. Dual pruning
  74. Triple pruning
  75. Quadruple pruning
  76. 3D CNN model pruning
  77. Pyramid inference

    Transformer component pruning:
  78. Normalization pruning
  79. Positional embeddings pruning
  80. Softmax pruning
  81. Skip connection pruning (residual connection removal)

    Unstructured pruning subtypes:
  82. Unstructured pruning (overview)
  83. Magnitude pruning
  84. Movement pruning
  85. — Gradual pruning

    Quantization theory and major subtypes:
  86. Post-Training Quantization (PTQ)
  87. Quantization-Aware Training (QAT)
  88. Activation Quantization
  89. Outlier-aware quantization
  90. Dequantization

    Integer quantization subtypes:
  91. Integer quantization (overview)
  92. Integer-only arithmetic quantization
  93. Fixed-point quantization (integer)
  94. Low-bit integer quantization (overview)
  95. Binary quantization
  96. Ternary quantization
  97. 2-bit quantization (INT2)
  98. 3-bit quantization (INT3)
  99. 4-bit quantization (INT4)
  100. 5-bit quantization (INT5)
  101. 6-bit quantization (INT6)
  102. 7-bit quantization (INT7)
  103. 8-bit quantization (INT8)
  104. 9-bit quantization (INT9)
  105. 10-bit quantization (INT10)
  106. 11-bit quantization (INT11)
  107. 12-bit quantization (INT12)
  108. 16-bit INT16 quantization
  109. 32-bit INT32 quantization

    Floating-point quantization subtypes:
  110. Floating-point quantization
  111. FP4 quantization
  112. FP6 quantization
  113. FP8 quantization
  114. FP16 quantization
  115. FP32 quantization

    Other quantization subtypes:
  116. Mixed-precision quantization
  117. Logarithmic power-of-two quantization (bitshift quantization)
  118. Double bitshift power-of-two quantization
  119. Division quantization
  120. Cluster-based quantization (Weight clustering)
  121. Hashing-based weight clustering
  122. Dyadic quantization
  123. Fake quantization
  124. Simulated quantization
  125. Stochastic quantization (probabilistic)

    Granularity-level quantization subtypes:
  126. Granular quantization (overview)
  127. Layerwise Quantization
  128. Blockwise Quantization
  129. Vector quantization

    Knowledge distillation subtypes:
  130. Knowledge Distillation (overview)
  131. Ensemble Distillation
  132. Unnatural instructions (data sets)
  133. Dataset Distillation
  134. Black Box Distillation
  135. White Box Distillation

    Parameter/weight sharing subtypes:
  136. Parameter/Weight sharing (overview)
  137. Activation sharing
  138. Layer fusion
  139. Clustering (Weights)
  140. Attention head fusion
  141. FFN fusion (sharing parameters)
  142. KV cache layer fusion (depthwise)
  143. KV cache head fusion (widthwise)

    Activation function optimizations:
  144. Activation function optimizations (overview)
  145. Activation function approximation
  146. Integer-only activation functions
  147. Fused activation functions (kernel fusion)
  148. Fused RELU
  149. Fused GELU
  150. Fused SwiGLU
  151. Activation alternatives/replacements
  152. Activation function pruning/removal (bilinear layers)
  153. Activation function reordering

    Normalization optimization types:
  154. Normalization algorithm optimizations (overview)
  155. Approximate normalization
  156. Norm reordering (pre-norm/post-norm)
  157. Integer-only normalization
  158. Normalization alternatives/replacements
  159. Fused normalization (e.g. "fused LayerNorm" in kernel fusion)

    Softmax optimization types:
  160. Softmax optimizations (overview)
  161. Softmax pruning
  162. Approximate Softmax
  163. Softmax alternatives/replacements
  164. Integer-only Softmax
  165. Fused Softmax

    Feed-Forward Network (FFN) optimization types:
  166. FFN optimizations (overview)
  167. FFN pruning
  168. FFN approximation
  169. Fused add-bias
  170. Bias vector pruning
  171. FFN sparsity
  172. FFN alternatives/replacements
  173. Integer-only FFN
  174. FFN fusion (shared parameters)
  175. Inter-FFN fusion (merging two FFNs)
  176. Intra-FFN fusion (merging two linear projections in one FFN)
  177. — Bias optimizations
  178. — FFN matrix merging (similar to "intra-FFN fusion")

    MatMul/GEMM optimization types:
  179. MatMul/GEMM kernel optimizations (overview)
  180. Faster matrix multiplication (e.g. Winograd, Strassen)
  181. Approximate matrix multiplication
  182. Transpose cache
  183. Fused multiply-add (FMA)
  184. Fused transpose
  185. Vector dot product optimization
  186. Sparse MatMul/GEMM
  187. — Tiled MatMul

    Positional Encoding optimizations:
  188. Positional encoding optimization (overview)
  189. RoPE (Rotary Positional Encoding)
  190. Pruning positional encoding (removal/NoPE)
  191. — Positional encoding approximation
  192. — Integer-only positional encoding

    NAS subtypes:
  193. Neural Architecture Search (NAS)
  194. Dynamic NAS
  195. Embedding Size Optimization (embeddings NAS)

    Platform-specific optimization subtypes:
  196. On-device inference (native phone and PC AI)
  197. AI Phones
  198. AI PCs (desktops/laptops)
  199. Edge device inference (IoT/mobile/PC)
  200. Hybrid cloud-on-device inference

    Decoding algorithm subtypes:
  201. Decoding algorithms (overview)
  202. Non-autoregressive decoding
  203. Greedy decoding
  204. Top-k decoding
  205. Top-p decoding
  206. Min-P Sampling
  207. Flash decoding
  208. Beam search decoding
  209. Edit decoding
  210. Contrastive decoding
  211. — Approximate top-k algorithms
  212. — Bidirectional decoding
  213. Constrained decoding

    Parallel Decoding algorithms:
  214. Parallel decoding
  215. Blockwise parallel decoding
  216. n-gram parallel decoding
  217. Lookahead decoding
  218. Medusa decoding
  219. Consensus decoding
  220. — Mutually-guided decoding
  221. — Multi-token generation
  222. — Eagle decoding

    Speculative decoding subtypes:
  223. Speculative decoding (overview)
  224. Generalized speculative decoding
  225. Aggressive decoding
  226. Lookup decoding
  227. Retrieval lookup decoding
  228. Prompt lookup decoding
  229. — Multi-query prompt lookup decoding (across entire LLM history)
  230. Self speculative decoding
  231. Tree speculative decoding
  232. Superposed decoding
  233. Hierarchical speculative decoding
  234. Heuristic speculative decoding
  235. Multi-token speculative decoding
  236. Sequential speculative decoding
  237. Eagle speculative decoding
  238. — Redrafting

    Parameter Efficient Fine-Tuning (PEFT) subtypes:
  239. PEFT (overview)
  240. LoRA
  241. Multi-LoRA inference
  242. QLoRa (Quantized Low-Rank Adapters)
  243. LoRA inference optimizations (load/unload)
  244. Prompt Tuning (Extended Vocabulary PEFT)
  245. Prefix Tuning

    Ensemble multi-LLM subtypes:
  246. Ensemble inference (overview of multi-model AI engines)
  247. Mixture of Experts (MoE)
  248. Model selection algorithms
  249. Big-little architectures
  250. Cascades
  251. Collaborative inference
  252. Consensus decoding
  253. — Swarm ensemble architectures
  254. — Committee ensemble architectures
  255. — Ensemble averaging
  256. Easy-hard queries
  257. Submodels (Many-Models-in-One)
  258. Distributed Inference

    Orchestration, Deployment and Serving:
  259. Cloud inference servers
  260. Orchestration frameworks
  261. Scheduling optimizations
  262. Serving
  263. Load balancing
  264. Batching
  265. Continuous batching
  266. Deployment
  267. Serverless
  268. Networking optimizations
  269. In-flight batching

    Attention optimization subtypes:
  270. Attention optimizations (overview)
  271. Multi-Head Attention (MHA)
  272. Group Query Attention (GQA)
  273. Multi-Query Attention (MQA)
  274. Sparse attention
  275. Local attention
  276. Memory-efficient attention algorithms
  277. Flash Attention
  278. Paged Attention
  279. Linear attention
  280. Cross attention
  281. Tree attention
  282. Sliding window attention
  283. Approximate attention heads
  284. Attention alternatives/replacements
  285. Fused MHA
  286. Low-rank matrix attention
  287. Medusa attention
  288. Block attention
  289. Cross attention
  290. Fused head attention
  291. Hybrid local-global attention
  292. FFT attention
  293. QKV computation optimizations
  294. Additive attention
  295. Multiplicative attention
  296. Graph attention
  297. Chunked attention
  298. Attention sink
  299. Attention steering
  300. Bilinear attention
  301. Attention-free methods
  302. Mixture-of-Heads (MOH) Attention (MoE+MHA)
  303. Star attention
  304. Ring attention
  305. — Flex attention
  306. — Razor attention
  307. — Contiguous QKV tensor
  308. — Relative Attention Bias (RAB)
  309. Lightning attention
  310. Multihead Latent Attention (MLA (DeepSeek)
  311. — FFT attention
  312. — Round attention
  313. Delta attention

    Long context optimizations (attention):
  314. Long context models
  315. Length generalization
  316. Quadratic attention complexity
  317. Long RAG

    Caching optimizations:
  318. Caching (overview)
  319. Inference Cache (text-to-text)
  320. Inference cache (global KV caching)
  321. Prompt caching
  322. Input Similarity-Based Caching (frame skipping in video)
  323. Semantic caching (text-to-text)
  324. Semantic KV caching
  325. Vector database caching
  326. Chatbot caching
  327. Vector Caching (Vector hashing)
  328. Caching vector dot products
  329. Caching general theory

    KV cache optimizations:
  330. KV Caching (overview)
  331. KV cache global (multi-query KV caching)
  332. KV cache reuse
  333. Global semantic KV caching (difficult!)
  334. Context cache (global KV caching)
  335. Prefix KV Caching
  336. KV cache recomputation with early exit
  337. Session KV cache (multi-turn KV caching)
  338. Substring/fused/concatenated KV cache (Lengthwise-fused KV caching)
  339. — Paged KV caching (related to paged attention)
  340. — KV cache offloading (to CPU)

    KV cache memory size reduction:
  341. KV cache compression
  342. KV cache quantization
  343. KV cache sparsity
  344. KV cache token pruning
  345. — Salient token-based KV cache token pruning
  346. KV cache eviction policies
  347. KV cache layer fusion
  348. KV cache layer pruning
  349. KV Cache low-rank matrix factorization
  350. — Cyclic KV cache (Rolling buffer KV cache or circular KV cache)
  351. — KV cache token merging
  352. — KV head fusion
  353. — KV head pruning
  354. — KV mixed-precision quantization
  355. — KV context compression
  356. — KV block pruning

    Non-Multiplication AI Models:
  357. Zero-Multiplication Models (overview)
  358. Binary quantization
  359. Ternary quantization
  360. 2-bit quantization (INT2)
  361. Adder networks
  362. Bitshift-add networks
  363. Bitshift power-of-2 quantization (logarithmic quantization)
  364. Double bitshift quantization
  365. Add-as-integer networks
  366. Logarithmic Models
  367. Bitwise neural networks
  368. Diff-squared networks
  369. Log-sum-exp (LSE) networks
  370. Max-Plus networks
  371. Min-Max-Plus networks
  372. Morphological networks
  373. Trigonometric approximate inference
  374. Weightless Neural Networks (WNNs)
  375. XNOR networks
  376. Hadamard elementwise matrix multiplication models
  377. Other addition-related zero-multiplication networks
  378. Table lookups replace multiplication
  379. Other multiplication-free neural networks

    Advanced Number System optimizations:
  380. Advanced Number Systems (overview)
  381. Posit number system (PNS)
  382. Residue number system (RNS)
  383. Dyadic numbers
  384. Double-base number system (DBNS)
  385. Dynamic number systems
  386. Hybrid number systems
  387. Tropical algebra (max-plus)
  388. MiniMax algebra
  389. Multi-dimensional logarithmic number system (MDLNS)
  390. Multiple-Base Number System (MBNS)
  391. — Semi-Logarithmic Number System (SLNS)
  392. — Lattice algebra

    Logarithmic Number System optimizations:
  393. Logarithmic number system (LNS) (overview)
  394. End-to-end LNS logarithmic model
  395. LNS addition and subtraction
  396. LNS in AI models
  397. LNS Hardware Acceleration
  398. LNS mathematical and algorithmic theory
  399. LNS algebra
  400. LNS extensions

    Prefill phase optimizations:
  401. Prefill optimizations (overview)
  402. Chunked prefill
  403. Disaggregated prefill scheduling (Phase splitting)
  404. Deep prefill, shallow decoder architecture
  405. Mini-prefill recomputation

    Parallel Programming Optimization Techniques:
  406. Parallelization techniques (overview)
  407. Hardware acceleration
  408. Hardware-software co-design
  409. Vectorization
  410. Pipelining (pipeline parallelism)
  411. Overlapping (new)
  412. Overlapping communications and computation (new)
  413. Overlapping rematerialization (new)
  414. Overlapping memory access & computation (new)
  415. Offloading
  416. Partitioning
  417. Dataflow optimizations
  418. — Sharding
  419. — Overlapping
  420. Data parallelism
  421. Query parallelism
  422. Tensor parallelism
  423. Model parallelism
  424. — Prefetching
  425. — Speculative execution
  426. Sequence Parallelism
  427. Skeleton-of-Thought (Query Parallelism)

    Hardware Optimizations:
  428. Hardware Acceleration (overview)
  429. Software accelerations
  430. Hardware-software co-design
  431. GPU
  432. GPU software platforms
  433. Multi-GPU
  434. CPU Execution
  435. Single Instruction Multiple Data (SIMD)
  436. AVX (AVX/AVX-2/AVX-512)
  437. — ARM NEON
  438. Neural Processing Unit (NPU)
  439. — Overclocking CPU
  440. — Overclocking GPU
  441. Assembly language

    RAG Architecture Optimizations:
  442. RAG architectures (overview)
  443. RAG cache
  444. RAG optimizations
  445. — RAG retriever datastore indexing
  446. Advanced RAG
  447. — Speculative RAG
  448. Reranker in RAG
  449. — Chunk-specific global KV caching
  450. — Chunk-specific prefix KV caching
  451. RAG Knowledge Graph
  452. RAG Ontologies/Taxonomies
  453. RAG fusion
  454. Mini-RAG (single-document RAG)

    Sparsity Optimizations:
  455. Sparsification techniques (overview)
  456. Activation Sparsity
  457. Dynamic Sparsity
  458. Block sparsity
  459. Vector sparsity
  460. Tensor sparsity
  461. Sparse matrix kernels
  462. Outlier-aware sparsification

    Memory Utilization Optimizations:
  463. Memory optimization techniques (overview)
  464. Parameter sharing
  465. Model compression
  466. Low-bit integer quantization
  467. Binary quantization
  468. Ternary quantization
  469. Layer fusion
  470. Recomputation: trading time for space
  471. Memory-bound versus CPU-bound
  472. Data locality optimization
  473. Compute-in-Memory (CIM) architectures (also called PIM)
  474. — Memory cache management algorithms
  475. Kernel operator fusion
  476. — Flash Inference (FlashInfer)
  477. — Checkpointing
  478. Offloading
  479. SSD storage

    Numerical representation subtypes:
  480. Floating-point representations (overview)
  481. Floating Point Bit Tricks
  482. Block floating-point arithmetic
  483. Fixed point number system (FXP) optimizations
  484. Floating point number system (FLP) optimizations
  485. Foating point bitwise arithmetic
  486. FTZ/DAZ floating point CPU settings

    Kernel optimizations:
  487. Kernel optimizations (overview)
  488. Kernel operator fusion (merging, aka "kernel fusion" or "fusion")
  489. — Fused epilogues (post-MatMul fusion: fused MatMul then activation/normalization)
  490. — Fused prologues (pre-MatMul fusion: fused activation/normalization then MatMul)
  491. Kernel fission (splitting one kernel apart)
  492. Kernel tiling
  493. — Operator reordering
  494. Graph operator fusion (Deep learning compilers)

    Computation optimizations:
  495. Advanced AI Mathematics
  496. Approximate activation functions
  497. Caching / memoization
  498. Computation reuse
  499. Precomputation
  500. Source code precomputation
  501. Conditional computation
  502. Approximations
  503. Integer-only arithmetic quantization
  504. Weight precomputations
  505. Zero-skipping
  506. Low-Level Zero Skipping
  507. High-Level Zero Skipping
  508. Negative skipping
  509. Approximate caching
  510. End-to-End integer inference
  511. Padding usage
  512. Incremental inference (new)
  513. BF16x9 emulation of FP32 computations (on Blackwell GPU)
  514. FP64 arithmetic emulation using 8-bit/16-bit/32-bit computations
  515. Thread block clusters (Blackwell/Rubin)

    Arithmetic optimizations:
  516. Integer operations
  517. Addition optimizations
  518. Bitwise operation tricks
  519. Approximate addition
  520. Multiplication algorithms
  521. Approximate division
  522. Approximate multiplication
  523. Bitwise operator inference
  524. Bitserial operations
  525. Division optimizations
  526. Logarithmic approximate multiplication
  527. Integer Dot Product
  528. Vector dot product optimization

    Advanced matrix algebra optimizations:
  529. Matrix Algebra (overview)
  530. Approximate matrix multiplication
  531. Butterfly matrices
  532. Monarch matrices
  533. Sparse matrices (sparsification)

    Low-rank matrix optimizations:
  534. Low-rank matrix factorization (overview)
  535. — Tensor decomposition
  536. — Tucker decomposition
  537. Embedding low-rank matrix factorization
  538. KV Cache low-rank matrix factorization

    Transformer architectural optimizations:
  539. Transformer architectures (overview)
  540. Transformer low-level optimizations (overview)
  541. — Adaptive Inference (dynamic inference)
  542. Integer-only Transformers
  543. Approximate Transformers
  544. Decoder-Only Architectures
  545. Encoder-Only Architectures
  546. Encoder-Decoder Architectures

    Transformers and LLMs:
  547. Open source models
  548. Inference frameworks
  549. Open source frameworks

    Next-Generation Transformer architectures:
  550. Next-generation architectures (overview)
  551. Hybrid Transformer architectures
  552. Newer Transformer architectures
  553. BERT (encoder)
  554. — State Space Models (SSMs)
  555. Mamba
  556. RWKV
  557. Knowledge graph AI architectures
  558. Compound AI architectures
  559. Large Concept Model (LCM)

    General Classes of Optimization Techniques:
  560. Dynamic inference (adaptive inference)
  561. Skipping
  562. Heuristics
  563. Probabilistic optimizations
  564. Approximate computing
  565. Code optimizations
  566. Deep learning compilers
  567. Incremental algorithms
  568. Fuzzy logic
  569. Inference budget (with adaptive inference)

    Loop Optimizations:
  570. Loop optimizations (overview)
  571. Inference loop optimizations
  572. Loop fusion (merging loops)
  573. Loop unrolling
  574. Loop perforation
  575. Loop reordering
  576. Loop tiling
  577. Loop reversal
  578. Loop fission (splitting a loop)
  579. — Loop interleave
  580. Loop interchange
  581. Loop coalescing
  582. Loop-invariant code motion ("hoisting")
  583. Loop distribution
  584. Pointer arithmetic
  585. Loop peeling (unrolling first iterations)
  586. Loop splittingLoop sentinel
  587. Loop collapsing
  588. Loop normalization
  589. Loop strip mining (Loop sectioning)
  590. Loop skewing
  591. Loop spreading

    Low-Level Coding Efficiency:
  592. Code optimizations (overview)
  593. Constant folding
  594. Common subexpression elimination
  595. Algebraic identities
  596. Strength reduction
  597. Type consistency
  598. Reciprocal multiplication
  599. References vs pointers
  600. Compile-time optimizations
  601. Pointer arithmetic
  602. Algorithm-level optimizations
  603. Lazy evaluation
  604. Memory reduction heuristics

    Data Structures for AI optimization:
  605. Hashing
  606. Perfect hashing
  607. Look-up tables (LUTs)
  608. Bloom filters
  609. — Trees
  610. — Tries
  611. Bloom filters
  612. Bitserial operations
  613. Permutation arrays

    Vector Data Structures:
  614. Parallel data structures
  615. Bit vectors
  616. Vector hashing
  617. Locality-Sensitive Hashing (LSH)
  618. Vector dot product caching
  619. — Bit signatures (vector algorithm)
  620. — K-means clustering (vector algorithm)
  621. — Hyper-Cube (vector algorithm)

    Convolution Optimizations in CNNs:
  622. Convolution optimizations (overview)
  623. Grouped convolutions
  624. Depth-wise separable convolutions

    Tokenization and Vocabulary Optimizations:
  625. Tokenization (overview)
  626. Tokenizer and model inference latency
  627. Semantic tokenization
  628. Tokenization for Machine Vision
  629. Tokenization of non-English languages
  630. Vocabulary optimizations:
  631. Vocabulary size
  632. Lexical shortlisting
  633. Vocabulary trimming
  634. Vocabulary expansion
  635. Dynamic vocabulary pruning

    Overall summaries of AI optimizations:
  636. Deslugging AI engines
  637. Accuracy-degrading optimizations
  638. Accuracy-retaining optimizations
  639. Uncommon inference optimizations

Not Enough?

More inference optimization resources:

AI Books from Aussie AI



The Sweetest Lesson: Your Brain Versus AI The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
  • Your brain is 50 times bigger than the best AI engines.
  • Truly intelligent AI will require more compute!
  • Another case of the bitter lesson?
  • Maybe it's the opposite of that: the sweetest lesson.

Get your copy from Amazon: The Sweetest Lesson



RAG Optimization RAG Optimization: Accurate and Efficient LLM Applications: new book on RAG architectures:
  • Smarter RAG
  • Faster RAG
  • Cheaper RAG
  • Agentic RAG
  • RAG reasoning

Get your copy from Amazon: RAG Optimization



Generative AI in C++ Generative AI Applications book:
  • Deciding on your AI project
  • Planning for success and safety
  • Designs and LLM architectures
  • Expediting development
  • Implementation and deployment

Get your copy from Amazon: Generative AI Applications



Generative AI in C++ Generative AI programming book:
  • Generative AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++



CUDA C++ Optimization CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization



CUDA C++ Optimization CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging

Free AI and C++ Books

Generative AI programming books:

  1. The Sweetest Lesson: Your Brain Versus AI, November 2025: full text online, free PDF available
  2. RAG Optimization: Accurate and Efficient LLM Applications, June 2025: full text online, free PDF available
  3. Generative AI Applications: Planning, Design and Implementation, November 2024: full text online, free PDF available
  4. Generative AI in C++ (Spuler, March 2024): full text online, free PDF available, table of contents, bonus materials, reference lists, source code

CUDA C++ GPU Programming Books:

  1. CUDA C++ Optimization: Coding Faster GPU Kernels, July 2024: full text online, bonus materials, free PDF available
  2. CUDA C++ Debugging: Safer GPU Kernel Programming, July 2024: full text online, free PDF available

Modern C++ Programming Books

  1. C++ AVX Optimization: CPU SIMD Vectorization, 2025: full text online, free PDF available
  2. C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations, 2025: full text online, free PDF available
  3. Advanced C++ Memory Techniques: Efficiency and Safety, 2025: full text online, free PDF available
  4. Efficient C++ Multithreading: Modern Concurrency Optimization, 2025: free PDF available
  5. Efficient Modern C++ Data Structures: Container and Algorithm Optimizations, 2025: free PDF available
  6. C++ Low Latency: Multithreading and Hotpath Optimizations, 2025: free PDF available
  7. Safe C++: Fixing Memory Safety Issues, Oct 2024: full text online, free PDF available

More AI Research Topics

Read more about: