Aussie AI

Inference Optimization Research

Last Updated 23 May, 2025

by David Spuler, Ph.D.

What's Hot in Inference Optimization Research?

Inference optimization has become a hot area of research as the industry evolves to the point where inference costs are about 95% of overall compute. This trend is driven by:

(a) More user adoption of AI (in both consumer and enterprise),

(b) Open source pre-trained models,

(d) RAG architectures replacing fine-tuning, and

(e) Multi-step inference-based reasoning (e.g., chain-of-thought in OpenAI's o1 model).

Some of the hottest areas for speeding up inference:

Chain-of-Thought token reduction. Although much research on CoT is about improved reasoning, there's also an important efficiency aspect, because multi-step reasoning is inherently costly and slow. One way is to reduce tokens, and there are various other approaches; see Chain-of-Thought efficiency optimization.
Hardware. The biggest opportunity for inference speedup is probably in hardware rather than software. There's the upcoming NVIDIA Blackwell architecture, which is apparently delayed as I write this, along with several AI-specific hardware startups such as Groq and Etched receiving large funding rounds. I'm not an expert on the hardware opportunities, so I'll leave it to you to review the hardware research papers.
KV cache compression. The KV cache was initially a speedup for inference, but it's become a memory hog, so there are numerous research papers on making it use less memory (see KV cache compression research). In particular, KV cache quantization is becoming standard in industry framework implementations, such as 4-bit quantized KV cache data used by Character.AI and Apple Intelligence. There are several fancier types of KV cache compression in the research. Notably, an implementation of KV cache layer fusion is used by Character.AI's inference backend for companionbots.
Context caching. The simplest cache is a text-to-text full "inference cache" and there's also semantic caching based on embedding vector similarity. However, the idea of saving the KV cache, but re-running decoding has various advantages, and is gaining attention in both research and industry. Google has recently released "context caching" features, and this is also appearing in other frameworks, such as vLLM and DeepSeek. Expect many more to follow! See: context caching research.
Prefix KV caching. There are many cases where Transformers are re-processing the same prefix of tokens, such as chatbot multi-turn conversational context, global system instructions (prepended), RAG chunks (prepended), and re-used documents. Instead, you can just load the KV cache data from a prefix KV cache, and the latency is minimal, and you only have to decode the last few tokens. Prefix KV caching is also getting implemented in frameworks, including vLLM, DeepSeek, and Character.AI's backend. Interestingly, DeepSeek offers lower pricing for "cached tokens," which reflects the lower cost.
Multi-LoRA. The idea of using multiple LoRA adapters for efficiently supporting multiple fine-tuned models got a massive boost from Apple Intelligence. There are many research papers now focused on further optimizing the load-time and inference characteristics of multi-LoRA architectures and other types of Parameter-Efficient Fine-Tuning (PEFT).
Memory-efficient attention algorithms. The two leading contenders for attention optimization by paying attention to its memory access patterns are Flash Attention and Paged Attention, and you can even combine them! There's also their precursors Multi-Query Attention (MQA) and Grouped Query Attention (GQA) that are still in use and getting researched in papers. See memory-efficient attention optimization.
Linear attention. Another way to reduce memory cost is to simply access it less! Algorithms like this include local attention and other types of linear attention. As a recent example in industry, Character.AI's inference backend uses a hybrid layerwise attention scheme, that alternates between local and global attention across different layers. There's a lot of research happening in optimizing the attention mechanisms, because of its annoying quadratic complexity. See research on attention optimization.
Zero-multiplication models. MIT research released a model architecture based on element-wise multiplication for matrix multiplication, which is the "Hadamard product." Basic matrix multiplication is O(n³) whereas Hadamard computations are O(n²), so it's potentially a tenfold reduction, and also a simpler algorithm that's more amenable to followup kernel optimizations like kernel fusion. See Hadamard multiplication models. There are actually at least ten other types of zero-multiplication models in the literature (e.g., adder models, shift-add, logarithmic, power-of-two, max-plus, min-max, weightless neural networks, etc.). There's also the well-known method of avoiding multiplication with low-bit quantization. Both binary quantization and ternary quantization can be implemented via addition, albeit with accuracy loss.
Speculative decoding. Increased parallelization of the decoding algorithm via speculative decoding is a perennially hot area of research. It's a speedup that's long been used in production backends. Various generalization have been discovered, such as generalized speculative decoding, heuristic speculative decoding, self-speculative decoding, retrieval lookup decoding, prompt lookup decoding, and several other methods.
Multi-token generation. Generalizing the decoding algorithm to output multiple tokens in parallel is a clear gain in efficiency, and some research is starting to show promise. These require an entirely different type of model architecture for both training and inference. There are also some multi-token drafting methods starting to be used to optimize speculative decoding algorithms. See: parallel decoding research.
Prefill optimizations. There has been a burst of new research that examines the cost of the prefill operation, which creates the KV cache, and is the reason for the initial latency before the first token is output. Hence, prefill time is important for user responsiveness for any interactive use cases. In particular, research has found that prefill is compute-bound, whereas the decoding phase is memory-bound. Hence, there is much research on prefill phase optimizations, chunked prefill, and disaggregated scheduling of prefill and decoding phases on GPU platforms. Note also that KV caching methods as discussed above can optimize prefill by avoiding it completely!

Blog Articles on Inference Optimization: See also the Aussie AI blog articles:

What is Inference Optimization?

Inference is the process of "running" a request on a Large Language Model (LLM). Our research focus is on optimizing these algorithms so that the AI models respond quickly to users (called "latency") and have a high overall throughput so as to scale efficiently. They need to be much faster not only to reduce data center GPU costs, but also to run efficiently on your smartphone device or your AI laptop.

Overview of Inference Optimization Techniques

Running an AI model to get a response is called "inference". Optimization of the inference algorithms for AI models is the primary mechanism to provide fast response times and scaleable throughput of AI requests from users. There is an extensive volume of literature on various techniques to optimize model inference. Some of the main techniques include:

Hardware acceleration (GPU and non-GPU)
Parallelization
Software acceleration
Model compilation (graph compilers / deep learning compilers)
Transformer-specific optimization techniques (e.g. shallow decoder architectures)
Model compression (e.g. quantization, pruning, distillation)
Advanced mathematical algorithms and matrix algebra
General code optimizations including inference loop optimizations

For more, see the Long list of Transformer optimization methods.

Hardware Acceleration

Modern AI algorithms make use of sophisticated hardware acceleration created by makers of silicon chips. The best known algorithms involve the use of a Graphics Processing Unit (GPU). GPU chips were originally designed for faster graphics, but have generalized over time to become general purpose sophisticated mathematical computation machines. The primary optimization from GPUs is parallel execution of arithmetic operations such as floating-point multiplications on vectors and matrices, which are the basis of AI algorithms.

Non-GPU is also another area of hardware acceleration. The GPU is only one of the chips in the box, and other chips can do grunt work as well. The CPU has been getting askance looks from AI researchers for years, but there are many sophisticated accelerations available from a variety of hardware vendors, usually at much lower price points. Also available are various special-purpose ASICs that focus on AI computations.

Parallelization

As a general strategy, AI requests can be parallelized across multiple machines, virtual machines, multi-GPU machines, multi-core CPUs, or multi-threaded execution. Obviously, multiple user requests can be farmed out across multiple endpoints, but the underlying algorithms can also be parallelized.

GPU processing and other hardware acceleration is mostly about parallelizing the multiplication algorithms involving weights, which is a low-level parallel optimization. However, parallelization can be invoked at a higher level to achieve speedups. Many of the steps in AI inference algorithms can be performed in parallel, such that the algorithms can split vector and matrix arithmetic across available processing power.

Software Acceleration

There are many areas of active resarch in methods to accelerate AI inference engines in software. Some of the methods include:

Hardware accelerator integration. This refers to the software detecting available GPU and non-GPU hardware acceleration capabilities, and interfacing correctly. This can involve issues not only integrating with low-level APIs, but preparation of the model data such as storing the data sequentially in the appropriate formats. See software integrations with hardware accelerators and hardware-software co-design.
Software frameworks: An AI model cannot run without software, and they usually do so using software inference frameworks. This software is also called the "kernel" in many contexts.
Transformer-specific optimizations. There are both code optimizations and architectural changes possible; see Transformer optimizations and architectural modifications such as as shallow decoder, normalization pruning, or FFN pruning.
Partitioning for hardware acceleration. Another is issue with using hardware acceleration with large LLMs is "partitioning" large vectors or tensors that are too large to fit into a single hardware accelerator.
Memory optimization: some AI architectures are memory-bound rather than CPU-bound, so reducing the size of the model to save memory actually speeds up latency. This includes various model compression methods such as quantization and pruning. See memory optimization techniques.
Static inference optimizations: such as model compression, quantization, pruning, knowledge distillation, parameter sharing , and more.
Dynamic inference optimizations: runtime techniques such as dynamic pruning, skipping, approximations, caching, computation reuse, and many more techniques.
Inference loop optimizations: The main loop of the inference engine can be optimized, such as via "early exit" algorithms. See inference loop optimizations.
General code optimizations: The inference engine can be improved using the various general code optimization methods. See code optimization techniques.
Permutation arrays: You can map one number to another using software tables of permutations.
Data structures: Various methods include hashing, caching, Bloom filters, or look-up tables (LUTs).

Most of the techniques in "model compression" are also a form of software acceleration, although they tend to get their own category these days. There are also many more software techniques on the list of optimization methods.

Model Compression

Model compression is the general technique of optimizing AI inference by changing the model to be smaller, usually at the cost of some degree of accuracy. This can mean using fewer weights (e.g. pruning) or using weights that are smaller numbers (e.g. quantization). The main established techniques with a longstanding body of research include:

Quantization of models. This is a well-known method whereby high-precision 32-bit floating-point multiplication of weights is replaced with lower-precision data, such as 16-bit floating point numbers, or often integers to allow faster integer multiplication. Integer quantization has been researched for 8-bit integers all the way down to 4-bit, 3-bit, 2-bit and even 1-bit (binary) integer representations. Quantization of a pre-trained model for faster inference is called Post-Training Quantization (PTQ). It is also possible to use quantization during model training using Quantization-Aware Training (QAT) algorithms. Quantization not only improves speed, but also reduces the model size for storage or transmission (e.g. an 8-bit quantization of a 32-bit floating point model reduces storage size by four). Read more about research on LLM quantization.
Pruning (static and dynamic). This technique involves optimizing the weighted links in LLM models, such as "magnitude pruning" by cutting those with a very low value (indicating a feature has a low probability). If there are fewer model links, there are less multiplications required. See research on model pruning and Transformer-specific pruning ideas.
Layer pruning / layer compaction ("pancaking"). This is conceptually a generalization of pruning, whereby whole model layers are pruned. Models typically involve multiple hidden layers of weights and nodes, which perform the same underlying inference algorithms with different sets of weights. The layers can sometimes be removed without significant loss of model accuracy, resulting in proportional speedups. Also called "layer fusion" or "module fusion" in some papers. Read more about Dynamic Layer Pruning and optimizing Transformer layers (including the "shallow decoder" architecture).
Knowledge distillation. Training a smaller model to be similar to a larger model. This is method where a large pre-trained model is then used to train a smaller more-efficient model. See knowledge distillation research papers.

Static Inference Optimization

Static inference optimization methods are those where the model is changed during training, or post-training, but is not modified during inference, where the inference engine runs the model in full. Some examples of this include:

Quantization (creates a lower-precision model, which is then unchanged during inference)
Static pruning (structured or unstructured)
Neural architecture search (tuning model hyper-parameters)
Knowledge distillation (training a smaller model from a large one)
Low-rank matrix factorization (tensor decomposition)
Weight sharing
Layer fusion
Weight clustering
Sparsification

The key point of these static inference optimizations is that a trained model is taken, and a new model is created that is smaller and faster, which is then used for inference.

Dynamic Inference Optimization (Adaptive inference)

Dynamic inference optimization, also called "adaptive inference", is where the inference engine adapts its behavior according to its input. Some examples of dynamic strategies include:

Early-exit inference (dynamic depth pruning)
Dynamic pruning (e.g. dynamic token pruning, dynamic layer pruning)
Layer skipping
Width pruning (channel pruning, filter pruning, etc.)
Head pruning
Length pruning (token pruning, embeddings pruning, etc.)
Big-little architectures (dynamically choosing between a small or large model, depending on input)
Dynamic sparsification (see sparsity)
Skipping computations
Caching and computation reuse (e.g. of vector dot products)
KV Caching (at least two types)
Parallel decoding
Kernel fusion (e.g. fused LayerNorm)
Cascades
Attention optimization
Aggressive decoding
Collaborative decoding
Speculative decoding
Zero skipping
Dual pruning and triple pruning

Hybrid optimizations. Combined strategies are possible in many ways. For example, a model can be quantized to lower precision, and then the inference engine could employ various dynamic pruning strategies. And some strategies apply across both training and inference phases, thereby combining approaches, such as using different approximation algorithms or advanced matrix decompositions. The co-design of hardware and software architectures also typically crosses both training and inference execution.

That's not the full list. For more, see the full list of neural network optimization methods.

General Research on Dynamic Inference

These papers examine the general theory of dynamic inference (or "input-adaptive inference" or "conditional computation"), such as running faster on "easy" requests but slower on "hard" requests. See also the lists of papers on the many specific dynamic inference optimization techniques (e.g. early exit, dynamic pruning, width pruning, length pruning, big-little, cascades, etc.)

Anil Kag, 2023, Novel neural architectures & algorithms for efficient inference, Ph.D. thesis, College of Engineering, Boston University, https://open.bu.edu/handle/2144/46649, PDF: https://open.bu.edu/bitstream/handle/2144/46649/Kag_bu_0017E_18472.pdf?sequence=8&isAllowed=y (See Chapter 10, "Input Hardness Adaptive Models" for methods of running faster on easy image classification problems.)
Bolukbasi, T., Wang, J., Dekel, O., and Saligrama, V. 2017. Adaptive neural networks for efficient inference. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, Proceedings of Machine Learning Research, pages 527–536. https://arxiv.org/abs/1702.07811, http://proceedings.mlr.press/v70/bolukbasi17a.html
Nan, F. and Saligrama, V. 2017. Adaptive classification for prediction under a budget. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 30, pages 4727–4737. Curran Associates, https://arxiv.org/abs/1705.10194 PDF: https://proceedings.neurips.cc/paper_files/paper/2017/file/d9ff90f4000eacd3a6c9cb27f78994cf-Paper.pdf
Nan, F. and Saligrama, V., 2017. Dynamic model selection for prediction under a budget. arXiv preprint arXiv:1704.07505. https://arxiv.org/abs/1704.07505
Han, Y., Huang, G., Song, S., Yang, L., Wang, H., and Wang, Y. 2022. Dynamic neural networks: A survey. volume 44, pages 7436–7456, Los Alamitos, CA, USA. IEEE Computer Society. https://arxiv.org/abs/2102.04906, https://doi.ieeecomputersociety.org/10.1109/TPAMI.2021.3117837 (Survey of dynamic inference techinques, where the engine is adaptive to the input.)
Graves, A. (2016). Adaptive computation time for recurrent neural networks. CoRR, abs/1603.08983. http://arxiv.org/abs/1603.08983
Jernite, Y., Grave, E., Joulin, A., and Mikolov, T. (2017). Variable computation in recurrent neural networks. In International Conference on Learning Representations. https://openreview.net/forum?id=S1LVSrcge, https://arxiv.org/abs/1611.06188
D. Stamoulis, T.-W. Chin, A. K. Prakash, H. Fang, S. Sajja, M. Bognar, and D. Marculescu, “Designing adaptive neural networks for energyconstrained image classification,” in Proceedings of the International Conference on Computer-Aided Design, 2018, pp. 1–8. https://ieeexplore.ieee.org/document/8587753
Xin Wang, Fisher Yu, Zi-Yi Dou, Trevor Darrell, Joseph E. Gonzalez, 2018, “Skipnet: Learning dynamic routing in convolutional networks,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 409–424. https://arxiv.org/abs/1711.09485
Yue Wang, Jianghao Shen, Ting-Kuei Hu, Pengfei Xu, Tan Nguyen, Richard Baraniuk, Zhangyang Wang, Yingyan Lin, 2020, “Dual dynamic inference: Enabling more efficient, adaptive and controllable deep inference,” IEEE Journal of Selected Topics in Signal Processing, 2020. https://arxiv.org/abs/1907.04523
P. Panda, A. Sengupta, and K. Roy, “Conditional deep learning for energy-efficient and enhanced pattern recognition,” in Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2016. https://arxiv.org/abs/1509.08971
Berestizshevsky, K., Even, G.: Dynamically sacrificing accuracy for reduced computation: Cascaded inference based on softmax confidence. In: Lecture Notes in Computer Science, pp. 306–320. Springer International Publishing (2019). https://doi.org/10.1007/978-3-030-30484-3_26
Jayakodi, N.K., Chatterjee, A., Choi, W., Doppa, J.R., Pande, P.P.: Trading-off accuracy and energy of deep inference on embedded systems: A co-design approach. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37(11), 2881–2893 (2018). https://doi.org/10.1109/tcad.2018.2857338, https://arxiv.org/abs/1901.10584
Maedeh Hemmat, Joshua San Miguel, Azadeh Davoodi, "AirNN: A Featherweight Framework for Dynamic Input-Dependent Approximation of CNNs", IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol.40, no.10, pp.2090-2103, 2021. https://ieeexplore.ieee.org/document/9239327 (Input-dependent inference optimization via layer-wise weight clustering and early exit based on a termination condition.)
Maedeh Hemmat; Azadeh Davoodi, March 2019, Dynamic Reconfiguration of CNNs for Input-Dependent Approximation, 20th International Symposium on Quality Electronic Design (ISQED), https://ieeexplore.ieee.org/document/8697843 (Dynamically decides how many clusters of similar weights to use, depending on input.)
B Wójcik, M Przewiȩźlikowski, F Szatkowski, Oct 2023, Zero time waste in pre-trained early exit neural networks, Neural Networks, https://www.sciencedirect.com/science/article/pii/S0893608023005555, PDF: https://papers.nips.cc/paper/2021/file/149ef6419512be56a93169cd5e6fa8fd-Paper.pdf (Attempts to quickly handle easy inputs by caching classifications from prior layers for early exit decisions.)
Jinmin He, Kai Li, Yifan Zang, Haobo Fu, Qiang Fu, Junliang Xing, Jian Cheng, 25 Jan 2024, Not All Tasks Are Equally Difficult: Multi-Task Deep Reinforcement Learning with Dynamic Depth Routing, https://arxiv.org/abs/2312.14472 (Dynamic routing based on easy vs hard queries to optimize training.)
Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, Liang Zhao, 4 Jan 2024, Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models https://arxiv.org/abs/2401.00625 (A general survey paper with coverage of many techniques including this one.)
Ting-Wu Chin, Ruizhou Ding, and Diana Marculescu. 2019. AdaScale: Towards Real-time Video Object Detection using Adaptive Scaling. Proceedings of Machine Learning and Systems 2019. 431–441. https://arxiv.org/abs/1902.02910 (Adaptive inference that dynamically choose the image scale to analyze.)
Shihao Zhang, Weiyao Lin, Ping Lu, Weihua Li, and Shuo Deng. 2017. Kill two birds with one stone: Boosting both object detection accuracy and speed with adaptive patch-of-interest composition. In 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW). IEEE, 447–452. https://arxiv.org/abs/1708.03795 Code: http://min.sjtu.edu.cn/lwydemo/Dete/demo/detection.html (Adaptive inference by focusing on regions of images.)
Bruno Korbar, Du Tran, and Lorenzo Torresani. 2019. Scsampler: Sampling salient clips from video for efficient action recognition. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 6232–6242. https://arxiv.org/abs/1904.04289 (Dynamically selects clips from videos to analyze.)
Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Fractalnet: Ultra-deep neural networks without residuals. In ICLR, 2017 https://arxiv.org/abs/1605.07648 (Uses both shallow and deep networks with dropout, but not fully layer-wise; similar to cascades.)
Huang, G.; Chen, D.; Li, T.; Wu, F.; van der Maaten, L.; and Weinberger, K. Q., 2017. Multi-scale dense networks for resource efficient image classification. arXiv preprint arXiv:1703.09844, https://arxiv.org/abs/1703.09844 (Doing dynamic inference, early-exit & changing the features.)
Liu, L.; and Deng, J. 2018. Dynamic deep neural networks: Optimizing accuracy-efficiency trade-offs by selective execution. In Thirty-Second AAAI Conference on Artificial Intelligence, https://arxiv.org/abs/1701.00299
M Piórczyński, F Szatkowski, K Bałazy, B Wójcik, 2023, Exploiting Transformer Activation Sparsity with Dynamic Inference https://arxiv.org/pdf/2310.04361.pdf
Francesco Ratto, Ángela Porras Máinez, Carlo Sau, Paolo Meloni, Gianfranco Deriu, Stefano Delucchi, Massimo Massa, Luigi Raffo, Francesca Palumbo, April 2023, An Automated Design Flow for Adaptive Neural Network Hardware Accelerators. Journal of Signal Processing Systems (2023): 1-23. https://link.springer.com/article/10.1007/s11265-023-01855-x (Adapatable inference for a CNN by dynamic modification of FPGA-accelerated hardware integrations.)
David Spuler, March 2024, Chapter 50. Adaptive Inference, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Li Yang, Zhezhi He, Yu Cao, Deliang Fan, Sep 2020, A Progressive Sub-Network Searching Framework for Dynamic Inference, https://arxiv.org/abs/2009.05681
8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
J. Yu, T. Huang, Universally slimmable networks and improved training techniques, in: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 1803–1811. doi:10.1109/ICCV.2019.00189. http://dx.doi.org/10.1109/ICCV.2019.00189
J. Yu, L. Yang, N. Xu, J. Yang, T. Huang, Slimmable neural networks, in: International Conference on Learning Representations, 2019. https://openreview.net/forum?id=H1gMCsAqY7 https://openreview.net/forum?id=H1gMCsAqY7
Z. Chen, Y. Li, S. Bengio, S. Si, You look twice: Gaternet for dynamic filter selection in CNNs, in: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 9164–9172. doi:10.1109/ CVPR.2019.00939. http://dx.doi.org/10.1109/CVPR.2019.00939
A. Kouris, S. Venieris, C.-S. Bouganis, A throughput-latency co-optimised cascade of convolutional neural network classifiers, IEEE, 2019. http://hdl.handle.net/10044/1/75445 http://hdl.handle.net/10044/1/75445
E. S. Marquez, J. S. Hare, M. Niranjan, Deep cascade learning, IEEE Transactions on Neural Networks and Learning Systems 29 (11) (2018) 5475–5485. doi:10.1109/TNNLS.2018.2805098. http://dx.doi.org/10.1109/TNNLS.2018.2805098
A. Krizhevsky, V. Nair, G. Hinton, CIFAR-10 and CIFAR-100 (Canadian Institute for Advanced Research), http://www.cs.toronto.edu/ kriz/cifar.html, last access: 02/2020 (2020).
K. He, X. Zhang, S. Ren, J. Sun, Spatial pyramid pooling in deep convolutional networks for visual recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (9) (2015) 1904–1916. doi: 10.1109/TPAMI.2015.2389824. http://dx.doi.org/10.1109/TPAMI.2015.2389824
S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: Towards real-time object detection with region proposal networks, IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (6) (2017) 1137–1149. doi:10.1109/ TPAMI.2016.2577031. http://dx.doi.org/10.1109/TPAMI.2016.2577031
E. Shelhamer, J. Long, T. Darrell, Fully convolutional networks for semantic segmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (4) (2017) 640–651. doi:10.1109/TPAMI.2016.2572683. http://dx.doi.org/10.1109/TPAMI.2016.2572683

Uncommon Optimization Techniques

Some of the more theoretical and lesser known techniques include:

Weight clustering. This technique involves merging weights that have similar magnitude into "clusters" that use exactly the same weight instead. It is similar to a combination of quantization and pruning, and can augment either technique. See weight clustering research.
Approximate matrix multiplication. There are various algorithms for performing mathematical multiplication of matrices without actually using numeric multiplication. This is an area of active research that involves a crossover between high-end mathematics and computer algorithms. Several techniques show promise of fast calculations with an acceptable loss of accuracy. Read more about AI matrix algebra optimizations.
Logarithmic bitshift quantization (power-of-two). The coding optimization of using bit-shift operators to replace multiplication is well-known. Conceptually, the idea is a second-order model quantization method involving to first convert floating point model weights to integers (for integer multiplications), and then further convert those integer weights logarithmically to the nearest power-of-2. This allows integer multiplication in vector and matrix multiplications to be replaced with integer bit-shift operations. However, the trade-off is a greater loss of model accuracy than basic integer quantization. Read more about bitshift inference optimizations.
Additive and zero-multiplication neural networks. Various approaches to remove the multiplication bottleneck by replacing it with other arithmetic operators, including adder networks, bit shifting, and other low-level optmizations. See "zero-multiplication models".
Low-rank optimization. Optimize high-degree tensors to be lower-degree tensors using matrix factorization/decomposition. This is conceptually a type of large-scale pruning. Read more about low-rank decomposition and matrix algebra.
Faster multiplication algorithms. There are ways to do multiplication faster, although this research is mainly used by chip designers nowadays. Read more about multiplication mathematics.
Approximate multiplication. This is a low-level optimization of the multiplication operation itself, usually a more mathematical method than using bitshifts. Some of these methods have been used in quantization. Read more about advanced mathematical optimizations.
Matrix multiplication algorithms. Read more about using optimizations involving matrix multiplication optimizations.
Advanced Numeric Representations. Various non-standard alternative methods to store numeric weights, making use of all the bits in a byte, going beyond the standard floating point or integer bit patterns. Read more about number systems.

Survey Papers on Inference Optimization

Survey papers on inference optimization:

Sehoon Kim, Coleman Hooper, Thanakul Wattanawong, Minwoo Kang, Ruohan Yan, Hasan Genc, Grace Dinh, Qijing Huang, Kurt Keutzer, Michael W. Mahoney, Yakun Sophia Shao, Amir Gholami, Full stack optimization of transformer inference: a survey, Feb 2023, arXiv:2302.14017, https://arxiv.org/abs/2302.14017
Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W. Mahoney, and Kurt Keutzer. 2021. A Survey of Quantization Methods for Efficient Neural Network Inference. arXiv:2103.13630 [cs], June 2021, http://arxiv.org/abs/2103. 13630 arXiv: 2103.13630, https://arxiv.org/abs/2103.13630
Full Stack Optimization of Transformer Inference: a Survey. Part 2 on Transformer Optimization, A Paper Overview, https://www.nebuly.com/blog/full-stack-optimization-of-transformer-inference-a-survey-part-2
Jingjing Xu, Wangchunshu Zhou, Zhiyi Fu, Hao Zhou, Lei Li, A Survey on Green Deep Learning, Nov 2021, https://arxiv.org/abs/2111.05193 (Extensive survey paper.)
Krishna Teja Chitty-Venkata, Sparsh Mittal, Murali Emani, Venkatram Vishwanath, Arun K. Somani, A Survey of Techniques for Optimizing Transformer Inference, 2023, arxiv.org July 2023, https://arxiv.org/abs/2307.07982
Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey (v2). arXiv preprint arXiv:2009.06732, 2022, https://arxiv.org/abs/2009.06732
Xunyu Zhu, Jian Li, Yong Liu, Can Ma, Weiping Wang, A Survey on Model Compression for Large Language Models, arXiv preprint arXiv:2308.07633, Aug 2023, https://arxiv.org/abs/2308.07633 (Recent 2023 survey paper on various model compression approaches.)
Meriam Dhouibi, Ahmed Karim Ben Salem, Afef Saidi, Slim Ben Saoud, March 2021, Accelerating deep neural networks implementation: A survey, https://doi.org/10.1049/cdt2.12016, PDF: https://ietresearch.onlinelibrary.wiley.com/doi/pdfdirect/10.1049/cdt2.12016 (Survey of various techniques including hardware acceleration and pruning.)
Md. Maruf Hossain Shuvo; Syed Kamrul Islam; Jianlin Cheng; Bashir I. Morshed, 2023, Efficient Acceleration of Deep Learning Inference on Resource-Constrained Edge Devices: A Review, Proceedings of the IEEE (Volume 111, Issue 1, January 2023), https://ieeexplore.ieee.org/document/9985008, PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9985008a (Extensive 2023 survey of inference optimization in general and specifically on edge platforms.)
Y Wang, Y Han, C Wang, S Song, Q Tian, G Huang, 2023, Computation-efficient Deep Learning for Computer Vision: A Survey, arXiv preprint arXiv:2308.13998, https://arxiv.org/abs/2308.13998
Giorgos Armeniakos, Georgios Zervakis, Dimitrios Soudris, Jörg Henkel, 2022, Hardware Approximate Techniques for Deep Neural Network Accelerators: A Survey, ACM Computing Surveys, Volume 55, Issue 4, No. 83, pp. 1–36, https://doi.org/10.1145/3527156, https://dl.acm.org/doi/10.1145/3527156, https://arxiv.org/abs/2203.08737
Ce Zhou, Qian Li, Chen Li, Jun Yu, Yixin Liu, Guangjing Wang, Kai Zhang, Cheng Ji, Qiben Yan, Lifang He, Hao Peng, Jianxin Li, Jia Wu, Ziwei Liu, Pengtao Xie, Caiming Xiong, Jian Pei, Philip S. Yu, Lichao Sun, May 2023, A Comprehensive Survey on Pretrained Foundation Models: A History from BERT to ChatGPT, https://arxiv.org/abs/2302.09419
Q Fournier, GM Caron, D Aloise, 2023, A practical survey on faster and lighter transformers, ACM Computing Surveys, https://dl.acm.org/doi/abs/10.1145/3586074, https://arxiv.org/abs/2103.14636
V. Sze, Y. Chen, T. Yang, and J. S. Emer. 2017. Efficient Processing of Deep Neural Networks: A Tutorial and Survey. Proc. IEEE 105, 12 (2017), 2295–2329. https://doi.org/10.1109/JPROC.2017.2761740 (Good paper from 2017.)
Xipeng Qiu, TianXiang Sun, Yige Xu, Yunfan Shao, Ning Dai, and Xuanjing Huang. 2020. Pre-trained Models for Natural Language Processing: A Survey. SCIENCE CHINA Technological Sciences 63, 10 (2020), 1872–1897. https://doi.org/10.1007/s11431-020-1647-3, https://arxiv.org/abs/2003.08271 (Good survey of Transformer architectures in 2020.)
Kah Phooi Seng, Li-Minn Ang, "Embedded Intelligence: State-of-the-Art and Research Challenges", IEEE Access, vol.10, pp.59236-59258, 2022. https://ieeexplore.ieee.org/document/9775683, PDF: https://research.usc.edu.au/esploro/outputs/99640278002621
Marcos Treviso, Ji-Ung Lee, Tianchu Ji, Betty van Aken, Qingqing Cao, Manuel R. Ciosici, Michael Hassid, Kenneth Heafield, Sara Hooker, Colin Raffel, Pedro H. Martins, André F. T. Martins, Jessica Zosa Forde, Peter Milder, Edwin Simpson, Noam Slonim, Jesse Dodge, Emma Strubell, Niranjan Balasubramanian, Leon Derczynski, Iryna Gurevych, Roy Schwartz, 2023, Efficient methods for natural language processing: A survey, 2023, https://arxiv.org/abs/2209.00099, https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00577/116725 (Extensive survey from 2023 covering many optimization techniques.)
Krishna Teja Chitty-Venkata, Sparsh Mittal, Murali Emani, Venkatram Vishwanath, Arun K. Somani, 2023, A Survey of Techniques for Optimizing Transformer Inference, https://arxiv.org/abs/2307.07982
G Alsuhli, V Sakellariou, H Saleh, M Al-Qutayri, Number Systems for Deep Neural Network Architectures: A Survey, 2023, https://arxiv.org/abs/2307.05035 (Good survey, but specific to number systems.)
Canwen Xu, Julian McAuley, Nov 2022, A Survey on Model Compression and Acceleration for Pretrained Language Models, https://arxiv.org/abs/2202.07105
G Menghani, 2023, Efficient deep learning: A survey on making deep learning models smaller, faster, and better, ACM Computing Surveys, https://dl.acm.org/doi/abs/10.1145/3578938, https://arxiv.org/abs/2106.08962
Anouar Nechi, Lukas Groth, Saleh Mulhem, Farhad Merchant, Rainer Buchty, Mladen Berekovic, 2023, FPGA-based Deep Learning Inference Accelerators: Where Are We Standing? ACM Transactions on Reconfigurable Technology and Systems, July 2023 https://doi.org/10.1145/3613963, https://dl.acm.org/doi/10.1145/3613963 PDF: https://dl.acm.org/doi/pdf/10.1145/3613963
L Papa, P Russo, I Amerini, L Zhou, Sep 2023, A survey on efficient vision transformers: algorithms, techniques, and performance benchmarking, arXiv preprint arXiv:2309.02031, 2023, https://arxiv.org/abs/2309.02031
Han, Y., Huang, G., Song, S., Yang, L., Wang, H., and Wang, Y. 2022. Dynamic neural networks: A survey. volume 44, pages 7436–7456, Los Alamitos, CA, USA. IEEE Computer Society. https://arxiv.org/abs/2102.04906, https://doi.ieeecomputersociety.org/10.1109/TPAMI.2021.3117837 (Survey of dynamic inference techinques, where the engine is adaptive to the input.)
Li, Z., Liu, F., Yang, W., Peng, S., and Zhou, J. (2022). A survey of convolutional neural networks: Analysis, applications, and prospects. IEEE Transactions on Neural Networks and Learning Systems, 33(12):6999–7019. https://arxiv.org/abs/2004.02806 (2020 version), https://ieeexplore.ieee.org/document/9451544 (A survey of CNNs.)
Praveen Joshi, Mohammed Hasanuzzaman, Chandra Thapa, Haithem Afli, Ted Scully, "Enabling All In-Edge Deep Learning: A Literature Review", IEEE Access, vol.11, pp.3431-3460, 2023. https://ieeexplore.ieee.org/document/10007810 https://arxiv.org/abs/2204.03326 (Extensive survey of edge computing, including deployment architectures and optimizations.)
Lipton, Z. C., Berkowitz, J., and Elkan, C. (2015). A critical review of recurrent neural networks for sequence learning. https://arxiv.org/abs/1506 (This 2015 survey of RRNs and LSTMs has some interesting perspectives.)
W Li, H Hacid, E Almazrouei, M Debbah, 2023, A Comprehensive Review and a Taxonomy of Edge Machine Learning: Requirements, Paradigms, and Techniques, AI 2023, 4(3), 729-786, https://www.mdpi.com/2673-2688/4/3/39 (Extensive survey related to optimizing inference for running on edge servers; also training on edge servers.)
JA Chen, W Niu, B Ren, Y Wang, X Shen, 2023, Survey: Exploiting data redundancy for optimization of deep learning, ACM Computing Surveys, https://dl.acm.org/doi/abs/10.1145/3564663, https://arxiv.org/pdf/2208.13363 (Various optimizations to skip or reuse computations or similar data.)
M Capra, B Bussolino, A Marchisio, M Shafique, 2020, An updated survey of efficient hardware architectures for accelerating deep convolutional neural networks, Future Internet, https://www.mdpi.com/1999-5903/12/7/113/pdf
J Zhong, Z Liu, X Chen, Apr 2023, Transformer-based models and hardware acceleration analysis in autonomous driving: A survey, https://arxiv.org/abs/2304.10891
Yingbing Huang, Lily Jiaxin Wan, Hanchen Ye, Manvi Jha, Jinghua Wang, Yuhong Li, Xiaofan Zhang, Deming Chen, 16 Jun 2024, New Solutions on LLM Acceleration, Optimization, and Application, https://arxiv.org/abs/2406.10903 (A survey of inference optimization methods and further analysis of Medusa-type speculative decoding and KV cache compression. Also explores hardware co-design, ML compilers and LLM-assisted code debugging.)
18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
Leo Donisch, Sigurd Schacht, Carsten Lanquillon, 6 Aug 2024, Inference Optimizations for Large Language Models: Effects, Challenges, and Practical Considerations, https://arxiv.org/abs/2408.03130
Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
David Spuler, September 2nd, 2024, 500+ LLM Inference Optimization Techniques, Aussie AI Blog, https://www.aussieai.com/blog/llm-inference-optimization
Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Guohao Dai, 6 Oct 2024, Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective, https://arxiv.org/abs/2410.04466
Baolin Li, Yankai Jiang, Vijay Gadepally, Devesh Tiwari, 17 Jul 2024, LLM Inference Serving: Survey of Recent Advances and Opportunities, https://arxiv.org/abs/2407.12391
M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024, Resource-efficient Algorithms and Systems of Foundation Models: A Survey, https://dl.acm.org/doi/pdf/10.1145/3706418
Wenchao Xu, Jinyu Chen, Peirong Zheng, Xiaoquan Yi, Tianyi Tian, Wenhui Zhu, Quan Wan, Haozhao Wang, Yunfeng Fan, Qinliang Su, Xuemin Shen, https://arxiv.org/abs/2412.13437 18 Dec 2024, Deploying Foundation Model Powered Agent Services: A Survey, (A survey of not just deployment, but many inference optimization techniques.)
Fabio Montello, Ronja Güldenring, Simone Scardapane, Lazaros Nalpantidis, 13 Jan 2025, A Survey on Dynamic Neural Networks: from Computer Vision to Multi-modal Sensor Fusion, https://arxiv.org/abs/2501.07451 (Survey of adaptive inference optimizations: early exit, dynamic routing, token skimming.)
Tom Wallace, Naser Ezzati-Jivan, Beatrice Ombuki-Berman, 16 Jan 2025, Optimization Strategies for Enhancing Resource Efficiency in Transformers & Large Language Models, https://arxiv.org/abs/2502.00046
Xiaoran Liu, Ruixiao Li, Mianqiu Huang, Zhigeng Liu, Yuerong Song, Qipeng Guo, Siyang He, Qiqi Wang, Linlin Li, Qun Liu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu, 24 Feb 2025, Thus Spake Long-Context Large Language Model, https://arxiv.org/abs/2502.17129 (Impressive survey of many techniques to improve efficiency and accuracy of long context processing in both inference and training, covering text, video and multimodal models.)
Sihyeong Park, Sungryeol Jeon, Chaelyn Lee, Seokhun Jeon, Byung-Soo Kim, Jemin Lee, 8 May 2025 (v2), A Survey on Inference Engines for Large Language Models: Perspectives on Optimization and Efficiency, https://arxiv.org/abs/2505.01658

Inference Optimization (Generally)

Other general research papers on inference optimization:

Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y. Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher Re, Ion Stoica Ce Zhang, "FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU", arXiv:2303.06865v2 [cs.LG], 12 Jun 2023, https://arxiv.org/abs/2303.06865
J. Choi and S. Venkataramani, Approximate Computing Techniques for Deep Neural Networks. Cham: Springer, 2019, pp. 307–329, Chapter 15, https://link.springer.com/chapter/10.1007/978-3-319-99322-5_15
Hasan Genc, Seah Kim, Alon Amid, Ameer Haj-Ali, Vighnesh Iyer, Pranav Prakash, Jerry Zhao, Daniel Grubb, Harrison Liew, Howard Mao, Albert Ou, Colin Schmidt, Samuel Steffl, John Wright, Ion Stoica, Jonathan Ragan-Kelley, Krste Asanovic, Borivoje Nikolic, and Yakun Sophia Shao, Gemmini: Enabling systematic deep-learning architecture evaluation via full-stack integration. In Proceedings of the 58th Annual Design Automation Conference (DAC), 2021, https://arxiv.org/abs/1911.09925
Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin , James Bradbury, Anselm Levskaya, Jonathan Heek, Kefan Xiao, Shivani Agrawal, Jeff Dean, "Efficiently Scaling Transformer Inference", arXiv:2211.05102v1 [cs.LG], 9 Nov 2022, https://arxiv.org/abs/2211.05102
Minhajul Hoque, "Boosting AI Model Inference: Three Proven Methods to Speed Up Your Models", Apr 9th, 2023, https://medium.com/@minh.hoque/boosting-ai-model-inference-three-proven-methods-to-speed-up-your-models-3f2b439f8c8
"An Introduction to the Inference Stack and Inference Acceleration Techniques", Amnon Gelfman, Deci Blog, December 4, 2020.
Model optimization (TensorFlow), https://www.tensorflow.org/lite/performance/model_optimization
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness, Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré, arXiv:2205.14135v2 [cs.LG], 23 Jun 2022, https://arxiv.org/abs/2205.14135
"Indexing Large Permutations in Hardware", Jacob H. Odom, Thesis Dissertation, Master of Science in Computer Engineering, Virginia Polytechnic Institute and State University, May 9, 2019, https://vtechworks.lib.vt.edu/bitstream/handle/10919/89906/Odom_JH_T_2019.pdf
Model Optimization (TensorFlow), https://www.tensorflow.org/lite/performance/model_optimization
Dave Dice, Alex Kogan, Optimizing Inference Performance of Transformers on CPUs, Feb 2021, https://arxiv.org/abs/2102.06621
Forrest N Iandola, Albert E Shaw, Ravi Krishna, and Kurt W Keutzer. Squeezebert: What can computer vision teach nlp about efficient neural networks? arXiv preprint arXiv:2006.11316, 2020, https://arxiv.org/abs/2006.11316
Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya, Reformer: The efficient transformer, In International Conference on Learning Representations, 2019, https://arxiv.org/abs/2001.04451
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942, 2019, https://arxiv.org/abs/1909.11942 Code: https://github.com/google-research/ALBERT
Fabián Varietti, Rodrigo Gallardo, Ian Spektor, Francisco Kurucz, Facundo Parodi, A guide to optimizing Transformer-based models for faster inference, Nov 29, 2022 https://tryolabs.com/blog/2022/11/24/transformer-based-model-for-faster-inference
Efficient Attention: Breaking The Quadratic Transformer Bottleneck, 2023 (accessed 8/12/23), https://gwern.net/note/attention, (A regularly updated bibliography of transformer attention optimization papers)
Ye Lin, Yanyang Li, Tong Xiao, Jingbo Zhu, Bag of Tricks for Optimizing Transformer Efficiency, Findings of the Association for Computational Linguistics: EMNLP 2021, November 2021, https://aclanthology.org/2021.findings-emnlp.357/
Hugging Face, How we sped up transformer inference 100x for HF API customers, January 18, 2021, https://huggingface.co/blog/accelerated-inference
Intel, Optimizing Transformer Model Inference on Intel Processors, April 2023, https://www.intel.com/content/www/us/en/developer/articles/technical/optimize-transformer-model-inference-processors.html (Contains a number of significant optimizations to the original Transformer architecture.)
Yi-Te Hsu, Sarthak Garg, Yi-Hsiu Liao, and Ilya Chatsviorkin. 2020. Efficient inference for neural machine translation. CoRR, abs/2010.02416. https://arxiv.org/abs/2010.02416
Jungo Kasai, Nikolaos Pappas, Hao Peng, James Cross, and Noah A. Smith. 2020. Deep encoder, shallow decoder: Reevaluating the speed-quality tradeoff in machine translation. CoRR, abs/2006.10369. https://arxiv.org/abs/2006.10369, Code: https://github.com/jungokasai/deep-shallow (Single-layer decoder architecture)
Young Jin Kim, Marcin Junczys-Dowmunt, Hany Hassan, Alham Fikri Aji, Kenneth Heafield, Roman Grundkiewicz, and Nikolay Bogoychev. From research to production and back: Ludicrously fast neural machine translation. In Proc. of WNGT, 2019. https://www.aclweb.org/anthology/D19-5632/, Code: https://github.com/marian-nmt/marian
Seongjun Yang, Gibbeum Lee, Jaewoong Cho, Dimitris Papailiopoulos, Kangwook Lee, Predictive Pipelined Decoding: A Compute-Latency Trade-off for Exact LLM Decoding, July 2023, https://arxiv.org/abs/2307.05908
Zining Zhang; Yao Chen; Bingsheng He; Zhenjie Zhang, NIOT: A Novel Inference Optimization of Transformers on Modern CPUs, IEEE Transactions on Parallel and Distributed Systems, Volume 34, Issue 6, June 2023, pp.1982-1995, https://ieeexplore.ieee.org/abstract/document/10107474
Anil Kag, 2023, Novel neural architectures & algorithms for efficient inference, Ph.D. thesis, College of Engineering, Boston University, https://open.bu.edu/handle/2144/46649, PDF: https://open.bu.edu/bitstream/handle/2144/46649/Kag_bu_0017E_18472.pdf?sequence=8&isAllowed=y (Almost 500 pages long!)
Zao Zhang, 23 May 2024, Design Efficient Deep Neural Networks with System Optimization, Ph.D. Thesis, School of Electrical and Information Engineering, Faculty of Engineering, The University of Sydney, Australia, PDF: https://ses.library.usyd.edu.au/bitstream/handle/2123/32642/zhang_z_thesis.pdf?sequence=1&isAllowed=y https://ses.library.usyd.edu.au/handle/2123/32642 https://hdl.handle.net/2123/32642
Theia Vogel, December 18, 2023, How to make LLMs go fast, https://vgel.me/posts/faster-inference/
Reyna Abhyankar, Zijian He, Vikranth Srivatsa, Hao Zhang, Yiying Zhang, 2 Feb 2024, APIServe: Efficient API Support for Large-Language Model Inferencing, https://arxiv.org/abs/2402.01869
Jovan Stojkovic, Esha Choukse, Chaojie Zhang, Inigo Goiri, Josep Torrellas, 29 Mar 2024, Towards Greener LLMs: Bringing Energy-Efficiency to the Forefront of LLM Inference, https://arxiv.org/abs/2403.20306
S Ha, 2023 How to Boost Deep Neural Networks for Computer Vision, 2023 60th ACM/IEEE Design Automation Conference (DAC), https://ieeexplore.ieee.org/abstract/document/10247892
Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong, Yi Pan, Shaochen Xu, Zihao Wu, Zhengliang Liu, Xin Zhang, Shu Zhang, Xintao Hu, Tuo Zhang, Ning Qiang, Tianming Liu, Bao Ge, Jan 2024, Understanding LLMs: A Comprehensive Overview from Training to Inference https://arxiv.org/abs/2401.02038
N Varshney, A Chatterjee, M Parmar, C Baral, Oct 2023, arXiv preprint arXiv:2310.18581, Accelerating LLM Inference by Enabling Intermediate Layer Decoding, https://arxiv.org/pdf/2310.18581.pdf (Dynamic confidence-based early exiting analysis on LLama models.)
Chen, Carol, Transformer Inference Arithmetic 2022-03-30, kipply's blog, https://kipp.ly/transformer-inference-arithmetic/
Lilian Weng, January 10, 2023, Large Transformer Model Inference Optimization, https://lilianweng.github.io/posts/2023-01-10-inference-optimization/
Maurizio Capra, Beatrice Bussolino, Alberto Marchisio, Guido Masera, Maurizio Martina, Muhammad Shafique, 2020, Hardware and software optimizations for accelerating deep neural networks: Survey of current trends, challenges, and the road ahead, https://ieeexplore.ieee.org/iel7/6287639/6514899/09269334.pdf, https://arxiv.org/abs/2012.11233 (Analysis of optimizations for DNNs and SNNs.)
Arpita Patra and Ajith Suresh. Blaze: blazing fast privacy-preserving machine learning. arXiv preprint arXiv:2005.09042, 2020, https://arxiv.org/abs/2005.09042
Zi Liang, Pinghui Wang, Ruofei Zhang, Lifeng Xing, Nuo Xu, and Shuo Zhang. Merge: Fast private text generation, 2023. https://arxiv.org/abs/2305.15769 (Bypasses the embedding generation phase in MPC privacy method.)
Rui Wang, Qibing Bai, Junyi Ao, Long Zhou, Zhixiang Xiong, Zhihua Wei, Yu Zhang, Tom Ko, and Haizhou Li, “LightHuBERT: Lightweight and configurable speech representation learning with once-for-all hidden-unit BERT,” in Interspeech, June 2022, https://arxiv.org/abs/2203.15610 2022.
Xipeng Qiu, TianXiang Sun, Yige Xu, Yunfan Shao, Ning Dai, and Xuanjing Huang. 2020. Pre-trained Models for Natural Language Processing: A Survey. SCIENCE CHINA Technological Sciences 63, 10 (2020), 1872–1897. https://doi.org/10.1007/s11431-020-1647-3 https://arxiv.org/abs/2003.08271 (Good survey of Transformer architectures in 2020.)
Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Fractalnet: Ultra-deep neural networks without residuals. In ICLR, 2017 https://arxiv.org/abs/1605.07648 (Uses both shallow and deep networks with dropout, but not fully layer-wise; similar to cascades.)
Huang, G.; Chen, D.; Li, T.; Wu, F.; van der Maaten, L.; and Weinberger, K. Q., 2017. Multi-scale dense networks for resource efficient image classification. arXiv preprint arXiv:1703.09844, https://arxiv.org/abs/1703.09844 (Doing dynamic inference, early-exit & changing the features.)
Liu, L.; and Deng, J. 2018. Dynamic deep neural networks: Optimizing accuracy-efficiency trade-offs by selective execution. In Thirty-Second AAAI Conference on Artificial Intelligence, https://arxiv.org/abs/1701.00299
Minjia Zhang, Samyam Rajbhandari, Wenhan Wang, Yuxiong He, 2018, DeepCPU: serving RNN-based deep learning models 10x faster, USENIX ATC '18: Proceedings of the 2018 USENIX Conference on Usenix Annual Technical Conference, July 2018, Pages 951–965, https://dl.acm.org/doi/10.5555/3277355.3277446 https://www.microsoft.com/en-us/research/publication/deepcpu-serving-rnn-based-deep-learning-models-10x-faster/ PDF: https://www.usenix.org/system/files/conference/atc18/atc18-zhang-minjia.pdf (Microsoft DeepCPU paper shows details of code optimizations to parallelize matrix multiplications.)
Krishna Teja Chitty-Venkata, Sparsh Mittal, Murali Emani, Venkatram Vishwanath, Arun K. Somani, 2023, A Survey of Techniques for Optimizing Transformer Inference, https://arxiv.org/abs/2307.07982
David So, Quoc Le, and Chen Liang. The evolved transformer. In International Conference on Machine Learning, pages 5877–5886. PMLR, 2019. https://arxiv.org/abs/1901.11117
Sandeep Subramanian, Ronan Collobert, Marc’Aurelio Ranzato, and Y-Lan Boureau. Multi-scale transformer language models. arXiv preprint arXiv:2005.00581, 2020. https://arxiv.org/abs/2005.00581
Ji Xin, Raphael Tang, Jaejun Lee, Yaoliang Yu, and Jimmy Lin. Deebert: Dynamic early exiting for accelerating BERT inference. arXiv preprint arXiv:2004.12993, 2020. https://arxiv.org/abs/2004.12993
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019, https://arxiv.org/abs/1910.01108
Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou. MobileBERT: a compact task-agnostic BERT for resource-limited devices. arXiv preprint arXiv:2004.02984, 2020. https://arxiv.org/abs/2004.02984
Piotr Nawrot, Szymon Tworkowski, Michael Tyrolski, Lukasz Kaiser, Yuhuai Wu, Christian Szegedy, and Henryk Michalewski. Hierarchical transformers are more efficient language models. arXiv preprint arXiv:2110.13711, 2021. https://arxiv.org/abs/2110.13711
Weijie Liu, Peng Zhou, Zhe Zhao, Zhiruo Wang, Haotang Deng, and Qi Ju. FastBERT: a self-distilling BERT with adaptive inference time. arXiv preprint arXiv:2004.02178, 2020. https://arxiv.org/abs/2004.02178
Daoyuan Chen, Yaliang Li, Minghui Qiu, Zhen Wang, Bofang Li, Bolin Ding, Hongbo Deng, Jun Huang, Wei Lin, and Jingren Zhou. AdaBERT: Task-adaptive BERT compression with differentiable neural architecture search. arXiv preprint arXiv:2001.04246, 2020. https://arxiv.org/abs/2001.04246
Kaixin Wu, Bojie Hu, and Qi Ju. 2021. TenTrans High-Performance Inference Toolkit for WMT2021 Efficiency Task. In Proceedings of the Sixth Conference on Machine Translation, pages 795–798, Online. Association for Computational Linguistics, https://aclanthology.org/2021.wmt-1.77/ Code: https://github.com/TenTrans/TenTrans
Lu Hou, Zhiqi Huang, Lifeng Shang, Xin Jiang, Xiao Chen, and Qun Liu. Dynabert: Dynamic bert with adaptive width and depth. arXiv preprint arXiv:2004.04037, 2020. https://arxiv.org/abs/2004.04037
Campos, D. and Zhai, C., To asymmetry and beyond: Structured pruning of sequence to sequence models for improved inference efficiency, arXiv preprint arXiv:2304.02721, 2023, https://arxiv.org/abs/2304.02721 Code: https://github.com/spacemanidol/Efficient-Web-Scale-Absractive-Summarization
Yi Tay, Mostafa Dehghani, Jinfeng Rao, William Fedus, Samira Abnar, Hyung Won Chung, Sharan Narang, Dani Yogatama, Ashish Vaswani, and Donald Metzler. 2021 (updated Jan 2022). Scale efficiently: Insights from pre-training and fine-tuning transformers. ArXiv, abs/2109.10686, https://arxiv.org/abs/2109.10686
P. He, X. Liu, J. Gao, and W. Chen, “Deberta: Decoding-enhanced bert with disentangled attention,” arXiv preprint arXiv:2006.03654, 2020 (revised Oct 2021), https://arxiv.org/abs/2006.03654
Ashwin K. Vijayakumar, Michael Cogswell, Ramprasaath R. Selvaraju, Qing Sun, Stefan Lee, David J. Crandall, and Dhruv Batra. 2018. Diverse beam search for improved description of complex scenes. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 7371–7379. AAAI Press. https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17329
Cristóbal Eyzaguirre, Felipe del Río, Vladimir Araujo, Álvaro Soto, DACT-BERT: Differentiable Adaptive Computation Time for an Efficient BERT Inference, Sep 2021, ArXiv preprint, abs/2109.11745, https://arxiv.org/abs/2109.11745
Gao, L.; Dai, Z.; Callan, J., 2020, Modularized transfomer-based ranking framework. arXiv preprint arXiv:2004.13313, 2020, https://arxiv.org/abs/2004.13313 -- proposed a modularized design for transformer-based ranking models to improve efficiency by using offline pre-computed representations and lightweight online interactions.
Neubig, G., Dyer, C., Goldberg, Y., Matthews, A., Ammar, W., Anastasopoulos, A., Ballesteros, M., Chiang, D., Clothiaux, D., Cohn, T., et al. 2017, Dynet: The dynamic neural network toolkit. arXiv preprint arXiv:1701.03980, https://arxiv.org/abs/1701.03980 Code: http://github.com/clab/dynet -
Jim Cramer, March 21st, 2024, I don't think the AI game has even started yet, says Jim Cramer, Mad Money, CNBC, https://www.cnbc.com/video/2024/03/21/lightning-round-id-rather-buy-nvidia-100-points-higher-than-buy-smci-says-jim-cramer.html
Moein Heidari, Reza Azad, Sina Ghorbani Kolahi, René Arimond, Leon Niggemeier, Alaa Sulaiman, Afshin Bozorgpour, Ehsan Khodapanah Aghdam, Amirhossein Kazerouni, Ilker Hacihaliloglu, Dorit Merhof, 28 Mar 2024, Enhancing Efficiency in Vision Transformer Networks: Design Techniques and Insights, https://arxiv.org/abs/2403.19882 Project: https://github.com/mindflow-institue/Awesome-Attention-Mechanism-in-Medical-Imaging (Survey of optimization techniques for Vision Transformers, with particular focus on attention optimizations.)
Rachel Gordon, Publication Date:March 21, 2024, AI generates high-quality images 30 times faster in a single step, MIT News, https://news.mit.edu/2024/ai-generates-high-quality-images-30-times-faster-single-step-0321 (MIT's new image generation framework called "distribution matching distillation" is faster than diffusion models.)
Andrew J. Loh, Dec 2023, Scaling AI: Cost and Performance of AI at the Leading Edge, Issue Brief, https://cset.georgetown.edu/wp-content/uploads/Scaling-AI-Cost-and-Performance-of-AI-at-the-Leading-Edge.pdf (White paper on trends in scaling AI engines.)
Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 15 Mar 2024 (v5), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer (A large survey of a variety of LLM optimizations.)
Yushuo Chen, Tianyi Tang, Erge Xiang, Linjiang Li, Wayne Xin Zhao, Jing Wang, Yunpeng Chai, Ji-Rong Wen, 17 Apr 2024, Towards Coarse-to-Fine Evaluation of Inference Efficiency for Large Language Models, https://arxiv.org/abs/2404.11502 (Benchmarks the performance of various Transformer inference frameworks: Transformers, vLLM, DeepSpeed-MII, TGI, TenserRT-LLM, llama.cpp, LightLLM, LMDeploy, StreamingLLM.)
You Zhou, Xiujing Lin, Xiang Zhang, Maolin Wang, Gangwei Jiang, Huakang Lu, Yupeng Wu, Kai Zhang, Zhe Yang, Kehang Wang, Yongduo Sui, Fengwei Jia, Zuoli Tang, Yao Zhao, Hongxuan Zhang, Tiannuo Yang, Weibo Chen, Yunong Mao, Yi Li, De Bao, Yu Li, Hongrui Liao, Ting Liu, Jingwen Liu, Jinchi Guo, Xiangyu Zhao, Ying WEI, Hong Qian, Qi Liu, Xiang Wang, Wai Kin (Victor)Chan, Chenliang Li, Yusen Li, Shiyu Yang, Jining Yan, Chao Mou, Shuai Han, Wuxia Jin, Guannan Zhang, Xiaodong Zeng, Nov 2023, On the Opportunities of Green Computing: A Survey, https://arxiv.org/abs/2311.00447 (Extensive survey of environmental and green AI issues, along with a survey of various optimization methods to reduce AI resource requirements in training and inference.)
Make LLM Fine-tuning 2x faster with Unsloth and HF TRL, January 10, 2023, Daniel Han-Chen, https://huggingface.co/blog/unsloth-trl Code: https://github.com/huggingface/blog/blob/main/unsloth-trl.md (Optimizes some PyTorch kernels for back-propagation and reduces memory usage in fine-tuning; currently works with Llama and Mistral architectures.)
David R. So, Wojciech Mańke, Hanxiao Liu, Zihang Dai, Noam Shazeer, Quoc V. Le, Jan 2022, Primer: Searching for Efficient Transformers for Language Modeling, https://arxiv.org/abs/2109.08668
Minghao Yan, Hongyi Wang, Shivaram Venkataraman, 9 Jan 2024 (v2), PolyThrottle: Energy-efficient Neural Network Inference on Edge Devices, https://arxiv.org/abs/2310.19991 (Faster inference with a focus on pipelining and scheduling of hardware acceleration.)
Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin Jin, Xin Liu, 23 Feb 2024, MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs, https://arxiv.org/abs/2402.15627
Wei Niu, Gagan Agrawal, Bin Ren, 29 Feb 2024, SoD2: Statically Optimizing Dynamic Deep Neural Network, https://arxiv.org/abs/2403.00176 (Analysis of operator computation shapes and pathways with kernel fusion and memory planning.)
Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. Large Batch Optimization for Deep Learning: Training BERT in 76 minutes. In International Conference on Learning Representations, September 2019. https://openreview.net/forum?id=Syx4wnEtvH
Shar Narasimhan. NVIDIA Clocks World’s Fastest BERT Training Time and Largest Transformer Based Model, Paving Path For Advanced Conversational AI, August 2019. https://developer.nvidia.com/blog/training-bert-with-gpus/
J Chen, S Kao, H He, W Zhuo, S Wen, 2023, Run, Don't Walk: Chasing Higher FLOPS for Faster Neural Networks, https://openaccess.thecvf.com/content/CVPR2023/papers/Chen_Run_Dont_Walk_Chasing_Higher_FLOPS_for_Faster_Neural_Networks_CVPR_2023_paper.pdf
Anouar Nechi, Lukas Groth, Saleh Mulhem, Farhad Merchant, Rainer Buchty, Mladen Berekovic, 2023, FPGA-based Deep Learning Inference Accelerators: Where Are We Standing? ACM Transactions on Reconfigurable Technology and Systems, July 2023 https://doi.org/10.1145/3613963 https://dl.acm.org/doi/10.1145/3613963 PDF: https://dl.acm.org/doi/pdf/10.1145/3613963
B Peng, J Quesnelle, H Fan, E Shippole, 2023 YaRN: Efficient Context Window Extension of Large Language Models, https://arxiv.org/pdf/2309.00071.pdf
Md. Maruf Hossain Shuvo; Syed Kamrul Islam; Jianlin Cheng; Bashir I. Morshed, 2023, Efficient Acceleration of Deep Learning Inference on Resource-Constrained Edge Devices: A Review, Proceedings of the IEEE (Volume 111, Issue 1, January 2023), https://ieeexplore.ieee.org/document/9985008 PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9985008 (Extensive 2023 survey of inference optimization in general and specifically on edge platforms.)
Meriam Dhouibi, Ahmed Karim Ben Salem, Afef Saidi, Slim Ben Saoud, March 2021, Accelerating deep neural networks implementation: A survey, https://doi.org/10.1049/cdt2.12016 PDF: https://ietresearch.onlinelibrary.wiley.com/doi/pdfdirect/10.1049/cdt2.12016
A Wong, S Abbasi, S Nair, 2023 TurboViT: Generating Fast Vision Transformers via Generative Architecture Search, https://arxiv.org/pdf/2308.11421.pdf
P Busia, 2023 Optimizing Neural Networks for Embedded Edge-Processing Platforms. https://iris.unica.it/bitstream/11584/357302/2/tesi_di_dottorato_PaolaBusia.pdf
Google. 2019. TensorFlow Performance Guide. https://docs.w3cub.com/tensorflow~guide/performance/performance_ guide/#general_best_practice
Shufan Wu, Tao Lv, Pengxin Yuan, Patric Zhao, Jason Ye, and Haibin Lin. 09-12-19 Optimization for BERT Inference Performance on CPU. https://medium.com/apache-mxnet/optimization-for-bert-inference-performance-on-cpu-3bb2413d376c
Quoc N. Le and Kip Kaehler. 2022, How We Scaled Bert To Serve 1+ Billion Daily Requests on CPUs. https://blog.roblox.com/2020/05/scaled-bert-serve-1-billion-daily-requests-cpus
Emma Ning, Nathan Yan, Jeffrey Zhu, and Jason Li. 2020, Microsoft open sources breakthrough optimizations for transformer inference on gpu and cpu. https://cloudblogs.microsoft.com/opensource/2020/01/21/microsoft-onnx-open-source-optimizations-transformer-inference-gpu-cpu/
Junxue Zhang, Chaoliang Zeng, Hong Zhang, Shuihai Hu, and Kai Chen. 2022, Liteflow: towards high-performance adaptive neural networks for kernel datapath. In Proceedings of the ACM SIGCOMM 2022 Conference, pages 414–427, 2022, https://dl.acm.org/doi/10.1145/3544216.3544229
Meng Hao, Hongwei Li, Hanxiao Chen, Pengzhi Xing, Guowen Xu, and Tianwei Zhang. 2022, Iron: Private inference on transformers. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. https://openreview.net/forum?id=deyqjpcTfsG. https://proceedings.neurips.cc/paper_files/paper/2022/file/64e2449d74f84e5b1a5c96ba7b3d308e-Paper-Conference.pdf
Y Wang, K Chen, H Tan, K Guo, 2023, Tabi: An Efficient Multi-Level Inference System for Large Language Models, https://cse.hkust.edu.hk/~kaichen/papers/tabi-eurosys23.pdf
B Wu, Y Zhong, Z Zhang, G Huang, X Liu, 2023, Fast Distributed Inference Serving for Large Language Models, https://arxiv.org/abs/2305.05920
Y Dong, W Lu, Y Zheng, H Wu, D Zhao, J Tan, July 2023, PUMA: Secure Inference of LLaMA-7B in Five Minutes, https://arxiv.org/abs/2307.12533
Li Yang, Zhezhi He, Yu Cao, Deliang Fan, Sep 2020, A Progressive Sub-Network Searching Framework for Dynamic Inference, https://arxiv.org/abs/2009.05681
Zejian Liu, Fanrong Li, Gang Li, and Jian Cheng. 2021, EBERT: Efficient BERT Inference with Dynamic Structured Pruning. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4814– 4823, 2021. https://aclanthology.org/2021.findings-acl.425/
Sinong Wang, Belinda Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020, https://arxiv.org/abs/2006.04768
Zhanghao Wu, Zhijian Liu, Ji Lin, Yujun Lin, and Song Han. Lite transformer with long-short range attention. arXiv preprint arXiv:2004.11886, 2020, https://arxiv.org/abs/2004.11886 Code: https://github.com/mit-han-lab/lite-transformer
Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Universal transformers. Proceedings of ICLR, 2018, March 2019, https://arxiv.org/abs/1807.03819
Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey, 2020, https://arxiv.org/abs/2009.06732v2 PDF: https://arxiv.org/pdf/2009.06732v2.pdf
Mandy Guo, Joshua Ainslie, David Uthus, Santiago Ontanon, Jianmo Ni, Yun-Hsuan Sung, and Yinfei Yang. LongT5: Efficient Text-To-Text Transformer for Long Sequences. arXiv preprint arXiv:2112.07916, 2021, https://arxiv.org/abs/2112.07916
Noam Shazeer, Fast transformer decoding: One write-head is all you need, 2019, CoRR, abs/1911.02150, http://arxiv.org/abs/1911.02150
Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Anselm Levskaya, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean, 2022, Efficiently scaling transformer inference, CoRR, abs/2211.05102, https://doi.org/10.48550/arXiv.2211.05102
Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
Longteng Zhang, Xiang Liu, Zeyu Li, Xinglin Pan, Peijie Dong, Ruibo Fan, Rui Guo, Xin Wang, Qiong Luo, Shaohuai Shi, Xiaowen Chu, Dec 2023, Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models, https://arxiv.org/abs/2311.03687 (Benchmarks model speed for training, fine-tuning and inference with various optimizations such as ZeRO, quantization, offloading/recomputation, and Flash Attention.)
Deci Engineering Team, September 28, 2021, 5 Factors that Impact the Inference Pipeline in Production + Hardware Usage Metrics, https://deci.ai/blog/optimize-inference-pipeline-production/
Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, 22 Apr 2024, A Survey on Efficient Inference for Large Language Models, https://arxiv.org/abs/2404.14294
Raj Dabre, Raphael Rubino, and Atsushi Fujita. 2020. Balancing cost and benefit with tied-multi transformers. In Proceedings of the Fourth Workshop on Neural Generation and Translation, pages 24–34, Online. Association for Computational Linguistics. https://arxiv.org/abs/2002.08614 (Choose number of layers for encoder and decoder based on input; dynamic layer pruning)
Z Yao, R Yazdani Aminabadi, 2022, Zeroquant: Efficient and affordable post-training quantization for large-scale transformers https://proceedings.neurips.cc/paper_files/paper/2022/file/adf7fa39d65e2983d724ff7da57f00ac-Paper-Conference.pdf
Vincent Vanhoucke, Andrew Senior, and Mark Z. Mao. Improving the speed of neural networks on CPUs. In NIPS Workshop, 2011, https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.308.2766 PDF: https://citeseerx.ist.psu.edu/doc/10.1.1.308.2766
Arnav Chavan, Raghav Magazine, Shubham Kushwaha, Mérouane Debbah, Deepak Gupta, 24 Apr 2024 (v2), Faster and Lighter LLMs: A Survey on Current Challenges and Way Forward, https://arxiv.org/abs/2402.01799 Code: https://github.com/nyunAI/Faster-LLM-Survey
Character.AI, June 20, 2024, Optimizing AI Inference at Character.AI, https://research.character.ai/optimizing-inference/
Xin Tan, Yimin Jiang, Yitao Yang, Hong Xu, 29 Jun 2024, Teola: Towards End-to-End Optimization of LLM-based Applications, https://arxiv.org/abs/2407.00326
Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas Kübler, Jiaji Huang, Matthäus Kleindessner, Jun Huan, Volkan Cevher, Yida Wang, George Karypis, 12 Jul 2024, Inference Optimization of Foundation Models on AI Accelerators, KDD’24, August 25–29, 2024, Barcelona, Spain, https://arxiv.org/abs/2407.09111
18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V. Le, Christopher Ré, Azalia Mirhoseini, 31 Jul 2024, Large Language Monkeys: Scaling Inference Compute with Repeated Sampling, https://arxiv.org/abs/2407.21787 (Generating multiple answers by repeated inference queries, and then using a verifier to choose the best one, which is shown to greatly increase overall accuracy.)
Yangzhen Wu, Zhiqing Sun, Shanda Li, Sean Welleck, Yiming Yang, 1 Aug 2024, An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models, https://arxiv.org/abs/2408.00724
Wang, Bin Guo, Chen Meng, Sheng Gui, Weifei Yu, Yi Xie, 10 Jul 2024, Inference Performance Optimization for Large Language Models on CPUs,Pujiang He, Shan Zhou, Wenhuan Huang, Changqing Li, Duyi https://arxiv.org/abs/2407.07304 Code: https://github.com/intel/xFasterTransformer
Pujiang He, Shan Zhou, Changqing Li, Wenhuan Huang, Weifei Yu, Duyi Wang, Chen Meng, Sheng Gui, 16 May 2024, Distributed Inference Performance Optimization for LLMs on CPUs, https://arxiv.org/abs/2407.00029
Longhao Chen, Yina Zhao, Qiangjun Xie, Qinghua Sheng, 6 Jun 2024, Optimization of Armv9 architecture general large language model inference performance based on Llama.cpp, https://arxiv.org/abs/2406.10816
Matias Martinez, 2 Aug 2024, The Impact of Hyperparameters on Large Language Model Inference Performance: An Evaluation of vLLM and HuggingFace Pipelines, https://arxiv.org/abs/2408.01050
Pierre Lienhart, Feb 2, 2024, LLM Inference Series: 5. Dissecting model performance, https://medium.com/@plienhar/llm-inference-series-5-dissecting-model-performance-6144aa93168f
Leo Donisch, Sigurd Schacht, Carsten Lanquillon, 6 Aug 2024, Inference Optimizations for Large Language Models: Effects, Challenges, and Practical Considerations, https://arxiv.org/abs/2408.03130
David Spuler, 25th August, 2024, Hot Inference Optimization Techniques, https://www.aussieai.com/blog/hot-inference-research
The SGLang Team, Jul 25, 2024, Achieving Faster Open-Source Llama3 Serving with SGLang Runtime (vs. TensorRT-LLM, vLLM), https://lmsys.org/blog/2024-07-25-sglang-llama3/
Carl Franzen, August 23, 2024, Grok-2 gets a speed bump after developers rewrite code in three days, https://venturebeat.com/ai/grok-2-gets-a-speed-bump-after-developers-rewrite-code-in-three-days/ (Inference speed improvement by rewriting using the SGLang orchestration framework.)
Tiernan Ray, Aug. 27, 2024, AI startup Cerebras debuts 'world's fastest inference' service - with a twist: The AI computer maker claims its inference service is dramatically faster and makes new kinds of 'agentic' AI possible, https://www.zdnet.com/article/ai-startup-cerebras-debuts-worlds-fastest-inference-with-a-twist/
Sophia R. Cunningham,Dominique Archambault,Austin Kung, May 2024, Efficient Training and Inference: Techniques for Large Language Models Using Llama, https://www.techrxiv.org/doi/full/10.36227/techrxiv.171651876.65094225/v1
Jiayi Liu, Tinghan Yang, Jennifer Neville, 17 Feb 2024, CliqueParcel: An Approach For Batching LLM Prompts That Jointly Optimizes Efficiency And Faithfulness, https://arxiv.org/abs/2402.14833
David Spuler, June 2024, Aussie AI, Optimizing On-Device Transformer Inference for Source Code Checking: IP Australia, https://ipsearch.ipaustralia.gov.au/patents/2024901675
James Wang, August 27, 2024, Introducing Cerebras Inference: AI at Instant Speed, https://cerebras.ai/blog/introducing-cerebras-inference-ai-at-instant-speed
DeepSeek-AI, Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et. al. (many additional authors), 19 Jun 2024 (v5), DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model, https://arxiv.org/abs/2405.04434
Together AI, September 5, 2024, Supercharging NVIDIA H200 and H100 GPU Cluster Performance With Together Kernel Collection, https://www.together.ai/blog/nvidia-h200-and-h100-gpu-cluster-performance-together-kernel-collection
SambaNova, September 10, 2024, SambaNova Launches The World's Fastest AI Platform, SambaNova Cloud runs Llama 3.1 405B at 132 tokens per second at full precision – available to developers today, https://sambanova.ai/press/worlds-fastest-ai-platform
David Spuler, September 2nd, 2024, 500+ LLM Inference Optimization Techniques, Aussie AI Blog, https://www.aussieai.com/blog/llm-inference-optimization
Sean Michael Kerner, September 23, 2024, Together AI promises faster inference and lower costs with enterprise AI platform for private cloud, https://venturebeat.com/ai/together-ai-promises-faster-inference-and-lower-costs-with-enterprise-ai-platform-for-private-cloud/
David Spuler, 26th August, 2024, State-of-the-Art LLM Backends, Aussie AI Blog, https://www.aussieai.com/blog/state-of-the-art-llm-backends
David Spuler, 26th August, 2024, Inference Optimization Research Ideas, https://www.aussieai.com/blog/inference-optimization-ideas
Yuzhe Yang, Yipeng Du, Ahmad Farhan, Claudio Angione, Yue Zhao, Harry Yang, Fielding Johnston, James Buban, Patrick Colangelo, 28 Oct 2024, Meta-Learning for Speeding Up Large Model Inference in Decentralized Environments, https://arxiv.org/abs/2410.21340 (Choosing between multiple acceleration techniques).
Krishna Teja Chitty-Venkata, Siddhisanket Raskar, Bharat Kale, Farah Ferdaus, Aditya Tanikanti, Ken Raffenetti, Valerie Taylor, Murali Emani, Venkatram Vishwanath, 31 Oct 2024, LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators, https://arxiv.org/abs/2411.00136
Mahernaija, Sep 28, 2024, Update 2024 : The Best NVIDIA GPUs for LLM Inference: A Comprehensive Guide. Comparative Study of All NVIDIA GPU, https://medium.com/@mahernaija/the-best-nvidia-gpus-for-llm-inference-a-comprehensive-guide-56ff5b3e3b1f
Character.AI, Nov 21, 2024, Optimizing AI Inference at Character.AI (Part Deux), https://research.character.ai/optimizing-ai-inference-at-character-ai-part-deux/ (Optimization techniques discussed include INT8, Flash attention 3, kernel fusion of KV dequantization and attention, MQA parallelization, producer-consumer CUDA warp specialization, fused matrix transpose, and more.)
Phoebe Lee and Kristina Joos, Jan 25, 2024, Advancing Production AI with NVIDIA AI Enterprise, https://developer.nvidia.com/blog/advancing-production-ai-with-nvidia-ai-enterprise/ ("... advances in NVIDIA AI software deliver up to 54% performance gains without a hardware upgrade...")
Rongxiang Wang and Felix Xiaozhu Lin. 2024. Turbocharge Speech Understanding with Pilot Inference. In Proceedings of the 30th Annual International Conference on Mobile Computing and Networking (ACM MobiCom '24). Association for Computing Machinery, New York, NY, USA, 1299–1313. https://doi.org/10.1145/3636534.3690694 https://dl.acm.org/doi/abs/10.1145/3636534.3690694 https://dl.acm.org/doi/pdf/10.1145/3636534.3690694 ("Pilot inference" is a specialized mix of caching, computation reuse, and backtracking in beam search for speech understanding, and is somewhat related to speculative decoding, and similar to continual inference for processing a stream.)
Andrew Chan, Dec 12, 2024, Fast LLM Inference From Scratch: Pushing single-GPU inference throughput to the edge without libraries, https://andrewkchan.dev/posts/yalm.html
Ben Dickson, December 13, 2024, New LLM optimization technique slashes memory costs up to 75%, https://venturebeat.com/ai/new-llm-optimization-technique-slashes-memory-costs-up-to-75/
Edoardo Cetin, Qi Sun, Tianyu Zhao, Yujin Tang, 6 Dec 2024 (v3), An Evolved Universal Transformer Memory, https://arxiv.org/abs/2410.13166
Pouya Hamadanian, Sadjad Fouladi, 20 Jan 2025, Glinthawk: A Two-Tiered Architecture for High-Throughput LLM Inference, https://arxiv.org/abs/2501.11779 https://github.com/microsoft/glinthawk (Separate memory-bound attention computation from other parts of model such as compute-bound FFNs, but only in the decoding phase (not prefill), whereby attention and KV cache management can be performed on a greater number of lower-end GPUs or CPU.)
Paul Gillin, Jan 16, 2025, Snowflake claims breakthrough can cut AI inferencing times by more than 50%, https://siliconangle.com/2025/01/16/snowflake-claims-breakthrough-can-cut-ai-inferencing-times-50/ (Inference optimization by KV cache management during prefill phase.)
Thor Olavsrud, How DeepSeek changes the gen AI equation for CIOs, 30 Jan 2025, https://www.cio.com/article/3813555/what-cios-should-learn-now-that-deepseek-is-here.html (" the future of gen AI lies in innovative, cost-efficient approaches")
Kayhan Behdin, Yun Dai, Ata Fatahibaarzi, Aman Gupta, Qingquan Song, Shao Tang, Hejian Sang, Gregory Dexter, Sirou Zhu, Siyu Zhu, Tejas Dharamsi, Maziar Sanjabi, Vignesh Kothapalli, Hamed Firooz, Zhoutong Fu, Yihan Cao, Pin-Lun Hsu, Fedor Borisyuk, Zhipeng Wang, Rahul Mazumder, Natesh Pillai, Luke Simon, 20 Feb 2025, Efficient AI in Practice: Training and Deployment of Efficient LLMs for Industry Applications, https://arxiv.org/abs/2502.14305 (Deploying small models for efficiency via distillation and quantization/pruning.)
Xiaoran Liu, Ruixiao Li, Mianqiu Huang, Zhigeng Liu, Yuerong Song, Qipeng Guo, Siyang He, Qiqi Wang, Linlin Li, Qun Liu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu, 24 Feb 2025, Thus Spake Long-Context Large Language Model, https://arxiv.org/abs/2502.17129 (Impressive survey of many techniques to improve efficiency and accuracy of long context processing in both inference and training, covering text, video and multimodal models.)
Sihyeong Park, Sungryeol Jeon, Chaelyn Lee, Seokhun Jeon, Byung-Soo Kim, Jemin Lee, 8 May 2025 (v2), A Survey on Inference Engines for Large Language Models: Perspectives on Optimization and Efficiency, https://arxiv.org/abs/2505.01658

Adaptive Inference

Adaptive inference is where the engine adapts its execution to the input, or to constraints such as its allowed computing budget. Some research papers on adaptive inference include:

Ignacio de Gregorio, April 2024, Mixture-of-Depths, a Dazzling New AI Breakthrough: Conditional Computing is Finally Here, Medium, https://medium.com/@ignacio.de.gregorio.noblejas/mixture-of-depths-a-dazzling-new-ai-breakthrough-be958fc629b2 (Mixture of depths is a layer-wise per-token limit to attention head computations, which is like width pruning with dynamic depth.)
David Raposo, Sam Ritter, Blake Richards, Timothy Lillicrap, Peter Conway Humphreys, Adam Santoro, 2 Apr 2024, Mixture-of-Depths: Dynamically allocating compute in transformer-based language models, https://arxiv.org/abs/2404.02258 (Per-layer pruning of which tokens can be in the attention computations to give a type of mixed lengthwise pruning combined with a dynamic width pruning or slimmable network approach.)
Junbo Qiao, Wei Li, Haizhen Xie, Hanting Chen, Yunshuai Zhou, Zhijun Tu, Jie Hu, Shaohui Lin, 9 Apr 2024, LIPT: Latency-aware Image Processing Transformer, https://arxiv.org/abs/2404.06075
M Sponner, B Waschneck, A Kumar , 2024, Adapting Neural Networks at Runtime: Current Trends in At-Runtime Optimizations for Deep Learning, ACM Computing Surveys,, PDF: https://dl.acm.org/doi/pdf/10.1145/3657283 (Survey of various adaptive inference optimization techniques with much focus on image and video processing optimization for LLMs.)
Wangbo Zhao, Jiasheng Tang, Yizeng Han, Yibing Song, Kai Wang, Gao Huang, Fan Wang, Yang You, 18 Mar 2024, Dynamic Tuning Towards Parameter and Inference Efficiency for ViT Adaptation, https://arxiv.org/abs/2403.11808 (PEFT and adaptive inference and token pruning in Vision Transformers.)
Jiawen Zhu, Xin Chen, Haiwen Diao, Shuai Li, Jun-Yan He, Chenyang Li, Bin Luo, Dong Wang, Huchuan Lu, 26 Mar 2024, Exploring Dynamic Transformer for Efficient Object Tracking, https://arxiv.org/abs/2403.17651 (Different reasoning routes for inputs in machine vision.)
Jinmin He, Kai Li, Yifan Zang, Haobo Fu, Qiang Fu, Junliang Xing, Jian Cheng, 25 Jan 2024, Not All Tasks Are Equally Difficult: Multi-Task Deep Reinforcement Learning with Dynamic Depth Routing, https://arxiv.org/abs/2312.14472 (Dynamic routing based on easy vs hard queries to optimize training.)
Yixin Song, Zeyu Mi, Haotong Xie, Haibo Chen 2023, PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU, https://arxiv.org/abs/2312.12456 Code: https://github.com/SJTU-IPADS/PowerInfer (Computes a GPU-CPU hybrid engine with some "active" neurons run on the GPU and other less "hot" neurons on the CPU, which is akin to adaptive inference on the width dimension.)
Devvrit, Sneha Kudugunta, Aditya Kusupati, Tim Dettmers, Kaifeng Chen, Inderjit Dhillon, Yulia Tsvetkov, Hannaneh Hajishirzi, Sham Kakade, Ali Farhadi, Prateek Jain, 2024, MatFormer: Nested Transformer for Elastic Inference https://openreview.net/pdf?id=93BaEweoRg (A method of training one large model, and then extracting many smaller sub-models from that model, using FFNs with a subset of parameters, which if done staticly can then be similar to a form of model compression, and elastic inference done dynamically is a type of adaptive inference.)
Bartosz Wójcik, Alessio Devoto, Karol Pustelnik, Pasquale Minervini, Simone Scardapane, 15 Dec 2023, Adaptive Computation Modules: Granular Conditional Computation For Efficient Inference, https://arxiv.org/abs/2312.10193 (Modifies its computation depending on the difficulty of each input token.)
F Ilhan, KH Chow, S Hu, T Huang, S Tekin, W Wei, 2024, Adaptive Deep Neural Network Inference Optimization with EENet, https://openaccess.thecvf.com/content/WACV2024/papers/Ilhan_Adaptive_Deep_Neural_Network_Inference_Optimization_With_EENet_WACV_2024_paper.pdf
Lu Hou, Lifeng Shang, Xin Jiang, and Qun Liu. 2020. Dynabert: Dynamic BERT with adaptive width and depth. arXiv preprint arXiv:2004.04037 https://arxiv.org/abs/2004.04037 Code: https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/DynaBERT
Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748 (General survey of various model compression and other inference optimizations.)
Simone Scardapane, Alessandro Baiocchi, Alessio Devoto, Valerio Marsocci, Pasquale Minervini, Jary Pomponi, 12 Mar 2024, Conditional computation in neural networks: principles and research trends, https://arxiv.org/abs/2403.07965 (Investigated three types of dynamic inference: MoE, early exit, and token selection.)
Salar Shakibhamedan, Amin Aminifar, Nima TaheriNejad, Axel Jantsch, 2024, EASE: Energy Optimization through Adaptation — A Review of Runtime Energy-Aware Approximate Deep Learning Algorithms, https://eclectx.org/Publications/2024_M13.pdf (Survey paper on techniques for adaptive inference with a focus on approximations of inference, including loop performance, stochastic algorithms, approximate arithmetic, quantization, pruning and low-rank.)
Yuyi Mao, Xianghao Yu, Kaibin Huang, Ying-Jun Angela Zhang, Jun Zhang, Dec 2023, Green Edge AI: A Contemporary Survey, https://arxiv.org/abs/2312.00333
David Spuler, March 2024, Chapter 50. Adaptive Inference, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
M Omer Mohammed Elamin Elshaigi, 2023 Adaptive Deep Neural Networks for Human Pose Estimation on Autonomous Nano-Drones, Masters Thesis, PDF: https://webthesis.biblio.polito.it/secure/27689/1/tesi.pdf
Ruisi Cai1, Saurav Muralidharan, Greg Heinrich, Hongxu Yin, Zhangyang Wang, Jan Kautz, Pavlo Molchanov, 2024, FLEXTRON: Many-in-One Flexible Large Language Model, https://openreview.net/pdf?id=9vKRhnflAs (Using one model to act in different ways by making it "elastic" with parameters, effectively using slimming via techniques such as layer fusion in MLPs and MHA Attention Heads.)
Max Sponner, Bernd Waschneck, Akash Kumar, 14 May 2024, Adapting Neural Networks at Runtime: Current Trends in At-Runtime Optimizations for Deep Learning, ACM Computing Surveys, Volume 56, Issue 10, Article No.: 248, Pages 1 - 40, https://doi.org/10.1145/3657283 https://dl.acm.org/doi/abs/10.1145/3657283
Keivan Alizadeh, Iman Mirzadeh, Hooman Shahrokhi, Dmitry Belenko, Frank Sun, Minsik Cho, Mohammad Hossein Sekhavat, Moin Nabi, Mehrdad Farajtabar, 1 Oct 2024, Duo-LLM: A Framework for Studying Adaptive Computation in Large Language Models, https://arxiv.org/abs/2410.10846
Akriti Jain, Saransh Sharma, Koyel Mukherjee, Soumyabrata Pal, 16 Oct 2024, FiRST: Finetuning Router-Selective Transformers for Input-Adaptive Latency Reduction, https://arxiv.org/abs/2410.12513
Theodore Glavas, Joud Chataoui, Florence Regol, Wassim Jabbour, Antonios Valkanas, Boris N. Oreshkin, Mark Coates, 26 Oct 2024, Dynamic layer selection in decoder-only transformers, https://arxiv.org/abs/2410.20022
Haim Barad, Jascha Achterberg, Tien Pei Chou, Jean Yu, 30 Oct 2024, Accelerated AI Inference via Dynamic Execution Methods, https://arxiv.org/abs/2411.00853
Rei Barjami, Antonio Miele, and Luca Mottola. 2024. Intermittent Inference: Trading a 1% Accuracy Loss for a 1.9x Throughput Speedup. In Proceedings of the 22nd ACM Conference on Embedded Networked Sensor Systems (SenSys '24). Association for Computing Machinery, New York, NY, USA, 647–660. https://doi.org/10.1145/3666025.3699364 https://dl.acm.org/doi/abs/10.1145/3666025.3699364 https://dl.acm.org/doi/pdf/10.1145/3666025.3699364
Shujian Zhang, Korawat Tanwisuth, Chengyue Gong, Pengcheng He, Mingyuan Zhou, 7 May 2024, Switchable Decision: Dynamic Neural Generation Networks, https://arxiv.org/abs/2405.04513 (Switching and skipping sub-layer components such as attention heads, FFNs, or input token skipping, using decisions made based on allocating computation resources.)
Wangsong Yin, Rongjie Yi, Daliang Xu, Gang Huang, Mengwei Xu, Xuanzhe Liu, 8 Sep 2024, ELMS: Elasticized Large Language Models On Mobile Devices, https://arxiv.org/abs/2409.09071
Yucheng Xing, Xin Wang, 19 Nov 2024, Puppet-CNN: Input-Adaptive Convolutional Neural Networks with Model Compression using Ordinary Differential Equation, https://arxiv.org/abs/2411.12876
Haihang Wu, Wei Wang, Tamasha Malepathirana, Sachith Seneviratne, Denny Oetomo, Saman Halgamuge, 10 Dec 2024, TT-MPD: Test Time Model Pruning and Distillation, https://arxiv.org/abs/2412.07114
Ç. Yeşil, B. T. Ay, F. A. Ak, Ö. B. Mercan and O. Nefesoğlu, "Adaptive Batch Budget for LLM Inference," 2024 9th International Conference on Computer Science and Engineering (UBMK), Antalya, Turkiye, 2024, pp. 219-223, doi: 10.1109/UBMK63289.2024.10773573. https://ieeexplore.ieee.org/abstract/document/10773573
Wenchao Xu, Jinyu Chen, Peirong Zheng, Xiaoquan Yi, Tianyi Tian, Wenhui Zhu, Quan Wan, Haozhao Wang, Yunfeng Fan, Qinliang Su, Xuemin Shen, https://arxiv.org/abs/2412.13437 18 Dec 2024, Deploying Foundation Model Powered Agent Services: A Survey, (A survey of not just deployment, but many inference optimization techniques.)
Fabio Montello, Ronja Güldenring, Simone Scardapane, Lazaros Nalpantidis, 13 Jan 2025, A Survey on Dynamic Neural Networks: from Computer Vision to Multi-modal Sensor Fusion, https://arxiv.org/abs/2501.07451 (Survey of adaptive inference optimizations: early exit, dynamic routing, token skimming.)
Xunhao Lai, Jianqiao Lu, Yao Luo, Yiyuan Ma, Xun Zhou, 28 Feb 2025, FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference, https://arxiv.org/abs/2502.20766 (Prefill optimization that dynamically applies different attention patterns, including sparse attention, for KV computations, based on the input query.)

Distributed Inference

B Wu, Y Zhong, Z Zhang, G Huang, X Liu, 2023, Fast Distributed Inference Serving for Large Language Models, https://arxiv.org/abs/2305.05920
Davide Macario, 2024, A Model-Distributed Inference Approach for Large Language Models at the Edge, Masters Thesis, Master of Science in Electrical and Computer Engineering, Graduate College, University of Illinois at Chicago, https://webthesis.biblio.polito.it/secure/31718/1/tesi.pdf
Marco Colocrese, Erdem Koyuncu, Hulya Seferoglu, 8 Aug 2024, Early-Exit meets Model-Distributed Inference at Edge Networks, https://arxiv.org/abs/2408.05247
Pujiang He, Shan Zhou, Wenhuan Huang, Changqing Li, Duyi Wang, Bin Guo, Chen Meng, Sheng Gui, Weifei Yu, Yi Xie, 10 Jul 2024, Inference Performance Optimization for Large Language Models on CPUs, https://arxiv.org/abs/2407.07304 Code: https://github.com/intel/xFasterTransformer
Mingjin Zhang, 2024, High-performance scheduling of deep learning tasks in collaborative edge computing, Ph.D. Thesis, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong, https://theses.lib.polyu.edu.hk/bitstream/200/13080/3/7528.pdf (Scheduling of inference and training tasks on edge devices with techniques such as model splitting/partitioning.)
Isaac Ong, May 16, 2024, Efficient Distributed LLM Inference with Dynamic Partitioning, Masters Thesis, Electrical Engineering and Computer Sciences, University of California, Berkeley, Technical Report No. UCB/EECS-2024-108, http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-108.html https://digitalassets.lib.berkeley.edu/techreports/ucb/incoming/EECS-2024-108.pdf
Eric Samikwa, 2024, Resource-Aware Distributed Machine Learning for Artificial Intelligence of Things, Ph.D. thesis, Faculty of Science, University of Bern, Switzerland, https://boristheses.unibe.ch/5378/1/24samikwa_e_1_.pdf https://doi.org/10.48549/5378 (Multi-edge device with early exit, "micro-split" scheduling, split/federated learning, and distributed inference.)
Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
Divya Jyoti Bajpai, Manjesh Kumar Hanawal, 6 Oct 2024, Distributed Inference on Mobile Edge and Cloud: An Early Exit based Clustering Approach, https://arxiv.org/abs/2410.05338
Nan Xue, Yaping Sun, Zhiyong Chen, Meixia Tao, Xiaodong Xu, Liang Qian, Shuguang Cui, Wenjun Zhang, Ping Zhang, 11 Nov 2024, WDMoE: Wireless Distributed Mixture of Experts for Large Language Models, https://arxiv.org/abs/2411.06681
J. Du et al., "Co-designing Transformer Architectures for Distributed Inference with Low Communication," in IEEE Transactions on Parallel and Distributed Systems, doi: 10.1109/TPDS.2024.3521582. https://ieeexplore.ieee.org/abstract/document/10812976/ (Distributed inference with sub-block parallelism capabilities and a planning phase that optimizes both compute and communications.)

Computational Complexity Analysis of LLM Inference

Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, 22 Apr 2024, A Survey on Efficient Inference for Large Language Models, https://arxiv.org/abs/2404.14294
Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, Liang Zhao, 4 Jan 2024, Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models https://arxiv.org/abs/2401.00625 (A general survey paper with coverage of many techniques including this one.)
A Ouyang, June 2023, Understanding the Performance of Transformer Inference, Masters Thesis, Electrical Engineering and Computer Science, MIT, https://dspace.mit.edu/handle/1721.1/151543 https://dspace.mit.edu/bitstream/handle/1721.1/151543/ouyang-aouyang-meng-eecs-2023-thesis.pdf?sequence=1&isAllowed=y (Detailed analysis of Transformer performance, including the techniques of KV caching.)
Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Universal transformers. Proceedings of ICLR, 2018, March 2019, https://arxiv.org/abs/1807.03819
Xiaoyu Li, Yingyu Liang, Zhenmei Shi, Zhao Song, 22 Aug 2024, A Tighter Complexity Analysis of SparseGPT, https://arxiv.org/abs/2408.12151
Josh Alman, Hantao Yu, 5 Oct 2024, Fundamental Limitations on Subquadratic Alternatives to Transformers, https://arxiv.org/abs/2410.04271
Seongho Kim, Jihyun Moon, Juntaek Oh, Insu Choi, Joon-Sung Yang, 15 Oct 2024, Survey and Evaluation of Converging Architecture in LLMs based on Footsteps of Operations, https://arxiv.org/abs/2410.11381 (Good analysis of the effect of dimensions of the various Transformer components.)
K Jung, JY Kwon, YD Mun, BG Kang, J Song, MJ Kim, Nov 2024, Analysis of Transformer Decoder Architecture and KV Cache Behavior During LLM Inference, https://www.researchgate.net/profile/Kyudan-Jung/publication/386077289_Analysis_of_Transformer_Decoder_Architecture_and_KV_Cache_Behavior_During_LLM_Inference/links/6742108127661f7ae6663557/Analysis-of-Transformer-Decoder-Architecture-and-KV-Cache-Behavior-During-LLM-Inference.pdf
Yekun Ke, Xiaoyu Li, Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song, 8 Jan 2025, On Computational Limits and Provably Efficient Criteria of Visual Autoregressive Models: A Fine-Grained Complexity Analysis, https://arxiv.org/abs/2501.04377