Aussie AI

Inference Optimization Research

  • Last Updated 12 December, 2024
  • by David Spuler, Ph.D.

What's Hot in Inference Optimization Research?

Inference optimization has become a hot area of research as the industry evolves to the point where inference costs are about 95% of overall compute. This trend is driven by:

    (a) More user adoption of AI (in both consumer and enterprise),

    (b) Open source pre-trained models,

    (c) Faster training and fine-tuning methods (e.g. LoRA),

    (d) RAG architectures replacing fine-tuning, and

    (e) Multi-step inference-based reasoning (e.g., chain-of-thought in OpenAI's o1 model).

Some of the hottest areas for speeding up inference:

  • Hardware. The biggest opportunity for inference speedup is probably in hardware rather than software. There's the upcoming NVIDIA Blackwell architecture, which is apparently delayed as I write this, along with several AI-specific hardware startups such as Groq and Etched receiving large funding rounds. I'm not an expert on the hardware opportunities, so I'll leave it there.
  • KV cache compression. The KV cache was initially a speedup for inference, but it's become a memory hog, so there are numerous research papers on making it use less memory (see KV cache compression research). In particular, KV cache quantization is becoming standard in industry framework implementations, such as 4-bit quantized KV cache data used by Character.AI and Apple Intelligence. There are several fancier types of KV cache compression in the research. Notably, an implementation of KV cache layer fusion is used by Character.AI's inference backend for companionbots.
  • Context caching. The simplest cache is a text-to-text full "inference cache" and there's also semantic caching based on embedding vector similarity. However, the idea of saving the KV cache, but re-running decoding has various advantages, and is gaining attention in both research and industry. Google has recently released "context caching" features, and this is also appearing in other frameworks, such as vLLM and DeepSeek. Expect many more to follow! See: context caching research.
  • Prefix KV caching. There are many cases where Transformers are re-processing the same prefix of tokens, such as chatbot multi-turn conversational context, global system instructions (prepended), RAG chunks (prepended), and re-used documents. Instead, you can just load the KV cache data from a prefix KV cache, and the latency is minimal, and you only have to decode the last few tokens. Prefix KV caching is also getting implemented in frameworks, including vLLM, DeepSeek, and Character.AI's backend. Interestingly, DeepSeek offers lower pricing for "cached tokens," which reflects the lower cost.
  • Multi-LoRA. The idea of using multiple LoRA adapters for efficiently supporting multiple fine-tuned models got a massive boost from Apple Intelligence. There are many research papers now focused on further optimizing the load-time and inference characteristics of multi-LoRA architectures and other types of Parameter-Efficient Fine-Tuning (PEFT).
  • Memory-efficient attention algorithms. The two leading contenders for attention optimization by paying attention to its memory access patterns are Flash Attention and Paged Attention, and you can even combine them! There's also their precursors Multi-Query Attention (MQA) and Grouped Query Attention (GQA) that are still in use and getting researched in papers. See memory-efficient attention optimization.
  • Linear attention. Another way to reduce memory cost is to simply access it less! Algorithms like this include local attention and other types of linear attention. As a recent example in industry, Character.AI's inference backend uses a hybrid layerwise attention scheme, that alternates between local and global attention across different layers. There's a lot of research happening in optimizing the attention mechanisms, because of its annoying quadratic complexity. See research on attention optimization.
  • Zero-multiplication models. MIT research released a model architecture based on element-wise multiplication for matrix multiplication, which is the "Hadamard product." Basic matrix multiplication is O(n3) whereas Hadamard computations are O(n2), so it's potentially a tenfold reduction, and also a simpler algorithm that's more amenable to followup kernel optimizations like kernel fusion. See Hadamard multiplication models. There are actually at least ten other types of zero-multiplication models in the literature (e.g., adder models, shift-add, logarithmic, power-of-two, max-plus, min-max, weightless neural networks, etc.). There's also the well-known method of avoiding multiplication with low-bit quantization. Both binary quantization and ternary quantization can be implemented via addition, albeit with accuracy loss.
  • Speculative decoding. Increased parallelization of the decoding algorithm via speculative decoding is a perennially hot area of research. It's a speedup that's long been used in production backends. Various generalization have been discovered, such as generalized speculative decoding, heuristic speculative decoding, self-speculative decoding, retrieval lookup decoding, prompt lookup decoding, and several other methods.
  • Multi-token generation. Generalizing the decoding algorithm to output multiple tokens in parallel is a clear gain in efficiency, and some research is starting to show promise. These require an entirely different type of model architecture for both training and inference. There are also some multi-token drafting methods starting to be used to optimize speculative decoding algorithms. See: parallel decoding research.
  • Prefill optimizations. There has been a burst of new research that examines the cost of the prefill operation, which creates the KV cache, and is the reason for the initial latency before the first token is output. Hence, prefill time is important for user responsiveness for any interactive use cases. In particular, research has found that prefill is compute-bound, whereas the decoding phase is memory-bound. Hence, there is much research on prefill phase optimizations, chunked prefill, and disaggregated scheduling of prefill and decoding phases on GPU platforms. Note also that KV caching methods as discussed above can optimize prefill by avoiding it completely!

Blog Articles on Inference Optimization: See also the Aussie AI blog articles:

What is Inference Optimization?

Inference is the process of "running" a request on a Large Language Model (LLM). Our research focus is on optimizing these algorithms so that the AI models respond quickly to users (called "latency") and have a high overall throughput so as to scale efficiently. They need to be much faster not only to reduce data center GPU costs, but also to run efficiently on your smartphone device or your AI laptop.

Overview of Inference Optimization Techniques

Running an AI model to get a response is called "inference". Optimization of the inference algorithms for AI models is the primary mechanism to provide fast response times and scaleable throughput of AI requests from users. There is an extensive volume of literature on various techniques to optimize model inference. Some of the main techniques include:

For more, see the Long list of Transformer optimization methods.

Hardware Acceleration

Modern AI algorithms make use of sophisticated hardware acceleration created by makers of silicon chips. The best known algorithms involve the use of a Graphics Processing Unit (GPU). GPU chips were originally designed for faster graphics, but have generalized over time to become general purpose sophisticated mathematical computation machines. The primary optimization from GPUs is parallel execution of arithmetic operations such as floating-point multiplications on vectors and matrices, which are the basis of AI algorithms.

Non-GPU is also another area of hardware acceleration. The GPU is only one of the chips in the box, and other chips can do grunt work as well. The CPU has been getting askance looks from AI researchers for years, but there are many sophisticated accelerations available from a variety of hardware vendors, usually at much lower price points. Also available are various special-purpose ASICs that focus on AI computations.

Parallelization

As a general strategy, AI requests can be parallelized across multiple machines, virtual machines, multi-GPU machines, multi-core CPUs, or multi-threaded execution. Obviously, multiple user requests can be farmed out across multiple endpoints, but the underlying algorithms can also be parallelized.

GPU processing and other hardware acceleration is mostly about parallelizing the multiplication algorithms involving weights, which is a low-level parallel optimization. However, parallelization can be invoked at a higher level to achieve speedups. Many of the steps in AI inference algorithms can be performed in parallel, such that the algorithms can split vector and matrix arithmetic across available processing power.

Software Acceleration

There are many areas of active resarch in methods to accelerate AI inference engines in software. Some of the methods include:

Most of the techniques in "model compression" are also a form of software acceleration, although they tend to get their own category these days. There are also many more software techniques on the list of optimization methods.

Model Compression

Model compression is the general technique of optimizing AI inference by changing the model to be smaller, usually at the cost of some degree of accuracy. This can mean using fewer weights (e.g. pruning) or using weights that are smaller numbers (e.g. quantization). The main established techniques with a longstanding body of research include:

  • Quantization of models. This is a well-known method whereby high-precision 32-bit floating-point multiplication of weights is replaced with lower-precision data, such as 16-bit floating point numbers, or often integers to allow faster integer multiplication. Integer quantization has been researched for 8-bit integers all the way down to 4-bit, 3-bit, 2-bit and even 1-bit (binary) integer representations. Quantization of a pre-trained model for faster inference is called Post-Training Quantization (PTQ). It is also possible to use quantization during model training using Quantization-Aware Training (QAT) algorithms. Quantization not only improves speed, but also reduces the model size for storage or transmission (e.g. an 8-bit quantization of a 32-bit floating point model reduces storage size by four). Read more about research on LLM quantization.
  • Pruning (static and dynamic). This technique involves optimizing the weighted links in LLM models, such as "magnitude pruning" by cutting those with a very low value (indicating a feature has a low probability). If there are fewer model links, there are less multiplications required. See research on model pruning and Transformer-specific pruning ideas.
  • Layer pruning / layer compaction ("pancaking"). This is conceptually a generalization of pruning, whereby whole model layers are pruned. Models typically involve multiple hidden layers of weights and nodes, which perform the same underlying inference algorithms with different sets of weights. The layers can sometimes be removed without significant loss of model accuracy, resulting in proportional speedups. Also called "layer fusion" or "module fusion" in some papers. Read more about Dynamic Layer Pruning and optimizing Transformer layers (including the "shallow decoder" architecture).
  • Knowledge distillation. Training a smaller model to be similar to a larger model. This is method where a large pre-trained model is then used to train a smaller more-efficient model. See knowledge distillation research papers.

Static Inference Optimization

Static inference optimization methods are those where the model is changed during training, or post-training, but is not modified during inference, where the inference engine runs the model in full. Some examples of this include:

The key point of these static inference optimizations is that a trained model is taken, and a new model is created that is smaller and faster, which is then used for inference.

Dynamic Inference Optimization (Adaptive inference)

Dynamic inference optimization, also called "adaptive inference", is where the inference engine adapts its behavior according to its input. Some examples of dynamic strategies include:

Hybrid optimizations. Combined strategies are possible in many ways. For example, a model can be quantized to lower precision, and then the inference engine could employ various dynamic pruning strategies. And some strategies apply across both training and inference phases, thereby combining approaches, such as using different approximation algorithms or advanced matrix decompositions. The co-design of hardware and software architectures also typically crosses both training and inference execution.

That's not the full list. For more, see the full list of neural network optimization methods.

General Research on Dynamic Inference

These papers examine the general theory of dynamic inference (or "input-adaptive inference" or "conditional computation"), such as running faster on "easy" requests but slower on "hard" requests. See also the lists of papers on the many specific dynamic inference optimization techniques (e.g. early exit, dynamic pruning, width pruning, length pruning, big-little, cascades, etc.)

  • Anil Kag, 2023, Novel neural architectures & algorithms for efficient inference, Ph.D. thesis, College of Engineering, Boston University, https://open.bu.edu/handle/2144/46649, PDF: https://open.bu.edu/bitstream/handle/2144/46649/Kag_bu_0017E_18472.pdf?sequence=8&isAllowed=y (See Chapter 10, "Input Hardness Adaptive Models" for methods of running faster on easy image classification problems.)
  • Bolukbasi, T., Wang, J., Dekel, O., and Saligrama, V. 2017. Adaptive neural networks for efficient inference. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, Proceedings of Machine Learning Research, pages 527–536. https://arxiv.org/abs/1702.07811, http://proceedings.mlr.press/v70/bolukbasi17a.html
  • Nan, F. and Saligrama, V. 2017. Adaptive classification for prediction under a budget. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 30, pages 4727–4737. Curran Associates, https://arxiv.org/abs/1705.10194 PDF: https://proceedings.neurips.cc/paper_files/paper/2017/file/d9ff90f4000eacd3a6c9cb27f78994cf-Paper.pdf
  • Nan, F. and Saligrama, V., 2017. Dynamic model selection for prediction under a budget. arXiv preprint arXiv:1704.07505. https://arxiv.org/abs/1704.07505
  • Han, Y., Huang, G., Song, S., Yang, L., Wang, H., and Wang, Y. 2022. Dynamic neural networks: A survey. volume 44, pages 7436–7456, Los Alamitos, CA, USA. IEEE Computer Society. https://arxiv.org/abs/2102.04906, https://doi.ieeecomputersociety.org/10.1109/TPAMI.2021.3117837 (Survey of dynamic inference techinques, where the engine is adaptive to the input.)
  • Graves, A. (2016). Adaptive computation time for recurrent neural networks. CoRR, abs/1603.08983. http://arxiv.org/abs/1603.08983
  • Jernite, Y., Grave, E., Joulin, A., and Mikolov, T. (2017). Variable computation in recurrent neural networks. In International Conference on Learning Representations. https://openreview.net/forum?id=S1LVSrcge, https://arxiv.org/abs/1611.06188
  • D. Stamoulis, T.-W. Chin, A. K. Prakash, H. Fang, S. Sajja, M. Bognar, and D. Marculescu, “Designing adaptive neural networks for energyconstrained image classification,” in Proceedings of the International Conference on Computer-Aided Design, 2018, pp. 1–8. https://ieeexplore.ieee.org/document/8587753
  • Xin Wang, Fisher Yu, Zi-Yi Dou, Trevor Darrell, Joseph E. Gonzalez, 2018, “Skipnet: Learning dynamic routing in convolutional networks,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 409–424. https://arxiv.org/abs/1711.09485
  • Yue Wang, Jianghao Shen, Ting-Kuei Hu, Pengfei Xu, Tan Nguyen, Richard Baraniuk, Zhangyang Wang, Yingyan Lin, 2020, “Dual dynamic inference: Enabling more efficient, adaptive and controllable deep inference,” IEEE Journal of Selected Topics in Signal Processing, 2020. https://arxiv.org/abs/1907.04523
  • P. Panda, A. Sengupta, and K. Roy, “Conditional deep learning for energy-efficient and enhanced pattern recognition,” in Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2016. https://arxiv.org/abs/1509.08971
  • Berestizshevsky, K., Even, G.: Dynamically sacrificing accuracy for reduced computation: Cascaded inference based on softmax confidence. In: Lecture Notes in Computer Science, pp. 306–320. Springer International Publishing (2019). https://doi.org/10.1007/978-3-030-30484-3_26
  • Jayakodi, N.K., Chatterjee, A., Choi, W., Doppa, J.R., Pande, P.P.: Trading-off accuracy and energy of deep inference on embedded systems: A co-design approach. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37(11), 2881–2893 (2018). https://doi.org/10.1109/tcad.2018.2857338, https://arxiv.org/abs/1901.10584
  • Maedeh Hemmat, Joshua San Miguel, Azadeh Davoodi, "AirNN: A Featherweight Framework for Dynamic Input-Dependent Approximation of CNNs", IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol.40, no.10, pp.2090-2103, 2021. https://ieeexplore.ieee.org/document/9239327 (Input-dependent inference optimization via layer-wise weight clustering and early exit based on a termination condition.)
  • Maedeh Hemmat; Azadeh Davoodi, March 2019, Dynamic Reconfiguration of CNNs for Input-Dependent Approximation, 20th International Symposium on Quality Electronic Design (ISQED), https://ieeexplore.ieee.org/document/8697843 (Dynamically decides how many clusters of similar weights to use, depending on input.)
  • B Wójcik, M Przewiȩźlikowski, F Szatkowski, Oct 2023, Zero time waste in pre-trained early exit neural networks, Neural Networks, https://www.sciencedirect.com/science/article/pii/S0893608023005555, PDF: https://papers.nips.cc/paper/2021/file/149ef6419512be56a93169cd5e6fa8fd-Paper.pdf (Attempts to quickly handle easy inputs by caching classifications from prior layers for early exit decisions.)
  • Jinmin He, Kai Li, Yifan Zang, Haobo Fu, Qiang Fu, Junliang Xing, Jian Cheng, 25 Jan 2024, Not All Tasks Are Equally Difficult: Multi-Task Deep Reinforcement Learning with Dynamic Depth Routing, https://arxiv.org/abs/2312.14472 (Dynamic routing based on easy vs hard queries to optimize training.)
  • Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, Liang Zhao, 4 Jan 2024, Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models https://arxiv.org/abs/2401.00625 (A general survey paper with coverage of many techniques including this one.)
  • Ting-Wu Chin, Ruizhou Ding, and Diana Marculescu. 2019. AdaScale: Towards Real-time Video Object Detection using Adaptive Scaling. Proceedings of Machine Learning and Systems 2019. 431–441. https://arxiv.org/abs/1902.02910 (Adaptive inference that dynamically choose the image scale to analyze.)
  • Shihao Zhang, Weiyao Lin, Ping Lu, Weihua Li, and Shuo Deng. 2017. Kill two birds with one stone: Boosting both object detection accuracy and speed with adaptive patch-of-interest composition. In 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW). IEEE, 447–452. https://arxiv.org/abs/1708.03795 Code: http://min.sjtu.edu.cn/lwydemo/Dete/demo/detection.html (Adaptive inference by focusing on regions of images.)
  • Bruno Korbar, Du Tran, and Lorenzo Torresani. 2019. Scsampler: Sampling salient clips from video for efficient action recognition. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 6232–6242. https://arxiv.org/abs/1904.04289 (Dynamically selects clips from videos to analyze.)
  • Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Fractalnet: Ultra-deep neural networks without residuals. In ICLR, 2017 https://arxiv.org/abs/1605.07648 (Uses both shallow and deep networks with dropout, but not fully layer-wise; similar to cascades.)
  • Huang, G.; Chen, D.; Li, T.; Wu, F.; van der Maaten, L.; and Weinberger, K. Q., 2017. Multi-scale dense networks for resource efficient image classification. arXiv preprint arXiv:1703.09844, https://arxiv.org/abs/1703.09844 (Doing dynamic inference, early-exit & changing the features.)
  • Liu, L.; and Deng, J. 2018. Dynamic deep neural networks: Optimizing accuracy-efficiency trade-offs by selective execution. In Thirty-Second AAAI Conference on Artificial Intelligence, https://arxiv.org/abs/1701.00299
  • M Piórczyński, F Szatkowski, K Bałazy, B Wójcik, 2023, Exploiting Transformer Activation Sparsity with Dynamic Inference https://arxiv.org/pdf/2310.04361.pdf
  • Francesco Ratto, Ángela Porras Máinez, Carlo Sau, Paolo Meloni, Gianfranco Deriu, Stefano Delucchi, Massimo Massa, Luigi Raffo, Francesca Palumbo, April 2023, An Automated Design Flow for Adaptive Neural Network Hardware Accelerators. Journal of Signal Processing Systems (2023): 1-23. https://link.springer.com/article/10.1007/s11265-023-01855-x (Adapatable inference for a CNN by dynamic modification of FPGA-accelerated hardware integrations.)
  • David Spuler, March 2024, Chapter 50. Adaptive Inference, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
  • Li Yang, Zhezhi He, Yu Cao, Deliang Fan, Sep 2020, A Progressive Sub-Network Searching Framework for Dynamic Inference, https://arxiv.org/abs/2009.05681
  • 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
  • J. Yu, T. Huang, Universally slimmable networks and improved training techniques, in: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 1803–1811. doi:10.1109/ICCV.2019.00189. http://dx.doi.org/10.1109/ICCV.2019.00189
  • J. Yu, L. Yang, N. Xu, J. Yang, T. Huang, Slimmable neural networks, in: International Conference on Learning Representations, 2019. https://openreview.net/forum?id=H1gMCsAqY7 https://openreview.net/forum?id=H1gMCsAqY7
  • Z. Chen, Y. Li, S. Bengio, S. Si, You look twice: Gaternet for dynamic filter selection in CNNs, in: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 9164–9172. doi:10.1109/ CVPR.2019.00939. http://dx.doi.org/10.1109/CVPR.2019.00939
  • A. Kouris, S. Venieris, C.-S. Bouganis, A throughput-latency co-optimised cascade of convolutional neural network classifiers, IEEE, 2019. http://hdl.handle.net/10044/1/75445 http://hdl.handle.net/10044/1/75445
  • E. S. Marquez, J. S. Hare, M. Niranjan, Deep cascade learning, IEEE Transactions on Neural Networks and Learning Systems 29 (11) (2018) 5475–5485. doi:10.1109/TNNLS.2018.2805098. http://dx.doi.org/10.1109/TNNLS.2018.2805098
  • A. Krizhevsky, V. Nair, G. Hinton, CIFAR-10 and CIFAR-100 (Canadian Institute for Advanced Research), http://www.cs.toronto.edu/ kriz/cifar.html, last access: 02/2020 (2020).
  • K. He, X. Zhang, S. Ren, J. Sun, Spatial pyramid pooling in deep convolutional networks for visual recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (9) (2015) 1904–1916. doi: 10.1109/TPAMI.2015.2389824. http://dx.doi.org/10.1109/TPAMI.2015.2389824
  • S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: Towards real-time object detection with region proposal networks, IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (6) (2017) 1137–1149. doi:10.1109/ TPAMI.2016.2577031. http://dx.doi.org/10.1109/TPAMI.2016.2577031
  • E. Shelhamer, J. Long, T. Darrell, Fully convolutional networks for semantic segmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (4) (2017) 640–651. doi:10.1109/TPAMI.2016.2572683. http://dx.doi.org/10.1109/TPAMI.2016.2572683

Uncommon Optimization Techniques

Some of the more theoretical and lesser known techniques include:

  • Weight clustering. This technique involves merging weights that have similar magnitude into "clusters" that use exactly the same weight instead. It is similar to a combination of quantization and pruning, and can augment either technique. See weight clustering research.
  • Approximate matrix multiplication. There are various algorithms for performing mathematical multiplication of matrices without actually using numeric multiplication. This is an area of active research that involves a crossover between high-end mathematics and computer algorithms. Several techniques show promise of fast calculations with an acceptable loss of accuracy. Read more about AI matrix algebra optimizations.
  • Logarithmic bitshift quantization (power-of-two). The coding optimization of using bit-shift operators to replace multiplication is well-known. Conceptually, the idea is a second-order model quantization method involving to first convert floating point model weights to integers (for integer multiplications), and then further convert those integer weights logarithmically to the nearest power-of-2. This allows integer multiplication in vector and matrix multiplications to be replaced with integer bit-shift operations. However, the trade-off is a greater loss of model accuracy than basic integer quantization. Read more about bitshift inference optimizations.
  • Additive and zero-multiplication neural networks. Various approaches to remove the multiplication bottleneck by replacing it with other arithmetic operators, including adder networks, bit shifting, and other low-level optmizations. See "zero-multiplication models".
  • Low-rank optimization. Optimize high-degree tensors to be lower-degree tensors using matrix factorization/decomposition. This is conceptually a type of large-scale pruning. Read more about low-rank decomposition and matrix algebra.
  • Faster multiplication algorithms. There are ways to do multiplication faster, although this research is mainly used by chip designers nowadays. Read more about multiplication mathematics.
  • Approximate multiplication. This is a low-level optimization of the multiplication operation itself, usually a more mathematical method than using bitshifts. Some of these methods have been used in quantization. Read more about advanced mathematical optimizations.
  • Matrix multiplication algorithms. Read more about using optimizations involving matrix multiplication optimizations.
  • Advanced Numeric Representations. Various non-standard alternative methods to store numeric weights, making use of all the bits in a byte, going beyond the standard floating point or integer bit patterns. Read more about number systems.

Survey Papers on Inference Optimization

Survey papers on inference optimization:

Inference Optimization (Generally)

Other general research papers on inference optimization:

Adaptive Inference

Adaptive inference is where the engine adapts its execution to the input, or to constraints such as its allowed computing budget. Some research papers on adaptive inference include:

  • Ignacio de Gregorio, April 2024, Mixture-of-Depths, a Dazzling New AI Breakthrough: Conditional Computing is Finally Here, Medium, https://medium.com/@ignacio.de.gregorio.noblejas/mixture-of-depths-a-dazzling-new-ai-breakthrough-be958fc629b2 (Mixture of depths is a layer-wise per-token limit to attention head computations, which is like width pruning with dynamic depth.)
  • David Raposo, Sam Ritter, Blake Richards, Timothy Lillicrap, Peter Conway Humphreys, Adam Santoro, 2 Apr 2024, Mixture-of-Depths: Dynamically allocating compute in transformer-based language models, https://arxiv.org/abs/2404.02258 (Per-layer pruning of which tokens can be in the attention computations to give a type of mixed lengthwise pruning combined with a dynamic width pruning or slimmable network approach.)
  • Junbo Qiao, Wei Li, Haizhen Xie, Hanting Chen, Yunshuai Zhou, Zhijun Tu, Jie Hu, Shaohui Lin, 9 Apr 2024, LIPT: Latency-aware Image Processing Transformer, https://arxiv.org/abs/2404.06075
  • M Sponner, B Waschneck, A Kumar , 2024, Adapting Neural Networks at Runtime: Current Trends in At-Runtime Optimizations for Deep Learning, ACM Computing Surveys,, PDF: https://dl.acm.org/doi/pdf/10.1145/3657283 (Survey of various adaptive inference optimization techniques with much focus on image and video processing optimization for LLMs.)
  • Wangbo Zhao, Jiasheng Tang, Yizeng Han, Yibing Song, Kai Wang, Gao Huang, Fan Wang, Yang You, 18 Mar 2024, Dynamic Tuning Towards Parameter and Inference Efficiency for ViT Adaptation, https://arxiv.org/abs/2403.11808 (PEFT and adaptive inference and token pruning in Vision Transformers.)
  • Jiawen Zhu, Xin Chen, Haiwen Diao, Shuai Li, Jun-Yan He, Chenyang Li, Bin Luo, Dong Wang, Huchuan Lu, 26 Mar 2024, Exploring Dynamic Transformer for Efficient Object Tracking, https://arxiv.org/abs/2403.17651 (Different reasoning routes for inputs in machine vision.)
  • Jinmin He, Kai Li, Yifan Zang, Haobo Fu, Qiang Fu, Junliang Xing, Jian Cheng, 25 Jan 2024, Not All Tasks Are Equally Difficult: Multi-Task Deep Reinforcement Learning with Dynamic Depth Routing, https://arxiv.org/abs/2312.14472 (Dynamic routing based on easy vs hard queries to optimize training.)
  • Yixin Song, Zeyu Mi, Haotong Xie, Haibo Chen 2023, PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU, https://arxiv.org/abs/2312.12456 Code: https://github.com/SJTU-IPADS/PowerInfer (Computes a GPU-CPU hybrid engine with some "active" neurons run on the GPU and other less "hot" neurons on the CPU, which is akin to adaptive inference on the width dimension.)
  • Devvrit, Sneha Kudugunta, Aditya Kusupati, Tim Dettmers, Kaifeng Chen, Inderjit Dhillon, Yulia Tsvetkov, Hannaneh Hajishirzi, Sham Kakade, Ali Farhadi, Prateek Jain, 2024, MatFormer: Nested Transformer for Elastic Inference https://openreview.net/pdf?id=93BaEweoRg (A method of training one large model, and then extracting many smaller sub-models from that model, using FFNs with a subset of parameters, which if done staticly can then be similar to a form of model compression, and elastic inference done dynamically is a type of adaptive inference.)
  • Bartosz Wójcik, Alessio Devoto, Karol Pustelnik, Pasquale Minervini, Simone Scardapane, 15 Dec 2023, Adaptive Computation Modules: Granular Conditional Computation For Efficient Inference, https://arxiv.org/abs/2312.10193 (Modifies its computation depending on the difficulty of each input token.)
  • F Ilhan, KH Chow, S Hu, T Huang, S Tekin, W Wei, 2024, Adaptive Deep Neural Network Inference Optimization with EENet, https://openaccess.thecvf.com/content/WACV2024/papers/Ilhan_Adaptive_Deep_Neural_Network_Inference_Optimization_With_EENet_WACV_2024_paper.pdf
  • Lu Hou, Lifeng Shang, Xin Jiang, and Qun Liu. 2020. Dynabert: Dynamic BERT with adaptive width and depth. arXiv preprint arXiv:2004.04037 https://arxiv.org/abs/2004.04037 Code: https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/DynaBERT
  • Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748 (General survey of various model compression and other inference optimizations.)
  • Simone Scardapane, Alessandro Baiocchi, Alessio Devoto, Valerio Marsocci, Pasquale Minervini, Jary Pomponi, 12 Mar 2024, Conditional computation in neural networks: principles and research trends, https://arxiv.org/abs/2403.07965 (Investigated three types of dynamic inference: MoE, early exit, and token selection.)
  • Salar Shakibhamedan, Amin Aminifar, Nima TaheriNejad, Axel Jantsch, 2024, EASE: Energy Optimization through Adaptation — A Review of Runtime Energy-Aware Approximate Deep Learning Algorithms, https://eclectx.org/Publications/2024_M13.pdf (Survey paper on techniques for adaptive inference with a focus on approximations of inference, including loop performance, stochastic algorithms, approximate arithmetic, quantization, pruning and low-rank.)
  • Yuyi Mao, Xianghao Yu, Kaibin Huang, Ying-Jun Angela Zhang, Jun Zhang, Dec 2023, Green Edge AI: A Contemporary Survey, https://arxiv.org/abs/2312.00333
  • David Spuler, March 2024, Chapter 50. Adaptive Inference, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
  • M Omer Mohammed Elamin Elshaigi, 2023 Adaptive Deep Neural Networks for Human Pose Estimation on Autonomous Nano-Drones, Masters Thesis, PDF: https://webthesis.biblio.polito.it/secure/27689/1/tesi.pdf
  • Ruisi Cai1, Saurav Muralidharan, Greg Heinrich, Hongxu Yin, Zhangyang Wang, Jan Kautz, Pavlo Molchanov, 2024, FLEXTRON: Many-in-One Flexible Large Language Model, https://openreview.net/pdf?id=9vKRhnflAs (Using one model to act in different ways by making it "elastic" with parameters, effectively using slimming via techniques such as layer fusion in MLPs and MHA Attention Heads.)
  • Max Sponner, Bernd Waschneck, Akash Kumar, 14 May 2024, Adapting Neural Networks at Runtime: Current Trends in At-Runtime Optimizations for Deep Learning, ACM Computing Surveys, Volume 56, Issue 10, Article No.: 248, Pages 1 - 40, https://doi.org/10.1145/3657283 https://dl.acm.org/doi/abs/10.1145/3657283
  • Keivan Alizadeh, Iman Mirzadeh, Hooman Shahrokhi, Dmitry Belenko, Frank Sun, Minsik Cho, Mohammad Hossein Sekhavat, Moin Nabi, Mehrdad Farajtabar, 1 Oct 2024, Duo-LLM: A Framework for Studying Adaptive Computation in Large Language Models, https://arxiv.org/abs/2410.10846
  • Akriti Jain, Saransh Sharma, Koyel Mukherjee, Soumyabrata Pal, 16 Oct 2024, FiRST: Finetuning Router-Selective Transformers for Input-Adaptive Latency Reduction, https://arxiv.org/abs/2410.12513
  • Theodore Glavas, Joud Chataoui, Florence Regol, Wassim Jabbour, Antonios Valkanas, Boris N. Oreshkin, Mark Coates, 26 Oct 2024, Dynamic layer selection in decoder-only transformers, https://arxiv.org/abs/2410.20022
  • Haim Barad, Jascha Achterberg, Tien Pei Chou, Jean Yu, 30 Oct 2024, Accelerated AI Inference via Dynamic Execution Methods, https://arxiv.org/abs/2411.00853
  • Rei Barjami, Antonio Miele, and Luca Mottola. 2024. Intermittent Inference: Trading a 1% Accuracy Loss for a 1.9x Throughput Speedup. In Proceedings of the 22nd ACM Conference on Embedded Networked Sensor Systems (SenSys '24). Association for Computing Machinery, New York, NY, USA, 647–660. https://doi.org/10.1145/3666025.3699364 https://dl.acm.org/doi/abs/10.1145/3666025.3699364 https://dl.acm.org/doi/pdf/10.1145/3666025.3699364
  • Shujian Zhang, Korawat Tanwisuth, Chengyue Gong, Pengcheng He, Mingyuan Zhou, 7 May 2024, Switchable Decision: Dynamic Neural Generation Networks, https://arxiv.org/abs/2405.04513 (Switching and skipping sub-layer components such as attention heads, FFNs, or input token skipping, using decisions made based on allocating computation resources.)
  • Wangsong Yin, Rongjie Yi, Daliang Xu, Gang Huang, Mengwei Xu, Xuanzhe Liu, 8 Sep 2024, ELMS: Elasticized Large Language Models On Mobile Devices, https://arxiv.org/abs/2409.09071
  • Yucheng Xing, Xin Wang, 19 Nov 2024, Puppet-CNN: Input-Adaptive Convolutional Neural Networks with Model Compression using Ordinary Differential Equation, https://arxiv.org/abs/2411.12876

Distributed Inference

Computational Complexity Analysis of LLM Inference

More Deep AI Research Areas

Read more about: