Aussie AI

Inference Optimization

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

Inference Optimization

Optimization of the inference algorithms for AI models is the primary mechanism to provide fast response times and scalable throughput of AI requests from users. There is an extensive volume of literature on various techniques to optimize model inference. Some of the main techniques include:

Hardware acceleration (GPU and non-GPU)
Parallelization (vectorization)
Software acceleration (e.g. pipelining, marshaling, kernel fusion)
Memory optimizations (reducing dataflow and memory-to-cache transfers)
Model compilation (graph compilers / deep learning compilers)
Transformer-specific optimization techniques (e.g. shallow decoder architectures)
Model compression (e.g. quantization, pruning, distillation)
Advanced mathematical algorithms and matrix algebra
General code optimizations including caching, precomputation, approximations, and inference loop optimizations

For more techniques, see the remaining chapters of this section and our “Long list of Transformer Optimization Methods” at URL: https://www.aussieai.com/research/list

Survey research papers. Research papers that survey the many types of inference optimizations include:

Sehoon Kim, Coleman Hooper, Thanakul Wattanawong, Minwoo Kang, Ruohan Yan, Hasan Genc, Grace Dinh, Qijing Huang, Kurt Keutzer, Michael W. Mahoney, Yakun Sophia Shao, Amir Gholami (2023), Full stack optimization of transformer inference: a survey, Feb 2023, arXiv:2302.14017, https://arxiv.org/abs/2302.14017
Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W. Mahoney, and Kurt Keutzer (2021), A Survey of Quantization Methods for Efficient Neural, Network Inference. arXiv:2103.13630 [cs], June 2021, http://arxiv.org/abs/2103. 13630 arXiv: 2103.13630, https://arxiv.org/abs/2103.13630
Nebuly (2023), Full Stack Optimization of Transformer Inference: a Survey , Part 2 on Transformer Optimization ,A Paper Overview, https://www.nebuly.com/blog/full-stack-optimization-of-transformer-inference-a-survey-part-2
Jingjing Xu, Wangchunshu Zhou, Zhiyi Fu, Hao Zhou, Lei Li (2021), A Survey on Green Deep Learning, Nov 2021, https://arxiv.org/abs/2111.05193 (Extensive survey paper.)
Krishna Teja Chitty-Venkata, Sparsh Mittal, Murali Emani, Venkatram Vishwanath, Arun K. Somani (2023), A Survey of Techniques for Optimizing Transformer Inference, 2023, arxiv.org July 2023, https://arxiv.org/abs/2307.07982
Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler (2022), Efficient transformers: A survey (v2), arXiv preprint arXiv:2009.06732, https://arxiv.org/abs/2009.06732
Xunyu Zhu, Jian Li, Yong Liu, Can Ma, Weiping Wang (2023), A Survey on Model Compression for Large Language Models, arXiv preprint arXiv:2308.07633, Aug 2023, https://arxiv.org/abs/2308.07633 (Recent 2023 survey paper on various model compression approaches.)
Meriam Dhouibi, Ahmed Karim Ben Salem, Afef Saidi, Slim Ben Saoud (2021), Accelerating deep neural networks implementation: A survey, March 2021, https://doi.org/10.1049/cdt2.12016, PDF: https://ietresearch.onlinelibrary.wiley.com/doi/pdfdirect/10.1049/cdt2.12016 (Survey of various techniques including hardware acceleration and pruning.)
Md. Maruf Hossain Shuvo; Syed Kamrul Islam; Jianlin Cheng; Bashir I. Morshed (2023), Efficient Acceleration of Deep Learning Inference on Resource-Constrained Edge Devices: A Review, Proceedings of the IEEE (Volume 111, Issue 1, January 2023), https://ieeexplore.ieee.org/document/9985008, PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9985008a (Extensive 2023 survey of inference optimization in general and specifically on edge platforms.)
Y Wang, Y Han, C Wang, S Song, Q Tian, G Huang (2023), Computation-efficient Deep Learning for Computer Vision: A Survey, arXiv preprint arXiv:2308.13998, https://arxiv.org/abs/2308.13998
Giorgos Armeniakos, Georgios Zervakis, Dimitrios Soudris, Jörg Henkel (2022), Hardware Approximate Techniques for Deep Neural Network Accelerators: A Survey, ACM Computing Surveys, Volume 55, Issue 4, No. 83, pp. 1–36, https://doi.org/10.1145/3527156, https://dl.acm.org/doi/10.1145/3527156, https://arxiv.org/abs/2203.08737
Ce Zhou, Qian Li, Chen Li, Jun Yu, Yixin Liu, Guangjing Wang, Kai Zhang, Cheng Ji, Qiben Yan, Lifang He, Hao Peng, Jianxin Li, Jia Wu, Ziwei Liu, Pengtao Xie, Caiming Xiong, Jian Pei, Philip S. Yu, Lichao Sun (2023), A Comprehensive Survey on Pretrained Foundation Models: A History from BERT to ChatGPT, May 2023, https://arxiv.org/abs/2302.09419
Q Fournier, GM Caron, D Aloise (2023), A practical survey on faster and lighter transformers, ACM Computing Surveys, https://dl.acm.org/doi/abs/10.1145/3586074, https://arxiv.org/abs/2103.14636
V. Sze, Y. Chen, T. Yang, and J. S. Emer (2017), Efficient Processing of Deep Neural Networks: A Tutorial and Survey, Proc. IEEE 105, 12 (2017), 2295–2329. https://doi.org/10.1109/JPROC.2017.2761740 (Good paper from 2017.)
Xipeng Qiu, TianXiang Sun, Yige Xu, Yunfan Shao, Ning Dai, and Xuanjing Huang (2020), Pre-trained Models for Natural Language Processing: A Survey, SCIENCE CHINA Technological Sciences 63, 10 (2020), 1872–1897. https://doi.org/10.1007/s11431-020-1647-3, https://arxiv.org/abs/2003.08271 (Good survey of Transformer architectures in 2020.)
Kah Phooi Seng, Li-Minn Ang (2022), Embedded Intelligence: State-of-the-Art and Research Challenges, IEEE Access, vol.10, pp.59236-59258, 2022. https://ieeexplore.ieee.org/document/9775683, PDF: https://research.usc.edu.au/esploro/outputs/99640278002621
Marcos Treviso, Ji-Ung Lee, Tianchu Ji, Betty van Aken, Qingqing Cao, Manuel R. Ciosici, Michael Hassid, Kenneth Heafield, Sara Hooker, Colin Raffel, Pedro H. Martins, André F. T. Martins, Jessica Zosa Forde, Peter Milder, Edwin Simpson, Noam Slonim, Jesse Dodge, Emma Strubell, Niranjan Balasubramanian, Leon Derczynski, Iryna Gurevych, Roy Schwartz (2023), Efficient methods for natural language processing: A survey, 2023, https://arxiv.org/abs/2209.00099, https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00577/116725 (Extensive survey from 2023 covering many optimization techniques.)
Krishna Teja Chitty-Venkata, Sparsh Mittal, Murali Emani, Venkatram Vishwanath, Arun K. Somani (2023), A Survey of Techniques for Optimizing Transformer Inference, https://arxiv.org/abs/2307.07982
G Alsuhli, V Sakellariou, H Saleh, M Al-Qutayri (2023), Number Systems for Deep Neural Network Architectures: A Survey, 2023, https://arxiv.org/abs/2307.05035 (Good survey, but specific to number systems.)
Canwen Xu, Julian McAuley (2022), A Survey on Model Compression and Acceleration for Pretrained Language Models, Nov 2022, https://arxiv.org/abs/2202.07105
G Menghani (2023), Efficient deep learning: A survey on making deep learning models smaller, faster, and better, ACM Computing Surveys, https://dl.acm.org/doi/abs/10.1145/3578938, https://arxiv.org/abs/2106.08962
Anouar Nechi, Lukas Groth, Saleh Mulhem, Farhad Merchant, Rainer Buchty, Mladen Berekovic (2023), FPGA-based Deep Learning Inference Accelerators: Where Are We Standing? ACM Transactions on Reconfigurable Technology and Systems, July 2023 https://doi.org/10.1145/3613963, https://dl.acm.org/doi/10.1145/3613963 PDF: https://dl.acm.org/doi/pdf/10.1145/3613963
L Papa, P Russo, I Amerini, L Zhou (2023), A survey on efficient vision transformers: algorithms, techniques, and performance benchmarking, Sep 2023, arXiv preprint arXiv:2309.02031, 2023, https://arxiv.org/abs/2309.02031
Han, Y., Huang, G., Song, S., Yang, L., Wang, H., and Wang, Y. (2022), Dynamic neural networks: A survey, Volume 44, pages 7436–7456, Los Alamitos, CA, USA. IEEE Computer Society. https://arxiv.org/abs/2102.04906, https://doi.ieeecomputersociety.org/10.1109/TPAMI.2021.3117837 (Survey of dynamic inference techniques, where the engine is adaptive to the input.)
Li, Z., Liu, F., Yang, W., Peng, S., and Zhou, J. (2022), A survey of convolutional neural networks: Analysis, applications, and prospects, IEEE Transactions on Neural Networks and Learning Systems, 33(12):6999–7019. https://arxiv.org/abs/2004.02806 (2020 version), https://ieeexplore.ieee.org/document/9451544 (A survey of CNNs.)
Praveen Joshi, Mohammed Hasanuzzaman, Chandra Thapa, Haithem Afli, Ted Scully (2023), Enabling All In-Edge Deep Learning: A Literature Review ,IEEE Access, vol.11, pp.3431-3460, 2023. https://ieeexplore.ieee.org/document/10007810 https://arxiv.org/abs/2204.03326 (Extensive survey of edge computing, including deployment architectures and optimizations.)
Lipton, Z. C., Berkowitz, J., and Elkan, C. (2015), A critical review of recurrent neural networks for sequence learning, https://arxiv.org/abs/1506 (This 2015 survey of RRNs and LSTMs has some interesting perspectives.)
W Li, H Hacid, E Almazrouei, M Debbah (2023), A Comprehensive Review and a Taxonomy of Edge Machine Learning: Requirements, Paradigms, and Techniques, AI 2023, 4(3), 729-786, https://www.mdpi.com/2673-2688/4/3/39 (Extensive survey related to optimizing inference for running on edge servers; also training on edge servers.)
JA Chen, W Niu, B Ren, Y Wang, X Shen (2023), Survey: Exploiting data redundancy for optimization of deep learning, ACM Computing Surveys, 2023, https://dl.acm.org/doi/abs/10.1145/3564663, https://arxiv.org/pdf/2208.13363 (Various optimizations to skip or reuse computations or similar data.)
M Capra, B Bussolino, A Marchisio, M Shafique (2020), An updated survey of efficient hardware architectures for accelerating deep convolutional neural networks, Future Internet, 2020, https://www.mdpi.com/1999-5903/12/7/113/pdf
J Zhong, Z Liu, X Chen (2023), Transformer-based models and hardware acceleration analysis in autonomous driving: A survey, Apr 2023, https://arxiv.org/abs/2304.10891

For more general research on inference optimizations, refer to https://www.aussieai.com/research/inference-optimization.

• Next:

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs

The new AI programming book by Aussie AI co-founders:

AI coding in C++
Transformer engine speedups
LLM models
Phone and desktop AI
Code examples
Research citations

Get your copy from Amazon: Generative AI in C++

Aussie AI

Inference Optimization

Inference Optimization

Quick Links

Product

New to Writing?

Writing Styles