Aussie AI

Inference Optimization

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Inference Optimization

Optimization of the inference algorithms for AI models is the primary mechanism to provide fast response times and scalable throughput of AI requests from users. There is an extensive volume of literature on various techniques to optimize model inference. Some of the main techniques include:

  • Hardware acceleration (GPU and non-GPU)
  • Parallelization (vectorization)
  • Software acceleration (e.g. pipelining, marshaling, kernel fusion)
  • Memory optimizations (reducing dataflow and memory-to-cache transfers)
  • Model compilation (graph compilers / deep learning compilers)
  • Transformer-specific optimization techniques (e.g. shallow decoder architectures)
  • Model compression (e.g. quantization, pruning, distillation)
  • Advanced mathematical algorithms and matrix algebra
  • General code optimizations including caching, precomputation, approximations, and inference loop optimizations

For more techniques, see the remaining chapters of this section and our “Long list of Transformer Optimization Methods” at URL: https://www.aussieai.com/research/list

Survey research papers. Research papers that survey the many types of inference optimizations include:

  1. Sehoon Kim, Coleman Hooper, Thanakul Wattanawong, Minwoo Kang, Ruohan Yan, Hasan Genc, Grace Dinh, Qijing Huang, Kurt Keutzer, Michael W. Mahoney, Yakun Sophia Shao, Amir Gholami (2023), Full stack optimization of transformer inference: a survey, Feb 2023, arXiv:2302.14017, https://arxiv.org/abs/2302.14017
  2. Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W. Mahoney, and Kurt Keutzer (2021), A Survey of Quantization Methods for Efficient Neural, Network Inference. arXiv:2103.13630 [cs], June 2021, http://arxiv.org/abs/2103. 13630 arXiv: 2103.13630, https://arxiv.org/abs/2103.13630
  3. Nebuly (2023), Full Stack Optimization of Transformer Inference: a Survey , Part 2 on Transformer Optimization ,A Paper Overview, https://www.nebuly.com/blog/full-stack-optimization-of-transformer-inference-a-survey-part-2
  4. Jingjing Xu, Wangchunshu Zhou, Zhiyi Fu, Hao Zhou, Lei Li (2021), A Survey on Green Deep Learning, Nov 2021, https://arxiv.org/abs/2111.05193 (Extensive survey paper.)
  5. Krishna Teja Chitty-Venkata, Sparsh Mittal, Murali Emani, Venkatram Vishwanath, Arun K. Somani (2023), A Survey of Techniques for Optimizing Transformer Inference, 2023, arxiv.org July 2023, https://arxiv.org/abs/2307.07982
  6. Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler (2022), Efficient transformers: A survey (v2), arXiv preprint arXiv:2009.06732, https://arxiv.org/abs/2009.06732
  7. Xunyu Zhu, Jian Li, Yong Liu, Can Ma, Weiping Wang (2023), A Survey on Model Compression for Large Language Models, arXiv preprint arXiv:2308.07633, Aug 2023, https://arxiv.org/abs/2308.07633 (Recent 2023 survey paper on various model compression approaches.)
  8. Meriam Dhouibi, Ahmed Karim Ben Salem, Afef Saidi, Slim Ben Saoud (2021), Accelerating deep neural networks implementation: A survey, March 2021, https://doi.org/10.1049/cdt2.12016, PDF: https://ietresearch.onlinelibrary.wiley.com/doi/pdfdirect/10.1049/cdt2.12016 (Survey of various techniques including hardware acceleration and pruning.)
  9. Md. Maruf Hossain Shuvo; Syed Kamrul Islam; Jianlin Cheng; Bashir I. Morshed (2023), Efficient Acceleration of Deep Learning Inference on Resource-Constrained Edge Devices: A Review, Proceedings of the IEEE (Volume 111, Issue 1, January 2023), https://ieeexplore.ieee.org/document/9985008, PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9985008a (Extensive 2023 survey of inference optimization in general and specifically on edge platforms.)
  10. Y Wang, Y Han, C Wang, S Song, Q Tian, G Huang (2023), Computation-efficient Deep Learning for Computer Vision: A Survey, arXiv preprint arXiv:2308.13998, https://arxiv.org/abs/2308.13998
  11. Giorgos Armeniakos, Georgios Zervakis, Dimitrios Soudris, Jörg Henkel (2022), Hardware Approximate Techniques for Deep Neural Network Accelerators: A Survey, ACM Computing Surveys, Volume 55, Issue 4, No. 83, pp. 1–36, https://doi.org/10.1145/3527156, https://dl.acm.org/doi/10.1145/3527156, https://arxiv.org/abs/2203.08737
  12. Ce Zhou, Qian Li, Chen Li, Jun Yu, Yixin Liu, Guangjing Wang, Kai Zhang, Cheng Ji, Qiben Yan, Lifang He, Hao Peng, Jianxin Li, Jia Wu, Ziwei Liu, Pengtao Xie, Caiming Xiong, Jian Pei, Philip S. Yu, Lichao Sun (2023), A Comprehensive Survey on Pretrained Foundation Models: A History from BERT to ChatGPT, May 2023, https://arxiv.org/abs/2302.09419
  13. Q Fournier, GM Caron, D Aloise (2023), A practical survey on faster and lighter transformers, ACM Computing Surveys, https://dl.acm.org/doi/abs/10.1145/3586074, https://arxiv.org/abs/2103.14636
  14. V. Sze, Y. Chen, T. Yang, and J. S. Emer (2017), Efficient Processing of Deep Neural Networks: A Tutorial and Survey, Proc. IEEE 105, 12 (2017), 2295–2329. https://doi.org/10.1109/JPROC.2017.2761740 (Good paper from 2017.)
  15. Xipeng Qiu, TianXiang Sun, Yige Xu, Yunfan Shao, Ning Dai, and Xuanjing Huang (2020), Pre-trained Models for Natural Language Processing: A Survey, SCIENCE CHINA Technological Sciences 63, 10 (2020), 1872–1897. https://doi.org/10.1007/s11431-020-1647-3, https://arxiv.org/abs/2003.08271 (Good survey of Transformer architectures in 2020.)
  16. Kah Phooi Seng, Li-Minn Ang (2022), Embedded Intelligence: State-of-the-Art and Research Challenges, IEEE Access, vol.10, pp.59236-59258, 2022. https://ieeexplore.ieee.org/document/9775683, PDF: https://research.usc.edu.au/esploro/outputs/99640278002621
  17. Marcos Treviso, Ji-Ung Lee, Tianchu Ji, Betty van Aken, Qingqing Cao, Manuel R. Ciosici, Michael Hassid, Kenneth Heafield, Sara Hooker, Colin Raffel, Pedro H. Martins, André F. T. Martins, Jessica Zosa Forde, Peter Milder, Edwin Simpson, Noam Slonim, Jesse Dodge, Emma Strubell, Niranjan Balasubramanian, Leon Derczynski, Iryna Gurevych, Roy Schwartz (2023), Efficient methods for natural language processing: A survey, 2023, https://arxiv.org/abs/2209.00099, https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00577/116725 (Extensive survey from 2023 covering many optimization techniques.)
  18. Krishna Teja Chitty-Venkata, Sparsh Mittal, Murali Emani, Venkatram Vishwanath, Arun K. Somani (2023), A Survey of Techniques for Optimizing Transformer Inference, https://arxiv.org/abs/2307.07982
  19. G Alsuhli, V Sakellariou, H Saleh, M Al-Qutayri (2023), Number Systems for Deep Neural Network Architectures: A Survey, 2023, https://arxiv.org/abs/2307.05035 (Good survey, but specific to number systems.)
  20. Canwen Xu, Julian McAuley (2022), A Survey on Model Compression and Acceleration for Pretrained Language Models, Nov 2022, https://arxiv.org/abs/2202.07105
  21. G Menghani (2023), Efficient deep learning: A survey on making deep learning models smaller, faster, and better, ACM Computing Surveys, https://dl.acm.org/doi/abs/10.1145/3578938, https://arxiv.org/abs/2106.08962
  22. Anouar Nechi, Lukas Groth, Saleh Mulhem, Farhad Merchant, Rainer Buchty, Mladen Berekovic (2023), FPGA-based Deep Learning Inference Accelerators: Where Are We Standing? ACM Transactions on Reconfigurable Technology and Systems, July 2023 https://doi.org/10.1145/3613963, https://dl.acm.org/doi/10.1145/3613963 PDF: https://dl.acm.org/doi/pdf/10.1145/3613963
  23. L Papa, P Russo, I Amerini, L Zhou (2023), A survey on efficient vision transformers: algorithms, techniques, and performance benchmarking, Sep 2023, arXiv preprint arXiv:2309.02031, 2023, https://arxiv.org/abs/2309.02031
  24. Han, Y., Huang, G., Song, S., Yang, L., Wang, H., and Wang, Y. (2022), Dynamic neural networks: A survey, Volume 44, pages 7436–7456, Los Alamitos, CA, USA. IEEE Computer Society. https://arxiv.org/abs/2102.04906, https://doi.ieeecomputersociety.org/10.1109/TPAMI.2021.3117837 (Survey of dynamic inference techniques, where the engine is adaptive to the input.)
  25. Li, Z., Liu, F., Yang, W., Peng, S., and Zhou, J. (2022), A survey of convolutional neural networks: Analysis, applications, and prospects, IEEE Transactions on Neural Networks and Learning Systems, 33(12):6999–7019. https://arxiv.org/abs/2004.02806 (2020 version), https://ieeexplore.ieee.org/document/9451544 (A survey of CNNs.)
  26. Praveen Joshi, Mohammed Hasanuzzaman, Chandra Thapa, Haithem Afli, Ted Scully (2023), Enabling All In-Edge Deep Learning: A Literature Review ,IEEE Access, vol.11, pp.3431-3460, 2023. https://ieeexplore.ieee.org/document/10007810 https://arxiv.org/abs/2204.03326 (Extensive survey of edge computing, including deployment architectures and optimizations.)
  27. Lipton, Z. C., Berkowitz, J., and Elkan, C. (2015), A critical review of recurrent neural networks for sequence learning, https://arxiv.org/abs/1506 (This 2015 survey of RRNs and LSTMs has some interesting perspectives.)
  28. W Li, H Hacid, E Almazrouei, M Debbah (2023), A Comprehensive Review and a Taxonomy of Edge Machine Learning: Requirements, Paradigms, and Techniques, AI 2023, 4(3), 729-786, https://www.mdpi.com/2673-2688/4/3/39 (Extensive survey related to optimizing inference for running on edge servers; also training on edge servers.)
  29. JA Chen, W Niu, B Ren, Y Wang, X Shen (2023), Survey: Exploiting data redundancy for optimization of deep learning, ACM Computing Surveys, 2023, https://dl.acm.org/doi/abs/10.1145/3564663, https://arxiv.org/pdf/2208.13363 (Various optimizations to skip or reuse computations or similar data.)
  30. M Capra, B Bussolino, A Marchisio, M Shafique (2020), An updated survey of efficient hardware architectures for accelerating deep convolutional neural networks, Future Internet, 2020, https://www.mdpi.com/1999-5903/12/7/113/pdf
  31. J Zhong, Z Liu, X Chen (2023), Transformer-based models and hardware acceleration analysis in autonomous driving: A survey, Apr 2023, https://arxiv.org/abs/2304.10891

For more general research on inference optimizations, refer to https://www.aussieai.com/research/inference-optimization.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++