Aussie AI

Multi-Model Deployment

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

Multi-Model Deployment

When running AI engines on a server, there are multiple models running inference, and a server has to decide how to allocated queries to the models efficiently. This is obviously an area that has been implemented many types in industry. If the architecture is multiple copies of the same models running on many servers, then this isn't really an AI problem. It's the well-known scheduling issue of dispersing user requests to multiple servers. Nevertheless, AI engines are a quirky type of application to run on servers, and there are some research papers on the practical deployment aspects of managing multiple AI models in a cloud server.

Research papers on AI multi-model deployment optimization:

Jashwant Raj Gunasekaran, Cyan Subhra Mishra, Prashanth Thinakaran, Mahmut Taylan Kandemir, and Chita R Das. Cocktail: A multidimensional optimization for model serving in cloud, In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), April 2022. PDF: https://www.usenix.org/system/files/nsdi22spring_prepub_gunasekaran.pdf, Code: https://github.com/jashwantraj92/cocktail (Serving framework for scheduling and serving queries from multiple ensemble models.)
Francisco Romero, Qian Li, Neeraja J Yadwadkar, and Christos Kozyrakis. 2021, INFaaS: Automated model-less inference serving, In 2021 USENIX Annual Technical Conference (USENIX ATC 21), July 2021, https://www.usenix.org/conference/atc21/presentation/romero (Choosing models when serving queries from multiple ensemble models.)
Or Honovich, Thomas Scialom, Omer Levy, Timo Schick, Dec 2022, Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor, https://arxiv.org/abs/2212.09689, https://github.com/orhonovich/unnatural-instructions (Training a new model by using another model to automatically create the data set on which to train it.)
Alexander Borzunov, Dmitry Baranchuk, Tim Dettmers, Max Ryabinin, Younes Belkada, Artem Chumachenko, Pavel Samygin, Colin Raffel, March 2023, PETALS: Collaborative Inference and Fine-tuning of Large Models, https://arxiv.org/abs/2209.01188, Code: https://petals.ml/ (Swarm deployment that shares the load to multiple servers.)
Y Liu, C Wang, Y Xiao, Z Lv, L Xiao, X Ji, 2023, Collaborative Inference for MEC Services Based on Multimodal Deep Neural Network, 2023 IEEE/CIC International Conference on Communications in China (ICCC) https://ieeexplore.ieee.org/abstract/document/10233276
Anil Kag, 2023, Novel neural architectures & algorithms for efficient inference, Ph.D. thesis, College of Engineering, Boston University, https://open.bu.edu/handle/2144/46649, PDF: https://open.bu.edu/bitstream/handle/2144/46649/Kag_bu_0017E_18472.pdf?sequence=8&isAllowed=y (See Chapter 11, “Efficient Edge Inference by Selective Query (Hybrid Models)”.)
Li, M., Li, Y., Tian, Y., Jiang, L., and Xu, Q. (2021). Appealnet: An efficient and highly-accurate edge/cloud collaborative architecture for DNN inference, In 2021 58th ACM/IEEE Design Automation Conference (DAC), pages 409–414. URL: https://ieeexplore.ieee.org/document/9586176
Kang, Y., Hauswald, J., Gao, C., Rovinski, A., Mudge, T., Mars, J., and Tang, L. (2017). Neurosurgeon: Collaborative intelligence between the cloud and mobile edge, SIGPLAN Notices, 52(4):615–629. https://doi.org/10.1145/3093336.3037698
Letian Zhang, Lixing Chen, Jie Xu, Feb 2021, Autodidactic Neurosurgeon: Collaborative Deep Inference for Mobile Edge Intelligence via Online Learning, https://arxiv.org/abs/2102.02638
Li, M., Li, Y., Tian, Y., Jiang, L., and Xu, Q. (2021). Appealnet: An efficient and highly-accurate edge/cloud collaborative architecture for DNN inference, In 2021 58th ACM/IEEE Design Automation Conference (DAC), pages 409–414. URL: https://ieeexplore.ieee.org/document/9586176, PDF: https://arxiv.org/pdf/2105.04104v2.pdf
Y. Long, I. Chakraborty, and K. Roy, 2020, Conditionally deep hybrid neural networks across edge and cloud, arXiv:2005.10851, https://arxiv.org/abs/2005.10851
Praveen Joshi, Mohammed Hasanuzzaman, Chandra Thapa, Haithem Afli, Ted Scully, 2023, Enabling All In-Edge Deep Learning: A Literature Review, IEEE Access, vol.11, pp.3431-3460, 2023. https://ieeexplore.ieee.org/document/10007810 https://arxiv.org/abs/2204.03326 (Extensive survey of edge computing, including deployment architectures and optimizations.)
E Samikwa, A Di Maio, T Braun, 2023, DISNET: Distributed Micro-Split Deep Learning in Heterogeneous Dynamic IoT, IEEE internet of things journal, PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10243578 (Partitioning methods for a model split over multiple distributed servers.)
Daniele Jahier Pagliari, Roberta Chiaro, Enrico Macii, Massimo Poncino, 2021, CRIME: Input-Dependent Collaborative Inference for Recurrent Neural Networks, IEEE Transactions on Computers, vol.70, no.10, pp.1626-1639, 2021. https://ieeexplore.ieee.org/document/9184963 (Collaborative inference by sharing workload to multiple devices.)
Y Zhang, Z Zhang, W Bao, D Yuan, 2023, ITIF: Integrated Transformers Inference Framework for Multiple Tenants on GPU, ICPP '23: Proceedings of the 52nd International Conference on Parallel Processing, August 2023, Pages 112–121, https://doi.org/10.1145/3605573.3605585, https://dl.acm.org/doi/abs/10.1145/3605573.3605585

For research papers on deployment of a multi-model architectures, see https://www.aussieai.com/research/ensemble#deployment.

• Next:

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs

The new AI programming book by Aussie AI co-founders:

AI coding in C++
Transformer engine speedups
LLM models
Phone and desktop AI
Code examples
Research citations

Get your copy from Amazon: Generative AI in C++

Aussie AI

Multi-Model Deployment

Multi-Model Deployment

Quick Links

Product

New to Writing?

Writing Styles