Aussie AI
Ensemble Multi-Model AI
-
Last Updated 12 December, 2024
-
by David Spuler, Ph.D.
If one AI engine is amazing, imagine what two could do. Or ten. Or a hundred.
The idea of using two or more AI engines together to complete a task is not new. This area of research is called "ensemble learning" or "multi-model" engines.
There are many ways that AI engines could cooperate to achieve more than one would alone. This is an area ripe for exploration, where we have only scratched the surface of possibilities. On the other hand, with today's high cost of GPUs limiting what can be done in both AI inference and training, the full realization of ensemble AI algorithms is still in the distant future.
One way in which two AI models work together has become common in practice: using the output of one model as input text for the training data set of a new model. This has been an effective technique for improving downstream models, but it isn't usually classed as an ensemble algorithm, although there is a paper about it with Honovich et al. (2022). This idea is similar to Knowledge Distillation, but differs in that its goal isn't to create a cut-down smaller model, but usually to improve accuracy of a large model.
Existing Ensemble AI Algorithms
Despite the costs, there are a number of areas where there is existing research, and even a significant number of practical use cases, in ensemble models.
- Generative Adversarial Networks (GANs). These use two AI models against each other, one being critical of the output of the other, to improve overall results. This is an architecture that has proven very successful in practice.
- Knowledge Distillation (KD). This is an optimization method whereby a large model is built first, and then it is used to "distill" its knowledge into a smaller model, as a "teacher" model to a "student" model. Two models are used in training, but only the smaller model is used for inference. Note that there are also various more advanced forms of "ensemble distillation" (multi-teacher) that involve two or more teachers, plus the student model. See knowledge distillation.
- Cascade inference optimizations. Cascade optimizations involve the selection of different models, or paths through multiple models, as a type of inference optimization. Two or more models are used in the inference phase, and there are various methods for deciding at runtime which to choose. See cascades.
- Big-small models. In a specific subtype of the cascade method called big-small models, there are two models trained differently. The idea is that during inference, a heuristic decides which model to invoke, and a faster "small" model is used in common cases, and the slower "big" model is only needed to handle the rarer cases. This can improve inference latency and total throughput.
- Speculative decoding. This method is similar to the big-little architecture, but differs because all queries initially go to the small model. The small, faster model does its best to suggest output tokens (i.e. it "speculates"), and then a bigger model is used to "verify" the correctness. If the bigger model has to override the smaller model, it is then slower, but usually the smaller model is accurate enough for the whole process to be faster on average, with accuracy close to using a larger model. Read more about: speculative decoding,
- Collaborative inference. This is where two or more models combine to complete inference. They could be on separate servers, or together on one server. See collaborative inference research.
- Multi-model training. There are various other methods of uses an ensemble technique to train up better models. Some of the techniques are called: bagging, boosting, stacking, and many variants. There is also the simple training method of using input data sets to train models, where the data is based on the output of some other model.
- Multiple parallel models inference. In some architectures, multiple models can process the same input data in parallel. Each model produces its own results, and the resulting ensemble model has to then choose an overall result. The algorithm to decide amongst the multiple options could be maximum (or minimum), majority (counting), weighted averages, and many other combinations.
- Hybrid dual transformer architectures. Rather than entire duplicate Transformer models, there has been research on adding extra components to the basic Transformer architecture, such as two heads or two encoders merged together. See Transformer architectures.
Model Selection Algorithms
Model selection algorithms are dynamic inference optimizations where a choice is made between two or more models for execution. One example is "big-little" architectures (see below), where a heuristic attempts to send "easy" queries to a faster "little" model. Various other ensemble architectures are possible with multiple models. Another practical example of a different type of model selection is the deployment architecture, which may be deciding which server to send the request to, where each server may have different models or instances of the same model. Other areas of research with similar aims include cascades and collaborative inference.
Model selection architectures are a general class of ensemble architectures where one of two or more models is "selected" to process a query. The general idea is that only model actually processes the query, with a decision mechanism beforehand (which can be model-based or heuristic-based). Example sub-classes of model selection include:
Research on model selection architectures:
- Bodun Hu, Le Xu, Jeongyoon Moon, Neeraja J. Yadwadkar, Aditya Akella, 27 Oct 2023, MOSEL: Inference Serving Using Dynamic Modality Selection, https://arxiv.org/abs/2310.18481 (Multi-modal model with dynamic selection of modality.)
- M Sponner, B Waschneck, A Kumar , 2024, Adapting Neural Networks at Runtime: Current Trends in At-Runtime Optimizations for Deep Learning, ACM Computing Surveys,, PDF: https://dl.acm.org/doi/pdf/10.1145/3657283 (Survey of various adaptive inference optimization techniques with much focus on image and video processing optimization for LLMs.)
- Can Wang, Bolin Zhang, Dianbo Sui, Zhiying Tu, Xiaoyu Liu, Jiabao Kang, 1 Mar 2024 (v2), A Survey on Effective Invocation Methods of Massive LLM Services, https://arxiv.org/abs/2402.03408 (Deployment of LLMs as LLM-as-a-Service or LLMaaS architectures including prompt compression, semantic caching and model selection based on scoring inputs.)
- Yuyi Mao, Xianghao Yu, Kaibin Huang, Ying-Jun Angela Zhang, Jun Zhang, Dec 2023, Green Edge AI: A Contemporary Survey, https://arxiv.org/abs/2312.00333
- David Spuler, March 2024, Chapter 54. Ensemble Multi-Model Architectures, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Steven Kolawole, Don Dennis, Ameet Talwalkar, Virginia Smith, 2 Jul 2024, Revisiting Cascaded Ensembles for Efficient Inference https://arxiv.org/abs/2407.02348
- Ziheng Wang, Pedro Reviriego, Farzad Niknia, Javier Conde, Shanshan Liu, Fabrizio Lombardi, 26 Aug 2024, Adaptive Resolution Inference (ARI): Energy-Efficient Machine Learning for Internet of Things, https://arxiv.org/abs/2408.14528 (Running a small quantized model and then determining whether to run the full non-quantized model.)
- Sean Michael Kerner, September 17, 2024, Model routing: The secret weapon for maximizing AI efficiency in enterprises, https://venturebeat.com/ai/why-accenture-and-martian-see-model-routing-as-key-to-enterprise-ai-success/
- Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M Waleed Kadous, Ion Stoica, 21 Jul 2024 (v3), RouteLLM: Learning to Route LLMs with Preference Data, https://arxiv.org/abs/2406.18665
- Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Ruhle, Laks V.S. Lakshmanan, Ahmed Hassan Awadallah, 22 Apr 2024, Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing, ICLR 2024, https://arxiv.org/abs/2404.14618
- Noah Martin, Abdullah Bin Faisal, Hiba Eltigani, Rukhshan Haroon, Swaminathan Lamelas, Fahad Dogar, 4 Oct 2024, LLMProxy: Reducing Cost to Access Large Language Models, https://arxiv.org/abs/2410.11857 (Deploying a proxy between user and LLM, with handling of conversational history context and caching.)
- Lingjiao Chen, Matei Zaharia, James Zou, 9 May 2023, FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance, https://arxiv.org/abs/2305.05176
- Dimitris Stripelis, Zijian Hu, Jipeng Zhang, Zhaozhuo Xu, Alay Dilipbhai Shah, Han Jin, Yuhang Yao, Salman Avestimehr, Chaoyang He, 23 Oct 2024 (v3), TensorOpera Router: A Multi-Model Router for Efficient LLM Inference, https://arxiv.org/abs/2408.12320
- Zesen Zhao, Shuowei Jin, Z. Morley Mao, 23 Sep 2024, Eagle: Efficient Training-Free Router for Multi-LLM Inference, https://arxiv.org/abs/2409.15518
- Tao Feng, Yanzhen Shen, Jiaxuan You, 4 Oct 2024, GraphRouter: A Graph-based Router for LLM Selections, https://arxiv.org/abs/2410.03834 https://github.com/ulab-uiuc/GraphRouter
- Kaushal Kumar Maurya, KV Aditya Srivatsa, Ekaterina Kochmar, 16 Aug 2024, SelectLLM: Query-Aware Efficient Selection Algorithm for Large Language Models, https://arxiv.org/abs/2408.08545
- Quang H. Nguyen, Duy C. Hoang, Juliette Decugis, Saurav Manchanda, Nitesh V. Chawla, Khoa D. Doan, 24 Jul 2024 (v2), MetaLLM: A High-performant and Cost-efficient Dynamic Framework for Wrapping LLMs, https://arxiv.org/abs/2407.10834
- Keming Lu, Hongyi Yuan, Runji Lin, Junyang Lin, Zheng Yuan, Chang Zhou, Jingren Zhou, 15 Nov 2023, Routing to the Expert: Efficient Reward-guided Ensemble of Large Language Models, https://arxiv.org/abs/2311.08692
- Małgorzata Łazuka, Andreea Anghel, Thomas Parnell, 3 Oct 2024, LLM-Pilot: Characterize and Optimize Performance of your LLM Inference Services, https://arxiv.org/abs/2410.02425
- Pranjal Aggarwal, Aman Madaan, Ankit Anand, Srividya Pranavi Potharaju, Swaroop Mishra, Pei Zhou, Aditya Gupta, Dheeraj Rajagopal, Karthik Kappaganthu, Yiming Yang, Shyam Upadhyay, Manaal Faruqui, Mausam, 28 Jun 2024 (v4), AutoMix: Automatically Mixing Language Models, https://arxiv.org/abs/2310.12963
- Josef Pichlmeier, Philipp Ross, Andre Luckow, 8 Oct 2024 (v2), Performance Characterization of Expert Router for Scalable LLM Inference, https://arxiv.org/abs/2404.15153
- Ou, Anthony C., Feb 2024, Large Language Model Routing with Benchmark Datasets, Master's Thesis, Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science, https://dspace.mit.edu/handle/1721.1/153846
- KV Aditya Srivatsa, Kaushal Kumar Maurya, Ekaterina Kochmar, 1 May 2024, Harnessing the Power of Multiple Minds: Lessons Learned from LLM Routing, https://arxiv.org/abs/2405.00467
- David Farr, Nico Manzonelli, Iain Cruickshank, Kate Starbird, Jevin West, 16 Oct 2024, LLM Chain Ensembles for Scalable and Accurate Data Annotation, https://arxiv.org/abs/2410.13006
- Xiangxiang Dai, Jin Li, Xutong Liu, Anqi Yu, John C.S. Lui, 2 Oct 2024 (v2), Cost-Effective Online Multi-LLM Selection with Versatile Reward Models, https://arxiv.org/abs/2405.16587
- Grant Wilkins, Srinivasan Keshav, Richard Mortier, 4 Jul 2024, Offline Energy-Optimal LLM Serving: Workload-Based Energy Models for LLM Inference on Heterogeneous Systems, https://arxiv.org/abs/2407.04014
Model Routing
Model routing is a generalization of model selection, which also includes evaluation of issues such as serving and network costs. The idea is to select a model from a set (i.e., model selection), and then route the query over the network to wherever that model is being served. This could potentially include choosing between commercial and open source models, and many variations therein.
Research papers on model routing algorithms:
- M Sponner, B Waschneck, A Kumar , 2024, Adapting Neural Networks at Runtime: Current Trends in At-Runtime Optimizations for Deep Learning, ACM Computing Surveys,, PDF: https://dl.acm.org/doi/pdf/10.1145/3657283 (Survey of various adaptive inference optimization techniques with much focus on image and video processing optimization for LLMs.)
- Sean Michael Kerner, September 17, 2024, Model routing: The secret weapon for maximizing AI efficiency in enterprises, https://venturebeat.com/ai/why-accenture-and-martian-see-model-routing-as-key-to-enterprise-ai-success/
- Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M Waleed Kadous, Ion Stoica, 21 Jul 2024 (v3), RouteLLM: Learning to Route LLMs with Preference Data, https://arxiv.org/abs/2406.18665
- Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Ruhle, Laks V.S. Lakshmanan, Ahmed Hassan Awadallah, 22 Apr 2024, Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing, ICLR 2024, https://arxiv.org/abs/2404.14618
- Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt Keutzer, Shriyash Kaustubh Upadhyay, 12 May 2024, MARS: A Benchmark for Multi-LLM Algorithmic Routing System, ICLR 2024, https://openreview.net/forum?id=C0rs3wM0N8 PDF: https://openreview.net/pdf?id=C0rs3wM0N8
- Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt Keutzer, Shriyash Kaustubh Upadhyay, 28 Mar 2024 (v2), RouterBench: A Benchmark for Multi-LLM Routing System, https://arxiv.org/abs/2403.12031 https://github.com/withmartian/routerbench
- Rinon Gal, Adi Haviv, Yuval Alaluf, Amit H. Bermano, Daniel Cohen-Or, Gal Chechik, 2 Oct 2024, ComfyGen: Prompt-Adaptive Workflows for Text-to-Image Generation, https://arxiv.org/abs/2410.01731 https://comfygen-paper.github.io/
- Lingjiao Chen, Matei Zaharia, James Zou, 9 May 2023, FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance, https://arxiv.org/abs/2305.05176
- Dimitris Stripelis, Zijian Hu, Jipeng Zhang, Zhaozhuo Xu, Alay Dilipbhai Shah, Han Jin, Yuhang Yao, Salman Avestimehr, Chaoyang He, 23 Oct 2024 (v3), TensorOpera Router: A Multi-Model Router for Efficient LLM Inference, https://arxiv.org/abs/2408.12320
- Zesen Zhao, Shuowei Jin, Z. Morley Mao, 23 Sep 2024, Eagle: Efficient Training-Free Router for Multi-LLM Inference, https://arxiv.org/abs/2409.15518
- Tao Feng, Yanzhen Shen, Jiaxuan You, 4 Oct 2024, GraphRouter: A Graph-based Router for LLM Selections, https://arxiv.org/abs/2410.03834 https://github.com/ulab-uiuc/GraphRouter
- Kaushal Kumar Maurya, KV Aditya Srivatsa, Ekaterina Kochmar, 16 Aug 2024, SelectLLM: Query-Aware Efficient Selection Algorithm for Large Language Models, https://arxiv.org/abs/2408.08545
- Quang H. Nguyen, Duy C. Hoang, Juliette Decugis, Saurav Manchanda, Nitesh V. Chawla, Khoa D. Doan, 24 Jul 2024 (v2), MetaLLM: A High-performant and Cost-efficient Dynamic Framework for Wrapping LLMs, https://arxiv.org/abs/2407.10834
- Keming Lu, Hongyi Yuan, Runji Lin, Junyang Lin, Zheng Yuan, Chang Zhou, Jingren Zhou, 15 Nov 2023, Routing to the Expert: Efficient Reward-guided Ensemble of Large Language Models, https://arxiv.org/abs/2311.08692
- Małgorzata Łazuka, Andreea Anghel, Thomas Parnell, 3 Oct 2024, LLM-Pilot: Characterize and Optimize Performance of your LLM Inference Services, https://arxiv.org/abs/2410.02425
- Pranjal Aggarwal, Aman Madaan, Ankit Anand, Srividya Pranavi Potharaju, Swaroop Mishra, Pei Zhou, Aditya Gupta, Dheeraj Rajagopal, Karthik Kappaganthu, Yiming Yang, Shyam Upadhyay, Manaal Faruqui, Mausam, 28 Jun 2024 (v4), AutoMix: Automatically Mixing Language Models, https://arxiv.org/abs/2310.12963
- Josef Pichlmeier, Philipp Ross, Andre Luckow, 8 Oct 2024 (v2), Performance Characterization of Expert Router for Scalable LLM Inference, https://arxiv.org/abs/2404.15153
- Ou, Anthony C., Feb 2024, Large Language Model Routing with Benchmark Datasets, Master's Thesis, Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science, https://dspace.mit.edu/handle/1721.1/153846
- KV Aditya Srivatsa, Kaushal Kumar Maurya, Ekaterina Kochmar, 1 May 2024, Harnessing the Power of Multiple Minds: Lessons Learned from LLM Routing, https://arxiv.org/abs/2405.00467
- David Farr, Nico Manzonelli, Iain Cruickshank, Kate Starbird, Jevin West, 16 Oct 2024, LLM Chain Ensembles for Scalable and Accurate Data Annotation, https://arxiv.org/abs/2410.13006
- Xiangxiang Dai, Jin Li, Xutong Liu, Anqi Yu, John C.S. Lui, 2 Oct 2024 (v2), Cost-Effective Online Multi-LLM Selection with Versatile Reward Models, https://arxiv.org/abs/2405.16587
- Arun Shankar, Oct 2024, Designing Cognitive Architectures: Agentic Workflow Patterns from Scratch, https://medium.com/google-cloud/designing-cognitive-architectures-agentic-workflow-patterns-from-scratch-63baa74c54bc
- Haim Barad, Jascha Achterberg, Tien Pei Chou, Jean Yu, 30 Oct 2024, Accelerated AI Inference via Dynamic Execution Methods, https://arxiv.org/abs/2411.00853
- Kirill Vasilevski, Dayi Lin, Ahmed Hassan, 14 Nov 2024, Real-time Adapting Routing (RAR): Improving Efficiency Through Continuous Learning in Software Powered by Layered Foundation Models, https://arxiv.org/abs/2411.09837
- AI: Alan Wake, Albert Wang, Bei Chen, C.X. Lv, Chao Li, Chengen Huang, Chenglin Cai, Chujie Zheng, Daniel Cooper, Ethan Dai, Fan Zhou, Feng Hu, Heng Ji, Howard Qiu, Jiangcheng Zhu, Jun Tian, Katherine Su, Lihuan Zhang, Liying Li, Ming Song, Mou Li, Peng Liu, Qichen Hu, Shawn Wang, Shijun Zhou, Shiyong Li, Tianhang Zhu, Wen Xie, Xiang He, Xiaobo Chen, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Yanpeng Li, Yongke Zhao, Yongzhen Luo, Yuchi Xu, Yuxuan Sha, Zhaodong Yan, Zhiyuan Liu, Zirui Zhang, 3 Dec 2024 (v2), Yi-Lightning Technical Report, https://arxiv.org/abs/2412.01253 https://platform.lingyiwanwu.com/ (MoE architecture with model expert routing optimizations, also with hybrid global-local attention and fused layers in the KV caching.)
- Yuanshuai Wang, Xingjian Zhang, Jinkun Zhao, Siwei Wen, Peilin Feng, Shuhao Liao, Lei Huang, Wenjun Wu, 5 Dec 2024, Bench-CoE: a Framework for Collaboration of Experts from Benchmark, https://arxiv.org/abs/2412.04167 https://github.com/ZhangXJ199/Bench-CoE
Big-Little Transformer Models
Although many ensemble architectures are about doing even more computations to achieve even more advanced capabilities, the idea of big-little or big-small architectures is to improve inference speed and throughput by sending common queries to a smaller model. The larger model is reserved for more difficult or rarer queries which take longer. As such, it's an AI version of the "common case first" code optimization technique.
Note that "collaborative inference" (e.g. "parallel decoding" or "speculative decoding") is also conceptually a similar architecture, but differs because multiple models work together for inference, whereas pure big-little architectures choose the model at the start, and only one model does the inference. Also related are the various non-autoregressive architectures.
Research papers on big-little architectures:
- Kim, S., Mangalam, K., Malik, J., Mahoney, M. W., Gholami, A., and Keutzer, K., Big little transformer decoder, arXiv preprint arXiv:2302.07863, May 2023, https://arxiv.org/abs/2302.07863
- Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, L., and Jumper, J., Feb 2023, Accelerating large language model decoding with speculative sampling, arXiv preprint arXiv:2302.01318, https://arxiv.org/abs/2302.01318
- Leviathan, Y., Kalman, M., and Matias, Y., Fast inference from transformers via speculative decoding, May 2023, https://arxiv.org/abs/2211.17192
- Stern, M., Shazeer, N., and Uszkoreit, J., Nov 2018, Blockwise parallel decoding for deep autoregressive models, Advances in Neural Information Processing Systems, 31, https://arxiv.org/abs/1811.03115
- Z. Peng et al. 2018. AXNet: ApproXimate computing using an end-to-end trainable neural network. 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD) https://ieeexplore.ieee.org/document/8605388 (Ensemble dual-model method where one model is a fast approximatation of the other.)
- Anil Kag, 2023, Novel neural architectures & algorithms for efficient inference, Ph.D. thesis, College of Engineering, Boston University, https://open.bu.edu/handle/2144/46649, PDF: https://open.bu.edu/bitstream/handle/2144/46649/Kag_bu_0017E_18472.pdf?sequence=8&isAllowed=y (See Chapter 10, "Input Hardness Adaptive Models" for methods of running faster on easy image classification problems.)
- Nan, F. and Saligrama, V., 2017. Dynamic model selection for prediction under a budget. arXiv preprint arXiv:1704.07505. https://arxiv.org/abs/1704.07505
- Park, E., Kim, D., Kim, S., Kim, Y.-D., Kim, G., Yoon, S., and Yoo, S. (2015). Big/little deep neural network for ultra low power inference. In 2015 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), pages 124–132. https://ieeexplore.ieee.org/document/7331375
- D Xu, W Yin, X Jin, Y Zhang, S Wei, M Xu, X Liu, Sep 2023, LLMCad: Fast and Scalable On-device Large Language Model Inference, arXiv preprint arXiv:2309.04255, https://arxiv.org/pdf/2309.04255.pdf (Keeps a smaller model in memory, improving speed and reducing memory utilization.)
- Yiding Wang, Kai Chen, Haisheng Tan, and Kun Guo. Tabi: An efficient multi-level inference system for large language models. In Proceedings of the Eighteenth European Conference on Computer Systems, pages 233–248, 2023. https://dl.acm.org/doi/10.1145/3552326.3587438, PDF: https://yidingwang.xyz/public/files/tabi_eurosys23.pdf (Has multiple models, some big, some small, with characteristics similar to ensembles, big-little, and cascades.)
- H Malard, S Zaiem, R Algayres, 2023, Big model only for hard audios: Sample dependent Whisper model selection for efficient inferences, arXiv preprint arXiv:2309.12712, https://arxiv.org/pdf/2309.12712.pdf (Big-little architecture for audio models.)
- S Bae, J Ko, H Song, SY Yun, Oct 2023, Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding, arXiv preprint arXiv:2310.05424, https://arxiv.org/pdf/2310.05424.pdf (Combination of early-exit with a "shallow-deep module" and parallel decoding.)
- Kaya Y., Hong S., Dumitras T., Shallow-deep networks: Understanding and mitigating network overthinking Proceedings of the international conference on machine learning, ICML (2019), pp. 3301-3310, https://arxiv.org/abs/1810.07052 (Shallow-deep method in a single model.)
- Qingyuan Wang, Barry Cardiff, Antoine Frappé, Benoit Larras, Deepu John, 26 Mar 2024, Tiny Models are the Computational Saver for Large Models, https://arxiv.org/abs/2403.17726v1 (Choose tiny or small models after an initial layer of the larger model, combining early exit with easy-hard queries for multi-model inference.)
- Benjamin Bergner, Andrii Skliar, Amelie Royer, Tijmen Blankevoort, Yuki Asano, Babak Ehteshami Bejnordi, 26 Feb 2024, Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding, https://arxiv.org/abs/2402.16844 (Using a large model to train parallel decoding for a small language model.)
- Chia-Hsuan Lee, Hao Cheng, Mari Ostendorf, Nov 2023, OrchestraLLM: Efficient Orchestration of Language Models for Dialogue State Tracking, https://arxiv.org/abs/2311.09758
- Zichao Shen, Neil Howard and Jose Nunez-Yanez, 2022, Big–Little Adaptive Neural Networks on Low-Power Near-Subthreshold Processors, J. Low Power Electron. Appl. 2022, 12(2), 28, https://doi.org/10.3390/jlpea12020028 https://www.mdpi.com/2079-9268/12/2/28 Code: https://github.com/DarkSZChao/Big-Little_NN_Strategies
- David Spuler, March 2024, Chapter 54. Ensemble Multi-Model Architectures, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Kaiyan Zhang, Jianyu Wang, Ning Ding, Biqing Qi, Ermo Hua, Xingtai Lv, Bowen Zhou, 18 Jun 2024, Fast and Slow Generating: An Empirical Study on Large and Small Language Models Collaborative Decoding, https://arxiv.org/abs/2406.12295 Code: https://github.com/TsinghuaC3I/FS-GEN
- Zixu Hao, Huiqiang Jiang, Shiqi Jiang, Ju Ren, Ting Cao, June 2024, Hybrid SLM and LLM for Edge-Cloud Collaborative Inference, EdgeFM ’24, June 3–7, 2024, Minato-ku, Tokyo, Japan, https://dl.acm.org/doi/pdf/10.1145/3662006.3662067 (Small model on edge devices with large model in the cloud, performing collaborative inference.)
- Xiang Lisa Li, Ari Holtzman, Daniel Fried, Percy Liang, Jason Eisner, Tatsunori Hashimoto, Luke Zettlemoyer, Mike Lewis, 10 Jul 2023 (v2), Contrastive Decoding: Open-ended Text Generation as Optimization, https://arxiv.org/abs/2210.15097
- Hyunjong Ok, Jegwang Ryu, Jaeho Lee, 26 Jun 2024, Decoding with Limited Teacher Supervision Requires Understanding When to Trust the Teacher, https://arxiv.org/abs/2406.18002 (Examines the idea of not using the larger model to always verify, and when to trust either the smaller or larger models, which is an idea that generalized beyond speculative decoding.)
- Aishwarya P S, Pranav Ajit Nair, Yashas Samaga B L, Toby James Boyd, Sanjiv Kumar, Prateek Jain, Praneeth Netrapalli, July 2024, Tandem Transformers for Inference Efficient LLMs, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:42906-42917, 2024, https://proceedings.mlr.press/v235/s24a.html
- Ziheng Wang, Pedro Reviriego, Farzad Niknia, Javier Conde, Shanshan Liu, Fabrizio Lombardi, 26 Aug 2024, Adaptive Resolution Inference (ARI): Energy-Efficient Machine Learning for Internet of Things, https://arxiv.org/abs/2408.14528 (Running a small quantized model and then determining whether to run the full non-quantized model.)
- Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Ruhle, Laks V.S. Lakshmanan, Ahmed Hassan Awadallah, 22 Apr 2024, Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing, ICLR 2024, https://arxiv.org/abs/2404.14618
- J. Niu, W. Zhang, C. J. Xue and N. Guan, 2024, "RTiL: Real-Time Inference of Large Language Models on Memory-Constrained GPU Devices," 2024 IEEE 30th International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA), Sokcho, Korea, Republic of, 2024, pp. 21-30, doi: 10.1109/RTCSA62462.2024.00013. https://ieeexplore.ieee.org/abstract/document/10695719
- Matthieu Zimmer, Milan Gritta, Gerasimos Lampouras, Haitham Bou Ammar, Jun Wang, 4 Oct 2024, Mixture of Attentions For Speculative Decoding, https://arxiv.org/abs/2410.03804
- He Guo, Yulong Wang, Zixuan Ye, Jifeng Dai, Yuwen Xiong, 14 Oct 2024, big.LITTLE Vision Transformer for Efficient Visual Recognition, https://arxiv.org/abs/2410.10267
- Maxwell Horton, Qingqing Cao, Chenfan Sun, Yanzi Jin, Sachin Mehta, Mohammad Rastegari, Moin Nabi, 10 Oct 2024, KV Prediction for Improved Time to First Token, https://arxiv.org/abs/2410.08391 https://github.com/apple/corenet/tree/main/projects/kv-prediction (Small model creates an approximation of the KV cache for use by a larger model.)
- Lingjiao Chen, Matei Zaharia, James Zou, 9 May 2023, FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance, https://arxiv.org/abs/2305.05176
General Research on Ensemble Models
- Ensemble Transformer for Efficient and Accurate Ranking Tasks: an Application to Question Answering Systems. Yoshitomo Matsubara, Luca Soldaini, Eric Lind, Alessandro Moschitt, Dec 2022, https://arxiv.org/abs/2201.05767
- Yungeng Zhang, Yuru Pei & Hongbin Zha, Learning Dual Transformer Network for Diffeomorphic Registration, Sep 2021, Medical Image Computing and Computer Assisted Intervention, MICCAI 2021, https://link.springer.com/chapter/10.1007/978-3-030-87202-1_13
- Xian-Feng Han, Yi-Fei Jin, Hui-Xian Cheng, Guo-Qiang Xiao, Dual Transformer for Point Cloud Analysis, Apr 2021, https://arxiv.org/abs/2104.13044
- Ting Yao, Yehao Li, Yingwei Pan, Yu Wang, Xiao-Ping Zhang, Tao Mei, 2023, Dual Vision Transformer, https://arxiv.org/pdf/2207.04976, Code: https://github.com/YehLi/ImageNetModel
- Mohammed Alhamid, Ensemble Models, March 2021, https://towardsdatascience.com/ensemble-models-5a62d4f4cb0c
- Oliver R. A. Dunbar, Andrew B. Duncan, Andrew M. Stuart, Marie-Therese Wolfram, Jan 2022, Ensemble Inference Methods for Models With Noisy and Expensive Likelihoods, https://arxiv.org/abs/2104.03384
- Shuning Chang, Pichao Wang, Hao Luo, Fan Wang, Mike Zheng Shou, Revisiting Vision Transformer from the View of Path Ensemble, https://arxiv.org/abs/2308.06548, PDF: https://arxiv.org/pdf/2308.06548.pdf (Treating the internal components of a Transformer as if they are an ensemble model.)
- T. G. Dietterich. Ensemble methods in machine learning. In Multiple classifier systems, pages 1–15. Springer, 2000, Lecture Notes in Computer Science book series LNCS, volume 1857, https://link.springer.com/chapter/10.1007/3-540-45014-9_1, PDF: https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=e3b09a777c71a4b88888509ab9bfa12d8bf295ba (Early paper with ensemble idea applied to classifiers, rather than multi-model.)
- Huang, G., Chen, D., Li, T., Wu, F., van der Maaten, L., Weinberger, K.Q.: Multi-scale dense networks for resource efficient image classification. In: 6th International Conference on Learning Representations, ICLR 2018 (2018). https://doi.org/10.48550/arXiv.1703.09844, https://arxiv.org/abs/1703.09844 (Has multiple models combined in an early-exit configuration.)
- Y. Matsubara, M. Levorato, and F. Restuccia, “Split computing and early exiting for deep learning applications: Survey and research challenges,” ACM Comput. Surveys, Mar 2022, https://arxiv.org/abs/2103.04505 (Split computing is splitting the inference between server and edge machines.)
- L. Li, K. Ota and M. Dong, "Deep learning for smart industry: Efficient manufacture inspection system with fog computing", IEEE Trans. Ind. Informat., vol. 14, no. 10, pp. 4665-4673, Oct. 2018. https://ieeexplore.ieee.org/document/8370640 ("Fog computing" is like cloud computing but on servers "nearer" to the ground.)
- C. Lo, Y.-Y. Su, C.-Y. Lee and S.-C. Chang, "A dynamic deep neural network design for efficient workload allocation in edge computing", Proc. IEEE Int. Conf. Comput. Design (ICCD), pp. 273-280, Nov. 2017. https://ieeexplore.ieee.org/document/8119222
- G Xu, Z Hao, Y Luo, H Hu, J An, S Mao, 2023, DeViT: Decomposing Vision Transformers for Collaborative Inference in Edge Devices, arXiv preprint arXiv:2309.05015, https://arxiv.org/abs/2309.05015
- Schwartz, R., Stanovsky, G., Swayamdipta, S., Dodge, J., and Smith, N. A. The right tool for the job: Matching model and instance complexities. In Annual Meeting of the Association for Computational Linguistics, 2020. https://arxiv.org/abs/2004.07453 (Early exit with "wisdom of committees" decisions.)
- Naftaly, U., N. Intrator, and D. Horn. "Optimal ensemble averaging of neural networks." Network: Computation in Neural Systems 8, no. 3 (1997): 283–296. https://www.tau.ac.il/~horn/publications/optimal.pdf
- Y. Liu and X. Yao, Ensemble Learning via Negative Correlation, Neural Networks, Volume 12, Issue 10, December 1999, pp. 1399-1404. doi:10.1016/S0893-6080(99)00073-8, https://www.sciencedirect.com/science/article/abs/pii/S0893608099000738
- Z.S.H. Chan; N. Kasabov, 2005, Fast neural network ensemble learning via negative-correlation data correction, IEEE Transactions on Neural Networks, Volume 16, Issue 6, November 2005, https://ieeexplore.ieee.org/document/1528547
- E Diao, 2023, Efficient and Collaborative Methods for Distributed Machine Learning, Ph.D. thesis, Department of Electrical and Computer Engineering Duke University, https://www.proquest.com/openview/410ea5eb4275fded25890f04c96a902e/1?pq-origsite=gscholar&cbl=18750&diss=y
- X Xu, K Yan, S Han, B Wang, X Tao, P Zhang, 2023, Learning-Based Edge-Device Collaborative DNN Inference in IoVT Networks IEEE Internet of Things Journal, https://ieeexplore.ieee.org/abstract/document/10258387
- Devvrit, Sneha Kudugunta, Aditya Kusupati, Tim Dettmers, Kaifeng Chen, Inderjit Dhillon, Yulia Tsvetkov, Hannaneh Hajishirzi, Sham Kakade, Ali Farhadi, Prateek Jain, Oct 2023, MatFormer: Nested Transformer for Elastic Inference, https://arxiv.org/abs/2310.07707 (Multiple submodels inside a large model.)
- Devvrit, Sneha Kudugunta, Aditya Kusupati, Tim Dettmers, Kaifeng Chen, Inderjit Dhillon, Yulia Tsvetkov, Hannaneh Hajishirzi, Sham Kakade, Ali Farhadi, Prateek Jain, 2024, MatFormer: Nested Transformer for Elastic Inference https://openreview.net/pdf?id=93BaEweoRg (A method of training one large model, and then extracting many smaller sub-models from that model, using FFNs with a subset of parameters, which if done staticly can then be similar to a form of model compression, and elastic inference done dynamically is a type of adaptive inference.)
- NVIDIA, Aug 2023, Triton Architecture, NVIDIA Triton Inference Server user guide documentation, https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/architecture.html
- Rui Wang, Qibing Bai, Junyi Ao, Long Zhou, Zhixiang Xiong, Zhihua Wei, Yu Zhang, Tom Ko, and Haizhou Li, “LightHuBERT: Lightweight and configurable speech representation learning with once-for-all hidden-unit BERT,” in Interspeech, June 2022, https://arxiv.org/abs/2203.15610 2022.
- Andrea Santilli, Silvio Severino, Emilian Postolache, Valentino Maiorca, Michele Mancusi, Riccardo Marin, Emanuele Rodolà, May 2023, Accelerating Transformer Inference for Translation via Parallel Decoding, https://arxiv.org/abs/2305.10427
- Meng Wang; Liang Qian; Na Meng; Yusong Cheng; Weiwei Fang, Nov 2023, Model Parallelism Optimization for Distributed DNN Inference on Edge Devices, 2023 IEEE 14th International Symposium on Parallel Architectures, Algorithms and Programming (PAAP), https://ieeexplore.ieee.org/abstract/document/10391646 (Distributes inference across multiple edge devices at the layer level, with further optimization using layer fusion.)
- Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, Luo Mai, 25 Jan 2024, ServerlessLLM: Locality-Enhanced Serverless Inference for Large Language Models, https://arxiv.org/abs/2401.14351 Code: https://github.com/ServerlessLLM/ServerlessLLM
- Y. Chen, X. Dai, D. Chen, M. Liu, X. Dong, L. Yuan, and Z. Liu, “Mobile-Former: Bridging mobilenet and transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5270–5279. https://arxiv.org/abs/2108.05895
- S Latifi, 2023, Efficient and Dependable Deep Learning Systems Ph.D. Thesis, Computer Science and Engineering, University of Michigan, https://deepblue.lib.umich.edu/bitstream/handle/2027.42/176548/salar_1.pdf?sequence=1
- David Spuler, March 2024, Chapter 54. Ensemble Multi-Model Architectures, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Y Wang, K Chen, H Tan, K Guo, 2023, Tabi: An Efficient Multi-Level Inference System for Large Language Models, https://cse.hkust.edu.hk/~kaichen/papers/tabi-eurosys23.pdf
- Li Yang, Zhezhi He, Yu Cao, Deliang Fan, Sep 2020, A Progressive Sub-Network Searching Framework for Dynamic Inference, https://arxiv.org/abs/2009.05681
- Kah Phooi Seng, Li-Minn Ang, 2022, "Embedded Intelligence: State-of-the-Art and Research Challenges", IEEE Access, vol.10, pp.59236-59258, 2022. https://ieeexplore.ieee.org/document/9775683 PDF: https://research.usc.edu.au/esploro/outputs/99640278002621
- Junlin Wang, Jue Wang, Ben Athiwaratkun, Ce Zhang, James Zou, 7 Jun 2024, Mixture-of-Agents Enhances Large Language Model Capabilities, https://arxiv.org/abs/2406.04692
- Q. Sun, Z. Yin, X. Li, Z. Wu, X. Qiu, and L. Kong, “Corex: Pushing the boundaries of complex reasoning through multi model collaboration,” arXiv preprint arXiv:2310.00280, 2023. https://arxiv.org/abs/2310.00280
- Matt Murphy, Tim Tully, Derek Xiao, January 18, 2024, The Modern AI Stack: Design Principles for the Future of Enterprise AI Architectures, Menlo Ventures, https://menlovc.com/perspective/the-modern-ai-stack-design-principles-for-the-future-of-enterprise-ai-architectures/ (Various details about the AI tech stack, organizational AI maturity levels, and several interesting facts: inference is 95% of AI cost now, 60% of organizations are using multi-model methods, RAG is the dominant architecture currently, and AI application development teams are primarily made up of non-ML software engineers leveraging on top of AI models.)
- Steven Kolawole, Don Dennis, Ameet Talwalkar, Virginia Smith, 2 Jul 2024, Revisiting Cascaded Ensembles for Efficient Inference https://arxiv.org/abs/2407.02348
- Zhuocheng Gong, Ang Lv, Jian Guan, Junxi Yan, Wei Wu, Huishuai Zhang, Minlie Huang, Dongyan Zhao, Rui Yan, 9 Jul 2024, Mixture-of-Modules: Reinventing Transformers as Dynamic Assemblies of Modules, https://arxiv.org/abs/2407.06677
- Aakriti Agrawal, Mucong Ding, Zora Che, Chenghao Deng, Anirudh Satheesh, John Langford, Furong Huang, 6 Oct 2024, EnsemW2S: Can an Ensemble of LLMs be Leveraged to Obtain a Stronger LLM? https://arxiv.org/abs/2410.04571
- Lingjiao Chen, Matei Zaharia, James Zou, 9 May 2023, FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance, https://arxiv.org/abs/2305.05176
- Do Xuan Long, Duong Ngoc Yen, Anh Tuan Luu, Kenji Kawaguchi, Min-Yen Kan, Nancy F. Chen, 1 Nov 2024, Multi-expert Prompting Improves Reliability, Safety, and Usefulness of Large Language Models, https://arxiv.org/abs/2411.00492
- Weixin Liang, Lili Yu, Liang Luo, Srinivasan Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen-tau Yih, Luke Zettlemoyer, Xi Victoria Lin, 7 Nov 2024, Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models. https://arxiv.org/abs/2411.04996
- Yingxuan Yang, Qiuying Peng, Jun Wang, Weinan Zhang, 21 Nov 2024, Multi-LLM-Agent Systems: Techniques and Business Perspectives, https://arxiv.org/abs/2411.14033
Deployment: Serving Multiple Cloud Models
When running AI engines on a server, there are multiple models running, and a server has to decide how to allocated queries to the models efficiently. There are some papers on the practical deployment aspects of managing multiple models in a cloud server.
- Jashwant Raj Gunasekaran, Cyan Subhra Mishra, Prashanth Thinakaran, Mahmut Taylan Kandemir, and Chita R Das. Cocktail: A multidimensional optimization for model serving in cloud. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), April 2022. PDF: https://www.usenix.org/system/files/nsdi22spring_prepub_gunasekaran.pdf, Code: https://github.com/jashwantraj92/cocktail (Serving framework for scheduling and serving queries from multiple ensemble models.)
- Francisco Romero, Qian Li, Neeraja J Yadwadkar, and Christos Kozyrakis. INFaaS: Automated model-less inference serving. In 2021 USENIX Annual Technical Conference (USENIX ATC 21), July 2021, https://www.usenix.org/conference/atc21/presentation/romero (Choosing models when serving queries from multiple ensemble models.)
- Or Honovich, Thomas Scialom, Omer Levy, Timo Schick, Dec 2022, Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor, https://arxiv.org/abs/2212.09689, https://github.com/orhonovich/unnatural-instructions (Training a new model by using another model to automatically create the data set on which to train it.)
- Alexander Borzunov, Dmitry Baranchuk, Tim Dettmers, Max Ryabinin, Younes Belkada, Artem Chumachenko, Pavel Samygin, Colin Raffel, March 2023, PETALS: Collaborative Inference and Fine-tuning of Large Models, https://arxiv.org/abs/2209.01188, Code: https://petals.ml/ (Swarm deployment that shares the load to multiple servers.)
- Y Liu, C Wang, Y Xiao, Z Lv, L Xiao, X Ji, 2023, Collaborative Inference for MEC Services Based on Multimodal Deep Neural Network, 2023 IEEE/CIC International Conference on Communications in China (ICCC) https://ieeexplore.ieee.org/abstract/document/10233276
- Anil Kag, 2023, Novel neural architectures & algorithms for efficient inference, Ph.D. thesis, College of Engineering, Boston University, https://open.bu.edu/handle/2144/46649, PDF: https://open.bu.edu/bitstream/handle/2144/46649/Kag_bu_0017E_18472.pdf?sequence=8&isAllowed=y (See Chapter 11, "Efficient Edge Inference by Selective Query (Hybrid Models)".)
- Li, M., Li, Y., Tian, Y., Jiang, L., and Xu, Q. (2021). Appealnet: An efficient and highly-accurate edge/cloud collaborative architecture for DNN inference. In 2021 58th ACM/IEEE Design Automation Conference (DAC), pages 409–414. URL: https://ieeexplore.ieee.org/document/9586176
- Kang, Y., Hauswald, J., Gao, C., Rovinski, A., Mudge, T., Mars, J., and Tang, L. (2017). Neurosurgeon: Collaborative intelligence between the cloud and mobile edge. SIGPLAN Notices, 52(4):615–629. https://doi.org/10.1145/3093336.3037698
- Letian Zhang, Lixing Chen, Jie Xu, Feb 2021, Autodidactic Neurosurgeon: Collaborative Deep Inference for Mobile Edge Intelligence via Online Learning, https://arxiv.org/abs/2102.02638
- Li, M., Li, Y., Tian, Y., Jiang, L., and Xu, Q. (2021). Appealnet: An efficient and highly-accurate edge/cloud collaborative architecture for DNN inference. In 2021 58th ACM/IEEE Design Automation Conference (DAC), pages 409–414. URL: https://ieeexplore.ieee.org/document/9586176, PDF: https://arxiv.org/pdf/2105.04104v2.pdf
- Y. Long, I. Chakraborty, and K. Roy, 2020, “Conditionally deep hybrid neural networks across edge and cloud,” arXiv:2005.10851, https://arxiv.org/abs/2005.10851
- Praveen Joshi, Mohammed Hasanuzzaman, Chandra Thapa, Haithem Afli, Ted Scully, "Enabling All In-Edge Deep Learning: A Literature Review", IEEE Access, vol.11, pp.3431-3460, 2023. https://ieeexplore.ieee.org/document/10007810 https://arxiv.org/abs/2204.03326 (Extensive survey of edge computing, including deployment architectures and optimizations.)
- E Samikwa, A Di Maio, T Braun, 2023, DISNET: Distributed Micro-Split Deep Learning in Heterogeneous Dynamic IoT, IEEE internet of things journal, PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10243578 (Partitioning methods for a model split over multiple distributed servers.)
- Daniele Jahier Pagliari, Roberta Chiaro, Enrico Macii, Massimo Poncino, "CRIME: Input-Dependent Collaborative Inference for Recurrent Neural Networks", IEEE Transactions on Computers, vol.70, no.10, pp.1626-1639, 2021. https://ieeexplore.ieee.org/document/9184963 (Collaborative inference by sharing workload to multiple devices.)
- Y Zhang, Z Zhang, W Bao, D Yuan, 2023, ITIF: Integrated Transformers Inference Framework for Multiple Tenants on GPU, ICPP '23: Proceedings of the 52nd International Conference on Parallel Processing, August 2023, Pages 112–121, https://doi.org/10.1145/3605573.3605585, https://dl.acm.org/doi/abs/10.1145/3605573.3605585
- Sohaib Ahmad, Hui Guan, Ramesh K. Sitaraman, 2024, Loki: A System for Serving ML Inference Pipelines with Hardware and Accuracy Scaling, https://guanh01.github.io/files/2024hpdc-loki.pdf
- Jeon, Byungsoo, May 2024, Automated and Portable Machine Learning Systems, Ph.D. Thesis, Carnegie Mellon University, https://doi.org/10.1184/R1/25746708.v1 https://kilthub.cmu.edu/articles/thesis/Automated_and_Portable_Machine_Learning_Systems/25746708/1 PDF: https://kilthub.cmu.edu/ndownloader/files/46074087 Code: https://github.com/cmu-catalyst/collage (Portability layer to integrate the various kernels and low-level backends more easily. Also covers pipeline parallelism in graph models, and KV cache parallelism similar to FlashDecode.)
- Vikranth Srivatsa∗, Zijian He∗, Reyna Abhyankar, Dongming Li, Yiying Zhang, 2024, Preble: Efficient Distributed Prompt Scheduling for LLM Serving, University of California, San Diego, https://escholarship.org/content/qt1bm0k1w0/qt1bm0k1w0.pdf (Evalulates prompt sharing including full inference cache or a partial prefix-based computation of a global KV cache for the prefill phase. Also schedules GPUs based on prefill versus decoding phase requirements.)
- Yao Fu, 14 May 2024, Challenges in Deploying Long-Context Transformers: A Theoretical Peak Performance Analysis, https://arxiv.org/abs/2405.08944
- Paula Rooney, 14 May 2024, Private cloud makes its comeback, thanks to AI, CIO, https://www.cio.com/article/2104613/private-cloud-makes-its-comeback-thanks-to-ai.html
- Chengyi Nie, Rodrigo Fonseca, Zhenhua Liu, 11 May 2024, Aladdin: Joint Placement and Scaling for SLO-Aware LLM Serving, https://arxiv.org/abs/2405.06856
- JH Jones, May 2024, A Quantitative Comparison of Pre-Trained Model Registries to Traditional Software Package Registries, Masters Thesis, Electrical and Computer Engineering, Purdue University, https://hammer.purdue.edu/articles/thesis/A_Quantitative_Comparison_of_Pre-Trained_Model_Registries_to_Traditional_Software_Package_Registries/25686447/1 PDF: https://hammer.purdue.edu/ndownloader/files/46096152
- Jiamin Li, Le Xu, Hong Xu, Aditya Akella, 28 Apr 2024, BlockLLM: Multi-tenant Finer-grained Serving for Large Language Models, https://arxiv.org/abs/2404.18322 (Partitioning inference over blocks for GPU.)
- Lequn Chen, 2024, Multi-tenant Machine Learning Model Serving Systems on GPU Clusters, PhD Thesis, University of Washington, https://digital.lib.washington.edu/researchworks/bitstream/handle/1773/51337/Chen_washington_0250E_26603.pdf?sequence=1&isAllowed=y
- Cohere Toolkit, https://github.com/cohere-ai/cohere-toolkit (A set of open source components for RAG architectures.)
- Ahmed Menshawy, Zeeshan Nawaz, Mahmoud Fahmy, April 2024, Navigating Challenges and Technical Debt in Large Language Models Deployment, EuroMLSys '24: Proceedings of the 4th Workshop on Machine Learning and Systems, Pages 192–199, https://doi.org/10.1145/3642970.3655840 https://dl.acm.org/doi/abs/10.1145/3642970.3655840 PDF Slides: https://www.cl.cam.ac.uk/research/srg/netos/euromlsys2024/slides/P_5_27.pdf
- Jiachen Liu, Zhiyu Wu, Jae-Won Chung, Fan Lai, Myungjin Lee, Mosharaf Chowdhury, 25 Apr 2024, Andes: Defining and Enhancing Quality-of-Experience in LLM-Based Text Streaming Services, https://arxiv.org/abs/2404.16283 (Scheduling GPU activity for multiple queries to ensure good UI experience for text-streaming outputs like chatbots.)
- Xue Geng, Zhe Wang, Chunyun Chen, Qing Xu, Kaixin Xu, Chao Jin, Manas Gupta, Xulei Yang, Zhenghua Chen, Mohamed M. Sabry Aly, Jie Lin, Min Wu, Xiaoli Li, 9 May 2024, From Algorithm to Hardware: A Survey on Efficient and Safe Deployment of Deep Neural Networks, https://arxiv.org/abs/2405.06038
- Josef Pichlmeier, Philipp Ross, Andre Luckow, 22 Apr 2024, Expert Router: Orchestrating Efficient Language Model Inference through Prompt Classification, https://arxiv.org/abs/2404.15153
- Konstantinos Papaioannou, Thaleia Dimitra Doudali, April 2024, The Importance of Workload Choice in Evaluating LLM Inference Systems, EuroMLSys '24: Proceedings of the 4th Workshop on Machine Learning and Systems, April 2024, Pages 39–46, https://doi.org/10.1145/3642970.3655823 https://dl.acm.org/doi/abs/10.1145/3642970.3655823
- Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, 22 Apr 2024, A Survey on Efficient Inference for Large Language Models, https://arxiv.org/abs/2404.14294
- Reyna Abhyankar, Zijian He, Vikranth Srivatsa, Hao Zhang, Yiying Zhang, 2 Feb 2024, APIServe: Efficient API Support for Large-Language Model Inferencing, https://arxiv.org/abs/2402.01869
- Hou-I Liu, Marco Galindo, Hongxia Xie, Lai-Kuan Wong, Hong-Han Shuai, Yung-Yui Li, Wen-Huang Cheng, 8 Apr 2024, Lightweight Deep Learning for Resource-Constrained Environments: A Survey, https://arxiv.org/abs/2404.07236 (A survey of various optimizations, with a lot of focus on image and vision models, including CNNs, RNNs, and Transformers.)
- Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Jiang Xu, Shuang Chen, Hao Feng, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan, 20 Jan 2024, Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads, https://arxiv.org/abs/2401.11181 (Separating the prefill and decoding phases for optimization.)
- Ajay Jaiswal, Bodun Hu, Lu Yin, Yeonju Ro, Shiwei Liu, Tianlong Chen, Aditya Akella, 5 Apr 2024, FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping, https://arxiv.org/abs/2404.03865 (Presents an input-adaptive layer skipping scheme for drops about 30% of FFN calculations. Avoids the KV caching problems by only skipping FFN computations in layers.)
- Stan Gibson, 03 Jun 2024, Getting infrastructure right for generative AI, CIO, https://www.cio.com/article/2128440/getting-infrastructure-right-for-generative-ai.html
- Vinod Vijay Nigade, Latency-Critical Inference Serving for Deep Learning, Ph.D. Thesis, VRIJE UNIVERSITEIT, Netherlands, https://research.vu.nl/ws/portalfiles/portal/258499994/phdthesis-vinodvufinal+4+-+65043c3f62dc9.pdf
- Jaskirat Singh, Bram Adams, Ahmed E. Hassan, 25 Mar 2024, On the Impact of Black-box Deployment Strategies for Edge AI on Latency and Model Performance, https://arxiv.org/abs/2403.17154 (MLOps deployment for quantization, partitioning and early-exit across mobile, edge, and cloud platforms, including running early exit on mobile.)
- LMDeploy Contributors, 2023, LMDeploy: A Toolkit for Compressing, Deploying, and Serving LLM, Apache 2.0 License, Code: https://github.com/InternLM/lmdeploy
- Mengke Ge, Junpeng Wang, Binhan Chen, Yingjian Zhong, Haitao Du, Song Chen, Yi Kang, 22 Mar 2024, Allspark: Workload Orchestration for Visual Transformers on Processing In-Memory Systems, https://arxiv.org/abs/2403.15069
- Z Ye, W Gao, Q Hu, P Sun, X Wang, Y Luo, T Zhang, 2023, Deep Learning Workload Scheduling in GPU Datacenters: A Survey, ACM Computing Surveys, PDF: https://dl.acm.org/doi/pdf/10.1145/3638757
- Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
- Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, Liang Zhao, 4 Jan 2024, Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models https://arxiv.org/abs/2401.00625 (A general survey paper with coverage of many techniques including this one.)
- Jiahang Zhou, Yanyu Chen, Zicong Hong, Wuhui Chen, Yue Yu, Tao Zhang, Hui Wang, Chuanfu Zhang, Zibin Zheng, 5 Jan 2024, Training and Serving System of Foundation Models: A Comprehensive Survey, https://arxiv.org/abs/2401.02643
- Meenu Mary John; Helena Holmström Olsson; Jan Bosch, 2020, AI Deployment Architecture: Multi-Case Study for Key Factor Identification, 2020 27th Asia-Pacific Software Engineering Conference (APSEC), https://ieeexplore.ieee.org/abstract/document/9359253
- Meenu Mary John, Helena Holmström Olsson, Jan Bosch, 2020, Architecting AI Deployment: A Systematic Review of State-of-the-Art and State-of-Practice Literature, ICSOB 2020: Software Business, pp 14–29, https://link.springer.com/chapter/10.1007/978-3-030-67292-8_2
- Daniel Crankshaw, Gur-Eyal Sela, Xiangxi Mo, Corey Zumar, Ion Stoica, Joseph Gonzalez, and Alexey Tumanov. 2020. InferLine: latency-aware provisioning and scaling for prediction serving pipelines. Proceedings of the 11th ACM Symposium on Cloud Computing. 477–491, https://arxiv.org/abs/1812.01776
- Youhe Jiang, Ran Yan, Xiaozhe Yao, Yang Zhou, Beidi Chen, Binhang Yuan, 2024, HEXGEN: Generative Inference of Large Language Model over Heterogeneous Environment. https://openreview.net/pdf?id=9ANyvRtFGa Code: https://github.com/Relaxed-System-Lab/HexGen
- Ali Rahmanian, Doctoral Thesis, April 2024, Edge Orchestration for Latency-Sensitive Applications, Department of Computing Science, Umea University, Sweden, https://www.diva-portal.org/smash/get/diva2:1849510/FULLTEXT02.pdf
- Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, Hao Zhang, 19 Mar 2024 (v2), DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving, https://arxiv.org/abs/2401.09670 (Optimizing LLMs differently in the prefill and decoding phases.)
- Juntao Zhao, Borui Wan, Yanghua Peng, Haibin Lin, Chuan Wu, 2 Mar 2024, LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization, https://arxiv.org/abs/2403.01136 (Deployment of LLMs on heterogenous GPUs and also differences between the two phases of decoder-only Transformers: prefill and decoding computations.)
- Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng Zhang, Dahua Lin, Ion Stoica, Hao Zhang, 2 Apr 2024, MuxServe: Flexible Multiplexing for Efficient Multiple LLM Serving, https://arxiv.org/abs/2404.02015
- Can Wang, Bolin Zhang, Dianbo Sui, Zhiying Tu, Xiaoyu Liu, Jiabao Kang, 1 Mar 2024 (v2), A Survey on Effective Invocation Methods of Massive LLM Services, https://arxiv.org/abs/2402.03408 (Deployment of LLMs as LLM-as-a-Service or LLMaaS architectures including prompt compression, semantic caching and model selection based on scoring inputs.)
- Qinghao Hu, Zhisheng Ye, Zerui Wang, Guoteng Wang, Meng Zhang, Qiaoling Chen, Peng Sun, Dahua Lin, Xiaolin Wang, Yingwei Luo, Yonggang Wen, Tianwei Zhang, https://arxiv.org/abs/2403.07648 12 Mar 2024, Characterization of Large Language Model Development in the Datacenter, (Analysis of deployment and LLOps issues in a 6-month production deployment.)
- Apple, June 2022, Deploying Transformers on the Apple Neural Engine, https://machinelearning.apple.com/research/neural-engine-transformers Code: https://github.com/apple/ml-ane-transformers
- Yao Lu, Song Bian, Lequn Chen, Yongjun He, Yulong Hui, Matthew Lentz, Beibin Li, Fei Liu, Jialin Li, Qi Liu, Rui Liu, Xiaoxuan Liu, Lin Ma, Kexin Rong, Jianguo Wang, Yingjun Wu, Yongji Wu, Huanchen Zhang, Minjia Zhang, Qizhen Zhang, Tianyi Zhou, Danyang Zhuo, 17 Jan 2024, Computing in the Era of Large Generative Models: From Cloud-Native to AI-Native, https://arxiv.org/abs/2401.12230
- Chang, Xiangyu; Miraj Ahmed, Sk; Krishnamurthy, Srikanth V.; Guler, Basak; Swami, Ananthram; Oymak, Samet; Roy-Chowdhury, Amit K., Jan 2024, Plug-and-Play Transformer Modules for Test-Time Adaptation, https://arxiv.org/abs/2401.04130 https://ui.adsabs.harvard.edu/abs/2024arXiv240104130C/abstract
- Andrew Starc, Feb 22 2024, Mantel Group survey reveals AI challenges of large Australian businesses, CRN, https://www.crn.com.au/news/mantel-group-survey-reveals-ai-challenges-of-large-australian-businesses-605376
- Chris Parnin, Gustavo Soares, Rahul Pandita, Sumit Gulwani, Jessica Rich, Austin Z. Henley, 21 Dec 2023, Building Your Own Product Copilot: Challenges, Opportunities, and Needs, https://arxiv.org/abs/2312.14231
- Jacob Robbins, January 4, 2024, Why generative AI orchestration startups are poised for growth in 2024, Pitch Book, https://pitchbook.com/news/articles/generative-ai-orchestration-startups-venture-capital-unicorns
- Eberhard Hechler , Martin Oberhofer , Thomas Schaeck, 2020, Deploying AI in the Enterprise, Book, https://link.springer.com/book/10.1007/978-1-4842-6206-1
- Teresa Tung, June 2023, 7 architecture considerations for generative AI, Accenture, https://www.accenture.com/us-en/blogs/cloud-computing/7-generative-ai-architecture-considerations
- Hayden Wolff, Jun 02, 2024, A Simple Guide to Deploying Generative AI with NVIDIA NIM, NVIDIA Technical Blog, https://developer.nvidia.com/blog/a-simple-guide-to-deploying-generative-ai-with-nvidia-nim/
- Tal Peretz, 15 NOV 2023, The Developer's Guide to Production-Grade LLM Apps: Advanced Techniques for Maximizing LLM Performance, https://buildingaistuff.com/p/the-developers-guide-to-production
- Hong Zhang, Yupeng Tang, Anurag Khandelwal, and Ion Stoica. 2023. SHEPHERD: Serving DNNs in the Wild. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). USENIX Association, Boston, MA, 787–808. https://www.usenix.org/conference/nsdi23/presentation/zhang-hong
- David Spuler, March 2024, Chapter 7. Deployment Architecture, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Kirill Kolodiazhnyi, May 15, 2020, Hands-On Machine Learning with C++: Build, train, and deploy end-to-end machine learning and deep learning pipelines, https://www.amazon.com/Hands-Machine-Learning-end-end/dp/1789955335/
- Deci Engineering Team, September 28, 2021, 5 Factors that Impact the Inference Pipeline in Production + Hardware Usage Metrics, https://deci.ai/blog/optimize-inference-pipeline-production/
- Abhimanyu Bambhaniya, Ritik Raj, Geonhwa Jeong, Souvik Kundu, Sudarshan Srinivasan, Midhilesh Elavazhagan, Madhu Kumar, Tushar Krishna, 3 Jun 2024, Demystifying Platform Requirements for Diverse LLM Inference Use Cases, https://arxiv.org/abs/2406.01698
- Adva Nakash Peleg, May 30, 2024, An LLM Journey: From POC to Production, https://medium.com/cyberark-engineering/an-llm-journey-from-poc-to-production-6c5ec6a172fb
- Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, Wei Lin, 5 Jun 2024, Llumnix: Dynamic Scheduling for Large Language Model Serving, https://arxiv.org/abs/2406.03243 Code: https://github.com/AlibabaPAI/llumnix
- Fabian Both, June 2024, why we no longer use LangChain for building our AI agents , https://www.octomind.dev/blog/why-we-no-longer-use-langchain-for-building-our-ai-agents (Replaces LangChain with their own more-focused internal tool sets.)
- Waleed Kadous, August 23, 2023, Llama 2 is about as factually accurate as GPT-4 for summaries and is 30X cheaper, https://www.anyscale.com/blog/llama-2-is-about-as-factually-accurate-as-gpt-4-for-summaries-and-is-30x-cheaper Code: https://github.com/anyscale/factuality-eval
- Louis-François Bouchard, Louie Peters, May 2024, Chapter 11: Deployment, Building LLMs for Production: Enhancing LLM Abilities and Reliability with Prompting, Fine-Tuning, and RAG, https://www.amazon.com/Building-LLMs-Production-Reliability-Fine-Tuning/dp/B0D4FFPFW8/
- Aarushi Kansal, Chapter 7: Monitoring, Building Generative AI-Powered Apps: A Hands-on Guide for Developers, Apress, https://www.amazon.com/Building-Generative-AI-Powered-Apps-Hands-ebook/dp/B0CTXXP1S4/
- Baolin Li, Yankai Jiang, Vijay Gadepally, Devesh Tiwari, 17 Jul 2024, LLM Inference Serving: Survey of Recent Advances and Opportunities, https://arxiv.org/abs/2407.12391
- Chip Huyen, Jul 25, 2024, Building A Generative AI Platform, https://huyenchip.com/2024/07/25/genai-platform.html
- Ryan Lucchese, Niki Birkner, Yaron Hagai, Virginia Adams, August 13, 2024, Together AI, A practitioner's guide to testing and running large GPU clusters for training generative AI models, https://www.together.ai/blog/a-practitioners-guide-to-testing-and-running-large-gpu-clusters-for-training-generative-ai-models
- Yeonhong Park, Jake Hyun, SangLyul Cho, Bonggeun Sim, Jae W. Lee, 21 Jun 2024 (v4), Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs, https://arxiv.org/abs/2402.10517 Code: https://github.com/SNU-ARC/any-precision-llm
- Guangxuan Xiao, May 2024, Efficient Deployment Algorithms for Large Language Models, Masters Thesis, MIT, https://dspace.mit.edu/bitstream/handle/1721.1/156332/xiao-xgx-sm-eecs-2024-thesis.pdf
- Intel, Jul 24, 2024, Generative AI Fundamentals: Deploying LLMs with OpenVINO™, OpenVINO™ toolkit, https://medium.com/openvino-toolkit/generative-ai-fundamentals-deploying-llms-with-openvino-3057861f6feb
- Hao Zhou, Chengming Hu, Ye Yuan, Yufei Cui, Yili Jin, Can Chen, Haolun Wu, Dun Yuan, Li Jiang, Di Wu, Xue Liu, Charlie Zhang, Xianbin Wang, Jiangchuan Liu, 17 May 2024, Large Language Model (LLM) for Telecommunications: A Comprehensive Survey on Principles, Key Techniques, and Opportunities, https://arxiv.org/abs/2405.10825
- Abhinand, Aug 20, 2024, Self-Hosting LLaMA 3.1 70B (or any ~70B LLM) Affordably, https://abhinand05.medium.com/self-hosting-llama-3-1-70b-or-any-70b-llm-affordably-2bd323d72f8d
- Dom Couldwell, Sep 03, 2024 Dealing with ‘day two’ issues in generative AI deployments, https://www.infoworld.com/article/3493255/dealing-with-day-two-issues-in-generative-ai-deployments.html
- Lightning AI, 2024, Serve LLMs, https://lightning.ai/docs/litserve/features/serve-llms
- Evan Schuman, 01 May 2024, LLM deployment flaws that catch IT by surprise, https://www.computerworld.com/article/2095216/llm-deployment-flaws-that-catch-it-by-surprise.html
- Michael Nuñez, September 10, 2024, Is Anthropic’s new ‘Workspaces’ feature the future of enterprise AI management? https://venturebeat.com/ai/is-anthropics-new-workspaces-feature-the-future-of-enterprise-ai-management/
- Andrei Paleyes, Raoul-Gabriel Urma, Neil D. Lawrence, 19 May 2022 (v3), Challenges in Deploying Machine Learning: a Survey of Case Studies, ACM Comput. Surv., Vol. 55, No. 6, Article 114, December 2022. https://doi.org/10.1145/3533378 https://arxiv.org/abs/2011.09926 https://dl.acm.org/doi/fullHtml/10.1145/3533378#Bib0005
- Yiyuan He, Minxian Xu, Jingfeng Wu, Wanyi Zheng, Kejiang Ye, Chengzhong Xu, 24 Sep 2024 (v2), UELLM: A Unified and Efficient Approach for LLM Inference Serving, https://arxiv.org/abs/2409.14961
- Dylan Patel and Daniel Nishball, Oct 03, 2024, AI Neocloud Playbook and Anatomy, https://www.semianalysis.com/p/ai-neocloud-playbook-and-anatomy
- Michael J. Zellinger, Matt Thomson, 3 Oct 2024, Efficiently Deploying LLMs with Controlled Risk, https://arxiv.org/abs/2410.02173
- Noah Martin, Abdullah Bin Faisal, Hiba Eltigani, Rukhshan Haroon, Swaminathan Lamelas, Fahad Dogar, 4 Oct 2024, LLMProxy: Reducing Cost to Access Large Language Models, https://arxiv.org/abs/2410.11857 (Deploying a proxy between user and LLM, with handling of conversational history context and caching.)
- Mastering LLM, Aug 17, 2024, How Much GPU Memory is Needed to Serve a Large Language Model (LLM)? https://masteringllm.medium.com/how-much-gpu-memory-is-needed-to-serve-a-large-languagemodel-llm-b1899bb2ab5d
- Fan Yang, Zehao Wang∗, Haoyu Zhang, Zhenhua Zhu, Xinhao Yang, Guohao Dai, Yu Wang, Oct 2024, Efficient Deployment of Large Language Model across Cloud-Device Systems, https://nicsefc.ee.tsinghua.edu.cn/nics_file/pdf/f06a14c1-4d6d-441d-b4e4-82545ac5781b.pdf
- Eldar Kurtic, Alexandre Marques, Shubhra Pandit, Mark Kurtz, Dan Alistarh, 4 Nov 2024, "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization, https://arxiv.org/abs/2411.02355
- Alina Mailach, Sebastian Simon, Johannes Dorn, Norbert Siegmund, 13 Nov 2024, Practitioners' Discussions on Building LLM-based Applications for Production, https://arxiv.org/abs/2411.08574
- Sonal Prabhune, Donald J. Berndt, 7 Nov 2024, Deploying Large Language Models With Retrieval Augmented Generation, https://arxiv.org/abs/2411.11895
- Narcisa Guran, Florian Knauf, Man Ngo, Stefan Petrescu, Jan S. Rellermeyer, 21 Nov 2024, Towards a Middleware for Large Language Models, https://arxiv.org/abs/2411.14513
- Yanyu Chen, Ganhong Huang, 6 Dec 2024, GUIDE: A Global Unified Inference Engine for Deploying Large Language Models in Heterogeneous Environments, https://arxiv.org/abs/2412.04788
Submodels (Many-Models-in-One)
Although most ensemble architectures do have multiple distinct models, another approach is to have one model act as many models. This is called "submodels" or "many-models-in-one" or "many-in-one models."
Several methods have been tried, including training multiple submodels as part of a larger model, or using cut-down versions of a bigger model as multiple smaller submodels (e.g. using early exit to give submodels along the depth dimension, width pruning along the width dimension, etc). In some such architectures, the same model is simply executed with different parameters, such as the meta-parameters controlling early exit or width pruning.
This idea also appears as a specialization of other optimizations. For example, the self-speculative decoding method involves having the smaller draft model simply an early exit of the larger verifier model. This avoids the cost of having to train two models, and there are advantages in computation reuse when half of the layers of the big model have already been computed in the small model.
Research papers on submodels and many-models-in-one architectures:
- Devvrit, Sneha Kudugunta, Aditya Kusupati, Tim Dettmers, Kaifeng Chen, Inderjit Dhillon, Yulia Tsvetkov, Hannaneh Hajishirzi, Sham Kakade, Ali Farhadi, Prateek Jain, 2024, MatFormer: Nested Transformer for Elastic Inference https://openreview.net/pdf?id=93BaEweoRg (A method of training one large model, and then extracting many smaller sub-models from that model, using FFNs with a subset of parameters, which if done staticly can then be similar to a form of model compression, and elastic inference done dynamically is a type of adaptive inference.)
- Lei Xun, Jonathon Hare, Geoff V. Merrett, 17 Jan 2024, Dynamic DNNs and Runtime Management for Efficient Inference on Mobile/Embedded Devices, https://arxiv.org/abs/2401.08965
- Ruisi Cai1, Saurav Muralidharan, Greg Heinrich, Hongxu Yin, Zhangyang Wang, Jan Kautz, Pavlo Molchanov, 2024, FLEXTRON: Many-in-One Flexible Large Language Model, https://openreview.net/pdf?id=9vKRhnflAs (Using one model to act in different ways by making it "elastic" with parameters, effectively using slimming via techniques such as layer fusion in MLPs and MHA Attention Heads.)
- Parsa Kavehzadeh, Mohammadreza Pourreza, Mojtaba Valipour, Tinashu Zhu, Haoli Bai, Ali Ghodsi, Boxing Chen, Mehdi Rezagholizadeh, 2 Jul 2024, S2D: Sorted Speculative Decoding For More Efficient Deployment of Nested Large Language Models, https://arxiv.org/abs/2407.01955 (Creating, managing and integrating multiple draft models as submodels in speculative decoding.)
- Mojtaba Valipour, Mehdi Rezagholizadeh, Hossein Rajabzadeh, Parsa Kavehzadeh, Marzieh Tahaei, Boxing Chen, Ali Ghodsi, 1 Jun 2024 (v3), SortedNet: A Scalable and Generalized Framework for Training Modular Deep Neural Networks, https://arxiv.org/abs/2309.00255
- Devvrit, Sneha Kudugunta, Aditya Kusupati, Tim Dettmers, Kaifeng Chen, Inderjit Dhillon, Yulia Tsvetkov, Hannaneh Hajishirzi, Sham Kakade, Ali Farhadi, Prateek Jain, 11 Oct 2023, MatFormer: Nested Transformer for Elastic Inference, https://arxiv.org/abs/2310.07707
- Janek Haberer, Ali Hojjat, Olaf Landsiedel, 26 Sep 2024, HydraViT: Stacking Heads for a Scalable ViT, https://arxiv.org/abs/2409.17978 https://github.com/ds-kiel/HydraViT
- Xuan Shen, Pu Zhao, Yifan Gong, Zhenglun Kong, Zheng Zhan, Yushu Wu, Ming Lin, Chao Wu, Xue Lin, Yanzhi Wang, 25 Sep 2024, Search for Efficient Large Language Models, https://arxiv.org/abs/2409.17372 (Looking for subnets inside models as an alternative to NAS.)
- Shrenik Bhansali, Alwin Jin, Tyler Lizzo, Larry Heck, 23 Oct 2024, LEGO: Language Model Building Blocks, https://arxiv.org/abs/2410.18287 (Extract small models out of large models.)
- R Cai, Y Ro, GW Kim, P Wang, BE Bejnordi, A Akella, Oct 2024, Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design, 38th Conference on Neural Information Processing Systems (NeurIPS 2024), https://utns.cs.utexas.edu/assets/papers/neurips24-readme.pdf https://github.com/VITA-Group/READ-ME (Extract multiple smaller MoE expert models from a large LLM.)
- Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean, 23 Jan 2017, Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer, https://arxiv.org/abs/1701.06538
- Yan Zhuang, Zhenzhe Zheng, Fan Wu, and Guihai Chen. 2024. LiteMoE: Customizing On-device LLM Serving via Proxy Submodel Tuning. In Proceedings of the 22nd ACM Conference on Embedded Networked Sensor Systems (SenSys '24). Association for Computing Machinery, New York, NY, USA, 521–534. https://doi.org/10.1145/3666025.3699355 https://dl.acm.org/doi/abs/10.1145/3666025.3699355
- Umesh Deshpande, Travis Janssen, Mudhakar Srivatsa, and Swaminathan Sundararaman. 2024. MoEsaic: Shared Mixture of Experts. In Proceedings of the 2024 ACM Symposium on Cloud Computing (SoCC '24). Association for Computing Machinery, New York, NY, USA, 434–442. https://doi.org/10.1145/3698038.3698521 https://dl.acm.org/doi/abs/10.1145/3698038.3698521
Distributed Inference
Distributed inference is the technique of spreading the inference of a single query over multiple models in different locations. It is a generalization of multi-GPU architectures, to use multiple distributed servers, each with one or more computation engines that handle parts of the inference processing stack.
Research papers on distributed inference algorithms:
- B Wu, Y Zhong, Z Zhang, G Huang, X Liu, 2023, Fast Distributed Inference Serving for Large Language Models, https://arxiv.org/abs/2305.05920
- Davide Macario, 2024, A Model-Distributed Inference Approach for Large Language Models at the Edge, Masters Thesis, Master of Science in Electrical and Computer Engineering, Graduate College, University of Illinois at Chicago, https://webthesis.biblio.polito.it/secure/31718/1/tesi.pdf
- Marco Colocrese, Erdem Koyuncu, Hulya Seferoglu, 8 Aug 2024, Early-Exit meets Model-Distributed Inference at Edge Networks, https://arxiv.org/abs/2408.05247
- Pujiang He, Shan Zhou, Wenhuan Huang, Changqing Li, Duyi Wang, Bin Guo, Chen Meng, Sheng Gui, Weifei Yu, Yi Xie, 10 Jul 2024, Inference Performance Optimization for Large Language Models on CPUs, https://arxiv.org/abs/2407.07304 Code: https://github.com/intel/xFasterTransformer
- Mingjin Zhang, 2024, High-performance scheduling of deep learning tasks in collaborative edge computing, Ph.D. Thesis, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong, https://theses.lib.polyu.edu.hk/bitstream/200/13080/3/7528.pdf (Scheduling of inference and training tasks on edge devices with techniques such as model splitting/partitioning.)
- Isaac Ong, May 16, 2024, Efficient Distributed LLM Inference with Dynamic Partitioning, Masters Thesis, Electrical Engineering and Computer Sciences, University of California, Berkeley, Technical Report No. UCB/EECS-2024-108, http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-108.html https://digitalassets.lib.berkeley.edu/techreports/ucb/incoming/EECS-2024-108.pdf
- Eric Samikwa, 2024, Resource-Aware Distributed Machine Learning for Artificial Intelligence of Things, Ph.D. thesis, Faculty of Science, University of Bern, Switzerland, https://boristheses.unibe.ch/5378/1/24samikwa_e_1_.pdf https://doi.org/10.48549/5378 (Multi-edge device with early exit, "micro-split" scheduling, split/federated learning, and distributed inference.)
- Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
- Divya Jyoti Bajpai, Manjesh Kumar Hanawal, 6 Oct 2024, Distributed Inference on Mobile Edge and Cloud: An Early Exit based Clustering Approach, https://arxiv.org/abs/2410.05338
- Nan Xue, Yaping Sun, Zhiyong Chen, Meixia Tao, Xiaodong Xu, Liang Qian, Shuguang Cui, Wenjun Zhang, Ping Zhang, 11 Nov 2024, WDMoE: Wireless Distributed Mixture of Experts for Large Language Models, https://arxiv.org/abs/2411.06681
More AI Research
Read more about: