Aussie AI

54. Ensemble Multi-Model Architectures

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

“Alone we can do so little;
together we can do so much.”

— Helen Keller.

What are Ensemble Architectures?

If one AI engine is amazing, imagine what two could do. Or ten. Or a hundred.

The idea of multi-model architectures recently received a large boost with the rumor that OpenAI's GPT-4 has an eight-model architecture. The unofficial leak of this confidential information could be false, but suggests that GPT-4 has a “Mixture-of-Experts” (MoE) architecture with 8 models, each of size about 220 billion parameters, for a total of 1.76 trillion parameters. An MoE architecture uses some decision method or heuristic (or possibly a learned feature) to send a query to different models, as discussed more below.

The idea of using two or more AI engines together to complete a task is far from new. There are many research papers on different types of multi-model architectures. This area of research is called “ensemble learning” or “multi-model” engines.

There are many ways that AI engines could cooperate to achieve more than one would alone. This is an area ripe for exploration, where we have only scratched the surface of possibilities. On the other hand, with today's high cost of GPUs limiting what can be done in both AI inference and training, the full realization of ensemble AI algorithms is still in the distant future.

One way in which two AI models work together has become common in practice: using the output of one model as input text for the training data set of a new model. This has been an effective technique for improving downstream models, but it isn't usually classed as an ensemble algorithm, although there is a paper about it with Honovich et al. (2022). This idea is similar to Knowledge Distillation, but differs in that its goal isn't to create a cut-down smaller model, but usually to improve accuracy of a large model.

Types of Ensemble Algorithms

Despite the costs, there are a number of areas where there is existing research, and even a significant number of practical use cases, in ensemble models. The two main categories are:

Collaborative inference. This is where two or more models combine to complete inference. They could be on separate servers, or together on one server. The two basic ways that multi-models do inference: together or separately. One architecture is for both models to do inference and combine results to complete the operation (e.g. big-small architectures, committee-based decoding, and speculative decoding). The other category is where one model is chosen from multiple options to do all the inference calculations, which is called a “model selection algorithm” (e.g. Mixture-of-Experts).
Multi-model training. There are various other methods of uses an ensemble technique to train up better models. Knowledge distillation is technically using two models for training (i.e. a teacher and a student model), but it's not usually classed as an ensemble architecture in itself because the larger model is not subsequently used for inference. Some of the more parallel training techniques are called: bagging, boosting, stacking, and many variants. There is also the simple training method of using input data sets to train models, where the data is based on the output of some other model (sometimes called “dataset distillation”).

Example Ensemble Architectures: Within the two major categories, there are multiple different areas of research. Some examples of types of multi-model architectures include:

Generative Adversarial Networks (GANs). These use two AI models against each other, one being critical of the output of the other, to improve overall results. This is an architecture that has proven very successful in practice.
Knowledge Distillation (KD). This is a well-known and widely used optimization method whereby a large model is built first, and then it is used to “distill” its knowledge into a smaller model, as a “teacher” model to a “student” model. Two models are used in training, but only the smaller model is used for inference. Note that there are also various more advanced forms of “ensemble distillation” (multi-teacher methods) that involve two or more teacher models, plus the student model. See Chapter 45 for more about knowledge distillation.
Cascade inference optimizations. Cascade optimizations involve the selection of different models, or paths through multiple models, as a type of inference optimization. Two or more models are used in the inference phase, and there are various methods for deciding at runtime which to choose.
Big-small models. In a specific subtype of the cascade method called big-small models, there are two models trained differently. The idea is that during inference, a heuristic decides which model to invoke, and a faster “small” model is used in common cases, and the slower “big” model is only needed to handle the rarer cases. This can improve inference latency and total throughput.
Speculative decoding. This method is similar to the big-little architecture, but differs because all queries initially go to the small model. The small, faster model does its best to suggest output tokens (i.e. it “speculates”), and then a bigger model is used to “verify” the correctness. If the bigger model has to override the smaller model, it is then slower, but usually the smaller model is accurate enough for the whole process to be faster on average, with accuracy close to using a larger model.
Parallel multi-model inference. In some architectures, multiple models can process the same input data in parallel. Each model produces its own results, and the resulting ensemble model has to then choose an overall result. The algorithm to decide amongst the multiple options could be maximum (or minimum), majority (counting), weighted averages, and many other combinations.
Hybrid dual transformer architectures. Rather than entire duplicate Transformer models, there has been research on adding extra components to the basic Transformer architecture, such as two decoders or two encoders merged together. This area is one of the more theoretical and less explored areas of multi-model research.

Research papers on ensemble architectures generally:

Yoshitomo Matsubara, Luca Soldaini, Eric Lind, Alessandro Moschitt, Dec 2022, Ensemble Transformer for Efficient and Accurate Ranking Tasks: an Application to Question Answering Systems, https://arxiv.org/abs/2201.05767
Yungeng Zhang, Yuru Pei & Hongbin Zha, Sep 2021, Learning Dual Transformer Network for Diffeomorphic Registration, Medical Image Computing and Computer Assisted Intervention, MICCAI 2021, https://link.springer.com/chapter/10.1007/978-3-030-87202-1_13
Xian-Feng Han, Yi-Fei Jin, Hui-Xian Cheng, Guo-Qiang Xiao, Apr 2021, Dual Transformer for Point Cloud Analysis, https://arxiv.org/abs/2104.13044
Ting Yao, Yehao Li, Yingwei Pan, Yu Wang, Xiao-Ping Zhang, Tao Mei, 2023, Dual Vision Transformer, https://arxiv.org/pdf/2207.04976, Code: https://github.com/YehLi/ImageNetModel
Mohammed Alhamid, March 2021, Ensemble Models, https://towardsdatascience.com/ensemble-models-5a62d4f4cb0c
Oliver R. A. Dunbar, Andrew B. Duncan, Andrew M. Stuart, Marie-Therese Wolfram, Jan 2022, Ensemble Inference Methods for Models With Noisy and Expensive Likelihoods, https://arxiv.org/abs/2104.03384
Shuning Chang, Pichao Wang, Hao Luo, Fan Wang, Mike Zheng Shou, 2023, Revisiting Vision Transformer from the View of Path Ensemble, https://arxiv.org/abs/2308.06548, PDF: https://arxiv.org/pdf/2308.06548.pdf (Treating the internal components of a Transformer as if they are an ensemble model.)
T. G. Dietterich. 2000, Ensemble methods in machine learning, In Multiple classifier systems, pages 1–15. Springer, 2000, Lecture Notes in Computer Science book series LNCS, volume 1857, https://link.springer.com/chapter/10.1007/3-540-45014-9_1, PDF: https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=e3b09a777c71a4b88888509ab9bfa12d8bf295ba (Early paper with ensemble idea applied to classifiers, rather than multi-model.)
Huang, G., Chen, D., Li, T., Wu, F., van der Maaten, L., Weinberger, K.Q., 2018, Multi-scale dense networks for resource efficient image classification, In: 6th International Conference on Learning Representations, ICLR 2018 (2018). https://doi.org/10.48550/arXiv.1703.09844, https://arxiv.org/abs/1703.09844 (Has multiple models combined in an early-exit configuration.)
Y. Matsubara, M. Levorato, and F. Restuccia, 2022, Split computing and early exiting for deep learning applications: Survey and research challenges, ACM Comput. Surveys, Mar 2022, https://arxiv.org/abs/2103.04505 (Split computing is splitting the inference between server and edge machines.)
L. Li, K. Ota and M. Dong, 2018, Deep learning for smart industry: Efficient manufacture inspection system with fog computing, IEEE Trans. Ind. Informat., vol. 14, no. 10, pp. 4665-4673, Oct. 2018. https://ieeexplore.ieee.org/document/8370640 (“Fog computing” is like cloud computing but on servers “nearer” to the ground.)
C. Lo, Y.-Y. Su, C.-Y. Lee and S.-C. Chang, 2017, A dynamic deep neural network design for efficient workload allocation in edge computing, Proc. IEEE Int. Conf. Comput. Design (ICCD), pp. 273-280, Nov. 2017. https://ieeexplore.ieee.org/document/8119222
G Xu, Z Hao, Y Luo, H Hu, J An, S Mao, 2023, DeViT: Decomposing Vision Transformers for Collaborative Inference in Edge Devices, arXiv preprint arXiv:2309.05015, https://arxiv.org/abs/2309.05015
Schwartz, R., Stanovsky, G., Swayamdipta, S., Dodge, J., and Smith, N. A., 2020, The right tool for the job: Matching model and instance complexities, In Annual Meeting of the Association for Computational Linguistics, 2020. https://arxiv.org/abs/2004.07453 (Early exit with “wisdom of committees” decisions.)
Naftaly, U., N. Intrator, and D. Horn. 1997, Optimal ensemble averaging of neural networks, Network: Computation in Neural Systems 8, no. 3 (1997): 283–296. https://www.tau.ac.il/~horn/publications/optimal.pdf
Y. Liu and X. Yao, 1999, Ensemble Learning via Negative Correlation, Neural Networks, Volume 12, Issue 10, December 1999, pp. 1399-1404. doi:10.1016/S0893-6080(99)00073-8, https://www.sciencedirect.com/science/article/abs/pii/S0893608099000738
Z.S.H. Chan; N. Kasabov, 2005, Fast neural network ensemble learning via negative-correlation data correction, IEEE Transactions on Neural Networks, Volume 16, Issue 6, November 2005, https://ieeexplore.ieee.org/document/1528547
E Diao, 2023, Efficient and Collaborative Methods for Distributed Machine Learning, Ph.D. thesis, Department of Electrical and Computer Engineering Duke University, https://www.proquest.com/openview/410ea5eb4275fded25890f04c96a902e/1?pq-origsite=gscholar&cbl=18750&diss=y
X Xu, K Yan, S Han, B Wang, X Tao, P Zhang, 2023, Learning-Based Edge-Device Collaborative DNN Inference in IoVT Networks, IEEE Internet of Things Journal, https://ieeexplore.ieee.org/abstract/document/10258387
Devvrit, Sneha Kudugunta, Aditya Kusupati, Tim Dettmers, Kaifeng Chen, Inderjit Dhillon, Yulia Tsvetkov, Hannaneh Hajishirzi, Sham Kakade, Ali Farhadi, Prateek Jain, Oct 2023, MatFormer: Nested Transformer for Elastic Inference, https://arxiv.org/abs/2310.07707 (Multiple submodels inside a large model.)

For research papers on ensemble architectures in general, see https://www.aussieai.com/research/ensemble.

Model Selection Algorithms

Model selection algorithms are dynamic inference optimizations where a choice is made between two or more models for execution. The hottest area of such research is Mixture-of-Experts, because of the GPT-4's rumored architecture. Another example is “big-little” architectures, where a heuristic attempts to send “easy” queries to a faster “little” model. Various other ensemble architectures are possible with multiple models.

Another practical example of a different type of model selection is the deployment architecture, which may be deciding which server to send the request to, where each server may have different models or multiple instances of the same model. Other areas of research with similar aims include cascades and collaborative inference.

Mixture of Experts (MoE)

Mixture of Experts (MoE) is an ensemble inference optimization method where multiple sub-models are trained and used. The efficiency arises by sending a query to one of the experts, thereby only some of the weights are activated, dependent on the input tokens. Each expert model is smaller than if all the models were merged.

This method is based on “divide and conquer” where a decision between experts “divides” a problem, and the chosen expert model “conquers” the sub-problem. Conceptually, the MoE architecture has some resemblance to cascades, big-little architectures, and knowledge distillation.

Research papers on MoE multi-model architectures:

William Fedus, Barret Zoph, and Noam Shazeer. 2021, Switch transformers: Scaling to trillion-parameter models with simple and efficient sparsity, J. Mach. Learn. Res, 23:1–40, 2021, https://arxiv.org/abs/2101.03961
Erik Daxberger, Floris Weers, Bowen Zhang, Tom Gunter, Ruoming Pang, Marcin Eichner, Michael Emmersberger, Yinfei Yang, Alexander Toshev, Xianzhi Du, 2023, Mobile V-MoEs: Scaling Down Vision Transformers via Sparse Mixture-of-Experts, arXiv preprint, https://arxiv.org/abs/2309.04354 (This paper covers Sparse MoEs for vision transformers.)
Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. 2022, Designing effective sparse expert models, arXiv preprint arXiv:2202.08906, 2022. https://arxiv.org/abs/2202.08906v1
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. 2017, Outrageously large neural networks: The sparsely-gated mixture-of-experts layer, arXiv preprint arXiv:1701.0653, https://arxiv.org/abs/1701.06538 (Sparse MoE's early paper with 1,000s of expert mini-models.)
IC Gormley, S Frühwirth-Schnatter, June 2018, Mixture of experts models, https://arxiv.org/abs/1806.08200
Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2020, Gshard: Scaling giant models with conditional computation and automatic sharding, arXiv preprint arXiv:2006.16668, 2020, https://arxiv.org/abs/2006.16668 (Sharding technique applied to an MoE model for further optimization.)
Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc V Le, Yonghui Wu, Zhifeng Chen, Claire Cui, 2022, Glam: Efficient scaling of language models with mixture-of-experts, ICML 2022, https://arxiv.org/abs/2112.06905, PDF: https://proceedings.mlr.press/v162/du22c/du22c.pdf
Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, Yuxiong He, 2022, DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale, ICML 2022, https://arxiv.org/abs/2201.05596, PDF: https://proceedings.mlr.press/v162/rajbhandari22a/rajbhandari22a.pdf
Z Chen, Y Deng, Y Wu, Q Gu, Y Li, Aug 2022, Towards understanding mixture of experts in deep learning, arXiv preprint arXiv:2208.02813, https://arxiv.org/abs/2208.02813
Y Krishnamurthy, C Watkins, T Gaertner, 2023, Improving Expert Specialization in Mixture of Experts, arXiv preprint arXiv:2302.14703, https://arxiv.org/pdf/2302.14703
I Voroneckaja, 2023, Automatic architecture selection for hierarchical mixture of experts models, Ph.D. Thesis, School of Mathematics & Statistics, University of Glasgow, https://theses.gla.ac.uk/83492/1/2023VoroneckajaPhD.pdf
SE Yuksel, JN Wilson, PD Gader, 2012, Twenty years of mixture of experts, IEEE Transactions on Neural Networks and Learning Systems (Volume 23, Issue 8, August 2012), https://ieeexplore.ieee.org/document/6215056, PDF: https://www.researchgate.net/profile/Seniha-Yuksel/publication/260707711Twenty_Years_of_Mixture_of_Experts/links/568f68e508aeaa1481b077de/Twenty-Years-of-Mixture-of-Experts.pdf
Saeed Masoudnia & Reza Ebrahimpour, 2014, Mixture of experts: a literature survey, Artificial Intelligence Review volume 42, pages275–293 (2014), https://link.springer.com/article/10.1007/s10462-012-9338-y
Ran Avnimelech; Nathan Intrator, 1999, Boosted mixture of experts: an ensemble learning scheme, Neural Comput 11(2): 483–497, https://ieeexplore.ieee.org/abstract/document/6790707
Chen K, Xu L, Chi H, 1999, Improved learning algorithms for mixture of experts in multiclass classification, Neural Netw 12(9): 1229–1252 https://pubmed.ncbi.nlm.nih.gov/12662629/
Reza Ebrahimpour, Ehsanollah Kabir, Hossein Esteky, Mohammad Reza Yousefi, 2008, View-independent face recognition with mixture of experts, Neurocomputing Volume 71, Issues 4–6, January 2008, Pages 1103-1107, https://www.sciencedirect.com/science/article/abs/pii/S0925231207003074
Goodband JH, Haas OCL, Mills JA, 2006, A mixture of experts committee machine to design compensators for intensity modulated radiation therapy, Pattern Recogn 39(9): 1704–1714. doi:10.1016/j.patcog.2006.03.018, https://doi.org/10.1016%2Fj.patcog.2006.03.018
Hansen JV, 1999, Combining predictors: comparison of five meta machine learning methods, Inform Sci 119(1–2): 91–105, https://doi.org/10.1016/S0020-0255(99)00052-3, https://www.sciencedirect.com/science/article/abs/pii/S0020025599000523
Hong X, Harris CJ, 2001, A mixture of experts network structure construction algorithm for modeling and control, Appl Intell 16(1): 59–69 https://link.springer.com/article/10.1023/A:1012869427428
Islam MM, Yao X, Murase K, 2003, A constructive algorithm for training cooperative neural network ensembles, IEEE Trans Neural Netw 14(4): 820–834 https://doi.org/10.1109%2FTNN.2003.813832, https://pubmed.ncbi.nlm.nih.gov/18238062/
R Csordás, K Irie, J Schmidhuber, Oct 2023, Approximating Two-Layer Feedforward Networks for Efficient Transformers, arXiv preprint arXiv:2310.10837, https://arxiv.org/pdf/2310.10837.pdf
Adrià Ruiz and Jakob Verbeek. 2019, Adaptative inference cost with convolutional neural mixture models, ICCV, pages 1872–1881, 2019, https://arxiv.org/abs/1908.06694

For research papers on MoE architectures, see https://www.aussieai.com/research/moe.

Big-Little Transformer Models

Although many ensemble architectures are about doing even more computations to achieve even more advanced capabilities, the idea of big-little or big-small architectures is to improve inference speed and throughput by sending common queries to a smaller model. The larger model is reserved for more difficult or rarer queries which take longer. As such, it's an AI version of the “common case first” code optimization technique.

Note that “collaborative inference” (e.g. “parallel decoding” or “speculative decoding”) is also conceptually a similar architecture, but differs because multiple models work together for inference, whereas pure big-little architectures choose the model at the start, and only one model does the inference. Also related are the various non-autoregressive architectures.

Research papers on big-little (two-model) architectures:

Kim, S., Mangalam, K., Malik, J., Mahoney, M. W., Gholami, A., and Keutzer, K., 2023, Big little transformer decoder, arXiv preprint arXiv:2302.07863, May 2023, https://arxiv.org/abs/2302.07863
Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, L., and Jumper, J., Feb 2023, Accelerating large language model decoding with speculative sampling, arXiv preprint arXiv:2302.01318, https://arxiv.org/abs/2302.01318
Leviathan, Y., Kalman, M., and Matias, Y., May 2023, Fast inference from transformers via speculative decoding, https://arxiv.org/abs/2211.17192
Stern, M., Shazeer, N., and Uszkoreit, J., Nov 2018, Blockwise parallel decoding for deep autoregressive models, Advances in Neural Information Processing Systems, 31, https://arxiv.org/abs/1811.03115
Z. Peng et al. 2018. AXNet: ApproXimate computing using an end-to-end trainable neural network, 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD) https://ieeexplore.ieee.org/document/8605388 (Ensemble dual-model method where one model is a fast approximation of the other.)
Anil Kag, 2023, Novel neural architectures & algorithms for efficient inference, Ph.D. thesis, College of Engineering, Boston University, https://open.bu.edu/handle/2144/46649, PDF: https://open.bu.edu/bitstream/handle/2144/46649/Kag_bu_0017E_18472.pdf?sequence=8&isAllowed=y (See Chapter 10, “Input Hardness Adaptive Models” for methods of running faster on easy image classification problems.)
Nan, F. and Saligrama, V., 2017. Dynamic model selection for prediction under a budget, arXiv preprint arXiv:1704.07505. https://arxiv.org/abs/1704.07505
Park, E., Kim, D., Kim, S., Kim, Y.-D., Kim, G., Yoon, S., and Yoo, S. (2015). Big/little deep neural network for ultra low power inference, In 2015 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), pages 124–132. https://ieeexplore.ieee.org/document/7331375
D Xu, W Yin, X Jin, Y Zhang, S Wei, M Xu, X Liu, Sep 2023, LLMCad: Fast and Scalable On-device Large Language Model Inference, arXiv preprint arXiv:2309.04255, https://arxiv.org/pdf/2309.04255.pdf (Keeps a smaller model in memory, improving speed and reducing memory utilization.)
Yiding Wang, Kai Chen, Haisheng Tan, and Kun Guo. 2023, Tabi: An efficient multi-level inference system for large language models, In Proceedings of the Eighteenth European Conference on Computer Systems, pages 233–248, 2023. https://dl.acm.org/doi/10.1145/3552326.3587438, PDF: https://yidingwang.xyz/public/files/tabi_eurosys23.pdf (Has multiple models, some big, some small, with characteristics similar to ensembles, big-little, and cascades.)
H Malard, S Zaiem, R Algayres, 2023, Big model only for hard audios: Sample dependent Whisper model selection for efficient inferences, arXiv preprint arXiv:2309.12712, https://arxiv.org/pdf/2309.12712.pdf (Big-little architecture for audio models.)
S Bae, J Ko, H Song, SY Yun, Oct 2023, Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding, arXiv preprint arXiv:2310.05424, https://arxiv.org/pdf/2310.05424.pdf (Combination of early-exit with a “shallow-deep module” and parallel decoding.)
Kaya Y., Hong S., Dumitras T., 2019, Shallow-deep networks: Understanding and mitigating network overthinking, Proceedings of the international conference on machine learning, ICML (2019), pp. 3301-3310, https://arxiv.org/abs/1810.07052 (Shallow-deep method in a single model.)

For research papers on big-little multi-model architectures, see https://www.aussieai.com/research/ensemble#biglittle.

Cascades

Cascades are a type of model inference optimization where execution flows down through a “cascade” of sub-structures, with the routing sequence depending on the inputs. This optimization mainly relates to early types of neural networks (i.e. DNNs and CNNs), rather than Transformer model architectures.

Cascade optimization is similar to “dynamic routing”, early exiting (especially “hierarchical early-exit”), and dynamic structural pruning (e.g. filter pruning, channel pruning, width pruning). The general class of algorithms is dynamic inference optimization (also called “adaptive inference”), where the model's execution path is changed dynamically, depending on the inputs.

The basic cascades architecture is not an ensemble architecture, but simply dynamic inference through a single model. However, this idea can be generalized to multiple paths through multiple models, which can either be an AI heuristic, or can alternatively be a simple matter of job scheduling in a deployment architecture. The area of cascades for DNNs/CNNs has generally received less research attention with the rise of Transformers, but there are still many papers.

Research papers on cascade optimizations:

P. Panda, A. Sengupta, and K. Roy, 2016, Conditional deep learning for energy-efficient and enhanced pattern recognition, in Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2016. https://arxiv.org/abs/1509.08971
Sokratis Nikolaidis, Stylianos I. Venieris, Iakovos S. Venieris, 2023, MultiTASC: A Multi-Tenancy-Aware Scheduler for Cascaded DNN Inference at the Consumer Edge, 2023 IEEE Symposium on Computers and Communications (ISCC), pp.411-416, 2023. https://ieeexplore.ieee.org/document/10217872
Oihane Gómez-Carmona, Diego Casado-Mansilla, Diego López-de-Ipiña, Javier García-Zubia, 2022, Optimizing Computational Resources for Edge Intelligence Through Model Cascade Strategies, IEEE Internet of Things Journal, vol.9, no.10, pp.7404-7417, 2022. https://ieeexplore.ieee.org/document/9564246
Sam Leroux, Steven Bohez, Elias De Coninck, Tim Verbelen, Bert Vankeirsbilck, Pieter Simoens, Bart Dhoedt, 2017, The cascading neural network: building the Internet of Smart Things, Knowledge and Information Systems, 2017. https://doi.org/10.1007/s10115-017-1029-1
Wang, X., Luo, Y., Crankshaw, D., Tumanov, A., Yu, F., and Gonzalez, J. E. (2018). Idk cascades: Fast deep learning by learning not to overthink, https://arxiv.org/abs/1706.00885
Chenguang Wang, Zihao Ye, Aston Zhang, Zheng Zhang, and Alexander J. Smola. 2020. Transformer on a Diet, arXiv e-prints (2020), arXiv:2002.06170. https://arxiv.org/abs/2002.06170
K. Neshatpour, F. Behnia, H. Homayoun, and A. Sasan. 2018, ICNN: An iterative implementation of convolutional neural networks to enable energy and computational complexity aware dynamic approximation, In Design, Automation, and Test in Europe Conference, pages 551–556, 2018. https://ieeexplore.ieee.org/document/8342068
Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. 2017, Fractalnet: Ultra-deep neural networks without residuals, In ICLR, 2017 https://arxiv.org/abs/1605.07648 (Not cascades, but similar conceptually.)
H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua. 2015, A convolutional neural network cascade for face detection, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5325–5334, 2015. https://ieeexplore.ieee.org/document/7299170
Y. Sun, X. Wang, and X. Tang. 2013, Deep convolutional network cascade for facial point detection, In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 3476–3483. IEEE, 2013. https://ieeexplore.ieee.org/document/6619290
Thomas Dean, Mark A Ruzon, Mark Segal, Jonathon Shlens, Sudheendra Vijayanarasimhan, and Jay Yagnik. 2013. Fast, accurate detection of 100,000 object classes on a single machine, In Proc. CVPR. https://web.stanford.edu/class/cs231m/references/hashing-dpm.pdf
A. Kouris, S. I. Venieris, C. Bouganis, 2018, Cascade CNN: Pushing the performance limits of quantisation in convolutional neural networks, in: 2018 28th International Conference on Field Programmable Logic and Applications (FPL), 2018, pp. 155–1557. doi:10.1109/FPL.2018.00034. http://dx.doi.org/10.1109/FPL.2018.00034
A. Kouris, S. Venieris, C.-S. Bouganis, 2019, A throughput-latency co-optimised cascade of convolutional neural network classifiers, IEEE, 2019. http://hdl.handle.net/10044/1/75445, http://hdl.handle.net/10044/1/75445
E. S. Marquez, J. S. Hare, M. Niranjan, 2018, Deep cascade learning, IEEE Transactions on Neural Networks and Learning Systems 29 (11) (2018) 5475–5485. doi:10.1109/TNNLS.2018.2805098. http://dx.doi.org/10.1109/TNNLS.2018.2805098
Berestizshevsky, K., Even, G., 2019, Dynamically sacrificing accuracy for reduced computation: Cascaded inference based on softmax confidence, In: Lecture Notes in Computer Science, pp. 306–320. Springer International Publishing (2019). https://doi.org/10.1007/978-3-030-30484-3_26 (Early exit; somewhat related to cascades.)
Huang, G., Chen, D., Li, T., Wu, F., van der Maaten, L., Weinberger, K.Q., 2017, Multi-scale dense networks for resource efficient image classification, In: 6th International Conference on Learning Representations, ICLR 2018 (2018). https://doi.org/10.48550/arXiv.1703.09844 https://arxiv.org/abs/1703.09844 (Hierarchical early-exit scheme with multiple models is conceptually similar to cascades.)
Jayakodi, N.K., Chatterjee, A., Choi, W., Doppa, J.R., Pande, P.P., 2018, Trading-off accuracy and energy of deep inference on embedded systems: A co-design approach, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37(11), 2881–2893 (2018). https://doi.org/10.1109/tcad.2018.2857338, https://arxiv.org/abs/1901.10584
Passalis, N., Raitoharju, J., Tefas, A., Gabbouj, M., 2020, Efficient adaptive inference for deep convolutional neural networks using hierarchical early exits, Pattern Recognition 105, 107346 (2020). https://doi.org/10.1016/j.patcog.2020.107346, PDF: https://hal.science/hal-03265174/document (Hierarchical early exit is similar to cascades.)
A Moos, 2023, Efficient Single Object Detection on Image Patches with Early Exit Enhanced High-Precision CNNs, arXiv preprint arXiv:2309.03530, https://arxiv.org/pdf/2309.03530.pdf (Fast inference for a soccer-playing robot with cascade-like hierarchical early exits.)
H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua. 2015, A convolutional neural network cascade for face detection, CVPR, 2015. https://ieeexplore.ieee.org/document/7299170
F. Yang, W. Choi, and Y. Lin. 2016, Exploit all the layers: Fast and accurate cnn object detector with scale dependent pooling and cascaded rejection classifiers, CVPR, 2016. https://ieeexplore.ieee.org/document/7780603, PDF: https://www.cvlibs.net/projects/autonomous_vision_survey/literature/Yang2016CVPR.pdf (Cascaded rejection classifiers.)
Zhicheng Yan, Hao Zhang, Robinson Piramuthu, Vignesh Jagadeesh, Dennis DeCoste, Wei Di, Yizhou Yu May 2015, HD-CNN: Hierarchical Deep Convolutional Neural Network for Large Scale Visual Recognition, https://arxiv.org/abs/1410.0736
Y Tang, T Iwaguchi, H Kawasaki, 2023, Underwater Image Enhancement by Transformer-based Diffusion Model with Non-uniform Sampling for Skip Strategy, arXiv preprint arXiv:2309.03445, https://arxiv.org/abs/2309.03445, Code: https://github.com/piggy2009/DM_underwater (Skipping iteratively is somewhat similar to cascading.)
D. Kang, J. Emmons, F. Abuzaid, P. Bailis and M. Zaharia, 2017, NoScope: Optimizing neural network queries over video at scale, Proc. VLDB Endowment, vol. 10, no. 11, pp. 1586-1597, 2017. https://arxiv.org/abs/1703.02529 (Cascades when analyzing images in video in real-time.)
P. Viola and M. Jones. 2001, Rapid object detection using a boosted cascade of simple features, In CVPR, 2001. https://ieeexplore.ieee.org/document/990517
Zhaowei Cai; Mohammad Saberian; Nuno Vasconcelos. 2015, Learning complexity-aware cascades for deep pedestrian detection, In ICCV, 2015. https://ieeexplore.ieee.org/document/8686227
Rodrigo Verschae, Javier Ruiz-del-Solar & Mauricio Correa, 2008, A unified learning framework for object detection and classification using nested cascades of boosted classifiers, Machine Vision and Applications, 19(2), 2008, https://link.springer.com/article/10.1007/s00138-007-0084-0
K. Neshatpour, F. Behnia, H. Homayoun, and A. Sasan. 2018, ICNN: An iterative implementation of convolutional neural networks to enable energy and computational complexity aware dynamic approximation, In Design, Automation, and Test in Europe Conference, pages 551–556, 2018. https://ieeexplore.ieee.org/document/8342068 (Sequences of small feed-forward networks focus on parts of an image.)
Francesco Daghero, Alessio Burrello, Daniele Jahier Pagliari, Luca Benini, Enrico Macii, Massimo Poncino, 2020, Energy-Efficient Adaptive Machine Learning on IoT End-Nodes With Class-Dependent Confidence, 2020 27th IEEE International Conference on Electronics, Circuits and Systems (ICECS), pp.1-4, 2020. https://ieeexplore.ieee.org/document/9294863, https://arxiv.org/abs/2204.03431v1 (An improved stopping policy for early exits on easy-input classification tasks.)
Yiding Wang, Kai Chen, Haisheng Tan, and Kun Guo. 2023, Tabi: An efficient multi-level inference system for large language models, In Proceedings of the Eighteenth European Conference on Computer Systems, pages 233–248, 2023. https://dl.acm.org/doi/10.1145/3552326.3587438, PDF: https://yidingwang.xyz/public/files/tabi_eurosys23.pdf (Has multiple models, some big, some small, with characteristics similar to ensembles, big-little, and cascades.)
P Kavehzadeh, M Valipour, M Tahaei, A Ghodsi, 2023, Sorted LLaMA: Unlocking the Potential of Intermediate Layers of Large Language Models for Dynamic Inference Using Sorted Fine-Tuning (SoFT), arXiv preprint, https://arxiv.org/pdf/2309.08968.pdf (Cascade-like item: SortedNet method unlocks the potential of intermediate layers.)

For research papers on cascade model architectures, see https://www.aussieai.com/research/cascades.

Collaborative Inference

Collaborative inference is a type of multi-model inference where two or more engines combine to perform inference calculations. There are two different types of architectures here: smarter or faster.

One of the goals of multi-model inference is obviously to have a smarter AI engine overall by combining the calculations of two models. Some examples with this goal include:

Consensus-based decoding
Mutually-guided decoding
Committee-based inference (“wisdom of committees”)

Surprisingly, some of these multi-model inference algorithms are actually speedup optimizations, where faster inference is possible by having two engines working together. The reduced latency can be achieved through parallel calculations and using a small model in the mix. Particular types of parallel collaborative inference include:

Speculative Decoding
Big-Little Architectures

Research papers on collaborative inference:

G Xu, Z Hao, Y Luo, H Hu, J An, S Mao, 2023, DeViT: Decomposing Vision Transformers for Collaborative Inference in Edge Devices, arXiv preprint arXiv:2309.05015, https://arxiv.org/abs/2309.05015
Jungo Kasai, Keisuke Sakaguchi, Ronan Le Bras, Hao Peng, Ximing Lu, Dragomir Radev, Yejin Choi, Noah A. Smith, Oct 2022, Twist Decoding: Diverse Generators Guide Each Other, https://arxiv.org/abs/2205.09273, Code: https://github.com/jungokasai/twist_decoding (Twist decoding is a type of collaborative inference.)
J Kasai, 2023, Towards Efficient, Customizable, and Communal Natural Language Processing, Ph.D. thesis, Computer Science and Engineering, University of Washington, https://www.proquest.com/openview/604084b574dcd05e41eb6e33682a3537/1 (Impressive thesis includes twist decoding amid other topics.)
Jinduo Song, Zhicheng Liu, Xiaofei Wang, Chao Qiu, Xu Chen, 2021, Adaptive and Collaborative Edge Inference in Task Stream with Latency Constraint, ICC 2021, IEEE International Conference on Communications, pp.1-6, https://ieeexplore.ieee.org/document/9500892
C Luo, J Chen, X Feng, J Zhang, J Li, 2023, Sustainable Collaborative Inference in Intelligent Transportation Systems, IEEE Transactions on Intelligent Transportation, https://ieeexplore.ieee.org/document/10239242
Yiping Kang, Johann Hauswald, Cao Gao, Austin Rovinski, Trevor Mudge, Jason Mars, Lingjia Tang, 2017, Neurosurgeon: Collaborative intelligence between the cloud and mobile edge, ACM SIGARCH Comput. Archit. News, vol. 52, no. 4, pp. 615–629, https://dl.acm.org/doi/10.1145/3037697.3037698
Z. Hao, G. Xu, Y. Luo, H. Hu, J. An, and S. Mao, June 2022, Multi-agent collaborative inference via dnn decoupling: Intermediate feature compression and edge learning, IEEE Trans. Mob. Comput., 2022, https://arxiv.org/abs/2205.11854
J. Kim, Y. Park, G. Kim, and S. J. Hwang, 2017, Splitnet: Learning to semantically split deep networks for parameter reduction and model parallelization, in Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, ser. Proceedings of Machine Learning Research, D. Precup and Y. W. Teh, Eds., vol. 70. PMLR, 2017, pp. 1866–1874. http://proceedings.mlr.press/v70/kim17b/kim17b.pdf
Y. Kim, J. Kim, D. Chae, D. Kim, and J. Kim, 2019, µlayer: Low latency on-device inference using cooperative single-layer acceleration and processor-friendly quantization, in Proceedings of the Fourteenth EuroSys Conference 2019, Dresden, Germany, March 25-28, 2019, G. Candea, R. van Renesse, and C. Fetzer, Eds. ACM, 2019, pp. 45:1–45:15. https://dl.acm.org/doi/10.1145/3302424.3303950
T. Mohammed, C. Joe-Wong, R. Babbar, and M. D. Francesco, 2020, Distributed inference acceleration with adaptive DNN partitioning and offloading, in 39th IEEE Conference on Computer Communications, INFOCOM 2020, Toronto, ON, Canada, July 6-9, 2020. IEEE, 2020, pp. 854–863, https://ieeexplore.ieee.org/document/9155237
S. Yang, Z. Zhang, C. Zhao, X. Song, S. Guo, and H. Li, 2022, CNNPC: end-edge-cloud collaborative CNN inference with joint model partition and compression, IEEE Trans. Parallel Distributed Syst., vol. 33, no. 10, pp. 4039–4056, 2022. https://ieeexplore.ieee.org/document/9782528
X Xu, K Yan, S Han, B Wang, X Tao, P Zhang, 2023, Learning-Based Edge-Device Collaborative DNN Inference in IoVT Networks, IEEE Internet of Things Journal, https://ieeexplore.ieee.org/abstract/document/10258387

For research papers on collaborative inference multi-model architectures, see https://www.aussieai.com/research/collaborative.

Speculative Decoding

Speculative execution is the general area of Computer Science theory from which speculative decoding is derived. Various algorithms benefit from speculatively executing in parallel with another pathway. A particular example is “branch prediction” in hardware execution of low-level machine code.

Applying this idea to inference yields “speculative decoding” as an ensemble architecture where a small model generates some possible output tokens (i.e. “speculating” possible outputs from its decoder), and a larger model verifies whether the output of the smaller model is correct. This optimizes inference speed because it is faster for a large model to verify the correctness of suggested output tokens in parallel on an already-generated sequence than for it to fully generate its own new tokens in an autoregressive method. If the small model predicts poorly, then the bigger model vetoes the suggested tokens, and has to “backtrack” making the whole process slower. However, the smaller model should be correct most of the time, and can generate multiple speculative tokens each iteration, causing an overall speedup across all of the tokens, while getting very close to the accuracy of a bigger model.

Speculative decoding is technically a subtype of the “big-little architecture”. Another type of big-little architecture involves using a heuristic to detect “easy” requests that are routed to a small model, or “hard” queries that are routed to the big model. Speculative decoding differs because all queries go first to the small model, and are then checked by the larger model, and sometimes the big model overrides the small model's suggestions and re-generates its own.

Research papers on speculative decoding:

Leviathan, Y., Kalman, M., and Matias, Y., May 2023, Fast inference from transformers via speculative decoding, https://arxiv.org/abs/2211.17192
D Xu, W Yin, X Jin, Y Zhang, S Wei, M Xu, X Liu, Sep 2023, LLMCad: Fast and Scalable On-device Large Language Model Inference, arXiv preprint arXiv:2309.04255, https://arxiv.org/pdf/2309.04255.pdf (Keeps a smaller model in memory, improving speed and reducing memory utilization.)
Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, L., and Jumper, J., Feb 2023, Accelerating large language model decoding with speculative sampling, arXiv preprint arXiv:2302.01318, https://arxiv.org/abs/2302.01318
Sehoon Kim, Karttikeya Mangalam, Suhong Moon, John Canny, Jitendra Malik, Michael W. Mahoney, Amir Gholami, Kurt Keutzer, 2023, Speculative Decoding with Big Little Decoder, Sep 2023 (original Feb 2023), https://arxiv.org/abs/2302.07863 (Separates a “fallback policy” when the smaller model detects it needs the bigger model, and a “rollback policy” when the bigger model vetoes output and intervenes, both for deciding when the bigger model controls.)
Yaniv Leviathan, Matan Kalman, and Yossi Matias. May 2023, Fast inference from transformers via speculative decoding, In International Conference on Machine Learning, pages 19274–19286. PMLR, 2023. https://arxiv.org/abs/2211.17192
Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. 2023, Accelerating large language model decoding with speculative sampling, DeepMind, arXiv preprint arXiv:2302.01318, 2023. https://arxiv.org/abs/2302.01318
Heming Xia, Tao Ge, Si-Qing Chen, Furu Wei, and Zhifang Sui. 2022, Speculative decoding: Lossless speedup of autoregressive translation, Openreview, 2022. https://openreview.net/forum?id=H-VlwsYvVi
Nan Yang, Tao Ge, Liang Wang, Binxing Jiao, Daxin Jiang, Linjun Yang, Rangan Majumder, Furu Wei, Apr 2023, Inference with Reference: Lossless Acceleration of Large Language Models, https://arxiv.org/abs/2304.04487 (Not pure speculative decoding, but an analogous method.)
Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Rae Ying Yee Wong, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, and Zhihao Jia, Aug 2023, Specinfer: Accelerating generative llm serving with speculative inference and token tree verification, arXiv preprint arXiv:2305.09781, 2023. https://arxiv.org/abs/2305.09781, Code: https://github.com/flexflow/FlexFlow/tree/inference
Burton, F. W., 1985, Speculative computation, parallelism, and functional programming, IEEE Transactions on Computers, C-34(12):1190–1193, 1985. doi: 10.1109/TC.1985. 6312218. https://ieeexplore.ieee.org/document/6312218 (Algorithmic theory of "speculative computation" from 1985.)
Hennessy, J. L. and Patterson, 2012, D. A., Computer Architecture: A Quantitative Approach, Morgan Kaufmann, Amsterdam, 5 edition, 2012. ISBN 978-0-12-383872-8. https://dl.acm.org/doi/book/10.5555/1999263 (Includes coverage of speculative algorithms.)
T. Ge, H. Xia, X. Sun, S. Chen, and F. Wei. 2022, Lossless acceleration for seq2seq generation with aggressive decoding, ArXiv, abs/2205.10350, 2022. https://arxiv.org/abs/2205.10350, Code: https://github.com/microsoft/unilm/tree/master/decoding (The generalized aggressive decoding method has a “draft-and-verify” algorithm that is similar to speculative decoding.)
M. Stern, N. Shazeer, and J. Uszkoreit. 2018, Blockwise parallel decoding for deep autoregressive models, CoRR, abs/1811.03115, 2018. https://arxiv.org/abs/1811.03115 (Generates various output in parallel and using a scoring method to confirm them.)
Jungo Kasai, Keisuke Sakaguchi, Ronan Le Bras, Hao Peng, Ximing Lu, Dragomir Radev, Yejin Choi, Noah A. Smith, Oct 2022, Twist Decoding: Diverse Generators Guide Each Other, https://arxiv.org/abs/2205.09273, Code: https://github.com/jungokasai/twist_decoding
S Bae, J Ko, H Song, SY Yun, Oct 2023, Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding, arXiv preprint arXiv:2310.05424, https://arxiv.org/pdf/2310.05424.pdf (Combination of early-exit with a “shallow-deep module” and parallel decoding.)
Kaya Y., Hong S., Dumitras T., 2019, Shallow-deep networks: Understanding and mitigating network overthinking, Proceedings of the international conference on machine learning, ICML (2019), pp. 3301-3310, https://arxiv.org/abs/1810.07052 (Shallow-deep method in a single model is analogous to speculative decoding.)
Yongchao Zhou, Kaifeng Lyu, Ankit Singh Rawat, Aditya Krishna Menon, Afshin Rostamizadeh, Sanjiv Kumar, Jean-François Kagy, Rishabh Agarwal, Oct 2023, DistillSpec: Improving Speculative Decoding via Knowledge Distillation, https://arxiv.org/abs/2310.08461

For research papers on speculative decoding multi-model architectures, see https://www.aussieai.com/research/speculative-decoding.

Consensus Decoding

Consensus-based decoding involves running the same query on multiple models, and then somehow deciding which answer to use. The key to having a smarter overall model is in deciding which of the models to listen to the most. Should you listen to the loudest one or the quiet achiever sitting in the corner? Various decision methods have been tried to choose the best output:

Majority decision
Maximum certainty (highest probability calculated)
Weighted averages (giving some engines more votes)

There are various pros and cons to the different options. For example, majority decision has a problem if all of the models come up with a different answer. Note that the algorithms for deciding can consider not only a single token output from each model, but multiple vectors of the top-k tokens with their predicted probabilities available.

In the basic consensus architecture, all of the models run to completion, so there isn't a speedup by having a smaller model involved. However, a variation is to add a time-dependent cut-off where models that take too long to complete are excluded. This will be faster on average, but the risk to accuracy in this approach is that the entire architecture ends up always following the smaller models.

Multi-Model Deployment

When running AI engines on a server, there are multiple models running inference, and a server has to decide how to allocated queries to the models efficiently. This is obviously an area that has been implemented many types in industry. If the architecture is multiple copies of the same models running on many servers, then this isn't really an AI problem. It's the well-known scheduling issue of dispersing user requests to multiple servers. Nevertheless, AI engines are a quirky type of application to run on servers, and there are some research papers on the practical deployment aspects of managing multiple AI models in a cloud server.

Research papers on AI multi-model deployment optimization:

Jashwant Raj Gunasekaran, Cyan Subhra Mishra, Prashanth Thinakaran, Mahmut Taylan Kandemir, and Chita R Das. Cocktail: A multidimensional optimization for model serving in cloud, In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), April 2022. PDF: https://www.usenix.org/system/files/nsdi22spring_prepub_gunasekaran.pdf, Code: https://github.com/jashwantraj92/cocktail (Serving framework for scheduling and serving queries from multiple ensemble models.)
Francisco Romero, Qian Li, Neeraja J Yadwadkar, and Christos Kozyrakis. 2021, INFaaS: Automated model-less inference serving, In 2021 USENIX Annual Technical Conference (USENIX ATC 21), July 2021, https://www.usenix.org/conference/atc21/presentation/romero (Choosing models when serving queries from multiple ensemble models.)
Or Honovich, Thomas Scialom, Omer Levy, Timo Schick, Dec 2022, Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor, https://arxiv.org/abs/2212.09689, https://github.com/orhonovich/unnatural-instructions (Training a new model by using another model to automatically create the data set on which to train it.)
Alexander Borzunov, Dmitry Baranchuk, Tim Dettmers, Max Ryabinin, Younes Belkada, Artem Chumachenko, Pavel Samygin, Colin Raffel, March 2023, PETALS: Collaborative Inference and Fine-tuning of Large Models, https://arxiv.org/abs/2209.01188, Code: https://petals.ml/ (Swarm deployment that shares the load to multiple servers.)
Y Liu, C Wang, Y Xiao, Z Lv, L Xiao, X Ji, 2023, Collaborative Inference for MEC Services Based on Multimodal Deep Neural Network, 2023 IEEE/CIC International Conference on Communications in China (ICCC) https://ieeexplore.ieee.org/abstract/document/10233276
Anil Kag, 2023, Novel neural architectures & algorithms for efficient inference, Ph.D. thesis, College of Engineering, Boston University, https://open.bu.edu/handle/2144/46649, PDF: https://open.bu.edu/bitstream/handle/2144/46649/Kag_bu_0017E_18472.pdf?sequence=8&isAllowed=y (See Chapter 11, “Efficient Edge Inference by Selective Query (Hybrid Models)”.)
Li, M., Li, Y., Tian, Y., Jiang, L., and Xu, Q. (2021). Appealnet: An efficient and highly-accurate edge/cloud collaborative architecture for DNN inference, In 2021 58th ACM/IEEE Design Automation Conference (DAC), pages 409–414. URL: https://ieeexplore.ieee.org/document/9586176
Kang, Y., Hauswald, J., Gao, C., Rovinski, A., Mudge, T., Mars, J., and Tang, L. (2017). Neurosurgeon: Collaborative intelligence between the cloud and mobile edge, SIGPLAN Notices, 52(4):615–629. https://doi.org/10.1145/3093336.3037698
Letian Zhang, Lixing Chen, Jie Xu, Feb 2021, Autodidactic Neurosurgeon: Collaborative Deep Inference for Mobile Edge Intelligence via Online Learning, https://arxiv.org/abs/2102.02638
Li, M., Li, Y., Tian, Y., Jiang, L., and Xu, Q. (2021). Appealnet: An efficient and highly-accurate edge/cloud collaborative architecture for DNN inference, In 2021 58th ACM/IEEE Design Automation Conference (DAC), pages 409–414. URL: https://ieeexplore.ieee.org/document/9586176, PDF: https://arxiv.org/pdf/2105.04104v2.pdf
Y. Long, I. Chakraborty, and K. Roy, 2020, Conditionally deep hybrid neural networks across edge and cloud, arXiv:2005.10851, https://arxiv.org/abs/2005.10851
Praveen Joshi, Mohammed Hasanuzzaman, Chandra Thapa, Haithem Afli, Ted Scully, 2023, Enabling All In-Edge Deep Learning: A Literature Review, IEEE Access, vol.11, pp.3431-3460, 2023. https://ieeexplore.ieee.org/document/10007810 https://arxiv.org/abs/2204.03326 (Extensive survey of edge computing, including deployment architectures and optimizations.)
E Samikwa, A Di Maio, T Braun, 2023, DISNET: Distributed Micro-Split Deep Learning in Heterogeneous Dynamic IoT, IEEE internet of things journal, PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10243578 (Partitioning methods for a model split over multiple distributed servers.)
Daniele Jahier Pagliari, Roberta Chiaro, Enrico Macii, Massimo Poncino, 2021, CRIME: Input-Dependent Collaborative Inference for Recurrent Neural Networks, IEEE Transactions on Computers, vol.70, no.10, pp.1626-1639, 2021. https://ieeexplore.ieee.org/document/9184963 (Collaborative inference by sharing workload to multiple devices.)
Y Zhang, Z Zhang, W Bao, D Yuan, 2023, ITIF: Integrated Transformers Inference Framework for Multiple Tenants on GPU, ICPP '23: Proceedings of the 52nd International Conference on Parallel Processing, August 2023, Pages 112–121, https://doi.org/10.1145/3605573.3605585, https://dl.acm.org/doi/abs/10.1145/3605573.3605585

For research papers on deployment of a multi-model architectures, see https://www.aussieai.com/research/ensemble#deployment.

• Next: Chapter 55. Adanced Number Systems

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs

The new AI programming book by Aussie AI co-founders:

AI coding in C++
Transformer engine speedups
LLM models
Phone and desktop AI
Code examples
Research citations

Get your copy from Amazon: Generative AI in C++