Aussie AI

Types of Ensemble Algorithms

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Types of Ensemble Algorithms

Despite the costs, there are a number of areas where there is existing research, and even a significant number of practical use cases, in ensemble models. The two main categories are:

  • Collaborative inference. This is where two or more models combine to complete inference. They could be on separate servers, or together on one server. The two basic ways that multi-models do inference: together or separately. One architecture is for both models to do inference and combine results to complete the operation (e.g. big-small architectures, committee-based decoding, and speculative decoding). The other category is where one model is chosen from multiple options to do all the inference calculations, which is called a “model selection algorithm” (e.g. Mixture-of-Experts).
  • Multi-model training. There are various other methods of uses an ensemble technique to train up better models. Knowledge distillation is technically using two models for training (i.e. a teacher and a student model), but it's not usually classed as an ensemble architecture in itself because the larger model is not subsequently used for inference. Some of the more parallel training techniques are called: bagging, boosting, stacking, and many variants. There is also the simple training method of using input data sets to train models, where the data is based on the output of some other model (sometimes called “dataset distillation”).

Example Ensemble Architectures: Within the two major categories, there are multiple different areas of research. Some examples of types of multi-model architectures include:

  • Generative Adversarial Networks (GANs). These use two AI models against each other, one being critical of the output of the other, to improve overall results. This is an architecture that has proven very successful in practice.
  • Knowledge Distillation (KD). This is a well-known and widely used optimization method whereby a large model is built first, and then it is used to “distill” its knowledge into a smaller model, as a “teacher” model to a “student” model. Two models are used in training, but only the smaller model is used for inference. Note that there are also various more advanced forms of “ensemble distillation” (multi-teacher methods) that involve two or more teacher models, plus the student model. See Chapter 45 for more about knowledge distillation.
  • Cascade inference optimizations. Cascade optimizations involve the selection of different models, or paths through multiple models, as a type of inference optimization. Two or more models are used in the inference phase, and there are various methods for deciding at runtime which to choose.
  • Big-small models. In a specific subtype of the cascade method called big-small models, there are two models trained differently. The idea is that during inference, a heuristic decides which model to invoke, and a faster “small” model is used in common cases, and the slower “big” model is only needed to handle the rarer cases. This can improve inference latency and total throughput.
  • Speculative decoding. This method is similar to the big-little architecture, but differs because all queries initially go to the small model. The small, faster model does its best to suggest output tokens (i.e. it “speculates”), and then a bigger model is used to “verify” the correctness. If the bigger model has to override the smaller model, it is then slower, but usually the smaller model is accurate enough for the whole process to be faster on average, with accuracy close to using a larger model.
  • Parallel multi-model inference. In some architectures, multiple models can process the same input data in parallel. Each model produces its own results, and the resulting ensemble model has to then choose an overall result. The algorithm to decide amongst the multiple options could be maximum (or minimum), majority (counting), weighted averages, and many other combinations.
  • Hybrid dual transformer architectures. Rather than entire duplicate Transformer models, there has been research on adding extra components to the basic Transformer architecture, such as two decoders or two encoders merged together. This area is one of the more theoretical and less explored areas of multi-model research.

Research papers on ensemble architectures generally:

  1. Yoshitomo Matsubara, Luca Soldaini, Eric Lind, Alessandro Moschitt, Dec 2022, Ensemble Transformer for Efficient and Accurate Ranking Tasks: an Application to Question Answering Systems, https://arxiv.org/abs/2201.05767
  2. Yungeng Zhang, Yuru Pei & Hongbin Zha, Sep 2021, Learning Dual Transformer Network for Diffeomorphic Registration, Medical Image Computing and Computer Assisted Intervention, MICCAI 2021, https://link.springer.com/chapter/10.1007/978-3-030-87202-1_13
  3. Xian-Feng Han, Yi-Fei Jin, Hui-Xian Cheng, Guo-Qiang Xiao, Apr 2021, Dual Transformer for Point Cloud Analysis, https://arxiv.org/abs/2104.13044
  4. Ting Yao, Yehao Li, Yingwei Pan, Yu Wang, Xiao-Ping Zhang, Tao Mei, 2023, Dual Vision Transformer, https://arxiv.org/pdf/2207.04976, Code: https://github.com/YehLi/ImageNetModel
  5. Mohammed Alhamid, March 2021, Ensemble Models, https://towardsdatascience.com/ensemble-models-5a62d4f4cb0c
  6. Oliver R. A. Dunbar, Andrew B. Duncan, Andrew M. Stuart, Marie-Therese Wolfram, Jan 2022, Ensemble Inference Methods for Models With Noisy and Expensive Likelihoods, https://arxiv.org/abs/2104.03384
  7. Shuning Chang, Pichao Wang, Hao Luo, Fan Wang, Mike Zheng Shou, 2023, Revisiting Vision Transformer from the View of Path Ensemble, https://arxiv.org/abs/2308.06548, PDF: https://arxiv.org/pdf/2308.06548.pdf (Treating the internal components of a Transformer as if they are an ensemble model.)
  8. T. G. Dietterich. 2000, Ensemble methods in machine learning, In Multiple classifier systems, pages 1–15. Springer, 2000, Lecture Notes in Computer Science book series LNCS, volume 1857, https://link.springer.com/chapter/10.1007/3-540-45014-9_1, PDF: https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=e3b09a777c71a4b88888509ab9bfa12d8bf295ba (Early paper with ensemble idea applied to classifiers, rather than multi-model.)
  9. Huang, G., Chen, D., Li, T., Wu, F., van der Maaten, L., Weinberger, K.Q., 2018, Multi-scale dense networks for resource efficient image classification, In: 6th International Conference on Learning Representations, ICLR 2018 (2018). https://doi.org/10.48550/arXiv.1703.09844, https://arxiv.org/abs/1703.09844 (Has multiple models combined in an early-exit configuration.)
  10. Y. Matsubara, M. Levorato, and F. Restuccia, 2022, Split computing and early exiting for deep learning applications: Survey and research challenges, ACM Comput. Surveys, Mar 2022, https://arxiv.org/abs/2103.04505 (Split computing is splitting the inference between server and edge machines.)
  11. L. Li, K. Ota and M. Dong, 2018, Deep learning for smart industry: Efficient manufacture inspection system with fog computing, IEEE Trans. Ind. Informat., vol. 14, no. 10, pp. 4665-4673, Oct. 2018. https://ieeexplore.ieee.org/document/8370640 (“Fog computing” is like cloud computing but on servers “nearer” to the ground.)
  12. C. Lo, Y.-Y. Su, C.-Y. Lee and S.-C. Chang, 2017, A dynamic deep neural network design for efficient workload allocation in edge computing, Proc. IEEE Int. Conf. Comput. Design (ICCD), pp. 273-280, Nov. 2017. https://ieeexplore.ieee.org/document/8119222
  13. G Xu, Z Hao, Y Luo, H Hu, J An, S Mao, 2023, DeViT: Decomposing Vision Transformers for Collaborative Inference in Edge Devices, arXiv preprint arXiv:2309.05015, https://arxiv.org/abs/2309.05015
  14. Schwartz, R., Stanovsky, G., Swayamdipta, S., Dodge, J., and Smith, N. A., 2020, The right tool for the job: Matching model and instance complexities, In Annual Meeting of the Association for Computational Linguistics, 2020. https://arxiv.org/abs/2004.07453 (Early exit with “wisdom of committees” decisions.)
  15. Naftaly, U., N. Intrator, and D. Horn. 1997, Optimal ensemble averaging of neural networks, Network: Computation in Neural Systems 8, no. 3 (1997): 283–296. https://www.tau.ac.il/~horn/publications/optimal.pdf
  16. Y. Liu and X. Yao, 1999, Ensemble Learning via Negative Correlation, Neural Networks, Volume 12, Issue 10, December 1999, pp. 1399-1404. doi:10.1016/S0893-6080(99)00073-8, https://www.sciencedirect.com/science/article/abs/pii/S0893608099000738
  17. Z.S.H. Chan; N. Kasabov, 2005, Fast neural network ensemble learning via negative-correlation data correction, IEEE Transactions on Neural Networks, Volume 16, Issue 6, November 2005, https://ieeexplore.ieee.org/document/1528547
  18. E Diao, 2023, Efficient and Collaborative Methods for Distributed Machine Learning, Ph.D. thesis, Department of Electrical and Computer Engineering Duke University, https://www.proquest.com/openview/410ea5eb4275fded25890f04c96a902e/1?pq-origsite=gscholar&cbl=18750&diss=y
  19. X Xu, K Yan, S Han, B Wang, X Tao, P Zhang, 2023, Learning-Based Edge-Device Collaborative DNN Inference in IoVT Networks, IEEE Internet of Things Journal, https://ieeexplore.ieee.org/abstract/document/10258387
  20. Devvrit, Sneha Kudugunta, Aditya Kusupati, Tim Dettmers, Kaifeng Chen, Inderjit Dhillon, Yulia Tsvetkov, Hannaneh Hajishirzi, Sham Kakade, Ali Farhadi, Prateek Jain, Oct 2023, MatFormer: Nested Transformer for Elastic Inference, https://arxiv.org/abs/2310.07707 (Multiple submodels inside a large model.)

For research papers on ensemble architectures in general, see https://www.aussieai.com/research/ensemble.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++