Aussie AI

Mixture of Experts (MoE)

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

Mixture of Experts (MoE)

Mixture of Experts (MoE) is an ensemble inference optimization method where multiple sub-models are trained and used. The efficiency arises by sending a query to one of the experts, thereby only some of the weights are activated, dependent on the input tokens. Each expert model is smaller than if all the models were merged.

This method is based on “divide and conquer” where a decision between experts “divides” a problem, and the chosen expert model “conquers” the sub-problem. Conceptually, the MoE architecture has some resemblance to cascades, big-little architectures, and knowledge distillation.

Research papers on MoE multi-model architectures:

William Fedus, Barret Zoph, and Noam Shazeer. 2021, Switch transformers: Scaling to trillion-parameter models with simple and efficient sparsity, J. Mach. Learn. Res, 23:1–40, 2021, https://arxiv.org/abs/2101.03961
Erik Daxberger, Floris Weers, Bowen Zhang, Tom Gunter, Ruoming Pang, Marcin Eichner, Michael Emmersberger, Yinfei Yang, Alexander Toshev, Xianzhi Du, 2023, Mobile V-MoEs: Scaling Down Vision Transformers via Sparse Mixture-of-Experts, arXiv preprint, https://arxiv.org/abs/2309.04354 (This paper covers Sparse MoEs for vision transformers.)
Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. 2022, Designing effective sparse expert models, arXiv preprint arXiv:2202.08906, 2022. https://arxiv.org/abs/2202.08906v1
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. 2017, Outrageously large neural networks: The sparsely-gated mixture-of-experts layer, arXiv preprint arXiv:1701.0653, https://arxiv.org/abs/1701.06538 (Sparse MoE's early paper with 1,000s of expert mini-models.)
IC Gormley, S Frühwirth-Schnatter, June 2018, Mixture of experts models, https://arxiv.org/abs/1806.08200
Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2020, Gshard: Scaling giant models with conditional computation and automatic sharding, arXiv preprint arXiv:2006.16668, 2020, https://arxiv.org/abs/2006.16668 (Sharding technique applied to an MoE model for further optimization.)
Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc V Le, Yonghui Wu, Zhifeng Chen, Claire Cui, 2022, Glam: Efficient scaling of language models with mixture-of-experts, ICML 2022, https://arxiv.org/abs/2112.06905, PDF: https://proceedings.mlr.press/v162/du22c/du22c.pdf
Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, Yuxiong He, 2022, DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale, ICML 2022, https://arxiv.org/abs/2201.05596, PDF: https://proceedings.mlr.press/v162/rajbhandari22a/rajbhandari22a.pdf
Z Chen, Y Deng, Y Wu, Q Gu, Y Li, Aug 2022, Towards understanding mixture of experts in deep learning, arXiv preprint arXiv:2208.02813, https://arxiv.org/abs/2208.02813
Y Krishnamurthy, C Watkins, T Gaertner, 2023, Improving Expert Specialization in Mixture of Experts, arXiv preprint arXiv:2302.14703, https://arxiv.org/pdf/2302.14703
I Voroneckaja, 2023, Automatic architecture selection for hierarchical mixture of experts models, Ph.D. Thesis, School of Mathematics & Statistics, University of Glasgow, https://theses.gla.ac.uk/83492/1/2023VoroneckajaPhD.pdf
SE Yuksel, JN Wilson, PD Gader, 2012, Twenty years of mixture of experts, IEEE Transactions on Neural Networks and Learning Systems (Volume 23, Issue 8, August 2012), https://ieeexplore.ieee.org/document/6215056, PDF: https://www.researchgate.net/profile/Seniha-Yuksel/publication/260707711Twenty_Years_of_Mixture_of_Experts/links/568f68e508aeaa1481b077de/Twenty-Years-of-Mixture-of-Experts.pdf
Saeed Masoudnia & Reza Ebrahimpour, 2014, Mixture of experts: a literature survey, Artificial Intelligence Review volume 42, pages275–293 (2014), https://link.springer.com/article/10.1007/s10462-012-9338-y
Ran Avnimelech; Nathan Intrator, 1999, Boosted mixture of experts: an ensemble learning scheme, Neural Comput 11(2): 483–497, https://ieeexplore.ieee.org/abstract/document/6790707
Chen K, Xu L, Chi H, 1999, Improved learning algorithms for mixture of experts in multiclass classification, Neural Netw 12(9): 1229–1252 https://pubmed.ncbi.nlm.nih.gov/12662629/
Reza Ebrahimpour, Ehsanollah Kabir, Hossein Esteky, Mohammad Reza Yousefi, 2008, View-independent face recognition with mixture of experts, Neurocomputing Volume 71, Issues 4–6, January 2008, Pages 1103-1107, https://www.sciencedirect.com/science/article/abs/pii/S0925231207003074
Goodband JH, Haas OCL, Mills JA, 2006, A mixture of experts committee machine to design compensators for intensity modulated radiation therapy, Pattern Recogn 39(9): 1704–1714. doi:10.1016/j.patcog.2006.03.018, https://doi.org/10.1016%2Fj.patcog.2006.03.018
Hansen JV, 1999, Combining predictors: comparison of five meta machine learning methods, Inform Sci 119(1–2): 91–105, https://doi.org/10.1016/S0020-0255(99)00052-3, https://www.sciencedirect.com/science/article/abs/pii/S0020025599000523
Hong X, Harris CJ, 2001, A mixture of experts network structure construction algorithm for modeling and control, Appl Intell 16(1): 59–69 https://link.springer.com/article/10.1023/A:1012869427428
Islam MM, Yao X, Murase K, 2003, A constructive algorithm for training cooperative neural network ensembles, IEEE Trans Neural Netw 14(4): 820–834 https://doi.org/10.1109%2FTNN.2003.813832, https://pubmed.ncbi.nlm.nih.gov/18238062/
R Csordás, K Irie, J Schmidhuber, Oct 2023, Approximating Two-Layer Feedforward Networks for Efficient Transformers, arXiv preprint arXiv:2310.10837, https://arxiv.org/pdf/2310.10837.pdf
Adrià Ruiz and Jakob Verbeek. 2019, Adaptative inference cost with convolutional neural mixture models, ICCV, pages 1872–1881, 2019, https://arxiv.org/abs/1908.06694