Aussie AI
Mixture of Experts (MoE)
-
Last Updated 8 December, 2024
-
by David Spuler, Ph.D.
Mixture of Experts (MoE) is an ensemble inference optimization method where multiple sub-models are trained and used. The efficiency arises by sending a query to one of the experts, thereby only some of the weights are activated, dependent on the input tokens. Each expert model is smaller than if all the models were merged.
The MoE method is based on "divide and conquer" where a decision between experts "divides" a problem, and the chosen expert model "conquers" the sub-problem. Conceptually, the MoE architecture has some resemblance to cascades, big-little architectures, and knowledge distillation.
The MoE architecture has seen a resurgence in research as it became a "hot" area. Rumors about the architectures of both GPT-4 and Google Gemini put them as MoE architectures. GPT-4 is unofficially an 8-model MoE architecture with 1.76T weights in total (across the 8 models).
Research Papers on Mixture of Experts
- William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. J. Mach. Learn. Res, 23:1–40, 2021, https://arxiv.org/abs/2101.03961
- Erik Daxberger, Floris Weers, Bowen Zhang, Tom Gunter, Ruoming Pang, Marcin Eichner, Michael Emmersberger, Yinfei Yang, Alexander Toshev, Xianzhi Du, 2023, Mobile V-MoEs: Scaling Down Vision Transformers via Sparse Mixture-of-Experts, arXiv preprint, https://arxiv.org/abs/2309.04354 (This paper covers Sparse MoEs for vision transformers.)
- Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. Designing effective sparse expert models. arXiv preprint arXiv:2202.08906, 2022. https://arxiv.org/abs/2202.08906v1
- Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.0653, https://arxiv.org/abs/1701.06538 (Sparse MoE's early paper with 1,000s of expert mini-models.)
- IC Gormley, S Frühwirth-Schnatter, June 2018, Mixture of experts models, https://arxiv.org/abs/1806.08200
- Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020, https://arxiv.org/abs/2006.16668 (Sharding technique applied to an MoE model for further optimization.)
- Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc V Le, Yonghui Wu, Zhifeng Chen, Claire Cui, Glam: Efficient scaling of language models with mixture-of-experts, ICML 2022, https://arxiv.org/abs/2112.06905, PDF: https://proceedings.mlr.press/v162/du22c/du22c.pdf
- Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, Yuxiong He, 2022, DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale, ICML 2022, https://arxiv.org/abs/2201.05596, PDF: https://proceedings.mlr.press/v162/rajbhandari22a/rajbhandari22a.pdf
- Z Chen, Y Deng, Y Wu, Q Gu, Y Li, Aug 2022, Towards understanding mixture of experts in deep learning, arXiv preprint arXiv:2208.02813, https://arxiv.org/abs/2208.02813
- Y Krishnamurthy, C Watkins, T Gaertner, 2023, Improving Expert Specialization in Mixture of Experts, arXiv preprint arXiv:2302.14703, https://arxiv.org/pdf/2302.14703
- I Voroneckaja, 2023, Automatic architecture selection for hierarchical mixture of experts models, Ph.D. Thesis, School of Mathematics & Statistics, University of Glasgow, https://theses.gla.ac.uk/83492/1/2023VoroneckajaPhD.pdf
- SE Yuksel, JN Wilson, PD Gader, 2012, Twenty years of mixture of experts, IEEE Transactions on Neural Networks and Learning Systems (Volume 23, Issue 8, August 2012), https://ieeexplore.ieee.org/document/6215056, PDF: https://www.researchgate.net/profile/Seniha-Yuksel/publication/260707711Twenty_Years_of_Mixture_of_Experts/links/568f68e508aeaa1481b077de/Twenty-Years-of-Mixture-of-Experts.pdf
- Saeed Masoudnia & Reza Ebrahimpour, 2014, Mixture of experts: a literature survey, Artificial Intelligence Review volume 42, pages275–293 (2014), https://link.springer.com/article/10.1007/s10462-012-9338-y
- Ran Avnimelech; Nathan Intrator, 1999, Boosted mixture of experts: an ensemble learning scheme. Neural Comput 11(2): 483–497, https://ieeexplore.ieee.org/abstract/document/6790707
- Chen K, Xu L, Chi H, 1999, Improved learning algorithms for mixture of experts in multiclass classification. Neural Netw 12(9): 1229–1252 https://pubmed.ncbi.nlm.nih.gov/12662629/
- Reza Ebrahimpour, Ehsanollah Kabir, Hossein Esteky, Mohammad Reza Yousefi, 2008, View-independent face recognition with mixture of experts. Neurocomputing Volume 71, Issues 4–6, January 2008, Pages 1103-1107, https://www.sciencedirect.com/science/article/abs/pii/S0925231207003074
- Goodband JH, Haas OCL, Mills JA, 2006, A mixture of experts committee machine to design compensators for intensity modulated radiation therapy. Pattern Recogn 39(9): 1704–1714. doi:10.1016/j.patcog.2006.03.018, https://doi.org/10.1016%2Fj.patcog.2006.03.018
- Hansen JV, 1999, Combining predictors: comparison of five meta machine learning methods. Inform Sci 119(1–2): 91–105, https://doi.org/10.1016/S0020-0255(99)00052-3, https://www.sciencedirect.com/science/article/abs/pii/S0020025599000523
- Hong X, Harris CJ, 2001, A mixture of experts network structure construction algorithm for modelling and control. Appl Intell 16(1): 59–69 https://link.springer.com/article/10.1023/A:1012869427428
- Islam MM, Yao X, Murase K, 2003, A constructive algorithm for training cooperative neural network ensembles. IEEE Trans Neural Netw 14(4): 820–834 https://doi.org/10.1109%2FTNN.2003.813832, https://pubmed.ncbi.nlm.nih.gov/18238062/
- R Csordás, K Irie, J Schmidhuber, Oct 2023, Approximating Two-Layer Feedforward Networks for Efficient Transformers, arXiv preprint arXiv:2310.10837, https://arxiv.org/pdf/2310.10837.pdf
- Adrià Ruiz and Jakob Verbeek. Adaptative inference cost with convolutional neural mixture models. ICCV, pages 1872–1881, 2019, https://arxiv.org/abs/1908.06694
- Rongjie Yi, Liwei Guo, Shiyun Wei, Ao Zhou, Shangguang Wang, Mengwei Xu, 28 Aug 2023, EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models, https://arxiv.org/abs/2308.14352
- Jitai Hao, WeiWei Sun, Xin Xin, Qi Meng, Zhumin Chen, Pengjie Ren, Zhaochun Ren, 7 Jun 2024, MEFT: Memory-Efficient Fine-Tuning through Sparse Adapter, https://arxiv.org/abs/2406.04984 Code: https://github.com/CURRENTF/MEFT
- Josef Pichlmeier, Philipp Ross, Andre Luckow, 22 Apr 2024, Expert Router: Orchestrating Efficient Language Model Inference through Prompt Classification, https://arxiv.org/abs/2404.15153
- Je-Yong Lee, Donghyun Lee, Genghan Zhang, Mo Tiwari, Azalia Mirhoseini, 12 Apr 2024, CATS: Contextually-Aware Thresholding for Sparsity in Large Language Models, https://arxiv.org/abs/2404.08763 (Sparsity with dynamic control over the thresholds with an effect that is similar to intra-model MoE.)
- Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
- You Zhou, Xiujing Lin, Xiang Zhang, Maolin Wang, Gangwei Jiang, Huakang Lu, Yupeng Wu, Kai Zhang, Zhe Yang, Kehang Wang, Yongduo Sui, Fengwei Jia, Zuoli Tang, Yao Zhao, Hongxuan Zhang, Tiannuo Yang, Weibo Chen, Yunong Mao, Yi Li, De Bao, Yu Li, Hongrui Liao, Ting Liu, Jingwen Liu, Jinchi Guo, Xiangyu Zhao, Ying WEI, Hong Qian, Qi Liu, Xiang Wang, Wai Kin (Victor)Chan, Chenliang Li, Yusen Li, Shiyu Yang, Jining Yan, Chao Mou, Shuai Han, Wuxia Jin, Guannan Zhang, Xiaodong Zeng, Nov 2023, On the Opportunities of Green Computing: A Survey, https://arxiv.org/abs/2311.00447 (Extensive survey of environmental and green AI issues, along with a survey of various optimization methods to reduce AI resource requirements in training and inference.)
- Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
- Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed, 8 Jan 2024, Mixtral of Experts, https://arxiv.org/abs/2401.04088 Notes: https://mistral.ai/news/mixtral-of-experts/ (Mistral AI MoE architecture paper.)
- Róbert Csordás, Piotr Piękos, Kazuki Irie, Jürgen Schmidhuber, 14 Dec 2023, SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention, https://arxiv.org/abs/2312.07987 Code: https://github.com/robertcsordas/moe_attention
- Lilian Weng, January 10, 2023, Large Transformer Model Inference Optimization, https://lilianweng.github.io/posts/2023-01-10-inference-optimization/
- Bowen Pan, Yikang Shen, Haokun Liu, Mayank Mishra, Gaoyuan Zhang, Aude Oliva, Colin Raffel, Rameswar Panda, 8 Apr 2024, Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models, https://arxiv.org/abs/2404.05567 (Examining creating MoE models using fewer parameters than normally required for MoE effectiveness.)
- Longfei Yun, Yonghao Zhuang, Yao Fu, Eric P Xing, Hao Zhang, 3 Apr 2024, Toward Inference-optimal Mixture-of-Expert Large Language Models, https://arxiv.org/abs/2404.02852
- Qwen Team March 28, 2024 Qwen1.5-MoE: Matching 7B Model Performance with 1/3 Activated Parameters https://qwenlm.github.io/blog/qwen-moe/
- Harry Dong, Beidi Chen, Yuejie Chi, 1 Apr 2024, Prompt-prompted Mixture of Experts for Efficient LLM Generation, https://arxiv.org/abs/2404.01365 Code: https://github.com/hdong920/GRIFFIN
- Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748 (General survey of various model compression and other inference optimizations.)
- Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 15 Mar 2024 (v5), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer (A large survey of a variety of LLM optimizations.)
- Simone Scardapane, Alessandro Baiocchi, Alessio Devoto, Valerio Marsocci, Pasquale Minervini, Jary Pomponi, 12 Mar 2024, Conditional computation in neural networks: principles and research trends, https://arxiv.org/abs/2403.07965 (Investigated three types of dynamic inference: MoE, early exit, and token selection.)
- Ted Zadouri, Ahmet Üstün, Arash Ahmadian, Beyza Ermiş, Acyr Locatelli, Sara Hooker, Sep 2023, Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient MoE for Instruction Tuning, https://arxiv.org/abs/2309.05444 Code: https://github.com/for-ai/parameter-efficient-moe
- E Frantar, D Alistarh, Oct 2023, QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models, arXiv preprint arXiv:2310.16795, https://arxiv.org/pdf/2310.16795.pdf
- David Spuler, March 2024, Chapter 54. Ensemble Multi-Model Architectures, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Ruisi Cai1, Saurav Muralidharan, Greg Heinrich, Hongxu Yin, Zhangyang Wang, Jan Kautz, Pavlo Molchanov, 2024, FLEXTRON: Many-in-One Flexible Large Language Model, https://openreview.net/pdf?id=9vKRhnflAs (Using one model to act in different ways by making it "elastic" with parameters, effectively using slimming via techniques such as layer fusion in MLPs and MHA Attention Heads.)
- 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
- DeepSeek-AI. DeepSeek-V2: A strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434, 2024. https://arxiv.org/abs/2405.04434 Code: https://github.com/deepseek-ai/DeepSeek-V2 (Introduces various architectural optimizations, notably RoPE handling and KV cache compression via low-rank matrices.)
- kipply's blog, 2023-03-30, Transformer Taxonomy (the last lit review), https://kipp.ly/transformer-taxonomy/ (Papers for all the Transformer architectures and milestone papers for the major optimization improvements on them.)
- Aidan Clark, Diego de las Casas, Aurelia Guy, Arthur Mensch, Michela Paganini, Jordan Hoffmann, Bogdan Damoc, Blake Hechtman, Trevor Cai, Sebastian Borgeaud, George van den Driessche, Eliza Rutherford, Tom Hennigan, Matthew Johnson, Katie Millican, Albin Cassirer, Chris Jones, Elena Buchatskaya, David Budden, Laurent Sifre, Simon Osindero, Oriol Vinyals, Jack Rae, Erich Elsen, Koray Kavukcuoglu, Karen Simonyan, 9 Feb 2022 (v2), Unified Scaling Laws for Routed Language Models, https://arxiv.org/abs/2202.01169
- Mikel Artetxe, Shruti Bhosale, Naman Goyal, Todor Mihaylov, Myle Ott, Sam Shleifer, Xi Victoria Lin, Jingfei Du, Srinivasan Iyer, Ramakanth Pasunuru, Giri Anantharaman, Xian Li, Shuohui Chen, Halil Akin, Mandeep Baines, Louis Martin, Xing Zhou, Punit Singh Koura, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Mona Diab, Zornitsa Kozareva, Ves Stoyanov, 26 Oct 2022 (v2), Efficient Large Scale Language Modeling with Mixtures of Experts, https://arxiv.org/abs/2112.10684
- Jing Liu, Ruihao Gong, Mingyang Zhang, Yefei He, Jianfei Cai, Bohan Zhuang, 13 Jun 2024, ME-Switch: A Memory-Efficient Expert Switching Framework for Large Language Models, https://arxiv.org/abs/2406.09041 (How to load multiple experts for MoE in a memory-efficient way using mixed-precision quantization based on identifying the few salient channels that need higher precision, as an alternative to multi-LoRA.)
- Lamini, June 2024, Introducing Lamini Memory Tuning: 95% LLM Accuracy, 10x Fewer Hallucinations, https://www.lamini.ai/blog/lamini-memory-tuning PDF: https://github.com/lamini-ai/Lamini-Memory-Tuning/blob/main/research-paper.pdf (Deploy models with many LoRA adapters, selecting between them with MoE.)
- Johnny Li, Saksham Consul, Eda Zhou, James Wong, Naila Farooqui, Yuxin Ye, Nithyashree Manohar, Zhuxiaona Wei, Tian Wu, Ben Echols, Sharon Zhou, Gregory Diamos, 25 Jun 2024, Banishing LLM Hallucinations Requires Rethinking Generalization, https://arxiv.org/abs/2406.17642
- Zhuocheng Gong, Ang Lv, Jian Guan, Junxi Yan, Wei Wu, Huishuai Zhang, Minlie Huang, Dongyan Zhao, Rui Yan, 9 Jul 2024, Mixture-of-Modules: Reinventing Transformers as Dynamic Assemblies of Modules, https://arxiv.org/abs/2407.06677
- An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng Xue, Na Ni, Pei Zhang, Peng Wang, Ru Peng, Rui Men, Ruize Gao, Runji Lin, Shijie Wang, Shuai Bai, Sinan Tan, Tianhang Zhu, Tianhao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng, Xiaohuan Zhou, Xingzhang Ren, Xinyu Zhang, Xipin Wei, Xuancheng Ren, Yang Fan, Yang Yao, Yichang Zhang, Yu Wan, Yunfei Chu, Zeyu Cui, Zhenru Zhang, Zhihao Fan, 15 Jul 2024, Qwen2 Technical Report, https://arxiv.org/abs/2407.10671
- Xu Owen He, 4 Jul 2024, Mixture of A Million Experts, Google DeepMind, https://arxiv.org/abs/2407.04153
- Zhengcong Fei, Mingyuan Fan, Changqian Yu, Debang Li, Junshi Huang, 16 Jul 2024, Scaling Diffusion Transformers to 16 Billion Parameters, https://arxiv.org/abs/2407.11633 Project: https://github.com/feizc/DiT-MoE
- Baolin Li, Yankai Jiang, Vijay Gadepally, Devesh Tiwari, 17 Jul 2024, LLM Inference Serving: Survey of Recent Advances and Opportunities, https://arxiv.org/abs/2407.12391
- Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
- Tiernan Ray, July 24, 2024, 3 ways Meta's Llama 3.1 is an advance for Gen AI, https://www.zdnet.com/article/3-ways-metas-llama-3-1-is-an-advance-for-gen-ai/
- Zarif Bin Akhtar, Mapping Generative Artificial Intelligence (GAI's) Exciting Future: From Gemini to Q* and Beyond, https://publications.eai.eu/index.php/airo/article/view/5962 https://doi.org/10.4108/airo.5962 PDF: https://publications.eai.eu/index.php/airo/article/view/5962/3329
- Arpita Vats, Rahul Raja, Vinija Jain, Aman Chadha, 2024, The Evolution of MoE: A Survey from Basics to Breakthroughs, https://www.researchgate.net/profile/Aman-Chadha/publication/383127907_The_Evolution_of_Mixture_of_Experts_A_Survey_from_Basics_to_Breakthroughs/links/66c597c24b25ef677f728421/The-Evolution-of-Mixture-of-Experts-A-Survey-from-Basics-to-Breakthroughs.pdf
- Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, Jiayi Huang, 8 Aug 2024 (v2), A Survey on Mixture of Experts, https://arxiv.org/abs/2407.06204 Project: https://github.com/withinmiaov/A-Survey-on-Mixture-of-Experts
- Mengyi Yan, Yaoshu Wang, Kehan Pang, Min Xie, Jianxin Li, 24 August 2024, Efficient Mixture of Experts based on Large Language Models for Low-Resource Data Preprocessing, KDD '24: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Pages 3690 - 3701, https://doi.org/10.1145/3637528.3671873 https://dl.acm.org/doi/abs/10.1145/3637528.3671873
- Shuzhang Zhong, Ling Liang, Yuan Wang, Runsheng Wang, Ru Huang, Meng Li, 19 Aug 2024, AdapMoE: Adaptive Sensitivity-based Expert Gating and Management for Efficient MoE Inference, https://arxiv.org/abs/2408.10284
- Bartosz Wójcik, Alessio Devoto, Karol Pustelnik, Pasquale Minervini, Simone Scardapane, 15 Dec 2023, Adaptive Computation Modules: Granular Conditional Computation For Efficient Inference, https://arxiv.org/abs/2312.10193
- Sungmin Yun, Kwanhee Kyung, Juhwan Cho, Jaewan Choi, Jongmin Kim, Byeongho Kim, Sukhan Lee, Kyomin Sohn, Jung Ho Ahn, 2 Sep 2024, Duplex: A Device for Large Language Models with Mixture of Experts, Grouped Query Attention, and Continuous Batching, https://arxiv.org/abs/2409.01141
- Niklas Muennighoff, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Jacob Morrison, Sewon Min, Weijia Shi, Pete Walsh, Oyvind Tafjord, Nathan Lambert, Yuling Gu, Shane Arora, Akshita Bhagia, Dustin Schwenk, David Wadden, Alexander Wettig, Binyuan Hui, Tim Dettmers, Douwe Kiela, Ali Farhadi, Noah A. Smith, Pang Wei Koh, Amanpreet Singh, Hannaneh Hajishirzi, 3 Sep 2024, OLMoE: Open Mixture-of-Experts Language Models, https://arxiv.org/abs/2409.02060
- Wei An, Xiao Bi, Guanting Chen, Shanhuang Chen, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Wenjun Gao, Kang Guan, Jianzhong Guo, Yongqiang Guo, Zhe Fu, Ying He, Panpan Huang, Jiashi Li, Wenfeng Liang, Xiaodong Liu, Xin Liu, Yiyuan Liu, Yuxuan Liu, Shanghao Lu, Xuan Lu, Xiaotao Nie, Tian Pei, Junjie Qiu, Hui Qu, Zehui Ren, Zhangli Sha, Xuecheng Su, Xiaowen Sun, Yixuan Tan, Minghui Tang, Shiyu Wang, Yaohui Wang, Yongji Wang, Ziwei Xie, Yiliang Xiong, Yanhong Xu, Shengfeng Ye, Shuiping Yu, Yukun Zha, Liyue Zhang, Haowei Zhang, Mingchuan Zhang, Wentao Zhang, Yichao Zhang, Chenggang Zhao, Yao Zhao, Shangyan Zhou, Shunfeng Zhou, Yuheng Zou, 31 Aug 2024 (v2), Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning, DeepSeek AI, https://www.arxiv.org/abs/2408.14158
- Sean Michael Kerner, September 17, 2024, Model routing: The secret weapon for maximizing AI efficiency in enterprises, https://venturebeat.com/ai/why-accenture-and-martian-see-model-routing-as-key-to-enterprise-ai-success/
- Jinghan Yao, Quentin Anthony, Aamir Shafi, Hari Subramoni, Dhabaleswar K. (DK)Panda, 17 Jan 2024 (v2), Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference, https://arxiv.org/abs/2401.08383
- Michael Nuñez, September 19, 2024, Microsoft’s GRIN-MoE AI model takes on coding and math, beating competitors in key benchmarks, https://venturebeat.com/ai/microsofts-grin-moe-ai-model-takes-on-coding-and-math-beating-competitors-in-key-benchmarks/
- Douglas C. Youvan, September 27, 2024, Building and Running Large-Scale Language Models: The Infrastructure and Techniques Behind GPT-4 , https://www.researchgate.net/profile/Douglas-Youvan/publication/384398902_Building_and_Running_Large-Scale_Language_Models_The_Infrastructure_and_Techniques_Behind_GPT-4/links/66f6f4d3906bca2ac3d20e68/Building-and-Running-Large-Scale-Language-Models-The-Infrastructure-and-Techniques-Behind-GPT-4.pdf
- Cong Guo, Feng Cheng, Zhixu Du, James Kiessling, Jonathan Ku, Shiyu Li, Ziru Li, Mingyuan Ma, Tergel Molom-Ochir, Benjamin Morris, Haoxuan Shan, Jingwei Sun, Yitu Wang, Chiyue Wei, Xueying Wu, Yuhao Wu, Hao Frank Yang, Jingyang Zhang, Junyao Zhang, Qilin Zheng, Guanglei Zhou, Hai (Helen)Li, Yiran Chen, 8 Oct 2024. A Survey: Collaborative Hardware and Software Design in the Era of Large Language Models, https://arxiv.org/abs/2410.07265
- Ajinkya Tejankar, KL Navaneet, Ujjawal Panchal, Kossar Pourahmadi, Hamed Pirsiavash, 13 Oct 2024, MoIN: Mixture of Introvert Experts to Upcycle an LLM, https://arxiv.org/abs/2410.09687
- Yulei Qian, Fengcun Li, Xiangyang Ji, Xiaoyu Zhao, Jianchao Tan, Kefeng Zhang, Xunliang Cai, 16 Oct 2024, EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference, https://arxiv.org/abs/2410.12247
- Yanyue Xie, Zhi Zhang, Ding Zhou, Cong Xie, Ziang Song, Xin Liu, Yanzhi Wang, Xue Lin, An Xu, 15 Oct 2024, MoE-Pruner: Pruning Mixture-of-Experts Large Language Model using the Hints from Its Router, https://arxiv.org/abs/2410.12013 (Pruning applied to MoE.)
- Jing Li, Zhijie Sun, Dachao Lin, Xuan He, Yi Lin, Binfan Zheng, Li Zeng, Rongqian Zhao, Xin Chen, 30 Aug 2024 (v2), Expert-Token Resonance: Redefining MoE Routing through Affinity-Driven Active Selection, https://arxiv.org/abs/2406.00023
- Weikai Li, Ding Wang, Zijian Ding, Atefeh Sohrabizadeh, Zongyue Qin, Jason Cong, Yizhou Sun, 25 Oct 2024, Hierarchical Mixture of Experts: Generalizable Learning for High-Level Synthesis, https://arxiv.org/abs/2410.19225
- Do Xuan Long, Duong Ngoc Yen, Anh Tuan Luu, Kenji Kawaguchi, Min-Yen Kan, Nancy F. Chen, 1 Nov 2024, Multi-expert Prompting Improves Reliability, Safety, and Usefulness of Large Language Models, https://arxiv.org/abs/2411.00492
- Xingwu Sun, Yanfeng Chen, Yiqing Huang, Ruobing Xie, Jiaqi Zhu, Kai Zhang, Shuaipeng Li, Zhen Yang, Jonny Han, Xiaobo Shu, Jiahao Bu, (and many more authors), 4 Nov 2024, Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent, https://arxiv.org/abs/2411.02265 https://github.com/Tencent/Hunyuan-Large https://huggingface.co/tencent/Tencent-Hunyuan-Large
- Fali Wang, Zhiwei Zhang, Xianren Zhang, Zongyu Wu, Tzuhao Mo, Qiuhao Lu, Wanjing Wang, Rui Li, Junjie Xu, Xianfeng Tang, Qi He, Yao Ma, Ming Huang, Suhang Wang, 4 Nov 2024, A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness, https://arxiv.org/abs/2411.03350
- Umesh Deshpande, Travis Janssen, Mudhakar Srivatsa, and Swaminathan Sundararaman. 2024. MoEsaic: Shared Mixture of Experts. In Proceedings of the 2024 ACM Symposium on Cloud Computing (SoCC '24). Association for Computing Machinery, New York, NY, USA, 434–442. https://doi.org/10.1145/3698038.3698521 https://dl.acm.org/doi/abs/10.1145/3698038.3698521
- M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024, Resource-efficient Algorithms and Systems of Foundation Models: A Survey, https://dl.acm.org/doi/pdf/10.1145/3706418
- AI: Alan Wake, Albert Wang, Bei Chen, C.X. Lv, Chao Li, Chengen Huang, Chenglin Cai, Chujie Zheng, Daniel Cooper, Ethan Dai, Fan Zhou, Feng Hu, Heng Ji, Howard Qiu, Jiangcheng Zhu, Jun Tian, Katherine Su, Lihuan Zhang, Liying Li, Ming Song, Mou Li, Peng Liu, Qichen Hu, Shawn Wang, Shijun Zhou, Shiyong Li, Tianhang Zhu, Wen Xie, Xiang He, Xiaobo Chen, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Yanpeng Li, Yongke Zhao, Yongzhen Luo, Yuchi Xu, Yuxuan Sha, Zhaodong Yan, Zhiyuan Liu, Zirui Zhang, 3 Dec 2024 (v2), Yi-Lightning Technical Report, https://arxiv.org/abs/2412.01253 https://platform.lingyiwanwu.com/ (MoE architecture with model expert routing optimizations, also with hybrid global-local attention and fused layers in the KV caching.)
Sparse MoE
- Soumajyoti Sarkar, Leonard Lausen, Volkan Cevher, Sheng Zha, Thomas Brox, George Karypis, 2 Sep 2024, Revisiting SMoE Language Models by Evaluating Inefficiencies with Task Specific Expert Pruning, https://arxiv.org/abs/2409.01483
- Niklas Muennighoff, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Jacob Morrison, Sewon Min, Weijia Shi, Pete Walsh, Oyvind Tafjord, Nathan Lambert, Yuling Gu, Shane Arora, Akshita Bhagia, Dustin Schwenk, David Wadden, Alexander Wettig, Binyuan Hui, Tim Dettmers, Douwe Kiela, Ali Farhadi, Noah A. Smith, Pang Wei Koh, Amanpreet Singh, Hannaneh Hajishirzi, 3 Sep 2024, OLMoE: Open Mixture-of-Experts Language Models, https://arxiv.org/abs/2409.02060
- Yanyue Xie, Zhi Zhang, Ding Zhou, Cong Xie, Ziang Song, Xin Liu, Yanzhi Wang, Xue Lin, An Xu, 15 Oct 2024, MoE-Pruner: Pruning Mixture-of-Experts Large Language Model using the Hints from Its Router, https://arxiv.org/abs/2410.12013 (Pruning applied to MoE.)
- Shawn Tan, Yikang Shen, Rameswar Panda, Aaron Courville, 4 Oct 2024 (v2), Scattered Mixture-of-Experts Implementation, https://arxiv.org/abs/2403.08245
MoE Optimization Techniques
Papers on efficient and speed optimization of MoE architectures:
- Jinghan Yao, Quentin Anthony, Aamir Shafi, Hari Subramoni, Dhabaleswar K. (DK)Panda, 17 Jan 2024 (v2), Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference, https://arxiv.org/abs/2401.08383
- Xin He, Shunkang Zhang, Yuxin Wang, Haiyan Yin, Zihao Zeng, Shaohuai Shi, Zhenheng Tang, Xiaowen Chu, Ivor Tsang, Ong Yew Soon, 23 Oct 2024, ExpertFlow: Optimized Expert Activation and Token Allocation for Efficient Mixture-of-Experts Inference, https://arxiv.org/abs/2410.17954
- R Cai, Y Ro, GW Kim, P Wang, BE Bejnordi, A Akella, Oct 2024, Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design, 38th Conference on Neural Information Processing Systems (NeurIPS 2024), https://utns.cs.utexas.edu/assets/papers/neurips24-readme.pdf https://github.com/VITA-Group/READ-ME (Extract multiple smaller MoE expert models from a large LLM.)
- Shawn Tan, Yikang Shen, Rameswar Panda, Aaron Courville, 4 Oct 2024 (v2), Scattered Mixture-of-Experts Implementation, https://arxiv.org/abs/2403.08245
- Dr. Ashish Bamania, Oct 27, 2024, Amazing Things Happen When Attention Heads Are Supercharged Using Mixture-Of-Experts: A deep dive into how the Attention mechanism works and how it is being enhanced by the Mixture-of-Experts architecture, resulting in Mixture-of-Head Attention (MoH) that makes our existing LLMs more efficient than ever. https://levelup.gitconnected.com/amazing-things-happen-when-attention-heads-are-supercharged-using-mixture-of-experts-b55a6b9a0ac8
- Xiaoniu Song, Zihang Zhong, Rong Chen, 29 Oct 2024, ProMoE: Fast MoE-based LLM Serving using Proactive Caching, https://arxiv.org/abs/2410.22134
- Peng Tang, Jiacheng Liu, Xiaofeng Hou, Yifei Pu, Jing Wang, Pheng-Ann Heng, Chao Li, Minyi Guo, 6 Nov 2024 (v2), HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference, https://arxiv.org/abs/2411.01433
- Fali Wang, Zhiwei Zhang, Xianren Zhang, Zongyu Wu, Tzuhao Mo, Qiuhao Lu, Wanjing Wang, Rui Li, Junjie Xu, Xianfeng Tang, Qi He, Yao Ma, Ming Huang, Suhang Wang, 4 Nov 2024, A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness, https://arxiv.org/abs/2411.03350
- Umesh Deshpande, Travis Janssen, Mudhakar Srivatsa, and Swaminathan Sundararaman. 2024. MoEsaic: Shared Mixture of Experts. In Proceedings of the 2024 ACM Symposium on Cloud Computing (SoCC '24). Association for Computing Machinery, New York, NY, USA, 434–442. https://doi.org/10.1145/3698038.3698521 https://dl.acm.org/doi/abs/10.1145/3698038.3698521
- Nan Xue, Yaping Sun, Zhiyong Chen, Meixia Tao, Xiaodong Xu, Liang Qian, Shuguang Cui, Wenjun Zhang, Ping Zhang, 11 Nov 2024, WDMoE: Wireless Distributed Mixture of Experts for Large Language Models, https://arxiv.org/abs/2411.06681
- Shiyi Cao, Shu Liu, Tyler Griggs, Peter Schafhalter, Xiaoxuan Liu, Ying Sheng, Joseph E. Gonzalez, Matei Zaharia, Ion Stoica, 18 Nov 2024, MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs, https://arxiv.org/abs/2411.11217
- AI: Alan Wake, Albert Wang, Bei Chen, C.X. Lv, Chao Li, Chengen Huang, Chenglin Cai, Chujie Zheng, Daniel Cooper, Ethan Dai, Fan Zhou, Feng Hu, Heng Ji, Howard Qiu, Jiangcheng Zhu, Jun Tian, Katherine Su, Lihuan Zhang, Liying Li, Ming Song, Mou Li, Peng Liu, Qichen Hu, Shawn Wang, Shijun Zhou, Shiyong Li, Tianhang Zhu, Wen Xie, Xiang He, Xiaobo Chen, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Yanpeng Li, Yongke Zhao, Yongzhen Luo, Yuchi Xu, Yuxuan Sha, Zhaodong Yan, Zhiyuan Liu, Zirui Zhang, 3 Dec 2024 (v2), Yi-Lightning Technical Report, https://arxiv.org/abs/2412.01253 https://platform.lingyiwanwu.com/ (MoE architecture with model expert routing optimizations, also with hybrid global-local attention and fused layers in the KV caching.)
- Yuanshuai Wang, Xingjian Zhang, Jinkun Zhao, Siwei Wen, Peilin Feng, Shuhao Liao, Lei Huang, Wenjun Wu, 5 Dec 2024, Bench-CoE: a Framework for Collaboration of Experts from Benchmark, https://arxiv.org/abs/2412.04167 https://github.com/ZhangXJ199/Bench-CoE
More AI Research
Read more about:
- Cascade Models
- Ensemble Models
- Inference Optimizations
- Loop Optimizations
- Code Optimizations
- « Research Home