Aussie AI
Mixture of Experts (MoE)
-
Last Updated 21 March, 2025
-
by David Spuler, Ph.D.
Mixture of Experts (MoE) is an ensemble inference optimization method where multiple sub-models are trained and used. The efficiency arises by sending a query to one of the experts, thereby only some of the weights are activated, dependent on the input tokens. Each expert model is smaller than if all the models were merged.
The MoE method is based on "divide and conquer" where a decision between experts "divides" a problem, and the chosen expert model "conquers" the sub-problem. Conceptually, the MoE architecture has some resemblance to cascades, big-little architectures, and knowledge distillation.
The MoE architecture has seen a resurgence in research as it became a "hot" area. Rumors about the architectures of both GPT-4 and Google Gemini put them as MoE architectures. GPT-4 is unofficially an 8-model MoE architecture with 1.76T weights in total (across the 8 models).
Research Papers on Mixture of Experts
- William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. J. Mach. Learn. Res, 23:1–40, 2021, https://arxiv.org/abs/2101.03961
- Erik Daxberger, Floris Weers, Bowen Zhang, Tom Gunter, Ruoming Pang, Marcin Eichner, Michael Emmersberger, Yinfei Yang, Alexander Toshev, Xianzhi Du, 2023, Mobile V-MoEs: Scaling Down Vision Transformers via Sparse Mixture-of-Experts, arXiv preprint, https://arxiv.org/abs/2309.04354 (This paper covers Sparse MoEs for vision transformers.)
- Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. Designing effective sparse expert models. arXiv preprint arXiv:2202.08906, 2022. https://arxiv.org/abs/2202.08906v1
- Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.0653, https://arxiv.org/abs/1701.06538 (Sparse MoE's early paper with 1,000s of expert mini-models.)
- IC Gormley, S Frühwirth-Schnatter, June 2018, Mixture of experts models, https://arxiv.org/abs/1806.08200
- Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020, https://arxiv.org/abs/2006.16668 (Sharding technique applied to an MoE model for further optimization.)
- Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc V Le, Yonghui Wu, Zhifeng Chen, Claire Cui, Glam: Efficient scaling of language models with mixture-of-experts, ICML 2022, https://arxiv.org/abs/2112.06905, PDF: https://proceedings.mlr.press/v162/du22c/du22c.pdf
- Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, Yuxiong He, 2022, DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale, ICML 2022, https://arxiv.org/abs/2201.05596, PDF: https://proceedings.mlr.press/v162/rajbhandari22a/rajbhandari22a.pdf
- Z Chen, Y Deng, Y Wu, Q Gu, Y Li, Aug 2022, Towards understanding mixture of experts in deep learning, arXiv preprint arXiv:2208.02813, https://arxiv.org/abs/2208.02813
- Y Krishnamurthy, C Watkins, T Gaertner, 2023, Improving Expert Specialization in Mixture of Experts, arXiv preprint arXiv:2302.14703, https://arxiv.org/pdf/2302.14703
- I Voroneckaja, 2023, Automatic architecture selection for hierarchical mixture of experts models, Ph.D. Thesis, School of Mathematics & Statistics, University of Glasgow, https://theses.gla.ac.uk/83492/1/2023VoroneckajaPhD.pdf
- SE Yuksel, JN Wilson, PD Gader, 2012, Twenty years of mixture of experts, IEEE Transactions on Neural Networks and Learning Systems (Volume 23, Issue 8, August 2012), https://ieeexplore.ieee.org/document/6215056, PDF: https://www.researchgate.net/profile/Seniha-Yuksel/publication/260707711Twenty_Years_of_Mixture_of_Experts/links/568f68e508aeaa1481b077de/Twenty-Years-of-Mixture-of-Experts.pdf
- Saeed Masoudnia & Reza Ebrahimpour, 2014, Mixture of experts: a literature survey, Artificial Intelligence Review volume 42, pages275–293 (2014), https://link.springer.com/article/10.1007/s10462-012-9338-y
- Ran Avnimelech; Nathan Intrator, 1999, Boosted mixture of experts: an ensemble learning scheme. Neural Comput 11(2): 483–497, https://ieeexplore.ieee.org/abstract/document/6790707
- Chen K, Xu L, Chi H, 1999, Improved learning algorithms for mixture of experts in multiclass classification. Neural Netw 12(9): 1229–1252 https://pubmed.ncbi.nlm.nih.gov/12662629/
- Reza Ebrahimpour, Ehsanollah Kabir, Hossein Esteky, Mohammad Reza Yousefi, 2008, View-independent face recognition with mixture of experts. Neurocomputing Volume 71, Issues 4–6, January 2008, Pages 1103-1107, https://www.sciencedirect.com/science/article/abs/pii/S0925231207003074
- Goodband JH, Haas OCL, Mills JA, 2006, A mixture of experts committee machine to design compensators for intensity modulated radiation therapy. Pattern Recogn 39(9): 1704–1714. doi:10.1016/j.patcog.2006.03.018, https://doi.org/10.1016%2Fj.patcog.2006.03.018
- Hansen JV, 1999, Combining predictors: comparison of five meta machine learning methods. Inform Sci 119(1–2): 91–105, https://doi.org/10.1016/S0020-0255(99)00052-3, https://www.sciencedirect.com/science/article/abs/pii/S0020025599000523
- Hong X, Harris CJ, 2001, A mixture of experts network structure construction algorithm for modelling and control. Appl Intell 16(1): 59–69 https://link.springer.com/article/10.1023/A:1012869427428
- Islam MM, Yao X, Murase K, 2003, A constructive algorithm for training cooperative neural network ensembles. IEEE Trans Neural Netw 14(4): 820–834 https://doi.org/10.1109%2FTNN.2003.813832, https://pubmed.ncbi.nlm.nih.gov/18238062/
- R Csordás, K Irie, J Schmidhuber, Oct 2023, Approximating Two-Layer Feedforward Networks for Efficient Transformers, arXiv preprint arXiv:2310.10837, https://arxiv.org/pdf/2310.10837.pdf
- Adrià Ruiz and Jakob Verbeek. Adaptative inference cost with convolutional neural mixture models. ICCV, pages 1872–1881, 2019, https://arxiv.org/abs/1908.06694
- Rongjie Yi, Liwei Guo, Shiyun Wei, Ao Zhou, Shangguang Wang, Mengwei Xu, 28 Aug 2023, EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models, https://arxiv.org/abs/2308.14352
- Jitai Hao, WeiWei Sun, Xin Xin, Qi Meng, Zhumin Chen, Pengjie Ren, Zhaochun Ren, 7 Jun 2024, MEFT: Memory-Efficient Fine-Tuning through Sparse Adapter, https://arxiv.org/abs/2406.04984 Code: https://github.com/CURRENTF/MEFT
- Josef Pichlmeier, Philipp Ross, Andre Luckow, 22 Apr 2024, Expert Router: Orchestrating Efficient Language Model Inference through Prompt Classification, https://arxiv.org/abs/2404.15153
- Je-Yong Lee, Donghyun Lee, Genghan Zhang, Mo Tiwari, Azalia Mirhoseini, 12 Apr 2024, CATS: Contextually-Aware Thresholding for Sparsity in Large Language Models, https://arxiv.org/abs/2404.08763 (Sparsity with dynamic control over the thresholds with an effect that is similar to intra-model MoE.)
- Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
- You Zhou, Xiujing Lin, Xiang Zhang, Maolin Wang, Gangwei Jiang, Huakang Lu, Yupeng Wu, Kai Zhang, Zhe Yang, Kehang Wang, Yongduo Sui, Fengwei Jia, Zuoli Tang, Yao Zhao, Hongxuan Zhang, Tiannuo Yang, Weibo Chen, Yunong Mao, Yi Li, De Bao, Yu Li, Hongrui Liao, Ting Liu, Jingwen Liu, Jinchi Guo, Xiangyu Zhao, Ying WEI, Hong Qian, Qi Liu, Xiang Wang, Wai Kin (Victor)Chan, Chenliang Li, Yusen Li, Shiyu Yang, Jining Yan, Chao Mou, Shuai Han, Wuxia Jin, Guannan Zhang, Xiaodong Zeng, Nov 2023, On the Opportunities of Green Computing: A Survey, https://arxiv.org/abs/2311.00447 (Extensive survey of environmental and green AI issues, along with a survey of various optimization methods to reduce AI resource requirements in training and inference.)
- Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
- Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed, 8 Jan 2024, Mixtral of Experts, https://arxiv.org/abs/2401.04088 Notes: https://mistral.ai/news/mixtral-of-experts/ (Mistral AI MoE architecture paper.)
- Róbert Csordás, Piotr Piękos, Kazuki Irie, Jürgen Schmidhuber, 14 Dec 2023, SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention, https://arxiv.org/abs/2312.07987 Code: https://github.com/robertcsordas/moe_attention
- Lilian Weng, January 10, 2023, Large Transformer Model Inference Optimization, https://lilianweng.github.io/posts/2023-01-10-inference-optimization/
- Bowen Pan, Yikang Shen, Haokun Liu, Mayank Mishra, Gaoyuan Zhang, Aude Oliva, Colin Raffel, Rameswar Panda, 8 Apr 2024, Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models, https://arxiv.org/abs/2404.05567 (Examining creating MoE models using fewer parameters than normally required for MoE effectiveness.)
- Longfei Yun, Yonghao Zhuang, Yao Fu, Eric P Xing, Hao Zhang, 3 Apr 2024, Toward Inference-optimal Mixture-of-Expert Large Language Models, https://arxiv.org/abs/2404.02852
- Qwen Team March 28, 2024 Qwen1.5-MoE: Matching 7B Model Performance with 1/3 Activated Parameters https://qwenlm.github.io/blog/qwen-moe/
- Harry Dong, Beidi Chen, Yuejie Chi, 1 Apr 2024, Prompt-prompted Mixture of Experts for Efficient LLM Generation, https://arxiv.org/abs/2404.01365 Code: https://github.com/hdong920/GRIFFIN
- Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748 (General survey of various model compression and other inference optimizations.)
- Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 15 Mar 2024 (v5), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer (A large survey of a variety of LLM optimizations.)
- Simone Scardapane, Alessandro Baiocchi, Alessio Devoto, Valerio Marsocci, Pasquale Minervini, Jary Pomponi, 12 Mar 2024, Conditional computation in neural networks: principles and research trends, https://arxiv.org/abs/2403.07965 (Investigated three types of dynamic inference: MoE, early exit, and token selection.)
- Ted Zadouri, Ahmet Üstün, Arash Ahmadian, Beyza Ermiş, Acyr Locatelli, Sara Hooker, Sep 2023, Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient MoE for Instruction Tuning, https://arxiv.org/abs/2309.05444 Code: https://github.com/for-ai/parameter-efficient-moe
- E Frantar, D Alistarh, Oct 2023, QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models, arXiv preprint arXiv:2310.16795, https://arxiv.org/pdf/2310.16795.pdf
- David Spuler, March 2024, Chapter 54. Ensemble Multi-Model Architectures, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Ruisi Cai1, Saurav Muralidharan, Greg Heinrich, Hongxu Yin, Zhangyang Wang, Jan Kautz, Pavlo Molchanov, 2024, FLEXTRON: Many-in-One Flexible Large Language Model, https://openreview.net/pdf?id=9vKRhnflAs (Using one model to act in different ways by making it "elastic" with parameters, effectively using slimming via techniques such as layer fusion in MLPs and MHA Attention Heads.)
- 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
- DeepSeek-AI. DeepSeek-V2: A strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434, 2024. https://arxiv.org/abs/2405.04434 Code: https://github.com/deepseek-ai/DeepSeek-V2 (Introduces various architectural optimizations, notably RoPE handling and KV cache compression via low-rank matrices.)
- kipply's blog, 2023-03-30, Transformer Taxonomy (the last lit review), https://kipp.ly/transformer-taxonomy/ (Papers for all the Transformer architectures and milestone papers for the major optimization improvements on them.)
- Aidan Clark, Diego de las Casas, Aurelia Guy, Arthur Mensch, Michela Paganini, Jordan Hoffmann, Bogdan Damoc, Blake Hechtman, Trevor Cai, Sebastian Borgeaud, George van den Driessche, Eliza Rutherford, Tom Hennigan, Matthew Johnson, Katie Millican, Albin Cassirer, Chris Jones, Elena Buchatskaya, David Budden, Laurent Sifre, Simon Osindero, Oriol Vinyals, Jack Rae, Erich Elsen, Koray Kavukcuoglu, Karen Simonyan, 9 Feb 2022 (v2), Unified Scaling Laws for Routed Language Models, https://arxiv.org/abs/2202.01169
- Mikel Artetxe, Shruti Bhosale, Naman Goyal, Todor Mihaylov, Myle Ott, Sam Shleifer, Xi Victoria Lin, Jingfei Du, Srinivasan Iyer, Ramakanth Pasunuru, Giri Anantharaman, Xian Li, Shuohui Chen, Halil Akin, Mandeep Baines, Louis Martin, Xing Zhou, Punit Singh Koura, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Mona Diab, Zornitsa Kozareva, Ves Stoyanov, 26 Oct 2022 (v2), Efficient Large Scale Language Modeling with Mixtures of Experts, https://arxiv.org/abs/2112.10684
- Jing Liu, Ruihao Gong, Mingyang Zhang, Yefei He, Jianfei Cai, Bohan Zhuang, 13 Jun 2024, ME-Switch: A Memory-Efficient Expert Switching Framework for Large Language Models, https://arxiv.org/abs/2406.09041 (How to load multiple experts for MoE in a memory-efficient way using mixed-precision quantization based on identifying the few salient channels that need higher precision, as an alternative to multi-LoRA.)
- Lamini, June 2024, Introducing Lamini Memory Tuning: 95% LLM Accuracy, 10x Fewer Hallucinations, https://www.lamini.ai/blog/lamini-memory-tuning PDF: https://github.com/lamini-ai/Lamini-Memory-Tuning/blob/main/research-paper.pdf (Deploy models with many LoRA adapters, selecting between them with MoE.)
- Johnny Li, Saksham Consul, Eda Zhou, James Wong, Naila Farooqui, Yuxin Ye, Nithyashree Manohar, Zhuxiaona Wei, Tian Wu, Ben Echols, Sharon Zhou, Gregory Diamos, 25 Jun 2024, Banishing LLM Hallucinations Requires Rethinking Generalization, https://arxiv.org/abs/2406.17642
- Zhuocheng Gong, Ang Lv, Jian Guan, Junxi Yan, Wei Wu, Huishuai Zhang, Minlie Huang, Dongyan Zhao, Rui Yan, 9 Jul 2024, Mixture-of-Modules: Reinventing Transformers as Dynamic Assemblies of Modules, https://arxiv.org/abs/2407.06677
- An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng Xue, Na Ni, Pei Zhang, Peng Wang, Ru Peng, Rui Men, Ruize Gao, Runji Lin, Shijie Wang, Shuai Bai, Sinan Tan, Tianhang Zhu, Tianhao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng, Xiaohuan Zhou, Xingzhang Ren, Xinyu Zhang, Xipin Wei, Xuancheng Ren, Yang Fan, Yang Yao, Yichang Zhang, Yu Wan, Yunfei Chu, Zeyu Cui, Zhenru Zhang, Zhihao Fan, 15 Jul 2024, Qwen2 Technical Report, https://arxiv.org/abs/2407.10671
- Xu Owen He, 4 Jul 2024, Mixture of A Million Experts, Google DeepMind, https://arxiv.org/abs/2407.04153
- Zhengcong Fei, Mingyuan Fan, Changqian Yu, Debang Li, Junshi Huang, 16 Jul 2024, Scaling Diffusion Transformers to 16 Billion Parameters, https://arxiv.org/abs/2407.11633 Project: https://github.com/feizc/DiT-MoE
- Baolin Li, Yankai Jiang, Vijay Gadepally, Devesh Tiwari, 17 Jul 2024, LLM Inference Serving: Survey of Recent Advances and Opportunities, https://arxiv.org/abs/2407.12391
- Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
- Tiernan Ray, July 24, 2024, 3 ways Meta's Llama 3.1 is an advance for Gen AI, https://www.zdnet.com/article/3-ways-metas-llama-3-1-is-an-advance-for-gen-ai/
- Zarif Bin Akhtar, Mapping Generative Artificial Intelligence (GAI's) Exciting Future: From Gemini to Q* and Beyond, https://publications.eai.eu/index.php/airo/article/view/5962 https://doi.org/10.4108/airo.5962 PDF: https://publications.eai.eu/index.php/airo/article/view/5962/3329
- Arpita Vats, Rahul Raja, Vinija Jain, Aman Chadha, 2024, The Evolution of MoE: A Survey from Basics to Breakthroughs, https://www.researchgate.net/profile/Aman-Chadha/publication/383127907_The_Evolution_of_Mixture_of_Experts_A_Survey_from_Basics_to_Breakthroughs/links/66c597c24b25ef677f728421/The-Evolution-of-Mixture-of-Experts-A-Survey-from-Basics-to-Breakthroughs.pdf
- Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, Jiayi Huang, 8 Aug 2024 (v2), A Survey on Mixture of Experts, https://arxiv.org/abs/2407.06204 Project: https://github.com/withinmiaov/A-Survey-on-Mixture-of-Experts
- Mengyi Yan, Yaoshu Wang, Kehan Pang, Min Xie, Jianxin Li, 24 August 2024, Efficient Mixture of Experts based on Large Language Models for Low-Resource Data Preprocessing, KDD '24: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Pages 3690 - 3701, https://doi.org/10.1145/3637528.3671873 https://dl.acm.org/doi/abs/10.1145/3637528.3671873
- Shuzhang Zhong, Ling Liang, Yuan Wang, Runsheng Wang, Ru Huang, Meng Li, 19 Aug 2024, AdapMoE: Adaptive Sensitivity-based Expert Gating and Management for Efficient MoE Inference, https://arxiv.org/abs/2408.10284
- Bartosz Wójcik, Alessio Devoto, Karol Pustelnik, Pasquale Minervini, Simone Scardapane, 15 Dec 2023, Adaptive Computation Modules: Granular Conditional Computation For Efficient Inference, https://arxiv.org/abs/2312.10193
- Sungmin Yun, Kwanhee Kyung, Juhwan Cho, Jaewan Choi, Jongmin Kim, Byeongho Kim, Sukhan Lee, Kyomin Sohn, Jung Ho Ahn, 2 Sep 2024, Duplex: A Device for Large Language Models with Mixture of Experts, Grouped Query Attention, and Continuous Batching, https://arxiv.org/abs/2409.01141
- Niklas Muennighoff, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Jacob Morrison, Sewon Min, Weijia Shi, Pete Walsh, Oyvind Tafjord, Nathan Lambert, Yuling Gu, Shane Arora, Akshita Bhagia, Dustin Schwenk, David Wadden, Alexander Wettig, Binyuan Hui, Tim Dettmers, Douwe Kiela, Ali Farhadi, Noah A. Smith, Pang Wei Koh, Amanpreet Singh, Hannaneh Hajishirzi, 3 Sep 2024, OLMoE: Open Mixture-of-Experts Language Models, https://arxiv.org/abs/2409.02060
- Wei An, Xiao Bi, Guanting Chen, Shanhuang Chen, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Wenjun Gao, Kang Guan, Jianzhong Guo, Yongqiang Guo, Zhe Fu, Ying He, Panpan Huang, Jiashi Li, Wenfeng Liang, Xiaodong Liu, Xin Liu, Yiyuan Liu, Yuxuan Liu, Shanghao Lu, Xuan Lu, Xiaotao Nie, Tian Pei, Junjie Qiu, Hui Qu, Zehui Ren, Zhangli Sha, Xuecheng Su, Xiaowen Sun, Yixuan Tan, Minghui Tang, Shiyu Wang, Yaohui Wang, Yongji Wang, Ziwei Xie, Yiliang Xiong, Yanhong Xu, Shengfeng Ye, Shuiping Yu, Yukun Zha, Liyue Zhang, Haowei Zhang, Mingchuan Zhang, Wentao Zhang, Yichao Zhang, Chenggang Zhao, Yao Zhao, Shangyan Zhou, Shunfeng Zhou, Yuheng Zou, 31 Aug 2024 (v2), Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning, DeepSeek AI, https://www.arxiv.org/abs/2408.14158
- Sean Michael Kerner, September 17, 2024, Model routing: The secret weapon for maximizing AI efficiency in enterprises, https://venturebeat.com/ai/why-accenture-and-martian-see-model-routing-as-key-to-enterprise-ai-success/
- Jinghan Yao, Quentin Anthony, Aamir Shafi, Hari Subramoni, Dhabaleswar K. (DK)Panda, 17 Jan 2024 (v2), Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference, https://arxiv.org/abs/2401.08383
- Michael Nuñez, September 19, 2024, Microsoft’s GRIN-MoE AI model takes on coding and math, beating competitors in key benchmarks, https://venturebeat.com/ai/microsofts-grin-moe-ai-model-takes-on-coding-and-math-beating-competitors-in-key-benchmarks/
- Douglas C. Youvan, September 27, 2024, Building and Running Large-Scale Language Models: The Infrastructure and Techniques Behind GPT-4 , https://www.researchgate.net/profile/Douglas-Youvan/publication/384398902_Building_and_Running_Large-Scale_Language_Models_The_Infrastructure_and_Techniques_Behind_GPT-4/links/66f6f4d3906bca2ac3d20e68/Building-and-Running-Large-Scale-Language-Models-The-Infrastructure-and-Techniques-Behind-GPT-4.pdf
- Cong Guo, Feng Cheng, Zhixu Du, James Kiessling, Jonathan Ku, Shiyu Li, Ziru Li, Mingyuan Ma, Tergel Molom-Ochir, Benjamin Morris, Haoxuan Shan, Jingwei Sun, Yitu Wang, Chiyue Wei, Xueying Wu, Yuhao Wu, Hao Frank Yang, Jingyang Zhang, Junyao Zhang, Qilin Zheng, Guanglei Zhou, Hai (Helen)Li, Yiran Chen, 8 Oct 2024. A Survey: Collaborative Hardware and Software Design in the Era of Large Language Models, https://arxiv.org/abs/2410.07265
- Ajinkya Tejankar, KL Navaneet, Ujjawal Panchal, Kossar Pourahmadi, Hamed Pirsiavash, 13 Oct 2024, MoIN: Mixture of Introvert Experts to Upcycle an LLM, https://arxiv.org/abs/2410.09687
- Yulei Qian, Fengcun Li, Xiangyang Ji, Xiaoyu Zhao, Jianchao Tan, Kefeng Zhang, Xunliang Cai, 16 Oct 2024, EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference, https://arxiv.org/abs/2410.12247
- Yanyue Xie, Zhi Zhang, Ding Zhou, Cong Xie, Ziang Song, Xin Liu, Yanzhi Wang, Xue Lin, An Xu, 15 Oct 2024, MoE-Pruner: Pruning Mixture-of-Experts Large Language Model using the Hints from Its Router, https://arxiv.org/abs/2410.12013 (Pruning applied to MoE.)
- Jing Li, Zhijie Sun, Dachao Lin, Xuan He, Yi Lin, Binfan Zheng, Li Zeng, Rongqian Zhao, Xin Chen, 30 Aug 2024 (v2), Expert-Token Resonance: Redefining MoE Routing through Affinity-Driven Active Selection, https://arxiv.org/abs/2406.00023
- Weikai Li, Ding Wang, Zijian Ding, Atefeh Sohrabizadeh, Zongyue Qin, Jason Cong, Yizhou Sun, 25 Oct 2024, Hierarchical Mixture of Experts: Generalizable Learning for High-Level Synthesis, https://arxiv.org/abs/2410.19225
- Do Xuan Long, Duong Ngoc Yen, Anh Tuan Luu, Kenji Kawaguchi, Min-Yen Kan, Nancy F. Chen, 1 Nov 2024, Multi-expert Prompting Improves Reliability, Safety, and Usefulness of Large Language Models, https://arxiv.org/abs/2411.00492
- Xingwu Sun, Yanfeng Chen, Yiqing Huang, Ruobing Xie, Jiaqi Zhu, Kai Zhang, Shuaipeng Li, Zhen Yang, Jonny Han, Xiaobo Shu, Jiahao Bu, (and many more authors), 4 Nov 2024, Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent, https://arxiv.org/abs/2411.02265 https://github.com/Tencent/Hunyuan-Large https://huggingface.co/tencent/Tencent-Hunyuan-Large
- Fali Wang, Zhiwei Zhang, Xianren Zhang, Zongyu Wu, Tzuhao Mo, Qiuhao Lu, Wanjing Wang, Rui Li, Junjie Xu, Xianfeng Tang, Qi He, Yao Ma, Ming Huang, Suhang Wang, 4 Nov 2024, A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness, https://arxiv.org/abs/2411.03350
- Umesh Deshpande, Travis Janssen, Mudhakar Srivatsa, and Swaminathan Sundararaman. 2024. MoEsaic: Shared Mixture of Experts. In Proceedings of the 2024 ACM Symposium on Cloud Computing (SoCC '24). Association for Computing Machinery, New York, NY, USA, 434–442. https://doi.org/10.1145/3698038.3698521 https://dl.acm.org/doi/abs/10.1145/3698038.3698521
- M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024, Resource-efficient Algorithms and Systems of Foundation Models: A Survey, https://dl.acm.org/doi/pdf/10.1145/3706418
- AI: Alan Wake, Albert Wang, Bei Chen, C.X. Lv, Chao Li, Chengen Huang, Chenglin Cai, Chujie Zheng, Daniel Cooper, Ethan Dai, Fan Zhou, Feng Hu, Heng Ji, Howard Qiu, Jiangcheng Zhu, Jun Tian, Katherine Su, Lihuan Zhang, Liying Li, Ming Song, Mou Li, Peng Liu, Qichen Hu, Shawn Wang, Shijun Zhou, Shiyong Li, Tianhang Zhu, Wen Xie, Xiang He, Xiaobo Chen, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Yanpeng Li, Yongke Zhao, Yongzhen Luo, Yuchi Xu, Yuxuan Sha, Zhaodong Yan, Zhiyuan Liu, Zirui Zhang, 3 Dec 2024 (v2), Yi-Lightning Technical Report, https://arxiv.org/abs/2412.01253 https://platform.lingyiwanwu.com/ (MoE architecture with model expert routing optimizations, also with hybrid global-local attention and fused layers in the KV caching.)
- Jiacheng Liu, Peng Tang, Wenfeng Wang, Yuhang Ren, Xiaofeng Hou, Pheng-Ann Heng, Minyi Guo, Chao Li, 18 Dec 2024, A Survey on Inference Optimization Techniques for Mixture of Experts Models, https://arxiv.org/abs/2412.14219 (Broad survey of MoE inference optimization from hardware to model compression to expert parallelism.)
- DeepSeek, Dec 2024, DeepSeek V3 Technical Report, https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf (DeepSeek V3 is now the leading open-source frontier model.)
- Tim Urista, Dec 2024, Dramatically Reduce Inference Costs with DeepSeek-V3: A New Era in Open-Source LLMs, https://ai.gopubby.com/dramatically-reduce-inference-costs-with-deepseek-v3-a-new-era-in-open-source-llms-4f1adf760ee1
- Venkatesh Balavadhani Parthasarathy, Ahtsham Zafar, Aafaq Khan, Arsalan Shahid, 30 Oct 2024 (v3), The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities, https://arxiv.org/abs/2408.13296
- MiniMax, Aonian Li, Bangwei Gong, Bo Yang, Boji Shan, Chang Liu, Cheng Zhu, Chunhao Zhang, Congchao Guo, Da Chen, Dong Li, Enwei Jiao, Gengxin Li, Guojun Zhang, Haohai Sun, Houze Dong, Jiadai Zhu, Jiaqi Zhuang, Jiayuan Song, Jin Zhu, Jingtao Han, Jingyang Li, Junbin Xie, Junhao Xu, Junjie Yan, Kaishun Zhang, Kecheng Xiao, Kexi Kang, Le Han, Leyang Wang, Lianfei Yu, Liheng Feng, Lin Zheng, Linbo Chai, Long Xing, Meizhi Ju, Mingyuan Chi, Mozhi Zhang, Peikai Huang, Pengcheng Niu, Pengfei Li, Pengyu Zhao, Qi Yang, Qidi Xu, Qiexiang Wang, Qin Wang, Qiuhui Li, Ruitao Leng, Shengmin Shi, Shuqi Yu, Sichen Li, Songquan Zhu, Tao Huang, Tianrun Liang, Weigao Sun, Weixuan Sun, Weiyu Cheng, Wenkai Li, Xiangjun Song, Xiao Su, Xiaodong Han, Xinjie Zhang, Xinzhu Hou, Xu Min, Xun Zou, Xuyang Shen, Yan Gong, Yingjie Zhu, Yipeng Zhou, Yiran Zhong, Yongyi Hu, Yuanxiang Fan, Yue Yu, Yufeng Yang, Yuhao Li, Yunan Huang, Yunji Li, Yunpeng Huang, Yunzhi Xu, Yuxin Mao, Zehan Li, Zekang Li, Zewei Tao, Zewen Ying, Zhaoyang Cong, Zhen Qin, Zhenhua Fan, Zhihang Yu, Zhuo Jiang, Zijia Wu, 14 Jan 2025, MiniMax-01: Scaling Foundation Models with Lightning Attention, https://arxiv.org/abs/2501.08313 https://github.com/MiniMax-AI (Content window over 1 million tokens.)
Sparse MoE
- Soumajyoti Sarkar, Leonard Lausen, Volkan Cevher, Sheng Zha, Thomas Brox, George Karypis, 2 Sep 2024, Revisiting SMoE Language Models by Evaluating Inefficiencies with Task Specific Expert Pruning, https://arxiv.org/abs/2409.01483
- Niklas Muennighoff, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Jacob Morrison, Sewon Min, Weijia Shi, Pete Walsh, Oyvind Tafjord, Nathan Lambert, Yuling Gu, Shane Arora, Akshita Bhagia, Dustin Schwenk, David Wadden, Alexander Wettig, Binyuan Hui, Tim Dettmers, Douwe Kiela, Ali Farhadi, Noah A. Smith, Pang Wei Koh, Amanpreet Singh, Hannaneh Hajishirzi, 3 Sep 2024, OLMoE: Open Mixture-of-Experts Language Models, https://arxiv.org/abs/2409.02060
- Yanyue Xie, Zhi Zhang, Ding Zhou, Cong Xie, Ziang Song, Xin Liu, Yanzhi Wang, Xue Lin, An Xu, 15 Oct 2024, MoE-Pruner: Pruning Mixture-of-Experts Large Language Model using the Hints from Its Router, https://arxiv.org/abs/2410.12013 (Pruning applied to MoE.)
- Shawn Tan, Yikang Shen, Rameswar Panda, Aaron Courville, 4 Oct 2024 (v2), Scattered Mixture-of-Experts Implementation, https://arxiv.org/abs/2403.08245
- Xinglin Pan, Wenxiang Lin, Lin Zhang, Shaohuai Shi, Zhenheng Tang, Rui Wang, Bo Li, Xiaowen Chu, 18 Jan 2025, FSMoE: A Flexible and Scalable Training System for Sparse Mixture-of-Experts Models, https://arxiv.org/abs/2501.10714
- Samira Abnar, Harshay Shah, Dan Busbridge, Alaaeldin Mohamed Elnouby Ali, Josh Susskind, Vimal Thilak, 21 Jan 2025, Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models, https://arxiv.org/abs/2501.12370
- Tiernan Ray, Jan. 28, 2025, Apple researchers reveal the secret sauce behind DeepSeek AI: The AI model that shook the world is part of a broad trend to squeeze more out of chips using what's called sparsity. https://www.zdnet.com/article/apple-researchers-reveal-the-secret-sauce-behind-deepseek-ai/ (Sparsity applied to MoE.)
- Wensheng Gan, Zhenyao Ning, Zhenlian Qi, Philip S. Yu, 18 Jan 2025, Mixture of Experts (MoE): A Big Data Perspective, https://arxiv.org/abs/2501.16352
- Chenpeng Wu, Qiqi Gu, Heng Shi, Jianguo Yao, Haibing Guan, 13 Mar 2025, Samoyeds: Accelerating MoE Models with Structured Sparsity Leveraging Sparse Tensor Cores, https://arxiv.org/abs/2503.10725
MoE Optimization Techniques
Papers on efficient and speed optimization of MoE architectures:
- Jinghan Yao, Quentin Anthony, Aamir Shafi, Hari Subramoni, Dhabaleswar K. (DK)Panda, 17 Jan 2024 (v2), Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference, https://arxiv.org/abs/2401.08383
- Xin He, Shunkang Zhang, Yuxin Wang, Haiyan Yin, Zihao Zeng, Shaohuai Shi, Zhenheng Tang, Xiaowen Chu, Ivor Tsang, Ong Yew Soon, 23 Oct 2024, ExpertFlow: Optimized Expert Activation and Token Allocation for Efficient Mixture-of-Experts Inference, https://arxiv.org/abs/2410.17954
- R Cai, Y Ro, GW Kim, P Wang, BE Bejnordi, A Akella, Oct 2024, Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design, 38th Conference on Neural Information Processing Systems (NeurIPS 2024), https://utns.cs.utexas.edu/assets/papers/neurips24-readme.pdf https://github.com/VITA-Group/READ-ME (Extract multiple smaller MoE expert models from a large LLM.)
- Shawn Tan, Yikang Shen, Rameswar Panda, Aaron Courville, 4 Oct 2024 (v2), Scattered Mixture-of-Experts Implementation, https://arxiv.org/abs/2403.08245
- Dr. Ashish Bamania, Oct 27, 2024, Amazing Things Happen When Attention Heads Are Supercharged Using Mixture-Of-Experts: A deep dive into how the Attention mechanism works and how it is being enhanced by the Mixture-of-Experts architecture, resulting in Mixture-of-Head Attention (MoH) that makes our existing LLMs more efficient than ever. https://levelup.gitconnected.com/amazing-things-happen-when-attention-heads-are-supercharged-using-mixture-of-experts-b55a6b9a0ac8
- Xiaoniu Song, Zihang Zhong, Rong Chen, 29 Oct 2024, ProMoE: Fast MoE-based LLM Serving using Proactive Caching, https://arxiv.org/abs/2410.22134
- Peng Tang, Jiacheng Liu, Xiaofeng Hou, Yifei Pu, Jing Wang, Pheng-Ann Heng, Chao Li, Minyi Guo, 6 Nov 2024 (v2), HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference, https://arxiv.org/abs/2411.01433
- Fali Wang, Zhiwei Zhang, Xianren Zhang, Zongyu Wu, Tzuhao Mo, Qiuhao Lu, Wanjing Wang, Rui Li, Junjie Xu, Xianfeng Tang, Qi He, Yao Ma, Ming Huang, Suhang Wang, 4 Nov 2024, A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness, https://arxiv.org/abs/2411.03350
- Umesh Deshpande, Travis Janssen, Mudhakar Srivatsa, and Swaminathan Sundararaman. 2024. MoEsaic: Shared Mixture of Experts. In Proceedings of the 2024 ACM Symposium on Cloud Computing (SoCC '24). Association for Computing Machinery, New York, NY, USA, 434–442. https://doi.org/10.1145/3698038.3698521 https://dl.acm.org/doi/abs/10.1145/3698038.3698521
- Nan Xue, Yaping Sun, Zhiyong Chen, Meixia Tao, Xiaodong Xu, Liang Qian, Shuguang Cui, Wenjun Zhang, Ping Zhang, 11 Nov 2024, WDMoE: Wireless Distributed Mixture of Experts for Large Language Models, https://arxiv.org/abs/2411.06681
- Shiyi Cao, Shu Liu, Tyler Griggs, Peter Schafhalter, Xiaoxuan Liu, Ying Sheng, Joseph E. Gonzalez, Matei Zaharia, Ion Stoica, 18 Nov 2024, MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs, https://arxiv.org/abs/2411.11217
- AI: Alan Wake, Albert Wang, Bei Chen, C.X. Lv, Chao Li, Chengen Huang, Chenglin Cai, Chujie Zheng, Daniel Cooper, Ethan Dai, Fan Zhou, Feng Hu, Heng Ji, Howard Qiu, Jiangcheng Zhu, Jun Tian, Katherine Su, Lihuan Zhang, Liying Li, Ming Song, Mou Li, Peng Liu, Qichen Hu, Shawn Wang, Shijun Zhou, Shiyong Li, Tianhang Zhu, Wen Xie, Xiang He, Xiaobo Chen, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Yanpeng Li, Yongke Zhao, Yongzhen Luo, Yuchi Xu, Yuxuan Sha, Zhaodong Yan, Zhiyuan Liu, Zirui Zhang, 3 Dec 2024 (v2), Yi-Lightning Technical Report, https://arxiv.org/abs/2412.01253 https://platform.lingyiwanwu.com/ (MoE architecture with model expert routing optimizations, also with hybrid global-local attention and fused layers in the KV caching.)
- Yuanshuai Wang, Xingjian Zhang, Jinkun Zhao, Siwei Wen, Peilin Feng, Shuhao Liao, Lei Huang, Wenjun Wu, 5 Dec 2024, Bench-CoE: a Framework for Collaboration of Experts from Benchmark, https://arxiv.org/abs/2412.04167 https://github.com/ZhangXJ199/Bench-CoE
- Yao Fu, Yinsicheng Jiang, Yeqi Huang, Ping Nie, Zhan Lu, Leyang Xue, Congjie He, Man-Kit Sit, Jilong Xue, Li Dong, Ziming Miao, Kai Zou, Edoardo Ponti, Luo Mai, 10 Dec 2024, MoE-CAP: Cost-Accuracy-Performance Benchmarking for Mixture-of-Experts Systems, https://arxiv.org/abs/2412.07067 https://huggingface.co/spaces/sparse-generative-ai/open-moe-llm-leaderboard
- Jiacheng Liu, Peng Tang, Wenfeng Wang, Yuhang Ren, Xiaofeng Hou, Pheng-Ann Heng, Minyi Guo, Chao Li, 18 Dec 2024, A Survey on Inference Optimization Techniques for Mixture of Experts Models, https://arxiv.org/abs/2412.14219 (Broad survey of MoE inference optimization from hardware to model compression to expert parallelism.)
- MiniMax, Aonian Li, Bangwei Gong, Bo Yang, Boji Shan, Chang Liu, Cheng Zhu, Chunhao Zhang, Congchao Guo, Da Chen, Dong Li, Enwei Jiao, Gengxin Li, Guojun Zhang, Haohai Sun, Houze Dong, Jiadai Zhu, Jiaqi Zhuang, Jiayuan Song, Jin Zhu, Jingtao Han, Jingyang Li, Junbin Xie, Junhao Xu, Junjie Yan, Kaishun Zhang, Kecheng Xiao, Kexi Kang, Le Han, Leyang Wang, Lianfei Yu, Liheng Feng, Lin Zheng, Linbo Chai, Long Xing, Meizhi Ju, Mingyuan Chi, Mozhi Zhang, Peikai Huang, Pengcheng Niu, Pengfei Li, Pengyu Zhao, Qi Yang, Qidi Xu, Qiexiang Wang, Qin Wang, Qiuhui Li, Ruitao Leng, Shengmin Shi, Shuqi Yu, Sichen Li, Songquan Zhu, Tao Huang, Tianrun Liang, Weigao Sun, Weixuan Sun, Weiyu Cheng, Wenkai Li, Xiangjun Song, Xiao Su, Xiaodong Han, Xinjie Zhang, Xinzhu Hou, Xu Min, Xun Zou, Xuyang Shen, Yan Gong, Yingjie Zhu, Yipeng Zhou, Yiran Zhong, Yongyi Hu, Yuanxiang Fan, Yue Yu, Yufeng Yang, Yuhao Li, Yunan Huang, Yunji Li, Yunpeng Huang, Yunzhi Xu, Yuxin Mao, Zehan Li, Zekang Li, Zewei Tao, Zewen Ying, Zhaoyang Cong, Zhen Qin, Zhenhua Fan, Zhihang Yu, Zhuo Jiang, Zijia Wu, 14 Jan 2025, MiniMax-01: Scaling Foundation Models with Lightning Attention, https://arxiv.org/abs/2501.08313 https://github.com/MiniMax-AI (Content window over 1 million tokens.)
- Fabio Montello, Ronja Güldenring, Simone Scardapane, Lazaros Nalpantidis, 13 Jan 2025, A Survey on Dynamic Neural Networks: from Computer Vision to Multi-modal Sensor Fusion, https://arxiv.org/abs/2501.07451 (Survey of adaptive inference optimizations: early exit, dynamic routing, token skimming.)
- Qwen Team, January 21, 2025, Global-batch load balance almost free lunch to improve your MoE LLM training, https://qwenlm.github.io/blog/global-load-balance/
- Nandini Lokesh Reddy, Jan 2025, DeepSeek: Bridging Performance and Efficiency in Modern AI, https://medium.com/@nandinilreddy/deepseek-bridging-performance-and-efficiency-in-modern-ai-106181a85693
- Samira Abnar, Harshay Shah, Dan Busbridge, Alaaeldin Mohamed Elnouby Ali, Josh Susskind, Vimal Thilak, 25 Jan 2025 (v2), Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models, https://arxiv.org/abs/2501.12370
- Tiernan Ray, Jan. 28, 2025, Apple researchers reveal the secret sauce behind DeepSeek AI: The AI model that shook the world is part of a broad trend to squeeze more out of chips using what's called sparsity. https://www.zdnet.com/article/apple-researchers-reveal-the-secret-sauce-behind-deepseek-ai/ (Sparsity applied to MoE.)
- Wensheng Gan, Zhenyao Ning, Zhenlian Qi, Philip S. Yu, 18 Jan 2025, Mixture of Experts (MoE): A Big Data Perspective, https://arxiv.org/abs/2501.16352
- Shangqian Gao, Ting Hua, Reza Shirkavand, Chi-Heng Lin, Zhen Tang, Zhengao Li, Longge Yuan, Fangyi Li, Zeyu Zhang, Alireza Ganjdanesh, Lou Qian, Xu Jie, Yen-Chang Hsu, 25 Jan 2025, ToMoE: Converting Dense Large Language Models to Mixture-of-Experts through Dynamic Structural Pruning, https://arxiv.org/abs/2501.15316
- Tech Fund, Feb 03, 2025, The Winners from DeepSeek, Nvidia, and The Outlook in AI: A tour of the space & AI-exposed stocks, https://www.techinvestments.io/p/the-winners-from-deepseek-nvidia
- Qizhen Zhang, Prajjwal Bhargava, Chloe Bi, Chris X. Cai, Jakob Foerster, Jeremy Fu, Punit Singh Koura, Ruan Silva, Sheng Shen, Emily Dinan, Suchin Gururangan, Mike Lewis, 31 Jan 2025, BTS: Harmonizing Specialized Experts into a Generalist LLM, https://arxiv.org/abs/2502.00075 (Combining multiple fine-tuned expert models via "layer stitching").
- Yuhang Zhou, Giannis Karamanolakis, Victor Soto, Anna Rumshisky, Mayank Kulkarni, Furong Huang, Wei Ai, Jianhua Lu, 4 Feb 2025 (v2), MergeME: Model Merging Techniques for Homogeneous and Heterogeneous MoEs, https://arxiv.org/abs/2502.00997
- Hanfei Yu, Xingqi Cui, Hong Zhang, Hao Wang, Hao Wang, 7 Feb 2025, fMoE: Fine-Grained Expert Offloading for Large Mixture-of-Experts Serving, https://arxiv.org/abs/2502.05370
- Zehua Pei, Lancheng Zou, Hui-Ling Zhen, Xianzhi Yu, Wulong Liu, Sinno Jialin Pan, Mingxuan Yuan, Bei Yu, 6 Feb 2025, CMoE: Fast Carving of Mixture-of-Experts for Efficient LLM Inference, https://arxiv.org/abs/2502.04416 https://github.com/JarvisPei/CMoE
- Zhiyuan Fang, Yuegui Huang, Zicong Hong, Yufeng Lyu, Wuhui Chen, Yue Yu, Fan Yu, Zibin Zheng, 9 Feb 2025, Klotski: Efficient Mixture-of-Expert Inference via Expert-Aware Multi-Batch Pipeline, https://arxiv.org/abs/2502.06888
- Shulai Zhang, Ningxin Zheng, Haibin Lin, Ziheng Jiang, Wenlei Bao, Chengquan Jiang, Qi Hou, Weihao Cui, Size Zheng, Li-Wen Chang, Quan Chen, Xin Liu, 27 Feb 2025, Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts. https://arxiv.org/abs/2502.19811
- Ashley Goolam, March 4, 2025, DeepSeek Open Source Week: A Complete Summary, https://apidog.com/blog/deepseek-open-source-week/
- Hulin Wang, Yaqi Xia, Donglin Yang, Xiaobo Zhou, and Dazhao Cheng. 2025. Harnessing Inter-GPU Shared Memory for Seamless MoE Communication-Computation Fusion. In Proceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP '25). Association for Computing Machinery, New York, NY, USA, 170–182. https://doi.org/10.1145/3710848.3710868 https://dl.acm.org/doi/abs/10.1145/3710848.3710868
- Shaibal Saha, Lanyu Xu, 26 Feb 2025, Vision Transformers on the Edge: A Comprehensive Survey of Model Compression and Acceleration Strategies, https://arxiv.org/abs/2503.02891
- W Sun, D Lan, T Zhu, X Qu, Y Cheng, Mar 2025, Linear-MoE: Linear Sequence Modeling Meets Mixture-of-Experts, ICLR 2025 review, https://openreview.net/pdf?id=HKIvuZxGbl
- W Zhang, X Ren, Mar 2025, ReM: Sparsify and MoEfy Models with Post-Hoc ReLU Modulation, ICLR 2025 review, https://openreview.net/pdf?id=cizhOu3CZa (Induce activation sparsity for MoE choice in the model router.)
- Chenpeng Wu, Qiqi Gu, Heng Shi, Jianguo Yao, Haibing Guan, 13 Mar 2025, Samoyeds: Accelerating MoE Models with Structured Sparsity Leveraging Sparse Tensor Cores, https://arxiv.org/abs/2503.10725
More AI Research
Read more about:
- Cascade Models
- Ensemble Models
- Inference Optimizations
- Loop Optimizations
- Code Optimizations
- « Research Home