Aussie AI
Training Optimization
-
Last Updated 2 March, 2025
-
by David Spuler, Ph.D.
Training is very expensive, leading to a rise in papers on optimization of model training methods. Training cost is typically many multiples of inference, but obviously the total inference cost can overshadow training cost given enough users. Nevertheless, the total cost of training to the industry is likely to remain high, since almost all use cases require not only initial training, but also ongoing fine-tuning and re-training.
Research on training algorithms in general:
- Unsupervised learning
- Reinforceument Learning from Human Feedback (RLHF)
- In-context learning (ICL)
- Direct Preference Optimization (DPO)
- Self-supervised learning (automated AI Feedback)
- Human-In-The-Loop (HITL)
General concepts in LLM reasoning and model capabilities:
- AGI (human-level intelligence)
- Scaling laws
- Inference scaling laws
- Test time compute
- The Wall
Information on improving the accuracy and/or speed of training algorithms:
- Training speed optimizations
- Distributed training
- Federated Learning
- Loss functions
- Gradient optimizers
- Early dropout
- Network optimizations
- Training costs
Improvements in resiliency of the training infastructure for multi-GPU clusters in data centers:
- Resiliency improvements (overview)
- Stragglers
- Hangs
- Silent Data Corruption (SDC)
- GPU failures
- GPU overheating
- GPU aging
- Transient Soft Errors
- High network latency
- Floating-point computation errors
- Floating-point checkers
- Checkpointing
- In-memory checkpointing
- Asynchronous checkpointing
Research on the data used in pre-training:
Modified types of pre-training for models:
Fine-tuning methods include:
- Fine-tuning (traditional full-parameter)
- Parameter-Efficient Fine-Tuning (PEFT)
- LoRA (low-rank adapters)
- QLoRA (quantized LoRA)
- Multi-LoRA
- Post-Optimization Fine-Tuning (POFT) (e.g., after quantization, pruning)
Lesser-known alternatives to fine-tuning being researched for improving model capabilities that require only a single inference step, but may also require a short training-like phase:
- Prompt tuning (extended vocabulary PEFT, typically with extra soft tokens prepended to prompt)
- Decoding-based reasoning in single inference step (e.g., tree decoding)
Retrieval-based alternatives to fine-tuning for extra LLM capabilities and intelligence/accuracy (without requiring any extra training):
- Plug-ins (data source integrations)
- RAG
- RALM (generalized retrieval)
- TAG (database table data)
- Agent architectures (read/write capabilities)
- Agentic RAG
- Agentic workflow (multi-agent, multi-step)
Non-retrieval methods of giving LLMs additional context information for their inference queries, but only with a single inference query (and without traditional RAG-type data retrieval):
- Tool usage
- TALM
- Inference "hooks"
Prompt engineering enhancements to LLM capabilities (single-step):
- Basic prompting methods (e.g., examples, formatting)
- Step-by-step prompting
- Emotional prompting
- Least-to-Most
- Self-ask prompting
- Concise prompting
Advanced topics in prompt engineering (single-shot):
- Prompt optimization techniques
- Programmatic prompt optimization (auto-prompting)
- Advanced prompt engineering (overview)
Inference-based reasoning algorithms with multiple steps combining prompt engineering and inference processing of queries:
- Chain-of-Thought (COT)
- Tree-of-Thought
- Skeleton of Thought
- ReAct (Reason-and-Act)
- Self-reflection (often just called "reflection")
- LLM as Judge
- Best of N (BoN)
- Multi-step inference for reasoning (overview)
Addressing limitations of model intelligence:
- LLM Safety (overview)
- Hallucinations
- Toxicity
- Refusal modules
- Jailbreak prevention
- Prompt shield
- Attribution and citation management
- Explainability
- Bias
- Fairness
- Ethics
- Alignment
- Testing safety
- Factuality
- Data Leakage
- Security
- Guardrails
- Privacy
- Safety monitoring
- Safe C++ programming (for AI engines)
Other directions for model intelligence:
- Planning
- Followup questions
- Interactive prompting
- Program execution models (e.g., LLM generates Python code to run)
- Symbolic reasoning
- Concept models ("large concept models" or LCMs)
Survey Papers on Training Optimizations
Survey papers on speeding up training:
- Yarally T, Cruz L, Feitosa D, et al (2023), Uncovering energy-efficient practices in deep learning training: Preliminary steps towards green AI. International Conference on AI Engineering - Software Engineering for AI (CAIN), https://arxiv.org/abs/2303.13972
- A. Apicella, F. Donnarumma, F. Isgrò, and R. Prevete, A survey on modern trainable activation functions, Neural Networks, vol. 138, pp.14–32, 2021, https://arxiv.org/abs/2005.00817 (Extensive survey all about training with activation functions, e.g. RELU, Swish, Maxout, leaky RELU.)
- R. Immonen, T. Hämäläinen et al., Tiny machine learning for resource-constrained microcontrollers, Journal of Sensors, vol. 2022, 2022, https://www.hindawi.com/journals/js/2022/7437023/ (Survey of on-device training for TinyML/edge computing.)
- P Freire, E Manuylovich, JE Prilepsky, SK Turitsyn, 2023, Artificial neural networks for photonic applications—from algorithms to implementation: tutorial, Advances in Optics and Photonics, Sep 2023, https://opg.optica.org/directpdfaccess/f0ae8746-2f89-4ac4-bb598eda29c7977c_539680/aop-15-3-739.pdf?da=1&id=539680&seq=0&mobile=no (Large survey covering many aspects of the future of training optimization.)
- Marcos Treviso, Tianchu Ji, Ji-Ung Lee, Betty van Aken, Qingqing Cao, Manuel R. Ciosici, Michael Hassid, Kenneth Heafield, Sara Hooker, Pedro H. Martins, Andre F. T. Martins, Pe- ´ ter Milder, Colin Raffel, Edwin Simpson, Noam Slonim, Niranjan Balasubramanian, Leon Derczynski, Roy Schwartz, Aug 2022, Efficient Methods for Natural Language Processing: A Survey. arxiv:2209.00099[cs], August 2022. http://arxiv.org/abs/2209.00099
- MM YAPICI, N Topaloğlu, 2021, Computers and Informatics, Performance comparison of deep learning frameworks https://dergipark.org.tr/en/pub/ci/issue/60236/769457, PDF: https://dergipark.org.tr/en/download/article-file/1201877 (Examines Torch, Theano, Caffe, Caffe2, MXNet, Keras, TensorFlow, and CNTK frameworks in terms of training speed.)
- H. Jahangir, S. K. Goel and S. Khurana, "Scaling Up the Transformers: A Survey of Training and Inference Optimization Techniques," 2024 International Conference on Electrical Electronics and Computing Technologies (ICEECT), Greater Noida, India, 2024, pp. 1-6, doi: 10.1109/ICEECT61758.2024.10739061. https://ieeexplore.ieee.org/abstract/document/10739061
- Jiahang Zhou, Yanyu Chen, Zicong Hong, Wuhui Chen, Yue Yu, Tao Zhang, Hui Wang, Chuanfu Zhang, Zibin Zheng, 5 Jan 2024, Training and Serving System of Foundation Models: A Comprehensive Survey, https://arxiv.org/abs/2401.02643
- Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, Jianfeng Gao, 20 Feb 2024 (v2), Large Language Models: A Survey, https://arxiv.org/abs/2402.06196
- R Abdulkadirov, P Lyakhov, N Nagornov, 2023, Survey of Optimization Algorithms in Modern Neural Networks https://www.mdpi.com/2227-7390/11/11/2466 https://www.mdpi.com/2227-7390/11/11/2466/pdf
- Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey
- You Zhou, Xiujing Lin, Xiang Zhang, Maolin Wang, Gangwei Jiang, Huakang Lu, Yupeng Wu, Kai Zhang, Zhe Yang, Kehang Wang, Yongduo Sui, Fengwei Jia, Zuoli Tang, Yao Zhao, Hongxuan Zhang, Tiannuo Yang, Weibo Chen, Yunong Mao, Yi Li, De Bao, Yu Li, Hongrui Liao, Ting Liu, Jingwen Liu, Jinchi Guo, Xiangyu Zhao, Ying WEI, Hong Qian, Qi Liu, Xiang Wang, Wai Kin (Victor)Chan, Chenliang Li, Yusen Li, Shiyu Yang, Jining Yan, Chao Mou, Shuai Han, Wuxia Jin, Guannan Zhang, Xiaodong Zeng, Nov 2023, On the Opportunities of Green Computing: A Survey, https://arxiv.org/abs/2311.00447 (Extensive survey of environmental and green AI issues, along with a survey of various optimization methods to reduce AI resource requirements in training and inference.)
- Jiangfei Duan, Shuo Zhang, Zerui Wang, Lijuan Jiang, Wenwen Qu, Qinghao Hu, Guoteng Wang, Qizhen Weng, Hang Yan, Xingcheng Zhang, Xipeng Qiu, Dahua Lin, Yonggang Wen, Xin Jin, Tianwei Zhang, Peng Sun, 29 Jul 2024, Efficient Training of Large Language Models on Distributed Infrastructures: A Survey, https://arxiv.org/abs/2407.20018
- Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, Liang Zhao, 4 Jan 2024, Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models https://arxiv.org/abs/2401.00625 (A general survey paper with coverage of many techniques including this one.)
- Zehao Xiao, Cees G. M. Snoek, 6 Nov 2024, Beyond Model Adaptation at Test Time: A Survey. https://arxiv.org/abs/2411.03687
- Fali Wang, Zhiwei Zhang, Xianren Zhang, Zongyu Wu, Tzuhao Mo, Qiuhao Lu, Wanjing Wang, Rui Li, Junjie Xu, Xianfeng Tang, Qi He, Yao Ma, Ming Huang, Suhang Wang, 4 Nov 2024, A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness, https://arxiv.org/abs/2411.03350
- Dan Zhang, Tao Feng, Lilong Xue, Yuandong Wang, Yuxiao Dong, Jie Tang, 23 Jan 2025, Parameter-Efficient Fine-Tuning for Foundation Models, https://arxiv.org/abs/2501.13787
Training Speed Optimizations
Papers with specific techniques for optimization of training in terms of throughput, latency or processing speed, rather than accuracy or perplexity of results (chosen out of literally thousands):
- Campos, V., Jou, B., i Nieto, X. G., Torres, J., and Chang, S.-F. (2018). Skip RNN: Learning to skip state updates in recurrent neural networks. In International Conference on Learning Representations. https://openreview.net/forum?id=HkwVAXyCW
- Mojtaba Valipour, Mehdi Rezagholizadeh, Hossein Rajabzadeh, Marzieh Tahaei, Boxing Chen, Ali Ghodsi, 2023, SortedNet, a Place for Every Network and Every Network in its Place: Towards a Generalized Solution for Training Many-in-One Neural Networks, https://arxiv.org/abs/2309.00255 (Generalization of multi-dimensional pruning, by training a large neural network with many sub-networks across different width and depth dimensions.)
- W. Jung, D. Jung, B. Kim, S. Lee, W. Rhee, and J. Ahn, “Restructuring Batch Normalization to Accelerate CNN Training,” in The Conference on Systems and Machine Learning, 2019, https://arxiv.org/abs/1807.01702
- EPTQ: Enhanced Post-Training Quantization via Label-Free Hessian O Gordon, HV Habi, A Netzer, arXiv preprint arXiv:2309.11531, 2023, https://arxiv.org/pdf/2309.11531.pdf Code: https://github.com/sony/model_optimization
- Yizhe Zhang, Guoyin Wang, Chunyuan Li, Zhe Gan, Chris Brockett, and Bill Dolan. Pointer: Constrained text generation via insertion-based generative pre-training. arXiv preprint arXiv:2005.00558, 2020. https://arxiv.org/abs/2005.00558
- S Tuli, NK Jha, 2023, TransCODE: Co-design of Transformers and Accelerators for Efficient Training and Inference, IEEE Transactions on Computer-Aided Design, https://ieeexplore.ieee.org/abstract/document/10144614/, https://arxiv.org/pdf/2303.14882
- M. Mathieu, M. Henaff, and Y. LeCun, 2014, “Fast training of convolutional networks through FFTs,” in 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, Y. Bengio and Y. LeCun, Eds., https://arxiv.org/abs/1312.5851
- D Zhu, N Yang, L Wang, Y Song, W Wu, F Wei, 2023, PoSE: Efficient Context Window Extension of LLMs via Positional Skip-wise Training https://arxiv.org/abs/2309.10400
- Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Mia Xu Chen, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, and Zhifeng Chen. 2019. Gpipe: Efficient training of giant neural networks using pipeline parallelism, http://arxiv.org/abs/1811.06965
- Jonas Geiping, Tom Goldstein, Dec 2022, Cramming: Training a Language Model on a Single GPU in One Day, https://arxiv.org/abs/2212.14034 Code: https://github.com/JonasGeiping/cramming (Note: uses Pytorch nvFuser deep learning compiler, which seems to be deprecated now.)
- Shabnam Daghaghi, Nicholas Meisburger, Mengnan Zhao, Yong Wu, Sameh Gobriel, Charlie Tai, Anshumali Shrivastava, Mar 2021, Accelerating SLIDE Deep Learning on Modern CPUs: Vectorization, Quantizations, Memory Optimizations, and More, https://arxiv.org/abs/2103.10891, Code: https://github.com/RUSH-LAB/SLIDE (Fast training on CPUs using AVX-512 and locality-sensitive hashing of vectors.)
- GY Lee, T Dam, MM Ferdaus, DP Poenar, VN Duong, Oct 2023, Unlocking the capabilities of explainable fewshot learning in remote sensing, https://arxiv.org/pdf/2310.08619.pdf
- Kai Lv, Yuqing Yang, Tengxiao Liu, Qinghui Gao, Qipeng Guo, and Xipeng Qiu, June 2023, Full parameter fine-tuning for large language models with limited resources, arXiv preprint arXiv:2306.09782, https://arxiv.org/abs/2306.09782 (Fused gradient computation and parameter update saves memory in training kernel by not saving the gradient tensor in memory.)
- Sachin Mehta, Mohammad Hossein Sekhavat, Qingqing Cao, Maxwell Horton, Yanzi Jin, Chenfan Sun, Iman Mirzadeh, Mahyar Najibi, Dmitry Belenko, Peter Zatloukal, Mohammad Rastegari, 22 Apr 2024, OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework, Apple Research, https://arxiv.org/abs/2404.14619 Code: https://huggingface.co/apple/OpenELM
- Benjue Weng, 13 Apr 2024, Navigating the Landscape of Large Language Models: A Comprehensive Review and Analysis of Paradigms and Fine-Tuning Strategies, https://arxiv.org/abs/2404.09022 (Reviewing fine-tuning of large models.)
- Qi Zhang, Tianqi Du, Haotian Huang, Yifei Wang, Yisen Wang, 2024, Look Ahead or Look Around? ATheoretical Comparison Between Autoregressive and Masked Pretraining, https://openreview.net/pdf?id=2rPoTgEmjV Code: https://github.com/PKU-ML/LookAheadLookAround (Evaluates autoregressive and masked methods in training.)
- Haikuo Shao; Jinming Lu; Meiqi Wang; Zhongfeng Wang, 2023, An Efficient Training Accelerator for Transformers With Hardware-Algorithm Co-Optimization, IEEE Transactions on Very Large Scale Integration (VLSI) Systems (Early Access), https://ieeexplore.ieee.org/document/10251161
- Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong, Yi Pan, Shaochen Xu, Zihao Wu, Zhengliang Liu, Xin Zhang, Shu Zhang, Xintao Hu, Tuo Zhang, Ning Qiang, Tianming Liu, Bao Ge, Jan 2024, Understanding LLMs: A Comprehensive Overview from Training to Inference https://arxiv.org/abs/2401.02038
- Jiahang Zhou, Yanyu Chen, Zicong Hong, Wuhui Chen, Yue Yu, Tao Zhang, Hui Wang, Chuanfu Zhang, Zibin Zheng, 5 Jan 2024, Training and Serving System of Foundation Models: A Comprehensive Survey, https://arxiv.org/abs/2401.02643
- Zhiqiu Xu, Yanjie Chen, Kirill Vishniakov, Yida Yin, Zhiqiang Shen, Trevor Darrell, Lingjie Liu, Zhuang Liu, Nov 2023, Initializing Models with Larger Ones, https://arxiv.org/abs/2311.18823 Code: https://github.com/OscarXZQ/weight-selection
- Noam Shazeer, Mitchell Stern, Apr 2018, Adafactor: Adaptive Learning Rates with Sublinear Memory Cost, https://arxiv.org/abs/1804.04235
- Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, Hao Wu, Feb 2018, Mixed Precision Training, https://arxiv.org/abs/1710.03740
- M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro, “Megatron-LM: Training multi-billion parameter language models using model parallelism,” arXiv preprint arXiv:1909.08053, 2019, https://arxiv.org/abs/1909.08053
- Ruixiang Tang, Dehan Kong, Longtao Huang, Hui Xue May 2023 Large Language Models Can be Lazy Learners: Analyze Shortcuts in In-Context Learning, https://arxiv.org/abs/2305.17256
- Diana Hu, 29/03/2024, Building AI Models is faster and cheaper than you probably think, Y Combinator, https://www.ycombinator.com/blog/building-ai-models
- Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, Jianfeng Gao, 20 Feb 2024 (v2), Large Language Models: A Survey, https://arxiv.org/abs/2402.06196
- Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin Jin, Xin Liu, 23 Feb 2024, MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs, https://arxiv.org/abs/2402.15627
- Carlo Nicolini, Jacopo Staiano, Bruno Lepri, Raffaele Marino, 13 Mar 2024, The Garden of Forking Paths: Observing Dynamic Parameters Distribution in Large Language Models, https://arxiv.org/abs/2403.08739 (Understanding how LLM parameters change over time during training.)
- Truong Giang Do, Le Huy Khiem, Quang Pham, TrungTin Nguyen, Thanh-Nam Doan, Binh T. Nguyen, Chenghao Liu, Savitha Ramasamy, Xiaoli Li, Steven HOI, Oct 2023, HyperRouter: Towards Efficient Training and Inference of Sparse Mixture of Experts, EMNLP 2023 Conference, https://openreview.net/forum?id=fL8AKDvELp Code: https://github.com/giangdip2410/hyperrouter
- S Guo, J Xu, LL Zhang, M Yang, Oct 2023, Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models, arXiv preprint arXiv:2310.05015, https://arxiv.org/pdf/2310.05015.pdf Code: https://github.com/microsoft/Moonlit/tree/main/Compresso
- H Woisetschläger, A Isenko, S Wang, R Mayer, 2023, Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly, https://arxiv.org/abs/2310.03150
- Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. Large Batch Optimization for Deep Learning: Training BERT in 76 minutes. In International Conference on Learning Representations, September 2019. https://openreview.net/forum?id=Syx4wnEtvH
- Shar Narasimhan. NVIDIA Clocks World’s Fastest BERT Training Time and Largest Transformer Based Model, Paving Path For Advanced Conversational AI, August 2019. https://developer.nvidia.com/blog/training-bert-with-gpus/
- R. Immonen, T. Hämäläinen et al., Tiny machine learning for resource-constrained microcontrollers, Journal of Sensors, vol. 2022, 2022, https://www.hindawi.com/journals/js/2022/7437023/
- R Abdulkadirov, P Lyakhov, N Nagornov, 2023, Survey of Optimization Algorithms in Modern Neural Networks https://www.mdpi.com/2227-7390/11/11/2466 https://www.mdpi.com/2227-7390/11/11/2466/pdf
- David Spuler, March 2024, Chapter 6. Training, Fine-Tuning & RAG, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Kirill Kolodiazhnyi, May 15, 2020, Hands-On Machine Learning with C++: Build, train, and deploy end-to-end machine learning and deep learning pipelines, https://www.amazon.com/Hands-Machine-Learning-end-end/dp/1789955335/
- Yisheng Xiao, Lijun Wu, Junliang Guo, Juntao Li, Min Zhang, Tao Qin, Tie-yan Liu, 6 Jul 2023 (v2), A Survey on Non-Autoregressive Generation for Neural Machine Translation and Beyond, https://arxiv.org/pdf/2204.09269.pdf
- Adi Gangidi, KR Kishore, Jenya Lee, June 12, 2024, How Meta trains large language models at scale, Meta Research, https://engineering.fb.com/2024/06/12/data-infrastructure/training-large-language-models-at-scale-meta/
- Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey
- NVIDIA, June 2024, Nemotron-4 340B Technical Report, https://d1qx31qr3h6wln.cloudfront.net/publications/Nemotron_4_340B_8T_0.pdf (Architecture is decoder-only with GQA, SentencePiece tokenizer, causal attention masks, RoPE, 96 layers, 96 heads, 8 KV heads, 256,000 vocabulary, 18432 internal dimension, context window 4096, and uses squared RELU.)
- Yi Zhou, Dec 16, 2023, Optimizing GenAI: Comparing Model Training, Fine-Tuning, RAG, and Prompt Engineering, https://medium.com/generative-ai-revolution-ai-native-transformation/optimizing-genai-comparing-model-training-fine-tuning-rag-and-prompt-engineering-7a7c6c65e0f0
- Jiangfei Duan, Shuo Zhang, Zerui Wang, Lijuan Jiang, Wenwen Qu, Qinghao Hu, Guoteng Wang, Qizhen Weng, Hang Yan, Xingcheng Zhang, Xipeng Qiu, Dahua Lin, Yonggang Wen, Xin Jin, Tianwei Zhang, Peng Sun, 29 Jul 2024, Efficient Training of Large Language Models on Distributed Infrastructures: A Survey, https://arxiv.org/abs/2407.20018
- Jinmin He, Kai Li, Yifan Zang, Haobo Fu, Qiang Fu, Junliang Xing, Jian Cheng, 25 Jan 2024, Not All Tasks Are Equally Difficult: Multi-Task Deep Reinforcement Learning with Dynamic Depth Routing, https://arxiv.org/abs/2312.14472 (Dynamic routing based on easy vs hard queries to optimize training.)
- You Zhou, Xiujing Lin, Xiang Zhang, Maolin Wang, Gangwei Jiang, Huakang Lu, Yupeng Wu, Kai Zhang, Zhe Yang, Kehang Wang, Yongduo Sui, Fengwei Jia, Zuoli Tang, Yao Zhao, Hongxuan Zhang, Tiannuo Yang, Weibo Chen, Yunong Mao, Yi Li, De Bao, Yu Li, Hongrui Liao, Ting Liu, Jingwen Liu, Jinchi Guo, Xiangyu Zhao, Ying WEI, Hong Qian, Qi Liu, Xiang Wang, Wai Kin (Victor)Chan, Chenliang Li, Yusen Li, Shiyu Yang, Jining Yan, Chao Mou, Shuai Han, Wuxia Jin, Guannan Zhang, Xiaodong Zeng, Nov 2023, On the Opportunities of Green Computing: A Survey, https://arxiv.org/abs/2311.00447 (Extensive survey of environmental and green AI issues, along with a survey of various optimization methods to reduce AI resource requirements in training and inference.)
- Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, Liang Zhao, 4 Jan 2024, Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models https://arxiv.org/abs/2401.00625 (A general survey paper with coverage of many techniques including this one.)
- Lorenzo Sani, Alex Iacob, Zeyu Cao, Bill Marino, Yan Gao, Tomas Paulik, Wanru Zhao, William F. Shen, Preslav Aleksandrov, Xinchi Qiu, Nicholas D. Lane, 19 Jul 2024 (v2), The Future of Large Language Model Pre-training is Federated, https://arxiv.org/abs/2405.10853
- Kaizhao Liang, Bo Liu, Lizhang Chen, Qiang Liu, 23 Aug 2024, Memory-Efficient LLM Training with Online Subspace Descent, https://arxiv.org/abs/2408.12857 https://github.com/kyleliang919/Online-Subspace-Descent
- Sophia R. Cunningham,Dominique Archambault,Austin Kung, May 2024, Efficient Training and Inference: Techniques for Large Language Models Using Llama, https://www.techrxiv.org/doi/full/10.36227/techrxiv.171651876.65094225/v1
- Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song, Yufa Zhou, 23 Aug 2024, Multi-Layer Transformers Gradient Can be Approximated in Almost Linear Time, https://arxiv.org/abs/2408.13233 (Training using low-rank matrices to approximate attention.)
- Agarwal, Saurabh, Aug 2024, Minimizing Data Movement in Machine Learning Systems, Ph.D. Thesis, Computer Sciences, University of Wisconsin--Madison, https://digital.library.wisc.edu/1711.dl/MKLIYRPB24A5R9D https://search.library.wisc.edu/digital/AMKLIYRPB24A5R9D PDF: https://asset.library.wisc.edu/1711.dl/QXSTVAIXECHQA8L/R/file-62b54.pdf?dl https://www.proquest.com/openview/c1ae2a92106d7ec681a7296cd163e0c1/1 (Dataflow optimization in training and also "clustered head attention" for memory-efficient inference, an extension of multi-head attention similar to layer-wise head fusion/pruning.)
- Jaime Sevilla Edu Roldán, May 28, 2024, Training Compute of Frontier AI Models Grows by 4-5x per Year, Epoch AI blog, https://epochai.org/blog/training-compute-of-frontier-ai-models-grows-by-4-5x-per-year
- Longteng Zhang, Xiang Liu, Zeyu Li, Xinglin Pan, Peijie Dong, Ruibo Fan, Rui Guo, Xin Wang, Qiong Luo, Shaohuai Shi, Xiaowen Chu, Dec 2023, Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models, https://arxiv.org/abs/2311.03687 (Benchmarks model speed for training, fine-tuning and inference with various optimizations such as ZeRO, quantization, offloading/recomputation, and Flash Attention.)
- Ari Lotter, Jeffrey Quesnelle, Umer H. Adil, Dillon Rolnick, Esteban La Rocca, A Preliminary Report on Distro, 2024, https://github.com/NousResearch/DisTrO/blob/main/A_Preliminary_Report_on_DisTrO.pdf https://venturebeat.com/wp-content/uploads/2024/08/A_Preliminary_Report_on_DisTrO.pdf (Reducing the inter-GPU networking bandwidth cost during training.)
- WenZheng Zhang, Yang Hu, Jing Shi, Xiaoying Bai, 22 Aug 2024, Poplar: Efficient Scaling of Distributed DNN Training on Heterogeneous GPU Clusters, https://arxiv.org/abs/2408.12596
- Douglas C. Youvan, September 27, 2024, Building and Running Large-Scale Language Models: The Infrastructure and Techniques Behind GPT-4 , https://www.researchgate.net/profile/Douglas-Youvan/publication/384398902_Building_and_Running_Large-Scale_Language_Models_The_Infrastructure_and_Techniques_Behind_GPT-4/links/66f6f4d3906bca2ac3d20e68/Building-and-Running-Large-Scale-Language-Models-The-Infrastructure-and-Techniques-Behind-GPT-4.pdf
- Wanchao Liang, Tianyu Liu, Less Wright, Will Constable, Andrew Gu, Chien-Chin Huang, Iris Zhang, Wei Feng, Howard Huang, Junjie Wang, Sanket Purandare, Gokul Nadathur, Stratos Idreos, 9 Oct 2024, TorchTitan: One-stop PyTorch native solution for production ready LLM pre-training, https://arxiv.org/abs/2410.06511
- Byron (Pin-Lun)Hsu, Yun Dai, Vignesh Kothapalli, Qingquan Song, Shao Tang, Siyu Zhu, Steven Shimizu, Shivam Sahni, Haowen Ning, Yanning Chen, 14 Oct 2024, Liger Kernel: Efficient Triton Kernels for LLM Training, https://arxiv.org/abs/2410.10989 http://github.com/linkedin/Liger-Kernel
- Ankit Singh Rawat, Veeranjaneyulu Sadhanala, Afshin Rostamizadeh, Ayan Chakrabarti, Wittawat Jitkrittum, Vladimir Feinberg, Seungyeon Kim, Hrayr Harutyunyan, Nikunj Saunshi, Zachary Nado, Rakesh Shivanna, Sashank J. Reddi, Aditya Krishna Menon, Rohan Anil, Sanjiv Kumar, 24 Oct 2024, A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs, https://arxiv.org/abs/2410.18779
- Saleh Ashkboos, Iman Mirzadeh, Keivan Alizadeh, Mohammad Hossein Sekhavat, Moin Nabi, Mehrdad Farajtabar, Fartash Faghri, 25 Oct 2024, Computational Bottlenecks of Training Small-scale Large Language Models, https://arxiv.org/abs/2410.19456
- Wasim Rajput, Oct 30, 2024, Developing Large Language Models (LLMs): A Step-by-Step Guide from Concept to Deployment. How LLMs like ChatGPT, Gemini, and Others are Developed, https://medium.com/the-generator/from-concept-to-deployment-a-practical-guide-to-developing-large-language-models-llms-d60b5841cade
- Zehao Xiao, Cees G. M. Snoek, 6 Nov 2024, Beyond Model Adaptation at Test Time: A Survey. https://arxiv.org/abs/2411.03687
- Fali Wang, Zhiwei Zhang, Xianren Zhang, Zongyu Wu, Tzuhao Mo, Qiuhao Lu, Wanjing Wang, Rui Li, Junjie Xu, Xianfeng Tang, Qi He, Yao Ma, Ming Huang, Suhang Wang, 4 Nov 2024, A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness, https://arxiv.org/abs/2411.03350
- Sebastian Raschka, October 29, 2024, Build a Large Language Model (From Scratch), Manning, https://github.com/rasbt/LLMs-from-scratch https://www.amazon.com/Build-Large-Language-Model-Scratch/dp/1633437167
- Hao Ge, Fangcheng Fu, Haoyang Li, Xuanyu Wang, Sheng Lin, Yujie Wang, Xiaonan Nie, Hailin Zhang, Xupeng Miao, and Bin Cui. 2024. Enabling Parallelism Hot Switching for Efficient Training of Large Language Models. In Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles (SOSP '24). Association for Computing Machinery, New York, NY, USA, 178–194. https://doi.org/10.1145/3694715.3695969 https://dl.acm.org/doi/abs/10.1145/3694715.3695969
- Erik Wijmans, Brody Huval, Alexander Hertzberg, Vladlen Koltun, Philipp Krähenbühl, 13 Nov 2024, Cut Your Losses in Large-Vocabulary Language Models, https://arxiv.org/abs/2411.09009 https://github.com/apple/ml-cross-entropy (Memory-efficient computation of cross-entropy in training.)
- R. Li, D. Fu, C. Shi, Z. Huang and G. Lu, "Efficient LLMs Training and Inference: An Introduction," in IEEE Access, doi: 10.1109/ACCESS.2024.3501358. https://ieeexplore.ieee.org/abstract/document/10756602 https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10756602
- Nir Barazida, Mar 9, 2022, Distributed training of deep learning models: handling stragglers and latency in synchronous training A review of the challenges in Synchronous distributed training and best solutions for stragglers and high latency https://towardsdatascience.com/stragglers-and-latency-in-synchronous-distributed-training-of-deep-learning-models-43783b0266d9
- Jianmin Chen, Xinghao Pan, Rajat Monga, Samy Bengio, Rafal Jozefowicz, 21 Mar 2017 (v3), Revisiting Distributed Synchronous SGD, https://arxiv.org/abs/1604.00981
- Palak (Microsoft Research India), Rohan Gandhi (Microsoft Research India), Karan Tandon (Microsoft Research India), Debopam Bhattacherjee (Microsoft Research India), Venkata N. Padmanabhan (Microsoft Research India), 16 Nov 2024, Improving training time and GPU utilization in geo-distributed language model training, https://arxiv.org/abs/2411.14458
- Chenghao Hu and Baochun Li. 2024. Menos: Split Fine-Tuning Large Language Models with Efficient GPU Memory Sharing. In Proceedings of the 25th International Middleware Conference (MIDDLEWARE '24). Association for Computing Machinery, New York, NY, USA, 185–198. https://doi.org/10.1145/3652892.3700758 https://dlnext.acm.org/doi/10.1145/3652892.3700758 https://iqua.ece.toronto.edu/papers/chenghao-middleware24.pdf
- Carl Franzen, August 27, 2024, ‘This could change everything!’ Nous Research unveils new tool to train powerful AI models with 10,000x efficiency, https://venturebeat.com/ai/this-could-change-everything-nous-research-unveils-new-tool-to-train-powerful-ai-models-with-10000x-efficiency/
- Carl Franzen, December 2, 2024, Nous Research is training an AI model using machines distributed across the internet, https://venturebeat.com/ai/nous-research-is-training-an-ai-model-using-machines-distributed-across-the-internet/
- Haoyang Li, Fangcheng Fu, Sheng Lin, Hao Ge, Xuanyu Wang, Jiawen Niu, Jie Jiang, Bin Cui, 10 Dec 2024, Demystifying Workload Imbalances in Large Transformer Model Training over Variable-length Sequences, https://arxiv.org/abs/2412.07894
- Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Michal Guerquin, Hamish Ivison, Pang Wei Koh, Jiacheng Liu, Saumya Malik, William Merrill, Lester James V. Miranda, Jacob Morrison, Tyler Murray, Crystal Nam, Valentina Pyatkin, Aman Rangapur, Michael Schmitz, Sam Skjonsberg, David Wadden, Christopher Wilhelm, Michael Wilson, Luke Zettlemoyer, Ali Farhadi, Noah A. Smith, Hannaneh Hajishirzi, 31 Dec 2024, 2 OLMo 2 Furious, https://arxiv.org/abs/2501.00656
- Zongbiao Li , Xiezhao Li , Yinghao Cui , Yijun Chen , Zhixuan Gu , Yuxuan Liu , Wenbo Zhu , Fei Jia , Ke Liu , Qifeng Li , Junyao Zhan , Jiangtao Zhou , Chenxi Zhang , Qike Liu, 31 Dec 2024, Automatically Planning Optimal Parallel Strategy for Large Language Models, https://arxiv.org/abs/2501.00254
- Andrea Matarazzo, Riccardo Torlone, 3 Jan 2025, A Survey on Large Language Models with some Insights on their Capabilities and Limitations, https://arxiv.org/abs/2501.04040 (Broad survey with many LLM topics covered from history to architectures to optimizations.)
- NovaSky, Jan 2025, Sky-T1: Train your own O1 preview model within $450, https://novasky-ai.github.io/posts/sky-t1/
- Weichen Fan, Chenyang Si, Junhao Song, Zhenyu Yang, Yinan He, Long Zhuo, Ziqi Huang, Ziyue Dong, Jingwen He, Dongwei Pan, Yi Wang, Yuming Jiang, Yaohui Wang, Peng Gao, Xinyuan Chen, Hengjie Li, Dahua Lin, Yu Qiao, Ziwei Liu, 14 Jan 2025, Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion Models, https://arxiv.org/abs/2501.08453 (Efficient training of text-to-video models.)
- Tong Xiao, Jingbo Zhu, 16 Jan 2025, Foundations of Large Language Models, https://arxiv.org/abs/2501.09223 (Huge 230 page paper on many topics such as training, prompting, alignment, and long context.)
- https://theses.hal.science/tel-04890912/file/ZHAO_XUNYI_2024.pdf Xunyi Zhao. Optimizing Memory Usage when Training Deep Neural Networks. Computer Science [cs]. Université de Bordeaux, France, 2024. English. NNT: 2024BORD0411 . tel-04890912
- Kaiyuan Tian, Linbo Qiao, Baihui Liu, Gongqingjian Jiang, Dongsheng Li, 21 Jan 2025, A Survey on Memory-Efficient Large-Scale Model Training in AI for Science, https://arxiv.org/abs/2501.11847
- Tanya Rodchenko, Natasha Noy, Nino Scherrer, Jennifer Prendki, 23 Jan 2025, Not Every AI Problem is a Data Problem: We Should Be Intentional About Data Scaling, https://arxiv.org/abs/2501.13779
- Tech Fund, Feb 03, 2025, The Winners from DeepSeek, Nvidia, and The Outlook in AI: A tour of the space & AI-exposed stocks, https://www.techinvestments.io/p/the-winners-from-deepseek-nvidia
- Thor Olavsrud, How DeepSeek changes the gen AI equation for CIOs, 30 Jan 2025, https://www.cio.com/article/3813555/what-cios-should-learn-now-that-deepseek-is-here.html (" the future of gen AI lies in innovative, cost-efficient approaches")
- Maxwell Zeff, February 5, 2025, Researchers created an open rival to OpenAI’s o1 ‘reasoning’ model for under $50, https://techcrunch.com/2025/02/05/researchers-created-an-open-rival-to-openais-o1-reasoning-model-for-under-50/
- Kyle Wiggers, January 11, 2025, Researchers open source Sky-T1, a ‘reasoning’ AI model that can be trained for less than $450,https://techcrunch.com/2025/01/11/researchers-open-source-sky-t1-a-reasoning-ai-model-that-can-be-trained-for-less-than-450/
- Di Chai, Pengbo Li, Feiyuan Zhang, Yilun Jin, Han Tian, Junxue Zhang, Kai Chen, 1 Feb 2025, Enhancing Token Filtering Efficiency in Large Language Model Training with Collider, https://arxiv.org/abs/2502.00340 (Token reduction in training.)
- XYZ Labs, Feb 23, 2025, Open Reasoner Zero: A Breakthrough in AI Training Efficiency Matches DeepSeek with Just 1/30th of Training Steps. Major AI Figures Including Kai-Fu Lee, Harry Shum, and Xiangyu Zhang Unveil Revolutionary Open-Source Training Method. https://xyzlabs.substack.com/p/open-reasoner-zero-a-breakthrough
- Xiaoran Liu, Ruixiao Li, Mianqiu Huang, Zhigeng Liu, Yuerong Song, Qipeng Guo, Siyang He, Qiqi Wang, Linlin Li, Qun Liu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu, 24 Feb 2025, Thus Spake Long-Context Large Language Model, https://arxiv.org/abs/2502.17129 (Impressive survey of many techniques to improve efficiency and accuracy of long context processing in both inference and training, covering text, video and multimodal models.)
Fine-Tuning
Papers on fine-tuning optimizations:
- Libo Qin, Qiguang Chen, Xiachong Feng, Yang Wu, Yongheng Zhang, Yinghui Li, Min Li, Wanxiang Che, Philip S. Yu, 21 May 2024, Large Language Models Meet NLP: A Survey, https://arxiv.org/abs/2405.12819 (A survey of research into how LLMs, with and without fine-tuning, perform in various NLP use cases, such as mathematical reasoning, dialogue understanding, translation, and more.)
- Runheng Liu, Xingchen Xiao, Heyan Huang, Zewen Chi, Zhijing Wu, 7 May 2024, FlashBack:Efficient Retrieval-Augmented Language Modeling for Long Context Inference, https://arxiv.org/abs/2405.04065 (Optimize RAG by appending rather than prepending documents, and modifying the attention for improvements in KV caching, by shimming or replacing some of the CUDA GPU low-level memory management APIs to avoid the need to rewrite kernels with extra higher-level memory management code.)
- Benjue Weng, 13 Apr 2024, Navigating the Landscape of Large Language Models: A Comprehensive Review and Analysis of Paradigms and Fine-Tuning Strategies, https://arxiv.org/abs/2404.09022 (Reviewing fine-tuning of large models.)
- Tal Peretz, 15 NOV 2023, The Developer's Guide to Production-Grade LLM Apps: Advanced Techniques for Maximizing LLM Performance, https://buildingaistuff.com/p/the-developers-guide-to-production
- Haoran Xu, Amr Sharaf, Yunmo Chen, Weiting Tan, Lingfeng Shen, Benjamin Van Durme, Kenton Murray, Young Jin Kim, 18 Jan 2024, Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation, https://arxiv.org/abs/2401.08417
- David Spuler, March 2024, Chapter 6. Training, Fine-Tuning & RAG, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey
- kipply's blog, 2023-03-30, Transformer Taxonomy (the last lit review), https://kipp.ly/transformer-taxonomy/ (Papers for all the Transformer architectures and milestone papers for the major optimization improvements on them.)
- Pranav Patel, 2024, In-depth guide to fine-tuning LLMs with LoRA and QLoRA, https://www.mercity.ai/blog-post/guide-to-fine-tuning-llms-with-lora-and-qlora
- Kai Lv, Yuqing Yang, Tengxiao Liu, Qinghui Gao, Qipeng Guo, Xipeng Qiu, 6 Jun 2024 (v2), Full Parameter Fine-tuning for Large Language Models with Limited Resources, https://arxiv.org/abs/2306.09782 Code: https://github.com/OpenLMLab/LOMO (Low-memory usage for full-parameter fine-tuning.)
- Louis-François Bouchard, Louie Peters, May 2024, Chapter 10: Fine-Tuning, Building LLMs for Production: Enhancing LLM Abilities and Reliability with Prompting, Fine-Tuning, and RAG, https://www.amazon.com/Building-LLMs-Production-Reliability-Fine-Tuning/dp/B0D4FFPFW8/
- Valentina Alto, 2024, Chapter 11: Fine-Tuning Large Language Models, Building LLM-Powered Applications: Create intelligence apps and agents with large language models, Packt Publishing, https://www.amazon.com/Building-LLM-Apps-Intelligent-Language/dp/1835462316/
- Aarushi Kansal, Chapter 5: Fine-Tuning: The Theory, Chapter 6: Fine-Tuning: Hands-On,, Building Generative AI-Powered Apps: A Hands-on Guide for Developers, Apress, https://www.amazon.com/Building-Generative-AI-Powered-Apps-Hands-ebook/dp/B0CTXXP1S4/
- Xinji Mai, Zeng Tao, Junxiong Lin, Haoran Wang, Yang Chang, Yanlan Kang, Yan Wang, Wenqiang Zhang, 27 Jun 2024, From Efficient Multimodal Models to World Models: A Survey, https://arxiv.org/abs/2407.00118 (A survey of multimodal models with coverage of many optimization techniques.)
- Yi Zhou, Dec 16, 2023, Optimizing GenAI: Comparing Model Training, Fine-Tuning, RAG, and Prompt Engineering, https://medium.com/generative-ai-revolution-ai-native-transformation/optimizing-genai-comparing-model-training-fine-tuning-rag-and-prompt-engineering-7a7c6c65e0f0
- Dan Peng, Zhihui Fu, Jun Wang, 1 Jul 2024, PocketLLM: Enabling On-Device Fine-Tuning for Personalized LLMs, https://arxiv.org/abs/2407.01031 (Running fine-tuning on a smartphone via a low-memory optimization using a "derivative-free" "zeroth-order" technique called MeZo, with advantages such as privacy.)
- OpenAI, August 20, 2024, Fine-tuning now available for GPT-4o, https://openai.com/index/gpt-4o-fine-tuning/
- Judy Hanwen Shen, Inioluwa Deborah Raji, Irene Y. Chen, 8 Aug 2024, The Data Addition Dilemma, https://arxiv.org/abs/2408.04154
- Yihua Zhang, Pingzhi Li, Junyuan Hong, Jiaxiang Li, Yimeng Zhang, Wenqing Zheng, Pin-Yu Chen, Jason D. Lee, Wotao Yin, Mingyi Hong, Zhangyang Wang, Sijia Liu, Tianlong Chen, 28 May 2024 (v3) Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark, https://arxiv.org/abs/2402.11592 Code: https://github.com/ZO-Bench/ZO-LLM
- Junjie Ye, Yuming Yang, Qi Zhang, Tao Gui, Xuanjing Huang, Peng Wang, Zhongchao Shi, Jianping Fan, 24 Sep 2024, Empirical Insights on Fine-Tuning Large Language Models for Question-Answering, https://arxiv.org/abs/2409.15825
- Siyun Zhao, Yuqing Yang, Zilong Wang, Zhiyuan He, Luna K. Qiu, Lili Qiu, 23 Sep 2024, Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely, https://arxiv.org/abs/2409.14924
- Foundry AI, Oct 2024, When Should You Move Beyond Prompting and Start Fine-Tuning? https://thefoundryai.com/blog/fine-tuning
- Venkatesh Balavadhani Parthasarathy, Ahtsham Zafar, Aafaq Khan, Arsalan Shahid, 30 Oct 2024 (v3), The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities, https://arxiv.org/abs/2408.13296
- Angels Balaguer, Vinamra Benara, Renato Luiz de Freitas Cunha, Roberto de M. Estevão Filho, Todd Hendry, Daniel Holstein, Jennifer Marsman, Nick Mecklenburg, Sara Malvar, Leonardo O. Nunes, Rafael Padilha, Morris Sharp, Bruno Silva, Swati Sharma, Vijay Aski, Ranveer Chandra, 30 Jan 2024 (v3), RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture, https://arxiv.org/abs/2401.08406
- Towards AI, December 24, 2024, Llm Fine Tuning Guide: Do You Need It and How to Do It https://towardsai.net/p/artificial-intelligence/llm-fine-tuning-guide-do-you-need-it-and-how-to-do-it-4
- Andrea Matarazzo, Riccardo Torlone, 3 Jan 2025, A Survey on Large Language Models with some Insights on their Capabilities and Limitations, https://arxiv.org/abs/2501.04040 (Broad survey with many LLM topics covered from history to architectures to optimizations.)
- Tong Xiao, Jingbo Zhu, 16 Jan 2025, Foundations of Large Language Models, https://arxiv.org/abs/2501.09223 (Huge 230 page paper on many topics such as training, prompting, alignment, and long context.)
Data Sets
Research papers on datasets used for training:
- Sean Williams, James Huckle, 30 May 2024, Easy Problems That LLMs Get Wrong, https://arxiv.org/abs/2405.19616 Code: https://github.com/autogenai/easy-problems-that-llms-get-wrong
- Raghav Jain, Daivik Sojitra, Arkadeep Acharya, Sriparna Saha, Adam Jatowt, Sandipan Dandapat, December 2023, Do Language Models Have a Common Sense regarding Time? Revisiting Temporal Commonsense Reasoning in the Era of Large Language Models, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing https://aclanthology.org/2023.emnlp-main.418/ PDF: https://aclanthology.org/2023.emnlp-main.418.pdf
- Gayathri Saranathan, Mahammad Parwez Alam, James Lim, Suparna Bhattacharya, Soon Yee Wong, Foltin Martin & Cong Xu, 2024, DELE: Data Efficient LLM Evaluation, Hewlett Packard Labs, Navigating and Addressing Data Problems for Foundation Models (DPFM) Workshop, ICLR 2024, https://openreview.net/pdf?id=I8bsxPWLNF
- You Zhou, Xiujing Lin, Xiang Zhang, Maolin Wang, Gangwei Jiang, Huakang Lu, Yupeng Wu, Kai Zhang, Zhe Yang, Kehang Wang, Yongduo Sui, Fengwei Jia, Zuoli Tang, Yao Zhao, Hongxuan Zhang, Tiannuo Yang, Weibo Chen, Yunong Mao, Yi Li, De Bao, Yu Li, Hongrui Liao, Ting Liu, Jingwen Liu, Jinchi Guo, Xiangyu Zhao, Ying WEI, Hong Qian, Qi Liu, Xiang Wang, Wai Kin (Victor)Chan, Chenliang Li, Yusen Li, Shiyu Yang, Jining Yan, Chao Mou, Shuai Han, Wuxia Jin, Guannan Zhang, Xiaodong Zeng, Nov 2023, On the Opportunities of Green Computing: A Survey, https://arxiv.org/abs/2311.00447 (Extensive survey of environmental and green AI issues, along with a survey of various optimization methods to reduce AI resource requirements in training and inference.)
- Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong, Yi Pan, Shaochen Xu, Zihao Wu, Zhengliang Liu, Xin Zhang, Shu Zhang, Xintao Hu, Tuo Zhang, Ning Qiang, Tianming Liu, Bao Ge, Jan 2024, Understanding LLMs: A Comprehensive Overview from Training to Inference https://arxiv.org/abs/2401.02038
- Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, Ji-Rong Wen, Nov 2023, A Survey of Large Language Models, https://arxiv.org/abs/2303.18223
- Omkar Thawakar, Ashmal Vayani, Salman Khan, Hisham Cholakal, Rao M. Anwer, Michael Felsberg, Tim Baldwin, Eric P. Xing, Fahad Shahbaz Khan, 26 Feb 2024, MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT, https://arxiv.org/abs/2402.16840 Code: https://github.com/mbzuai-oryx/MobiLlama
- Pratyush Maini, Skyler Seto, He Bai, David Grangier, Yizhe Zhang, Navdeep Jaitly, 29 Jan 2024, Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling, https://arxiv.org/abs/2401.16380
- Cobus Greyling, Dec 2023, A Comprehensive Survey of Large Language Models (LLMs), https://cobusgreyling.medium.com/a-comprehensive-survey-of-large-language-models-llms-946a30d9288e
- Ankit Patel, June 14, 2024, NVIDIA Releases Open Synthetic Data Generation Pipeline for Training Large Language Models, https://blogs.nvidia.com/blog/nemotron-4-synthetic-data-generation-llm-training/
- Anas Awadalla, Le Xue, Oscar Lo, Manli Shu, Hannah Lee, Etash Kumar Guha, Matt Jordan, Sheng Shen, Mohamed Awadalla, Silvio Savarese, Caiming Xiong, Ran Xu, Yejin Choi, Ludwig Schmidt, 17 Jun 2024, MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens, https://arxiv.org/abs/2406.11271
- Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey
- NVIDIA, June 2024, Nemotron-4 340B Technical Report, https://d1qx31qr3h6wln.cloudfront.net/publications/Nemotron_4_340B_8T_0.pdf (Architecture is decoder-only with GQA, SentencePiece tokenizer, causal attention masks, RoPE, 96 layers, 96 heads, 8 KV heads, 256,000 vocabulary, 18432 internal dimension, context window 4096, and uses squared RELU.)
- Piotr Skalski, June 20, 2024, Florence-2: Open Source Vision Foundation Model by Microsoft, https://blog.roboflow.com/florence-2/
- Sharon Goldman, August 24, 2024, The hidden reason AI costs are soaring—and it’s not because Nvidia chips are more expensive, https://fortune.com/2024/08/23/data-labeling-ai-scaleai-snorkel-costs/ (The high cost of data labeling.)
- Yehui Tang, Yunhe Wang, Jianyuan Guo, Zhijun Tu, Kai Han, Hailin Hu, Dacheng Tao, 5 Feb 2024. A Survey on Transformer Compression. https://arxiv.org/abs/2402.05964 (Model compression survey paper with focus on pruning, quantization, knowledge distillation, and efficient architecture design.)
- Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, Jianfeng Gao, 20 Feb 2024 (v2), Large Language Models: A Survey, https://arxiv.org/abs/2402.06196
- Reddit Signs AI Content Licensing Deal Ahead of IPO, https://www.bloomberg.com/news/articles/2024-02-16/reddit-is-said-to-sign-ai-content-licensing-deal-ahead-of-ipo?srnd=undefined&sref=b0SdE1lu&tpcc=NL_Marketing
- Pablo Villalobos, Anson Ho, Jaime Sevilla, Tamay Besiroglu, Lennart Heim, Marius Hobbhahn, Jun 06, 2024, Will We Run Out of Data? Limits of LLM Scaling Based on Human-Generated Data, Epoch AI, https://epochai.org/blog/will-we-run-out-of-data-limits-of-llm-scaling-based-on-human-generated-data
- Anson Ho, Tamay Besiroglu, Ege Erdil, David Owen, Robi Rahman, Zifan Carl Guo, David Atkinson, Neil Thompson, Jaime Sevilla, 9 Mar 2024, Algorithmic progress in language models, https://arxiv.org/abs/2403.05812
- Georgia Argyro, Angeliki Dimitriou, Maria Lymperaiou, Giorgos Filandrianos, Giorgos Stamou, 10 Sep 2024, Prompt2Fashion: An automatically generated fashion dataset, https://arxiv.org/abs/2409.06442
- Qinzhuo Wu, Weikai Xu, Wei Liu, Tao Tan, Jianfeng Liu, Ang Li, Jian Luan, Bin Wang, Shuo Shang, 23 Sep 2024, MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI Understanding, https://arxiv.org/abs/2409.14818
- Douglas C. Youvan, September 27, 2024, Building and Running Large-Scale Language Models: The Infrastructure and Techniques Behind GPT-4 , https://www.researchgate.net/profile/Douglas-Youvan/publication/384398902_Building_and_Running_Large-Scale_Language_Models_The_Infrastructure_and_Techniques_Behind_GPT-4/links/66f6f4d3906bca2ac3d20e68/Building-and-Running-Large-Scale-Language-Models-The-Infrastructure-and-Techniques-Behind-GPT-4.pdf
- Pierre-Carl Langlais, Anastasia Stasenko, Catherine Arnett, November 13, 2024, Releasing the largest multilingual open pretraining dataset, https://huggingface.co/blog/Pclanglais/two-trillion-tokens-open
- Arindam Mitra , Ahmed Awadallah , Yash Lara , November 14, 2024, Orca-AgentInstruct: Agentic flows can be effective synthetic-data generators, Microsoft Research Blog, https://www.microsoft.com/en-us/research/blog/orca-agentinstruct-agentic-flows-can-be-effective-synthetic-data-generators/
- Paul Sawers, Dec 2024, Harvard and Google to release 1 million public-domain books as AI training dataset, https://techcrunch.com/2024/12/12/harvard-and-google-to-release-1-million-public-domain-books-as-ai-training-dataset/
- Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, Lei Chen, 27 Dec 2024, A Survey on Large Language Model Acceleration based on KV Cache Management, https://arxiv.org/abs/2412.19442 (Huge survey of all KV cache optimization methods.)
- Andrea Matarazzo, Riccardo Torlone, 3 Jan 2025, A Survey on Large Language Models with some Insights on their Capabilities and Limitations, https://arxiv.org/abs/2501.04040 (Broad survey with many LLM topics covered from history to architectures to optimizations.)
Synthetic Data
Research paper on LLM-generated synthetic data for training:
- Skurzhanskyi, O.H., Marchenko, O.O. & Anisimov, A.V., 2024, Specialized Pre-Training of Neural Networks on Synthetic Data for Improving Paraphrase Generation. Cybern Syst Anal 2024 https://doi.org/10.1007/s10559-024-00658-7 https://link.springer.com/article/10.1007/s10559-024-00658-7
- Pratyush Maini, Skyler Seto, He Bai, David Grangier, Yizhe Zhang, Navdeep Jaitly, 29 Jan 2024, Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling, https://arxiv.org/abs/2401.16380
- André Bauer, Simon Trapp, Michael Stenger, Robert Leppich, Samuel Kounev, Mark Leznik, Kyle Chard, Ian Foster, 4 Jan 2024, Comprehensive Exploration of Synthetic Data Generation: A Survey https://arxiv.org/abs/2401.02524
- Ankit Patel, June 14, 2024, NVIDIA Releases Open Synthetic Data Generation Pipeline for Training Large Language Models, https://blogs.nvidia.com/blog/nemotron-4-synthetic-data-generation-llm-training/
- David Spuler, March 2024, Chapter 45. Knowledge Distillation, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- A Gudibande, E Wallace, C Snell, X Geng, H Liu 2023, The false promise of imitating proprietary llms, https://arxiv.org/abs/2305.15717
- Y Wang, W Zhong, L Li, F Mi, X Zeng, W Huang 2023, Aligning large language models with human: A survey, https://arxiv.org/abs/2307.12966
- Y Gu, L Dong, F Wei, M Huang, 2023, Knowledge Distillation of Large Language Models, https://arxiv.org/abs/2306.08543
- X Wan, R Sun, H Dai, SO Arik, T Pfister, 2023, Better zero-shot reasoning with self-adaptive prompting, https://arxiv.org/abs/2305.14106
- S Horawalavithana, S Munikoti, I Stewart, 2023, SCITUNE: Aligning Large Language Models with Scientific Multimodal Instructions, https://arxiv.org/abs/2307.01139
- X Daull, P Bellot, E Bruno, V Martin, 2023, Complex QA and language models hybrid architectures, Survey, https://arxiv.org/abs/2302.09051
- Z Yuan, J Liu, Q Zi, M Liu, X Peng, Y Lou, 2023, Evaluating Instruction-Tuned Large Language Models on Code Comprehension and Generation, https://arxiv.org/abs/2308.01240
- W AlShikh, M Daaboul, K Goddard, B Imel, 2023, Becoming self-instruct: introducing early stopping criteria for minimal instruct tuning, https://arxiv.org/abs/2307.03692
- Z He, Z Xie, R Jha, H Steck, D Liang, Y Feng, 2023, Large Language Models as Zero-Shot Conversational Recommenders, https://arxiv.org/abs/2308.10053
- NVIDIA, June 2024, Nemotron-4 340B Technical Report, https://d1qx31qr3h6wln.cloudfront.net/publications/Nemotron_4_340B_8T_0.pdf (Architecture is decoder-only with GQA, SentencePiece tokenizer, causal attention masks, RoPE, 96 layers, 96 heads, 8 KV heads, 256,000 vocabulary, 18432 internal dimension, context window 4096, and uses squared RELU.)
- Michael Nuñez, July 18, 2024, Groq’s open-source Llama AI model tops leaderboard, outperforming GPT-4o and Claude in function calling, https://venturebeat.com/ai/groq-open-source-llama-ai-model-tops-leaderboard-outperforming-gpt-4o-and-claude-in-function-calling/
- Louie Peters, Aug 27, 2024, Two Paths to Small LMs? Synthetic Data (Phi 3.5) vs Pruning & Distillation (Llama-3.1-Minitron), https://newsletter.towardsai.net/p/114-two-paths-to-small-lms-synthetic
- Aatish Bhatia, Aug. 25, 2024, When A.I.’s Output Is a Threat to A.I. Itself: As A.I.-generated data becomes harder to detect, it’s increasingly likely to be ingested by future A.I., leading to worse results, NY Times, https://www.nytimes.com/interactive/2024/08/26/upshot/ai-synthetic-data.html
- Shumailov, I., Shumaylov, Z., Zhao, Y. et al. 2024, AI models collapse when trained on recursively generated data. Nature 631, 755–759. https://doi.org/10.1038/s41586-024-07566-y https://www.nature.com/articles/s41586-024-07566-y
- Damien Ferbach, Quentin Bertrand, Avishek Joey Bose, Gauthier Gidel, 12 Jun 2024, Self-Consuming Generative Models with Curated Data Provably Optimize Human Preferences, https://arxiv.org/abs/2407.09499
- Ryan McNeal, Aug 27, 2024, ChatGPT and GPT-4 could get a sweet upgrade this fall with 'strawberry', https://www.androidauthority.com/openai-strawberry-ai-3475682/
- Ruibo Liu, Jerry Wei, Fangyu Liu, Chenglei Si, Yanzhe Zhang, Jinmeng Rao, Steven Zheng, Daiyi Peng, Diyi Yang, Denny Zhou, Andrew M. Dai, 10 Aug 2024 (v2), Best Practices and Lessons Learned on Synthetic Data, https://arxiv.org/abs/2404.07503
- Georgia Argyro, Angeliki Dimitriou, Maria Lymperaiou, Giorgos Filandrianos, Giorgos Stamou, 10 Sep 2024, Prompt2Fashion: An automatically generated fashion dataset, https://arxiv.org/abs/2409.06442
- Alisia Lupidi, Carlos Gemmell, Nicola Cancedda, Jane Dwivedi-Yu, Jason Weston, Jakob Foerster, Roberta Raileanu, Maria Lomeli, 12 Sep 2024, Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources, https://arxiv.org/abs/2409.08239
- Hritik Bansal, Arian Hosseini, Rishabh Agarwal, Vinh Q. Tran, Mehran Kazemi, 29 Aug 2024, Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling, https://arxiv.org/abs/2408.16737
- Ulyana Piterbarg, Lerrel Pinto, Rob Fergus, 3 Oct 2024, Training Language Models on Synthetic Edit Sequences Improves Code Synthesis, https://arxiv.org/abs/2410.02749
- Ke Wang, Jiahui Zhu, Minjie Ren, Zeming Liu, Shiwei Li, Zongye Zhang, Chenkai Zhang, Xiaoyu Wu, Qiqi Zhan, Qingjie Liu, Yunhong Wang, 16 Oct 2024, A Survey on Data Synthesis and Augmentation for Large Language Models, https://arxiv.org/abs/2410.12896
- Ran Xu, Hui Liu, Sreyashi Nag, Zhenwei Dai, Yaochen Xie, Xianfeng Tang, Chen Luo, Yang Li, Joyce C. Ho, Carl Yang, Qi He, 23 Oct 2024, SimRAG: Self-Improving Retrieval-Augmented Generation for Adapting Large Language Models to Specialized Domains, https://arxiv.org/abs/2410.17952
- Xingwu Sun, Yanfeng Chen, Yiqing Huang, Ruobing Xie, Jiaqi Zhu, Kai Zhang, Shuaipeng Li, Zhen Yang, Jonny Han, Xiaobo Shu, Jiahao Bu, (and many more authors), 4 Nov 2024, Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent, https://arxiv.org/abs/2411.02265 https://github.com/Tencent/Hunyuan-Large https://huggingface.co/tencent/Tencent-Hunyuan-Large
- Arindam Mitra , Ahmed Awadallah , Yash Lara , November 14, 2024, Orca-AgentInstruct: Agentic flows can be effective synthetic-data generators, Microsoft Research Blog, https://www.microsoft.com/en-us/research/blog/orca-agentinstruct-agentic-flows-can-be-effective-synthetic-data-generators/
- Seungone Kim, Juyoung Suk, Xiang Yue, Vijay Viswanathan, Seongyun Lee, Yizhong Wang, Kiril Gashteovski, Carolin Lawrence, Sean Welleck, Graham Neubig, 4 Dec 2024, Evaluating Language Models as Synthetic Data Generators, https://arxiv.org/abs/2412.03679
- Venkatesh Balavadhani Parthasarathy, Ahtsham Zafar, Aafaq Khan, Arsalan Shahid, 30 Oct 2024 (v3), The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities, https://arxiv.org/abs/2408.13296
- Xiang Huang, Jiayu Shen, Shanshan Huang, Sitao Cheng, Xiaxia Wang, Yuzhong Qu, 27 Dec 2024, TARGA: Targeted Synthetic Data Generation for Practical Reasoning over Structured Data, https://arxiv.org/abs/2412.19544?
- Sebastian Raschka, PhD, Jan 15, 2025, Noteworthy AI Research Papers of 2024 (Part Two). Six influential AI papers from July to December, https://magazine.sebastianraschka.com/p/ai-research-papers-2024-part-2 (Examines multimodal LLama3 models and the different multimodal architectures.)
- FZ Subah, Oct 2025, Mitigating and Assessing Bias and Fairness in Large Language Model-Generated Synthetic Tabular Data, Masters Thesis, Department of Engineering, University of Cambridge, https://www.mlmi.eng.cam.ac.uk/files/2023-2024/fzs21_mitigating_2024.pdf
- Chetan Harsha, Karmvir Singh Phogat, Sridhar Dasaratha, Sai Akhil Puranam, Shashishekar Ramakrishna, Jan 2025, Synthetic Data Generation Using Large Language Models for Financial Question Answering, Proceedings of the Joint Workshop of the 9th FinNLP, the 6th FNP, and the 1st LLMFinLegal, pages 76–95 January 19–20, 2025, Association for Computational Linguistics, https://aclanthology.org/2025.finnlp-1.7.pdf
- Zhan Ling, Kang Liu, Kai Yan, Yifan Yang, Weijian Lin, Ting-Han Fan, Lingfeng Shen, Zhengyin Du, Jiecao Chen, 25 Jan 2025, LongReason: A Synthetic Long-Context Reasoning Benchmark via Context Expansion, https://arxiv.org/abs/2501.15089
- Minsang Kim, Seungjun Baek, 6 Feb 2025, Syntriever: How to Train Your Retriever with Synthetic Data from LLMs, https://arxiv.org/abs/2502.03824
- Hanmeng Liu, Zhizhang Fu, Mengru Ding, Ruoxi Ning, Chaoli Zhang, Xiaozhang Liu, Yue Zhang, 13 Feb 2025, Logical Reasoning in Large Language Models: A Survey, https://arxiv.org/abs/2502.09100
- Joshua Ong Jun Leang, Giwon Hong, Wenda Li, Shay B. Cohen, 18 Feb 2025, Theorem Prover as a Judge for Synthetic Data Generation, https://arxiv.org/abs/2502.13137
Unnatural Instructions (Synthetic Data)
Research papers on "unnatural instructions," a type of synthetic data for training:
- A Gudibande, E Wallace, C Snell, X Geng, H Liu 2023, The false promise of imitating proprietary llms, https://arxiv.org/abs/2305.15717
- Y Wang, W Zhong, L Li, F Mi, X Zeng, W Huang 2023, Aligning large language models with human: A survey, https://arxiv.org/abs/2307.12966
- Y Gu, L Dong, F Wei, M Huang, 2023, Knowledge Distillation of Large Language Models, https://arxiv.org/abs/2306.08543
- X Wan, R Sun, H Dai, SO Arik, T Pfister, 2023, Better zero-shot reasoning with self-adaptive prompting, https://arxiv.org/abs/2305.14106
- S Horawalavithana, S Munikoti, I Stewart, 2023, SCITUNE: Aligning Large Language Models with Scientific Multimodal Instructions, https://arxiv.org/abs/2307.01139
- X Daull, P Bellot, E Bruno, V Martin, 2023, Complex QA and language models hybrid architectures, Survey, https://arxiv.org/abs/2302.09051
- Z Yuan, J Liu, Q Zi, M Liu, X Peng, Y Lou, 2023, Evaluating Instruction-Tuned Large Language Models on Code Comprehension and Generation, https://arxiv.org/abs/2308.01240
- W AlShikh, M Daaboul, K Goddard, B Imel, 2023, Becoming self-instruct: introducing early stopping criteria for minimal instruct tuning, https://arxiv.org/abs/2307.03692
- Z He, Z Xie, R Jha, H Steck, D Liang, Y Feng, 2023, Large Language Models as Zero-Shot Conversational Recommenders, https://arxiv.org/abs/2308.10053
Distributed Training
Distributed training is the optimization of spreading training computations across multiple GPUs or multiple servers. Trillion parameter models are trained on large clusters of 100,000+ GPUs, with complex multi-server multi-GPU architectures. Distributed training can also be performed on much more spread-out architectures with servers communicating over the internet.
Some of the research papers on distributed training:
- Jiangfei Duan, Shuo Zhang, Zerui Wang, Lijuan Jiang, Wenwen Qu, Qinghao Hu, Guoteng Wang, Qizhen Weng, Hang Yan, Xingcheng Zhang, Xipeng Qiu, Dahua Lin, Yonggang Wen, Xin Jin, Tianwei Zhang, Peng Sun, 29 Jul 2024, Efficient Training of Large Language Models on Distributed Infrastructures: A Survey, https://arxiv.org/abs/2407.20018
- WenZheng Zhang, Yang Hu, Jing Shi, Xiaoying Bai, 22 Aug 2024, Poplar: Efficient Scaling of Distributed DNN Training on Heterogeneous GPU Clusters, https://arxiv.org/abs/2408.12596
- Douglas C. Youvan, September 27, 2024, Building and Running Large-Scale Language Models: The Infrastructure and Techniques Behind GPT-4 , https://www.researchgate.net/profile/Douglas-Youvan/publication/384398902_Building_and_Running_Large-Scale_Language_Models_The_Infrastructure_and_Techniques_Behind_GPT-4/links/66f6f4d3906bca2ac3d20e68/Building-and-Running-Large-Scale-Language-Models-The-Infrastructure-and-Techniques-Behind-GPT-4.pdf
- Palak (Microsoft Research India), Rohan Gandhi (Microsoft Research India), Karan Tandon (Microsoft Research India), Debopam Bhattacherjee (Microsoft Research India), Venkata N. Padmanabhan (Microsoft Research India), 16 Nov 2024, Improving training time and GPU utilization in geo-distributed language model training, https://arxiv.org/abs/2411.14458
- M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024, Resource-efficient Algorithms and Systems of Foundation Models: A Survey, https://dl.acm.org/doi/pdf/10.1145/3706418
- Bowen Peng, Jeffrey Quesnelle, Diederik P. Kingma, 29 Nov 2024, DeMo: Decoupled Momentum Optimization, https://arxiv.org/abs/2411.19870 https://github.com/bloc97/DeMo (Extension to ADAM optimizer that greatly reduces network communication in training.)
- Carl Franzen, August 27, 2024, ‘This could change everything!’ Nous Research unveils new tool to train powerful AI models with 10,000x efficiency, https://venturebeat.com/ai/this-could-change-everything-nous-research-unveils-new-tool-to-train-powerful-ai-models-with-10000x-efficiency/
- Carl Franzen, December 2, 2024, Nous Research is training an AI model using machines distributed across the internet, https://venturebeat.com/ai/nous-research-is-training-an-ai-model-using-machines-distributed-across-the-internet/
- Yicheng Feng, Yuetao Chen, Kaiwen Chen, Jingzong Li, Tianyuan Wu, Peng Cheng, Chuan Wu, Wei Wang, Tsung-Yi Ho, Hong Xu, 17 Dec 2024, Echo: Simulating Distributed Training At Scale, https://arxiv.org/abs/2412.12487
- Kaiyuan Tian, Linbo Qiao, Baihui Liu, Gongqingjian Jiang, Dongsheng Li, 21 Jan 2025, A Survey on Memory-Efficient Large-Scale Model Training in AI for Science, https://arxiv.org/abs/2501.11847
Training Costs
Research on the total costs of performing LLM training:
- Will Henshall June 3, 2024, The Billion-Dollar Price Tag of Building AI, Time, https://time.com/6984292/cost-artificial-intelligence-compute-epoch-report/
- Epoch AI, 2024, How Much Does It Cost to Train Frontier AI Models? https://epochai.org/blog/how-much-does-it-cost-to-train-frontier-ai-models
- Ben Cottier, Robi Rahman, Loredana Fattorini, Nestor Maslej, David Owen, 31 May 2024, The rising costs of training frontier AI models, https://arxiv.org/abs/2405.21015
- Douglas C. Youvan, September 27, 2024, Building and Running Large-Scale Language Models: The Infrastructure and Techniques Behind GPT-4 , https://www.researchgate.net/profile/Douglas-Youvan/publication/384398902_Building_and_Running_Large-Scale_Language_Models_The_Infrastructure_and_Techniques_Behind_GPT-4/links/66f6f4d3906bca2ac3d20e68/Building-and-Running-Large-Scale-Language-Models-The-Infrastructure-and-Techniques-Behind-GPT-4.pdf
- NovaSky, Jan 2025, Sky-T1: Train your own O1 preview model within $450, https://novasky-ai.github.io/posts/sky-t1/
- Alberto Romero, Jan 2025, DeepSeek, a little-known Chinese startup, released R1 yesterday, https://substack.com/@thealgorithmicbridge/note/c-87664591-
- Maxwell Zeff, February 5, 2025, Researchers created an open rival to OpenAI’s o1 ‘reasoning’ model for under $50, https://techcrunch.com/2025/02/05/researchers-created-an-open-rival-to-openais-o1-reasoning-model-for-under-50/
- Kyle Wiggers, January 11, 2025, Researchers open source Sky-T1, a ‘reasoning’ AI model that can be trained for less than $450,https://techcrunch.com/2025/01/11/researchers-open-source-sky-t1-a-reasoning-ai-model-that-can-be-trained-for-less-than-450/
Federated Learning
Research on federated learning, a type of distributed training for LLMs:
- Caelin Kaplan, Tareq Si Salem, Angelo Rodio, Chuan Xu, Giovanni Neglia, 7 May 2024, Federated Learning for Cooperative Inference Systems: The Case of Early Exit Networks, https://arxiv.org/abs/2405.04249
- Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
- Mohamed Nabih Ali, Daniele Falavigna, Alessio Brutti, 2024, Fed-EE: Federating Heterogeneous ASR Models using Early-Exit Architectures, PDF: https://cris.fbk.eu/bitstream/11582/343747/1/paper_49.pdf
- H Woisetschläger, A Isenko, S Wang, R Mayer, 2023, Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly, https://arxiv.org/abs/2310.03150
- Lorenzo Sani, Alex Iacob, Zeyu Cao, Bill Marino, Yan Gao, Tomas Paulik, Wanru Zhao, William F. Shen, Preslav Aleksandrov, Xinchi Qiu, Nicholas D. Lane, 19 Jul 2024 (v2), The Future of Large Language Model Pre-training is Federated, https://arxiv.org/abs/2405.10853
- Jaxpruner: A Concise Library for Sparsity Research, Joo Hyung Lee, Wonpyo Park, Nicole Elyse Mitchell, Jonathan Pilault, Johan Samir Obando Ceron, Han-Byul Kim, Namhoon Lee, Elias Frantar, Yun Long, Amir Yazdanbakhsh, Woohyun Han, Shivani Agrawal, Suvinay Subramanian, Xin Wang, Sheng-Chun Kao, Xingyao Zhang, Trevor Gale, Aart J.C. Bik, Milen Ferev, Zhonglin Han, Hong-Seok Kim, Yann Dauphin, Gintare Karolina Dziugaite, Pablo Samuel Castro, Utku Evci, Conference on Parsimony and Learning, PMLR 234:515-528, 2024. https://proceedings.mlr.press/v234/lee24a.html https://proceedings.mlr.press/v234/lee24a/lee24a.pdf https://openreview.net/forum?id=H2rCZCfXkS https://openreview.net/pdf?id=H2rCZCfXkS
- Eric Samikwa, 2024, Resource-Aware Distributed Machine Learning for Artificial Intelligence of Things, Ph.D. thesis, Faculty of Science, University of Bern, Switzerland, https://boristheses.unibe.ch/5378/1/24samikwa_e_1_.pdf https://doi.org/10.48549/5378 (Multi-edge device with early exit, "micro-split" scheduling, split/federated learning, and distributed inference.)
- Yue Zheng, Yuhao Chen, Bin Qian, Xiufang Shi, Yuanchao Shu, Jiming Chen, 29 Sep 2024, A Review on Edge Large Language Models: Design, Execution, and Applications, https://arxiv.org/abs/2410.11845
- Shengwen Ding, Chenhui Hu, 24 Nov 2024, eFedLLM: Efficient LLM Inference Based on Federated Learning, https://arxiv.org/abs/2411.16003
- Natalie Lang, Alejandro Cohen, Nir Shlezinger, 27 Mar 2024, Stragglers-Aware Low-Latency Synchronous Federated Learning via Layer-Wise Model Updates, https://arxiv.org/abs/2403.18375
- Chengxi Li, Ming Xiao, Mikael Skoglund, 22 Mar 2024, Adaptive Coded Federated Learning: Privacy Preservation and Straggler Mitigation, https://arxiv.org/abs/2403.14905
- Andrew Hard, Antonious M. Girgis, Ehsan Amid, Sean Augenstein, Lara McConnaughey, Rajiv Mathews, Rohan Anil, 14 Mar 2024, Learning from straggler clients in federated learning, https://arxiv.org/abs/2403.09086
- Hongpeng Guo, Haotian Gu, Xiaoyang Wang, Bo Chen, Eun Kyung Lee, Tamar Eilam, Deming Chen, Klara Nahrstedt, 31 Jan 2024, FedCore: Straggler-Free Federated Learning with Distributed Coresets, https://arxiv.org/abs/2402.00219
Mixed-Precision Training
Research on using mixtures of precision in the underlying data values and weights, when performing LLM training:
- Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
- Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, Liang Zhao, 4 Jan 2024, Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models https://arxiv.org/abs/2401.00625 (A general survey paper with coverage of many techniques including this one.)
- Jiahang Zhou, Yanyu Chen, Zicong Hong, Wuhui Chen, Yue Yu, Tao Zhang, Hui Wang, Chuanfu Zhang, Zibin Zheng, 5 Jan 2024, Training and Serving System of Foundation Models: A Comprehensive Survey, https://arxiv.org/abs/2401.02643
- Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, Hao Wu, Feb 2018, Mixed Precision Training, https://arxiv.org/abs/1710.03740
- 18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
- Jiangfei Duan, Shuo Zhang, Zerui Wang, Lijuan Jiang, Wenwen Qu, Qinghao Hu, Guoteng Wang, Qizhen Weng, Hang Yan, Xingcheng Zhang, Xipeng Qiu, Dahua Lin, Yonggang Wen, Xin Jin, Tianwei Zhang, Peng Sun, 29 Jul 2024, Efficient Training of Large Language Models on Distributed Infrastructures: A Survey, https://arxiv.org/abs/2407.20018
- Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong, Yi Pan, Shaochen Xu, Zihao Wu, Zhengliang Liu, Xin Zhang, Shu Zhang, Xintao Hu, Tuo Zhang, Ning Qiang, Tianming Liu, Bao Ge, 6 Jan 2024 (v2), Understanding LLMs: A Comprehensive Overview from Training to Inference, https://arxiv.org/abs/2401.02038
- Douglas C. Youvan, September 27, 2024, Building and Running Large-Scale Language Models: The Infrastructure and Techniques Behind GPT-4 , https://www.researchgate.net/profile/Douglas-Youvan/publication/384398902_Building_and_Running_Large-Scale_Language_Models_The_Infrastructure_and_Techniques_Behind_GPT-4/links/66f6f4d3906bca2ac3d20e68/Building-and-Running-Large-Scale-Language-Models-The-Infrastructure-and-Techniques-Behind-GPT-4.pdf
- Kaiyuan Tian, Linbo Qiao, Baihui Liu, Gongqingjian Jiang, Dongsheng Li, 21 Jan 2025, A Survey on Memory-Efficient Large-Scale Model Training in AI for Science, https://arxiv.org/abs/2501.11847
- Minhajul Hoque, Jan 4, 2025, DeepSeek V3: How They Achieved Big Results with Small Compute, https://ai.plainenglish.io/deepseek-v3-how-they-achieved-big-results-with-small-compute-fb694606d59a (DeepSeek optimizations included FP8 quantization with outlier handling, attention and KV cache optimization via Multi-Head Latent Attention (MHLA), and multi-token decoding.)
- Nandini Lokesh Reddy, Jan 2025, DeepSeek: Bridging Performance and Efficiency in Modern AI, https://medium.com/@nandinilreddy/deepseek-bridging-performance-and-efficiency-in-modern-ai-106181a85693
Model Merging
Model merging is a technique whereby two separate LLMs can be merged together to create a new model with the combined expertise of the two individual models. Surprisingly, the two sets of weights can simply be combined, such as by addition.
Research papers on model merging:
- Enneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xiaochun Cao, Jie Zhang, Dacheng Tao, 15 Aug 2024 (v2), Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities, https://arxiv.org/abs/2408.07666 Project: https://github.com/EnnengYang/Awesome-Model-Merging-Methods-Theories-Applications (An extensive review of merging two models.)
- Cameron R. Wolfe, Sep 16, 2024, Model Merging: A Survey: From modern LLM applications to the early days of machine learning research, https://cameronrwolfe.substack.com/p/model-merging
- Lucas Bandarkar, Benjamin Muller, Pritish Yuvraj, Rui Hou, Nayan Singhal, Hongjiang Lv, Bing Liu, 2 Oct 2024, Layer Swapping for Zero-Shot Cross-Lingual Transfer in Large Language Models, https://arxiv.org/abs/2410.01335
- Yuxuan Zhang, Ruizhe Li, 2 Oct 2024, DLP-LoRA: Efficient Task-Specific LoRA Fusion with a Dynamic, Lightweight Plugin for Large Language Models, https://arxiv.org/abs/2410.01497 https://github.com/MeCuping/DLP-LoRA (Merging multiple LoRA adapters for parallel inference.)
- Sean Michael Kerner, October 23, 2024, Differentiable Adaptive Merging is accelerating SLMs for enterprises, https://venturebeat.com/ai/differentiable-adaptive-merging-is-accelerating-slms-for-enterprises/
- Mingyang Zhang, Jing Liu, Ganggui Ding, Xinyi Yu, Linlin Ou, Bohan Zhuang, 18 Dec 2024, Channel Merging: Preserving Specialization for Merged Experts, https://arxiv.org/abs/2412.15283
- Sakana.ai, March 21, 2024, Evolving New Foundation Models: Unleashing the Power of Automating Model Development, https://sakana.ai/evolutionary-model-merge/
- Sakana.ai, December 03, 2024, Population-based Model Merging via Quality Diversity, https://sakana.ai/cycleqd/
- Ayoub Ben Chaliah, Hela Dellagi, 31 Dec 2024, Superposition in Transformers: A Novel Way of Building Mixture of Experts, https://arxiv.org/abs/2501.00530 (Effectively model merging to combine a base model and its fine-tuned version, to avoid catastrophic forgetting.)
- Violet Xiang, Charlie Snell, Kanishk Gandhi, Alon Albalak, Anikait Singh, Chase Blagden, Duy Phung, Rafael Rafailov, Nathan Lile, Dakota Mahan, Louis Castricato, Jan-Philipp Franken, Nick Haber, Chelsea Finn, 8 Jan 2025, Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought, https://arxiv.org/abs/2501.04682
- Qizhen Zhang, Prajjwal Bhargava, Chloe Bi, Chris X. Cai, Jakob Foerster, Jeremy Fu, Punit Singh Koura, Ruan Silva, Sheng Shen, Emily Dinan, Suchin Gururangan, Mike Lewis, 31 Jan 2025, BTS: Harmonizing Specialized Experts into a Generalist LLM, https://arxiv.org/abs/2502.00075 (Combining multiple fine-tuned expert models via "layer stitching").
- Yuhang Zhou, Giannis Karamanolakis, Victor Soto, Anna Rumshisky, Mayank Kulkarni, Furong Huang, Wei Ai, Jianhua Lu, 4 Feb 2025 (v2), MergeME: Model Merging Techniques for Homogeneous and Heterogeneous MoEs, https://arxiv.org/abs/2502.00997
- Kunfeng Lai, Zhenheng Tang, Xinglin Pan, Peijie Dong, Xiang Liu, Haolan Chen, Li Shen, Bo Li, Xiaowen Chu, 11 Feb 2025 (v2), Mediator: Memory-efficient LLM Merging with Less Parameter Conflicts and Uncertainty Based Routing, https://arxiv.org/abs/2502.04411
Early Dropout
Early dropout is an LLM training optimizations whereby training computations are skipped by "dropping out early" during a training cycle. In some LLM training research, it is simply called "dropout." However, it should not be confused with "early exiting" which is an LLM inference optimization involving skipping of layers.
Research papers on early dropout in LLM training:
- Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, Communications of the ACM, Volume 60, Issue 6, June 2017, pp 84–90, https://doi.org/10.1145/3065386 https://dl.acm.org/doi/10.1145/3065386 PDF: https://proceedings.neurips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf Code: http://code.google.com/p/cuda-convnet/ (The early paper that introduced a grouped convolution architecture for multi-GPUs, later the basis of AlexNet, which was a famous image recognition CNN.)
- Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15(1), 1929–1958 (2014), https://dl.acm.org/doi/abs/10.5555/2627435.2670313
- Srivastava, Nitish. Improving neural networks with dropout. 2013, Master’s thesis, U. Toronto, https://www.semanticscholar.org/paper/Improving-Neural-Networks-with-Dropout-Srivastava/5d5d4f49d6443c8529a6f5ebef5c499d47a869da PDF: http://www.cs.toronto.edu/~nitish/msc_thesis.pdf
- Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed A Aly, Beidi Chen, Carole-Jean Wu, 29 Apr 2024 (v2), LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding, https://arxiv.org/abs/2404.16710
- Alperen Gormez, 2024, Efficient Neural Network Inference and Training Using Early Exit Strategies, Ph.D. Thesis, Electrical and Computer Engineering, University of Illinois at Chicago, 2024, https://alperengormez.github.io/assets/phd/agormez_phd_thesis.pdf (Early exit in inference, training, and fine-tuning.)
More AI Research
Read more about: