Aussie AI

Training Optimization

  • Last Updated 2 March, 2025
  • by David Spuler, Ph.D.

Training is very expensive, leading to a rise in papers on optimization of model training methods. Training cost is typically many multiples of inference, but obviously the total inference cost can overshadow training cost given enough users. Nevertheless, the total cost of training to the industry is likely to remain high, since almost all use cases require not only initial training, but also ongoing fine-tuning and re-training.

Research on training algorithms in general:

  • Unsupervised learning
  • Reinforceument Learning from Human Feedback (RLHF)
  • In-context learning (ICL)
  • Direct Preference Optimization (DPO)
  • Self-supervised learning (automated AI Feedback)
  • Human-In-The-Loop (HITL)

General concepts in LLM reasoning and model capabilities:

Information on improving the accuracy and/or speed of training algorithms:

Improvements in resiliency of the training infastructure for multi-GPU clusters in data centers:

Research on the data used in pre-training:

Modified types of pre-training for models:

Fine-tuning methods include:

Lesser-known alternatives to fine-tuning being researched for improving model capabilities that require only a single inference step, but may also require a short training-like phase:

  • Prompt tuning (extended vocabulary PEFT, typically with extra soft tokens prepended to prompt)
  • Decoding-based reasoning in single inference step (e.g., tree decoding)

Retrieval-based alternatives to fine-tuning for extra LLM capabilities and intelligence/accuracy (without requiring any extra training):

Non-retrieval methods of giving LLMs additional context information for their inference queries, but only with a single inference query (and without traditional RAG-type data retrieval):

Prompt engineering enhancements to LLM capabilities (single-step):

Advanced topics in prompt engineering (single-shot):

Inference-based reasoning algorithms with multiple steps combining prompt engineering and inference processing of queries:

Addressing limitations of model intelligence:

Other directions for model intelligence:

  • Planning
  • Followup questions
  • Interactive prompting
  • Program execution models (e.g., LLM generates Python code to run)
  • Symbolic reasoning
  • Concept models ("large concept models" or LCMs)

Survey Papers on Training Optimizations

Survey papers on speeding up training:

  • Yarally T, Cruz L, Feitosa D, et al (2023), Uncovering energy-efficient practices in deep learning training: Preliminary steps towards green AI. International Conference on AI Engineering - Software Engineering for AI (CAIN), https://arxiv.org/abs/2303.13972
  • A. Apicella, F. Donnarumma, F. Isgrò, and R. Prevete, A survey on modern trainable activation functions, Neural Networks, vol. 138, pp.14–32, 2021, https://arxiv.org/abs/2005.00817 (Extensive survey all about training with activation functions, e.g. RELU, Swish, Maxout, leaky RELU.)
  • R. Immonen, T. Hämäläinen et al., Tiny machine learning for resource-constrained microcontrollers, Journal of Sensors, vol. 2022, 2022, https://www.hindawi.com/journals/js/2022/7437023/ (Survey of on-device training for TinyML/edge computing.)
  • P Freire, E Manuylovich, JE Prilepsky, SK Turitsyn, 2023, Artificial neural networks for photonic applications—from algorithms to implementation: tutorial, Advances in Optics and Photonics, Sep 2023, https://opg.optica.org/directpdfaccess/f0ae8746-2f89-4ac4-bb598eda29c7977c_539680/aop-15-3-739.pdf?da=1&id=539680&seq=0&mobile=no (Large survey covering many aspects of the future of training optimization.)
  • Marcos Treviso, Tianchu Ji, Ji-Ung Lee, Betty van Aken, Qingqing Cao, Manuel R. Ciosici, Michael Hassid, Kenneth Heafield, Sara Hooker, Pedro H. Martins, Andre F. T. Martins, Pe- ´ ter Milder, Colin Raffel, Edwin Simpson, Noam Slonim, Niranjan Balasubramanian, Leon Derczynski, Roy Schwartz, Aug 2022, Efficient Methods for Natural Language Processing: A Survey. arxiv:2209.00099[cs], August 2022. http://arxiv.org/abs/2209.00099
  • MM YAPICI, N Topaloğlu, 2021, Computers and Informatics, Performance comparison of deep learning frameworks https://dergipark.org.tr/en/pub/ci/issue/60236/769457, PDF: https://dergipark.org.tr/en/download/article-file/1201877 (Examines Torch, Theano, Caffe, Caffe2, MXNet, Keras, TensorFlow, and CNTK frameworks in terms of training speed.)
  • H. Jahangir, S. K. Goel and S. Khurana, "Scaling Up the Transformers: A Survey of Training and Inference Optimization Techniques," 2024 International Conference on Electrical Electronics and Computing Technologies (ICEECT), Greater Noida, India, 2024, pp. 1-6, doi: 10.1109/ICEECT61758.2024.10739061. https://ieeexplore.ieee.org/abstract/document/10739061
  • Jiahang Zhou, Yanyu Chen, Zicong Hong, Wuhui Chen, Yue Yu, Tao Zhang, Hui Wang, Chuanfu Zhang, Zibin Zheng, 5 Jan 2024, Training and Serving System of Foundation Models: A Comprehensive Survey, https://arxiv.org/abs/2401.02643
  • Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, Jianfeng Gao, 20 Feb 2024 (v2), Large Language Models: A Survey, https://arxiv.org/abs/2402.06196
  • R Abdulkadirov, P Lyakhov, N Nagornov, 2023, Survey of Optimization Algorithms in Modern Neural Networks https://www.mdpi.com/2227-7390/11/11/2466 https://www.mdpi.com/2227-7390/11/11/2466/pdf
  • Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey
  • You Zhou, Xiujing Lin, Xiang Zhang, Maolin Wang, Gangwei Jiang, Huakang Lu, Yupeng Wu, Kai Zhang, Zhe Yang, Kehang Wang, Yongduo Sui, Fengwei Jia, Zuoli Tang, Yao Zhao, Hongxuan Zhang, Tiannuo Yang, Weibo Chen, Yunong Mao, Yi Li, De Bao, Yu Li, Hongrui Liao, Ting Liu, Jingwen Liu, Jinchi Guo, Xiangyu Zhao, Ying WEI, Hong Qian, Qi Liu, Xiang Wang, Wai Kin (Victor)Chan, Chenliang Li, Yusen Li, Shiyu Yang, Jining Yan, Chao Mou, Shuai Han, Wuxia Jin, Guannan Zhang, Xiaodong Zeng, Nov 2023, On the Opportunities of Green Computing: A Survey, https://arxiv.org/abs/2311.00447 (Extensive survey of environmental and green AI issues, along with a survey of various optimization methods to reduce AI resource requirements in training and inference.)
  • Jiangfei Duan, Shuo Zhang, Zerui Wang, Lijuan Jiang, Wenwen Qu, Qinghao Hu, Guoteng Wang, Qizhen Weng, Hang Yan, Xingcheng Zhang, Xipeng Qiu, Dahua Lin, Yonggang Wen, Xin Jin, Tianwei Zhang, Peng Sun, 29 Jul 2024, Efficient Training of Large Language Models on Distributed Infrastructures: A Survey, https://arxiv.org/abs/2407.20018
  • Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, Liang Zhao, 4 Jan 2024, Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models https://arxiv.org/abs/2401.00625 (A general survey paper with coverage of many techniques including this one.)
  • Zehao Xiao, Cees G. M. Snoek, 6 Nov 2024, Beyond Model Adaptation at Test Time: A Survey. https://arxiv.org/abs/2411.03687
  • Fali Wang, Zhiwei Zhang, Xianren Zhang, Zongyu Wu, Tzuhao Mo, Qiuhao Lu, Wanjing Wang, Rui Li, Junjie Xu, Xianfeng Tang, Qi He, Yao Ma, Ming Huang, Suhang Wang, 4 Nov 2024, A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness, https://arxiv.org/abs/2411.03350
  • Dan Zhang, Tao Feng, Lilong Xue, Yuandong Wang, Yuxiao Dong, Jie Tang, 23 Jan 2025, Parameter-Efficient Fine-Tuning for Foundation Models, https://arxiv.org/abs/2501.13787

Training Speed Optimizations

Papers with specific techniques for optimization of training in terms of throughput, latency or processing speed, rather than accuracy or perplexity of results (chosen out of literally thousands):

Fine-Tuning

Papers on fine-tuning optimizations:

Data Sets

Research papers on datasets used for training:

Synthetic Data

Research paper on LLM-generated synthetic data for training:

Unnatural Instructions (Synthetic Data)

Research papers on "unnatural instructions," a type of synthetic data for training:

Distributed Training

Distributed training is the optimization of spreading training computations across multiple GPUs or multiple servers. Trillion parameter models are trained on large clusters of 100,000+ GPUs, with complex multi-server multi-GPU architectures. Distributed training can also be performed on much more spread-out architectures with servers communicating over the internet.

Some of the research papers on distributed training:

Training Costs

Research on the total costs of performing LLM training:

Federated Learning

Research on federated learning, a type of distributed training for LLMs:

Mixed-Precision Training

Research on using mixtures of precision in the underlying data values and weights, when performing LLM training:

Model Merging

Model merging is a technique whereby two separate LLMs can be merged together to create a new model with the combined expertise of the two individual models. Surprisingly, the two sets of weights can simply be combined, such as by addition.

Research papers on model merging:

  • Enneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xiaochun Cao, Jie Zhang, Dacheng Tao, 15 Aug 2024 (v2), Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities, https://arxiv.org/abs/2408.07666 Project: https://github.com/EnnengYang/Awesome-Model-Merging-Methods-Theories-Applications (An extensive review of merging two models.)
  • Cameron R. Wolfe, Sep 16, 2024, Model Merging: A Survey: From modern LLM applications to the early days of machine learning research, https://cameronrwolfe.substack.com/p/model-merging
  • Lucas Bandarkar, Benjamin Muller, Pritish Yuvraj, Rui Hou, Nayan Singhal, Hongjiang Lv, Bing Liu, 2 Oct 2024, Layer Swapping for Zero-Shot Cross-Lingual Transfer in Large Language Models, https://arxiv.org/abs/2410.01335
  • Yuxuan Zhang, Ruizhe Li, 2 Oct 2024, DLP-LoRA: Efficient Task-Specific LoRA Fusion with a Dynamic, Lightweight Plugin for Large Language Models, https://arxiv.org/abs/2410.01497 https://github.com/MeCuping/DLP-LoRA (Merging multiple LoRA adapters for parallel inference.)
  • Sean Michael Kerner, October 23, 2024, Differentiable Adaptive Merging is accelerating SLMs for enterprises, https://venturebeat.com/ai/differentiable-adaptive-merging-is-accelerating-slms-for-enterprises/
  • Mingyang Zhang, Jing Liu, Ganggui Ding, Xinyi Yu, Linlin Ou, Bohan Zhuang, 18 Dec 2024, Channel Merging: Preserving Specialization for Merged Experts, https://arxiv.org/abs/2412.15283
  • Sakana.ai, March 21, 2024, Evolving New Foundation Models: Unleashing the Power of Automating Model Development, https://sakana.ai/evolutionary-model-merge/
  • Sakana.ai, December 03, 2024, Population-based Model Merging via Quality Diversity, https://sakana.ai/cycleqd/
  • Ayoub Ben Chaliah, Hela Dellagi, 31 Dec 2024, Superposition in Transformers: A Novel Way of Building Mixture of Experts, https://arxiv.org/abs/2501.00530 (Effectively model merging to combine a base model and its fine-tuned version, to avoid catastrophic forgetting.)
  • Violet Xiang, Charlie Snell, Kanishk Gandhi, Alon Albalak, Anikait Singh, Chase Blagden, Duy Phung, Rafael Rafailov, Nathan Lile, Dakota Mahan, Louis Castricato, Jan-Philipp Franken, Nick Haber, Chelsea Finn, 8 Jan 2025, Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought, https://arxiv.org/abs/2501.04682
  • Qizhen Zhang, Prajjwal Bhargava, Chloe Bi, Chris X. Cai, Jakob Foerster, Jeremy Fu, Punit Singh Koura, Ruan Silva, Sheng Shen, Emily Dinan, Suchin Gururangan, Mike Lewis, 31 Jan 2025, BTS: Harmonizing Specialized Experts into a Generalist LLM, https://arxiv.org/abs/2502.00075 (Combining multiple fine-tuned expert models via "layer stitching").
  • Yuhang Zhou, Giannis Karamanolakis, Victor Soto, Anna Rumshisky, Mayank Kulkarni, Furong Huang, Wei Ai, Jianhua Lu, 4 Feb 2025 (v2), MergeME: Model Merging Techniques for Homogeneous and Heterogeneous MoEs, https://arxiv.org/abs/2502.00997
  • Kunfeng Lai, Zhenheng Tang, Xinglin Pan, Peijie Dong, Xiang Liu, Haolan Chen, Li Shen, Bo Li, Xiaowen Chu, 11 Feb 2025 (v2), Mediator: Memory-efficient LLM Merging with Less Parameter Conflicts and Uncertainty Based Routing, https://arxiv.org/abs/2502.04411

Early Dropout

Early dropout is an LLM training optimizations whereby training computations are skipped by "dropping out early" during a training cycle. In some LLM training research, it is simply called "dropout." However, it should not be confused with "early exiting" which is an LLM inference optimization involving skipping of layers.

Research papers on early dropout in LLM training:

More AI Research

Read more about: