Aussie AI

Resiliency in Large-Scale Datacenters

Last Updated 24 June, 2025

by David Spuler, Ph.D.

Resiliency in Datacenter AI

Resiliency is the correct handling of failures that occur in the data center training infrastructure. Achieving resiliency in your AI backend is important to achieve a high level of quality in any application. High accuracy and fast speed is desirable for both training and inference workloads, but more critical for training, because the impact is more central to the performance of one huge job. Supercomputing clusters of 100,000+ GPU chips amplify this importance significancly, and raise a whole new level of challenges. Although the "optimizer" algorithm is important for training results in terms of both accuracy and convergence time, and this consequently has a momentous amount of research papers, there are also lower-level technical issues related to the underlying infrastructure that is running these training algorithms. There are various issues related to the GPU chips, server hardware, and the networking communications layers between them.

Types of Datacenter Resilience Issues

Supercomputing clusters running AI training of 100,000+ GPUs are somewhat fickle. Some of the general types of technical issues with a multi-GPU AI platform in distributed training resiliency include:

Stragglers (slow workers)
Hangs (never-finishing workers)
High network latency

Failures can occur in almost any component:

CPU
GPU
Memory
Disk
Power supply
Cooling
Networking hardware
Other hardware infrastructure

Failures can even occur in the hardware or software that's supposed to detect or correct failures! For example, these can fail:

Monitoring interfaces
Checkpoint/restart infrastructure
Out-of-band networking components

The GPU is itself a complicated piece of equipment that has a non-zero failure rate. Some of the hardware issues specific to the GPU include:

Silent Data Corruption (SDC) errors
Overheating GPUs
Aging GPUs ("silicon wear-out")
Transient soft errors (e.g., random bit flips from radiation)
Early life failures

And the software layer can contribute insidious errors in various ways:

Silent GPU floating-point exceptions
Silent software kernel errors
Bounds violations hidden in contiguous blocks

Problems that arise in the networking layer between GPUs, whether in the same multi-GPU server or across multiple distributed servers, include:

Network latency
Network congestion
Timeouts
Network error states

If you're looking for an easy fix for a small server room in your building's basement, here's a suggestion: sort out the air-conditioning system so that the server room is a few degrees cooler. That will lower your failure rate for multiple types of hardware component. But if you've got 100,000 servers running from a hydro-electric power plant next door, you can't just click the thermostat down a couple notches.

Stragglers and Hangs

Stragglers are software processes that run slowly in a multi-GPU AI training sequence, and return their resultant weight updates with a delay. Hangs are also software processes that are like stragglers, but fail to ever return successfully. When farming out training tasks to various "worker" nodes, the slowest workers are called "stragglers" (slowest returns) or "hangs" (never completing). Distributed training is constrained to progress at the rate of the slowest straggler, so addressing them is not only a resiliency improvement, but is also a training speed optimization.

Stragglers are a general problem with distributed workloads, and there's not a single cause of a job that's slow to return its results. Problems can arise due to:

Hardware problems (GPU or CPU).
Software kernel errors (e.g., poorly handled edge cases).
Network issues (various types).

Stragglers may be repeat offenders (e.g., it's the GPU/CPU or other server issue), or the issue can be dispersed to different servers randomly (e.g., it's network congestion randomly slowing down some outgoing messages or responses).

Straggler mitigation is an AI training optimization that aims to speed up training by reducing slowdowns from straggler workers. Mitigation strategies can include isolating a single GPU or single server, if one is repeatedly causing problems, or addressing any underlying causes such as network congestion issues.

Research papers on stragglers and straggler mitigation in AI datacenters:

Amir Javadpour, Guojun Wang, Samira Rezaei, Kuan Ching Li, 13 Apr 2020, Detecting Straggler MapReduce Tasks in Big Data Processing Infrastructure by Neural Network, https://arxiv.org/abs/2004.05868
Yi Wang, Rohan Varma, April 07, 2023, Straggler Mitigation On PyTorch DDP By Hierarchical SGD, https://pytorch.org/blog/straggler-mitigation/
Yang, E., Kang, DK. & Youn, CH. BOA: batch orchestration algorithm for straggler mitigation of distributed DL training in heterogeneous GPU cluster. J Supercomput 76, 47–67 (2020). https://doi.org/10.1007/s11227-019-02845-2 https://link.springer.com/article/10.1007/s11227-019-02845-2
Haoyang Li, Fangcheng Fu, Hao Ge, Sheng Lin, Xuanyu Wang, Jiawen Niu, Yujie Wang, Hailin Zhang, Xiaonan Nie, Bin Cui, 17 Oct 2024, Malleus: Straggler-Resilient Hybrid Parallel Training of Large-scale Models via Malleable Data and Model Parallelization, https://arxiv.org/abs/2410.13333
H. Kim, C. Song, H. Lee and H. Yu, "Addressing Straggler Problem Through Dynamic Partial All-Reduce for Distributed Deep Learning in Heterogeneous GPU Clusters," 2023 IEEE International Conference on Consumer Electronics (ICCE), Las Vegas, NV, USA, 2023, pp. 1-6, doi: 10.1109/ICCE56470.2023.10043527. https://ieeexplore.ieee.org/document/10043527
Nir Barazida, Mar 9, 2022, Distributed training of deep learning models: handling stragglers and latency in synchronous training A review of the challenges in Synchronous distributed training and best solutions for stragglers and high latency https://towardsdatascience.com/stragglers-and-latency-in-synchronous-distributed-training-of-deep-learning-models-43783b0266d9
Jianmin Chen, Xinghao Pan, Rajat Monga, Samy Bengio, Rafal Jozefowicz, 21 Mar 2017 (v3), Revisiting Distributed Synchronous SGD, https://arxiv.org/abs/1604.00981
Tianyuan Wu, Wei Wang, Yinghao Yu, Siran Yang, Wenchao Wu, Qinkai Duan, Guodong Yang, Jiamang Wang, Lin Qu, Liping Zhang, 16 Oct 2024, FALCON: Pinpointing and Mitigating Stragglers for Large-Scale Hybrid-Parallel Training, https://arxiv.org/abs/2410.12588
Tharindu Adikari, Haider Al-Lawati, Jason Lam, Zhenhua Hu, Stark C. Draper, 6 Nov 2024, Exploiting Stragglers in Distributed Computing Systems with Task Grouping, https://arxiv.org/abs/2411.03645 (Reduce straggler work loss by using more granular workloads.)
Anindya Bijoy Das, Aditya Ramamoorthy, David J. Love, Christopher G. Brinton, 9 Aug 2024, Sparsity-Preserving Encodings for Straggler-Optimal Distributed Matrix Computations at the Edge, https://arxiv.org/abs/2408.05152
Aditya Ramamoorthy, Ruoyu Meng, Vrinda S. Girimaji, 18 Nov 2024 (v2), Leveraging partial stragglers within gradient coding, https://arxiv.org/abs/2405.19509
Youshao Xiao, Lin Ju, Zhenglei Zhou, Siyuan Li, Zhaoxin Huan, Dalong Zhang, Rujie Jiang, Lin Wang, Xiaolu Zhang, Lei Liang, Jun Zhou, 15 Apr 2024, AntDT: A Self-Adaptive Distributed Training Framework for Leader and Straggler Nodes, https://arxiv.org/abs/2404.09679
Natalie Lang, Alejandro Cohen, Nir Shlezinger, 27 Mar 2024, Stragglers-Aware Low-Latency Synchronous Federated Learning via Layer-Wise Model Updates, https://arxiv.org/abs/2403.18375
Chengxi Li, Ming Xiao, Mikael Skoglund, 22 Mar 2024, Adaptive Coded Federated Learning: Privacy Preservation and Straggler Mitigation, https://arxiv.org/abs/2403.14905
Chengxi Li, Mikael Skoglund, 19 Mar 2024, Distributed Learning based on 1-Bit Gradient Coding in the Presence of Stragglers, https://arxiv.org/abs/2403.14716
Andrew Hard, Antonious M. Girgis, Ehsan Amid, Sean Augenstein, Lara McConnaughey, Rajiv Mathews, Rohan Anil, 14 Mar 2024, Learning from straggler clients in federated learning, https://arxiv.org/abs/2403.09086
Chengxi Li, Mikael Skoglund, 14 Jun 2024 (v3), Gradient Coding in Decentralized Learning for Evading Stragglers, https://arxiv.org/abs/2402.04193
Hongpeng Guo, Haotian Gu, Xiaoyang Wang, Bo Chen, Eun Kyung Lee, Tamar Eilam, Deming Chen, Klara Nahrstedt, 31 Jan 2024, FedCore: Straggler-Free Federated Learning with Distributed Coresets, https://arxiv.org/abs/2402.00219
Jinkun Lin, Ziheng Jiang, Zuquan Song, Sida Zhao, Menghan Yu, Zhanghan Wang, Chenyuan Wang, Zuocheng Shi, Xiang Shi, Wei Jia, Zherui Liu, Shuguang Wang, Haibin Lin, Xin Liu, Aurojit Panda, Jinyang Li, 12 May 2025 (v2), Understanding Stragglers in Large Model Training Using What-if Analysis, https://arxiv.org/abs/2505.05713

Silent Data Corruption (SDC)

Silent Data Corruption (SDC) is a computational error in AI training that causes incorrect results without triggering an exception. SDCs are a specific type of hardware error that occurs in the GPU manufacturing process. They are a type of insidious error that causes anomalous computations, but does not trigger any exceptions (i.e., "silent"). Programmers are aware of numerous types of coding errors that cause errors without warnings, and this can occur in hardware, too.

SDCs are usually quite obscure, because they must have passed the GPU acceptance testing as part of the manufacturing process. If you have a GPU in your gaming PC, it's not that likely that you have one, but if you're running an AI training workload on a datacenter supercomputer with 100,000 GPUs, the odds are higher.

SDCs are caused by random fluctuations in the intricate nanometer-scale processes that create GPUs. Hence, SDCs usually have characteristics, such as:

Affect individual chips (i.e., it's a miniscule manufacturing error).
Specific to a particular microcode instruction or processing sequence.
Localized to one region of the single chip.
Not always the same type of error.
Sometimes intermittent.

Note that SDCs are not typically considered to include:

GPU acceptance testing failures (i.e., not silent).
Large GPU failures from overheating (although SDCs can also be heat-dependent).
Microcoding or hardware design errors (affecting all chips).

Given their obscurity, SDCs are also:

Hard to detect
Problematic to prove (even if suspected)
Difficult to mitigate against

Research on SDCs

Research papers on SDCs include:

M. Vishwanathan, R. Shah, K. K. Kim and M. Choi, "Silent Data Corruption (SDC) vulnerability of GPU on various GPGPU workloads," 2015 International SoC Design Conference (ISOCC), Gyeongju, Korea (South), 2015, pp. 11-12, doi: 10.1109/ISOCC.2015.7401681. https://ieeexplore.ieee.org/document/7401681
Wei, X., Jiang, N., Wang, X., Yue, H. (2021). Detecting SDCs in GPGPUs Through an Efficient Instruction Duplication Mechanism. In: Qiu, H., Zhang, C., Fei, Z., Qiu, M., Kung, SY. (eds) Knowledge Science, Engineering and Management. KSEM 2021. Lecture Notes in Computer Science(), vol 12817. Springer, Cham. https://doi.org/10.1007/978-3-030-82153-1_47 https://link.springer.com/chapter/10.1007/978-3-030-82153-1_47
Anne Meixner, March 12th, 2024, Strategies For Detecting Sources Of Silent Data Corruption, https://semiengineering.com/strategies-for-detecting-sources-of-silent-data-corruption/
K. S. Yim, C. Pham, M. Saleheen, Z. Kalbarczyk and R. Iyer, "Hauberk: Lightweight Silent Data Corruption Error Detector for GPGPU," 2011 IEEE International Parallel & Distributed Processing Symposium, Anchorage, AK, USA, 2011, pp. 287-300, doi: 10.1109/IPDPS.2011.36. https://ieeexplore.ieee.org/document/6012845
Jyotika Athavale, Randy Fish, Jul 24, 2024, Examining Silent Data Corruption: A Lurking, Persistent Problem in Computing, https://www.synopsys.com/blogs/chip-design/what-is-silent-data-corruption-sdc.html
AR Anwer, G Li, K Pattabiraman, M Sullivan, T Tsai, SKS Hari, 2020, GPU-Trident: Efficient Modeling of Error Propagation in GPU Programs, SC20: International Conference for High Performance Computing, https://research.nvidia.com/sites/default/files/pubs/2020-10_GPU-Trident%3A-Efficient-Modeling//SC_2020_GPU_Trident.pdf
Y. Huang, S. Guo, S. Di, G. Li and F. Cappello, "Mitigating Silent Data Corruptions in HPC Applications across Multiple Program Inputs," SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, Dallas, TX, USA, 2022, pp. 1-14, doi: 10.1109/SC41404.2022.00022. https://ieeexplore.ieee.org/abstract/document/10046091 https://hyfshishen.github.io/publications/SC22-paper.pdf
M. H. Rahman, S. Di, S. Guo, X. Lu, G. Li and F. Cappello, "Druto: Upper-Bounding Silent Data Corruption Vulnerability in GPU Applications," 2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS), San Francisco, CA, USA, 2024, pp. 582-594, doi: 10.1109/IPDPS57955.2024.00058. https://ieeexplore.ieee.org/abstract/document/10579167
X. Wei et al., "ApproxDup: Developing an Approximate Instruction Duplication Mechanism for Efficient SDC Detection in GPGPUs," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 43, no. 4, pp. 1051-1064, April 2024, doi: 10.1109/TCAD.2023.3330821. https://ieeexplore.ieee.org/abstract/document/10312777
Y Huang, S Di, Z Zhang, X Lu, G Li, 2024, Versatile Datapath Soft Error Detection on the Cheap for HPC Applications, https://www.computer.org/csdl/proceedings-article/sc/2024/529100a870/21HUW3yatUc (Using static analysis and code transformations to detect soft errors.)
Öz, I., Karadaş, Ö.F. Regional soft error vulnerability and error propagation analysis for GPGPU applications. J Supercomput 78, 4095–4130 (2022). https://doi.org/10.1007/s11227-021-04026-6 https://link.springer.com/article/10.1007/s11227-021-04026-6
Hengshan Yue, Xiaohui Wei, Guangli Li, Jianpeng Zhao, Nan Jiang, and Jingweijia Tan. 2021. G-SEPM: building an accurate and efficient soft error prediction model for GPGPUs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '21). Association for Computing Machinery, New York, NY, USA, Article 54, 1–15. https://doi.org/10.1145/3458817.3476170 https://dl.acm.org/doi/abs/10.1145/3458817.3476170
Z. He, H. Xu and G. Li, "A Fast Low-Level Error Detection Technique," 2024 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Brisbane, Australia, 2024, pp. 90-98, doi: 10.1109/DSN58291.2024.00023. https://ieeexplore.ieee.org/abstract/document/10646930 https://dsn2024uq.github.io/Proceedings/pdfs/DSN2024-6rvE3SSpzFYmysif75Dkid/410500a090/410500a090.pdf
Z. Li et al., "A Visual Comparison of Silent Error Propagation," in IEEE Transactions on Visualization and Computer Graphics, vol. 30, no. 7, pp. 3268-3282, July 2024, doi: 10.1109/TVCG.2022.3230636. https://ieeexplore.ieee.org/abstract/document/9993758
X Wei, Y Wu, N Jiang, H Yue, 2023, Detecting SDCs in GPGPUs Through Efficient Partial Thread Redundancy, https://link.springer.com/chapter/10.1007/978-981-97-0862-8_14
Siva Kumar Sastry Hari, Paolo Rech, Timothy Tsai, Mark Stephenson, Arslan Zulfiqar, Michael Sullivan, Philip Shirvani, Paul Racunas, Joel Emer, Stephen W. Keckler, 28 Apr 2020, Estimating Silent Data Corruption Rates Using a Two-Level Model, https://arxiv.org/abs/2005.01445
Abdul Rehman Anwer, Guanpeng Li, Karthik Pattabiraman, Siva Hari, Michael B. Sullivan, Timothy Tsai, March 27, 2019, Towards analytically evaluating the error resilience of GPU Programs, https://d1qx31qr3h6wln.cloudfront.net/publications/SELSE2019_GPUTrident.pdf
Bautista Gomez Leonardo, Balaprakash Prasanna, Benoit Anne, Cappello Franck, Robert Yves, Unsal Osman, Di Sheng, Hori Atsushi, Gerofi Balazs, Snir Marc, Nov 2024, New Techniques to Design Silent Data Corruption Detectors, https://jlesc.github.io/projects/sdc_detection/
Alireza Tajary, Hamid R. Zarandi, and Nader Bagherzadeh. 2020. IRHT: An SDC detection and recovery architecture based on value locality of instruction binary codes. Microprocess. Microsyst. 77, C (Sep 2020). https://doi.org/10.1016/j.micpro.2020.103159 https://dl.acm.org/doi/10.1016/j.micpro.2020.103159
Ahmad H Sedaghat Y(2024), An automated framework for selectively tolerating SDC errors based on rigorous instruction-level vulnerability assessment, Future Generation Computer Systems, 10.1016/j.future.2024.04.006157:C(392-407), 18-Jul-2024, https://dl.acm.org/doi/10.1016/j.future.2024.04.006
Qining Lu, Guanpeng Li, Karthik Pattabiraman, Meeta S. Gupta, and Jude A. Rivers. 2017. Configurable Detection of SDC-causing Errors in Programs. ACM Trans. Embed. Comput. Syst. 16, 3, Article 88 (August 2017), 25 pages. https://doi.org/10.1145/3014586 https://dl.acm.org/doi/10.1145/3014586
Fang, W., Gu, J., Yan, Z., Wang, Q. (2021). SDC Error Detection by Exploring the Importance of Instruction Features. In: Liu, Z., Wu, F., Das, S.K. (eds) Wireless Algorithms, Systems, and Applications. WASA 2021. Lecture Notes in Computer Science(), vol 12937. Springer, Cham. https://doi.org/10.1007/978-3-030-85928-2_28 https://link.springer.com/chapter/10.1007/978-3-030-85928-2_28
Harish Dattatraya Dixit, Sneha Pendharkar, Matt Beadon, Chris Mason, Tejasvi Chakravarthy, Bharath Muthiah, Sriram Sankar, 22 Feb 2021, Silent Data Corruptions at Scale, Facebook Research, https://arxiv.org/abs/2102.11245
Y Jiang, Z Zhou, B Xu, B Liu, R Xu, P Huang, 2025, Training with Confidence: Catching Silent Errors in Deep Learning Training with Automated Proactive Checks, University of Michigan, https://orderlab.io/paper/traincheck-osdi25-preprint.pdf (Runtime checking for LLM training errors by tracking "training invariants" and focused mostly on software causes.)

GPU Overheating

GPU overheating is where a GPU can become too hot in terms of temperature, leading to a failure or incorrect computation. Overheating is more common under heavy loads and with aged GPUs (near their end-of-life) or brand new GPus (early-life failures). Overheating can cause a GPU to fail either catastrophically, or with a partial computation failure in one or more tiles. Such failures may arise from very high temperatures or more gradually over time due to GPU aging.

Research papers on overheating:

D. Defour and E. Petit, "GPUburn: A system to test and mitigate GPU hardware failures," 2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS), Agios Konstantinos, Greece, 2013, pp. 263-270, doi: 10.1109/SAMOS.2013.6621133. https://hal.science/hal-00827588/document
M. Platini, T. Ropars, B. Pelletier and N. De Palma, "CPU Overheating Characterization in HPC Systems: A Case Study," 2018 IEEE/ACM 8th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS), Dallas, TX, USA, 2018, pp. 59-68, doi: 10.1109/FTXS.2018.00010. https://ieeexplore.ieee.org/abstract/document/8564488
Shan Abdul, Nov 8, 2021, GPU Overheating: Causes, Symptoms & How to Cool It Down, https://www.makeuseof.com/gpu-overheating-causes-symptoms/
Jon Perez-Cerrolaza, Jaume Abella, Leonidas Kosmidis, Alejandro J. Calderon, Francisco Cazorla, and Jose Luis Flores. 2022. GPU Devices for Safety-Critical Systems: A Survey. ACM Comput. Surv. 55, 7, Article 147 (July 2023), 37 pages. https://doi.org/10.1145/3549526 https://dl.acm.org/doi/abs/10.1145/3549526 https://upcommons.upc.edu/bitstream/handle/2117/386460/Perez%20et%20al.pdf?sequence=5
Marco Ottavi, Dimitris Gizopoulos, Salvatore Pontarelli, 2018, Dependable Multicore Architectures at Nanoscale, https://link.springer.com/book/10.1007/978-3-319-54422-9
NVIDIA, Nov 2024, NVIDIA Validation Suite User Guide, https://docs.nvidia.com/deploy/nvvs-user-guide/index.html https://docs.nvidia.com/deploy/pdf/NVIDIA_Validation_Suite_User_Guide.pdf
Qinghao Hu, Zhisheng Ye, Zerui Wang, Guoteng Wang, Meng Zhang, Qiaoling Chen, Peng Sun, Dahua Lin, Xiaolin Wang, Yingwei Luo, Yonggang Wen, Tianwei Zhang, 3 Apr 2024 (v2), Characterization of Large Language Model Development in the Datacenter, https://arxiv.org/abs/2403.07648
wilicc, 2024, gpu-burn: Multi-GPU CUDA stress test, http://wili.cc/blog/gpu-burn.html https://github.com/wilicc/gpu-burn

Transient Soft Errors

Transient soft errors are hardware errors that don't trigger an exception, and do not recur. They are also known simply as "soft errors," because the failure is not permanent. There is some overlap with Silent Data Corruption (SDCs), since a transient soft error is also silent. However, soft errors are not only caused by manufacturing errors in the silicon, but can occur intermittently due to the effects of atmospheric radiation.

Bizarrely, the nanometer scale of silicon circuitry is so tiny that individual transistors can be affected by a single particle (e.g., neutron), and these arise spontaneously from cosmic rays in the wild. The effect can be harmless in many cases, but occasionally directly causes a "bit flip" in one of the circuits. Various physical shielding techniques can reduce these problems, but not avoid them completely.

Research papers on GPU soft errors include:

Y Huang, S Di, Z Zhang, X Lu, G Li, 2024, Versatile Datapath Soft Error Detection on the Cheap for HPC Applications, https://www.computer.org/csdl/proceedings-article/sc/2024/529100a870/21HUW3yatUc (Using static analysis and code transformations to detect soft errors.)
Öz, I., Karadaş, Ö.F. Regional soft error vulnerability and error propagation analysis for GPGPU applications. J Supercomput 78, 4095–4130 (2022). https://doi.org/10.1007/s11227-021-04026-6 https://link.springer.com/article/10.1007/s11227-021-04026-6
Hengshan Yue, Xiaohui Wei, Guangli Li, Jianpeng Zhao, Nan Jiang, and Jingweijia Tan. 2021. G-SEPM: building an accurate and efficient soft error prediction model for GPGPUs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '21). Association for Computing Machinery, New York, NY, USA, Article 54, 1–15. https://doi.org/10.1145/3458817.3476170 https://dl.acm.org/doi/abs/10.1145/3458817.3476170
Z. He, H. Xu and G. Li, "A Fast Low-Level Error Detection Technique," 2024 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Brisbane, Australia, 2024, pp. 90-98, doi: 10.1109/DSN58291.2024.00023. https://ieeexplore.ieee.org/abstract/document/10646930 https://dsn2024uq.github.io/Proceedings/pdfs/DSN2024-6rvE3SSpzFYmysif75Dkid/410500a090/410500a090.pdf
M. B. Sullivan et al., "Characterizing and Mitigating Soft Errors in GPU DRAM," in IEEE Micro, vol. 42, no. 4, pp. 69-77, 1 July-Aug. 2022, doi: 10.1109/MM.2022.3163122. https://ieeexplore.ieee.org/document/9744333
Harish Dattatraya Dixit, Sneha Pendharkar, Matt Beadon, Chris Mason, Tejasvi Chakravarthy, Bharath Muthiah, Sriram Sankar, 22 Feb 2021, Silent Data Corruptions at Scale, Facebook Research, https://arxiv.org/abs/2102.11245

High Network Latency

High network latency is a slow transmission of data during LLM training in a multi-GPU data center training stack. The speed of the network is critical to both performance and resiliency of an AI training job. There is a fluctuating rate of network load in a typical training workload, with bursts of network traffic as computation segments are farmed out to workers, followed by a delay due to large-scale parallel computation, and then another burst as results are returned to the center from the leaf nodes.

Research papers on AI network optimizations include:

Ari Lotter, Jeffrey Quesnelle, Umer H. Adil, Dillon Rolnick, Esteban La Rocca, A Preliminary Report on Distro, 2024, https://github.com/NousResearch/DisTrO/blob/main/A_Preliminary_Report_on_DisTrO.pdf https://venturebeat.com/wp-content/uploads/2024/08/A_Preliminary_Report_on_DisTrO.pdf (Reducing the inter-GPU networking bandwidth cost during training.)
Wei An, Xiao Bi, Guanting Chen, Shanhuang Chen, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Wenjun Gao, Kang Guan, Jianzhong Guo, Yongqiang Guo, Zhe Fu, Ying He, Panpan Huang, Jiashi Li, Wenfeng Liang, Xiaodong Liu, Xin Liu, Yiyuan Liu, Yuxuan Liu, Shanghao Lu, Xuan Lu, Xiaotao Nie, Tian Pei, Junjie Qiu, Hui Qu, Zehui Ren, Zhangli Sha, Xuecheng Su, Xiaowen Sun, Yixuan Tan, Minghui Tang, Shiyu Wang, Yaohui Wang, Yongji Wang, Ziwei Xie, Yiliang Xiong, Yanhong Xu, Shengfeng Ye, Shuiping Yu, Yukun Zha, Liyue Zhang, Haowei Zhang, Mingchuan Zhang, Wentao Zhang, Yichao Zhang, Chenggang Zhao, Yao Zhao, Shangyan Zhou, Shunfeng Zhou, Yuheng Zou, 31 Aug 2024 (v2), Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning, DeepSeek AI, https://www.arxiv.org/abs/2408.14158
David Spuler, 26th August, 2024, State-of-the-Art LLM Backends, Aussie AI Blog, https://www.aussieai.com/blog/state-of-the-art-llm-backends
Zeyu Zhang, Haiying Shen, 23 Sep 2024, CSPS: A Communication-Efficient Sequence-Parallelism based Serving System for Transformer based Models with Long Prompts, https://arxiv.org/abs/2409.15104 (Sparse attention and overlapped communication with computation and disaggregates prefill/decoding with chunked prefill, with a novel QKV splitting approach focused on the Q values.)
Stephen Jones, March 2024, CUDA: New Features and Beyond, GTC 2024, https://www.nvidia.com/en-us/on-demand/session/gtc24-s62400/
Dylan Patel and Daniel Nishball, Oct 03, 2024, AI Neocloud Playbook and Anatomy, https://www.semianalysis.com/p/ai-neocloud-playbook-and-anatomy
Yuhang Yao, Han Jin, Alay Dilipbhai Shah, Shanshan Han, Zijian Hu, Yide Ran, Dimitris Stripelis, Zhaozhuo Xu, Salman Avestimehr, Chaoyang He, 10 Sep 2024 (v2), ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency, https://arxiv.org/abs/2408.00008
Nir Barazida, Mar 9, 2022, Distributed training of deep learning models: handling stragglers and latency in synchronous training A review of the challenges in Synchronous distributed training and best solutions for stragglers and high latency https://towardsdatascience.com/stragglers-and-latency-in-synchronous-distributed-training-of-deep-learning-models-43783b0266d9
Jianmin Chen, Xinghao Pan, Rajat Monga, Samy Bengio, Rafal Jozefowicz, 21 Mar 2017 (v3), Revisiting Distributed Synchronous SGD, https://arxiv.org/abs/1604.00981
Palak (Microsoft Research India), Rohan Gandhi (Microsoft Research India), Karan Tandon (Microsoft Research India), Debopam Bhattacherjee (Microsoft Research India), Venkata N. Padmanabhan (Microsoft Research India), 16 Nov 2024, Improving training time and GPU utilization in geo-distributed language model training, https://arxiv.org/abs/2411.14458
Haiquan Wang, Chaoyi Ruan, Jia He, Jiaqi Ruan, Chengjie Tang, Xiaosong Ma, Cheng Li, 24 Nov 2024, Hiding Communication Cost in Distributed LLM Training via Micro-batch Co-execution, https://arxiv.org/abs/2411.15871
Greg Gutmann, Sep 2020, Peer-to-peer Memory Copy with NVLink: CUDA Feature Testing, https://codingbyexample.com/2020/09/14/p2p-memcpy-with-nvlink/
Chaoyi Jiang, Lei Gao, Hossein Entezari Zarch, Murali Annavaram, 26 Nov 2024, Efficient LLM Inference with I/O-Aware Partial KV Cache Recomputation, https://arxiv.org/abs/2411.17089 (Overlapping/optimizing CPU-GPU network bandwidth for KV cache with some recomputation.)
Bowen Peng, Jeffrey Quesnelle, Diederik P. Kingma, 29 Nov 2024, DeMo: Decoupled Momentum Optimization, https://arxiv.org/abs/2411.19870 https://github.com/bloc97/DeMo (Extension to ADAM optimizer that greatly reduces network communication in training.)
Leigh Engel and Anthony Larijani, Dec 11, 2024, Deploying NVIDIA H200 NVL at Scale with New Enterprise Reference Architecture, https://developer.nvidia.com/blog/deploying-nvidia-h200-nvl-at-scale-with-new-enterprise-reference-architecture/
Hongyi Jin, Ruihang Lai, Charlie F. Ruan, Yingcheng Wang, Todd C. Mowry, Xupeng Miao, Zhihao Jia, Tianqi Chen, 17 Dec 2024, A System for Microserving of LLMs, https://arxiv.org/abs/2412.12488 (Disaggregated prefill and decoding combined with context cache migration for sending the KV cache over the network.)
Ryan Lucchese, Niki Birkner, Yaron Hagai, Virginia Adams, August 13, 2024, Together AI, A practitioner's guide to testing and running large GPU clusters for training generative AI models, https://www.together.ai/blog/a-practitioners-guide-to-testing-and-running-large-gpu-clusters-for-training-generative-ai-models
NVIDIA, 2024, nvbandwidth: A tool for bandwidth measurements on NVIDIA GPUs. https://github.com/NVIDIA/nvbandwidth
NVIDIA, 2024, DCGM Diagnostics, https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html
Nandini Lokesh Reddy, Jan 2025, DeepSeek: Bridging Performance and Efficiency in Modern AI, https://medium.com/@nandinilreddy/deepseek-bridging-performance-and-efficiency-in-modern-ai-106181a85693
Dylan Butts, May 19 2025, Nvidia announces new tech to keep it at the center of AI development, https://www.cnbc.com/2025/05/19/nvidia-announces-new-tech-to-keep-it-at-the-center-of-ai-development-.html
Chao Jin, Ziheng Jiang, Zhihao Bai, Zheng Zhong, Juncai Liu, Xiang Li, Ningxin Zheng, Xi Wang, Cong Xie, Qi Huang, Wen Heng, Yiyuan Ma, Wenlei Bao, Size Zheng, Yanghua Peng, Haibin Lin, Xuanzhe Liu, Xin Jin, Xin Liu, 19 May 2025 (v2), MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production,https://arxiv.org/abs/2505.11432
M. Kim, P. Pinyoanuntapong, B. Kim, W. Saad and D. Calin, "Edge vs Cloud: How Do We Balance Cost, Latency, and Quality for Large Language Models Over 5G Networks?," 2025 IEEE Wireless Communications and Networking Conference (WCNC), Milan, Italy, 2025, pp. 1-6, doi: 10.1109/WCNC61545.2025.10978177, https://ieeexplore.ieee.org/abstract/document/10978177/

Silent Floating-Point Computation Errors

Floating-point exceptions are often silent in GPU kernels, and GPU software coding needs to take extra care. Whereas CPU computations might trigger SIGFPE, the GPU is likely to quietly continue. This might lead to incorrect results that are insidious, or it may result in special erroneous values such as NaN (not-a-number) and Inf (infinity, either positive or negative).

Research papers on floating point errors:

GPU-NBDetect Oct 2024 (accessed), Comprehensive-Study-on-GPU-Program-Numerical-Issues, https://github.com/GPU-Program-Bug-Study/Comprehensive-Study-on-GPU-Program-Numerical-Issues.github.io/tree/main/GPU-NBDetect
FP Checker, Jul 19, 2021, Floating-point Exceptions and GPU Applications, https://fpchecker.org/2021-07-12-exceptions.html
Lawrence Livermore National Laboratory, Oct 024, FP Checker: dynamic analysis tool to detect floating-point errors in HPC applications, https://fpchecker.org/index.html https://github.com/LLNL/FPChecker
Laguna, Ignacio. "FPChecker: Detecting Floating-point Exceptions in GPU Applications." In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 1126-1129. IEEE, 2019. https://ieeexplore.ieee.org/abstract/document/8952258 https://www.osti.gov/servlets/purl/1574625
Michael Bentley, Ian Briggs, Ganesh Gopalakrishnan, Ignacio Laguna, Harshitha Menon, Tristan Vanderbruggen, Cindy Rubio González, 2020, FPChecker Detecting Floating-Point Exceptions in GPUs, https://fpanalysistools.org/pearc19/slides/Module-FPChecker.pdf
Ignacio Laguna Feb 4, 2020, Improving Reliability Through Analyzing and Debugging Floating-Point Software, 2020 ECP Annual Meeting, https://fpanalysistools.org/slides/ignacio_laguna_ECP_2020.pdf
Ignacio Laguna, Xinyi Li, and Ganesh Gopalakrishnan. 2022. BinFPE: accurate floating-point exception detection for GPU applications. In Proceedings of the 11th ACM SIGPLAN International Workshop on the State Of the Art in Program Analysis (SOAP 2022). Association for Computing Machinery, New York, NY, USA, 1–8. https://doi.org/10.1145/3520313.3534655 https://dl.acm.org/doi/10.1145/3520313.3534655 https://dl.acm.org/doi/pdf/10.1145/3520313.3534655 https://github.com/LLNL/BinFPE
Floris Gorter, Enrico Barberis, Raphael Isemann, Erik van der Kouwe, Cristiano Giuffrida, Herbert Bos, November 1, 2023, FloatZone: How Floating Point Additions can Detect Memory Errors, https://download.vusec.net/papers/floatzone_sec23.pdf https://github.com/vusec/floatzone

Floating-Point Runtime Error Checkers

Since floating-point errors are often silent in GPUs, it is advantageous to use runtime tools that can detect them. There are a variety of such tools under development in research, but there's not yet a mainstream tool that is widely used.

Research papers on tools that detect floating-point errors and exceptions at runtime:

Lawrence Livermore National Laboratory, Oct 024, FP Checker: dynamic analysis tool to detect floating-point errors in HPC applications, https://fpchecker.org/index.html https://github.com/LLNL/FPChecker
Laguna, Ignacio. "FPChecker: Detecting Floating-point Exceptions in GPU Applications." In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 1126-1129. IEEE, 2019. https://ieeexplore.ieee.org/abstract/document/8952258 https://www.osti.gov/servlets/purl/1574625
Michael Bentley, Ian Briggs, Ganesh Gopalakrishnan, Ignacio Laguna, Harshitha Menon, Tristan Vanderbruggen, Cindy Rubio González, 2020, FPChecker Detecting Floating-Point Exceptions in GPUs, https://fpanalysistools.org/pearc19/slides/Module-FPChecker.pdf
Ignacio Laguna, Xinyi Li, and Ganesh Gopalakrishnan. 2022. BinFPE: accurate floating-point exception detection for GPU applications. In Proceedings of the 11th ACM SIGPLAN International Workshop on the State Of the Art in Program Analysis (SOAP 2022). Association for Computing Machinery, New York, NY, USA, 1–8. https://doi.org/10.1145/3520313.3534655 https://dl.acm.org/doi/10.1145/3520313.3534655 https://dl.acm.org/doi/pdf/10.1145/3520313.3534655 https://github.com/LLNL/BinFPE
Xinyi Li, Ignacio Laguna, Katarzyna Swirydowicz, Bo Fang, Ang Li, and Ganesh Gopalakrishnan. Design and evaluation of GPU-FPX: A low-overhead tool for floating-point excep tion detection in NVIDIA GPUs. In ACM HPDC 2023, 2023. doi:10.11578/dc.20230713.4. https://dl.acm.org/doi/pdf/10.1145/3588195.3592991
Peter Dinda, Alex Bernat, and Conor Hetland. Spying on the Floating Point Behavior of Existing, Unmodified Sci entific Applications. In HPDC, pages 5–16. ACM, 2020. doi:10.1145/3369583.3392673. http://pdinda.org/Papers/hpdc20.pdf

Checkpointing

Checkpointing is a resilience technique that stores a copy of the current application state, which is a "checkpoint." In AI training, this is effectively a backup of the calculated weights up to the current point of training. A checkpoint can be used as a restart point when a failure is detected, or can be a method to pause a training job temporarily.

Checkpoints can be used in LLM training to achieve several different aims:

Backup of the training state for fast recovery from training failures.
Pausing and later resuming a training procedure.
Comparing models across different parts of the training sequence.

Checkpointing is most used as a reliability improvement to LLM training. If a failure occurs, the training application can re-load the checkpoint data and re-start from that point, rather than starting from scratch. Hence, the idea with LLM is to store the computed parameter values at regular checkpoints, offloaded to CPU memory rather than using up precious GPU VRAM. Progress on training to that point is thereby kept, and won't be lost even with a serious failure. It's kind of like the Microsoft Word "Autosave" feature, if you turn your head and squint sideways.

Given the size of LLMs, and the need to store all parameters during training, the amount of data is large. This can cause bottelenecks due to:

(a) network bandwidth, and

(b) write storage latency.

There is an inherent need to do checkpoints at short intervals, so as not to lose much work in a rollback scenario, but this inherently increases the overall cost of using checkpoints for recovery from failures. The delay in training while awaiting storage of a checkpoint is sometimes called a "checkpoint stall."

To address the inefficiencies inherent to checkpointing a large LLM, various optimizations to checkpointing have been developed:

Asynchronous checkpointing
Incremental checkpointing
Quantized checkpointing
Distributed checkpointing
In-memory checkpointing
Lazy checkpointing
Checkpointing network optimizations (e.g., overlapping or interleaving checkpoint network traffic with training traffic).
Checkpoint compression (smaller sizes)

Research papers on checkpointing for AI training workloads:

Jiahang Zhou, Yanyu Chen, Zicong Hong, Wuhui Chen, Yue Yu, Tao Zhang, Hui Wang, Chuanfu Zhang, Zibin Zheng, 5 Jan 2024, Training and Serving System of Foundation Models: A Comprehensive Survey, https://arxiv.org/abs/2401.02643
Jiangfei Duan, Shuo Zhang, Zerui Wang, Lijuan Jiang, Wenwen Qu, Qinghao Hu, Guoteng Wang, Qizhen Weng, Hang Yan, Xingcheng Zhang, Xipeng Qiu, Dahua Lin, Yonggang Wen, Xin Jin, Tianwei Zhang, Peng Sun, 29 Jul 2024, Efficient Training of Large Language Models on Distributed Infrastructures: A Survey, https://arxiv.org/abs/2407.20018
Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong, Yi Pan, Shaochen Xu, Zihao Wu, Zhengliang Liu, Xin Zhang, Shu Zhang, Xintao Hu, Tuo Zhang, Ning Qiang, Tianming Liu, Bao Ge, 6 Jan 2024 (v2), Understanding LLMs: A Comprehensive Overview from Training to Inference, https://arxiv.org/abs/2401.02038
Douglas C. Youvan, September 27, 2024, Building and Running Large-Scale Language Models: The Infrastructure and Techniques Behind GPT-4 , https://www.researchgate.net/profile/Douglas-Youvan/publication/384398902_Building_and_Running_Large-Scale_Language_Models_The_Infrastructure_and_Techniques_Behind_GPT-4/links/66f6f4d3906bca2ac3d20e68/Building-and-Running-Large-Scale-Language-Models-The-Infrastructure-and-Techniques-Behind-GPT-4.pdf
Tianqi Chen, Bing Xu, Chiyuan Zhang, Carlos Guestrin, 22 Apr 2016 (v2), Training Deep Nets with Sublinear Memory Cost, https://arxiv.org/abs/1604.06174
Qinghao Hu, Zhisheng Ye, Zerui Wang, Guoteng Wang, Meng Zhang, Qiaoling Chen, Peng Sun, Dahua Lin, Xiaolin Wang, Yingwei Luo, Yonggang Wen, Tianwei Zhang, 3 Apr 2024 (v2), Characterization of Large Language Model Development in the Datacenter, https://arxiv.org/abs/2403.07648
Kiwan Maeng, Shivam Bharuka, Isabel Gao, Mark Jeffrey, Vikram Saraph, Bor-Yiing Su, Caroline Trippel, Jiyan Yang, Mike Rabbat, Brandon Lucia, Carole-Jean Wu, 2021, Understanding and Improving Failure Tolerant Training for Deep Learning Recommendation with Partial Recovery Proceedings of Machine Learning and Systems 3 (MLSys 2021), https://proceedings.mlsys.org/paper_files/paper/2021/hash/f0f9e98bc2e2f0abc3e315eaa0d808fc-Abstract.html
Assaf Eisenman, Kiran Kumar Matam, Steven Ingram, Dheevatsa Mudigere, Raghuraman Krishnamoorthi, Krishnakumar Nair, Misha Smelyanskiy, Murali Annavaram, 4 May 2021 (v2), Check-N-Run: A Checkpointing System for Training Deep Learning Recommendation Models, https://arxiv.org/abs/2010.08679 (Uses incremental checkpointing of only changed parameters, and also quantizes the stored checkpoint.)
Jayashree Mohan; Amar Phanishayee; Vijay Chidambaram, 2021,https://www.mcs.anl.gov/~wozniak/papers/DeepFreeze_2020.pdf CheckFreq: Frequent, Fine-Grained DNN Checkpointing, https://www.usenix.org/conference/fast21/presentation/mohan
B. Nicolae, A. Moody, E. Gonsiorowski, K. Mohror and F. Cappello, "VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale," 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Rio de Janeiro, Brazil, 2019, pp. 911-920, doi: 10.1109/IPDPS.2019.00099. https://ieeexplore.ieee.org/document/8821049 https://hal.science/hal-02184203/document
F. Shahzad, M. Wittmann, T. Zeiser, G. Hager and G. Wellein, "An Evaluation of Different I/O Techniques for Checkpoint/Restart," 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum, Cambridge, MA, USA, 2013, pp. 1708-1716, doi: 10.1109/IPDPSW.2013.145. https://ieeexplore.ieee.org/document/6651069 https://pdfs.semanticscholar.org/f980/3801e3c3ebd7d6be74874f2e4dde71e0c5fb.pdf
Basma Abdel Azeem, Manal Helal, 29 Nov 2023, Performance Evaluation of Checkpoint/Restart Techniques, https://arxiv.org/abs/2311.17545
D. Tiwari, S. Gupta and S. S. Vazhkudai, "Lazy Checkpointing: Exploiting Temporal Locality in Failures to Mitigate Checkpointing Overheads on Extreme-Scale Systems," 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, Atlanta, GA, USA, 2014, pp. 25-36, doi: 10.1109/DSN.2014.101. https://ieeexplore.ieee.org/document/6903564
TensorFlow, Nov 2024 (accessed), tf.train.CheckpointManager, https://www.tensorflow.org/api_docs/python/tf/train/CheckpointManager
Zhuang Wang, Zhen Jia, Shuai Zheng, Zhen Zhang, Xinwei Fu, T. S. Eugene Ng, and Yida Wang. 2023. GEMINI: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints. In Proceedings of the 29th Symposium on Operating Systems Principles (SOSP '23). Association for Computing Machinery, New York, NY, USA, 364–381. https://doi.org/10.1145/3600006.3613145 https://dl.acm.org/doi/10.1145/3600006.3613145 https://www.cs.rice.edu/~eugeneng/papers/SOSP23.pdf (First paper on in-memory checkpointing to CPU memory, and also covers interleaving of checkpointing network traffic with training traffic.)
Yuxin Wang, Xueze Kang, Shaohuai Shi, Xin He, Zhenheng Tang, Xinglin Pan, Yang Zheng, Xiaoyu Wu, Amelie Chi Zhou, Bingsheng He, Xiaowen Chu, 19 Aug 2024 (v4), Fault-Tolerant Hybrid-Parallel Training at Scale with Reliable and Efficient In-memory Checkpointing, https://arxiv.org/abs/2310.12670
Guanhua Wang, Olatunji Ruwase, Bing Xie, Yuxiong He, 19 Jun 2024, FastPersist: Accelerating Model Checkpointing in Deep Learning, https://arxiv.org/abs/2406.13768
Y. Li, T. Wu, G. Li, Y. Song and S. Yin, "Portus: Efficient DNN Checkpointing to Persistent Memory with Zero-Copy," 2024 IEEE 44th International Conference on Distributed Computing Systems (ICDCS), Jersey City, NJ, USA, 2024, pp. 59-70, doi: 10.1109/ICDCS60910.2024.00015. https://ieeexplore.ieee.org/abstract/document/10630969 (Asynchronous checkpointing using RDMA for network optimization.)
Tanmaey Gupta, Sanjeev Krishnan, Rituraj Kumar, Abhishek Vijeev, Bhargav Gulavani, Nipun Kwatra, Ramachandran Ramjee, and Muthian Sivathanu. 2024. Just-In-Time Checkpointing: Low Cost Error Recovery from Deep Learning Training Failures. In Proceedings of the Nineteenth European Conference on Computer Systems (EuroSys '24). Association for Computing Machinery, New York, NY, USA, 1110–1125. https://doi.org/10.1145/3627703.3650085 https://dl.acm.org/doi/abs/10.1145/3627703.3650085
Jia, J., Liu, Y., Liu, Y., Chen, Y., Lin, F. (2024). AdapCK: Optimizing I/O for Checkpointing on Large-Scale High Performance Computing Systems. In: Carretero, J., Shende, S., Garcia-Blas, J., Brandic, I., Olcoz, K., Schreiber, M. (eds) Euro-Par 2024: Parallel Processing. Euro-Par 2024. Lecture Notes in Computer Science, vol 14803. Springer, Cham. https://doi.org/10.1007/978-3-031-69583-4_24 https://link.springer.com/chapter/10.1007/978-3-031-69583-4_24
Tianyuan Wu, Wei Wang, Yinghao Yu, Siran Yang, Wenchao Wu, Qinkai Duan, Guodong Yang, Jiamang Wang, Lin Qu, Liping Zhang, 16 Oct 2024, FALCON: Pinpointing and Mitigating Stragglers for Large-Scale Hybrid-Parallel Training, https://arxiv.org/abs/2410.12588
Giskard, Nov 2024, Machine Learning Checkpointing, https://www.giskard.ai/glossary/machine-learning-checkpointing
Zhuang Wang, Zhen Jia, October 25, 2023, More-efficient recovery from failures during large-ML-model training. Novel “checkpointing” scheme that uses CPU memory reduces the time wasted on failure recovery by more than 92%. https://www.amazon.science/blog/more-efficient-recovery-from-failures-during-large-ml-model-training
Isabella Richard, Oct 4, 2024, Understanding the Impact of Checkpoints on AI Efficiency, https://www.ddn.com/blog/understanding-the-impact-of-checkpoints-on-ai-efficiency/
DDN, Sep 10, 2024, LLM Checkpointing Efficiency is a Critical Blocker to AI Productivity, https://www.ddn.com/resources/whitepapers/checkpointing-efficiency-is-a-critical-blocker-to-ai-productivity/
Eyal Zakkay, Jan 3, 2019, Advanced Keras — Accurately Resuming a Training Process, https://towardsdatascience.com/resuming-a-training-process-with-keras-3e93152ee11a
W Xu, X Huang, S Meng, W Zhang, L Guo, K Sato, Nov 2024, An Efficient Checkpointing System for Large Machine Learning Model Training, https://conferences.computer.org/sc-wpub/pdfs/SC-W2024-6oZmigAQfgJ1GhPL0yE3pS/555400a896/555400a896.pdf
Simon Karasik, Mar 12, 2024, Tips and tricks for performing large model checkpointing, https://medium.com/nebius/tips-and-tricks-for-performing-large-model-checkpointing-3ea4a73c7ee9
Weka, 2024, Checkpointing for Resiliency and Performance in AI Pipelines, https://www.weka.io/wp-content/uploads/files/resources/2024/05/checkpointing-resiliency-performance-ai-pipelines.pdf
Admin Staff, June 13, 2023, Understanding AI Model Checkpoints: A Simplified Guide, https://nightcafe.studio/blogs/info/understanding-ai-model-checkpoints-a-simplified-guide
Nirmal Raj Saxena, Saurabh Hukerikar, Mikolaj Blaz, Swapna Raj, 15 Oct 2024, Optimal Checkpoint Interval with Availability as an Objective Function, https://arxiv.org/abs/2410.18124
Marina Moran, Javier Balladini, Dolores Rexachs, Emilio Luque, 3 Sep 2024, Checkpoint and Restart: An Energy Consumption Characterization in Clusters, https://arxiv.org/abs/2409.02214
Xiang Fu (Nanchang Hangkong University), Weiping Zhang (Nanchang Hangkong University), Xin Huang (Nanchang Hangkong University), Wubiao Xu (Nanchang Hangkong University), Shiman Meng (Nanchang Hangkong University), Luanzheng Guo (Pacific Northwest National Laboratory), Kento Sato (R-CCS, RIKEN), 5 Nov 2024 (v3), AutoCheck: Automatically Identifying Variables for Checkpointing by Data Dependency Analysis, https://arxiv.org/abs/2408.06082
Yao Xu, Gene Cooperman, 5 Aug 2024, Enabling Practical Transparent Checkpointing for MPI: A Topological Sort Approach, https://arxiv.org/abs/2408.02218
Borui Wan, Mingji Han, Yiyao Sheng, Yanghua Peng, Haibin Lin, Mofan Zhang, Zhichao Lai, Menghan Yu, Junda Zhang, Zuquan Song, Xin Liu, Chuan Wu, 10 Oct 2024 (v2), ByteCheckpoint: A Unified Checkpointing System for Large Foundation Model Development, https://arxiv.org/abs/2407.20143
Madan Timalsina, Lisa Gerhardt, Nicholas Tyler, Johannes P. Blaschke, William Arndt, 26 Jul 2024, Optimizing Checkpoint-Restart Mechanisms for HPC with DMTCP in Containers at NERSC, https://arxiv.org/abs/2407.19117
Xinyu Lian, Sam Ade Jacobs, Lev Kurilenko, Masahiro Tanaka, Stas Bekman, Olatunji Ruwase, Minjia Zhang, 28 Jun 2024 (v2), Universal Checkpointing: Efficient and Flexible Checkpointing for Large Scale Distributed Training, https://arxiv.org/abs/2406.18820
Nikhil Mehta, Jonathan Lorraine, Steve Masson, Ramanathan Arunachalam, Zaid Pervaiz Bhat, James Lucas, Arun George Zachariah, 26 Jun 2024, Improving Hyperparameter Optimization with Checkpointed Model Weights, https://arxiv.org/abs/2406.18630 https://github.com/NVlabs/forecasting-model-search
Wenshuo Li, Xinghao Chen, Han Shu, Yehui Tang, Yunhe Wang, 17 Jun 2024, ExCP: Extreme LLM Checkpoint Compression via Weight-Momentum Joint Shrinking, https://arxiv.org/abs/2406.11257 https://github.com/Gaffey/ExCP
Avinash Maurya, Robert Underwood, M. Mustafa Rafique, Franck Cappello, Bogdan Nicolae, 15 Jun 2024, DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models, https://arxiv.org/abs/2406.10707
Zhuobin Huang, Xingda Wei, Yingyi Hao, Rong Chen, Mingcong Han, Jinyu Gu, Haibo Chen, 20 May 2024, PARALLELGPUOS: A Concurrent OS-level GPU Checkpoint and Restore System using Validated Speculation, https://arxiv.org/abs/2405.12079
S. Wang, Q. Cao, K. Zhou, J. Xu, Z. Guo and J. Guo, "ParaCkpt: Heterogeneous Multi-Path Checkpointing Mechanism for Training Deep Learning Models," 2024 IEEE 42nd International Conference on Computer Design (ICCD), Milan, Italy, 2024, pp. 183-190, doi: 10.1109/ICCD63220.2024.00036. https://ieeexplore.ieee.org/abstract/document/10818161/ (Generalizing in-memory checkpoints by storing data in shards across multiple storage areas including CPU memory and SSDs.)
Kaiyuan Tian, Linbo Qiao, Baihui Liu, Gongqingjian Jiang, Dongsheng Li, 21 Jan 2025, A Survey on Memory-Efficient Large-Scale Model Training in AI for Science, https://arxiv.org/abs/2501.11847
Radostin Stoyanov, Viktória Spišaková, Jesus Ramos, Steven Gurfinkel, Andrei Vagin, Adrian Reber, Wesley Armour, Rodrigo Bruno, 23 Feb 2025, CRIUgpu: Transparent Checkpointing of GPU-Accelerated Workloads, https://arxiv.org/abs/2502.16631
Daniel Waddington, Cornel Constantinescu, 14 May 2025, Lossless Compression for LLM Tensor Incremental Snapshots, https://arxiv.org/abs/2505.09810
Qingyin Lin, Zhiguang Chen, Jiangsu Du, Wenguang Chen, Rui Li, 2024, IncrCP: Decomposing and Orchestrating Incremental Checkpoints for Effective Recommendation Model Training, PVLDB, 18(4): 1049- 1062, 2024. doi:10.14778/3717755.3717765, https://www.vldb.org/pvldb/vol18/p1049-du.pdf

Note that some types of checkpointing/offloading algorithm are more focused on speed optimization than on making a checkpoint/backup for resiliency purposes. One speed optimization of LLM training is to offload some of the model parameters out of GPU memory, offloaded to CPU memory. These model weights are then later re-loaded or merged into tensors via recomputation/re-materialization in further computations. The idea of this type of checkpointing is to have more memory-efficient training.

In-Memory Checkpointing

In-memory checkpointing is an AI training optimization whereby a "checkpoint" or backup of the current state is stored in memory. This is more efficient than on-disk checkpointing, because the delay due to storing a large amount of data to disk or SSD can be avoided. Using memory to store checkpoints allows faster completion of a checkpoint, and more frequent checkpointing. In cases of a failure detection, the system can recover using an in-memory checkpoint more efficiently than loading the checkpoint data from disk.

Research papers on in-memory checkpointing to CPU memory, the current SOTA, include:

Zhuang Wang, Zhen Jia, Shuai Zheng, Zhen Zhang, Xinwei Fu, T. S. Eugene Ng, and Yida Wang. 2023. GEMINI: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints. In Proceedings of the 29th Symposium on Operating Systems Principles (SOSP '23). Association for Computing Machinery, New York, NY, USA, 364–381. https://doi.org/10.1145/3600006.3613145 https://dl.acm.org/doi/10.1145/3600006.3613145 https://www.cs.rice.edu/~eugeneng/papers/SOSP23.pdf (First paper on in-memory checkpointing to CPU memory, and also covers interleaving of checkpointing network traffic with training traffic.)
Yuxin Wang, Xueze Kang, Shaohuai Shi, Xin He, Zhenheng Tang, Xinglin Pan, Yang Zheng, Xiaoyu Wu, Amelie Chi Zhou, Bingsheng He, Xiaowen Chu, 19 Aug 2024 (v4), Fault-Tolerant Hybrid-Parallel Training at Scale with Reliable and Efficient In-memory Checkpointing, https://arxiv.org/abs/2310.12670
Zhuang Wang, Zhen Jia, October 25, 2023, More-efficient recovery from failures during large-ML-model training. Novel “checkpointing” scheme that uses CPU memory reduces the time wasted on failure recovery by more than 92%. https://www.amazon.science/blog/more-efficient-recovery-from-failures-during-large-ml-model-training
S. Wang, Q. Cao, K. Zhou, J. Xu, Z. Guo and J. Guo, "ParaCkpt: Heterogeneous Multi-Path Checkpointing Mechanism for Training Deep Learning Models," 2024 IEEE 42nd International Conference on Computer Design (ICCD), Milan, Italy, 2024, pp. 183-190, doi: 10.1109/ICCD63220.2024.00036. https://ieeexplore.ieee.org/abstract/document/10818161/ (Generalizing in-memory checkpoints by storing data in shards across multiple storage areas including CPU memory and SSDs.)

Asynchronous Checkpointing

Asynchronous checkpoint is where the LLM training job requests a checkpoint to be stored, but does not await completion of the storage of the checkpoint. This is often used with in-memory checkpointing, but can be used in any checkpointing method. The async checkpointing algorithm must ensure that additional training data that comes in after the checkpoint request, but during the checkpoint storage, is not erroneously stored as part of the checkpoint.

Research papers on asynchronous checkpointing, which is now a standard technique:

Qinghao Hu, Zhisheng Ye, Zerui Wang, Guoteng Wang, Meng Zhang, Qiaoling Chen, Peng Sun, Dahua Lin, Xiaolin Wang, Yingwei Luo, Yonggang Wen, Tianwei Zhang, 3 Apr 2024 (v2), Characterization of Large Language Model Development in the Datacenter, https://arxiv.org/abs/2403.07648
Kiwan Maeng, Shivam Bharuka, Isabel Gao, Mark Jeffrey, Vikram Saraph, Bor-Yiing Su, Caroline Trippel, Jiyan Yang, Mike Rabbat, Brandon Lucia, Carole-Jean Wu, 2021, Understanding and Improving Failure Tolerant Training for Deep Learning Recommendation with Partial Recovery Proceedings of Machine Learning and Systems 3 (MLSys 2021), https://proceedings.mlsys.org/paper_files/paper/2021/hash/f0f9e98bc2e2f0abc3e315eaa0d808fc-Abstract.html
Jayashree Mohan; Amar Phanishayee; Vijay Chidambaram, 2021,https://www.mcs.anl.gov/~wozniak/papers/DeepFreeze_2020.pdf CheckFreq: Frequent, Fine-Grained DNN Checkpointing, https://www.usenix.org/conference/fast21/presentation/mohan
B. Nicolae, A. Moody, E. Gonsiorowski, K. Mohror and F. Cappello, "VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale," 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Rio de Janeiro, Brazil, 2019, pp. 911-920, doi: 10.1109/IPDPS.2019.00099. https://ieeexplore.ieee.org/document/8821049 https://hal.science/hal-02184203/document
F. Shahzad, M. Wittmann, T. Zeiser, G. Hager and G. Wellein, "An Evaluation of Different I/O Techniques for Checkpoint/Restart," 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum, Cambridge, MA, USA, 2013, pp. 1708-1716, doi: 10.1109/IPDPSW.2013.145. https://ieeexplore.ieee.org/document/6651069 https://pdfs.semanticscholar.org/f980/3801e3c3ebd7d6be74874f2e4dde71e0c5fb.pdf
Basma Abdel Azeem, Manal Helal, 29 Nov 2023, Performance Evaluation of Checkpoint/Restart Techniques, https://arxiv.org/abs/2311.17545
D. Tiwari, S. Gupta and S. S. Vazhkudai, "Lazy Checkpointing: Exploiting Temporal Locality in Failures to Mitigate Checkpointing Overheads on Extreme-Scale Systems," 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, Atlanta, GA, USA, 2014, pp. 25-36, doi: 10.1109/DSN.2014.101. https://ieeexplore.ieee.org/document/6903564
Yuxin Wang, Xueze Kang, Shaohuai Shi, Xin He, Zhenheng Tang, Xinglin Pan, Yang Zheng, Xiaoyu Wu, Amelie Chi Zhou, Bingsheng He, Xiaowen Chu, 19 Aug 2024 (v4), Fault-Tolerant Hybrid-Parallel Training at Scale with Reliable and Efficient In-memory Checkpointing, https://arxiv.org/abs/2310.12670
Y. Li, T. Wu, G. Li, Y. Song and S. Yin, "Portus: Efficient DNN Checkpointing to Persistent Memory with Zero-Copy," 2024 IEEE 44th International Conference on Distributed Computing Systems (ICDCS), Jersey City, NJ, USA, 2024, pp. 59-70, doi: 10.1109/ICDCS60910.2024.00015. https://ieeexplore.ieee.org/abstract/document/10630969 (Asynchronous checkpointing using RDMA for network optimization.)
Tanmaey Gupta, Sanjeev Krishnan, Rituraj Kumar, Abhishek Vijeev, Bhargav Gulavani, Nipun Kwatra, Ramachandran Ramjee, and Muthian Sivathanu. 2024. Just-In-Time Checkpointing: Low Cost Error Recovery from Deep Learning Training Failures. In Proceedings of the Nineteenth European Conference on Computer Systems (EuroSys '24). Association for Computing Machinery, New York, NY, USA, 1110–1125. https://doi.org/10.1145/3627703.3650085 https://dl.acm.org/doi/abs/10.1145/3627703.3650085
Jia, J., Liu, Y., Liu, Y., Chen, Y., Lin, F. (2024). AdapCK: Optimizing I/O for Checkpointing on Large-Scale High Performance Computing Systems. In: Carretero, J., Shende, S., Garcia-Blas, J., Brandic, I., Olcoz, K., Schreiber, M. (eds) Euro-Par 2024: Parallel Processing. Euro-Par 2024. Lecture Notes in Computer Science, vol 14803. Springer, Cham. https://doi.org/10.1007/978-3-031-69583-4_24 https://link.springer.com/chapter/10.1007/978-3-031-69583-4_24
Simon Karasik, Mar 12, 2024, Tips and tricks for performing large model checkpointing, https://medium.com/nebius/tips-and-tricks-for-performing-large-model-checkpointing-3ea4a73c7ee9
Avinash Maurya, Robert Underwood, M. Mustafa Rafique, Franck Cappello, Bogdan Nicolae, 15 Jun 2024, DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models, https://arxiv.org/abs/2406.10707

GPU Failures and Reliability

GPU failures are where a GPU performs an incorrect calculation or triggers an exception. Catastrophic GPU failures are where the entire GPU burns out, but less severe failures can include single-tile burnouts or transient errors such as Silent Data Corruption (SDC) and other transient soft errors.

Research papers on the issues of GPU errors/failures and overall GPU reliability:

Marco Ottavi, Dimitris Gizopoulos, Salvatore Pontarelli, 2018, Dependable Multicore Architectures at Nanoscale, https://link.springer.com/book/10.1007/978-3-319-54422-9
Jon Perez-Cerrolaza, Jaume Abella, Leonidas Kosmidis, Alejandro J. Calderon, Francisco Cazorla, and Jose Luis Flores. 2022. GPU Devices for Safety-Critical Systems: A Survey. ACM Comput. Surv. 55, 7, Article 147 (July 2023), 37 pages. https://doi.org/10.1145/3549526 https://dl.acm.org/doi/abs/10.1145/3549526 https://www.researchgate.net/publication/362138939_GPU_Devices_for_Safety-Critical_Systems_A_Survey
NVIDIA, Nov 2024, NVIDIA Validation Suite User Guide, https://docs.nvidia.com/deploy/nvvs-user-guide/index.html https://docs.nvidia.com/deploy/pdf/NVIDIA_Validation_Suite_User_Guide.pdf
D. Tiwari et al., "Understanding GPU errors on large-scale HPC systems and the implications for system design and operation," 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), Burlingame, CA, USA, 2015, pp. 331-342, doi: 10.1109/HPCA.2015.7056044. https://ieeexplore.ieee.org/abstract/document/7056044/ https://www.osti.gov/servlets/purl/1185857
Yuichi Ozaki, Sousuke Kanamoto, Hiroaki Yamamoto, and Kenichi Kourai. 2019. Detecting System Failures with GPUs and LLVM. In Proceedings of the 10th ACM SIGOPS Asia-Pacific Workshop on Systems (APSys '19). Association for Computing Machinery, New York, NY, USA, 47–53. https://doi.org/10.1145/3343737.3343749 https://dl.acm.org/doi/abs/10.1145/3343737.3343749 https://kyutech.repo.nii.ac.jp/record/6433/files/RECN_2019-06.pdf
Fernando Fernandes Dos Santos, Luigi Carro, Flavio Vella, and Paolo Rech. 2024. Assessing the Impact of Compiler Optimizations on GPUs Reliability. ACM Trans. Archit. Code Optim. 21, 2, Article 26 (June 2024), 22 pages. https://doi.org/10.1145/3638249 https://dl.acm.org/doi/full/10.1145/3638249 https://dl.acm.org/doi/pdf/10.1145/3638249
AR Anwer, G Li, K Pattabiraman, M Sullivan, T Tsai, SKS Hari, 2020, GPU-Trident: Efficient Modeling of Error Propagation in GPU Programs, SC20: International Conference for High Performance Computing, https://research.nvidia.com/sites/default/files/pubs/2020-10_GPU-Trident%3A-Efficient-Modeling//SC_2020_GPU_Trident.pdf
Y. Zhang and C. Jung, "Featherweight Soft Error Resilience for GPUs," 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO), Chicago, IL, USA, 2022, pp. 245-262, doi: 10.1109/MICRO56248.2022.00030. https://ieeexplore.ieee.org/abstract/document/9923801 https://par.nsf.gov/servlets/purl/10380636
L Yang, G Papadimitriou, D Sartzetakis, 2024, GPU Reliability Assessment: Insights Across the Abstraction Layers, https://ieeexplore.ieee.org/abstract/document/10740838/ https://lishanyang.github.io/CLUSTER24_Yang.pdf
Harish Dattatraya Dixit, Sneha Pendharkar, Matt Beadon, Chris Mason, Tejasvi Chakravarthy, Bharath Muthiah, Sriram Sankar, 22 Feb 2021, Silent Data Corruptions at Scale, Facebook Research, https://arxiv.org/abs/2102.11245
Qinghao Hu, Zhisheng Ye, Zerui Wang, Guoteng Wang, Meng Zhang, Qiaoling Chen, Peng Sun, Dahua Lin, Xiaolin Wang, Yingwei Luo, Yonggang Wen, Tianwei Zhang, 3 Apr 2024 (v2), Characterization of Large Language Model Development in the Datacenter, https://arxiv.org/abs/2403.07648
Zhihan Jiang, Junjie Huang, Zhuangbin Chen, Yichen Li, Guangba Yu, Cong Feng, Yongqiang Yang, Zengyin Yang, Michael R. Lyu, 26 Mar 2025, L4: Diagnosing Large-scale LLM Training Failures via Automated Log Analysis, https://arxiv.org/abs/2503.20263

Fault Tolerance

Research on fault tolerance in AI systems:

Jiangfei Duan, Shuo Zhang, Zerui Wang, Lijuan Jiang, Wenwen Qu, Qinghao Hu, Guoteng Wang, Qizhen Weng, Hang Yan, Xingcheng Zhang, Xipeng Qiu, Dahua Lin, Yonggang Wen, Xin Jin, Tianwei Zhang, Peng Sun, 29 Jul 2024, Efficient Training of Large Language Models on Distributed Infrastructures: A Survey, https://arxiv.org/abs/2407.20018
Robert S. Hanmer, 12 July 2013, Patterns for Fault Tolerant Software (Wiley Software Patterns Series) 1st Edition, Wiley, https://www.amazon.com.au/Patterns-Fault-Tolerant-Software-Wiley-ebook/dp/B00DXK33SK/
Elena Dubrova, March 2013, Fault-Tolerant Design, Springer, https://www.amazon.com.au/Fault-Tolerant-Design-Elena-Dubrova-ebook/dp/B00C0QKAFW/
Gerardus Blokdyk, 2018, Software fault tolerance, Second Edition, https://www.amazon.com.au/Software-tolerance-Second-Gerardus-Blokdyk-ebook/dp/B07GVT67W2/