Aussie AI

Resiliency in Large-Scale Datacenters

  • Last Updated 2 March, 2025
  • by David Spuler, Ph.D.

Resiliency in Datacenter AI

Resiliency is the correct handling of failures that occur in the data center training infrastructure. Achieving resiliency in your AI backend is important to achieve a high level of quality in any application. High accuracy and fast speed is desirable for both training and inference workloads, but more critical for training, because the impact is more central to the performance of one huge job. Supercomputing clusters of 100,000+ GPU chips amplify this importance significancly, and raise a whole new level of challenges. Although the "optimizer" algorithm is important for training results in terms of both accuracy and convergence time, and this consequently has a momentous amount of research papers, there are also lower-level technical issues related to the underlying infrastructure that is running these training algorithms. There are various issues related to the GPU chips, server hardware, and the networking communications layers between them.

Types of Datacenter Resilience Issues

Supercomputing clusters running AI training of 100,000+ GPUs are somewhat fickle. Some of the general types of technical issues with a multi-GPU AI platform in distributed training resiliency include:

  • Stragglers (slow workers)
  • Hangs (never-finishing workers)
  • High network latency

Failures can occur in almost any component:

  • CPU
  • GPU
  • Memory
  • Disk
  • Power supply
  • Cooling
  • Networking hardware
  • Other hardware infrastructure

Failures can even occur in the hardware or software that's supposed to detect or correct failures! For example, these can fail:

  • Monitoring interfaces
  • Checkpoint/restart infrastructure
  • Out-of-band networking components

The GPU is itself a complicated piece of equipment that has a non-zero failure rate. Some of the hardware issues specific to the GPU include:

  • Silent Data Corruption (SDC) errors
  • Overheating GPUs
  • Aging GPUs ("silicon wear-out")
  • Transient soft errors (e.g., random bit flips from radiation)
  • Early life failures

And the software layer can contribute insidious errors in various ways:

  • Silent GPU floating-point exceptions
  • Silent software kernel errors
  • Bounds violations hidden in contiguous blocks

Problems that arise in the networking layer between GPUs, whether in the same multi-GPU server or across multiple distributed servers, include:

  • Network latency
  • Network congestion
  • Timeouts
  • Network error states

If you're looking for an easy fix for a small server room in your building's basement, here's a suggestion: sort out the air-conditioning system so that the server room is a few degrees cooler. That will lower your failure rate for multiple types of hardware component. But if you've got 100,000 servers running from a hydro-electric power plant next door, you can't just click the thermostat down a couple notches.

Stragglers and Hangs

Stragglers are software processes that run slowly in a multi-GPU AI training sequence, and return their resultant weight updates with a delay. Hangs are also software processes that are like stragglers, but fail to ever return successfully. When farming out training tasks to various "worker" nodes, the slowest workers are called "stragglers" (slowest returns) or "hangs" (never completing). Distributed training is constrained to progress at the rate of the slowest straggler, so addressing them is not only a resiliency improvement, but is also a training speed optimization.

Stragglers are a general problem with distributed workloads, and there's not a single cause of a job that's slow to return its results. Problems can arise due to:

  • Hardware problems (GPU or CPU).
  • Software kernel errors (e.g., poorly handled edge cases).
  • Network issues (various types).

Stragglers may be repeat offenders (e.g., it's the GPU/CPU or other server issue), or the issue can be dispersed to different servers randomly (e.g., it's network congestion randomly slowing down some outgoing messages or responses).

Straggler mitigation is an AI training optimization that aims to speed up training by reducing slowdowns from straggler workers. Mitigation strategies can include isolating a single GPU or single server, if one is repeatedly causing problems, or addressing any underlying causes such as network congestion issues.

Research papers on stragglers and straggler mitigation in AI datacenters:

  • Amir Javadpour, Guojun Wang, Samira Rezaei, Kuan Ching Li, 13 Apr 2020, Detecting Straggler MapReduce Tasks in Big Data Processing Infrastructure by Neural Network, https://arxiv.org/abs/2004.05868
  • Yi Wang, Rohan Varma, April 07, 2023, Straggler Mitigation On PyTorch DDP By Hierarchical SGD, https://pytorch.org/blog/straggler-mitigation/
  • Yang, E., Kang, DK. & Youn, CH. BOA: batch orchestration algorithm for straggler mitigation of distributed DL training in heterogeneous GPU cluster. J Supercomput 76, 47–67 (2020). https://doi.org/10.1007/s11227-019-02845-2 https://link.springer.com/article/10.1007/s11227-019-02845-2
  • Haoyang Li, Fangcheng Fu, Hao Ge, Sheng Lin, Xuanyu Wang, Jiawen Niu, Yujie Wang, Hailin Zhang, Xiaonan Nie, Bin Cui, 17 Oct 2024, Malleus: Straggler-Resilient Hybrid Parallel Training of Large-scale Models via Malleable Data and Model Parallelization, https://arxiv.org/abs/2410.13333
  • H. Kim, C. Song, H. Lee and H. Yu, "Addressing Straggler Problem Through Dynamic Partial All-Reduce for Distributed Deep Learning in Heterogeneous GPU Clusters," 2023 IEEE International Conference on Consumer Electronics (ICCE), Las Vegas, NV, USA, 2023, pp. 1-6, doi: 10.1109/ICCE56470.2023.10043527. https://ieeexplore.ieee.org/document/10043527
  • Nir Barazida, Mar 9, 2022, Distributed training of deep learning models: handling stragglers and latency in synchronous training A review of the challenges in Synchronous distributed training and best solutions for stragglers and high latency https://towardsdatascience.com/stragglers-and-latency-in-synchronous-distributed-training-of-deep-learning-models-43783b0266d9
  • Jianmin Chen, Xinghao Pan, Rajat Monga, Samy Bengio, Rafal Jozefowicz, 21 Mar 2017 (v3), Revisiting Distributed Synchronous SGD, https://arxiv.org/abs/1604.00981
  • Tianyuan Wu, Wei Wang, Yinghao Yu, Siran Yang, Wenchao Wu, Qinkai Duan, Guodong Yang, Jiamang Wang, Lin Qu, Liping Zhang, 16 Oct 2024, FALCON: Pinpointing and Mitigating Stragglers for Large-Scale Hybrid-Parallel Training, https://arxiv.org/abs/2410.12588
  • Tharindu Adikari, Haider Al-Lawati, Jason Lam, Zhenhua Hu, Stark C. Draper, 6 Nov 2024, Exploiting Stragglers in Distributed Computing Systems with Task Grouping, https://arxiv.org/abs/2411.03645 (Reduce straggler work loss by using more granular workloads.)
  • Anindya Bijoy Das, Aditya Ramamoorthy, David J. Love, Christopher G. Brinton, 9 Aug 2024, Sparsity-Preserving Encodings for Straggler-Optimal Distributed Matrix Computations at the Edge, https://arxiv.org/abs/2408.05152
  • Aditya Ramamoorthy, Ruoyu Meng, Vrinda S. Girimaji, 18 Nov 2024 (v2), Leveraging partial stragglers within gradient coding, https://arxiv.org/abs/2405.19509
  • Youshao Xiao, Lin Ju, Zhenglei Zhou, Siyuan Li, Zhaoxin Huan, Dalong Zhang, Rujie Jiang, Lin Wang, Xiaolu Zhang, Lei Liang, Jun Zhou, 15 Apr 2024, AntDT: A Self-Adaptive Distributed Training Framework for Leader and Straggler Nodes, https://arxiv.org/abs/2404.09679
  • Natalie Lang, Alejandro Cohen, Nir Shlezinger, 27 Mar 2024, Stragglers-Aware Low-Latency Synchronous Federated Learning via Layer-Wise Model Updates, https://arxiv.org/abs/2403.18375
  • Chengxi Li, Ming Xiao, Mikael Skoglund, 22 Mar 2024, Adaptive Coded Federated Learning: Privacy Preservation and Straggler Mitigation, https://arxiv.org/abs/2403.14905
  • Chengxi Li, Mikael Skoglund, 19 Mar 2024, Distributed Learning based on 1-Bit Gradient Coding in the Presence of Stragglers, https://arxiv.org/abs/2403.14716
  • Andrew Hard, Antonious M. Girgis, Ehsan Amid, Sean Augenstein, Lara McConnaughey, Rajiv Mathews, Rohan Anil, 14 Mar 2024, Learning from straggler clients in federated learning, https://arxiv.org/abs/2403.09086
  • Chengxi Li, Mikael Skoglund, 14 Jun 2024 (v3), Gradient Coding in Decentralized Learning for Evading Stragglers, https://arxiv.org/abs/2402.04193
  • Hongpeng Guo, Haotian Gu, Xiaoyang Wang, Bo Chen, Eun Kyung Lee, Tamar Eilam, Deming Chen, Klara Nahrstedt, 31 Jan 2024, FedCore: Straggler-Free Federated Learning with Distributed Coresets, https://arxiv.org/abs/2402.00219

Silent Data Corruption (SDC)

Silent Data Corruption (SDC) is a computational error in AI training that causes incorrect results without triggering an exception. SDCs are a specific type of hardware error that occurs in the GPU manufacturing process. They are a type of insidious error that causes anomalous computations, but does not trigger any exceptions (i.e., "silent"). Programmers are aware of numerous types of coding errors that cause errors without warnings, and this can occur in hardware, too.

SDCs are usually quite obscure, because they must have passed the GPU acceptance testing as part of the manufacturing process. If you have a GPU in your gaming PC, it's not that likely that you have one, but if you're running an AI training workload on a datacenter supercomputer with 100,000 GPUs, the odds are higher.

SDCs are caused by random fluctuations in the intricate nanometer-scale processes that create GPUs. Hence, SDCs usually have characteristics, such as:

  • Affect individual chips (i.e., it's a miniscule manufacturing error).
  • Specific to a particular microcode instruction or processing sequence.
  • Localized to one region of the single chip.
  • Not always the same type of error.
  • Sometimes intermittent.

Note that SDCs are not typically considered to include:

  • GPU acceptance testing failures (i.e., not silent).
  • Large GPU failures from overheating (although SDCs can also be heat-dependent).
  • Microcoding or hardware design errors (affecting all chips).

Given their obscurity, SDCs are also:

  • Hard to detect
  • Problematic to prove (even if suspected)
  • Difficult to mitigate against

Research on SDCs

Research papers on SDCs include:

GPU Overheating

GPU overheating is where a GPU can become too hot in terms of temperature, leading to a failure or incorrect computation. Overheating is more common under heavy loads and with aged GPUs (near their end-of-life) or brand new GPus (early-life failures). Overheating can cause a GPU to fail either catastrophically, or with a partial computation failure in one or more tiles. Such failures may arise from very high temperatures or more gradually over time due to GPU aging.

Research papers on overheating:

Transient Soft Errors

Transient soft errors are hardware errors that don't trigger an exception, and do not recur. They are also known simply as "soft errors," because the failure is not permanent. There is some overlap with Silent Data Corruption (SDCs), since a transient soft error is also silent. However, soft errors are not only caused by manufacturing errors in the silicon, but can occur intermittently due to the effects of atmospheric radiation.

Bizarrely, the nanometer scale of silicon circuitry is so tiny that individual transistors can be affected by a single particle (e.g., neutron), and these arise spontaneously from cosmic rays in the wild. The effect can be harmless in many cases, but occasionally directly causes a "bit flip" in one of the circuits. Various physical shielding techniques can reduce these problems, but not avoid them completely.

Research papers on GPU soft errors include:

High Network Latency

High network latency is a slow transmission of data during LLM training in a multi-GPU data center training stack. The speed of the network is critical to both performance and resiliency of an AI training job. There is a fluctuating rate of network load in a typical training workload, with bursts of network traffic as computation segments are farmed out to workers, followed by a delay due to large-scale parallel computation, and then another burst as results are returned to the center from the leaf nodes.

Research papers on AI network optimizations include:

Silent Floating-Point Computation Errors

Floating-point exceptions are often silent in GPU kernels, and GPU software coding needs to take extra care. Whereas CPU computations might trigger SIGFPE, the GPU is likely to quietly continue. This might lead to incorrect results that are insidious, or it may result in special erroneous values such as NaN (not-a-number) and Inf (infinity, either positive or negative).

Research papers on floating point errors:

Floating-Point Runtime Error Checkers

Since floating-point errors are often silent in GPUs, it is advantageous to use runtime tools that can detect them. There are a variety of such tools under development in research, but there's not yet a mainstream tool that is widely used.

Research papers on tools that detect floating-point errors and exceptions at runtime:

Checkpointing

Checkpointing is a resilience technique that stores a copy of the current application state, which is a "checkpoint." In AI training, this is effectively a backup of the calculated weights up to the current point of training. A checkpoint can be used as a restart point when a failure is detected, or can be a method to pause a training job temporarily.

Checkpoints can be used in LLM training to achieve several different aims:

  • Backup of the training state for fast recovery from training failures.
  • Pausing and later resuming a training procedure.
  • Comparing models across different parts of the training sequence.

Checkpointing is most used as a reliability improvement to LLM training. If a failure occurs, the training application can re-load the checkpoint data and re-start from that point, rather than starting from scratch. Hence, the idea with LLM is to store the computed parameter values at regular checkpoints, offloaded to CPU memory rather than using up precious GPU VRAM. Progress on training to that point is thereby kept, and won't be lost even with a serious failure. It's kind of like the Microsoft Word "Autosave" feature, if you turn your head and squint sideways.

Given the size of LLMs, and the need to store all parameters during training, the amount of data is large. This can cause bottelenecks due to:

    (a) network bandwidth, and

    (b) write storage latency.

There is an inherent need to do checkpoints at short intervals, so as not to lose much work in a rollback scenario, but this inherently increases the overall cost of using checkpoints for recovery from failures. The delay in training while awaiting storage of a checkpoint is sometimes called a "checkpoint stall."

To address the inefficiencies inherent to checkpointing a large LLM, various optimizations to checkpointing have been developed:

  • Asynchronous checkpointing
  • Incremental checkpointing
  • Quantized checkpointing
  • Distributed checkpointing
  • In-memory checkpointing
  • Lazy checkpointing
  • Checkpointing network optimizations (e.g., overlapping or interleaving checkpoint network traffic with training traffic).
  • Checkpoint compression (smaller sizes)

Research papers on checkpointing for AI training workloads:

Note that some types of checkpointing/offloading algorithm are more focused on speed optimization than on making a checkpoint/backup for resiliency purposes. One speed optimization of LLM training is to offload some of the model parameters out of GPU memory, offloaded to CPU memory. These model weights are then later re-loaded or merged into tensors via recomputation/re-materialization in further computations. The idea of this type of checkpointing is to have more memory-efficient training.

In-Memory Checkpointing

In-memory checkpointing is an AI training optimization whereby a "checkpoint" or backup of the current state is stored in memory. This is more efficient than on-disk checkpointing, because the delay due to storing a large amount of data to disk or SSD can be avoided. Using memory to store checkpoints allows faster completion of a checkpoint, and more frequent checkpointing. In cases of a failure detection, the system can recover using an in-memory checkpoint more efficiently than loading the checkpoint data from disk.

Research papers on in-memory checkpointing to CPU memory, the current SOTA, include:

  • Zhuang Wang, Zhen Jia, Shuai Zheng, Zhen Zhang, Xinwei Fu, T. S. Eugene Ng, and Yida Wang. 2023. GEMINI: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints. In Proceedings of the 29th Symposium on Operating Systems Principles (SOSP '23). Association for Computing Machinery, New York, NY, USA, 364–381. https://doi.org/10.1145/3600006.3613145 https://dl.acm.org/doi/10.1145/3600006.3613145 https://www.cs.rice.edu/~eugeneng/papers/SOSP23.pdf (First paper on in-memory checkpointing to CPU memory, and also covers interleaving of checkpointing network traffic with training traffic.)
  • Yuxin Wang, Xueze Kang, Shaohuai Shi, Xin He, Zhenheng Tang, Xinglin Pan, Yang Zheng, Xiaoyu Wu, Amelie Chi Zhou, Bingsheng He, Xiaowen Chu, 19 Aug 2024 (v4), Fault-Tolerant Hybrid-Parallel Training at Scale with Reliable and Efficient In-memory Checkpointing, https://arxiv.org/abs/2310.12670
  • Zhuang Wang, Zhen Jia, October 25, 2023, More-efficient recovery from failures during large-ML-model training. Novel “checkpointing” scheme that uses CPU memory reduces the time wasted on failure recovery by more than 92%. https://www.amazon.science/blog/more-efficient-recovery-from-failures-during-large-ml-model-training
  • S. Wang, Q. Cao, K. Zhou, J. Xu, Z. Guo and J. Guo, "ParaCkpt: Heterogeneous Multi-Path Checkpointing Mechanism for Training Deep Learning Models," 2024 IEEE 42nd International Conference on Computer Design (ICCD), Milan, Italy, 2024, pp. 183-190, doi: 10.1109/ICCD63220.2024.00036. https://ieeexplore.ieee.org/abstract/document/10818161/ (Generalizing in-memory checkpoints by storing data in shards across multiple storage areas including CPU memory and SSDs.)

Asynchronous Checkpointing

Asynchronous checkpoint is where the LLM training job requests a checkpoint to be stored, but does not await completion of the storage of the checkpoint. This is often used with in-memory checkpointing, but can be used in any checkpointing method. The async checkpointing algorithm must ensure that additional training data that comes in after the checkpoint request, but during the checkpoint storage, is not erroneously stored as part of the checkpoint.

Research papers on asynchronous checkpointing, which is now a standard technique:

GPU Failures and Reliability

GPU failures are where a GPU performs an incorrect calculation or triggers an exception. Catastrophic GPU failures are where the entire GPU burns out, but less severe failures can include single-tile burnouts or transient errors such as Silent Data Corruption (SDC) and other transient soft errors.

Research papers on the issues of GPU errors/failures and overall GPU reliability:

Fault Tolerance

Research on fault tolerance in AI systems:

More AI Research

Read more about: