Aussie AI
Resiliency in Large-Scale Datacenters
-
Last Updated 2 March, 2025
-
by David Spuler, Ph.D.
Resiliency in Datacenter AI
Resiliency is the correct handling of failures that occur in the data center training infrastructure. Achieving resiliency in your AI backend is important to achieve a high level of quality in any application. High accuracy and fast speed is desirable for both training and inference workloads, but more critical for training, because the impact is more central to the performance of one huge job. Supercomputing clusters of 100,000+ GPU chips amplify this importance significancly, and raise a whole new level of challenges. Although the "optimizer" algorithm is important for training results in terms of both accuracy and convergence time, and this consequently has a momentous amount of research papers, there are also lower-level technical issues related to the underlying infrastructure that is running these training algorithms. There are various issues related to the GPU chips, server hardware, and the networking communications layers between them.
Types of Datacenter Resilience Issues
Supercomputing clusters running AI training of 100,000+ GPUs are somewhat fickle. Some of the general types of technical issues with a multi-GPU AI platform in distributed training resiliency include:
- Stragglers (slow workers)
- Hangs (never-finishing workers)
- High network latency
Failures can occur in almost any component:
- CPU
- GPU
- Memory
- Disk
- Power supply
- Cooling
- Networking hardware
- Other hardware infrastructure
Failures can even occur in the hardware or software that's supposed to detect or correct failures! For example, these can fail:
- Monitoring interfaces
- Checkpoint/restart infrastructure
- Out-of-band networking components
The GPU is itself a complicated piece of equipment that has a non-zero failure rate. Some of the hardware issues specific to the GPU include:
- Silent Data Corruption (SDC) errors
- Overheating GPUs
- Aging GPUs ("silicon wear-out")
- Transient soft errors (e.g., random bit flips from radiation)
- Early life failures
And the software layer can contribute insidious errors in various ways:
- Silent GPU floating-point exceptions
- Silent software kernel errors
- Bounds violations hidden in contiguous blocks
Problems that arise in the networking layer between GPUs, whether in the same multi-GPU server or across multiple distributed servers, include:
- Network latency
- Network congestion
- Timeouts
- Network error states
If you're looking for an easy fix for a small server room in your building's basement, here's a suggestion: sort out the air-conditioning system so that the server room is a few degrees cooler. That will lower your failure rate for multiple types of hardware component. But if you've got 100,000 servers running from a hydro-electric power plant next door, you can't just click the thermostat down a couple notches.
Stragglers and Hangs
Stragglers are software processes that run slowly in a multi-GPU AI training sequence, and return their resultant weight updates with a delay. Hangs are also software processes that are like stragglers, but fail to ever return successfully. When farming out training tasks to various "worker" nodes, the slowest workers are called "stragglers" (slowest returns) or "hangs" (never completing). Distributed training is constrained to progress at the rate of the slowest straggler, so addressing them is not only a resiliency improvement, but is also a training speed optimization.
Stragglers are a general problem with distributed workloads, and there's not a single cause of a job that's slow to return its results. Problems can arise due to:
- Hardware problems (GPU or CPU).
- Software kernel errors (e.g., poorly handled edge cases).
- Network issues (various types).
Stragglers may be repeat offenders (e.g., it's the GPU/CPU or other server issue), or the issue can be dispersed to different servers randomly (e.g., it's network congestion randomly slowing down some outgoing messages or responses).
Straggler mitigation is an AI training optimization that aims to speed up training by reducing slowdowns from straggler workers. Mitigation strategies can include isolating a single GPU or single server, if one is repeatedly causing problems, or addressing any underlying causes such as network congestion issues.
Research papers on stragglers and straggler mitigation in AI datacenters:
- Amir Javadpour, Guojun Wang, Samira Rezaei, Kuan Ching Li, 13 Apr 2020, Detecting Straggler MapReduce Tasks in Big Data Processing Infrastructure by Neural Network, https://arxiv.org/abs/2004.05868
- Yi Wang, Rohan Varma, April 07, 2023, Straggler Mitigation On PyTorch DDP By Hierarchical SGD, https://pytorch.org/blog/straggler-mitigation/
- Yang, E., Kang, DK. & Youn, CH. BOA: batch orchestration algorithm for straggler mitigation of distributed DL training in heterogeneous GPU cluster. J Supercomput 76, 47–67 (2020). https://doi.org/10.1007/s11227-019-02845-2 https://link.springer.com/article/10.1007/s11227-019-02845-2
- Haoyang Li, Fangcheng Fu, Hao Ge, Sheng Lin, Xuanyu Wang, Jiawen Niu, Yujie Wang, Hailin Zhang, Xiaonan Nie, Bin Cui, 17 Oct 2024, Malleus: Straggler-Resilient Hybrid Parallel Training of Large-scale Models via Malleable Data and Model Parallelization, https://arxiv.org/abs/2410.13333
- H. Kim, C. Song, H. Lee and H. Yu, "Addressing Straggler Problem Through Dynamic Partial All-Reduce for Distributed Deep Learning in Heterogeneous GPU Clusters," 2023 IEEE International Conference on Consumer Electronics (ICCE), Las Vegas, NV, USA, 2023, pp. 1-6, doi: 10.1109/ICCE56470.2023.10043527. https://ieeexplore.ieee.org/document/10043527
- Nir Barazida, Mar 9, 2022, Distributed training of deep learning models: handling stragglers and latency in synchronous training A review of the challenges in Synchronous distributed training and best solutions for stragglers and high latency https://towardsdatascience.com/stragglers-and-latency-in-synchronous-distributed-training-of-deep-learning-models-43783b0266d9
- Jianmin Chen, Xinghao Pan, Rajat Monga, Samy Bengio, Rafal Jozefowicz, 21 Mar 2017 (v3), Revisiting Distributed Synchronous SGD, https://arxiv.org/abs/1604.00981
- Tianyuan Wu, Wei Wang, Yinghao Yu, Siran Yang, Wenchao Wu, Qinkai Duan, Guodong Yang, Jiamang Wang, Lin Qu, Liping Zhang, 16 Oct 2024, FALCON: Pinpointing and Mitigating Stragglers for Large-Scale Hybrid-Parallel Training, https://arxiv.org/abs/2410.12588
- Tharindu Adikari, Haider Al-Lawati, Jason Lam, Zhenhua Hu, Stark C. Draper, 6 Nov 2024, Exploiting Stragglers in Distributed Computing Systems with Task Grouping, https://arxiv.org/abs/2411.03645 (Reduce straggler work loss by using more granular workloads.)
- Anindya Bijoy Das, Aditya Ramamoorthy, David J. Love, Christopher G. Brinton, 9 Aug 2024, Sparsity-Preserving Encodings for Straggler-Optimal Distributed Matrix Computations at the Edge, https://arxiv.org/abs/2408.05152
- Aditya Ramamoorthy, Ruoyu Meng, Vrinda S. Girimaji, 18 Nov 2024 (v2), Leveraging partial stragglers within gradient coding, https://arxiv.org/abs/2405.19509
- Youshao Xiao, Lin Ju, Zhenglei Zhou, Siyuan Li, Zhaoxin Huan, Dalong Zhang, Rujie Jiang, Lin Wang, Xiaolu Zhang, Lei Liang, Jun Zhou, 15 Apr 2024, AntDT: A Self-Adaptive Distributed Training Framework for Leader and Straggler Nodes, https://arxiv.org/abs/2404.09679
- Natalie Lang, Alejandro Cohen, Nir Shlezinger, 27 Mar 2024, Stragglers-Aware Low-Latency Synchronous Federated Learning via Layer-Wise Model Updates, https://arxiv.org/abs/2403.18375
- Chengxi Li, Ming Xiao, Mikael Skoglund, 22 Mar 2024, Adaptive Coded Federated Learning: Privacy Preservation and Straggler Mitigation, https://arxiv.org/abs/2403.14905
- Chengxi Li, Mikael Skoglund, 19 Mar 2024, Distributed Learning based on 1-Bit Gradient Coding in the Presence of Stragglers, https://arxiv.org/abs/2403.14716
- Andrew Hard, Antonious M. Girgis, Ehsan Amid, Sean Augenstein, Lara McConnaughey, Rajiv Mathews, Rohan Anil, 14 Mar 2024, Learning from straggler clients in federated learning, https://arxiv.org/abs/2403.09086
- Chengxi Li, Mikael Skoglund, 14 Jun 2024 (v3), Gradient Coding in Decentralized Learning for Evading Stragglers, https://arxiv.org/abs/2402.04193
- Hongpeng Guo, Haotian Gu, Xiaoyang Wang, Bo Chen, Eun Kyung Lee, Tamar Eilam, Deming Chen, Klara Nahrstedt, 31 Jan 2024, FedCore: Straggler-Free Federated Learning with Distributed Coresets, https://arxiv.org/abs/2402.00219
Silent Data Corruption (SDC)
Silent Data Corruption (SDC) is a computational error in AI training that causes incorrect results without triggering an exception. SDCs are a specific type of hardware error that occurs in the GPU manufacturing process. They are a type of insidious error that causes anomalous computations, but does not trigger any exceptions (i.e., "silent"). Programmers are aware of numerous types of coding errors that cause errors without warnings, and this can occur in hardware, too.
SDCs are usually quite obscure, because they must have passed the GPU acceptance testing as part of the manufacturing process. If you have a GPU in your gaming PC, it's not that likely that you have one, but if you're running an AI training workload on a datacenter supercomputer with 100,000 GPUs, the odds are higher.
SDCs are caused by random fluctuations in the intricate nanometer-scale processes that create GPUs. Hence, SDCs usually have characteristics, such as:
- Affect individual chips (i.e., it's a miniscule manufacturing error).
- Specific to a particular microcode instruction or processing sequence.
- Localized to one region of the single chip.
- Not always the same type of error.
- Sometimes intermittent.
Note that SDCs are not typically considered to include:
- GPU acceptance testing failures (i.e., not silent).
- Large GPU failures from overheating (although SDCs can also be heat-dependent).
- Microcoding or hardware design errors (affecting all chips).
Given their obscurity, SDCs are also:
- Hard to detect
- Problematic to prove (even if suspected)
- Difficult to mitigate against
Research on SDCs
Research papers on SDCs include:
- M. Vishwanathan, R. Shah, K. K. Kim and M. Choi, "Silent Data Corruption (SDC) vulnerability of GPU on various GPGPU workloads," 2015 International SoC Design Conference (ISOCC), Gyeongju, Korea (South), 2015, pp. 11-12, doi: 10.1109/ISOCC.2015.7401681. https://ieeexplore.ieee.org/document/7401681
- Wei, X., Jiang, N., Wang, X., Yue, H. (2021). Detecting SDCs in GPGPUs Through an Efficient Instruction Duplication Mechanism. In: Qiu, H., Zhang, C., Fei, Z., Qiu, M., Kung, SY. (eds) Knowledge Science, Engineering and Management. KSEM 2021. Lecture Notes in Computer Science(), vol 12817. Springer, Cham. https://doi.org/10.1007/978-3-030-82153-1_47 https://link.springer.com/chapter/10.1007/978-3-030-82153-1_47
- Anne Meixner, March 12th, 2024, Strategies For Detecting Sources Of Silent Data Corruption, https://semiengineering.com/strategies-for-detecting-sources-of-silent-data-corruption/
- K. S. Yim, C. Pham, M. Saleheen, Z. Kalbarczyk and R. Iyer, "Hauberk: Lightweight Silent Data Corruption Error Detector for GPGPU," 2011 IEEE International Parallel & Distributed Processing Symposium, Anchorage, AK, USA, 2011, pp. 287-300, doi: 10.1109/IPDPS.2011.36. https://ieeexplore.ieee.org/document/6012845
- Jyotika Athavale, Randy Fish, Jul 24, 2024, Examining Silent Data Corruption: A Lurking, Persistent Problem in Computing, https://www.synopsys.com/blogs/chip-design/what-is-silent-data-corruption-sdc.html
- AR Anwer, G Li, K Pattabiraman, M Sullivan, T Tsai, SKS Hari, 2020, GPU-Trident: Efficient Modeling of Error Propagation in GPU Programs, SC20: International Conference for High Performance Computing, https://research.nvidia.com/sites/default/files/pubs/2020-10_GPU-Trident%3A-Efficient-Modeling//SC_2020_GPU_Trident.pdf
- Y. Huang, S. Guo, S. Di, G. Li and F. Cappello, "Mitigating Silent Data Corruptions in HPC Applications across Multiple Program Inputs," SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, Dallas, TX, USA, 2022, pp. 1-14, doi: 10.1109/SC41404.2022.00022. https://ieeexplore.ieee.org/abstract/document/10046091 https://hyfshishen.github.io/publications/SC22-paper.pdf
- M. H. Rahman, S. Di, S. Guo, X. Lu, G. Li and F. Cappello, "Druto: Upper-Bounding Silent Data Corruption Vulnerability in GPU Applications," 2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS), San Francisco, CA, USA, 2024, pp. 582-594, doi: 10.1109/IPDPS57955.2024.00058. https://ieeexplore.ieee.org/abstract/document/10579167
- X. Wei et al., "ApproxDup: Developing an Approximate Instruction Duplication Mechanism for Efficient SDC Detection in GPGPUs," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 43, no. 4, pp. 1051-1064, April 2024, doi: 10.1109/TCAD.2023.3330821. https://ieeexplore.ieee.org/abstract/document/10312777
- Y Huang, S Di, Z Zhang, X Lu, G Li, 2024, Versatile Datapath Soft Error Detection on the Cheap for HPC Applications, https://www.computer.org/csdl/proceedings-article/sc/2024/529100a870/21HUW3yatUc (Using static analysis and code transformations to detect soft errors.)
- Öz, I., Karadaş, Ö.F. Regional soft error vulnerability and error propagation analysis for GPGPU applications. J Supercomput 78, 4095–4130 (2022). https://doi.org/10.1007/s11227-021-04026-6 https://link.springer.com/article/10.1007/s11227-021-04026-6
- Hengshan Yue, Xiaohui Wei, Guangli Li, Jianpeng Zhao, Nan Jiang, and Jingweijia Tan. 2021. G-SEPM: building an accurate and efficient soft error prediction model for GPGPUs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '21). Association for Computing Machinery, New York, NY, USA, Article 54, 1–15. https://doi.org/10.1145/3458817.3476170 https://dl.acm.org/doi/abs/10.1145/3458817.3476170
- Z. He, H. Xu and G. Li, "A Fast Low-Level Error Detection Technique," 2024 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Brisbane, Australia, 2024, pp. 90-98, doi: 10.1109/DSN58291.2024.00023. https://ieeexplore.ieee.org/abstract/document/10646930 https://dsn2024uq.github.io/Proceedings/pdfs/DSN2024-6rvE3SSpzFYmysif75Dkid/410500a090/410500a090.pdf
- Z. Li et al., "A Visual Comparison of Silent Error Propagation," in IEEE Transactions on Visualization and Computer Graphics, vol. 30, no. 7, pp. 3268-3282, July 2024, doi: 10.1109/TVCG.2022.3230636. https://ieeexplore.ieee.org/abstract/document/9993758
- X Wei, Y Wu, N Jiang, H Yue, 2023, Detecting SDCs in GPGPUs Through Efficient Partial Thread Redundancy, https://link.springer.com/chapter/10.1007/978-981-97-0862-8_14
- Siva Kumar Sastry Hari, Paolo Rech, Timothy Tsai, Mark Stephenson, Arslan Zulfiqar, Michael Sullivan, Philip Shirvani, Paul Racunas, Joel Emer, Stephen W. Keckler, 28 Apr 2020, Estimating Silent Data Corruption Rates Using a Two-Level Model, https://arxiv.org/abs/2005.01445
- Abdul Rehman Anwer, Guanpeng Li, Karthik Pattabiraman, Siva Hari, Michael B. Sullivan, Timothy Tsai, March 27, 2019, Towards analytically evaluating the error resilience of GPU Programs, https://d1qx31qr3h6wln.cloudfront.net/publications/SELSE2019_GPUTrident.pdf
- Bautista Gomez Leonardo, Balaprakash Prasanna, Benoit Anne, Cappello Franck, Robert Yves, Unsal Osman, Di Sheng, Hori Atsushi, Gerofi Balazs, Snir Marc, Nov 2024, New Techniques to Design Silent Data Corruption Detectors, https://jlesc.github.io/projects/sdc_detection/
- Alireza Tajary, Hamid R. Zarandi, and Nader Bagherzadeh. 2020. IRHT: An SDC detection and recovery architecture based on value locality of instruction binary codes. Microprocess. Microsyst. 77, C (Sep 2020). https://doi.org/10.1016/j.micpro.2020.103159 https://dl.acm.org/doi/10.1016/j.micpro.2020.103159
- Ahmad H Sedaghat Y(2024), An automated framework for selectively tolerating SDC errors based on rigorous instruction-level vulnerability assessment, Future Generation Computer Systems, 10.1016/j.future.2024.04.006157:C(392-407), 18-Jul-2024, https://dl.acm.org/doi/10.1016/j.future.2024.04.006
- Qining Lu, Guanpeng Li, Karthik Pattabiraman, Meeta S. Gupta, and Jude A. Rivers. 2017. Configurable Detection of SDC-causing Errors in Programs. ACM Trans. Embed. Comput. Syst. 16, 3, Article 88 (August 2017), 25 pages. https://doi.org/10.1145/3014586 https://dl.acm.org/doi/10.1145/3014586
- Fang, W., Gu, J., Yan, Z., Wang, Q. (2021). SDC Error Detection by Exploring the Importance of Instruction Features. In: Liu, Z., Wu, F., Das, S.K. (eds) Wireless Algorithms, Systems, and Applications. WASA 2021. Lecture Notes in Computer Science(), vol 12937. Springer, Cham. https://doi.org/10.1007/978-3-030-85928-2_28 https://link.springer.com/chapter/10.1007/978-3-030-85928-2_28
- Harish Dattatraya Dixit, Sneha Pendharkar, Matt Beadon, Chris Mason, Tejasvi Chakravarthy, Bharath Muthiah, Sriram Sankar, 22 Feb 2021, Silent Data Corruptions at Scale, Facebook Research, https://arxiv.org/abs/2102.11245
GPU Overheating
GPU overheating is where a GPU can become too hot in terms of temperature, leading to a failure or incorrect computation. Overheating is more common under heavy loads and with aged GPUs (near their end-of-life) or brand new GPus (early-life failures). Overheating can cause a GPU to fail either catastrophically, or with a partial computation failure in one or more tiles. Such failures may arise from very high temperatures or more gradually over time due to GPU aging.
Research papers on overheating:
- D. Defour and E. Petit, "GPUburn: A system to test and mitigate GPU hardware failures," 2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS), Agios Konstantinos, Greece, 2013, pp. 263-270, doi: 10.1109/SAMOS.2013.6621133. https://hal.science/hal-00827588/document
- M. Platini, T. Ropars, B. Pelletier and N. De Palma, "CPU Overheating Characterization in HPC Systems: A Case Study," 2018 IEEE/ACM 8th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS), Dallas, TX, USA, 2018, pp. 59-68, doi: 10.1109/FTXS.2018.00010. https://ieeexplore.ieee.org/abstract/document/8564488
- Shan Abdul, Nov 8, 2021, GPU Overheating: Causes, Symptoms & How to Cool It Down, https://www.makeuseof.com/gpu-overheating-causes-symptoms/
- Jon Perez-Cerrolaza, Jaume Abella, Leonidas Kosmidis, Alejandro J. Calderon, Francisco Cazorla, and Jose Luis Flores. 2022. GPU Devices for Safety-Critical Systems: A Survey. ACM Comput. Surv. 55, 7, Article 147 (July 2023), 37 pages. https://doi.org/10.1145/3549526 https://dl.acm.org/doi/abs/10.1145/3549526 https://upcommons.upc.edu/bitstream/handle/2117/386460/Perez%20et%20al.pdf?sequence=5
- Marco Ottavi, Dimitris Gizopoulos, Salvatore Pontarelli, 2018, Dependable Multicore Architectures at Nanoscale, https://link.springer.com/book/10.1007/978-3-319-54422-9
- NVIDIA, Nov 2024, NVIDIA Validation Suite User Guide, https://docs.nvidia.com/deploy/nvvs-user-guide/index.html https://docs.nvidia.com/deploy/pdf/NVIDIA_Validation_Suite_User_Guide.pdf
- Qinghao Hu, Zhisheng Ye, Zerui Wang, Guoteng Wang, Meng Zhang, Qiaoling Chen, Peng Sun, Dahua Lin, Xiaolin Wang, Yingwei Luo, Yonggang Wen, Tianwei Zhang, 3 Apr 2024 (v2), Characterization of Large Language Model Development in the Datacenter, https://arxiv.org/abs/2403.07648
- wilicc, 2024, gpu-burn: Multi-GPU CUDA stress test, http://wili.cc/blog/gpu-burn.html https://github.com/wilicc/gpu-burn
Transient Soft Errors
Transient soft errors are hardware errors that don't trigger an exception, and do not recur. They are also known simply as "soft errors," because the failure is not permanent. There is some overlap with Silent Data Corruption (SDCs), since a transient soft error is also silent. However, soft errors are not only caused by manufacturing errors in the silicon, but can occur intermittently due to the effects of atmospheric radiation.
Bizarrely, the nanometer scale of silicon circuitry is so tiny that individual transistors can be affected by a single particle (e.g., neutron), and these arise spontaneously from cosmic rays in the wild. The effect can be harmless in many cases, but occasionally directly causes a "bit flip" in one of the circuits. Various physical shielding techniques can reduce these problems, but not avoid them completely.
Research papers on GPU soft errors include:
- Y Huang, S Di, Z Zhang, X Lu, G Li, 2024, Versatile Datapath Soft Error Detection on the Cheap for HPC Applications, https://www.computer.org/csdl/proceedings-article/sc/2024/529100a870/21HUW3yatUc (Using static analysis and code transformations to detect soft errors.)
- Öz, I., Karadaş, Ö.F. Regional soft error vulnerability and error propagation analysis for GPGPU applications. J Supercomput 78, 4095–4130 (2022). https://doi.org/10.1007/s11227-021-04026-6 https://link.springer.com/article/10.1007/s11227-021-04026-6
- Hengshan Yue, Xiaohui Wei, Guangli Li, Jianpeng Zhao, Nan Jiang, and Jingweijia Tan. 2021. G-SEPM: building an accurate and efficient soft error prediction model for GPGPUs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '21). Association for Computing Machinery, New York, NY, USA, Article 54, 1–15. https://doi.org/10.1145/3458817.3476170 https://dl.acm.org/doi/abs/10.1145/3458817.3476170
- Z. He, H. Xu and G. Li, "A Fast Low-Level Error Detection Technique," 2024 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Brisbane, Australia, 2024, pp. 90-98, doi: 10.1109/DSN58291.2024.00023. https://ieeexplore.ieee.org/abstract/document/10646930 https://dsn2024uq.github.io/Proceedings/pdfs/DSN2024-6rvE3SSpzFYmysif75Dkid/410500a090/410500a090.pdf
- M. B. Sullivan et al., "Characterizing and Mitigating Soft Errors in GPU DRAM," in IEEE Micro, vol. 42, no. 4, pp. 69-77, 1 July-Aug. 2022, doi: 10.1109/MM.2022.3163122. https://ieeexplore.ieee.org/document/9744333
- Harish Dattatraya Dixit, Sneha Pendharkar, Matt Beadon, Chris Mason, Tejasvi Chakravarthy, Bharath Muthiah, Sriram Sankar, 22 Feb 2021, Silent Data Corruptions at Scale, Facebook Research, https://arxiv.org/abs/2102.11245
High Network Latency
High network latency is a slow transmission of data during LLM training in a multi-GPU data center training stack. The speed of the network is critical to both performance and resiliency of an AI training job. There is a fluctuating rate of network load in a typical training workload, with bursts of network traffic as computation segments are farmed out to workers, followed by a delay due to large-scale parallel computation, and then another burst as results are returned to the center from the leaf nodes.
Research papers on AI network optimizations include:
- Ari Lotter, Jeffrey Quesnelle, Umer H. Adil, Dillon Rolnick, Esteban La Rocca, A Preliminary Report on Distro, 2024, https://github.com/NousResearch/DisTrO/blob/main/A_Preliminary_Report_on_DisTrO.pdf https://venturebeat.com/wp-content/uploads/2024/08/A_Preliminary_Report_on_DisTrO.pdf (Reducing the inter-GPU networking bandwidth cost during training.)
- Wei An, Xiao Bi, Guanting Chen, Shanhuang Chen, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Wenjun Gao, Kang Guan, Jianzhong Guo, Yongqiang Guo, Zhe Fu, Ying He, Panpan Huang, Jiashi Li, Wenfeng Liang, Xiaodong Liu, Xin Liu, Yiyuan Liu, Yuxuan Liu, Shanghao Lu, Xuan Lu, Xiaotao Nie, Tian Pei, Junjie Qiu, Hui Qu, Zehui Ren, Zhangli Sha, Xuecheng Su, Xiaowen Sun, Yixuan Tan, Minghui Tang, Shiyu Wang, Yaohui Wang, Yongji Wang, Ziwei Xie, Yiliang Xiong, Yanhong Xu, Shengfeng Ye, Shuiping Yu, Yukun Zha, Liyue Zhang, Haowei Zhang, Mingchuan Zhang, Wentao Zhang, Yichao Zhang, Chenggang Zhao, Yao Zhao, Shangyan Zhou, Shunfeng Zhou, Yuheng Zou, 31 Aug 2024 (v2), Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning, DeepSeek AI, https://www.arxiv.org/abs/2408.14158
- David Spuler, 26th August, 2024, State-of-the-Art LLM Backends, Aussie AI Blog, https://www.aussieai.com/blog/state-of-the-art-llm-backends
- Zeyu Zhang, Haiying Shen, 23 Sep 2024, CSPS: A Communication-Efficient Sequence-Parallelism based Serving System for Transformer based Models with Long Prompts, https://arxiv.org/abs/2409.15104 (Sparse attention and overlapped communication with computation and disaggregates prefill/decoding with chunked prefill, with a novel QKV splitting approach focused on the Q values.)
- Stephen Jones, March 2024, CUDA: New Features and Beyond, GTC 2024, https://www.nvidia.com/en-us/on-demand/session/gtc24-s62400/
- Dylan Patel and Daniel Nishball, Oct 03, 2024, AI Neocloud Playbook and Anatomy, https://www.semianalysis.com/p/ai-neocloud-playbook-and-anatomy
- Yuhang Yao, Han Jin, Alay Dilipbhai Shah, Shanshan Han, Zijian Hu, Yide Ran, Dimitris Stripelis, Zhaozhuo Xu, Salman Avestimehr, Chaoyang He, 10 Sep 2024 (v2), ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency, https://arxiv.org/abs/2408.00008
- Nir Barazida, Mar 9, 2022, Distributed training of deep learning models: handling stragglers and latency in synchronous training A review of the challenges in Synchronous distributed training and best solutions for stragglers and high latency https://towardsdatascience.com/stragglers-and-latency-in-synchronous-distributed-training-of-deep-learning-models-43783b0266d9
- Jianmin Chen, Xinghao Pan, Rajat Monga, Samy Bengio, Rafal Jozefowicz, 21 Mar 2017 (v3), Revisiting Distributed Synchronous SGD, https://arxiv.org/abs/1604.00981
- Palak (Microsoft Research India), Rohan Gandhi (Microsoft Research India), Karan Tandon (Microsoft Research India), Debopam Bhattacherjee (Microsoft Research India), Venkata N. Padmanabhan (Microsoft Research India), 16 Nov 2024, Improving training time and GPU utilization in geo-distributed language model training, https://arxiv.org/abs/2411.14458
- Haiquan Wang, Chaoyi Ruan, Jia He, Jiaqi Ruan, Chengjie Tang, Xiaosong Ma, Cheng Li, 24 Nov 2024, Hiding Communication Cost in Distributed LLM Training via Micro-batch Co-execution, https://arxiv.org/abs/2411.15871
- Greg Gutmann, Sep 2020, Peer-to-peer Memory Copy with NVLink: CUDA Feature Testing, https://codingbyexample.com/2020/09/14/p2p-memcpy-with-nvlink/
- Chaoyi Jiang, Lei Gao, Hossein Entezari Zarch, Murali Annavaram, 26 Nov 2024, Efficient LLM Inference with I/O-Aware Partial KV Cache Recomputation, https://arxiv.org/abs/2411.17089 (Overlapping/optimizing CPU-GPU network bandwidth for KV cache with some recomputation.)
- Bowen Peng, Jeffrey Quesnelle, Diederik P. Kingma, 29 Nov 2024, DeMo: Decoupled Momentum Optimization, https://arxiv.org/abs/2411.19870 https://github.com/bloc97/DeMo (Extension to ADAM optimizer that greatly reduces network communication in training.)
- Leigh Engel and Anthony Larijani, Dec 11, 2024, Deploying NVIDIA H200 NVL at Scale with New Enterprise Reference Architecture, https://developer.nvidia.com/blog/deploying-nvidia-h200-nvl-at-scale-with-new-enterprise-reference-architecture/
- Hongyi Jin, Ruihang Lai, Charlie F. Ruan, Yingcheng Wang, Todd C. Mowry, Xupeng Miao, Zhihao Jia, Tianqi Chen, 17 Dec 2024, A System for Microserving of LLMs, https://arxiv.org/abs/2412.12488 (Disaggregated prefill and decoding combined with context cache migration for sending the KV cache over the network.)
- Ryan Lucchese, Niki Birkner, Yaron Hagai, Virginia Adams, August 13, 2024, Together AI, A practitioner's guide to testing and running large GPU clusters for training generative AI models, https://www.together.ai/blog/a-practitioners-guide-to-testing-and-running-large-gpu-clusters-for-training-generative-ai-models
- NVIDIA, 2024, nvbandwidth: A tool for bandwidth measurements on NVIDIA GPUs. https://github.com/NVIDIA/nvbandwidth
- NVIDIA, 2024, DCGM Diagnostics, https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html
- Nandini Lokesh Reddy, Jan 2025, DeepSeek: Bridging Performance and Efficiency in Modern AI, https://medium.com/@nandinilreddy/deepseek-bridging-performance-and-efficiency-in-modern-ai-106181a85693
Silent Floating-Point Computation Errors
Floating-point exceptions are often silent in GPU kernels, and GPU software coding needs to take extra care. Whereas CPU computations might trigger SIGFPE, the GPU is likely to quietly continue. This might lead to incorrect results that are insidious, or it may result in special erroneous values such as NaN (not-a-number) and Inf (infinity, either positive or negative).
Research papers on floating point errors:
- GPU-NBDetect Oct 2024 (accessed), Comprehensive-Study-on-GPU-Program-Numerical-Issues, https://github.com/GPU-Program-Bug-Study/Comprehensive-Study-on-GPU-Program-Numerical-Issues.github.io/tree/main/GPU-NBDetect
- FP Checker, Jul 19, 2021, Floating-point Exceptions and GPU Applications, https://fpchecker.org/2021-07-12-exceptions.html
- Lawrence Livermore National Laboratory, Oct 024, FP Checker: dynamic analysis tool to detect floating-point errors in HPC applications, https://fpchecker.org/index.html https://github.com/LLNL/FPChecker
- Laguna, Ignacio. "FPChecker: Detecting Floating-point Exceptions in GPU Applications." In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 1126-1129. IEEE, 2019. https://ieeexplore.ieee.org/abstract/document/8952258 https://www.osti.gov/servlets/purl/1574625
- Michael Bentley, Ian Briggs, Ganesh Gopalakrishnan, Ignacio Laguna, Harshitha Menon, Tristan Vanderbruggen, Cindy Rubio González, 2020, FPChecker Detecting Floating-Point Exceptions in GPUs, https://fpanalysistools.org/pearc19/slides/Module-FPChecker.pdf
- Ignacio Laguna Feb 4, 2020, Improving Reliability Through Analyzing and Debugging Floating-Point Software, 2020 ECP Annual Meeting, https://fpanalysistools.org/slides/ignacio_laguna_ECP_2020.pdf
- Ignacio Laguna, Xinyi Li, and Ganesh Gopalakrishnan. 2022. BinFPE: accurate floating-point exception detection for GPU applications. In Proceedings of the 11th ACM SIGPLAN International Workshop on the State Of the Art in Program Analysis (SOAP 2022). Association for Computing Machinery, New York, NY, USA, 1–8. https://doi.org/10.1145/3520313.3534655 https://dl.acm.org/doi/10.1145/3520313.3534655 https://dl.acm.org/doi/pdf/10.1145/3520313.3534655 https://github.com/LLNL/BinFPE
- Floris Gorter, Enrico Barberis, Raphael Isemann, Erik van der Kouwe, Cristiano Giuffrida, Herbert Bos, November 1, 2023, FloatZone: How Floating Point Additions can Detect Memory Errors, https://download.vusec.net/papers/floatzone_sec23.pdf https://github.com/vusec/floatzone
Floating-Point Runtime Error Checkers
Since floating-point errors are often silent in GPUs, it is advantageous to use runtime tools that can detect them. There are a variety of such tools under development in research, but there's not yet a mainstream tool that is widely used.
Research papers on tools that detect floating-point errors and exceptions at runtime:
- Lawrence Livermore National Laboratory, Oct 024, FP Checker: dynamic analysis tool to detect floating-point errors in HPC applications, https://fpchecker.org/index.html https://github.com/LLNL/FPChecker
- Laguna, Ignacio. "FPChecker: Detecting Floating-point Exceptions in GPU Applications." In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 1126-1129. IEEE, 2019. https://ieeexplore.ieee.org/abstract/document/8952258 https://www.osti.gov/servlets/purl/1574625
- Michael Bentley, Ian Briggs, Ganesh Gopalakrishnan, Ignacio Laguna, Harshitha Menon, Tristan Vanderbruggen, Cindy Rubio González, 2020, FPChecker Detecting Floating-Point Exceptions in GPUs, https://fpanalysistools.org/pearc19/slides/Module-FPChecker.pdf
- Ignacio Laguna, Xinyi Li, and Ganesh Gopalakrishnan. 2022. BinFPE: accurate floating-point exception detection for GPU applications. In Proceedings of the 11th ACM SIGPLAN International Workshop on the State Of the Art in Program Analysis (SOAP 2022). Association for Computing Machinery, New York, NY, USA, 1–8. https://doi.org/10.1145/3520313.3534655 https://dl.acm.org/doi/10.1145/3520313.3534655 https://dl.acm.org/doi/pdf/10.1145/3520313.3534655 https://github.com/LLNL/BinFPE
- Xinyi Li, Ignacio Laguna, Katarzyna Swirydowicz, Bo Fang, Ang Li, and Ganesh Gopalakrishnan. Design and evaluation of GPU-FPX: A low-overhead tool for floating-point excep tion detection in NVIDIA GPUs. In ACM HPDC 2023, 2023. doi:10.11578/dc.20230713.4. https://dl.acm.org/doi/pdf/10.1145/3588195.3592991
- Peter Dinda, Alex Bernat, and Conor Hetland. Spying on the Floating Point Behavior of Existing, Unmodified Sci entific Applications. In HPDC, pages 5–16. ACM, 2020. doi:10.1145/3369583.3392673. http://pdinda.org/Papers/hpdc20.pdf
Checkpointing
Checkpointing is a resilience technique that stores a copy of the current application state, which is a "checkpoint." In AI training, this is effectively a backup of the calculated weights up to the current point of training. A checkpoint can be used as a restart point when a failure is detected, or can be a method to pause a training job temporarily.
Checkpoints can be used in LLM training to achieve several different aims:
- Backup of the training state for fast recovery from training failures.
- Pausing and later resuming a training procedure.
- Comparing models across different parts of the training sequence.
Checkpointing is most used as a reliability improvement to LLM training. If a failure occurs, the training application can re-load the checkpoint data and re-start from that point, rather than starting from scratch. Hence, the idea with LLM is to store the computed parameter values at regular checkpoints, offloaded to CPU memory rather than using up precious GPU VRAM. Progress on training to that point is thereby kept, and won't be lost even with a serious failure. It's kind of like the Microsoft Word "Autosave" feature, if you turn your head and squint sideways.
Given the size of LLMs, and the need to store all parameters during training, the amount of data is large. This can cause bottelenecks due to:
(a) network bandwidth, and
(b) write storage latency.
There is an inherent need to do checkpoints at short intervals, so as not to lose much work in a rollback scenario, but this inherently increases the overall cost of using checkpoints for recovery from failures. The delay in training while awaiting storage of a checkpoint is sometimes called a "checkpoint stall."
To address the inefficiencies inherent to checkpointing a large LLM, various optimizations to checkpointing have been developed:
- Asynchronous checkpointing
- Incremental checkpointing
- Quantized checkpointing
- Distributed checkpointing
- In-memory checkpointing
- Lazy checkpointing
- Checkpointing network optimizations (e.g., overlapping or interleaving checkpoint network traffic with training traffic).
- Checkpoint compression (smaller sizes)
Research papers on checkpointing for AI training workloads:
- Jiahang Zhou, Yanyu Chen, Zicong Hong, Wuhui Chen, Yue Yu, Tao Zhang, Hui Wang, Chuanfu Zhang, Zibin Zheng, 5 Jan 2024, Training and Serving System of Foundation Models: A Comprehensive Survey, https://arxiv.org/abs/2401.02643
- Jiangfei Duan, Shuo Zhang, Zerui Wang, Lijuan Jiang, Wenwen Qu, Qinghao Hu, Guoteng Wang, Qizhen Weng, Hang Yan, Xingcheng Zhang, Xipeng Qiu, Dahua Lin, Yonggang Wen, Xin Jin, Tianwei Zhang, Peng Sun, 29 Jul 2024, Efficient Training of Large Language Models on Distributed Infrastructures: A Survey, https://arxiv.org/abs/2407.20018
- Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong, Yi Pan, Shaochen Xu, Zihao Wu, Zhengliang Liu, Xin Zhang, Shu Zhang, Xintao Hu, Tuo Zhang, Ning Qiang, Tianming Liu, Bao Ge, 6 Jan 2024 (v2), Understanding LLMs: A Comprehensive Overview from Training to Inference, https://arxiv.org/abs/2401.02038
- Douglas C. Youvan, September 27, 2024, Building and Running Large-Scale Language Models: The Infrastructure and Techniques Behind GPT-4 , https://www.researchgate.net/profile/Douglas-Youvan/publication/384398902_Building_and_Running_Large-Scale_Language_Models_The_Infrastructure_and_Techniques_Behind_GPT-4/links/66f6f4d3906bca2ac3d20e68/Building-and-Running-Large-Scale-Language-Models-The-Infrastructure-and-Techniques-Behind-GPT-4.pdf
- Tianqi Chen, Bing Xu, Chiyuan Zhang, Carlos Guestrin, 22 Apr 2016 (v2), Training Deep Nets with Sublinear Memory Cost, https://arxiv.org/abs/1604.06174
- Qinghao Hu, Zhisheng Ye, Zerui Wang, Guoteng Wang, Meng Zhang, Qiaoling Chen, Peng Sun, Dahua Lin, Xiaolin Wang, Yingwei Luo, Yonggang Wen, Tianwei Zhang, 3 Apr 2024 (v2), Characterization of Large Language Model Development in the Datacenter, https://arxiv.org/abs/2403.07648
- Kiwan Maeng, Shivam Bharuka, Isabel Gao, Mark Jeffrey, Vikram Saraph, Bor-Yiing Su, Caroline Trippel, Jiyan Yang, Mike Rabbat, Brandon Lucia, Carole-Jean Wu, 2021, Understanding and Improving Failure Tolerant Training for Deep Learning Recommendation with Partial Recovery Proceedings of Machine Learning and Systems 3 (MLSys 2021), https://proceedings.mlsys.org/paper_files/paper/2021/hash/f0f9e98bc2e2f0abc3e315eaa0d808fc-Abstract.html
- Assaf Eisenman, Kiran Kumar Matam, Steven Ingram, Dheevatsa Mudigere, Raghuraman Krishnamoorthi, Krishnakumar Nair, Misha Smelyanskiy, Murali Annavaram, 4 May 2021 (v2), Check-N-Run: A Checkpointing System for Training Deep Learning Recommendation Models, https://arxiv.org/abs/2010.08679 (Uses incremental checkpointing of only changed parameters, and also quantizes the stored checkpoint.)
- Jayashree Mohan; Amar Phanishayee; Vijay Chidambaram, 2021,https://www.mcs.anl.gov/~wozniak/papers/DeepFreeze_2020.pdf CheckFreq: Frequent, Fine-Grained DNN Checkpointing, https://www.usenix.org/conference/fast21/presentation/mohan
- B. Nicolae, A. Moody, E. Gonsiorowski, K. Mohror and F. Cappello, "VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale," 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Rio de Janeiro, Brazil, 2019, pp. 911-920, doi: 10.1109/IPDPS.2019.00099. https://ieeexplore.ieee.org/document/8821049 https://hal.science/hal-02184203/document
- F. Shahzad, M. Wittmann, T. Zeiser, G. Hager and G. Wellein, "An Evaluation of Different I/O Techniques for Checkpoint/Restart," 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum, Cambridge, MA, USA, 2013, pp. 1708-1716, doi: 10.1109/IPDPSW.2013.145. https://ieeexplore.ieee.org/document/6651069 https://pdfs.semanticscholar.org/f980/3801e3c3ebd7d6be74874f2e4dde71e0c5fb.pdf
- Basma Abdel Azeem, Manal Helal, 29 Nov 2023, Performance Evaluation of Checkpoint/Restart Techniques, https://arxiv.org/abs/2311.17545
- D. Tiwari, S. Gupta and S. S. Vazhkudai, "Lazy Checkpointing: Exploiting Temporal Locality in Failures to Mitigate Checkpointing Overheads on Extreme-Scale Systems," 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, Atlanta, GA, USA, 2014, pp. 25-36, doi: 10.1109/DSN.2014.101. https://ieeexplore.ieee.org/document/6903564
- TensorFlow, Nov 2024 (accessed), tf.train.CheckpointManager, https://www.tensorflow.org/api_docs/python/tf/train/CheckpointManager
- Zhuang Wang, Zhen Jia, Shuai Zheng, Zhen Zhang, Xinwei Fu, T. S. Eugene Ng, and Yida Wang. 2023. GEMINI: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints. In Proceedings of the 29th Symposium on Operating Systems Principles (SOSP '23). Association for Computing Machinery, New York, NY, USA, 364–381. https://doi.org/10.1145/3600006.3613145 https://dl.acm.org/doi/10.1145/3600006.3613145 https://www.cs.rice.edu/~eugeneng/papers/SOSP23.pdf (First paper on in-memory checkpointing to CPU memory, and also covers interleaving of checkpointing network traffic with training traffic.)
- Yuxin Wang, Xueze Kang, Shaohuai Shi, Xin He, Zhenheng Tang, Xinglin Pan, Yang Zheng, Xiaoyu Wu, Amelie Chi Zhou, Bingsheng He, Xiaowen Chu, 19 Aug 2024 (v4), Fault-Tolerant Hybrid-Parallel Training at Scale with Reliable and Efficient In-memory Checkpointing, https://arxiv.org/abs/2310.12670
- Guanhua Wang, Olatunji Ruwase, Bing Xie, Yuxiong He, 19 Jun 2024, FastPersist: Accelerating Model Checkpointing in Deep Learning, https://arxiv.org/abs/2406.13768
- Y. Li, T. Wu, G. Li, Y. Song and S. Yin, "Portus: Efficient DNN Checkpointing to Persistent Memory with Zero-Copy," 2024 IEEE 44th International Conference on Distributed Computing Systems (ICDCS), Jersey City, NJ, USA, 2024, pp. 59-70, doi: 10.1109/ICDCS60910.2024.00015. https://ieeexplore.ieee.org/abstract/document/10630969 (Asynchronous checkpointing using RDMA for network optimization.)
- Tanmaey Gupta, Sanjeev Krishnan, Rituraj Kumar, Abhishek Vijeev, Bhargav Gulavani, Nipun Kwatra, Ramachandran Ramjee, and Muthian Sivathanu. 2024. Just-In-Time Checkpointing: Low Cost Error Recovery from Deep Learning Training Failures. In Proceedings of the Nineteenth European Conference on Computer Systems (EuroSys '24). Association for Computing Machinery, New York, NY, USA, 1110–1125. https://doi.org/10.1145/3627703.3650085 https://dl.acm.org/doi/abs/10.1145/3627703.3650085
- Jia, J., Liu, Y., Liu, Y., Chen, Y., Lin, F. (2024). AdapCK: Optimizing I/O for Checkpointing on Large-Scale High Performance Computing Systems. In: Carretero, J., Shende, S., Garcia-Blas, J., Brandic, I., Olcoz, K., Schreiber, M. (eds) Euro-Par 2024: Parallel Processing. Euro-Par 2024. Lecture Notes in Computer Science, vol 14803. Springer, Cham. https://doi.org/10.1007/978-3-031-69583-4_24 https://link.springer.com/chapter/10.1007/978-3-031-69583-4_24
- Tianyuan Wu, Wei Wang, Yinghao Yu, Siran Yang, Wenchao Wu, Qinkai Duan, Guodong Yang, Jiamang Wang, Lin Qu, Liping Zhang, 16 Oct 2024, FALCON: Pinpointing and Mitigating Stragglers for Large-Scale Hybrid-Parallel Training, https://arxiv.org/abs/2410.12588
- Giskard, Nov 2024, Machine Learning Checkpointing, https://www.giskard.ai/glossary/machine-learning-checkpointing
- Zhuang Wang, Zhen Jia, October 25, 2023, More-efficient recovery from failures during large-ML-model training. Novel “checkpointing” scheme that uses CPU memory reduces the time wasted on failure recovery by more than 92%. https://www.amazon.science/blog/more-efficient-recovery-from-failures-during-large-ml-model-training
- Isabella Richard, Oct 4, 2024, Understanding the Impact of Checkpoints on AI Efficiency, https://www.ddn.com/blog/understanding-the-impact-of-checkpoints-on-ai-efficiency/
- DDN, Sep 10, 2024, LLM Checkpointing Efficiency is a Critical Blocker to AI Productivity, https://www.ddn.com/resources/whitepapers/checkpointing-efficiency-is-a-critical-blocker-to-ai-productivity/
- Eyal Zakkay, Jan 3, 2019, Advanced Keras — Accurately Resuming a Training Process, https://towardsdatascience.com/resuming-a-training-process-with-keras-3e93152ee11a
- W Xu, X Huang, S Meng, W Zhang, L Guo, K Sato, Nov 2024, An Efficient Checkpointing System for Large Machine Learning Model Training, https://conferences.computer.org/sc-wpub/pdfs/SC-W2024-6oZmigAQfgJ1GhPL0yE3pS/555400a896/555400a896.pdf
- Simon Karasik, Mar 12, 2024, Tips and tricks for performing large model checkpointing, https://medium.com/nebius/tips-and-tricks-for-performing-large-model-checkpointing-3ea4a73c7ee9
- Weka, 2024, Checkpointing for Resiliency and Performance in AI Pipelines, https://www.weka.io/wp-content/uploads/files/resources/2024/05/checkpointing-resiliency-performance-ai-pipelines.pdf
- Admin Staff, June 13, 2023, Understanding AI Model Checkpoints: A Simplified Guide, https://nightcafe.studio/blogs/info/understanding-ai-model-checkpoints-a-simplified-guide
- Nirmal Raj Saxena, Saurabh Hukerikar, Mikolaj Blaz, Swapna Raj, 15 Oct 2024, Optimal Checkpoint Interval with Availability as an Objective Function, https://arxiv.org/abs/2410.18124
- Marina Moran, Javier Balladini, Dolores Rexachs, Emilio Luque, 3 Sep 2024, Checkpoint and Restart: An Energy Consumption Characterization in Clusters, https://arxiv.org/abs/2409.02214
- Xiang Fu (Nanchang Hangkong University), Weiping Zhang (Nanchang Hangkong University), Xin Huang (Nanchang Hangkong University), Wubiao Xu (Nanchang Hangkong University), Shiman Meng (Nanchang Hangkong University), Luanzheng Guo (Pacific Northwest National Laboratory), Kento Sato (R-CCS, RIKEN), 5 Nov 2024 (v3), AutoCheck: Automatically Identifying Variables for Checkpointing by Data Dependency Analysis, https://arxiv.org/abs/2408.06082
- Yao Xu, Gene Cooperman, 5 Aug 2024, Enabling Practical Transparent Checkpointing for MPI: A Topological Sort Approach, https://arxiv.org/abs/2408.02218
- Borui Wan, Mingji Han, Yiyao Sheng, Yanghua Peng, Haibin Lin, Mofan Zhang, Zhichao Lai, Menghan Yu, Junda Zhang, Zuquan Song, Xin Liu, Chuan Wu, 10 Oct 2024 (v2), ByteCheckpoint: A Unified Checkpointing System for Large Foundation Model Development, https://arxiv.org/abs/2407.20143
- Madan Timalsina, Lisa Gerhardt, Nicholas Tyler, Johannes P. Blaschke, William Arndt, 26 Jul 2024, Optimizing Checkpoint-Restart Mechanisms for HPC with DMTCP in Containers at NERSC, https://arxiv.org/abs/2407.19117
- Xinyu Lian, Sam Ade Jacobs, Lev Kurilenko, Masahiro Tanaka, Stas Bekman, Olatunji Ruwase, Minjia Zhang, 28 Jun 2024 (v2), Universal Checkpointing: Efficient and Flexible Checkpointing for Large Scale Distributed Training, https://arxiv.org/abs/2406.18820
- Nikhil Mehta, Jonathan Lorraine, Steve Masson, Ramanathan Arunachalam, Zaid Pervaiz Bhat, James Lucas, Arun George Zachariah, 26 Jun 2024, Improving Hyperparameter Optimization with Checkpointed Model Weights, https://arxiv.org/abs/2406.18630 https://github.com/NVlabs/forecasting-model-search
- Wenshuo Li, Xinghao Chen, Han Shu, Yehui Tang, Yunhe Wang, 17 Jun 2024, ExCP: Extreme LLM Checkpoint Compression via Weight-Momentum Joint Shrinking, https://arxiv.org/abs/2406.11257 https://github.com/Gaffey/ExCP
- Avinash Maurya, Robert Underwood, M. Mustafa Rafique, Franck Cappello, Bogdan Nicolae, 15 Jun 2024, DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models, https://arxiv.org/abs/2406.10707
- Zhuobin Huang, Xingda Wei, Yingyi Hao, Rong Chen, Mingcong Han, Jinyu Gu, Haibo Chen, 20 May 2024, PARALLELGPUOS: A Concurrent OS-level GPU Checkpoint and Restore System using Validated Speculation, https://arxiv.org/abs/2405.12079
- S. Wang, Q. Cao, K. Zhou, J. Xu, Z. Guo and J. Guo, "ParaCkpt: Heterogeneous Multi-Path Checkpointing Mechanism for Training Deep Learning Models," 2024 IEEE 42nd International Conference on Computer Design (ICCD), Milan, Italy, 2024, pp. 183-190, doi: 10.1109/ICCD63220.2024.00036. https://ieeexplore.ieee.org/abstract/document/10818161/ (Generalizing in-memory checkpoints by storing data in shards across multiple storage areas including CPU memory and SSDs.)
- Kaiyuan Tian, Linbo Qiao, Baihui Liu, Gongqingjian Jiang, Dongsheng Li, 21 Jan 2025, A Survey on Memory-Efficient Large-Scale Model Training in AI for Science, https://arxiv.org/abs/2501.11847
- Radostin Stoyanov, Viktória Spišaková, Jesus Ramos, Steven Gurfinkel, Andrei Vagin, Adrian Reber, Wesley Armour, Rodrigo Bruno, 23 Feb 2025, CRIUgpu: Transparent Checkpointing of GPU-Accelerated Workloads, https://arxiv.org/abs/2502.16631
Note that some types of checkpointing/offloading algorithm are more focused on speed optimization than on making a checkpoint/backup for resiliency purposes. One speed optimization of LLM training is to offload some of the model parameters out of GPU memory, offloaded to CPU memory. These model weights are then later re-loaded or merged into tensors via recomputation/re-materialization in further computations. The idea of this type of checkpointing is to have more memory-efficient training.
In-Memory Checkpointing
In-memory checkpointing is an AI training optimization whereby a "checkpoint" or backup of the current state is stored in memory. This is more efficient than on-disk checkpointing, because the delay due to storing a large amount of data to disk or SSD can be avoided. Using memory to store checkpoints allows faster completion of a checkpoint, and more frequent checkpointing. In cases of a failure detection, the system can recover using an in-memory checkpoint more efficiently than loading the checkpoint data from disk.
Research papers on in-memory checkpointing to CPU memory, the current SOTA, include:
- Zhuang Wang, Zhen Jia, Shuai Zheng, Zhen Zhang, Xinwei Fu, T. S. Eugene Ng, and Yida Wang. 2023. GEMINI: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints. In Proceedings of the 29th Symposium on Operating Systems Principles (SOSP '23). Association for Computing Machinery, New York, NY, USA, 364–381. https://doi.org/10.1145/3600006.3613145 https://dl.acm.org/doi/10.1145/3600006.3613145 https://www.cs.rice.edu/~eugeneng/papers/SOSP23.pdf (First paper on in-memory checkpointing to CPU memory, and also covers interleaving of checkpointing network traffic with training traffic.)
- Yuxin Wang, Xueze Kang, Shaohuai Shi, Xin He, Zhenheng Tang, Xinglin Pan, Yang Zheng, Xiaoyu Wu, Amelie Chi Zhou, Bingsheng He, Xiaowen Chu, 19 Aug 2024 (v4), Fault-Tolerant Hybrid-Parallel Training at Scale with Reliable and Efficient In-memory Checkpointing, https://arxiv.org/abs/2310.12670
- Zhuang Wang, Zhen Jia, October 25, 2023, More-efficient recovery from failures during large-ML-model training. Novel “checkpointing” scheme that uses CPU memory reduces the time wasted on failure recovery by more than 92%. https://www.amazon.science/blog/more-efficient-recovery-from-failures-during-large-ml-model-training
- S. Wang, Q. Cao, K. Zhou, J. Xu, Z. Guo and J. Guo, "ParaCkpt: Heterogeneous Multi-Path Checkpointing Mechanism for Training Deep Learning Models," 2024 IEEE 42nd International Conference on Computer Design (ICCD), Milan, Italy, 2024, pp. 183-190, doi: 10.1109/ICCD63220.2024.00036. https://ieeexplore.ieee.org/abstract/document/10818161/ (Generalizing in-memory checkpoints by storing data in shards across multiple storage areas including CPU memory and SSDs.)
Asynchronous Checkpointing
Asynchronous checkpoint is where the LLM training job requests a checkpoint to be stored, but does not await completion of the storage of the checkpoint. This is often used with in-memory checkpointing, but can be used in any checkpointing method. The async checkpointing algorithm must ensure that additional training data that comes in after the checkpoint request, but during the checkpoint storage, is not erroneously stored as part of the checkpoint.
Research papers on asynchronous checkpointing, which is now a standard technique:
- Qinghao Hu, Zhisheng Ye, Zerui Wang, Guoteng Wang, Meng Zhang, Qiaoling Chen, Peng Sun, Dahua Lin, Xiaolin Wang, Yingwei Luo, Yonggang Wen, Tianwei Zhang, 3 Apr 2024 (v2), Characterization of Large Language Model Development in the Datacenter, https://arxiv.org/abs/2403.07648
- Kiwan Maeng, Shivam Bharuka, Isabel Gao, Mark Jeffrey, Vikram Saraph, Bor-Yiing Su, Caroline Trippel, Jiyan Yang, Mike Rabbat, Brandon Lucia, Carole-Jean Wu, 2021, Understanding and Improving Failure Tolerant Training for Deep Learning Recommendation with Partial Recovery Proceedings of Machine Learning and Systems 3 (MLSys 2021), https://proceedings.mlsys.org/paper_files/paper/2021/hash/f0f9e98bc2e2f0abc3e315eaa0d808fc-Abstract.html
- Jayashree Mohan; Amar Phanishayee; Vijay Chidambaram, 2021,https://www.mcs.anl.gov/~wozniak/papers/DeepFreeze_2020.pdf CheckFreq: Frequent, Fine-Grained DNN Checkpointing, https://www.usenix.org/conference/fast21/presentation/mohan
- B. Nicolae, A. Moody, E. Gonsiorowski, K. Mohror and F. Cappello, "VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale," 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Rio de Janeiro, Brazil, 2019, pp. 911-920, doi: 10.1109/IPDPS.2019.00099. https://ieeexplore.ieee.org/document/8821049 https://hal.science/hal-02184203/document
- F. Shahzad, M. Wittmann, T. Zeiser, G. Hager and G. Wellein, "An Evaluation of Different I/O Techniques for Checkpoint/Restart," 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum, Cambridge, MA, USA, 2013, pp. 1708-1716, doi: 10.1109/IPDPSW.2013.145. https://ieeexplore.ieee.org/document/6651069 https://pdfs.semanticscholar.org/f980/3801e3c3ebd7d6be74874f2e4dde71e0c5fb.pdf
- Basma Abdel Azeem, Manal Helal, 29 Nov 2023, Performance Evaluation of Checkpoint/Restart Techniques, https://arxiv.org/abs/2311.17545
- D. Tiwari, S. Gupta and S. S. Vazhkudai, "Lazy Checkpointing: Exploiting Temporal Locality in Failures to Mitigate Checkpointing Overheads on Extreme-Scale Systems," 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, Atlanta, GA, USA, 2014, pp. 25-36, doi: 10.1109/DSN.2014.101. https://ieeexplore.ieee.org/document/6903564
- Yuxin Wang, Xueze Kang, Shaohuai Shi, Xin He, Zhenheng Tang, Xinglin Pan, Yang Zheng, Xiaoyu Wu, Amelie Chi Zhou, Bingsheng He, Xiaowen Chu, 19 Aug 2024 (v4), Fault-Tolerant Hybrid-Parallel Training at Scale with Reliable and Efficient In-memory Checkpointing, https://arxiv.org/abs/2310.12670
- Y. Li, T. Wu, G. Li, Y. Song and S. Yin, "Portus: Efficient DNN Checkpointing to Persistent Memory with Zero-Copy," 2024 IEEE 44th International Conference on Distributed Computing Systems (ICDCS), Jersey City, NJ, USA, 2024, pp. 59-70, doi: 10.1109/ICDCS60910.2024.00015. https://ieeexplore.ieee.org/abstract/document/10630969 (Asynchronous checkpointing using RDMA for network optimization.)
- Tanmaey Gupta, Sanjeev Krishnan, Rituraj Kumar, Abhishek Vijeev, Bhargav Gulavani, Nipun Kwatra, Ramachandran Ramjee, and Muthian Sivathanu. 2024. Just-In-Time Checkpointing: Low Cost Error Recovery from Deep Learning Training Failures. In Proceedings of the Nineteenth European Conference on Computer Systems (EuroSys '24). Association for Computing Machinery, New York, NY, USA, 1110–1125. https://doi.org/10.1145/3627703.3650085 https://dl.acm.org/doi/abs/10.1145/3627703.3650085
- Jia, J., Liu, Y., Liu, Y., Chen, Y., Lin, F. (2024). AdapCK: Optimizing I/O for Checkpointing on Large-Scale High Performance Computing Systems. In: Carretero, J., Shende, S., Garcia-Blas, J., Brandic, I., Olcoz, K., Schreiber, M. (eds) Euro-Par 2024: Parallel Processing. Euro-Par 2024. Lecture Notes in Computer Science, vol 14803. Springer, Cham. https://doi.org/10.1007/978-3-031-69583-4_24 https://link.springer.com/chapter/10.1007/978-3-031-69583-4_24
- Simon Karasik, Mar 12, 2024, Tips and tricks for performing large model checkpointing, https://medium.com/nebius/tips-and-tricks-for-performing-large-model-checkpointing-3ea4a73c7ee9
- Avinash Maurya, Robert Underwood, M. Mustafa Rafique, Franck Cappello, Bogdan Nicolae, 15 Jun 2024, DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models, https://arxiv.org/abs/2406.10707
GPU Failures and Reliability
GPU failures are where a GPU performs an incorrect calculation or triggers an exception. Catastrophic GPU failures are where the entire GPU burns out, but less severe failures can include single-tile burnouts or transient errors such as Silent Data Corruption (SDC) and other transient soft errors.
Research papers on the issues of GPU errors/failures and overall GPU reliability:
- Marco Ottavi, Dimitris Gizopoulos, Salvatore Pontarelli, 2018, Dependable Multicore Architectures at Nanoscale, https://link.springer.com/book/10.1007/978-3-319-54422-9
- Jon Perez-Cerrolaza, Jaume Abella, Leonidas Kosmidis, Alejandro J. Calderon, Francisco Cazorla, and Jose Luis Flores. 2022. GPU Devices for Safety-Critical Systems: A Survey. ACM Comput. Surv. 55, 7, Article 147 (July 2023), 37 pages. https://doi.org/10.1145/3549526 https://dl.acm.org/doi/abs/10.1145/3549526 https://www.researchgate.net/publication/362138939_GPU_Devices_for_Safety-Critical_Systems_A_Survey
- NVIDIA, Nov 2024, NVIDIA Validation Suite User Guide, https://docs.nvidia.com/deploy/nvvs-user-guide/index.html https://docs.nvidia.com/deploy/pdf/NVIDIA_Validation_Suite_User_Guide.pdf
- D. Tiwari et al., "Understanding GPU errors on large-scale HPC systems and the implications for system design and operation," 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), Burlingame, CA, USA, 2015, pp. 331-342, doi: 10.1109/HPCA.2015.7056044. https://ieeexplore.ieee.org/abstract/document/7056044/ https://www.osti.gov/servlets/purl/1185857
- Yuichi Ozaki, Sousuke Kanamoto, Hiroaki Yamamoto, and Kenichi Kourai. 2019. Detecting System Failures with GPUs and LLVM. In Proceedings of the 10th ACM SIGOPS Asia-Pacific Workshop on Systems (APSys '19). Association for Computing Machinery, New York, NY, USA, 47–53. https://doi.org/10.1145/3343737.3343749 https://dl.acm.org/doi/abs/10.1145/3343737.3343749 https://kyutech.repo.nii.ac.jp/record/6433/files/RECN_2019-06.pdf
- Fernando Fernandes Dos Santos, Luigi Carro, Flavio Vella, and Paolo Rech. 2024. Assessing the Impact of Compiler Optimizations on GPUs Reliability. ACM Trans. Archit. Code Optim. 21, 2, Article 26 (June 2024), 22 pages. https://doi.org/10.1145/3638249 https://dl.acm.org/doi/full/10.1145/3638249 https://dl.acm.org/doi/pdf/10.1145/3638249
- AR Anwer, G Li, K Pattabiraman, M Sullivan, T Tsai, SKS Hari, 2020, GPU-Trident: Efficient Modeling of Error Propagation in GPU Programs, SC20: International Conference for High Performance Computing, https://research.nvidia.com/sites/default/files/pubs/2020-10_GPU-Trident%3A-Efficient-Modeling//SC_2020_GPU_Trident.pdf
- Y. Zhang and C. Jung, "Featherweight Soft Error Resilience for GPUs," 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO), Chicago, IL, USA, 2022, pp. 245-262, doi: 10.1109/MICRO56248.2022.00030. https://ieeexplore.ieee.org/abstract/document/9923801 https://par.nsf.gov/servlets/purl/10380636
- L Yang, G Papadimitriou, D Sartzetakis, 2024, GPU Reliability Assessment: Insights Across the Abstraction Layers, https://ieeexplore.ieee.org/abstract/document/10740838/ https://lishanyang.github.io/CLUSTER24_Yang.pdf
- Harish Dattatraya Dixit, Sneha Pendharkar, Matt Beadon, Chris Mason, Tejasvi Chakravarthy, Bharath Muthiah, Sriram Sankar, 22 Feb 2021, Silent Data Corruptions at Scale, Facebook Research, https://arxiv.org/abs/2102.11245
- Qinghao Hu, Zhisheng Ye, Zerui Wang, Guoteng Wang, Meng Zhang, Qiaoling Chen, Peng Sun, Dahua Lin, Xiaolin Wang, Yingwei Luo, Yonggang Wen, Tianwei Zhang, 3 Apr 2024 (v2), Characterization of Large Language Model Development in the Datacenter, https://arxiv.org/abs/2403.07648
Fault Tolerance
Research on fault tolerance in AI systems:
- Jiangfei Duan, Shuo Zhang, Zerui Wang, Lijuan Jiang, Wenwen Qu, Qinghao Hu, Guoteng Wang, Qizhen Weng, Hang Yan, Xingcheng Zhang, Xipeng Qiu, Dahua Lin, Yonggang Wen, Xin Jin, Tianwei Zhang, Peng Sun, 29 Jul 2024, Efficient Training of Large Language Models on Distributed Infrastructures: A Survey, https://arxiv.org/abs/2407.20018
- Robert S. Hanmer, 12 July 2013, Patterns for Fault Tolerant Software (Wiley Software Patterns Series) 1st Edition, Wiley, https://www.amazon.com.au/Patterns-Fault-Tolerant-Software-Wiley-ebook/dp/B00DXK33SK/
- Elena Dubrova, March 2013, Fault-Tolerant Design, Springer, https://www.amazon.com.au/Fault-Tolerant-Design-Elena-Dubrova-ebook/dp/B00C0QKAFW/
- Gerardus Blokdyk, 2018, Software fault tolerance, Second Edition, https://www.amazon.com.au/Software-tolerance-Second-Gerardus-Blokdyk-ebook/dp/B07GVT67W2/
More AI Research
Read more about:
- Training Optimizations
- Inference Optimizations
- Loop Optimizations
- Code Optimizations
- « Research Home