Aussie AI
Debugging AI Models and Frameworks
-
Last Updated 7 December, 2024
-
by David Spuler, Ph.D.
I heard a rumor that AI frameworks are just code, and AI models are just data. So this means that there must be bugs! And this article is about real, hard-core coding bugs, the nasty kind that sneak in with all of this performance tuning that's going around, not the higher level AI problems with safety and accuracy issues.
The reality is that an AI engine is some of the most difficult code you'll ever see. Parallelized code of any kind (e.g. low-level hardware acceleration, multi-threaded, multi-GPU, etc.) multiplies this complexity by another order of magnitude. Hence, starting with the basics of high quality coding practices are ever more important, such as:
- Unit tests
- Assertions and self-testing code
- Debug tracing code
- Automated system tests (regression testing)
- Error handling (e.g. starting with checking error return codes)
- Exception handling (wrapping code in a full exception handling stack)
All of these techniques involve a significant chunk of extra coding work. Theory says that full exception handling can be 80% of a finalized software product, so it's a four-fold amount of extra work! Maybe that estimate is a little outdated, given improvements in modern tech stacks, but it still contains many grains of truth.
There are many programming tools to help the debugging cycle:
- C++ memory debugging tools (e.g. Valgrind on Linux)
- Performance profiling tools (for "de-slugging")
- Memory usage tracking (ie. allocated memory measurement)
- Interactive debugging tools (eg. in the IDE, Gnu gdb, etc.)
Random Number Seeds
Neural network code often uses random numbers to improve accuracy, for a stochastic algorithm, or even just for random testing. Random numbers need a "seed" to get started, which is done via the "srand" function in C++. The typical way to initialize the random number generator, so it's truly random, is to use the current time:
srand(time(NULL));
But that's not good for debugging! We don't want randomness when we're trying to reproduce a bug!
A generalized plan is to have a debugging or regression testing mode where the seed is fixed.
if (g_yapi_debug_srand_seed != 0) { srand(g_yapi_debug_srand_seed); // Non-random randomness! } else { // Normal run srand(time(NULL)); }
The test harness has to set the global debug variable when it's doing a regression test. For example, either it's manually hard-coded into a testing function, or it could be set via a command-line argument to your test harness executable.
This is better, but if we have a bug in production, we won't know the seed number. So the better code also prints out the seed number in case you need to use it later to reproduce a bug that occurred live.
if (g_yapi_debug_srand_seed != 0) { srand(g_yapi_debug_srand_seed); // Non-random randomness! } else { // Normal run long int iseed = (long)time(NULL); fprintf(stderr, "INFO: Random number seed: %ld 0x%lx\n", iseed, iseed); srand(iseed); }
Research on Debugging AI Framework Code
Papers on the issues of debugging the actual code that runs AI models, including the code inside the frameworks and ML compilers, includes:
- H Guan, Y Xiao, J Li, Y Liu, G Bai, May 2023, A comprehensive study of real-world bugs in machine learning model optimization, 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), https://ieeexplore.ieee.org/document/10172690, PDF: https://yepangliu.github.io/files/ICSE2023-MOB.pdf, PDF: https://baigd.github.io/files/ICSE23-MOB.pdf (Extensive coverage of bugs in the AI tech stack, including those introduced by optimizing models!)
- J Han, E Shihab, Z Wan, S Deng, X Xia, 2020, Empirical Software Engineering, What do programmers discuss about deep learning frameworks, https://link.springer.com/article/10.1007/s10664-020-09819-6, PDF: https://zhiyuan-wan.github.io/assets/publications/han_emse_20_dl_discussion.pdf
- Juneyoung Lee, Chung-Kil Hur, Ralf Jung, Zhengyang Liu, John Regehr, and Nuno P. Lopes, 2018, Reconciling High-Level Optimizations and Low-Level Code in LLVM, Proc. of the ACM on Programming Languages, Volume 2 Issue OOPSLA, Nov. 2018. PDF: http://web.ist.utl.pt/nuno.lopes/pubs/llvmmem-oopsla18.pdf (This paper has discussion of "bounds checking" and "out of memory" handling.)
- F Mince, D Dinh, J Kgomo, N Thompson, S Hooker, 2023, The Grand Illusion: The Myth of Software Portability and Implications for ML Progress, arXiv preprint arXiv:2309.07181, PDF: https://arxiv.org/pdf/2309.07181.pdf (Examines ML software frameworks TensorFlow, Pytorch, and JAX, from the perspective of portability across hardware.)
- Harshil Patel, 11th August, 2023, Model Debugging Strategies: Machine Learning Guide, MLOps Blog, Neptune Labs, https://neptune.ai/blog/model-debugging-strategies-machine-learning
- Ian Colbert, Alessandro Pappalardo, Jakoba Petri-Koenig, Aug 2023, A2Q: Accumulator-Aware Quantization with Guaranteed Overflow Avoidance, https://arxiv.org/pdf/2308.13504.pdf (Examines avoiding computation overflow in quantization.)
- Guanpeng Li, Siva Kumar Sastry Hari, Michael Sullivan, Timothy Tsai, Karthik Pattabiraman, Joel Emer, and Stephen W Keckler. Understanding error propagation in deep learning neural network (dnn) accelerators and applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, page 8. ACM, 2017. https://ieeexplore.ieee.org/document/9926241, PDF: https://people.csail.mit.edu/emer/media/papers/2017.11.sc.error_propagation_in_DNNs.pdf
- Vincenzo Piuri. Analysis of fault tolerance in artificial neural networks. Journal of Parallel and Distributed Computing, 61(1):18–48, 2001. https://www.sciencedirect.com/science/article/abs/pii/S0743731500916630
- Cesar Torres-Huitzil; Bernard Girau, Fault and Error Tolerance in Neural Networks: A Review, IEEE Access, Volume 5, DOI: 10.1109/ACCESS.2017.2742698, https://ieeexplore.ieee.org/document/8013784
- H Liu, V Singh, M Filipiuk, SKS Hari - arXiv preprint arXiv:2310.03841, Oct 2023 ALBERTA: ALgorithm-Based Error Resilience in Transformer Architectures, https://arxiv.org/pdf/2310.03841.pdf (Detects and tolerates errors in GEMM at runtime, including testing with some error injection tests by flipping bits in neurons.)
- Zitao Chen, Guanpeng Li, and Karthik Pattabiraman. 2021, A low-cost fault corrector for deep neural networks through range restriction. In IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pages 1–13, https://ieeexplore.ieee.org/document/9505066/, https://arxiv.org/abs/2003.13874
- Y Guan, Y Qiu, J Leng, F Yang, S Yu, Y Liu, Y Feng, 2023, Amanda: Unified Instrumentation Framework for Deep Neural Networks, Conference, April 27–May 1,2024, San Diego, CA, Association for Computing Machinery, https://www.cs.sjtu.edu.cn/~leng-jw/resources/Files/guan2024asplos-amanda.pdf (AI tracing and instrumentation methods)
- M. A. Hanif, R. Hafiz, and M. Shafique, Error resilience analysis for systematically employing approximate computing in convolutional neural networks, Proc. Design, Automat. Test Eur. Conf. Exhib. (DATE), Mar. 2018, pp. 913–916. https://ieeexplore.ieee.org/document/8342139
- A. Marchisio, V. Mrazek, M. A. Hanif, and M. Shafique, ReD-CaNe: A systematic methodology for resilience analysis and design of capsule networks under approximations, in Proc. Design, Autom. Test Eur. Conf. Exhib. (DATE), Mar. 2020, pp. 1205–1210. https://arxiv.org/abs/1912.00700
- Ansong Ni, Miltiadis Allamanis, Arman Cohan, Yinlin Deng, Kensen Shi, Charles Sutton, Pengcheng Yin, 23 Apr 2024, NExT: Teaching Large Language Models to Reason about Code Execution, https://arxiv.org/abs/2404.14662
- Francisco Ribeiro, José Nuno Castro de Macedo, Kanae Tsushima, Rui Abreu, João Saraiva, 2023, GPT-3-Powered Type Error Debugging: Investigating the Use of Large Language Models for Code Repair, SLE 2023: Proceedings of the 16th ACM SIGPLAN International Conference on Software Language Engineering, October 2023, Pages 111–124, https://doi.org/10.1145/3623476.3623522 (Code corrections are a type of GEC.)
- Stephen Macneil, Paul Denny, Andrew Tran, Juho Leinonen, Seth Bernstein, Arto Hellas, Sami Sarsa, Joanne Kim, 2024, Decoding Logic Errors: A Comparative Study on Bug Detection by Students and Large Language Models, ACE '24: Proceedings of the 26th Australasian Computing Education Conference, January 2024, Pages 11–18, https://doi.org/10.1145/3636243.3636245, https://dl.acm.org/doi/abs/10.1145/3636243.3636245
- Shixun Wu, Yujia Zhai, Jinyang Liu, Jiajun Huang, Zizhe Jian, Bryan M. Wong, Zizhong Chen, May 2023, Anatomy of High-Performance GEMM with Online Fault Tolerance on GPUs, https://arxiv.org/abs/2305.01024 (Focuses on error tolerance of failures within matrix multiplication algorithms.)
- Javier Ferrando, Gabriele Sarti, Arianna Bisazza, Marta R. Costa-jussà, 2 May 2024 (v2), A Primer on the Inner Workings of Transformer-based Language Models, https://arxiv.org/pdf/2405.00208 (Analyzes the theory of the Transformer architecture, including an interesting separation of the effects of attention versus FFNs on logits to give attributions.)
- Lan Chu, Jan 2024, LLM Output — Evaluating, debugging, and interpreting, Towards AI, https://pub.towardsai.net/llm-output-evaluating-debugging-and-interpreting-f3bd29e7d14d
- Sun, Youcheng and Huang, Xiaowei and Kroening, Daniel and Sharp, James and Hill, Matthew and Ashmore, Rob, Structural Test Coverage Criteria for Deep Neural Networks, https://doi.org/10.1145/3358233
- Youcheng Sun, Xiaowei Huang, Daniel Kroening, James Sharp, Matthew Hill, Rob Ashmore, Apr 2019, Testing Deep Neural Networks, https://arxiv.org/pdf/1803.04792.pdf
- DeepConcolic (Testing for Deep Neural Networks), https://github.com/TrustAI/DeepConcolic
- Jing Yu; Shukai Duan; Xiaojun Ye, 2023, A White-Box Testing for Deep Neural Networks Based on Neuron Coverage, IEEE Transactions on Neural Networks and Learning Systems ( Volume: 34, Issue: 11, November 2023), https://ieeexplore.ieee.org/document/9737039
- David Spuler, March 2024, Chapter 42. Debugging, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Samantha Robertson, Zijie J. Wang, Dominik Moritz, Mary Beth Kery, Fred Hohman, 12 Apr 2023, Angler: Helping Machine Translation Practitioners Prioritize Model Improvements, https://arxiv.org/abs/2304.05967 https://machinelearning.apple.com/research/helping-machine-translation Code: https://github.com/apple/ml-translate-vis
- David Hashe, 19 July 2024, How to build highly-debuggable C++ binaries, https://dhashe.com/how-to-build-highly-debuggable-c-binaries.html
- Jiangfei Duan, Shuo Zhang, Zerui Wang, Lijuan Jiang, Wenwen Qu, Qinghao Hu, Guoteng Wang, Qizhen Weng, Hang Yan, Xingcheng Zhang, Xipeng Qiu, Dahua Lin, Yonggang Wen, Xin Jin, Tianwei Zhang, Peng Sun, 29 Jul 2024, Efficient Training of Large Language Models on Distributed Infrastructures: A Survey, https://arxiv.org/abs/2407.20018
- Angie Boggust, Venkatesh Sivaraman, Yannick Assogba, Donghao Ren, Dominik Moritz, Fred Hohman, 6 Aug 2024, Compress and Compare: Interactively Evaluating Efficiency and Behavior Across ML Model Compression Experiments, https://arxiv.org/abs/2408.03274
- Victor Dibia, Jingya Chen, Gagan Bansal, Suff Syed, Adam Fourney, Erkang Zhu, Chi Wang, Saleema Amershi, 9 Aug 2024, AutoGen Studio: A No-Code Developer Tool for Building and Debugging Multi-Agent Systems, https://arxiv.org/abs/2408.15247
- Fred Hohman, Chaoqun Wang, Jinmook Lee, Jochen Görtler, Dominik Moritz, Jeffrey P Bigham, Zhile Ren, Cecile Foret, Qi Shan, and Xiaoyi Zhang. 2024. Talaria: Interactively Optimizing Machine Learning Models for Efficient Inference. In Proceedings of the CHI Conference on Human Factors in Computing Systems (CHI '24). Association for Computing Machinery, New York, NY, USA, Article 648, 1–19. https://doi.org/10.1145/3613904.3642628 https://dl.acm.org/doi/full/10.1145/3613904.3642628
- Xinyi Hou, Yanjie Zhao, Haoyu Wang, 3 Aug 2024, Voices from the Frontier: A Comprehensive Analysis of the OpenAI Developer Forum, https://arxiv.org/abs/2408.01687
- Xiang Chen, Chaoyang Gao, Chunyang Chen, Guangbei Zhang, Yong Liu, 12 Aug 2024 (v2), An Empirical Study on Challenges for LLM Developers, https://arxiv.org/abs/2408.05002
- Mingyuan Wu, Husheng Zhou, Lingming Zhang, Cong Liu, Yuqun Zhang, 29 May 2019 (v3), Characterizing and Detecting CUDA Program Bugs, https://arxiv.org/abs/1905.01833 (Study of CUDA bugs in several production-level CUDA projects, including memory resource issues and synchronization errors.)
- M. Wu, Y. Ouyang, H. Zhou, L. Zhang, C. Liu and Y. Zhang, "Simulee: Detecting CUDA Synchronization Bugs via Memory-Access Modeling," 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE), Seoul, Korea (South), 2020, pp. 937-948, doi: 10.1145/3377811.3380358. https://ieeexplore.ieee.org/document/9284094 (Simulation tool to detect CUDA bugs by interpreting the LLVM byte code.)
- Pengcheng Li, Chen Ding, Xiaoyu Hu, Tolga Soyata, 2014, LDetector: A Low Overhead Race Detector For GPU Programs, https://wodet.cs.washington.edu/wp-content/uploads/2014/02/wodet2014-final14.pdf
- S. Lagouvardos, J. Dolby, N. Grech, A. Antoniadis, and Y. Smaragdakis, 2020, “Static analysis of shape in Tensorflow programs,” in 34th European Conference on Object-Oriented Programming (ECOOP 2020). Schloss Dagstuhl-Leibniz-Zentrum fur Informatik, 2020. https://drops.dagstuhl.de/entities/document/10.4230/DARTS.6.2.6 PDF: https://drops.dagstuhl.de/storage/05darts/darts-vol006/darts-vol006-issue002_ecoop2020/DARTS.6.2.6/DARTS.6.2.6.pdf
- H. Y. Jhoo, S. Kim, W. Song, K. Park, D. Lee, and K. Yi, “A static analyzer for detecting tensor shape errors in deep neural network training code,” in Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings, 2022, pp. 337 338. https://arxiv.org/abs/2112.09037
- S. Hong, H. Sun, X. Gao and S. H. Tan, "Investigating and Detecting Silent Bugs in PyTorch Programs," 2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), Rovaniemi, Finland, 2024, pp. 272-283, doi: 10.1109/SANER60148.2024.00035. https://ieeexplore.ieee.org/abstract/document/10589839 PDF: https://gaoxiang9430.github.io/papers/saner24a.pdf
- Florian Tambon, Amin Nikanjam, Le An, Foutse Khomh, Giuliano Antoniol, 1 Sep 2023 (v2), Silent Bugs in Deep Learning Frameworks: An Empirical Study of Keras and TensorFlow, https://arxiv.org/abs/2112.13314
- M Hattori, N Kobayashi, R Sato, 2023, Gradual tensor shape checking, PDF: https://library.oapen.org/bitstream/handle/20.500.12657/63011/1/978-3-031-30044-8.pdf#page=210
- Yiran Wang, José Antonio Hernández López, Ulf Nilsson, and Dániel Varró. 2024. Using Run-Time Information to Enhance Static Analysis of Machine Learning Code in Notebooks. In Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering (FSE 2024). Association for Computing Machinery, New York, NY, USA, 497–501. https://doi.org/10.1145/3663529.3663785 https://dl.acm.org/doi/abs/10.1145/3663529.3663785 PDF: https://dl.acm.org/doi/pdf/10.1145/3663529.3663785
- Aparna Dhinakaran, Sep 2024, Choosing Between LLM Agent Frameworks. The tradeoffs between building bespoke code-based agents and the major agent frameworks. https://towardsdatascience.com/choosing-between-llm-agent-frameworks-69019493b259
- Yash Belhe & Zhefan Xu, Sep 2024 (accessed), Neural Networks and Debugging, https://deeplearning.cs.cmu.edu/S20/document/recitation/recitation-4.pdf (Detailed discussion of debugging your AI model.)
- LangChain, Nov 7, 2024. SCIPE - Systematic Chain Improvement and Problem Evaluation, https://blog.langchain.dev/scipe-systematic-chain-improvement-and-problem-evaluation/ https://github.com/garg-ankush/scipe/tree/main
- Xiaoyu Zhang, Weipeng Jiang, Chao Shen, Qi Li, Qian Wang, Chenhao Lin, Xiaohong Guan, 27 Apr 2024, A Survey of Deep Learning Library Testing Methods, https://arxiv.org/abs/2404.17871
- Utpal Bora, Saurabh Joshi, Gautam Muduganti, Ramakrishna Upadrasta, 21 Nov 2024, LLOR: Automated Repair of OpenMP Programs, https://arxiv.org/abs/2411.14590 (Addressing data race errors in programs.)
- Harish Dattatraya Dixit, Sneha Pendharkar, Matt Beadon, Chris Mason, Tejasvi Chakravarthy, Bharath Muthiah, Sriram Sankar, 22 Feb 2021, Silent Data Corruptions at Scale, Facebook Research, https://arxiv.org/abs/2102.11245
- Qinghao Hu, Zhisheng Ye, Zerui Wang, Guoteng Wang, Meng Zhang, Qiaoling Chen, Peng Sun, Dahua Lin, Xiaolin Wang, Yingwei Luo, Yonggang Wen, Tianwei Zhang, 3 Apr 2024 (v2), Characterization of Large Language Model Development in the Datacenter, https://arxiv.org/abs/2403.07648
- Taylor Allred, Xinyi Li, Ashton Wiersdorf, Ben Greenman, Ganesh Gopalakrishnan, 22 Mar 2024, FlowFPX: Nimble Tools for Debugging Floating-Point Exceptions, https://arxiv.org/abs/2403.15632 https://juliahub.com/ui/Packages/FloatTracker/dBXig
- Khairul Alam, Kartik Mittal, Banani Roy, Chanchal Roy, 22 Nov 2024 (v2), Developer Challenges on Large Language Models: A Study of Stack Overflow and OpenAI Developer Forum Posts, https://arxiv.org/abs/2411.10873
- Shengming Zhao, Yuheng Huang, Jiayang Song, Zhijie Wang, Chengcheng Wan, Lei Ma, 29 Nov 2024, Towards Understanding Retrieval Accuracy and Prompt Quality in RAG Systems, https://arxiv.org/abs/2411.19463
General Debugging Techniques Research
Research on general program debugging methods:
- TM Austin, SE Breach, GS Sohi, 1994, Efficient detection of all pointer and array access errors, PLDI '94: Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation, August 1994, Pages 290–301, https://dl.acm.org/doi/abs/10.1145/178243.178446, PDF: https://dl.acm.org/doi/pdf/10.1145/178243.178446
More AI Research
Read more about: