Aussie AI
Parallelization
-
Last Updated 12 December, 2024
-
by David Spuler, Ph.D.
Parallelization is the use of parallel computation algorithms in AI engines. This is typically achieved via GPU hardware or CPU SIMD capabilities, such can also be software simulated via multi-core and multi-threaded algorithms. Adding parallelization to an AI engine in C++ requires accessing of the hardware acceleration capabilities using the intrinsic functions that are specific to the type of hardware being used.
Software Kernel Parallelization
Data doesn't just magically end up in the GPU. There has to be software written to send the data there, and there are a lot of possible optimizations that are used in writing such software. This software is often called the "kernel" of the AI engine. The sub-components of the engine often get called the MatMul kernel, Softmax kernel, normalization kernel, and so on.
Speed optimizations: Software techniques that aim to optimize parallelization primarily by increasing throughput and reducing latency include:
- Vectorization
- Multi-threading
- Kernel fusion
- Kernel fission
- Pipelining
- Overlapping
- Scheduling algorithms
Memory usage optimizations: Software optimizations that aim to improve memory usage, and thereby benefit further from lowering memory access overhead to increase parallelism, include:
- Tiling
- Data locality optimizations
- Dataflow optimizations
- Memory management optimizations
- Cache management
- Prefetching
- Offloading
Vectorization
Vectorization is the name given to transforming a software loop from running sequentially on an array of data to performing the same computation fully in parallel, by sending the data to a GPU or CPU SIMD extensions. Vectorization uses techniques from loop optimizations to transform loops into faster parallelizable versions, such as loop unrolling to "unroll" a loop into all its element-wise actions, and loop distribution (also called "loop sectioning"), which breaks the array into segments that are the right size to fit in parallel into your GPU or CPU SIMD extensions. In theory, a good optimizing compiler can do vectorization optimizations automatically for simple loops, but often you have to do it yourself.
Research on Vectorization: There are various papers expressly on vectorization, and many papers that do it implicitly or by other names. See also research papers on loop unrolling and loop distribution.
- Ragan-Kelley, J., Barnes, C., Adams, A., Paris, S., Durand, F., and Amarasinghe, S., Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI’13, pp. 519–530, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-2014-6. doi: 10.1145/2491956.2462176. http://doi.acm.org/10.1145/2491956.2462176, PDF: https://people.csail.mit.edu/jrk/halide-pldi13.pdf
- Pieter Hijma, Stijn Heldens, Alessio Sclocco, Ben van Werkhoven, Henri E. Bal, 2023, Optimization techniques for GPU programming, ACM Computing Surveys, Volume 55, Issue 11, Article No. 239, pp 1–81, https://dl.acm.org/doi/abs/10.1145/3570638, PDF: https://dl.acm.org/doi/pdf/10.1145/3570638 (Vectorization covered in extensive survey of software optimizations to improve GPU latency and throughput.)
- Shibata, N. and Petrogalli, F., 2020, SLEEF: A portable vectorized library of C standard mathematical functions, IEEE Trans. Parallel Distrib. Syst. 31, 1316–1327. https://dx.doi.org/10.1109/TPDS.2019.2960333
- Anderson A, Muralidharan S, Gregg D, 2017, Efficient multibyte floating point data formats using vectorization. IEEE Trans Comput 66(12):2081–2096. https://ieeexplore.ieee.org/document/7950938 (Vectorization of floating point bit formats)
- Andrew Anderson, David Gregg, 2016, Vectorization of Multibyte Floating Point Data Formats, https://arxiv.org/abs/1601.07789 (Bitwise vectorization algorithms for floating point math.)
- Shabnam Daghaghi, Nicholas Meisburger, Mengnan Zhao, Yong Wu, Sameh Gobriel, Charlie Tai, Anshumali Shrivastava, Mar 2021, Accelerating SLIDE Deep Learning on Modern CPUs: Vectorization, Quantizations, Memory Optimizations, and More, https://arxiv.org/abs/2103.10891, Code: https://github.com/RUSH-LAB/SLIDE (Fast training on CPUs using AVX-512 and locality-sensitive hashing of vectors.)
- S Jain, S VenkataKeerthy, R Aggarwal, 2022, Reinforcement Learning assisted Loop Distribution for Locality and Vectorization, 2022 IEEE/ACM Eighth Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC), https://ieeexplore.ieee.org/abstract/document/10026979/, PDF: https://www.researchgate.net/profile/Dibyendu-Das/publication/365475992_Reinforcement_Learning_assisted_Loop_Distribution_for_Locality_and_Vectorization/links/637679e937878b3e87bb988e/Reinforcement-Learning-assisted-Loop-Distribution-for-Locality-and-Vectorization.pdf
- Roberto L. Castro, Andrei Ivanov, Diego Andrade, Tal Ben-Nun, Basilio B. Fraguela, Torsten Hoefler, Oct 2023, VENOM: A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores, https://browse.arxiv.org/abs/2310.02065
- R Paktinatkeleshteri, 2023, Efficient Auto-Vectorization for Control-flow Dependent Loops through Data Permutation, Masters Thesis, Department of Computing Science, University of Alberta, http://webdocs.cs.ualberta.ca/~amaral/thesis/RouzbehPaktinatkeleshteriMSc.pdf (Automatic vectorization optimizations by compilers.)
- Y He, A Podobas, S Markidis, Aug 2023 Leveraging MLIR for Loop Vectorization and GPU Porting of FFT Libraries, arXiv preprint arXiv:2308.00497, https://arxiv.org/abs/2308.00497
- D. Levine, D. Callahan, and J. Dongarra, 1991. A comparative study of automatic vectorizing compilers, Parallel Computing, vol. 17, no. 10-11, pp. 1223–1244, https://dl.acm.org/doi/10.5555/165357.165391 PDF: https://netlib.org/utk/people/JD/JackDongarra/PAPERS/Comparative-Study-of-Automatic-Vectorizing-Compilers.pdf
- S. Maleki, Y. Gao, M. J. Garzar, T. Wong, D. A. Padua, et al., 2011, An evaluation of vectorizing compilers, 2011 International Conference on Parallel Architectures and Compilation Techniques, IEEE, 2011, pp. 372–382, https://ieeexplore.ieee.org/document/6113845, PDF: http://polaris.cs.uiuc.edu/~garzaran/doc/pact11.pdf
- A. Pohl, B. Cosenza, and B. Juurlink, Control Flow Vectorization for ARM NEON, in Proceedings of the 21st International Workshop on Software and Compilers for Embedded Systems, Sankt Goar Germany: ACM, May 2018, pp. 66–75, isbn: 978-1-4503-5780-7. doi: 10.1145/3207719.3207721. https://dl.acm.org/doi/10.1145/3207719.3207721, PDF: https://www.cosenza.eu/papers/PohlSCOPES18.pdf
- N. Sreraman and R. Govindarajan, 2000, A vectorizing compiler for multimedia extensions, International Journal of Parallel Programming, vol. 28, pp. 363–400, 2000. https://link.springer.com/article/10.1023/A:1007559022013a
- H Liu, F Deng, 2023, Convolutional Acceleration Algorithm Combining Loop Optimization and Automatic Scheduling, 2023 International Conference for Advancement in Technology (ICONAT), https://ieeexplore.ieee.org/abstract/document/10080410/
- Pieter Hijma, Stijn Heldens, Alessio Sclocco, Ben van Werkhoven, Henri E. Bal, 2023, Optimization techniques for GPU programming, ACM Computing Surveys, Volume 55, Issue 11, Article No. 239, pp 1–81, https://dl.acm.org/doi/abs/10.1145/3570638, PDF: https://dl.acm.org/doi/pdf/10.1145/3570638 (Extensive survey of software optimizations to improve GPU latency and throughput.)
- R Maruthamuthu, D Dhabliya, 2023, Advancements in Compiler Design and Optimization Techniques, Volume 399 (2023), E3S Web Conf., 399 (2023), https://www.e3s-conferences.org/articles/e3sconf/abs/2023/36/e3sconf_iconnect2023_04047/e3sconf_iconnect2023_04047.html, https://www.e3s-conferences.org/articles/e3sconf/pdf/2023/36/e3sconf_iconnect2023_04047.pdf (Examines automatic vectorization by programming language compilers.)
- D. F. Bacon, S. L. Graham, and O. J. Sharp. 1994. Compiler transformations for high-performance computing. ACM Computing Surveys 26, 4 (1994), 345–420. https://dl.acm.org/doi/10.1145/197405.197406, PDF: https://people.eecs.berkeley.edu/~fateman/264/papers/bacon.pdf (Many of the loop optimizations are relevant to vectorization, even back in 1994.)
- David A. Padua, Michael J. Wolfe, 1986, Advanced compiler optimizations for supercomputers. Commun. ACM 29, 12, 1184–1201, https://dl.acm.org/doi/10.1145/7902.7904, PDF: https://dl.acm.org/doi/epdf/10.1145/7902.7904
- Xiao Fu, Weiling Yang, Dezun Dong, Xing Su, 03 June 2024, Optimizing Attention by Exploiting Data Reuse on ARM Multi-core CPUs, ICS '24: Proceedings of the 38th ACM International Conference on Supercomputing, May 2024, Pages 137–149, https://doi.org/10.1145/3650200.3656620 https://dl.acm.org/doi/abs/10.1145/3650200.3656620
- David Spuler, March 2024, Chapter 30. Vectorization, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- EventHelix, Jan 1, 2022, Auto-vectorize C and C++ code, https://medium.com/software-design/auto-vectorize-c-and-c-code-34569b2b5f1e
- Mike H.B. Gray, 2024, Implementation of Floating-Point Arithmetic Coding Using x86-64 AVX-256 Assembly Language, https://www.opastpublishers.com/open-access-articles/implementation-of-floatingpoint-arithmetic-coding-using-x8664-avx256-assembly-language.pdf
- Justin Luitjens, Dec 04, 2013, CUDA Pro Tip: Increase Performance with Vectorized Memory Access, https://developer.nvidia.com/blog/cuda-pro-tip-increase-performance-with-vectorized-memory-access/
Single Instruction Multiple Data (SIMD)
Research on SIMD instructions also include:
- Xiao Fu, Weiling Yang, Dezun Dong, Xing Su, 03 June 2024, Optimizing Attention by Exploiting Data Reuse on ARM Multi-core CPUs, ICS '24: Proceedings of the 38th ACM International Conference on Supercomputing, May 2024, Pages 137–149, https://doi.org/10.1145/3650200.3656620 https://dl.acm.org/doi/abs/10.1145/3650200.3656620
- Rocke, F. (2023), Evaluation of C++ SIMD Libraries, Bachelor’s Thesis, INSTITUT FUR INFORMATIK, DER LUDWIG–MAXIMILIANS–UNIVERSIT AT MUNCHEN, https://www.mnm-team.org/pub/Fopras/rock23/ PDF: https://www.mnm-team.org/pub/Fopras/rock23/PDF-Version/rock23.pdf (Reviewed six SIMD libraries: Highway, Vc, Libsimdpp, NSIMD, SIMD Everywhere, Pure SIMD).
- C Zhou, Z Hassman, R Xu, D Shah, V Richard, Y Li, Oct 2023, SIMD Dataflow Co-optimization for Efficient Neural Networks Inferences on CPUs, arXiv preprint arXiv:2310.00574, https://arxiv.org/pdf/2310.00574.pdf
- David Spuler, March 2024, Chapter 17. AVX Intrinsics, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Tianyi Zhang, Jonah Wonkyu Yi, Bowen Yao, Zhaozhuo Xu, Anshumali Shrivastava, 2 Mar 2024, NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-add-free Attention, https://arxiv.org/abs/2403.01273 Code: https://github.com/tonyzhang617/nomad-dist (Converts 4-bit vector dot products to using SIMD registers as lookup tables on CPUs.)
- Longhao Chen, Yina Zhao, Qiangjun Xie, Qinghua Sheng, 6 Jun 2024, Optimization of Armv9 architecture general large language model inference performance based on Llama.cpp, https://arxiv.org/abs/2406.10816
- Hyungyo Kim, Gaohan Ye, Nachuan Wang, Amir Yazdanbakhsh, Nam Sung Kim, 2024, Exploiting Intel Advanced Matrix Extensions (AMX) for Large Language Model Inference Jan.-Jun. 2024, pp. 117-120, vol. 23 DOI Bookmark: 10.1109/LCA.2024.3397747, https://www.computer.org/csdl/journal/ca/2024/01/10538369/1XcOWKoKwfe
- Z Zhong, January 22nd, 2024, Enhancing SIMD Assembly Language Development with Visualization Techniques, Masters Thesis, Department of Computer Science and Communications Engineering,, Master of Engineering, Waseda University, Japan, https://waseda.repo.nii.ac.jp/record/2001309/files/t5121F099.pdf
- Mike H.B. Gray, 2024, Implementation of Floating-Point Arithmetic Coding Using x86-64 AVX-256 Assembly Language, https://www.opastpublishers.com/open-access-articles/implementation-of-floatingpoint-arithmetic-coding-using-x8664-avx256-assembly-language.pdf
- Z. Zhang, Y. Chen, B. He and Z. Zhang, June 2023, NIOT: A Novel Inference Optimization of Transformers on Modern CPUs, IEEE Transactions on Parallel and Distributed Systems, vol. 34, no. 6, pp. 1982-1995, June 2023, doi: 10.1109/TPDS.2023.3269530, https://ieeexplore.ieee.org/abstract/document/10107474
- NVIDIA, Sep 2024, SIMD Intrinsics, https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH__INTRINSIC__SIMD.html
- Justin Luitjens, Dec 04, 2013, CUDA Pro Tip: Increase Performance with Vectorized Memory Access, https://developer.nvidia.com/blog/cuda-pro-tip-increase-performance-with-vectorized-memory-access/
Tensor Parallelism
Research papers on tensor parallelism include:
- Jiahang Zhou, Yanyu Chen, Zicong Hong, Wuhui Chen, Yue Yu, Tao Zhang, Hui Wang, Chuanfu Zhang, Zibin Zheng, 5 Jan 2024, Training and Serving System of Foundation Models: A Comprehensive Survey, https://arxiv.org/abs/2401.02643
- Shashank Verma and Neal Vaidya, Mastering LLM Techniques: Inference Optimization, Nov 17, 2023, NVIDIA Technical Blog, https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/
- NVIDIA, June 2024, Nemotron-4 340B Technical Report, https://d1qx31qr3h6wln.cloudfront.net/publications/Nemotron_4_340B_8T_0.pdf (Architecture is decoder-only with GQA, SentencePiece tokenizer, causal attention masks, RoPE, 96 layers, 96 heads, 8 KV heads, 256,000 vocabulary, 18432 internal dimension, context window 4096, and uses squared RELU.)
- Li-Wen Chang, Wenlei Bao, Qi Hou, Chengquan Jiang, Ningxin Zheng, Yinmin Zhong, Xuanrun Zhang, Zuquan Song, Ziheng Jiang, Haibin Lin, Xin Jin, Xin Liu, 18 Jun 2024 (v4), FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion, https://arxiv.org/abs/2406.06858
- Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas Kübler, Jiaji Huang, Matthäus Kleindessner, Jun Huan, Volkan Cevher, Yida Wang, George Karypis, 12 Jul 2024, Inference Optimization of Foundation Models on AI Accelerators, KDD’24, August 25–29, 2024, Barcelona, Spain, https://arxiv.org/abs/2407.09111
- 18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
- Amr Elmeleegy, Shivam Raj, Brian Slechta and Vishal Mehta, Jun 12, 2024, Demystifying AI Inference Deployments for Trillion Parameter Large Language Models, NVIDIA Technical Blog, https://developer.nvidia.com/blog/demystifying-ai-inference-deployments-for-trillion-parameter-large-language-models/
- Rohan Baskar Prabhakar, Hengrui Zhang, David Wentlzaff, 14 Aug 2024, Kraken: Inherently Parallel Transformers For Efficient Multi-Device Inference, https://arxiv.org/abs/2408.07802 (Modified Transformer architecture with parallelized sub-layers of attention and FFN.)
- Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, Ying Sheng, 6 Jun 2024 (v2), SGLang: Efficient Execution of Structured Language Model Programs, https://arxiv.org/abs/2312.07104 https://github.com/sgl-project/sglang
- Hugging Face, 2024, Text Generation Inference, https://huggingface.co/docs/text-generation-inference/index
- Wei An, Xiao Bi, Guanting Chen, Shanhuang Chen, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Wenjun Gao, Kang Guan, Jianzhong Guo, Yongqiang Guo, Zhe Fu, Ying He, Panpan Huang, Jiashi Li, Wenfeng Liang, Xiaodong Liu, Xin Liu, Yiyuan Liu, Yuxuan Liu, Shanghao Lu, Xuan Lu, Xiaotao Nie, Tian Pei, Junjie Qiu, Hui Qu, Zehui Ren, Zhangli Sha, Xuecheng Su, Xiaowen Sun, Yixuan Tan, Minghui Tang, Shiyu Wang, Yaohui Wang, Yongji Wang, Ziwei Xie, Yiliang Xiong, Yanhong Xu, Shengfeng Ye, Shuiping Yu, Yukun Zha, Liyue Zhang, Haowei Zhang, Mingchuan Zhang, Wentao Zhang, Yichao Zhang, Chenggang Zhao, Yao Zhao, Shangyan Zhou, Shunfeng Zhou, Yuheng Zou, 31 Aug 2024 (v2), Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning, DeepSeek AI, https://www.arxiv.org/abs/2408.14158
- Bin Xiao, Lei Su, 4 Sep 2024, ISO: Overlap of Computation and Communication within Seqenence For LLM Inference, https://arxiv.org/abs/2409.11155
- Guanhua Wang, Chengming Zhang, Zheyu Shen, Ang Li, Olatunji Ruwase, 23 Sep 2024, Domino: Eliminating Communication in LLM Training via Generic Tensor Slicing and Overlapping, https://arxiv.org/abs/2409.15241 (Using tensor parallelism to overlap communications and computation.)
- Zeyu Zhang, Haiying Shen, 23 Sep 2024, CSPS: A Communication-Efficient Sequence-Parallelism based Serving System for Transformer based Models with Long Prompts, https://arxiv.org/abs/2409.15104 (Sparse attention and overlapped communication with computation and disaggregates prefill/decoding with chunked prefill, with a novel QKV splitting approach focused on the Q values.)
- Harry Dong, Tyler Johnson, Minsik Cho, Emad Soroush, 12 Nov 2024, Towards Low-bit Communication for Tensor Parallel LLM Inference, https://arxiv.org/abs/2411.07942
Multi-Query Parallelism
Research papers on multi-query parallelism include:
- L. Zheng, L. Yin, Z. Xie, J. Huang, C. Sun, C. H. Yu, S. Cao, C. Kozyrakis, I. Stoica, J. E. Gonzalez et al., Dec 2023, Efficiently programming large language models using SGLang, arXiv preprint arXiv:2312.07104, 2023, https://arxiv.org/abs/2312.07104 (Uses a radix attention method, a trie or prefix tree, for KV caching.)
- X. Ning, Z. Lin, Z. Zhou, H. Yang, and Y. Wang, “Skeleton-of thought: Large language models can do parallel decoding,” arXiv preprint arXiv:2307.15337, 2023. https://arxiv.org/abs/2307.15337
- S. Jin, Y. Wu, H. Zheng, Q. Zhang, M. Lentz, Z. M. Mao, A. Prakash, F. Qian, and D. Zhuo, “Adaptive skeleton graph decoding,” arXiv preprint arXiv:2402.12280, 2024. https://arxiv.org/abs/2402.12280
- M. Liu, A. Zeng, B. Wang, P. Zhang, J. Tang, and Y. Dong, “Apar: Llms can do auto-parallel auto-regressive decoding,” arXiv preprint arXiv:2401.06761, 2024. https://arxiv.org/abs/2401.06761
Data Parallelism
Research papers on data parallelism include:
- NVIDIA, June 2024, Nemotron-4 340B Technical Report, https://d1qx31qr3h6wln.cloudfront.net/publications/Nemotron_4_340B_8T_0.pdf (Architecture is decoder-only with GQA, SentencePiece tokenizer, causal attention masks, RoPE, 96 layers, 96 heads, 8 KV heads, 256,000 vocabulary, 18432 internal dimension, context window 4096, and uses squared RELU.)
- Wesley Brewer, Aditya Kashi, Sajal Dash, Aristeidis Tsaris, Junqi Yin, Mallikarjun Shankar, Feiyi Wang, 24 Jun 2024, Scalable Artificial Intelligence for Science: Perspectives, Methods and Exemplars, https://arxiv.org/abs/2406.17812
- 18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
- Amr Elmeleegy, Shivam Raj, Brian Slechta and Vishal Mehta, Jun 12, 2024, Demystifying AI Inference Deployments for Trillion Parameter Large Language Models, NVIDIA Technical Blog, https://developer.nvidia.com/blog/demystifying-ai-inference-deployments-for-trillion-parameter-large-language-models/
- Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong, Yi Pan, Shaochen Xu, Zihao Wu, Zhengliang Liu, Xin Zhang, Shu Zhang, Xintao Hu, Tuo Zhang, Ning Qiang, Tianming Liu, Bao Ge, 6 Jan 2024 (v2), Understanding LLMs: A Comprehensive Overview from Training to Inference, https://arxiv.org/abs/2401.02038
- Wei An, Xiao Bi, Guanting Chen, Shanhuang Chen, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Wenjun Gao, Kang Guan, Jianzhong Guo, Yongqiang Guo, Zhe Fu, Ying He, Panpan Huang, Jiashi Li, Wenfeng Liang, Xiaodong Liu, Xin Liu, Yiyuan Liu, Yuxuan Liu, Shanghao Lu, Xuan Lu, Xiaotao Nie, Tian Pei, Junjie Qiu, Hui Qu, Zehui Ren, Zhangli Sha, Xuecheng Su, Xiaowen Sun, Yixuan Tan, Minghui Tang, Shiyu Wang, Yaohui Wang, Yongji Wang, Ziwei Xie, Yiliang Xiong, Yanhong Xu, Shengfeng Ye, Shuiping Yu, Yukun Zha, Liyue Zhang, Haowei Zhang, Mingchuan Zhang, Wentao Zhang, Yichao Zhang, Chenggang Zhao, Yao Zhao, Shangyan Zhou, Shunfeng Zhou, Yuheng Zou, 31 Aug 2024 (v2), Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning, DeepSeek AI, https://www.arxiv.org/abs/2408.14158
- Douglas C. Youvan, September 27, 2024, Building and Running Large-Scale Language Models: The Infrastructure and Techniques Behind GPT-4 , https://www.researchgate.net/profile/Douglas-Youvan/publication/384398902_Building_and_Running_Large-Scale_Language_Models_The_Infrastructure_and_Techniques_Behind_GPT-4/links/66f6f4d3906bca2ac3d20e68/Building-and-Running-Large-Scale-Language-Models-The-Infrastructure-and-Techniques-Behind-GPT-4.pdf
- Kabir Nagrecha, Oct 2024, Thesis, Orchestration Systems to Support Deep Learning at Scale Doctor of Philosophy, Computer Science, University of California San Diego, https://escholarship.org/content/qt3pp6k1p4/qt3pp6k1p4_noSplash_457f4c7c0435172a3d0a17428455894c.pdf (Pipeline and data parallelism systems.)
Sequence Parallelism
Research papers on sequence parallelism include:
- Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, Liang Zhao, 4 Jan 2024, Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models https://arxiv.org/abs/2401.00625 (A general survey paper with coverage of many techniques including this one.)
- Shashank Verma and Neal Vaidya, Mastering LLM Techniques: Inference Optimization, Nov 17, 2023, NVIDIA Technical Blog, https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/
- Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas Kübler, Jiaji Huang, Matthäus Kleindessner, Jun Huan, Volkan Cevher, Yida Wang, George Karypis, 12 Jul 2024, Inference Optimization of Foundation Models on AI Accelerators, KDD’24, August 25–29, 2024, Barcelona, Spain, https://arxiv.org/abs/2407.09111
Pipelining
Pipelining is the software technique of making sure the GPU has a full queue of work. If the GPU has a full pipeline, it is maxed out, and its throughput is at the highest it can achieve. That's the profound pinnacle of optimal optimization.
Bubbles are bad: Sometimes, the GPU is so fast than it can outpace the CPU's ability to send it data (often due to memory access and data transfer overhead in the CPU), and then the GPU has nothing to do for a time. This is called a "bubble" in the pipeline, and even though we're talking about only milliseconds, this represents a squandered opportunity to go faster.
Research on Pipelining: There are various papers expressly on pipelining, such as:
- Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Mia Xu Chen, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, and Zhifeng Chen. 2019. Gpipe: Efficient training of giant neural networks using pipeline parallelism, http://arxiv.org/abs/1811.06965
- Ravi Teja Mullapudi, Vinay Vasista, and Uday Bondhugula. 2015. Polymage: Automatic optimization for image processing pipelines. ACM SIGARCH Comput. Arch. News 43, 1 (2015), 429–443. https://doi.org/10.1145/2694344.2694364, PDF: https://www.csa.iisc.ac.in/TR/2015/3/polymage-tr.pdf
- Ravi Teja Mullapudi, Andrew Adams, Dillon Sharlet, Jonathan Ragan-Kelley, and Kayvon Fatahalian. 2016. Automatically scheduling halide image processing pipelines. ACM Trans. Graph. 35, 4 (2016), 1–11. https://doi.org/10.1145/2897824.2925952, PDF: https://dl.acm.org/doi/pdf/10.1145/2897824.2925952
- Daniel Crankshaw, Gur-Eyal Sela, Xiangxi Mo, Corey Zumar, Ion Stoica, Joseph Gonzalez, and Alexey Tumanov. 2020. InferLine: latency-aware provisioning and scaling for prediction serving pipelines. Proceedings of the 11th ACM Symposium on Cloud Computing. 477–491, https://arxiv.org/abs/1812.01776
- R Ma, J Wang, Q Qi, X Yang, H Sun, Z Zhuang, J Liao, September 2023, PipeLLM: Pipeline LLM Inference on Heterogeneous Devices with Sequence Slicing, ACM SIGCOMM '23: Proceedings of the ACM SIGCOMM 2023 Conference, Pages 1126–1128, https://doi.org/10.1145/3603269.3610856, https://dl.acm.org/doi/abs/10.1145/3603269.3610856
- A Agrawal, A Panwar, J Mohan, N Kwatra, BS Gulavani, 2023, SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills, arXiv preprint, https://arxiv.org/abs/2308.16369
- Saeid Ghafouri, Kamran Razavi, Mehran Salmani, Alireza Sanaee, Tania Lorido-Botran, Lin Wang, Joseph Doyle, Pooyan Jamshidi, Aug 2023, IPA: Inference Pipeline Adaptation to Achieve High Accuracy and Cost-Efficiency, https://arxiv.org/abs/2308.12871
- GI Yu, JS Jeong, GW Kim, S Kim, BG Chun, 2022, Orca: A distributed serving system for Transformer-Based generative models, 16th USENIX Symposium, https://www.usenix.org/conference/osdi22/presentation/yu, PDF: https://www.usenix.org/system/files/osdi22-yu.pdf (Improved parallelization/pipelining with latency reduction from iteration-level scheduling across multiple requests.)
- Tae Jun Ham, Sung Jun Jung, Seonghak Kim, Young H Oh, Yeonhong Park, Yoonho Song, Jung-Hun Park, Sanghee Lee, Kyoung Park, Jae W Lee, et al. A^3: Accelerating attention mechanisms in neural networks with approximation. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 328–341. IEEE, 2020. https://arxiv.org/abs/2002.10941
- Pieter Hijma, Stijn Heldens, Alessio Sclocco, Ben van Werkhoven, Henri E. Bal, 2023, Optimization techniques for GPU programming, ACM Computing Surveys, Volume 55, Issue 11, Article No. 239, pp 1–81, https://dl.acm.org/doi/abs/10.1145/3570638, PDF: https://dl.acm.org/doi/pdf/10.1145/3570638 (Extensive survey of software optimizations to improve GPU latency and throughput.)
- Daniel Nichols, Siddharth Singh, Shu-Huai Lin, Abhinav Bhatele, July 2022, A Survey and Empirical Evaluation of Parallel Deep Learning Frameworks, https://arxiv.org/abs/2111.04949 PDF: https://pssg.cs.umd.edu/assets/papers/2022-07-dl-survey-arxiv.pdf (Survey of frameworks from the theoretical perspective of parallelism.)
- Junki Park, Hyunsung Yoon, Daehyun Ahn, Jungwook Choi, and Jae-Joon Kim. 2020. OPTIMUS: OPTImized Matrix Multiplication Structure for Transformer neural network accelerator. Proceedings of Machine Learning and Systems 2 (2020), 363-378. https://proceedings.mlsys.org/paper_files/paper/2020/file/91ba7292e5388b90b58d0b839a7f19ec-Paper.pdf
- S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He, “Zero: Memory optimizations toward training trillion parameter models,” in SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020, pp. 1–16, https://arxiv.org/abs/1910.02054, Code (part): https://github.com/microsoft/deepspeed (Zero Redundancy Optimizer (ZeRO) provides memory optimization, improved utilization, and fragmentation avoidance, allowing improved pipelining during training.)
- J Zhang, G Niu, Q Dai, H Li, Z Wu, F Dong, Z Wu, 2023, PipePar: Enabling fast DNN pipeline parallel training in heterogeneous GPU clusters, Neurocomputing, https://www.sciencedirect.com/science/article/pii/S0925231223007841
- C Wang, D Sun, Y Bai, 2023, PiPAD: Pipelined and Parallel Dynamic GNN Training on GPUs, https://arxiv.org/pdf/2301.00391
- G Huang, Y Bai, L Liu, Y Wang, B Yu… - … of Machine Learning …, 2023, ALCOP: Automatic Load-Compute Pipelining in Deep Learning Compiler for AI-GPUs, https://arxiv.org/abs/2210.16691, PDF: https://proceedings.mlsys.org/paper_files/paper/2023/file/12a304a31e42dfefa21c82431e849124-Paper-mlsys2023.pdf
- S Raskar, JM Monsalve Diaz, T Applencourt, 2023, Implementation of Dataflow Software Pipelining for Codelet Model, https://research.spec.org/icpe_proceedings/2023/proceedings/p161.pdf
- Yuke Wang, Boyuan Feng, Zheng Wang, Tong Geng, Kevin Barker, Ang Li, Yufei Ding, June 2023, MGG: Accelerating Graph Neural Networks with Fine-grained intra-kernel Communication-Computation Pipelining on Multi-GPU Platforms, https://arxiv.org/abs/2209.06800, PDF: https://www.usenix.org/system/files/osdi23-wang-yuke.pdf
- KKW Ng, HM Demoulin, V Liu, October 2023, Paella: Low-latency Model Serving with Software-defined GPU Scheduling, SOSP '23: Proceedings of the 29th Symposium on Operating Systems Principles, Pages 595–610, https://dl.acm.org/doi/abs/10.1145/3600006.3613163 D. F. Bacon, S. L. Graham, and O. J. Sharp. 1994. Compiler transformations for high-performance computing. ACM Computing Surveys 26, 4 (1994), 345–420. https://dl.acm.org/doi/10.1145/197405.197406, PDF: https://people.eecs.berkeley.edu/~fateman/264/papers/bacon.pdf (Examines pipelining along with extensive coverage of numerous compiler auto-optimizations of program code.)
- Eric LaForest, March 19, 2010, Survey of Loop Transformation Techniques, ECE 1754, http://fpgacpu.ca/writings/SurveyLoopTransformations.pdf
- Zao Zhang, 23 May 2024, Design Efficient Deep Neural Networks with System Optimization, Ph.D. Thesis, School of Electrical and Information Engineering, Faculty of Engineering, The University of Sydney, Australia, PDF: https://ses.library.usyd.edu.au/bitstream/handle/2123/32642/zhang_z_thesis.pdf?sequence=1&isAllowed=y https://ses.library.usyd.edu.au/handle/2123/32642 https://hdl.handle.net/2123/32642
- Shaoyuan Chen, Yutong Lin, Mingxing Zhang, Yongwei Wu, 3 May 2024, Efficient and Economic Large Language Model Inference with Attention Offloading, https://arxiv.org/abs/2405.01814 (Separates the process-bound and memory-bound parts of inference for speedup, with focus on prefill, decoding, and the sub-tasks such as QKV and FFN use of GEMM kernels, versus the different pattern of attention computations and the KV cache.)
- Hyungjun Oh, Kihong Kim, Jaemin Kim, Sungkyun Kim, Junyeol Lee, Du-seong Chang, Jiwon Seo, 15 Mar 2024, ExeGPT: Constraint-Aware Resource Scheduling for LLM Inference, https://arxiv.org/abs/2404.07947 (Scheduling and pipelining of inference calculations.)
- Jaskirat Singh, Bram Adams, Ahmed E. Hassan, 25 Mar 2024, On the Impact of Black-box Deployment Strategies for Edge AI on Latency and Model Performance, https://arxiv.org/abs/2403.17154 (MLOps deployment for quantization, partitioning and early-exit across mobile, edge, and cloud platforms, including running early exit on mobile.)
- Jiahang Zhou, Yanyu Chen, Zicong Hong, Wuhui Chen, Yue Yu, Tao Zhang, Hui Wang, Chuanfu Zhang, Zibin Zheng, 5 Jan 2024, Training and Serving System of Foundation Models: A Comprehensive Survey, https://arxiv.org/abs/2401.02643
- Siqi Wang, Hailong Yang, Xuezhu Wang, Tongxuan Liu, Pengbo Wang, Xuning Liang, Kejie Ma, Tianyu Feng, Xin You, Yongjun Bao, Yi Liu, Zhongzhi Luan, Depei Qian, 24 Feb 2024, Minions: Accelerating Large Language Model Inference with Adaptive and Collective Speculative Decoding, https://arxiv.org/abs/2402.15678 (Speculative decoding with a focus on improving poor acceptance rates using majority-voting consensus decoding with multiple small draft models, achieving 80-90% acceptance with OPT-13B, and around 50% acceptance for Llama-70B, and also pipelining of inference computations for speedup.)
- Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin Jin, Xin Liu, 23 Feb 2024, MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs, https://arxiv.org/abs/2402.15627
- Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, Ramachandran Ramjee, 4 Mar 2024, Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve, https://arxiv.org/abs/2403.02310 (Faster latency by scheduling of prefill and decoding algorithm phases.)
- Shashank Verma and Neal Vaidya, Mastering LLM Techniques: Inference Optimization, Nov 17, 2023, NVIDIA Technical Blog, https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/
- NOT (cannot load this paper) Method to Deploy Lightweight Models with a Novel Pipeline for Efficient Inference, J Dong, W Li, Z Huang, L Xu, 2023, https://ieeexplore.ieee.org/abstract/document/10221956/
- H Peng, S Huang, S Chen, B Li, T Geng, A Li, 2022, A length adaptive algorithm-hardware co-design of transformer on fpga through sparse attention and dynamic pipelining, https://dl.acm.org/doi/abs/10.1145/3489517.3530585 https://arxiv.org/pdf/2208.03646
- Ruilong Ma, Xiang Yang, Jingyu Wang, Qi Qi, Haifeng Sun, Jing Wang, Zirui Zhuang, Jianxin Liao, June 16-21, 2024, HPipe: Large Language Model Pipeline Parallelism for Long Context on Heterogeneous Cost-effective Devices, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track), pages 1–9, Association for Computational Linguistics, https://aclanthology.org/2024.naacl-industry.1.pdf
- Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao, July 11, 2024, FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision, https://arxiv.org/abs/2407.08608 https://tridao.me/blog/2024/flash3/
- 18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
- NVIDIA, June 2024, Nemotron-4 340B Technical Report, https://d1qx31qr3h6wln.cloudfront.net/publications/Nemotron_4_340B_8T_0.pdf (Architecture is decoder-only with GQA, SentencePiece tokenizer, causal attention masks, RoPE, 96 layers, 96 heads, 8 KV heads, 256,000 vocabulary, 18432 internal dimension, context window 4096, and uses squared RELU.)
- Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas Kübler, Jiaji Huang, Matthäus Kleindessner, Jun Huan, Volkan Cevher, Yida Wang, George Karypis, 12 Jul 2024, Inference Optimization of Foundation Models on AI Accelerators, KDD’24, August 25–29, 2024, Barcelona, Spain, https://arxiv.org/abs/2407.09111
- Amr Elmeleegy, Shivam Raj, Brian Slechta and Vishal Mehta, Jun 12, 2024, Demystifying AI Inference Deployments for Trillion Parameter Large Language Models, NVIDIA Technical Blog, https://developer.nvidia.com/blog/demystifying-ai-inference-deployments-for-trillion-parameter-large-language-models/
- Aaron Archer, Matthew Fahrbach, Kuikui Liu, Prakash Prabhu, July 2024, Practical Performance Guarantees for Pipelined DNN Inference, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:1655-1671, 2024, https://proceedings.mlr.press/v235/archer24a.html PDF: https://raw.githubusercontent.com/mlresearch/v235/main/assets/archer24a/archer24a.pdf
- Branden Butler, Sixing Yu, Arya Mazaheri, Ali Jannesari, 16 Jul 2024, PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation, https://arxiv.org/abs/2407.11798 (Optimizes speculative decoding further using pipelining.)
- Kan Zhu, Yilong Zhao, Liangyu Zhao, Gefei Zuo, Yile Gu, Dedong Xie, Yufei Gao, Qinyu Xu, Tian Tang, Zihao Ye, Keisuke Kamahori, Chien-Yu Lin, Stephanie Wang, Arvind Krishnamurthy, Baris Kasikci, 22 Aug 2024, NanoFlow: Towards Optimal Large Language Model Serving Throughput, https://arxiv.org/abs/2408.12757
- Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong, Yi Pan, Shaochen Xu, Zihao Wu, Zhengliang Liu, Xin Zhang, Shu Zhang, Xintao Hu, Tuo Zhang, Ning Qiang, Tianming Liu, Bao Ge, 6 Jan 2024 (v2), Understanding LLMs: A Comprehensive Overview from Training to Inference, https://arxiv.org/abs/2401.02038
- Xueyuan Han, Zinuo Cai, Yichu Zhang, Chongxin Fan, Junhan Liu, Ruhui Ma, Rajkumar Buyya, 9 Sep 2024 (v2), Hermes: Memory-Efficient Pipeline Inference for Large Models on Edge Devices, https://arxiv.org/abs/2409.04249 (Pipelining of model layer-wise loading and inference for memory-efficient inference.)
- Wei An, Xiao Bi, Guanting Chen, Shanhuang Chen, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Wenjun Gao, Kang Guan, Jianzhong Guo, Yongqiang Guo, Zhe Fu, Ying He, Panpan Huang, Jiashi Li, Wenfeng Liang, Xiaodong Liu, Xin Liu, Yiyuan Liu, Yuxuan Liu, Shanghao Lu, Xuan Lu, Xiaotao Nie, Tian Pei, Junjie Qiu, Hui Qu, Zehui Ren, Zhangli Sha, Xuecheng Su, Xiaowen Sun, Yixuan Tan, Minghui Tang, Shiyu Wang, Yaohui Wang, Yongji Wang, Ziwei Xie, Yiliang Xiong, Yanhong Xu, Shengfeng Ye, Shuiping Yu, Yukun Zha, Liyue Zhang, Haowei Zhang, Mingchuan Zhang, Wentao Zhang, Yichao Zhang, Chenggang Zhao, Yao Zhao, Shangyan Zhou, Shunfeng Zhou, Yuheng Zou, 31 Aug 2024 (v2), Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning, DeepSeek AI, https://www.arxiv.org/abs/2408.14158
- Bin Xiao, Lei Su, 4 Sep 2024, ISO: Overlap of Computation and Communication within Seqenence For LLM Inference, https://arxiv.org/abs/2409.11155
- Han Xu, Yutong Li, Shihao Ji, 12 Sep 2024, LlamaF: An Efficient Llama2 Architecture Accelerator on Embedded FPGAs, https://arxiv.org/abs/2409.11424 (Matrix multiplications are 97% of computations, which are optimized with a pipelined matrix-vector operation.)
- Amey Agrawal, Junda Chen, Íñigo Goiri, Ramachandran Ramjee, Chaojie Zhang, Alexey Tumanov, Esha Choukse, 25 Sep 2024, Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations, https://arxiv.org/abs/2409.17264
- Zhuohan L, Oct 2024, Empowering Large Language Models with Efficient and Automated Systems, Ph.D. Thesis, Computer Science, University of California, Berkeley, https://escholarship.org/content/qt2kp379f3/qt2kp379f3.pdf (Examines pipeline parallelism and paged attention.)
- Kabir Nagrecha, Oct 2024, Thesis, Orchestration Systems to Support Deep Learning at Scale Doctor of Philosophy, Computer Science, University of California San Diego, https://escholarship.org/content/qt3pp6k1p4/qt3pp6k1p4_noSplash_457f4c7c0435172a3d0a17428455894c.pdf (Pipeline and data parallelism systems.)
- Yulei Qian, Fengcun Li, Xiangyang Ji, Xiaoyu Zhao, Jianchao Tan, Kefeng Zhang, Xunliang Cai, 16 Oct 2024, EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference, https://arxiv.org/abs/2410.12247
- Samarth N Ramesh, Zhixue Zhao, 22 Nov 2024, Efficient Pruning of Text-to-Image Models: Insights from Pruning Stable Diffusion, https://arxiv.org/abs/2411.15113 (Comprehensive analysis of different types of pruning on diffusion image models.)
Overlapping
Overlapping is a type of parallelization that is closely related to pipelining. The idea is to run two types of actions or computations in parallel. There are multiple types of overlapping:
- Overlap communication and computation
- Overlap memory accesses and computation
- Overlap recomputation/rematerialization and computation
- Overlap CPU and GPU computations
Here are some research papers on overlapping optimizations:
- Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao, July 11, 2024, FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision, https://arxiv.org/abs/2407.08608 https://tridao.me/blog/2024/flash3/
- Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong, Yi Pan, Shaochen Xu, Zihao Wu, Zhengliang Liu, Xin Zhang, Shu Zhang, Xintao Hu, Tuo Zhang, Ning Qiang, Tianming Liu, Bao Ge, 6 Jan 2024 (v2), Understanding LLMs: A Comprehensive Overview from Training to Inference, https://arxiv.org/abs/2401.02038
- DeepSeek-AI, Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et. al. (many additional authors), 19 Jun 2024 (v5), DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model, https://arxiv.org/abs/2405.04434
- Bin Xiao, Lei Su, 4 Sep 2024, ISO: Overlap of Computation and Communication within Seqenence For LLM Inference, https://arxiv.org/abs/2409.11155
- Li-WenChang, WenleiBao, QiHou, ChengquanJiang, NingxinZheng, YinminZhong, Xuanrun Zhang, Zuquan Song, Ziheng Jiang, Haibin Lin, Xin Jin, and Xin Liu, June 2024, https://arxiv.org/abs/2406.06858
- A. Jangda, J. Huang, G. Liu, A. H. N. Sabet, S. Maleki, Y. Miao, M. Musuvathi, T. Mytkowicz, and O. Saarikivi, “Breaking the computation and communication abstraction barrier in distributed machine learning workloads,” in Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 402–416, 2022. https://arxiv.org/abs/2105.05720
- S. Wang, J. Wei, A. Sabne, A. Davis, B. Ilbeyi, B. Hechtman, D. Chen, K. S. Murthy, M. Maggioni, Q. Zhang, et al., “Overlap communication with dependent computation via decomposition in large deep learning models,” in Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, pp. 93–106, 2023. https://dl.acm.org/doi/10.1145/3567955.3567959 PDF: https://dl.acm.org/doi/pdf/10.1145/3567955.3567959
- F. Li, S. Zhao, Y. Qing, X. Chen, X. Guan, S. Wang, G. Zhang, and H. Cui, “Fold3d: Rethinking and parallelizing computational and communicational tasks in the training of large dnn models,” IEEE Transactions on Parallel and Distributed Systems, vol. 34, no. 5, pp. 1432–1449, 2023. https://ieeexplore.ieee.org/document/10050126 https://i.cs.hku.hk/~heming/papers/tpds23-fold3d.pdf
- G. Wang, H. Qin, S. A. Jacobs, C. Holmes, S. Rajbhandari, O. Ruwase, F. Yan, L. Yang, and Y. He, 2023, Zero++: Extremely efficient collective communication for giant model training, arXiv preprint arXiv:2306.10209, 2023. https://arxiv.org/abs/2306.10209
- Chang Chen, Xiuhong Li, Qianchao Zhu, Jiangfei Duan, Peng Sun, Xingcheng Zhang, and Chao Yang. 2024. Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3 (ASPLOS '24), Vol. 3. Association for Computing Machinery, New York, NY, USA, 178–191. https://doi.org/10.1145/3620666.3651379 https://dl.acm.org/doi/abs/10.1145/3620666.3651379
- Ping Chen, Wenjie Zhang, Shuibing He, Yingjie Gu, Zhuwei Peng, Kexin Huang, Xuan Zhan, Weijian Chen, Yi Zheng, Zhefeng Wang, Yanlong Yin, Gang Chen, 27 Jun 2024 (v2), Optimizing Large Model Training through Overlapped Activation Recomputation, https://arxiv.org/abs/2406.08756
- Chenyu Jiang, Ye Tian, Zhen Jia, Shuai Zheng, Chuan Wu, Yida Wang, 30 Apr 2024, Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping, https://arxiv.org/abs/2404.19429
- Chenyu Jiang, Ye Tian, Zhen Jia, Shuai Zheng, Chuan Wu, Yida Wang, 2024, Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping, Proceedings of Machine Learning and Systems 6 (MLSys 2024) Conference, https://proceedings.mlsys.org/paper_files/paper/2024/hash/339caf45a6fa281cae8adc6465343464-Abstract-Conference.html PDF: https://proceedings.mlsys.org/paper_files/paper/2024/file/339caf45a6fa281cae8adc6465343464-Paper-Conference.pdf
- Suchita Pati, Shaizeen Aga, Mahzabeen Islam, Nuwan Jayasena, and Matthew D. Sinclair. 2024. T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS '24), Vol. 2. Association for Computing Machinery, New York, NY, USA, 1146–1164. https://doi.org/10.1145/3620665.3640410 https://dl.acm.org/doi/abs/10.1145/3620665.3640410 PDF: https://dl.acm.org/doi/pdf/10.1145/3620665.3640410
- Ganesh Bikshandi, Jay Shah, 19 Dec 2023, A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library, https://arxiv.org/abs/2312.11918 https://research.colfax-intl.com/nvidia-hopper-flashattention-2/
- Guanhua Wang, Chengming Zhang, Zheyu Shen, Ang Li, Olatunji Ruwase, 23 Sep 2024, Domino: Eliminating Communication in LLM Training via Generic Tensor Slicing and Overlapping, https://arxiv.org/abs/2409.15241 (Using tensor parallelism to overlap communications and computation.)
- Zeyu Zhang, Haiying Shen, 23 Sep 2024, CSPS: A Communication-Efficient Sequence-Parallelism based Serving System for Transformer based Models with Long Prompts, https://arxiv.org/abs/2409.15104 (Sparse attention and overlapped communication with computation and disaggregates prefill/decoding with chunked prefill, with a novel QKV splitting approach focused on the Q values.)
- Yi Xu, Ziming Mao, Xiangxi Mo, Shu Liu, Ion Stoica, 14 Nov 2024, Pie: Pooling CPU Memory for LLM Inference, https://arxiv.org/abs/2411.09317
- Jianhua Gao, Bingjie Liu, Weixing Ji, Hua Huang, 9 Apr 2024, A Systematic Literature Survey of Sparse Matrix-Vector Multiplication, https://arxiv.org/abs/2404.06047
- Vincent Abbott, Gioele Zardini, 4 Dec 2024, FlashAttention on a Napkin: A Diagrammatic Approach to Deep Learning IO-Awareness, https://arxiv.org/abs/2412.03317
Overlapping Communication and Computation
Here are some research papers on overlapping communication (sending data) with computation, usually on the GPU:
- Bin Xiao, Lei Su, 4 Sep 2024, ISO: Overlap of Computation and Communication within Seqenence For LLM Inference, https://arxiv.org/abs/2409.11155
- Li-WenChang, WenleiBao, QiHou, ChengquanJiang, NingxinZheng, YinminZhong, Xuanrun Zhang, Zuquan Song, Ziheng Jiang, Haibin Lin, Xin Jin, and Xin Liu, June 2024, https://arxiv.org/abs/2406.06858
- A. Jangda, J. Huang, G. Liu, A. H. N. Sabet, S. Maleki, Y. Miao, M. Musuvathi, T. Mytkowicz, and O. Saarikivi, “Breaking the computation and communication abstraction barrier in distributed machine learning workloads,” in Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 402–416, 2022. https://arxiv.org/abs/2105.05720
- S. Wang, J. Wei, A. Sabne, A. Davis, B. Ilbeyi, B. Hechtman, D. Chen, K. S. Murthy, M. Maggioni, Q. Zhang, et al., “Overlap communication with dependent computation via decomposition in large deep learning models,” in Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, pp. 93–106, 2023. https://dl.acm.org/doi/10.1145/3567955.3567959 PDF: https://dl.acm.org/doi/pdf/10.1145/3567955.3567959
- F. Li, S. Zhao, Y. Qing, X. Chen, X. Guan, S. Wang, G. Zhang, and H. Cui, “Fold3d: Rethinking and parallelizing computational and communicational tasks in the training of large dnn models,” IEEE Transactions on Parallel and Distributed Systems, vol. 34, no. 5, pp. 1432–1449, 2023. https://ieeexplore.ieee.org/document/10050126 https://i.cs.hku.hk/~heming/papers/tpds23-fold3d.pdf
- G. Wang, H. Qin, S. A. Jacobs, C. Holmes, S. Rajbhandari, O. Ruwase, F. Yan, L. Yang, and Y. He, 2023, Zero++: Extremely efficient collective communication for giant model training, arXiv preprint arXiv:2306.10209, 2023. https://arxiv.org/abs/2306.10209
- Chang Chen, Xiuhong Li, Qianchao Zhu, Jiangfei Duan, Peng Sun, Xingcheng Zhang, and Chao Yang. 2024. Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3 (ASPLOS '24), Vol. 3. Association for Computing Machinery, New York, NY, USA, 178–191. https://doi.org/10.1145/3620666.3651379 https://dl.acm.org/doi/abs/10.1145/3620666.3651379
- Chenyu Jiang, Ye Tian, Zhen Jia, Shuai Zheng, Chuan Wu, Yida Wang, 30 Apr 2024, Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping, https://arxiv.org/abs/2404.19429
- Chenyu Jiang, Ye Tian, Zhen Jia, Shuai Zheng, Chuan Wu, Yida Wang, 2024, Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping, Proceedings of Machine Learning and Systems 6 (MLSys 2024) Conference, https://proceedings.mlsys.org/paper_files/paper/2024/hash/339caf45a6fa281cae8adc6465343464-Abstract-Conference.html PDF: https://proceedings.mlsys.org/paper_files/paper/2024/file/339caf45a6fa281cae8adc6465343464-Paper-Conference.pdf
- Suchita Pati, Shaizeen Aga, Mahzabeen Islam, Nuwan Jayasena, and Matthew D. Sinclair. 2024. T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS '24), Vol. 2. Association for Computing Machinery, New York, NY, USA, 1146–1164. https://doi.org/10.1145/3620665.3640410 https://dl.acm.org/doi/abs/10.1145/3620665.3640410 PDF: https://dl.acm.org/doi/pdf/10.1145/3620665.3640410
- Guanhua Wang, Chengming Zhang, Zheyu Shen, Ang Li, Olatunji Ruwase, 23 Sep 2024, Domino: Eliminating Communication in LLM Training via Generic Tensor Slicing and Overlapping, https://arxiv.org/abs/2409.15241 (Using tensor parallelism to overlap communications and computation.)
- Zeyu Zhang, Haiying Shen, 23 Sep 2024, CSPS: A Communication-Efficient Sequence-Parallelism based Serving System for Transformer based Models with Long Prompts, https://arxiv.org/abs/2409.15104 (Sparse attention and overlapped communication with computation and disaggregates prefill/decoding with chunked prefill, with a novel QKV splitting approach focused on the Q values.)
Skeleton-of-Thought (Query Parallelism)
Skeleton-of-thought is a type of query parallelism for multi-step inference. The idea is that, rather than generate a very long answer sequentially, the execution uses two steps:
1. Generate a brief outline of points or paragraphs (quickly), and
2. In parallel, flesh out each of these points in the outline, and
3. Combine them into the full answer (quickly).
Research papers on skeleton-of-thought:
- L. Zheng, L. Yin, Z. Xie, J. Huang, C. Sun, C. H. Yu, S. Cao, C. Kozyrakis, I. Stoica, J. E. Gonzalez et al., Dec 2023, Efficiently programming large language models using SGLang, arXiv preprint arXiv:2312.07104, 2023, https://arxiv.org/abs/2312.07104 (Uses a radix attention method, a trie or prefix tree, for KV caching.)
- Xuefei Ning , Zinan Lin , November 17, 2023 Skeleton-of-Thought: Parallel decoding speeds up and improves LLM output, Microsoft Research Blog, https://www.microsoft.com/en-us/research/blog/skeleton-of-thought-parallel-decoding-speeds-up-and-improves-llm-output/ Code: https://github.com/imagination-research/sot/
- S. Jin, Y. Wu, H. Zheng, Q. Zhang, M. Lentz, Z. M. Mao, A. Prakash, F. Qian, and D. Zhuo, “Adaptive skeleton graph decoding,” arXiv preprint arXiv:2402.12280, 2024. https://arxiv.org/abs/2402.12280
- M. Liu, A. Zeng, B. Wang, P. Zhang, J. Tang, and Y. Dong, “Apar: Llms can do auto-parallel auto-regressive decoding,” arXiv preprint arXiv:2401.06761, 2024. https://arxiv.org/abs/2401.06761
- 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
- Mahsa Khoshnoodi, Vinija Jain, Mingye Gao, Malavika Srikanth, Aman Chadha, 24 May 2024 (v2), A Comprehensive Survey of Accelerated Generation Techniques in Large Language Models, https://arxiv.org/abs/2405.13019
- Steven Kolawole, KeshavSanthanam, Virginia Smith, Pratiksha Thaker, Nov 2024, Extracting Parallelism from LargeLanguageModelQueries, https://openreview.net/pdf?id=CZHt9kLS5S
Offloading
The term "offloading" in Computer Science theory generally refers to a low-power device offloading computation to a separate more powerful device. In the AI context, this would seem to refer to a CPU offloading processing to a GPU. However, the term "offloading" is not really used with this meaning in AI research. Other terms are used such as "hardware acceleration", "parallelization", and vectorization.
In much AI research, the term "offloading" actually refers to the opposite type of situation, from a GPU back to the CPU (i.e. from a high-power device down to a low-power device). This type of offloading is done to save memory space, not to speed up parall processing (although indirectly it does this, too). The "offloading" occurs in the reverse and is combined with recomputation or rematerialization. The overall goal is to reduce processing time by optimization of memory usage, where the data (e.g. a tensor) is offloaded out of the GPU to save GPU RAM space, and then later re-sent to the GPU (or recomputed) if it's needed again. When this memory optimization is used during training, it is often called "checkpointing" or "gradient checkpointing". See details on recomputation and checkpointing for memory optimization.
The term "offloading" is used with its theoretical meaning in the research area of "edge computing" because it transfers processing from a low-resource device to a high-powered server. Edge computing is the theory of devices on the "edge," which can mean IoT network devices (e.g. modems or security cameras) or personal mobile phones. In this context, the term "offloading" really means "uploading" because it refers to sending the request from the edge device up into the cloud. The AI inference processing is done in the cloud, on a much more powerful server with dozens of mega-GPUs bought with unicorn venture capital funding, and then the results are sent back down to the device or phone.
More research on offloading:
- Yixin Song, Zeyu Mi, Haotong Xie, Haibo Chen 2023, PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU, https://arxiv.org/abs/2312.12456 Code: https://github.com/SJTU-IPADS/PowerInfer (Computes a GPU-CPU hybrid engine with some "active" neurons run on the GPU and other less "hot" neurons on the CPU, which is akin to adaptive inference on the width dimension.)
- Xuanlei Zhao, Bin Jia, Haotian Zhou, Ziming Liu, Shenggan Cheng, Yang You, 2 Mar 2024, HeteGen: Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices, https://arxiv.org/abs/2403.01164
- Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, Luo Mai, 25 Jan 2024, ServerlessLLM: Locality-Enhanced Serverless Inference for Large Language Models, https://arxiv.org/abs/2401.14351 Code: https://github.com/ServerlessLLM/ServerlessLLM
- Longteng Zhang, Xiang Liu, Zeyu Li, Xinglin Pan, Peijie Dong, Ruibo Fan, Rui Guo, Xin Wang, Qiong Luo, Shaohuai Shi, Xiaowen Chu, Dec 2023, Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models, https://arxiv.org/abs/2311.03687 (Benchmarks model speed for training, fine-tuning and inference with various optimizations such as ZeRO, quantization, offloading/recomputation, and Flash Attention.)
- Abhimanyu Bambhaniya, Ritik Raj, Geonhwa Jeong, Souvik Kundu, Sudarshan Srinivasan, Midhilesh Elavazhagan, Madhu Kumar, Tushar Krishna, 3 Jun 2024, Demystifying Platform Requirements for Diverse LLM Inference Use Cases, https://arxiv.org/abs/2406.01698
- Ruslan Svirschevski, Avner May, Zhuoming Chen, Beidi Chen, Zhihao Jia, Max Ryabinin, 4 Jun 2024, SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices, https://arxiv.org/abs/2406.02532 (Speculative decoding with draft trees on low-resource consumer hardware with offloading.)
- Donghyeon Joo, Ramyad Hadidi, Soheil Feizi, Bahar Asgari, 17 Jun 2024, Endor: Hardware-Friendly Sparse Format for Offloaded LLM Inference, https://arxiv.org/abs/2406.11674
- Daliang Xu, Hao Zhang, Liming Yang, Ruiqi Liu, Mengwei Xu, and Xuanzhe Liu, 11 June 2024, WiP: Efficient LLM Prefilling with Mobile NPU, EdgeFM '24: Proceedings of the Workshop on Edge and Mobile Foundation Models, June 2024, Pages 33 - 35, https://doi.org/10.1145/3662006.3662066 https://dl.acm.org/doi/abs/10.1145/3662006.3662066 PDF: https://dl.acm.org/doi/pdf/10.1145/3662006.3662066 (Faster NPU prefill via chunked prefilling using sequences of tokens, along with INT8 NPU quantization that is aware of outliers and offloads FP32 calculations from NPU back to CPU.)
- Ying He, Jingcheng Fang, F. Richard Yu, Victor C. Leung, 2024, Large Language Models (LLMs) Inference Offloading and Resource Allocation in Cloud-Edge Computing: An Active Inference Approach, PrePrints pp. 1-12, DOI: 10.1109/TMC.2024.3415661, https://www.computer.org/csdl/journal/tm/5555/01/10591707/1YraFlDdKYo
- 18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
- Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
- Felippe Vieira Zacarias, Kiran Palli, Sudharshan Vazhkudai, Evelyn Grevelink, July 2024, Analyzing LLM performance: The impact of high-bandwidth memory on model inference, https://www.micron.com/content/dam/micron/global/public/documents/products/product-flyer/llm-inference-engineering-report.pdf
- Xunyi Zhao, Lionel Eyraud-Dubois, Théotime Le Hellard, Julia Gusak, Olivier Beaumont, 24 July, 2024, OFFMATE: full fine-tuning of LLMs on a single GPU by re-materialization and offloading, https://hal.science/hal-04660745/document
- Jiangfei Duan, Shuo Zhang, Zerui Wang, Lijuan Jiang, Wenwen Qu, Qinghao Hu, Guoteng Wang, Qizhen Weng, Hang Yan, Xingcheng Zhang, Xipeng Qiu, Dahua Lin, Yonggang Wen, Xin Jin, Tianwei Zhang, Peng Sun, 29 Jul 2024, Efficient Training of Large Language Models on Distributed Infrastructures: A Survey, https://arxiv.org/abs/2407.20018
- R. Narmeen, P. Mach, Z. Becvar and I. Ahmad, 16 August 2024, Joint Exit Selection and Offloading Decision for Applications Based on Deep Neural Networks, IEEE Internet of Things Journal, doi: 10.1109/JIOT.2024.3444898, https://doi.org/10.1109/JIOT.2024.3444898 https://ieeexplore.ieee.org/abstract/document/10638073
- Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
- X. Yuan, H. Li, K. Ota and M. Dong, 2024, Generative Inference of Large Language Models in Edge Computing: An Energy Efficient Approach, 2024 International Wireless Communications and Mobile Computing (IWCMC), Ayia Napa, Cyprus, 2024, pp. 244-249, doi: 10.1109/IWCMC61514.2024.10592339, https://ieeexplore.ieee.org/document/10592339
- Kan Zhu, Yilong Zhao, Liangyu Zhao, Gefei Zuo, Yile Gu, Dedong Xie, Yufei Gao, Qinyu Xu, Tian Tang, Zihao Ye, Keisuke Kamahori, Chien-Yu Lin, Stephanie Wang, Arvind Krishnamurthy, Baris Kasikci, 22 Aug 2024, NanoFlow: Towards Optimal Large Language Model Serving Throughput, https://arxiv.org/abs/2408.12757
- Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong, Yi Pan, Shaochen Xu, Zihao Wu, Zhengliang Liu, Xin Zhang, Shu Zhang, Xintao Hu, Tuo Zhang, Ning Qiang, Tianming Liu, Bao Ge, 6 Jan 2024 (v2), Understanding LLMs: A Comprehensive Overview from Training to Inference, https://arxiv.org/abs/2401.02038
- Xiurui Pan, Endian Li, Qiao Li, Shengwen Liang, Yizhou Shan, Ke Zhou, Yingwei Luo, Xiaolin Wang, Jie Zhang, 8 Sep 2024, InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference, https://arxiv.org/abs/2409.04992
- Othmane Friha, Mohamed Amine Ferrag, Burak Kantarci, Burak Cakmak, Arda Ozgun, Nassira Ghoualmi-Zine, 2024, LLM-based Edge Intelligence: A Comprehensive Survey on Architectures, Applications, Security and Trustworthiness, PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10669603
- Shaoyuan Chen, Yutong Lin, Mingxing Zhang, Yongwei Wu, 3 May 2024, Efficient and Economic Large Language Model Inference with Attention Offloading, https://arxiv.org/abs/2405.01814 (Offloading the memory-bound processing of KV caches in attention kernels during decoding to bandwidth-focused GPUs, while reserving compute-bound computations like FFNs and prefill for powerful GPUs.)
- Douglas C. Youvan, September 27, 2024, Building and Running Large-Scale Language Models: The Infrastructure and Techniques Behind GPT-4 , https://www.researchgate.net/profile/Douglas-Youvan/publication/384398902_Building_and_Running_Large-Scale_Language_Models_The_Infrastructure_and_Techniques_Behind_GPT-4/links/66f6f4d3906bca2ac3d20e68/Building-and-Running-Large-Scale-Language-Models-The-Infrastructure-and-Techniques-Behind-GPT-4.pdf
- J. Niu, W. Zhang, C. J. Xue and N. Guan, 2024, "RTiL: Real-Time Inference of Large Language Models on Memory-Constrained GPU Devices," 2024 IEEE 30th International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA), Sokcho, Korea, Republic of, 2024, pp. 21-30, doi: 10.1109/RTCSA62462.2024.00013. https://ieeexplore.ieee.org/abstract/document/10695719
- Jie Peng, Zhang Cao, Huaizhi Qu, Zhengyu Zhang, Chang Guo, Yanyong Zhang, Zhichao Cao, Tianlong Chen, 23 Oct 2024 (v2), Harnessing Your DRAM and SSD for Sustainable and Accessible LLM Inference with Mixed-Precision and Multi-level Caching, https://arxiv.org/abs/2410.14740
- Xin He, Shunkang Zhang, Yuxin Wang, Haiyan Yin, Zihao Zeng, Shaohuai Shi, Zhenheng Tang, Xiaowen Chu, Ivor Tsang, Ong Yew Soon, 23 Oct 2024, ExpertFlow: Optimized Expert Activation and Token Allocation for Efficient Mixture-of-Experts Inference, https://arxiv.org/abs/2410.17954
- Jiaming Qiu, Ruiqi Wang, Brooks Hu, Roch Guerin, Chenyang Lu, 24 Oct 2024, Optimizing Edge Offloading Decisions for Object Detection, https://arxiv.org/abs/2410.18919
- Xiaoniu Song, Zihang Zhong, Rong Chen, 29 Oct 2024, ProMoE: Fast MoE-based LLM Serving using Proactive Caching, https://arxiv.org/abs/2410.22134
- Xuanlin Jiang, Yang Zhou, Shiyi Cao, Ion Stoica, Minlan Yu, 2 Nov 2024, NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference, https://arxiv.org/abs/2411.01142
- Shiyi Cao, Shu Liu, Tyler Griggs, Peter Schafhalter, Xiaoxuan Liu, Ying Sheng, Joseph E. Gonzalez, Matei Zaharia, Ion Stoica, 18 Nov 2024, MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs, https://arxiv.org/abs/2411.11217
- Rongxiang Wang and Felix Xiaozhu Lin. 2024. Turbocharge Speech Understanding with Pilot Inference. In Proceedings of the 30th Annual International Conference on Mobile Computing and Networking (ACM MobiCom '24). Association for Computing Machinery, New York, NY, USA, 1299–1313. https://doi.org/10.1145/3636534.3690694 https://dl.acm.org/doi/abs/10.1145/3636534.3690694 https://dl.acm.org/doi/pdf/10.1145/3636534.3690694 ("Pilot inference" is a specialized mix of caching, computation reuse, and backtracking in beam search for speech understanding, and is somewhat related to speculative decoding, and similar to continual inference for processing a stream.)
Kernel Tiling
Tiling is an optimization of computations that breaks them up into "tiles" or "blocks". This mainly refers to matrix multiplication kernels, in which case the tiles or blocks are smaller square or rectangular submatrices of the bigger matrix. Various tiled MatMul optimizations can be used to optimize GEMM kernels. This technique is closely related to loop tiling.
Research on Tiling: Papers on tiling computation optimizations include those below (see also research on loop tiling):
- P Tillet, HT Kung, D Cox, 2019, Triton: an intermediate language and compiler for tiled neural network computations, Proceedings of the 3rd ACM SIGPLAN, http://www.eecs.harvard.edu/~htk/publication/2019-mapl-tillet-kung-cox.pdf
- Ravi Teja Mullapudi, Andrew Adams, Dillon Sharlet, Jonathan Ragan-Kelley, and Kayvon Fatahalian. 2016. Automatically Scheduling Halide Image Processing Pipelines. ACM Trans. Graph. 35, 4, Article 83 (jul 2016), 11 pages. https://doi.org/10.1145/2897824.2925952, https://research.google/pubs/pub45525/, PDF: https://dl.acm.org/doi/pdf/10.1145/2897824.2925952
- H Li, J Choi, Y Kwon, JH Ahn, Oct 2023, A Hardware-Friendly Tiled Singular-Value Decomposition-Based Matrix Multiplication for Transformer-Based Models, IEEE Computer Architecture Letters, https://ieeexplore.ieee.org/abstract/document/10285300/
- Dennis Sebastian Rieber, 2023, Deployment of Deep Neural Networks on Dedicated Hardware Accelerators, Ph.D. thesis, Doctor of Natural Sciences, Ruprecht–Karls University Heidelberg, PDF: https://archiv.ub.uni-heidelberg.de/volltextserver/32994/1/dissertationPDFA.pdf
- C Zhang, P Li, G Sun, Y Guan, B Xiao, 2015, Optimizing FPGA-based accelerator design for deep convolutional neural networks, FPGA '15: Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysFebruary 2015, Pages 161–170, https://dl.acm.org/doi/abs/10.1145/2684746.2689060, PDF: https://iceory.github.io/2018/04/25/fpga-based-cnn/FPGA-BASED-CNN.pdf
- Robert Lim, 2019, Methods for accelerating machine learning in high performance computing, Report AREA-2019-01, School of Computer and Data Sciences, University of Oregon, https://www.cs.uoregon.edu/Reports/AREA-201901-Lim.pdf
- Xiao Fu, Weiling Yang, Dezun Dong, Xing Su, 03 June 2024, Optimizing Attention by Exploiting Data Reuse on ARM Multi-core CPUs, ICS '24: Proceedings of the 38th ACM International Conference on Supercomputing, May 2024, Pages 137–149, https://doi.org/10.1145/3650200.3656620 https://dl.acm.org/doi/abs/10.1145/3650200.3656620
- Victor J.B. Jung, Alessio Burrello, Moritz Scherer, Francesco Conti, Luca Benini, 3 Apr 2024, Optimizing the Deployment of Tiny Transformers on Low-Power MCUs, https://arxiv.org/abs/2404.02945 (Uses an approach called "Fused Weight Self-Attention" that fuses some of the QKV matrices and also tiling in multi-head attention, along with 8-bit integer quantization and integerized Softmax.)
- Cong Guo, Fengchen Xue, Jingwen Leng, Yuxian Qiu, Yue Guan, Weihao Cui, Quan Chen, Minyi Guo, 2024, Accelerating Sparse DNNs Based on Tiled GEMM, IEEE Transactions on Computers, https://www.computer.org/csdl/journal/tc/5555/01/10436533/1UwVolp0wta
- Salar Shakibhamedan, Amin Aminifar, Nima TaheriNejad, Axel Jantsch, 2024, EASE: Energy Optimization through Adaptation — A Review of Runtime Energy-Aware Approximate Deep Learning Algorithms, https://eclectx.org/Publications/2024_M13.pdf (Survey paper on techniques for adaptive inference with a focus on approximations of inference, including loop performance, stochastic algorithms, approximate arithmetic, quantization, pruning and low-rank.)
- Eunji Kwon; Jongho Yoon; Seokhyeong Kang, Dec 2023, Mobile Transformer Accelerator Exploiting Various Line Sparsity and Tile-Based Dynamic Quantization, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (Early Access), https://ieeexplore.ieee.org/abstract/document/10375766
- Robert A. van de Geijn, Enrique S. Quintana-Ort´ı, 2007, The Science of Programming Matrix Computations, https://www.cs.utexas.edu/users/rvdg/tmp/TSoPMC.pdf Code: http://www.cs.utexas.edu/users/flame/
- X Xie, H Peng, A Hasan, S Huang, J Zhao, 2023, Accel-GCN: High-Performance GPU Accelerator Design for Graph Convolution Networks https://arxiv.org/abs/2308.11825 (Kernel for sparse matrix multiplication with block-level tiling as example.)
- Kazushige Goto, Robert A. van de Geijn, 2008, Anatomy of high-performance matrix multiplication, ACM Transactions on Mathematical Software, Volume 34, Issue 3, Article No.: 12, pp 1–25, https://dl.acm.org/doi/10.1145/1356052.1356053 (The GotoBLAS algorithm for matrix multiplication.)
- Yufan Xu, Saurabh Raje, Atanas Rountev, Gerald Sabin, Aravind Sukumaran-Rajam, and P Sadayappan. Training of deep learning pipelines on memory-constrained gpus via segmented fused-tiled execution. In Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction, pages 104–116, 2022. https://arxiv.org/pdf/2310.12109.pdf
- Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, June 2022. https://arxiv.org/abs/2205.14135 Code: https://github.com/HazyResearch/flash-attention (The original FlashAttention version 1, now superceded by FlashAttention 2, which uses tiling and memory-aware kernels to optimize attention.)
- William Brandon, Aniruddha Nrusimha, Kevin Qian, Zachary Ankner, Tian Jin, Zhiye Song, Jonathan Ragan-Kelley, 15 Nov 2023, Striped Attention: Faster Ring Attention for Causal Transformers, https://arxiv.org/abs/2311.09431
- David Spuler, March 2024, Chapter 34. MatMul/GEMM, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Dominik Wagner, Seanie Lee, Ilja Baumann, Philipp Seeberger, Korbinian Riedhammer, Tobias Bocklet, 16 Jun 2024, Optimized Speculative Sampling for GPU Hardware Accelerators, https://arxiv.org/abs/2406.11016 (Speculative decoding accelerated with multiple GPUs using approaches such as tiling, and uses a fused sigmoid replacing Softmax.)
- Francesco Daghero, Alessio Burrello, Massimo Poncino, Enrico Macii, Daniele Jahier Pagliari, 18 Jun 2024, Accelerating Depthwise Separable Convolutions on Ultra-Low-Power Devices, SAMOS2024 conference, https://arxiv.org/abs/2406.12478 Code: https://github.com/eml-eda/depthwise-separable-fusion
- Kaixin Xu, Zhe Wang, Chunyun Chen, Xue Geng, Jie Lin, Xulei Yang, Min Wu, Xiaoli Li, Weisi Lin, 2 Jul 2024, LPViT: Low-Power Semi-structured Pruning for Vision Transformers, https://arxiv.org/abs/2407.02068 (Block-level pruning to give a granular type of structured pruning which speeds up MatMul/GEMM by skipping whole blocks or tiles.)
- Cong Guo; Fengchen Xue; Jingwen Leng; Yuxian Qiu, May 2024, Accelerating Sparse DNNs Based on Tiled GEMM, IEEE Transactions on Computers, vol. 73, no. 5, pp. 1275-1289, May 2024, doi: 10.1109/TC.2024.3365942, https://ieeexplore.ieee.org/abstract/document/10436533
- Mohammad Mahdi Salehi Dezfuli, Kazem Cheshmi, 28 Jun 2024, Improving Locality in Sparse and Dense Matrix Multiplications, https://arxiv.org/abs/2407.00243
- A. Haan, D. T. Popovici, K. Sen, C. Iacu and A. Cheung, 2014, "To Tile or not to Tile, That is the Question," 2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), San Francisco, CA, USA, 2024, pp. 449-458, doi: 10.1109/IPDPSW63119.2024.00096, https://ieeexplore.ieee.org/abstract/document/10596518
- David Spuler, March 2024, Loop Tiling or Blocking, in Generative AI in C++, https://www.aussieai.com/book/ch15-loop-tiling-blocking
- David Spuler, March 2024, Tiled Matrix-Vector Multiplication, in Generative AI in C++, in Generative AI in C++, https://www.aussieai.com/book/ch34-tiled-matrix-vector-multiplication
- Zebin Yang, Renze Chen, Taiqiang Wu, Ngai Wong, Yun Liang, Runsheng Wang, Ru Huang, Meng Li, 23 Oct 2024, MCUBERT: Memory-Efficient BERT Inference on Commodity Microcontrollers https://arxiv.org/abs/2410.17957
- Z. Zhang, D. Yang, X. Zhou and D. Cheng, "MCFuser: High-Performance and Rapid Fusion of Memory-Bound Compute-Intensive Operators," in 2024 SC24: International Conference for High Performance Computing, Networking, Storage and Analysis SC, Atlanta, GA, United States, 2024, pp. 528-542, doi: 10.1109/SC41406.2024.00040. https://www.computer.org/csdl/proceedings-article/sc/2024/529100a528/21HUVuG3S8M
- Inas Bachiri, September 2024, A Literature Review on Combining Neural Architecture Search and Compiler Optimizations for Neural Network Acceleration, DOI:10.13140/RG.2.2.10612.16009, Thesis for: Master in Computer Science, https://www.researchgate.net/publication/384190836_A_Literature_Review_on_Combining_Neural_Architecture_Search_and_Compiler_Optimizations_for_Neural_Network_Acceleration https://www.researchgate.net/profile/Inas-Bachiri/publication/384190836_A_Literature_Review_on_Combining_Neural_Architecture_Search_and_Compiler_Optimizations_for_Neural_Network_Acceleration/links/66ed912c6b101f6fa4f3d6ce/A-Literature-Review-on-Combining-Neural-Architecture-Search-and-Compiler-Optimizations-for-Neural-Network-Acceleration.pdf
- Mohammadali Shakerdargah, Shan Lu, Chao Gao, Di Niu, 20 Nov 2024, MAS-Attention: Memory-Aware Stream Processing for Attention Acceleration on Resource-Constrained Edge Devices, https://arxiv.org/abs/2411.17720
General Research on Parallelization
General papers on parallelism:
- Pieter Hijma, Stijn Heldens, Alessio Sclocco, Ben van Werkhoven, Henri E. Bal, 2023, Optimization techniques for GPU programming, ACM Computing Surveys, Volume 55, Issue 11, Article No. 239, pp 1–81, https://dl.acm.org/doi/abs/10.1145/3570638, PDF: https://dl.acm.org/doi/pdf/10.1145/3570638 (Extensive survey of software optimizations to improve GPU latency and throughput.)
- Daniel Nichols, Siddharth Singh, Shu-Huai Lin, Abhinav Bhatele, July 2022, A Survey and Empirical Evaluation of Parallel Deep Learning Frameworks, https://arxiv.org/abs/2111.04949 PDF: https://pssg.cs.umd.edu/assets/papers/2022-07-dl-survey-arxiv.pdf (Survey of frameworks from the theoretical perspective of parallelism.)
- Daniel Kusswurm, 2022, Modern Parallel Programming with C++ and Assembly Language: X86 SIMD Development Using AVX, AVX2, and AVX-512, 1st Edition, Apress, https://www.amazon.com/Modern-Parallel-Programming-Assembly-Language/dp/1484279174/, Code: https://github.com/Apress/modern-parallel-programming-cpp-assembly
- Wikipedia, 2023 (accessed), AoS and SoA, https://en.wikipedia.org/wiki/AoS_and_SoA (Structure of arrays vs array of structures.)
- William S. Moses, Ivan R. Ivanov, Jens Domke, Toshio Endo, Johannes Doerfert, Oleksandr Zinenko, 2023, High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs, PPoPP '23: Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, February 2023, Pages 119–134, https://dl.acm.org/doi/abs/10.1145/3572848.3577475, PDF: https://dl.acm.org/doi/pdf/10.1145/3572848.3577475
- Yu Emma Wang, Carole-Jean Wu, Xiaodong Wang, Kim Hazelwood, and David Brooks. Exploiting parallelism opportunities with deep learning frameworks. ACM Trans. Archit. Code Optim., 18(1), 2021. https://arxiv.org/abs/1908.04705 (Analysis of parallelization in terms of overhead of scheduling and intra-operator parallelization such as multi-threaded MatMul operations.)
- Stanford University, 2018, General Matrix Multiply (GeMM), https://spatial-lang.org/gemm
- Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022. https://arxiv.org/abs/2204.02311 (Google Palm architecture parallelized the layers of the Transformer.)
- Liangzhen Lai, Naveen Suda, Vikas Chandra, 2018, CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M CPUs, arXiv preprint arXiv:1801.06601, https://arxiv.org/abs/1801.06601 PDF: https://arxiv.org/pdf/1801.06601
- Pieter Hijma, Stijn Heldens, Alessio Sclocco, Ben van Werkhoven, Henri E. Bal, 2023, Optimization techniques for GPU programming, ACM Computing Surveys, Volume 55, Issue 11, Article No. 239, pp 1–81, https://dl.acm.org/doi/abs/10.1145/3570638, PDF: https://dl.acm.org/doi/pdf/10.1145/3570638 (Extensive survey of software optimizations to improve GPU latency and throughput.)
- Xiao Fu, Weiling Yang, Dezun Dong, Xing Su, 03 June 2024, Optimizing Attention by Exploiting Data Reuse on ARM Multi-core CPUs, ICS '24: Proceedings of the 38th ACM International Conference on Supercomputing, May 2024, Pages 137–149, https://doi.org/10.1145/3650200.3656620 https://dl.acm.org/doi/abs/10.1145/3650200.3656620
- Minsik Cho, Mohammad Rastegari, Devang Naik, 8 May 2024, KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache Generation, https://arxiv.org/abs/2405.05329 (Parallelization of KV cache generation in prefill phase.)
- Shaoyuan Chen, Yutong Lin, Mingxing Zhang, Yongwei Wu, 3 May 2024, Efficient and Economic Large Language Model Inference with Attention Offloading, https://arxiv.org/abs/2405.01814 (Separates the process-bound and memory-bound parts of inference for speedup, with focus on prefill, decoding, and the sub-tasks such as QKV and FFN use of GEMM kernels, versus the different pattern of attention computations and the KV cache.)
- Sathya Krishnan Suresh, Shunmugapriya P, 24 Apr 2024 (v2), Towards smaller, faster decoder-only transformers: Architectural variants and their implications, https://arxiv.org/abs/2404.14462 Code: https://github.com/SkAndMl/gpt-variations (Focuses on three new variants of decoder-only Transformer architectures: ParallelGPT (p-gpt), LinearlyCompressedGPT (lc-gpt), and ConvCompressedGPT (cc-gpt).)
- Tyler Griggs, Xiaoxuan Liu, Jiaxiang Yu, Doyoung Kim, Wei-Lin Chiang, Alvin Cheung, Ion Stoica, 22 Apr 2024, Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity, https://arxiv.org/abs/2404.14527 Code: https://github.com/tyler-griggs/melange-release
- M Davies, I McDougall, S Anandaraj, D Machchhar, April 2024, A Journey of a 1,000 Kernels Begins with a Single Step: A Retrospective of Deep Learning on GPUs, ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, April 2024, Pages 20–36, https://doi.org/10.1145/3620665.3640367 https://dl.acm.org/doi/abs/10.1145/3620665.3640367 (Benchmarking analysis of GPU execution extending MLPerf.)
- L. Zheng, L. Yin, Z. Xie, J. Huang, C. Sun, C. H. Yu, S. Cao, C. Kozyrakis, I. Stoica, J. E. Gonzalez et al., Dec 2023, Efficiently programming large language models using SGLang, arXiv preprint arXiv:2312.07104, 2023, https://arxiv.org/abs/2312.07104 (Uses a radix attention method, a trie or prefix tree, for KV caching.)
- S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y. Cao, and K. Narasimhan, “Tree of thoughts: Deliberate problem solving with large language models,” Advances in Neural Information Processing Systems, vol. 36, 2024. https://arxiv.org/abs/2305.10601
- Shashank Verma and Neal Vaidya, Nov 17, 2023 Mastering LLM Techniques: Inference Optimization, NVIDIA Technical Blog, https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/ (An overview that covers a lot of inference optimization techniques.)
- Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Jiang Xu, Shuang Chen, Hao Feng, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan, 20 Jan 2024, Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads, https://arxiv.org/abs/2401.11181 (Separating the prefill and decoding phases for optimization.)
- Yansong Xu, Dongxu Lyu, Zhenyu Li, Zilong Wang, Yuzhou Chen, Gang Wang, Zhican Wang, Haomin Li, Guanghui He, 16 Mar 2024, DEFA: Efficient Deformable Attention Acceleration via Pruning-Assisted Grid-Sampling and Multi-Scale Parallel Processing, https://arxiv.org/abs/2403.10913 (Attention optimizations in Vision Transformer with pruning of feature maps, and extensive parallelization with consideration of the hardware layer.)
- Rocke, F. (2023), Evaluation of C++ SIMD Libraries, Bachelor’s Thesis, INSTITUT FUR INFORMATIK, DER LUDWIG–MAXIMILIANS–UNIVERSIT AT MUNCHEN, https://www.mnm-team.org/pub/Fopras/rock23/ PDF: https://www.mnm-team.org/pub/Fopras/rock23/PDF-Version/rock23.pdf (Reviewed six SIMD libraries: Highway, Vc, Libsimdpp, NSIMD, SIMD Everywhere, Pure SIMD).
- Shixun Wu, Yujia Zhai, Jinyang Liu, Jiajun Huang, Zizhe Jian, Bryan M. Wong, Zizhong Chen, May 2023, Anatomy of High-Performance GEMM with Online Fault Tolerance on GPUs, https://arxiv.org/abs/2305.01024 (Focuses on error tolerance of failures within matrix multiplication algorithms.)
- Zao Zhang, 23 May 2024, Design Efficient Deep Neural Networks with System Optimization, Ph.D. Thesis, School of Electrical and Information Engineering, Faculty of Engineering, The University of Sydney, Australia, PDF: https://ses.library.usyd.edu.au/bitstream/handle/2123/32642/zhang_z_thesis.pdf?sequence=1&isAllowed=y https://ses.library.usyd.edu.au/handle/2123/32642 https://hdl.handle.net/2123/32642
- Radha Gulhane, 2024, Accelerated and Memory-Efficient Distributed Deep Learning: Leveraging Quantization, Parallelism Techniques, and Mix-Match Runtime Communication , Masters Thesis, Computer Science and Engineering , The Ohio State University, https://etd.ohiolink.edu/acprod/odb_etd/ws/send_file/send?accession=osu1713381834648517&disposition=inline
- Hyungjun Oh, Kihong Kim, Jaemin Kim, Sungkyun Kim, Junyeol Lee, Du-seong Chang, Jiwon Seo, 15 Mar 2024, ExeGPT: Constraint-Aware Resource Scheduling for LLM Inference, https://arxiv.org/abs/2404.07947 (Scheduling and pipelining of inference calculations.)
- Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
- Jiahang Zhou, Yanyu Chen, Zicong Hong, Wuhui Chen, Yue Yu, Tao Zhang, Hui Wang, Chuanfu Zhang, Zibin Zheng, 5 Jan 2024, Training and Serving System of Foundation Models: A Comprehensive Survey, https://arxiv.org/abs/2401.02643
- Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin Jin, Xin Liu, 23 Feb 2024, MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs, https://arxiv.org/abs/2402.15627
- Vijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, Bryan Catanzaro, May 2022, Reducing Activation Recomputation in Large Transformer Models, https://arxiv.org/abs/2205.05198
- Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, Communications of the ACM, Volume 60, Issue 6, June 2017, pp 84–90, https://doi.org/10.1145/3065386 https://dl.acm.org/doi/10.1145/3065386 PDF: https://proceedings.neurips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf Code: http://code.google.com/p/cuda-convnet/ (The early paper that introduced a grouped convolution architecture for multi-GPUs, later the basis of AlexNet, which was a famous image recognition CNN.)
- David Spuler, March 2024, Part III: Parallel C++ Optimizations, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- X. Ning, Z. Lin, Z. Zhou, H. Yang, and Y. Wang, “Skeleton-of thought: Large language models can do parallel decoding,” arXiv preprint arXiv:2307.15337, 2023. https://arxiv.org/abs/2307.15337
- S. Jin, Y. Wu, H. Zheng, Q. Zhang, M. Lentz, Z. M. Mao, A. Prakash, F. Qian, and D. Zhuo, “Adaptive skeleton graph decoding,” arXiv preprint arXiv:2402.12280, 2024. https://arxiv.org/abs/2402.12280
- M. Liu, A. Zeng, B. Wang, P. Zhang, J. Tang, and Y. Dong, “Apar: Llms can do auto-parallel auto-regressive decoding,” arXiv preprint arXiv:2401.06761, 2024. https://arxiv.org/abs/2401.06761
- Abhimanyu Bambhaniya, Ritik Raj, Geonhwa Jeong, Souvik Kundu, Sudarshan Srinivasan, Midhilesh Elavazhagan, Madhu Kumar, Tushar Krishna, 3 Jun 2024, Demystifying Platform Requirements for Diverse LLM Inference Use Cases, https://arxiv.org/abs/2406.01698
- M.Besta, N. Blach, A. Kubicek, R. Gerstenberger, M. Podstawski, L. Gianinazzi, J. Gajda, T. Lehmann, H. Niewiadomski, P. Nyczyk et al., “Graph of thoughts: Solving elaborate problems with large language models,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 16, 2024, pp. 17682–17690. https://arxiv.org/abs/2308.09687
- Wesley Brewer, Aditya Kashi, Sajal Dash, Aristeidis Tsaris, Junqi Yin, Mallikarjun Shankar, Feiyi Wang, 24 Jun 2024, Scalable Artificial Intelligence for Science: Perspectives, Methods and Exemplars, https://arxiv.org/abs/2406.17812
- Ruilong Ma, Xiang Yang, Jingyu Wang, Qi Qi, Haifeng Sun, Jing Wang, Zirui Zhuang, Jianxin Liao, June 16-21, 2024, HPipe: Large Language Model Pipeline Parallelism for Long Context on Heterogeneous Cost-effective Devices, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track), pages 1–9, Association for Computational Linguistics, https://aclanthology.org/2024.naacl-industry.1.pdf
- 18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
- Jiangfei Duan, Shuo Zhang, Zerui Wang, Lijuan Jiang, Wenwen Qu, Qinghao Hu, Guoteng Wang, Qizhen Weng, Hang Yan, Xingcheng Zhang, Xipeng Qiu, Dahua Lin, Yonggang Wen, Xin Jin, Tianwei Zhang, Peng Sun, 29 Jul 2024, Efficient Training of Large Language Models on Distributed Infrastructures: A Survey, https://arxiv.org/abs/2407.20018
- Rohan Baskar Prabhakar, Hengrui Zhang, David Wentlzaff, 14 Aug 2024, Kraken: Inherently Parallel Transformers For Efficient Multi-Device Inference, https://arxiv.org/abs/2408.07802 (Modified Transformer architecture with parallelized sub-layers of attention and FFN.)
- Minjia Zhang, Samyam Rajbhandari, Wenhan Wang, Yuxiong He, 2018, DeepCPU: serving RNN-based deep learning models 10x faster, USENIX ATC '18: Proceedings of the 2018 USENIX Conference on Usenix Annual Technical Conference, July 2018, Pages 951–965, https://dl.acm.org/doi/10.5555/3277355.3277446 https://www.microsoft.com/en-us/research/publication/deepcpu-serving-rnn-based-deep-learning-models-10x-faster/ PDF: https://www.usenix.org/system/files/conference/atc18/atc18-zhang-minjia.pdf (Microsoft DeepCPU paper shows details of code optimizations to parallelize matrix multiplications.)
- Kan Zhu, Yilong Zhao, Liangyu Zhao, Gefei Zuo, Yile Gu, Dedong Xie, Yufei Gao, Qinyu Xu, Tian Tang, Zihao Ye, Keisuke Kamahori, Chien-Yu Lin, Stephanie Wang, Arvind Krishnamurthy, Baris Kasikci, 22 Aug 2024, NanoFlow: Towards Optimal Large Language Model Serving Throughput, https://arxiv.org/abs/2408.12757
- Thomas Merth, Qichen Fu, Mohammad Rastegari, Mahyar Najibi, 19 Jul 2024 (v2), Superposition Prompting: Improving and Accelerating Retrieval-Augmented Generation https://arxiv.org/abs/2404.06910 (Process each RAG chunk in parallel and choose a final output.)
- Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong, Yi Pan, Shaochen Xu, Zihao Wu, Zhengliang Liu, Xin Zhang, Shu Zhang, Xintao Hu, Tuo Zhang, Ning Qiang, Tianming Liu, Bao Ge, 6 Jan 2024 (v2), Understanding LLMs: A Comprehensive Overview from Training to Inference, https://arxiv.org/abs/2401.02038
- Suchita Pati, Shaizeen Aga, Nuwan Jayasena, Matthew D. Sinclair, 3 Sep 2024, Global Optimizations & Lightweight Dynamic Logic for Concurrency, https://arxiv.org/abs/2409.02227 (Parallelizing GEMM at a granular level.)
- Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
- Amey Agrawal, Junda Chen, Íñigo Goiri, Ramachandran Ramjee, Chaojie Zhang, Alexey Tumanov, Esha Choukse, 25 Sep 2024, Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations, https://arxiv.org/abs/2409.17264
- Kabir Nagrecha, Oct 2024, Thesis, Orchestration Systems to Support Deep Learning at Scale Doctor of Philosophy, Computer Science, University of California San Diego, https://escholarship.org/content/qt3pp6k1p4/qt3pp6k1p4_noSplash_457f4c7c0435172a3d0a17428455894c.pdf (Pipeline and data parallelism systems.)
- S Durvasula, A Zhao, R Kiguru, Y Guan, Z Chen, Oct 2024, ACE: Efficient GPU Kernel Concurrency for Input-Dependent Irregular Computational Graphs, PACT ’24, October 14–16, 2024, Southern California, CA, USA, https://www.embarclab.com/static/media/ace.1c73b44bc2ad143f7b9f.pdf (Identify parallel kernels at runtime.)
- Hao Ge, Fangcheng Fu, Haoyang Li, Xuanyu Wang, Sheng Lin, Yujie Wang, Xiaonan Nie, Hailin Zhang, Xupeng Miao, and Bin Cui. 2024. Enabling Parallelism Hot Switching for Efficient Training of Large Language Models. In Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles (SOSP '24). Association for Computing Machinery, New York, NY, USA, 178–194. https://doi.org/10.1145/3694715.3695969 https://dl.acm.org/doi/abs/10.1145/3694715.3695969
- M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024, Resource-efficient Algorithms and Systems of Foundation Models: A Survey, https://dl.acm.org/doi/pdf/10.1145/3706418
More AI Research
Read more about: