Aussie AI

Parallelization

Last Updated 10 June, 2025

by David Spuler, Ph.D.

Parallelization is the use of parallel computation algorithms in AI engines. This is typically achieved via GPU hardware or CPU SIMD capabilities, such can also be software simulated via multi-core and multi-threaded algorithms. Adding parallelization to an AI engine in C++ requires accessing of the hardware acceleration capabilities using the intrinsic functions that are specific to the type of hardware being used.

Software Kernel Parallelization

Data doesn't just magically end up in the GPU. There has to be software written to send the data there, and there are a lot of possible optimizations that are used in writing such software. This software is often called the "kernel" of the AI engine. The sub-components of the engine often get called the MatMul kernel, Softmax kernel, normalization kernel, and so on.

Speed optimizations: Software techniques that aim to optimize parallelization primarily by increasing throughput and reducing latency include:

Vectorization
Multi-threading
Kernel fusion
Kernel fission
Pipelining
Overlapping
Scheduling algorithms

Memory usage optimizations: Software optimizations that aim to improve memory usage, and thereby benefit further from lowering memory access overhead to increase parallelism, include:

Tiling
Data locality optimizations
Dataflow optimizations
Memory management optimizations
Cache management
Prefetching
Offloading

Vectorization

Vectorization is the name given to transforming a software loop from running sequentially on an array of data to performing the same computation fully in parallel, by sending the data to a GPU or CPU SIMD extensions. Vectorization uses techniques from loop optimizations to transform loops into faster parallelizable versions, such as loop unrolling to "unroll" a loop into all its element-wise actions, and loop distribution (also called "loop sectioning"), which breaks the array into segments that are the right size to fit in parallel into your GPU or CPU SIMD extensions. In theory, a good optimizing compiler can do vectorization optimizations automatically for simple loops, but often you have to do it yourself.

Research on Vectorization: There are various papers expressly on vectorization, and many papers that do it implicitly or by other names. See also research papers on loop unrolling and loop distribution.

Ragan-Kelley, J., Barnes, C., Adams, A., Paris, S., Durand, F., and Amarasinghe, S., Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI’13, pp. 519–530, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-2014-6. doi: 10.1145/2491956.2462176. http://doi.acm.org/10.1145/2491956.2462176, PDF: https://people.csail.mit.edu/jrk/halide-pldi13.pdf
Pieter Hijma, Stijn Heldens, Alessio Sclocco, Ben van Werkhoven, Henri E. Bal, 2023, Optimization techniques for GPU programming, ACM Computing Surveys, Volume 55, Issue 11, Article No. 239, pp 1–81, https://dl.acm.org/doi/abs/10.1145/3570638, PDF: https://dl.acm.org/doi/pdf/10.1145/3570638 (Vectorization covered in extensive survey of software optimizations to improve GPU latency and throughput.)
Shibata, N. and Petrogalli, F., 2020, SLEEF: A portable vectorized library of C standard mathematical functions, IEEE Trans. Parallel Distrib. Syst. 31, 1316–1327. https://dx.doi.org/10.1109/TPDS.2019.2960333
Anderson A, Muralidharan S, Gregg D, 2017, Efficient multibyte floating point data formats using vectorization. IEEE Trans Comput 66(12):2081–2096. https://ieeexplore.ieee.org/document/7950938 (Vectorization of floating point bit formats)
Andrew Anderson, David Gregg, 2016, Vectorization of Multibyte Floating Point Data Formats, https://arxiv.org/abs/1601.07789 (Bitwise vectorization algorithms for floating point math.)
Shabnam Daghaghi, Nicholas Meisburger, Mengnan Zhao, Yong Wu, Sameh Gobriel, Charlie Tai, Anshumali Shrivastava, Mar 2021, Accelerating SLIDE Deep Learning on Modern CPUs: Vectorization, Quantizations, Memory Optimizations, and More, https://arxiv.org/abs/2103.10891, Code: https://github.com/RUSH-LAB/SLIDE (Fast training on CPUs using AVX-512 and locality-sensitive hashing of vectors.)
S Jain, S VenkataKeerthy, R Aggarwal, 2022, Reinforcement Learning assisted Loop Distribution for Locality and Vectorization, 2022 IEEE/ACM Eighth Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC), https://ieeexplore.ieee.org/abstract/document/10026979/, PDF: https://www.researchgate.net/profile/Dibyendu-Das/publication/365475992_Reinforcement_Learning_assisted_Loop_Distribution_for_Locality_and_Vectorization/links/637679e937878b3e87bb988e/Reinforcement-Learning-assisted-Loop-Distribution-for-Locality-and-Vectorization.pdf
Roberto L. Castro, Andrei Ivanov, Diego Andrade, Tal Ben-Nun, Basilio B. Fraguela, Torsten Hoefler, Oct 2023, VENOM: A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores, https://browse.arxiv.org/abs/2310.02065
R Paktinatkeleshteri, 2023, Efficient Auto-Vectorization for Control-flow Dependent Loops through Data Permutation, Masters Thesis, Department of Computing Science, University of Alberta, http://webdocs.cs.ualberta.ca/~amaral/thesis/RouzbehPaktinatkeleshteriMSc.pdf (Automatic vectorization optimizations by compilers.)
Y He, A Podobas, S Markidis, Aug 2023 Leveraging MLIR for Loop Vectorization and GPU Porting of FFT Libraries, arXiv preprint arXiv:2308.00497, https://arxiv.org/abs/2308.00497
D. Levine, D. Callahan, and J. Dongarra, 1991. A comparative study of automatic vectorizing compilers, Parallel Computing, vol. 17, no. 10-11, pp. 1223–1244, https://dl.acm.org/doi/10.5555/165357.165391 PDF: https://netlib.org/utk/people/JD/JackDongarra/PAPERS/Comparative-Study-of-Automatic-Vectorizing-Compilers.pdf
S. Maleki, Y. Gao, M. J. Garzar, T. Wong, D. A. Padua, et al., 2011, An evaluation of vectorizing compilers, 2011 International Conference on Parallel Architectures and Compilation Techniques, IEEE, 2011, pp. 372–382, https://ieeexplore.ieee.org/document/6113845, PDF: http://polaris.cs.uiuc.edu/~garzaran/doc/pact11.pdf
A. Pohl, B. Cosenza, and B. Juurlink, Control Flow Vectorization for ARM NEON, in Proceedings of the 21st International Workshop on Software and Compilers for Embedded Systems, Sankt Goar Germany: ACM, May 2018, pp. 66–75, isbn: 978-1-4503-5780-7. doi: 10.1145/3207719.3207721. https://dl.acm.org/doi/10.1145/3207719.3207721, PDF: https://www.cosenza.eu/papers/PohlSCOPES18.pdf
N. Sreraman and R. Govindarajan, 2000, A vectorizing compiler for multimedia extensions, International Journal of Parallel Programming, vol. 28, pp. 363–400, 2000. https://link.springer.com/article/10.1023/A:1007559022013a
H Liu, F Deng, 2023, Convolutional Acceleration Algorithm Combining Loop Optimization and Automatic Scheduling, 2023 International Conference for Advancement in Technology (ICONAT), https://ieeexplore.ieee.org/abstract/document/10080410/
Pieter Hijma, Stijn Heldens, Alessio Sclocco, Ben van Werkhoven, Henri E. Bal, 2023, Optimization techniques for GPU programming, ACM Computing Surveys, Volume 55, Issue 11, Article No. 239, pp 1–81, https://dl.acm.org/doi/abs/10.1145/3570638, PDF: https://dl.acm.org/doi/pdf/10.1145/3570638 (Extensive survey of software optimizations to improve GPU latency and throughput.)
R Maruthamuthu, D Dhabliya, 2023, Advancements in Compiler Design and Optimization Techniques, Volume 399 (2023), E3S Web Conf., 399 (2023), https://www.e3s-conferences.org/articles/e3sconf/abs/2023/36/e3sconf_iconnect2023_04047/e3sconf_iconnect2023_04047.html, https://www.e3s-conferences.org/articles/e3sconf/pdf/2023/36/e3sconf_iconnect2023_04047.pdf (Examines automatic vectorization by programming language compilers.)
D. F. Bacon, S. L. Graham, and O. J. Sharp. 1994. Compiler transformations for high-performance computing. ACM Computing Surveys 26, 4 (1994), 345–420. https://dl.acm.org/doi/10.1145/197405.197406, PDF: https://people.eecs.berkeley.edu/~fateman/264/papers/bacon.pdf (Many of the loop optimizations are relevant to vectorization, even back in 1994.)
David A. Padua, Michael J. Wolfe, 1986, Advanced compiler optimizations for supercomputers. Commun. ACM 29, 12, 1184–1201, https://dl.acm.org/doi/10.1145/7902.7904, PDF: https://dl.acm.org/doi/epdf/10.1145/7902.7904
Xiao Fu, Weiling Yang, Dezun Dong, Xing Su, 03 June 2024, Optimizing Attention by Exploiting Data Reuse on ARM Multi-core CPUs, ICS '24: Proceedings of the 38th ACM International Conference on Supercomputing, May 2024, Pages 137–149, https://doi.org/10.1145/3650200.3656620 https://dl.acm.org/doi/abs/10.1145/3650200.3656620
David Spuler, March 2024, Chapter 30. Vectorization, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
EventHelix, Jan 1, 2022, Auto-vectorize C and C++ code, https://medium.com/software-design/auto-vectorize-c-and-c-code-34569b2b5f1e
Mike H.B. Gray, 2024, Implementation of Floating-Point Arithmetic Coding Using x86-64 AVX-256 Assembly Language, https://www.opastpublishers.com/open-access-articles/implementation-of-floatingpoint-arithmetic-coding-using-x8664-avx256-assembly-language.pdf
Justin Luitjens, Dec 04, 2013, CUDA Pro Tip: Increase Performance with Vectorized Memory Access, https://developer.nvidia.com/blog/cuda-pro-tip-increase-performance-with-vectorized-memory-access/
Andrew Chan, Dec 12, 2024, Fast LLM Inference From Scratch: Pushing single-GPU inference throughput to the edge without libraries, https://andrewkchan.dev/posts/yalm.html

Single Instruction Multiple Data (SIMD)

Research on SIMD instructions also include:

Xiao Fu, Weiling Yang, Dezun Dong, Xing Su, 03 June 2024, Optimizing Attention by Exploiting Data Reuse on ARM Multi-core CPUs, ICS '24: Proceedings of the 38th ACM International Conference on Supercomputing, May 2024, Pages 137–149, https://doi.org/10.1145/3650200.3656620 https://dl.acm.org/doi/abs/10.1145/3650200.3656620
Rocke, F. (2023), Evaluation of C++ SIMD Libraries, Bachelor’s Thesis, INSTITUT FUR INFORMATIK, DER LUDWIG–MAXIMILIANS–UNIVERSIT AT MUNCHEN, https://www.mnm-team.org/pub/Fopras/rock23/ PDF: https://www.mnm-team.org/pub/Fopras/rock23/PDF-Version/rock23.pdf (Reviewed six SIMD libraries: Highway, Vc, Libsimdpp, NSIMD, SIMD Everywhere, Pure SIMD).
C Zhou, Z Hassman, R Xu, D Shah, V Richard, Y Li, Oct 2023, SIMD Dataflow Co-optimization for Efficient Neural Networks Inferences on CPUs, arXiv preprint arXiv:2310.00574, https://arxiv.org/pdf/2310.00574.pdf
David Spuler, March 2024, Chapter 17. AVX Intrinsics, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Tianyi Zhang, Jonah Wonkyu Yi, Bowen Yao, Zhaozhuo Xu, Anshumali Shrivastava, 2 Mar 2024, NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-add-free Attention, https://arxiv.org/abs/2403.01273 Code: https://github.com/tonyzhang617/nomad-dist (Converts 4-bit vector dot products to using SIMD registers as lookup tables on CPUs.)
Longhao Chen, Yina Zhao, Qiangjun Xie, Qinghua Sheng, 6 Jun 2024, Optimization of Armv9 architecture general large language model inference performance based on Llama.cpp, https://arxiv.org/abs/2406.10816
Hyungyo Kim, Gaohan Ye, Nachuan Wang, Amir Yazdanbakhsh, Nam Sung Kim, 2024, Exploiting Intel Advanced Matrix Extensions (AMX) for Large Language Model Inference Jan.-Jun. 2024, pp. 117-120, vol. 23 DOI Bookmark: 10.1109/LCA.2024.3397747, https://www.computer.org/csdl/journal/ca/2024/01/10538369/1XcOWKoKwfe
Z Zhong, January 22nd, 2024, Enhancing SIMD Assembly Language Development with Visualization Techniques, Masters Thesis, Department of Computer Science and Communications Engineering,, Master of Engineering, Waseda University, Japan, https://waseda.repo.nii.ac.jp/record/2001309/files/t5121F099.pdf
Mike H.B. Gray, 2024, Implementation of Floating-Point Arithmetic Coding Using x86-64 AVX-256 Assembly Language, https://www.opastpublishers.com/open-access-articles/implementation-of-floatingpoint-arithmetic-coding-using-x8664-avx256-assembly-language.pdf
Z. Zhang, Y. Chen, B. He and Z. Zhang, June 2023, NIOT: A Novel Inference Optimization of Transformers on Modern CPUs, IEEE Transactions on Parallel and Distributed Systems, vol. 34, no. 6, pp. 1982-1995, June 2023, doi: 10.1109/TPDS.2023.3269530, https://ieeexplore.ieee.org/abstract/document/10107474
NVIDIA, Sep 2024, SIMD Intrinsics, https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH__INTRINSIC__SIMD.html
Justin Luitjens, Dec 04, 2013, CUDA Pro Tip: Increase Performance with Vectorized Memory Access, https://developer.nvidia.com/blog/cuda-pro-tip-increase-performance-with-vectorized-memory-access/
Mukul Lokhande, Gopal Raut, Santosh Kumar Vishvakarma, 16 Dec 2024, Flex-PE: Flexible and SIMD Multi-Precision Processing Element for AI Workloads, https://arxiv.org/abs/2412.11702
Andrew Chan, Dec 12, 2024, Fast LLM Inference From Scratch: Pushing single-GPU inference throughput to the edge without libraries, https://andrewkchan.dev/posts/yalm.html
Sarah Butcher & Alex McMurray, Jan 2025, The C++ techniques you need for $600k hedge fund jobs, https://www.efinancialcareers.com/news/low-latency-c
C Zhang, X Zhu, L Chen, T Yang, E Pan, G Yu, Y Zhao, 2025, Enhancing LLM Inference Performance on ARMCPUsthrough Software and Hardware Co-optimization Strategies, DOI 10.23919/ICS.2025.3568404, https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10994252

Tensor Parallelism

Research papers on tensor parallelism include:

Jiahang Zhou, Yanyu Chen, Zicong Hong, Wuhui Chen, Yue Yu, Tao Zhang, Hui Wang, Chuanfu Zhang, Zibin Zheng, 5 Jan 2024, Training and Serving System of Foundation Models: A Comprehensive Survey, https://arxiv.org/abs/2401.02643
Shashank Verma and Neal Vaidya, Mastering LLM Techniques: Inference Optimization, Nov 17, 2023, NVIDIA Technical Blog, https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/
NVIDIA, June 2024, Nemotron-4 340B Technical Report, https://d1qx31qr3h6wln.cloudfront.net/publications/Nemotron_4_340B_8T_0.pdf (Architecture is decoder-only with GQA, SentencePiece tokenizer, causal attention masks, RoPE, 96 layers, 96 heads, 8 KV heads, 256,000 vocabulary, 18432 internal dimension, context window 4096, and uses squared RELU.)
Li-Wen Chang, Wenlei Bao, Qi Hou, Chengquan Jiang, Ningxin Zheng, Yinmin Zhong, Xuanrun Zhang, Zuquan Song, Ziheng Jiang, Haibin Lin, Xin Jin, Xin Liu, 18 Jun 2024 (v4), FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion, https://arxiv.org/abs/2406.06858
Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas Kübler, Jiaji Huang, Matthäus Kleindessner, Jun Huan, Volkan Cevher, Yida Wang, George Karypis, 12 Jul 2024, Inference Optimization of Foundation Models on AI Accelerators, KDD’24, August 25–29, 2024, Barcelona, Spain, https://arxiv.org/abs/2407.09111
18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
Amr Elmeleegy, Shivam Raj, Brian Slechta and Vishal Mehta, Jun 12, 2024, Demystifying AI Inference Deployments for Trillion Parameter Large Language Models, NVIDIA Technical Blog, https://developer.nvidia.com/blog/demystifying-ai-inference-deployments-for-trillion-parameter-large-language-models/
Rohan Baskar Prabhakar, Hengrui Zhang, David Wentlzaff, 14 Aug 2024, Kraken: Inherently Parallel Transformers For Efficient Multi-Device Inference, https://arxiv.org/abs/2408.07802 (Modified Transformer architecture with parallelized sub-layers of attention and FFN.)
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, Ying Sheng, 6 Jun 2024 (v2), SGLang: Efficient Execution of Structured Language Model Programs, https://arxiv.org/abs/2312.07104 https://github.com/sgl-project/sglang
Hugging Face, 2024, Text Generation Inference, https://huggingface.co/docs/text-generation-inference/index
Wei An, Xiao Bi, Guanting Chen, Shanhuang Chen, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Wenjun Gao, Kang Guan, Jianzhong Guo, Yongqiang Guo, Zhe Fu, Ying He, Panpan Huang, Jiashi Li, Wenfeng Liang, Xiaodong Liu, Xin Liu, Yiyuan Liu, Yuxuan Liu, Shanghao Lu, Xuan Lu, Xiaotao Nie, Tian Pei, Junjie Qiu, Hui Qu, Zehui Ren, Zhangli Sha, Xuecheng Su, Xiaowen Sun, Yixuan Tan, Minghui Tang, Shiyu Wang, Yaohui Wang, Yongji Wang, Ziwei Xie, Yiliang Xiong, Yanhong Xu, Shengfeng Ye, Shuiping Yu, Yukun Zha, Liyue Zhang, Haowei Zhang, Mingchuan Zhang, Wentao Zhang, Yichao Zhang, Chenggang Zhao, Yao Zhao, Shangyan Zhou, Shunfeng Zhou, Yuheng Zou, 31 Aug 2024 (v2), Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning, DeepSeek AI, https://www.arxiv.org/abs/2408.14158
Bin Xiao, Lei Su, 4 Sep 2024, ISO: Overlap of Computation and Communication within Seqenence For LLM Inference, https://arxiv.org/abs/2409.11155
Guanhua Wang, Chengming Zhang, Zheyu Shen, Ang Li, Olatunji Ruwase, 23 Sep 2024, Domino: Eliminating Communication in LLM Training via Generic Tensor Slicing and Overlapping, https://arxiv.org/abs/2409.15241 (Using tensor parallelism to overlap communications and computation.)
Zeyu Zhang, Haiying Shen, 23 Sep 2024, CSPS: A Communication-Efficient Sequence-Parallelism based Serving System for Transformer based Models with Long Prompts, https://arxiv.org/abs/2409.15104 (Sparse attention and overlapped communication with computation and disaggregates prefill/decoding with chunked prefill, with a novel QKV splitting approach focused on the Q values.)
Harry Dong, Tyler Johnson, Minsik Cho, Emad Soroush, 12 Nov 2024, Towards Low-bit Communication for Tensor Parallel LLM Inference, https://arxiv.org/abs/2411.07942
Tong Xiao, Jingbo Zhu, 16 Jan 2025, Foundations of Large Language Models, https://arxiv.org/abs/2501.09223 (Huge 230 page paper on many topics such as training, prompting, alignment, and long context.)
Kaiyuan Tian, Linbo Qiao, Baihui Liu, Gongqingjian Jiang, Dongsheng Li, 21 Jan 2025, A Survey on Memory-Efficient Large-Scale Model Training in AI for Science, https://arxiv.org/abs/2501.11847
Xiaoran Liu, Ruixiao Li, Mianqiu Huang, Zhigeng Liu, Yuerong Song, Qipeng Guo, Siyang He, Qiqi Wang, Linlin Li, Qun Liu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu, 24 Feb 2025, Thus Spake Long-Context Large Language Model, https://arxiv.org/abs/2502.17129 (Impressive survey of many techniques to improve efficiency and accuracy of long context processing in both inference and training, covering text, video and multimodal models.)

Multi-Query Parallelism

Research papers on multi-query parallelism include:

L. Zheng, L. Yin, Z. Xie, J. Huang, C. Sun, C. H. Yu, S. Cao, C. Kozyrakis, I. Stoica, J. E. Gonzalez et al., Dec 2023, Efficiently programming large language models using SGLang, arXiv preprint arXiv:2312.07104, 2023, https://arxiv.org/abs/2312.07104 (Uses a radix attention method, a trie or prefix tree, for KV caching.)
X. Ning, Z. Lin, Z. Zhou, H. Yang, and Y. Wang, “Skeleton-of thought: Large language models can do parallel decoding,” arXiv preprint arXiv:2307.15337, 2023. https://arxiv.org/abs/2307.15337
S. Jin, Y. Wu, H. Zheng, Q. Zhang, M. Lentz, Z. M. Mao, A. Prakash, F. Qian, and D. Zhuo, “Adaptive skeleton graph decoding,” arXiv preprint arXiv:2402.12280, 2024. https://arxiv.org/abs/2402.12280
M. Liu, A. Zeng, B. Wang, P. Zhang, J. Tang, and Y. Dong, “Apar: Llms can do auto-parallel auto-regressive decoding,” arXiv preprint arXiv:2401.06761, 2024. https://arxiv.org/abs/2401.06761

Data Parallelism

Research papers on data parallelism include:

NVIDIA, June 2024, Nemotron-4 340B Technical Report, https://d1qx31qr3h6wln.cloudfront.net/publications/Nemotron_4_340B_8T_0.pdf (Architecture is decoder-only with GQA, SentencePiece tokenizer, causal attention masks, RoPE, 96 layers, 96 heads, 8 KV heads, 256,000 vocabulary, 18432 internal dimension, context window 4096, and uses squared RELU.)
Wesley Brewer, Aditya Kashi, Sajal Dash, Aristeidis Tsaris, Junqi Yin, Mallikarjun Shankar, Feiyi Wang, 24 Jun 2024, Scalable Artificial Intelligence for Science: Perspectives, Methods and Exemplars, https://arxiv.org/abs/2406.17812
18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
Amr Elmeleegy, Shivam Raj, Brian Slechta and Vishal Mehta, Jun 12, 2024, Demystifying AI Inference Deployments for Trillion Parameter Large Language Models, NVIDIA Technical Blog, https://developer.nvidia.com/blog/demystifying-ai-inference-deployments-for-trillion-parameter-large-language-models/
Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong, Yi Pan, Shaochen Xu, Zihao Wu, Zhengliang Liu, Xin Zhang, Shu Zhang, Xintao Hu, Tuo Zhang, Ning Qiang, Tianming Liu, Bao Ge, 6 Jan 2024 (v2), Understanding LLMs: A Comprehensive Overview from Training to Inference, https://arxiv.org/abs/2401.02038
Wei An, Xiao Bi, Guanting Chen, Shanhuang Chen, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Wenjun Gao, Kang Guan, Jianzhong Guo, Yongqiang Guo, Zhe Fu, Ying He, Panpan Huang, Jiashi Li, Wenfeng Liang, Xiaodong Liu, Xin Liu, Yiyuan Liu, Yuxuan Liu, Shanghao Lu, Xuan Lu, Xiaotao Nie, Tian Pei, Junjie Qiu, Hui Qu, Zehui Ren, Zhangli Sha, Xuecheng Su, Xiaowen Sun, Yixuan Tan, Minghui Tang, Shiyu Wang, Yaohui Wang, Yongji Wang, Ziwei Xie, Yiliang Xiong, Yanhong Xu, Shengfeng Ye, Shuiping Yu, Yukun Zha, Liyue Zhang, Haowei Zhang, Mingchuan Zhang, Wentao Zhang, Yichao Zhang, Chenggang Zhao, Yao Zhao, Shangyan Zhou, Shunfeng Zhou, Yuheng Zou, 31 Aug 2024 (v2), Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning, DeepSeek AI, https://www.arxiv.org/abs/2408.14158
Douglas C. Youvan, September 27, 2024, Building and Running Large-Scale Language Models: The Infrastructure and Techniques Behind GPT-4 , https://www.researchgate.net/profile/Douglas-Youvan/publication/384398902_Building_and_Running_Large-Scale_Language_Models_The_Infrastructure_and_Techniques_Behind_GPT-4/links/66f6f4d3906bca2ac3d20e68/Building-and-Running-Large-Scale-Language-Models-The-Infrastructure-and-Techniques-Behind-GPT-4.pdf
Kabir Nagrecha, Oct 2024, Thesis, Orchestration Systems to Support Deep Learning at Scale Doctor of Philosophy, Computer Science, University of California San Diego, https://escholarship.org/content/qt3pp6k1p4/qt3pp6k1p4_noSplash_457f4c7c0435172a3d0a17428455894c.pdf (Pipeline and data parallelism systems.)
Tong Xiao, Jingbo Zhu, 16 Jan 2025, Foundations of Large Language Models, https://arxiv.org/abs/2501.09223 (Huge 230 page paper on many topics such as training, prompting, alignment, and long context.)
Kaiyuan Tian, Linbo Qiao, Baihui Liu, Gongqingjian Jiang, Dongsheng Li, 21 Jan 2025, A Survey on Memory-Efficient Large-Scale Model Training in AI for Science, https://arxiv.org/abs/2501.11847
Nandini Lokesh Reddy, Jan 2025, DeepSeek: Bridging Performance and Efficiency in Modern AI, https://medium.com/@nandinilreddy/deepseek-bridging-performance-and-efficiency-in-modern-ai-106181a85693
Xiaoran Liu, Ruixiao Li, Mianqiu Huang, Zhigeng Liu, Yuerong Song, Qipeng Guo, Siyang He, Qiqi Wang, Linlin Li, Qun Liu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu, 24 Feb 2025, Thus Spake Long-Context Large Language Model, https://arxiv.org/abs/2502.17129 (Impressive survey of many techniques to improve efficiency and accuracy of long context processing in both inference and training, covering text, video and multimodal models.)
Hao Ge, Junda Feng, Qi Huang, Fangcheng Fu, Xiaonan Nie, Lei Zuo, Haibin Lin, Bin Cui, Xin Liu, 28 Feb 2025, ByteScale: Efficient Scaling of LLM Training with a 2048K Context Length on More Than 12,000 GPUs, https://arxiv.org/abs/2502.21231 (Addressing training inefficiencies when training data ranges from short to very long queries, including via hybrid data parallelism and communications optimizations.)

Sequence Parallelism

Research papers on sequence parallelism include:

Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, Liang Zhao, 4 Jan 2024, Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models https://arxiv.org/abs/2401.00625 (A general survey paper with coverage of many techniques including this one.)
Shashank Verma and Neal Vaidya, Mastering LLM Techniques: Inference Optimization, Nov 17, 2023, NVIDIA Technical Blog, https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/
Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas Kübler, Jiaji Huang, Matthäus Kleindessner, Jun Huan, Volkan Cevher, Yida Wang, George Karypis, 12 Jul 2024, Inference Optimization of Foundation Models on AI Accelerators, KDD’24, August 25–29, 2024, Barcelona, Spain, https://arxiv.org/abs/2407.09111

Pipelining

Pipelining is the software technique of making sure the GPU has a full queue of work. If the GPU has a full pipeline, it is maxed out, and its throughput is at the highest it can achieve. That's the profound pinnacle of optimal optimization.

Bubbles are bad: Sometimes, the GPU is so fast than it can outpace the CPU's ability to send it data (often due to memory access and data transfer overhead in the CPU), and then the GPU has nothing to do for a time. This is called a "bubble" in the pipeline, and even though we're talking about only milliseconds, this represents a squandered opportunity to go faster.

Research on Pipelining: There are various papers expressly on pipelining, such as:

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Mia Xu Chen, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, and Zhifeng Chen. 2019. Gpipe: Efficient training of giant neural networks using pipeline parallelism, http://arxiv.org/abs/1811.06965
Ravi Teja Mullapudi, Vinay Vasista, and Uday Bondhugula. 2015. Polymage: Automatic optimization for image processing pipelines. ACM SIGARCH Comput. Arch. News 43, 1 (2015), 429–443. https://doi.org/10.1145/2694344.2694364, PDF: https://www.csa.iisc.ac.in/TR/2015/3/polymage-tr.pdf
Ravi Teja Mullapudi, Andrew Adams, Dillon Sharlet, Jonathan Ragan-Kelley, and Kayvon Fatahalian. 2016. Automatically scheduling halide image processing pipelines. ACM Trans. Graph. 35, 4 (2016), 1–11. https://doi.org/10.1145/2897824.2925952, PDF: https://dl.acm.org/doi/pdf/10.1145/2897824.2925952
Daniel Crankshaw, Gur-Eyal Sela, Xiangxi Mo, Corey Zumar, Ion Stoica, Joseph Gonzalez, and Alexey Tumanov. 2020. InferLine: latency-aware provisioning and scaling for prediction serving pipelines. Proceedings of the 11th ACM Symposium on Cloud Computing. 477–491, https://arxiv.org/abs/1812.01776
R Ma, J Wang, Q Qi, X Yang, H Sun, Z Zhuang, J Liao, September 2023, PipeLLM: Pipeline LLM Inference on Heterogeneous Devices with Sequence Slicing, ACM SIGCOMM '23: Proceedings of the ACM SIGCOMM 2023 Conference, Pages 1126–1128, https://doi.org/10.1145/3603269.3610856, https://dl.acm.org/doi/abs/10.1145/3603269.3610856
A Agrawal, A Panwar, J Mohan, N Kwatra, BS Gulavani, 2023, SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills, arXiv preprint, https://arxiv.org/abs/2308.16369
Saeid Ghafouri, Kamran Razavi, Mehran Salmani, Alireza Sanaee, Tania Lorido-Botran, Lin Wang, Joseph Doyle, Pooyan Jamshidi, Aug 2023, IPA: Inference Pipeline Adaptation to Achieve High Accuracy and Cost-Efficiency, https://arxiv.org/abs/2308.12871
GI Yu, JS Jeong, GW Kim, S Kim, BG Chun, 2022, Orca: A distributed serving system for Transformer-Based generative models, 16th USENIX Symposium, https://www.usenix.org/conference/osdi22/presentation/yu, PDF: https://www.usenix.org/system/files/osdi22-yu.pdf (Improved parallelization/pipelining with latency reduction from iteration-level scheduling across multiple requests.)
Tae Jun Ham, Sung Jun Jung, Seonghak Kim, Young H Oh, Yeonhong Park, Yoonho Song, Jung-Hun Park, Sanghee Lee, Kyoung Park, Jae W Lee, et al. A^3: Accelerating attention mechanisms in neural networks with approximation. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 328–341. IEEE, 2020. https://arxiv.org/abs/2002.10941
Pieter Hijma, Stijn Heldens, Alessio Sclocco, Ben van Werkhoven, Henri E. Bal, 2023, Optimization techniques for GPU programming, ACM Computing Surveys, Volume 55, Issue 11, Article No. 239, pp 1–81, https://dl.acm.org/doi/abs/10.1145/3570638, PDF: https://dl.acm.org/doi/pdf/10.1145/3570638 (Extensive survey of software optimizations to improve GPU latency and throughput.)
Daniel Nichols, Siddharth Singh, Shu-Huai Lin, Abhinav Bhatele, July 2022, A Survey and Empirical Evaluation of Parallel Deep Learning Frameworks, https://arxiv.org/abs/2111.04949 PDF: https://pssg.cs.umd.edu/assets/papers/2022-07-dl-survey-arxiv.pdf (Survey of frameworks from the theoretical perspective of parallelism.)
Junki Park, Hyunsung Yoon, Daehyun Ahn, Jungwook Choi, and Jae-Joon Kim. 2020. OPTIMUS: OPTImized Matrix Multiplication Structure for Transformer neural network accelerator. Proceedings of Machine Learning and Systems 2 (2020), 363-378. https://proceedings.mlsys.org/paper_files/paper/2020/file/91ba7292e5388b90b58d0b839a7f19ec-Paper.pdf
S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He, “Zero: Memory optimizations toward training trillion parameter models,” in SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020, pp. 1–16, https://arxiv.org/abs/1910.02054, Code (part): https://github.com/microsoft/deepspeed (Zero Redundancy Optimizer (ZeRO) provides memory optimization, improved utilization, and fragmentation avoidance, allowing improved pipelining during training.)
J Zhang, G Niu, Q Dai, H Li, Z Wu, F Dong, Z Wu, 2023, PipePar: Enabling fast DNN pipeline parallel training in heterogeneous GPU clusters, Neurocomputing, https://www.sciencedirect.com/science/article/pii/S0925231223007841
C Wang, D Sun, Y Bai, 2023, PiPAD: Pipelined and Parallel Dynamic GNN Training on GPUs, https://arxiv.org/pdf/2301.00391
G Huang, Y Bai, L Liu, Y Wang, B Yu… - … of Machine Learning …, 2023, ALCOP: Automatic Load-Compute Pipelining in Deep Learning Compiler for AI-GPUs, https://arxiv.org/abs/2210.16691, PDF: https://proceedings.mlsys.org/paper_files/paper/2023/file/12a304a31e42dfefa21c82431e849124-Paper-mlsys2023.pdf
S Raskar, JM Monsalve Diaz, T Applencourt, 2023, Implementation of Dataflow Software Pipelining for Codelet Model, https://research.spec.org/icpe_proceedings/2023/proceedings/p161.pdf
Yuke Wang, Boyuan Feng, Zheng Wang, Tong Geng, Kevin Barker, Ang Li, Yufei Ding, June 2023, MGG: Accelerating Graph Neural Networks with Fine-grained intra-kernel Communication-Computation Pipelining on Multi-GPU Platforms, https://arxiv.org/abs/2209.06800, PDF: https://www.usenix.org/system/files/osdi23-wang-yuke.pdf
KKW Ng, HM Demoulin, V Liu, October 2023, Paella: Low-latency Model Serving with Software-defined GPU Scheduling, SOSP '23: Proceedings of the 29th Symposium on Operating Systems Principles, Pages 595–610, https://dl.acm.org/doi/abs/10.1145/3600006.3613163 D. F. Bacon, S. L. Graham, and O. J. Sharp. 1994. Compiler transformations for high-performance computing. ACM Computing Surveys 26, 4 (1994), 345–420. https://dl.acm.org/doi/10.1145/197405.197406, PDF: https://people.eecs.berkeley.edu/~fateman/264/papers/bacon.pdf (Examines pipelining along with extensive coverage of numerous compiler auto-optimizations of program code.)
Eric LaForest, March 19, 2010, Survey of Loop Transformation Techniques, ECE 1754, http://fpgacpu.ca/writings/SurveyLoopTransformations.pdf
Zao Zhang, 23 May 2024, Design Efficient Deep Neural Networks with System Optimization, Ph.D. Thesis, School of Electrical and Information Engineering, Faculty of Engineering, The University of Sydney, Australia, PDF: https://ses.library.usyd.edu.au/bitstream/handle/2123/32642/zhang_z_thesis.pdf?sequence=1&isAllowed=y https://ses.library.usyd.edu.au/handle/2123/32642 https://hdl.handle.net/2123/32642
Shaoyuan Chen, Yutong Lin, Mingxing Zhang, Yongwei Wu, 3 May 2024, Efficient and Economic Large Language Model Inference with Attention Offloading, https://arxiv.org/abs/2405.01814 (Separates the process-bound and memory-bound parts of inference for speedup, with focus on prefill, decoding, and the sub-tasks such as QKV and FFN use of GEMM kernels, versus the different pattern of attention computations and the KV cache.)
Hyungjun Oh, Kihong Kim, Jaemin Kim, Sungkyun Kim, Junyeol Lee, Du-seong Chang, Jiwon Seo, 15 Mar 2024, ExeGPT: Constraint-Aware Resource Scheduling for LLM Inference, https://arxiv.org/abs/2404.07947 (Scheduling and pipelining of inference calculations.)
Jaskirat Singh, Bram Adams, Ahmed E. Hassan, 25 Mar 2024, On the Impact of Black-box Deployment Strategies for Edge AI on Latency and Model Performance, https://arxiv.org/abs/2403.17154 (MLOps deployment for quantization, partitioning and early-exit across mobile, edge, and cloud platforms, including running early exit on mobile.)
Jiahang Zhou, Yanyu Chen, Zicong Hong, Wuhui Chen, Yue Yu, Tao Zhang, Hui Wang, Chuanfu Zhang, Zibin Zheng, 5 Jan 2024, Training and Serving System of Foundation Models: A Comprehensive Survey, https://arxiv.org/abs/2401.02643
Siqi Wang, Hailong Yang, Xuezhu Wang, Tongxuan Liu, Pengbo Wang, Xuning Liang, Kejie Ma, Tianyu Feng, Xin You, Yongjun Bao, Yi Liu, Zhongzhi Luan, Depei Qian, 24 Feb 2024, Minions: Accelerating Large Language Model Inference with Adaptive and Collective Speculative Decoding, https://arxiv.org/abs/2402.15678 (Speculative decoding with a focus on improving poor acceptance rates using majority-voting consensus decoding with multiple small draft models, achieving 80-90% acceptance with OPT-13B, and around 50% acceptance for Llama-70B, and also pipelining of inference computations for speedup.)
Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin Jin, Xin Liu, 23 Feb 2024, MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs, https://arxiv.org/abs/2402.15627
Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, Ramachandran Ramjee, 4 Mar 2024, Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve, https://arxiv.org/abs/2403.02310 (Faster latency by scheduling of prefill and decoding algorithm phases.)
Shashank Verma and Neal Vaidya, Mastering LLM Techniques: Inference Optimization, Nov 17, 2023, NVIDIA Technical Blog, https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/
NOT (cannot load this paper) Method to Deploy Lightweight Models with a Novel Pipeline for Efficient Inference, J Dong, W Li, Z Huang, L Xu, 2023, https://ieeexplore.ieee.org/abstract/document/10221956/
H Peng, S Huang, S Chen, B Li, T Geng, A Li, 2022, A length adaptive algorithm-hardware co-design of transformer on fpga through sparse attention and dynamic pipelining, https://dl.acm.org/doi/abs/10.1145/3489517.3530585 https://arxiv.org/pdf/2208.03646
Ruilong Ma, Xiang Yang, Jingyu Wang, Qi Qi, Haifeng Sun, Jing Wang, Zirui Zhuang, Jianxin Liao, June 16-21, 2024, HPipe: Large Language Model Pipeline Parallelism for Long Context on Heterogeneous Cost-effective Devices, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track), pages 1–9, Association for Computational Linguistics, https://aclanthology.org/2024.naacl-industry.1.pdf
Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao, July 11, 2024, FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision, https://arxiv.org/abs/2407.08608 https://tridao.me/blog/2024/flash3/
18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
NVIDIA, June 2024, Nemotron-4 340B Technical Report, https://d1qx31qr3h6wln.cloudfront.net/publications/Nemotron_4_340B_8T_0.pdf (Architecture is decoder-only with GQA, SentencePiece tokenizer, causal attention masks, RoPE, 96 layers, 96 heads, 8 KV heads, 256,000 vocabulary, 18432 internal dimension, context window 4096, and uses squared RELU.)
Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas Kübler, Jiaji Huang, Matthäus Kleindessner, Jun Huan, Volkan Cevher, Yida Wang, George Karypis, 12 Jul 2024, Inference Optimization of Foundation Models on AI Accelerators, KDD’24, August 25–29, 2024, Barcelona, Spain, https://arxiv.org/abs/2407.09111
Amr Elmeleegy, Shivam Raj, Brian Slechta and Vishal Mehta, Jun 12, 2024, Demystifying AI Inference Deployments for Trillion Parameter Large Language Models, NVIDIA Technical Blog, https://developer.nvidia.com/blog/demystifying-ai-inference-deployments-for-trillion-parameter-large-language-models/
Aaron Archer, Matthew Fahrbach, Kuikui Liu, Prakash Prabhu, July 2024, Practical Performance Guarantees for Pipelined DNN Inference, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:1655-1671, 2024, https://proceedings.mlr.press/v235/archer24a.html PDF: https://raw.githubusercontent.com/mlresearch/v235/main/assets/archer24a/archer24a.pdf
Branden Butler, Sixing Yu, Arya Mazaheri, Ali Jannesari, 16 Jul 2024, PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation, https://arxiv.org/abs/2407.11798 (Optimizes speculative decoding further using pipelining.)
Kan Zhu, Yilong Zhao, Liangyu Zhao, Gefei Zuo, Yile Gu, Dedong Xie, Yufei Gao, Qinyu Xu, Tian Tang, Zihao Ye, Keisuke Kamahori, Chien-Yu Lin, Stephanie Wang, Arvind Krishnamurthy, Baris Kasikci, 22 Aug 2024, NanoFlow: Towards Optimal Large Language Model Serving Throughput, https://arxiv.org/abs/2408.12757
Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong, Yi Pan, Shaochen Xu, Zihao Wu, Zhengliang Liu, Xin Zhang, Shu Zhang, Xintao Hu, Tuo Zhang, Ning Qiang, Tianming Liu, Bao Ge, 6 Jan 2024 (v2), Understanding LLMs: A Comprehensive Overview from Training to Inference, https://arxiv.org/abs/2401.02038
Xueyuan Han, Zinuo Cai, Yichu Zhang, Chongxin Fan, Junhan Liu, Ruhui Ma, Rajkumar Buyya, 9 Sep 2024 (v2), Hermes: Memory-Efficient Pipeline Inference for Large Models on Edge Devices, https://arxiv.org/abs/2409.04249 (Pipelining of model layer-wise loading and inference for memory-efficient inference.)
Wei An, Xiao Bi, Guanting Chen, Shanhuang Chen, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Wenjun Gao, Kang Guan, Jianzhong Guo, Yongqiang Guo, Zhe Fu, Ying He, Panpan Huang, Jiashi Li, Wenfeng Liang, Xiaodong Liu, Xin Liu, Yiyuan Liu, Yuxuan Liu, Shanghao Lu, Xuan Lu, Xiaotao Nie, Tian Pei, Junjie Qiu, Hui Qu, Zehui Ren, Zhangli Sha, Xuecheng Su, Xiaowen Sun, Yixuan Tan, Minghui Tang, Shiyu Wang, Yaohui Wang, Yongji Wang, Ziwei Xie, Yiliang Xiong, Yanhong Xu, Shengfeng Ye, Shuiping Yu, Yukun Zha, Liyue Zhang, Haowei Zhang, Mingchuan Zhang, Wentao Zhang, Yichao Zhang, Chenggang Zhao, Yao Zhao, Shangyan Zhou, Shunfeng Zhou, Yuheng Zou, 31 Aug 2024 (v2), Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning, DeepSeek AI, https://www.arxiv.org/abs/2408.14158
Bin Xiao, Lei Su, 4 Sep 2024, ISO: Overlap of Computation and Communication within Seqenence For LLM Inference, https://arxiv.org/abs/2409.11155
Han Xu, Yutong Li, Shihao Ji, 12 Sep 2024, LlamaF: An Efficient Llama2 Architecture Accelerator on Embedded FPGAs, https://arxiv.org/abs/2409.11424 (Matrix multiplications are 97% of computations, which are optimized with a pipelined matrix-vector operation.)
Amey Agrawal, Junda Chen, Íñigo Goiri, Ramachandran Ramjee, Chaojie Zhang, Alexey Tumanov, Esha Choukse, 25 Sep 2024, Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations, https://arxiv.org/abs/2409.17264
Zhuohan L, Oct 2024, Empowering Large Language Models with Efficient and Automated Systems, Ph.D. Thesis, Computer Science, University of California, Berkeley, https://escholarship.org/content/qt2kp379f3/qt2kp379f3.pdf (Examines pipeline parallelism and paged attention.)
Kabir Nagrecha, Oct 2024, Thesis, Orchestration Systems to Support Deep Learning at Scale Doctor of Philosophy, Computer Science, University of California San Diego, https://escholarship.org/content/qt3pp6k1p4/qt3pp6k1p4_noSplash_457f4c7c0435172a3d0a17428455894c.pdf (Pipeline and data parallelism systems.)
Yulei Qian, Fengcun Li, Xiangyang Ji, Xiaoyu Zhao, Jianchao Tan, Kefeng Zhang, Xunliang Cai, 16 Oct 2024, EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference, https://arxiv.org/abs/2410.12247
Samarth N Ramesh, Zhixue Zhao, 22 Nov 2024, Efficient Pruning of Text-to-Image Models: Insights from Pruning Stable Diffusion, https://arxiv.org/abs/2411.15113 (Comprehensive analysis of different types of pruning on diffusion image models.)
Rongxiang Wang, Zhiming Xu, Felix Xiaozhu Lin, 15 Dec 2024, Efficient Whisper on Streaming Speech, https://arxiv.org/abs/2412.11272
Tong Xiao, Jingbo Zhu, 16 Jan 2025, Foundations of Large Language Models, https://arxiv.org/abs/2501.09223 (Huge 230 page paper on many topics such as training, prompting, alignment, and long context.)
Kaiyuan Tian, Linbo Qiao, Baihui Liu, Gongqingjian Jiang, Dongsheng Li, 21 Jan 2025, A Survey on Memory-Efficient Large-Scale Model Training in AI for Science, https://arxiv.org/abs/2501.11847
Minhajul Hoque, Jan 4, 2025, DeepSeek V3: How They Achieved Big Results with Small Compute, https://ai.plainenglish.io/deepseek-v3-how-they-achieved-big-results-with-small-compute-fb694606d59a (DeepSeek optimizations included FP8 quantization with outlier handling, attention and KV cache optimization via Multi-Head Latent Attention (MHLA), and multi-token decoding.)
Nandini Lokesh Reddy, Jan 2025, DeepSeek: Bridging Performance and Efficiency in Modern AI, https://medium.com/@nandinilreddy/deepseek-bridging-performance-and-efficiency-in-modern-ai-106181a85693
Xiaoran Liu, Ruixiao Li, Mianqiu Huang, Zhigeng Liu, Yuerong Song, Qipeng Guo, Siyang He, Qiqi Wang, Linlin Li, Qun Liu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu, 24 Feb 2025, Thus Spake Long-Context Large Language Model, https://arxiv.org/abs/2502.17129 (Impressive survey of many techniques to improve efficiency and accuracy of long context processing in both inference and training, covering text, video and multimodal models.)
Ashley Goolam, March 4, 2025, DeepSeek Open Source Week: A Complete Summary, https://apidog.com/blog/deepseek-open-source-week/
J Lin, Z Liu, Y You, J Wang, W Zhang, R Zhao, 2025, WeiPipe: Weight Pipeline Parallelism for Communication-Effective Long-Context Large Model Training, PPoPP ’25, March 1–5, 2025, Las Vegas, NV, USA, https://dl.acm.org/doi/pdf/10.1145/3710848.3710869 https://doi.org/10.1145/3710848.3710869

Overlapping

Overlapping is a type of parallelization that is closely related to pipelining. The idea is to run two types of actions or computations in parallel. There are multiple types of overlapping:

Overlap communication and computation
Overlap memory accesses and computation
Overlap recomputation/rematerialization and computation
Overlap CPU and GPU computations

Here are some research papers on overlapping optimizations:

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao, July 11, 2024, FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision, https://arxiv.org/abs/2407.08608 https://tridao.me/blog/2024/flash3/
Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong, Yi Pan, Shaochen Xu, Zihao Wu, Zhengliang Liu, Xin Zhang, Shu Zhang, Xintao Hu, Tuo Zhang, Ning Qiang, Tianming Liu, Bao Ge, 6 Jan 2024 (v2), Understanding LLMs: A Comprehensive Overview from Training to Inference, https://arxiv.org/abs/2401.02038
DeepSeek-AI, Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et. al. (many additional authors), 19 Jun 2024 (v5), DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model, https://arxiv.org/abs/2405.04434
Bin Xiao, Lei Su, 4 Sep 2024, ISO: Overlap of Computation and Communication within Seqenence For LLM Inference, https://arxiv.org/abs/2409.11155
Li-WenChang, WenleiBao, QiHou, ChengquanJiang, NingxinZheng, YinminZhong, Xuanrun Zhang, Zuquan Song, Ziheng Jiang, Haibin Lin, Xin Jin, and Xin Liu, June 2024, https://arxiv.org/abs/2406.06858
A. Jangda, J. Huang, G. Liu, A. H. N. Sabet, S. Maleki, Y. Miao, M. Musuvathi, T. Mytkowicz, and O. Saarikivi, “Breaking the computation and communication abstraction barrier in distributed machine learning workloads,” in Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 402–416, 2022. https://arxiv.org/abs/2105.05720
S. Wang, J. Wei, A. Sabne, A. Davis, B. Ilbeyi, B. Hechtman, D. Chen, K. S. Murthy, M. Maggioni, Q. Zhang, et al., “Overlap communication with dependent computation via decomposition in large deep learning models,” in Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, pp. 93–106, 2023. https://dl.acm.org/doi/10.1145/3567955.3567959 PDF: https://dl.acm.org/doi/pdf/10.1145/3567955.3567959
F. Li, S. Zhao, Y. Qing, X. Chen, X. Guan, S. Wang, G. Zhang, and H. Cui, “Fold3d: Rethinking and parallelizing computational and communicational tasks in the training of large dnn models,” IEEE Transactions on Parallel and Distributed Systems, vol. 34, no. 5, pp. 1432–1449, 2023. https://ieeexplore.ieee.org/document/10050126 https://i.cs.hku.hk/~heming/papers/tpds23-fold3d.pdf
G. Wang, H. Qin, S. A. Jacobs, C. Holmes, S. Rajbhandari, O. Ruwase, F. Yan, L. Yang, and Y. He, 2023, Zero++: Extremely efficient collective communication for giant model training, arXiv preprint arXiv:2306.10209, 2023. https://arxiv.org/abs/2306.10209
Chang Chen, Xiuhong Li, Qianchao Zhu, Jiangfei Duan, Peng Sun, Xingcheng Zhang, and Chao Yang. 2024. Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3 (ASPLOS '24), Vol. 3. Association for Computing Machinery, New York, NY, USA, 178–191. https://doi.org/10.1145/3620666.3651379 https://dl.acm.org/doi/abs/10.1145/3620666.3651379
Ping Chen, Wenjie Zhang, Shuibing He, Yingjie Gu, Zhuwei Peng, Kexin Huang, Xuan Zhan, Weijian Chen, Yi Zheng, Zhefeng Wang, Yanlong Yin, Gang Chen, 27 Jun 2024 (v2), Optimizing Large Model Training through Overlapped Activation Recomputation, https://arxiv.org/abs/2406.08756
Chenyu Jiang, Ye Tian, Zhen Jia, Shuai Zheng, Chuan Wu, Yida Wang, 30 Apr 2024, Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping, https://arxiv.org/abs/2404.19429
Chenyu Jiang, Ye Tian, Zhen Jia, Shuai Zheng, Chuan Wu, Yida Wang, 2024, Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping, Proceedings of Machine Learning and Systems 6 (MLSys 2024) Conference, https://proceedings.mlsys.org/paper_files/paper/2024/hash/339caf45a6fa281cae8adc6465343464-Abstract-Conference.html PDF: https://proceedings.mlsys.org/paper_files/paper/2024/file/339caf45a6fa281cae8adc6465343464-Paper-Conference.pdf
Suchita Pati, Shaizeen Aga, Mahzabeen Islam, Nuwan Jayasena, and Matthew D. Sinclair. 2024. T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS '24), Vol. 2. Association for Computing Machinery, New York, NY, USA, 1146–1164. https://doi.org/10.1145/3620665.3640410 https://dl.acm.org/doi/abs/10.1145/3620665.3640410 PDF: https://dl.acm.org/doi/pdf/10.1145/3620665.3640410
Ganesh Bikshandi, Jay Shah, 19 Dec 2023, A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library, https://arxiv.org/abs/2312.11918 https://research.colfax-intl.com/nvidia-hopper-flashattention-2/
Guanhua Wang, Chengming Zhang, Zheyu Shen, Ang Li, Olatunji Ruwase, 23 Sep 2024, Domino: Eliminating Communication in LLM Training via Generic Tensor Slicing and Overlapping, https://arxiv.org/abs/2409.15241 (Using tensor parallelism to overlap communications and computation.)
Zeyu Zhang, Haiying Shen, 23 Sep 2024, CSPS: A Communication-Efficient Sequence-Parallelism based Serving System for Transformer based Models with Long Prompts, https://arxiv.org/abs/2409.15104 (Sparse attention and overlapped communication with computation and disaggregates prefill/decoding with chunked prefill, with a novel QKV splitting approach focused on the Q values.)
Yi Xu, Ziming Mao, Xiangxi Mo, Shu Liu, Ion Stoica, 14 Nov 2024, Pie: Pooling CPU Memory for LLM Inference, https://arxiv.org/abs/2411.09317
Jianhua Gao, Bingjie Liu, Weixing Ji, Hua Huang, 9 Apr 2024, A Systematic Literature Survey of Sparse Matrix-Vector Multiplication, https://arxiv.org/abs/2404.06047
Vincent Abbott, Gioele Zardini, 4 Dec 2024, FlashAttention on a Napkin: A Diagrammatic Approach to Deep Learning IO-Awareness, https://arxiv.org/abs/2412.03317
Zongbiao Li , Xiezhao Li , Yinghao Cui , Yijun Chen , Zhixuan Gu , Yuxuan Liu , Wenbo Zhu , Fei Jia , Ke Liu , Qifeng Li , Junyao Zhan , Jiangtao Zhou , Chenxi Zhang , Qike Liu, 31 Dec 2024, Automatically Planning Optimal Parallel Strategy for Large Language Models, https://arxiv.org/abs/2501.00254
Size Zheng, Jin Fang, Xuegui Zheng, Qi Hou, Wenlei Bao, Ningxin Zheng, Ziheng Jiang, Dongyang Wang, Jianxi Ye, Haibin Lin, Li-Wen Chang, Xin Liu, 27 Mar 2025 (v2), TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives, https://arxiv.org/abs/2503.20313
Masahiro Tanaka, Du Li, Umesh Chand, Ali Zafar, Haiying Shen, Olatunji Ruwase, 14 Apr 2025, DeepCompile: A Compiler-Driven Approach to Optimizing Distributed Deep Learning Training, https://arxiv.org/abs/2504.09983

Overlapping Communication and Computation

Here are some research papers on overlapping communication (sending data) with computation, usually on the GPU:

Bin Xiao, Lei Su, 4 Sep 2024, ISO: Overlap of Computation and Communication within Seqenence For LLM Inference, https://arxiv.org/abs/2409.11155
Li-WenChang, WenleiBao, QiHou, ChengquanJiang, NingxinZheng, YinminZhong, Xuanrun Zhang, Zuquan Song, Ziheng Jiang, Haibin Lin, Xin Jin, and Xin Liu, June 2024, https://arxiv.org/abs/2406.06858
A. Jangda, J. Huang, G. Liu, A. H. N. Sabet, S. Maleki, Y. Miao, M. Musuvathi, T. Mytkowicz, and O. Saarikivi, “Breaking the computation and communication abstraction barrier in distributed machine learning workloads,” in Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 402–416, 2022. https://arxiv.org/abs/2105.05720
S. Wang, J. Wei, A. Sabne, A. Davis, B. Ilbeyi, B. Hechtman, D. Chen, K. S. Murthy, M. Maggioni, Q. Zhang, et al., “Overlap communication with dependent computation via decomposition in large deep learning models,” in Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, pp. 93–106, 2023. https://dl.acm.org/doi/10.1145/3567955.3567959 PDF: https://dl.acm.org/doi/pdf/10.1145/3567955.3567959
F. Li, S. Zhao, Y. Qing, X. Chen, X. Guan, S. Wang, G. Zhang, and H. Cui, “Fold3d: Rethinking and parallelizing computational and communicational tasks in the training of large dnn models,” IEEE Transactions on Parallel and Distributed Systems, vol. 34, no. 5, pp. 1432–1449, 2023. https://ieeexplore.ieee.org/document/10050126 https://i.cs.hku.hk/~heming/papers/tpds23-fold3d.pdf
G. Wang, H. Qin, S. A. Jacobs, C. Holmes, S. Rajbhandari, O. Ruwase, F. Yan, L. Yang, and Y. He, 2023, Zero++: Extremely efficient collective communication for giant model training, arXiv preprint arXiv:2306.10209, 2023. https://arxiv.org/abs/2306.10209
Chang Chen, Xiuhong Li, Qianchao Zhu, Jiangfei Duan, Peng Sun, Xingcheng Zhang, and Chao Yang. 2024. Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3 (ASPLOS '24), Vol. 3. Association for Computing Machinery, New York, NY, USA, 178–191. https://doi.org/10.1145/3620666.3651379 https://dl.acm.org/doi/abs/10.1145/3620666.3651379
Chenyu Jiang, Ye Tian, Zhen Jia, Shuai Zheng, Chuan Wu, Yida Wang, 30 Apr 2024, Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping, https://arxiv.org/abs/2404.19429
Chenyu Jiang, Ye Tian, Zhen Jia, Shuai Zheng, Chuan Wu, Yida Wang, 2024, Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping, Proceedings of Machine Learning and Systems 6 (MLSys 2024) Conference, https://proceedings.mlsys.org/paper_files/paper/2024/hash/339caf45a6fa281cae8adc6465343464-Abstract-Conference.html PDF: https://proceedings.mlsys.org/paper_files/paper/2024/file/339caf45a6fa281cae8adc6465343464-Paper-Conference.pdf
Suchita Pati, Shaizeen Aga, Mahzabeen Islam, Nuwan Jayasena, and Matthew D. Sinclair. 2024. T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS '24), Vol. 2. Association for Computing Machinery, New York, NY, USA, 1146–1164. https://doi.org/10.1145/3620665.3640410 https://dl.acm.org/doi/abs/10.1145/3620665.3640410 PDF: https://dl.acm.org/doi/pdf/10.1145/3620665.3640410
Guanhua Wang, Chengming Zhang, Zheyu Shen, Ang Li, Olatunji Ruwase, 23 Sep 2024, Domino: Eliminating Communication in LLM Training via Generic Tensor Slicing and Overlapping, https://arxiv.org/abs/2409.15241 (Using tensor parallelism to overlap communications and computation.)
Zeyu Zhang, Haiying Shen, 23 Sep 2024, CSPS: A Communication-Efficient Sequence-Parallelism based Serving System for Transformer based Models with Long Prompts, https://arxiv.org/abs/2409.15104 (Sparse attention and overlapped communication with computation and disaggregates prefill/decoding with chunked prefill, with a novel QKV splitting approach focused on the Q values.)
Palak (Microsoft Research India), Rohan Gandhi (Microsoft Research India), Karan Tandon (Microsoft Research India), Debopam Bhattacherjee (Microsoft Research India), Venkata N. Padmanabhan (Microsoft Research India), 16 Nov 2024, Improving training time and GPU utilization in geo-distributed language model training, https://arxiv.org/abs/2411.14458
Haiquan Wang, Chaoyi Ruan, Jia He, Jiaqi Ruan, Chengjie Tang, Xiaosong Ma, Cheng Li, 24 Nov 2024, Hiding Communication Cost in Distributed LLM Training via Micro-batch Co-execution, https://arxiv.org/abs/2411.15871
Chaoyi Jiang, Lei Gao, Hossein Entezari Zarch, Murali Annavaram, 26 Nov 2024, Efficient LLM Inference with I/O-Aware Partial KV Cache Recomputation, https://arxiv.org/abs/2411.17089 (Overlapping/optimizing CPU-GPU network bandwidth for KV cache with some recomputation.)
MiniMax, Aonian Li, Bangwei Gong, Bo Yang, Boji Shan, Chang Liu, Cheng Zhu, Chunhao Zhang, Congchao Guo, Da Chen, Dong Li, Enwei Jiao, Gengxin Li, Guojun Zhang, Haohai Sun, Houze Dong, Jiadai Zhu, Jiaqi Zhuang, Jiayuan Song, Jin Zhu, Jingtao Han, Jingyang Li, Junbin Xie, Junhao Xu, Junjie Yan, Kaishun Zhang, Kecheng Xiao, Kexi Kang, Le Han, Leyang Wang, Lianfei Yu, Liheng Feng, Lin Zheng, Linbo Chai, Long Xing, Meizhi Ju, Mingyuan Chi, Mozhi Zhang, Peikai Huang, Pengcheng Niu, Pengfei Li, Pengyu Zhao, Qi Yang, Qidi Xu, Qiexiang Wang, Qin Wang, Qiuhui Li, Ruitao Leng, Shengmin Shi, Shuqi Yu, Sichen Li, Songquan Zhu, Tao Huang, Tianrun Liang, Weigao Sun, Weixuan Sun, Weiyu Cheng, Wenkai Li, Xiangjun Song, Xiao Su, Xiaodong Han, Xinjie Zhang, Xinzhu Hou, Xu Min, Xun Zou, Xuyang Shen, Yan Gong, Yingjie Zhu, Yipeng Zhou, Yiran Zhong, Yongyi Hu, Yuanxiang Fan, Yue Yu, Yufeng Yang, Yuhao Li, Yunan Huang, Yunji Li, Yunpeng Huang, Yunzhi Xu, Yuxin Mao, Zehan Li, Zekang Li, Zewei Tao, Zewen Ying, Zhaoyang Cong, Zhen Qin, Zhenhua Fan, Zhihang Yu, Zhuo Jiang, Zijia Wu, 14 Jan 2025, MiniMax-01: Scaling Foundation Models with Lightning Attention, https://arxiv.org/abs/2501.08313 https://github.com/MiniMax-AI (Content window over 1 million tokens.)
Muru Zhang, Mayank Mishra, Zhongzhu Zhou, William Brandon, Jue Wang, Yoon Kim, Jonathan Ragan-Kelley, Shuaiwen Leon Song, Ben Athiwaratkun, Tri Dao, 11 Jan 2025, Ladder-residual: parallelism-aware architecture for accelerating large model inference with communication overlapping, https://arxiv.org/abs/2501.06589
Ahmet Caner Yüzügüler, Jiawei Zhuang, Lukas Cavigelli, 14 Jan 2025, PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving, https://arxiv.org/abs/2501.08192
L. Chen et al., "Bridge-NDP: Efficient Communication-Computation Overlap in Near Data Processing System," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, doi: 10.1109/TCAD.2025.3531254. https://ieeexplore.ieee.org/abstract/document/10844857
W. Choi, J. Jeong, H. Jang and J. Ahn, "GPU-centric Memory Tiering for LLM Serving with NVIDIA Grace Hopper Superchip," in IEEE Computer Architecture Letters, doi: 10.1109/LCA.2025.3533588. https://ieeexplore.ieee.org/abstract/document/10852027
Minhajul Hoque, Jan 4, 2025, DeepSeek V3: How They Achieved Big Results with Small Compute, https://ai.plainenglish.io/deepseek-v3-how-they-achieved-big-results-with-small-compute-fb694606d59a (DeepSeek optimizations included FP8 quantization with outlier handling, attention and KV cache optimization via Multi-Head Latent Attention (MHLA), and multi-token decoding.)
Nandini Lokesh Reddy, Jan 2025, DeepSeek: Bridging Performance and Efficiency in Modern AI, https://medium.com/@nandinilreddy/deepseek-bridging-performance-and-efficiency-in-modern-ai-106181a85693
Shenggan Cheng, Shengjie Lin, Lansong Diao, Hao Wu, Siyu Wang, Chang Si, Ziming Liu, Xuanlei Zhao, Jiangsu Du, Wei Lin, and Yang You. 2025. Concerto: Automatic Communication Optimization and Scheduling for Large-Scale Deep Learning. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1 (ASPLOS '25). Association for Computing Machinery, New York, NY, USA, 198–213. https://doi.org/10.1145/3669940.3707223 https://dl.acm.org/doi/abs/10.1145/3669940.3707223
Shulai Zhang, Ningxin Zheng, Haibin Lin, Ziheng Jiang, Wenlei Bao, Chengquan Jiang, Qi Hou, Weihao Cui, Size Zheng, Li-Wen Chang, Quan Chen, Xin Liu, 27 Feb 2025, Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts. https://arxiv.org/abs/2502.19811
J Lin, Z Liu, Y You, J Wang, W Zhang, R Zhao, 2025, WeiPipe: Weight Pipeline Parallelism for Communication-Effective Long-Context Large Model Training, PPoPP ’25, March 1–5, 2025, Las Vegas, NV, USA, https://dl.acm.org/doi/pdf/10.1145/3710848.3710869 https://doi.org/10.1145/3710848.3710869
Amr Elmeleegy, Harry Kim, David Zier, Kyle Kranen, Neelay Shah, Ryan Olson and Omri Kahalon, Mar 18, 2025, Introducing NVIDIA Dynamo, A Low-Latency Distributed Inference Framework for Scaling Reasoning AI Models, https://developer.nvidia.com/blog/introducing-nvidia-dynamo-a-low-latency-distributed-inference-framework-for-scaling-reasoning-ai-models/
Size Zheng, Jin Fang, Xuegui Zheng, Qi Hou, Wenlei Bao, Ningxin Zheng, Ziheng Jiang, Dongyang Wang, Jianxi Ye, Haibin Lin, Li-Wen Chang, Xin Liu, 27 Mar 2025 (v2), TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives, https://arxiv.org/abs/2503.20313
Chao Jin, Ziheng Jiang, Zhihao Bai, Zheng Zhong, Juncai Liu, Xiang Li, Ningxin Zheng, Xi Wang, Cong Xie, Qi Huang, Wen Heng, Yiyuan Ma, Wenlei Bao, Size Zheng, Yanghua Peng, Haibin Lin, Xuanzhe Liu, Xin Jin, Xin Liu, 19 May 2025 (v2), MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production,https://arxiv.org/abs/2505.11432
Raja Gond, Nipun Kwatra, Ramachandran Ramjee, 16 May 2025, TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference, https://arxiv.org/abs/2505.11329
Masahiro Tanaka, Du Li, Umesh Chand, Ali Zafar, Haiying Shen, Olatunji Ruwase, 14 Apr 2025, DeepCompile: A Compiler-Driven Approach to Optimizing Distributed Deep Learning Training, https://arxiv.org/abs/2504.09983

Skeleton-of-Thought (Query Parallelism)

Skeleton-of-thought is a type of query parallelism for multi-step inference. The idea is that, rather than generate a very long answer sequentially, the execution uses two steps:

1. Generate a brief outline of points or paragraphs (quickly), and

2. In parallel, flesh out each of these points in the outline, and

3. Combine them into the full answer (quickly).

Research papers on skeleton-of-thought:

L. Zheng, L. Yin, Z. Xie, J. Huang, C. Sun, C. H. Yu, S. Cao, C. Kozyrakis, I. Stoica, J. E. Gonzalez et al., Dec 2023, Efficiently programming large language models using SGLang, arXiv preprint arXiv:2312.07104, 2023, https://arxiv.org/abs/2312.07104 (Uses a radix attention method, a trie or prefix tree, for KV caching.)
Xuefei Ning , Zinan Lin , November 17, 2023 Skeleton-of-Thought: Parallel decoding speeds up and improves LLM output, Microsoft Research Blog, https://www.microsoft.com/en-us/research/blog/skeleton-of-thought-parallel-decoding-speeds-up-and-improves-llm-output/ Code: https://github.com/imagination-research/sot/
S. Jin, Y. Wu, H. Zheng, Q. Zhang, M. Lentz, Z. M. Mao, A. Prakash, F. Qian, and D. Zhuo, “Adaptive skeleton graph decoding,” arXiv preprint arXiv:2402.12280, 2024. https://arxiv.org/abs/2402.12280
M. Liu, A. Zeng, B. Wang, P. Zhang, J. Tang, and Y. Dong, “Apar: Llms can do auto-parallel auto-regressive decoding,” arXiv preprint arXiv:2401.06761, 2024. https://arxiv.org/abs/2401.06761
8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
Mahsa Khoshnoodi, Vinija Jain, Mingye Gao, Malavika Srikanth, Aman Chadha, 24 May 2024 (v2), A Comprehensive Survey of Accelerated Generation Techniques in Large Language Models, https://arxiv.org/abs/2405.13019
Steven Kolawole, KeshavSanthanam, Virginia Smith, Pratiksha Thaker, Nov 2024, Extracting Parallelism from LargeLanguageModelQueries, https://openreview.net/pdf?id=CZHt9kLS5S
Huiyou Zhan, Xuan Zhang, Haisheng Tan, Han Tian, Dongping Yong, Junyang Zhang, Xiang-Yang Li, 16 Jan 2025, PICE: A Semantic-Driven Progressive Inference System for LLM Serving in Cloud-Edge Networks, https://arxiv.org/abs/2501.09367 (Generate an outline in the cloud that is filled in by edge models, which is similar to Skeleton-of-Thought.)
Xuefei Ning, Zinan Lin, Zixuan Zhou, Zifu Wang, Huazhong Yang, Yu Wang, May 2024, Skeleton-of-Thought: Prompting LLMs for Efficient Parallel Generation, ICLR 2024, https://www.microsoft.com/en-us/research/publication/skeleton-of-thought-large-language-models-can-do-parallel-decoding/ https://neurips2023-enlsp.github.io/papers/paper_33.pdf Code: https://github.com/imagination-research/sot/
Ruibin Xiong, Yimeng Chen, Dmitrii Khizbullin, Jürgen Schmidhuber, 11 Mar 2025, Beyond Outlining: Heterogeneous Recursive Planning for Adaptive Long-form Writing with Language Models, https://arxiv.org/abs/2503.08275
Yijiong Yu, 26 Mar 2025, Accelerate Parallelizable Reasoning via Parallel Decoding within One Sequence, https://arxiv.org/abs/2503.20533 https://github.com/yuyijiong/parallel-decoding-in-one-sequence
Siqi Fan, Peng Han, Shuo Shang, Yequan Wang, Aixin Sun, 28 May 2025, CoThink: Token-Efficient Reasoning via Instruct Models Guiding Reasoning Models, https://arxiv.org/abs/2505.22017 (Generate an outline before reasoning.)

Offloading

The term "offloading" in Computer Science theory generally refers to a low-power device offloading computation to a separate more powerful device. In the AI context, this would seem to refer to a CPU offloading processing to a GPU. However, the term "offloading" is not really used with this meaning in AI research. Other terms are used such as "hardware acceleration", "parallelization", and vectorization.

In much AI research, the term "offloading" actually refers to the opposite type of situation, from a GPU back to the CPU (i.e. from a high-power device down to a low-power device). This type of offloading is done to save memory space, not to speed up parall processing (although indirectly it does this, too). The "offloading" occurs in the reverse and is combined with recomputation or rematerialization. The overall goal is to reduce processing time by optimization of memory usage, where the data (e.g. a tensor) is offloaded out of the GPU to save GPU RAM space, and then later re-sent to the GPU (or recomputed) if it's needed again. When this memory optimization is used during training, it is often called "checkpointing" or "gradient checkpointing". See details on recomputation and checkpointing for memory optimization.

The term "offloading" is used with its theoretical meaning in the research area of "edge computing" because it transfers processing from a low-resource device to a high-powered server. Edge computing is the theory of devices on the "edge," which can mean IoT network devices (e.g. modems or security cameras) or personal mobile phones. In this context, the term "offloading" really means "uploading" because it refers to sending the request from the edge device up into the cloud. The AI inference processing is done in the cloud, on a much more powerful server with dozens of mega-GPUs bought with unicorn venture capital funding, and then the results are sent back down to the device or phone.

More research on offloading:

Yixin Song, Zeyu Mi, Haotong Xie, Haibo Chen 2023, PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU, https://arxiv.org/abs/2312.12456 Code: https://github.com/SJTU-IPADS/PowerInfer (Computes a GPU-CPU hybrid engine with some "active" neurons run on the GPU and other less "hot" neurons on the CPU, which is akin to adaptive inference on the width dimension.)
Xuanlei Zhao, Bin Jia, Haotian Zhou, Ziming Liu, Shenggan Cheng, Yang You, 2 Mar 2024, HeteGen: Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices, https://arxiv.org/abs/2403.01164
Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, Luo Mai, 25 Jan 2024, ServerlessLLM: Locality-Enhanced Serverless Inference for Large Language Models, https://arxiv.org/abs/2401.14351 Code: https://github.com/ServerlessLLM/ServerlessLLM
Longteng Zhang, Xiang Liu, Zeyu Li, Xinglin Pan, Peijie Dong, Ruibo Fan, Rui Guo, Xin Wang, Qiong Luo, Shaohuai Shi, Xiaowen Chu, Dec 2023, Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models, https://arxiv.org/abs/2311.03687 (Benchmarks model speed for training, fine-tuning and inference with various optimizations such as ZeRO, quantization, offloading/recomputation, and Flash Attention.)
Abhimanyu Bambhaniya, Ritik Raj, Geonhwa Jeong, Souvik Kundu, Sudarshan Srinivasan, Midhilesh Elavazhagan, Madhu Kumar, Tushar Krishna, 3 Jun 2024, Demystifying Platform Requirements for Diverse LLM Inference Use Cases, https://arxiv.org/abs/2406.01698
Ruslan Svirschevski, Avner May, Zhuoming Chen, Beidi Chen, Zhihao Jia, Max Ryabinin, 4 Jun 2024, SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices, https://arxiv.org/abs/2406.02532 (Speculative decoding with draft trees on low-resource consumer hardware with offloading.)
Donghyeon Joo, Ramyad Hadidi, Soheil Feizi, Bahar Asgari, 17 Jun 2024, Endor: Hardware-Friendly Sparse Format for Offloaded LLM Inference, https://arxiv.org/abs/2406.11674
Daliang Xu, Hao Zhang, Liming Yang, Ruiqi Liu, Mengwei Xu, and Xuanzhe Liu, 11 June 2024, WiP: Efficient LLM Prefilling with Mobile NPU, EdgeFM '24: Proceedings of the Workshop on Edge and Mobile Foundation Models, June 2024, Pages 33 - 35, https://doi.org/10.1145/3662006.3662066 https://dl.acm.org/doi/abs/10.1145/3662006.3662066 PDF: https://dl.acm.org/doi/pdf/10.1145/3662006.3662066 (Faster NPU prefill via chunked prefilling using sequences of tokens, along with INT8 NPU quantization that is aware of outliers and offloads FP32 calculations from NPU back to CPU.)
Ying He, Jingcheng Fang, F. Richard Yu, Victor C. Leung, 2024, Large Language Models (LLMs) Inference Offloading and Resource Allocation in Cloud-Edge Computing: An Active Inference Approach, PrePrints pp. 1-12, DOI: 10.1109/TMC.2024.3415661, https://www.computer.org/csdl/journal/tm/5555/01/10591707/1YraFlDdKYo
18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
Felippe Vieira Zacarias, Kiran Palli, Sudharshan Vazhkudai, Evelyn Grevelink, July 2024, Analyzing LLM performance: The impact of high-bandwidth memory on model inference, https://www.micron.com/content/dam/micron/global/public/documents/products/product-flyer/llm-inference-engineering-report.pdf
Xunyi Zhao, Lionel Eyraud-Dubois, Théotime Le Hellard, Julia Gusak, Olivier Beaumont, 24 July, 2024, OFFMATE: full fine-tuning of LLMs on a single GPU by re-materialization and offloading, https://hal.science/hal-04660745/document
Jiangfei Duan, Shuo Zhang, Zerui Wang, Lijuan Jiang, Wenwen Qu, Qinghao Hu, Guoteng Wang, Qizhen Weng, Hang Yan, Xingcheng Zhang, Xipeng Qiu, Dahua Lin, Yonggang Wen, Xin Jin, Tianwei Zhang, Peng Sun, 29 Jul 2024, Efficient Training of Large Language Models on Distributed Infrastructures: A Survey, https://arxiv.org/abs/2407.20018
R. Narmeen, P. Mach, Z. Becvar and I. Ahmad, 16 August 2024, Joint Exit Selection and Offloading Decision for Applications Based on Deep Neural Networks, IEEE Internet of Things Journal, doi: 10.1109/JIOT.2024.3444898, https://doi.org/10.1109/JIOT.2024.3444898 https://ieeexplore.ieee.org/abstract/document/10638073
Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
X. Yuan, H. Li, K. Ota and M. Dong, 2024, Generative Inference of Large Language Models in Edge Computing: An Energy Efficient Approach, 2024 International Wireless Communications and Mobile Computing (IWCMC), Ayia Napa, Cyprus, 2024, pp. 244-249, doi: 10.1109/IWCMC61514.2024.10592339, https://ieeexplore.ieee.org/document/10592339
Kan Zhu, Yilong Zhao, Liangyu Zhao, Gefei Zuo, Yile Gu, Dedong Xie, Yufei Gao, Qinyu Xu, Tian Tang, Zihao Ye, Keisuke Kamahori, Chien-Yu Lin, Stephanie Wang, Arvind Krishnamurthy, Baris Kasikci, 22 Aug 2024, NanoFlow: Towards Optimal Large Language Model Serving Throughput, https://arxiv.org/abs/2408.12757
Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong, Yi Pan, Shaochen Xu, Zihao Wu, Zhengliang Liu, Xin Zhang, Shu Zhang, Xintao Hu, Tuo Zhang, Ning Qiang, Tianming Liu, Bao Ge, 6 Jan 2024 (v2), Understanding LLMs: A Comprehensive Overview from Training to Inference, https://arxiv.org/abs/2401.02038
Xiurui Pan, Endian Li, Qiao Li, Shengwen Liang, Yizhou Shan, Ke Zhou, Yingwei Luo, Xiaolin Wang, Jie Zhang, 8 Sep 2024, InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference, https://arxiv.org/abs/2409.04992
Othmane Friha, Mohamed Amine Ferrag, Burak Kantarci, Burak Cakmak, Arda Ozgun, Nassira Ghoualmi-Zine, 2024, LLM-based Edge Intelligence: A Comprehensive Survey on Architectures, Applications, Security and Trustworthiness, PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10669603
Shaoyuan Chen, Yutong Lin, Mingxing Zhang, Yongwei Wu, 3 May 2024, Efficient and Economic Large Language Model Inference with Attention Offloading, https://arxiv.org/abs/2405.01814 (Offloading the memory-bound processing of KV caches in attention kernels during decoding to bandwidth-focused GPUs, while reserving compute-bound computations like FFNs and prefill for powerful GPUs.)
Douglas C. Youvan, September 27, 2024, Building and Running Large-Scale Language Models: The Infrastructure and Techniques Behind GPT-4 , https://www.researchgate.net/profile/Douglas-Youvan/publication/384398902_Building_and_Running_Large-Scale_Language_Models_The_Infrastructure_and_Techniques_Behind_GPT-4/links/66f6f4d3906bca2ac3d20e68/Building-and-Running-Large-Scale-Language-Models-The-Infrastructure-and-Techniques-Behind-GPT-4.pdf
J. Niu, W. Zhang, C. J. Xue and N. Guan, 2024, "RTiL: Real-Time Inference of Large Language Models on Memory-Constrained GPU Devices," 2024 IEEE 30th International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA), Sokcho, Korea, Republic of, 2024, pp. 21-30, doi: 10.1109/RTCSA62462.2024.00013. https://ieeexplore.ieee.org/abstract/document/10695719
Jie Peng, Zhang Cao, Huaizhi Qu, Zhengyu Zhang, Chang Guo, Yanyong Zhang, Zhichao Cao, Tianlong Chen, 23 Oct 2024 (v2), Harnessing Your DRAM and SSD for Sustainable and Accessible LLM Inference with Mixed-Precision and Multi-level Caching, https://arxiv.org/abs/2410.14740
Xin He, Shunkang Zhang, Yuxin Wang, Haiyan Yin, Zihao Zeng, Shaohuai Shi, Zhenheng Tang, Xiaowen Chu, Ivor Tsang, Ong Yew Soon, 23 Oct 2024, ExpertFlow: Optimized Expert Activation and Token Allocation for Efficient Mixture-of-Experts Inference, https://arxiv.org/abs/2410.17954
Jiaming Qiu, Ruiqi Wang, Brooks Hu, Roch Guerin, Chenyang Lu, 24 Oct 2024, Optimizing Edge Offloading Decisions for Object Detection, https://arxiv.org/abs/2410.18919
Xiaoniu Song, Zihang Zhong, Rong Chen, 29 Oct 2024, ProMoE: Fast MoE-based LLM Serving using Proactive Caching, https://arxiv.org/abs/2410.22134
Xuanlin Jiang, Yang Zhou, Shiyi Cao, Ion Stoica, Minlan Yu, 2 Nov 2024, NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference, https://arxiv.org/abs/2411.01142
Shiyi Cao, Shu Liu, Tyler Griggs, Peter Schafhalter, Xiaoxuan Liu, Ying Sheng, Joseph E. Gonzalez, Matei Zaharia, Ion Stoica, 18 Nov 2024, MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs, https://arxiv.org/abs/2411.11217
Rongxiang Wang and Felix Xiaozhu Lin. 2024. Turbocharge Speech Understanding with Pilot Inference. In Proceedings of the 30th Annual International Conference on Mobile Computing and Networking (ACM MobiCom '24). Association for Computing Machinery, New York, NY, USA, 1299–1313. https://doi.org/10.1145/3636534.3690694 https://dl.acm.org/doi/abs/10.1145/3636534.3690694 https://dl.acm.org/doi/pdf/10.1145/3636534.3690694 ("Pilot inference" is a specialized mix of caching, computation reuse, and backtracking in beam search for speech understanding, and is somewhat related to speculative decoding, and similar to continual inference for processing a stream.)
Jiacheng Liu, Peng Tang, Wenfeng Wang, Yuhang Ren, Xiaofeng Hou, Pheng-Ann Heng, Minyi Guo, Chao Li, 18 Dec 2024, A Survey on Inference Optimization Techniques for Mixture of Experts Models, https://arxiv.org/abs/2412.14219 (Broad survey of MoE inference optimization from hardware to model compression to expert parallelism.)
Y Xiao, Dec 2024, Optimizing the Serving System for Large Language Model Inference, https://charlie-xiao.github.io/assets/pdf/projects/fluidinfer.pdf (Concatenated or splits batches for higher throughput.)
Dingyan Zhang, Haotian Wang, Yang Liu, Xingda Wei, Yizhou Shan, Rong Chen, Haibo Chen, 23 Dec 2024, Fast and Live Model Auto Scaling with O(1) Host Caching, https://arxiv.org/abs/2412.17246
Sanghyeon Lee, Hongbeen Kim, Soojin Hwang, Guseul Heo, Minwoo Noh, Jaehyuk Huh. 3 Jan 2025, Efficient LLM Inference with Activation Checkpointing and Hybrid Caching, https://arxiv.org/abs/2501.01792 (Recomputation of the KV cache from stored activations.)
https://theses.hal.science/tel-04890912/file/ZHAO_XUNYI_2024.pdf Xunyi Zhao. Optimizing Memory Usage when Training Deep Neural Networks. Computer Science [cs]. Université de Bordeaux, France, 2024. English. NNT: 2024BORD0411 . tel-04890912
Kaiyuan Tian, Linbo Qiao, Baihui Liu, Gongqingjian Jiang, Dongsheng Li, 21 Jan 2025, A Survey on Memory-Efficient Large-Scale Model Training in AI for Science, https://arxiv.org/abs/2501.11847
Hanfei Yu, Xingqi Cui, Hong Zhang, Hao Wang, Hao Wang, 7 Feb 2025, fMoE: Fine-Grained Expert Offloading for Large Mixture-of-Experts Serving, https://arxiv.org/abs/2502.05370
Hongsun Jang, Siung Noh, Changmin Shin, Jaewon Jung, Jaeyong Song, Jinho Lee, 14 Feb 2025, INF^2: High-Throughput Generative Inference of Large Language Models using Near-Storage Processing, https://arxiv.org/abs/2502.09921
Cheng Luo, Zefan Cai, Hanshi Sun, Jinqi Xiao, Bo Yuan, Wen Xiao, Junjie Hu, Jiawei Zhao, Beidi Chen, Anima Anandkumar, 18 Feb 2025, HeadInfer: Memory-Efficient LLM Inference by Head-wise Offloading, https://arxiv.org/abs/2502.12574
Xiaoran Liu, Ruixiao Li, Mianqiu Huang, Zhigeng Liu, Yuerong Song, Qipeng Guo, Siyang He, Qiqi Wang, Linlin Li, Qun Liu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu, 24 Feb 2025, Thus Spake Long-Context Large Language Model, https://arxiv.org/abs/2502.17129 (Impressive survey of many techniques to improve efficiency and accuracy of long context processing in both inference and training, covering text, video and multimodal models.)
Hongchao Du, Shangyu Wu, Arina Kharlamova, Nan Guan, Chun Jason Xue, 4 Mar 2025, FlexInfer: Breaking Memory Constraint via Flexible and Efficient Offloading for On-Device LLM Inference, https://arxiv.org/abs/2503.03777
Yunkai Liang, Zhangyu Chen, Pengfei Zuo, Zhi Zhou, Xu Chen, Zhou Yu, 26 Mar 2025, Injecting Adrenaline into LLM Serving: Boosting Resource Utilization and Throughput via Attention Disaggregation https://arxiv.org/abs/2503.20552
Shibo Jie, Yehui Tang, Kai Han, Zhi-Hong Deng, Jing Han, 20 Mar 2025, SpeCache: Speculative Key-Value Caching for Efficient Generation of LLMs, https://arxiv.org/abs/2503.16163
Masahiro Tanaka, Du Li, Umesh Chand, Ali Zafar, Haiying Shen, Olatunji Ruwase, 14 Apr 2025, DeepCompile: A Compiler-Driven Approach to Optimizing Distributed Deep Learning Training, https://arxiv.org/abs/2504.09983
Xiangwen Zhuge, Xu Shen, Zeyu Wang, Fan Dang, Xuan Ding, Danyang Li, Yahui Han, Tianxiang Hao, Zheng Yang, 21 May 2025 (v3), SpecOffload: Unlocking Latent GPU Capacity for LLM Inference on Resource-Constrained Devices, https://arxiv.org/abs/2505.10259 https://github.com/MobiSense/SpecOffload-public

Sharding

Sharding is the optimization of splitting rows of data across multiple machines. It is a type of distributed computation optimization that can be used by LLM engines. Research papers on sharding techniques in AI:

Jovan Stojkovic, Chaojie Zhang, Íñigo Goiri, Josep Torrellas, Esha Choukse, 1 Aug 2024, DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency, https://arxiv.org/abs/2408.00741
Yikun Han, Chunjiang Liu, Pengfei Wang, 18 Oct 2023, A Comprehensive Survey on Vector Database: Storage and Retrieval Technique, Challenge, https://arxiv.org/abs/2310.11703
Wei An, Xiao Bi, Guanting Chen, Shanhuang Chen, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Wenjun Gao, Kang Guan, Jianzhong Guo, Yongqiang Guo, Zhe Fu, Ying He, Panpan Huang, Jiashi Li, Wenfeng Liang, Xiaodong Liu, Xin Liu, Yiyuan Liu, Yuxuan Liu, Shanghao Lu, Xuan Lu, Xiaotao Nie, Tian Pei, Junjie Qiu, Hui Qu, Zehui Ren, Zhangli Sha, Xuecheng Su, Xiaowen Sun, Yixuan Tan, Minghui Tang, Shiyu Wang, Yaohui Wang, Yongji Wang, Ziwei Xie, Yiliang Xiong, Yanhong Xu, Shengfeng Ye, Shuiping Yu, Yukun Zha, Liyue Zhang, Haowei Zhang, Mingchuan Zhang, Wentao Zhang, Yichao Zhang, Chenggang Zhao, Yao Zhao, Shangyan Zhou, Shunfeng Zhou, Yuheng Zou, 31 Aug 2024 (v2), Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning, DeepSeek AI, https://www.arxiv.org/abs/2408.14158
Jinda Jia, Cong Xie, Hanlin Lu, Daoce Wang, Hao Feng, Chengming Zhang, Baixi Sun, Haibin Lin, Zhi Zhang, Xin Liu, Dingwen Tao, 20 Oct 2024, SDP4Bit: Toward 4-bit Communication Quantization in Sharded Data Parallelism for LLM Training https://arxiv.org/abs/2410.15526
Arun Shankar, Oct 2024, Designing Cognitive Architectures: Agentic Workflow Patterns from Scratch, https://medium.com/google-cloud/designing-cognitive-architectures-agentic-workflow-patterns-from-scratch-63baa74c54bc
Borui Wan, Mingji Han, Yiyao Sheng, Yanghua Peng, Haibin Lin, Mofan Zhang, Zhichao Lai, Menghan Yu, Junda Zhang, Zuquan Song, Xin Liu, Chuan Wu, 10 Oct 2024 (v2), ByteCheckpoint: A Unified Checkpointing System for Large Foundation Model Development, https://arxiv.org/abs/2407.20143
S. Wang, Q. Cao, K. Zhou, J. Xu, Z. Guo and J. Guo, "ParaCkpt: Heterogeneous Multi-Path Checkpointing Mechanism for Training Deep Learning Models," 2024 IEEE 42nd International Conference on Computer Design (ICCD), Milan, Italy, 2024, pp. 183-190, doi: 10.1109/ICCD63220.2024.00036. https://ieeexplore.ieee.org/abstract/document/10818161/ (Generalizing in-memory checkpoints by storing data in shards across multiple storage areas including CPU memory and SSDs.)
Hao Ge, Junda Feng, Qi Huang, Fangcheng Fu, Xiaonan Nie, Lei Zuo, Haibin Lin, Bin Cui, Xin Liu, 28 Feb 2025, ByteScale: Efficient Scaling of LLM Training with a 2048K Context Length on More Than 12,000 GPUs, https://arxiv.org/abs/2502.21231 (Addressing training inefficiencies when training data ranges from short to very long queries, including via hybrid data parallelism and communications optimizations.)
Qidong Su, Wei Zhao, Xin Li, Muralidhar Andoorveedu, Chenhao Jiang, Zhanda Zhu, Kevin Song, Christina Giannoula, Gennady Pekhimenko, 9 Mar 2025, Seesaw: High-throughput LLM Inference via Model Re-sharding, https://arxiv.org/abs/2503.06433

Kernel Tiling

Tiling is an optimization of computations that breaks them up into "tiles" or "blocks". This mainly refers to matrix multiplication kernels, in which case the tiles or blocks are smaller square or rectangular submatrices of the bigger matrix. Various tiled MatMul optimizations can be used to optimize GEMM kernels. This technique is closely related to loop tiling.

Research on Tiling: Papers on tiling computation optimizations include those below (see also research on loop tiling):

P Tillet, HT Kung, D Cox, 2019, Triton: an intermediate language and compiler for tiled neural network computations, Proceedings of the 3rd ACM SIGPLAN, http://www.eecs.harvard.edu/~htk/publication/2019-mapl-tillet-kung-cox.pdf
Ravi Teja Mullapudi, Andrew Adams, Dillon Sharlet, Jonathan Ragan-Kelley, and Kayvon Fatahalian. 2016. Automatically Scheduling Halide Image Processing Pipelines. ACM Trans. Graph. 35, 4, Article 83 (jul 2016), 11 pages. https://doi.org/10.1145/2897824.2925952, https://research.google/pubs/pub45525/, PDF: https://dl.acm.org/doi/pdf/10.1145/2897824.2925952
H Li, J Choi, Y Kwon, JH Ahn, Oct 2023, A Hardware-Friendly Tiled Singular-Value Decomposition-Based Matrix Multiplication for Transformer-Based Models, IEEE Computer Architecture Letters, https://ieeexplore.ieee.org/abstract/document/10285300/
Dennis Sebastian Rieber, 2023, Deployment of Deep Neural Networks on Dedicated Hardware Accelerators, Ph.D. thesis, Doctor of Natural Sciences, Ruprecht–Karls University Heidelberg, PDF: https://archiv.ub.uni-heidelberg.de/volltextserver/32994/1/dissertationPDFA.pdf
C Zhang, P Li, G Sun, Y Guan, B Xiao, 2015, Optimizing FPGA-based accelerator design for deep convolutional neural networks, FPGA '15: Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysFebruary 2015, Pages 161–170, https://dl.acm.org/doi/abs/10.1145/2684746.2689060, PDF: https://iceory.github.io/2018/04/25/fpga-based-cnn/FPGA-BASED-CNN.pdf
Robert Lim, 2019, Methods for accelerating machine learning in high performance computing, Report AREA-2019-01, School of Computer and Data Sciences, University of Oregon, https://www.cs.uoregon.edu/Reports/AREA-201901-Lim.pdf
Xiao Fu, Weiling Yang, Dezun Dong, Xing Su, 03 June 2024, Optimizing Attention by Exploiting Data Reuse on ARM Multi-core CPUs, ICS '24: Proceedings of the 38th ACM International Conference on Supercomputing, May 2024, Pages 137–149, https://doi.org/10.1145/3650200.3656620 https://dl.acm.org/doi/abs/10.1145/3650200.3656620
Victor J.B. Jung, Alessio Burrello, Moritz Scherer, Francesco Conti, Luca Benini, 3 Apr 2024, Optimizing the Deployment of Tiny Transformers on Low-Power MCUs, https://arxiv.org/abs/2404.02945 (Uses an approach called "Fused Weight Self-Attention" that fuses some of the QKV matrices and also tiling in multi-head attention, along with 8-bit integer quantization and integerized Softmax.)
Cong Guo, Fengchen Xue, Jingwen Leng, Yuxian Qiu, Yue Guan, Weihao Cui, Quan Chen, Minyi Guo, 2024, Accelerating Sparse DNNs Based on Tiled GEMM, IEEE Transactions on Computers, https://www.computer.org/csdl/journal/tc/5555/01/10436533/1UwVolp0wta
Salar Shakibhamedan, Amin Aminifar, Nima TaheriNejad, Axel Jantsch, 2024, EASE: Energy Optimization through Adaptation — A Review of Runtime Energy-Aware Approximate Deep Learning Algorithms, https://eclectx.org/Publications/2024_M13.pdf (Survey paper on techniques for adaptive inference with a focus on approximations of inference, including loop performance, stochastic algorithms, approximate arithmetic, quantization, pruning and low-rank.)
Eunji Kwon; Jongho Yoon; Seokhyeong Kang, Dec 2023, Mobile Transformer Accelerator Exploiting Various Line Sparsity and Tile-Based Dynamic Quantization, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (Early Access), https://ieeexplore.ieee.org/abstract/document/10375766
Robert A. van de Geijn, Enrique S. Quintana-Ort´ı, 2007, The Science of Programming Matrix Computations, https://www.cs.utexas.edu/users/rvdg/tmp/TSoPMC.pdf Code: http://www.cs.utexas.edu/users/flame/
X Xie, H Peng, A Hasan, S Huang, J Zhao, 2023, Accel-GCN: High-Performance GPU Accelerator Design for Graph Convolution Networks https://arxiv.org/abs/2308.11825 (Kernel for sparse matrix multiplication with block-level tiling as example.)
Kazushige Goto, Robert A. van de Geijn, 2008, Anatomy of high-performance matrix multiplication, ACM Transactions on Mathematical Software, Volume 34, Issue 3, Article No.: 12, pp 1–25, https://dl.acm.org/doi/10.1145/1356052.1356053 (The GotoBLAS algorithm for matrix multiplication.)
Yufan Xu, Saurabh Raje, Atanas Rountev, Gerald Sabin, Aravind Sukumaran-Rajam, and P Sadayappan. Training of deep learning pipelines on memory-constrained gpus via segmented fused-tiled execution. In Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction, pages 104–116, 2022. https://arxiv.org/pdf/2310.12109.pdf
Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, June 2022. https://arxiv.org/abs/2205.14135 Code: https://github.com/HazyResearch/flash-attention (The original FlashAttention version 1, now superceded by FlashAttention 2, which uses tiling and memory-aware kernels to optimize attention.)
William Brandon, Aniruddha Nrusimha, Kevin Qian, Zachary Ankner, Tian Jin, Zhiye Song, Jonathan Ragan-Kelley, 15 Nov 2023, Striped Attention: Faster Ring Attention for Causal Transformers, https://arxiv.org/abs/2311.09431
David Spuler, March 2024, Chapter 34. MatMul/GEMM, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Dominik Wagner, Seanie Lee, Ilja Baumann, Philipp Seeberger, Korbinian Riedhammer, Tobias Bocklet, 16 Jun 2024, Optimized Speculative Sampling for GPU Hardware Accelerators, https://arxiv.org/abs/2406.11016 (Speculative decoding accelerated with multiple GPUs using approaches such as tiling, and uses a fused sigmoid replacing Softmax.)
Francesco Daghero, Alessio Burrello, Massimo Poncino, Enrico Macii, Daniele Jahier Pagliari, 18 Jun 2024, Accelerating Depthwise Separable Convolutions on Ultra-Low-Power Devices, SAMOS2024 conference, https://arxiv.org/abs/2406.12478 Code: https://github.com/eml-eda/depthwise-separable-fusion
Kaixin Xu, Zhe Wang, Chunyun Chen, Xue Geng, Jie Lin, Xulei Yang, Min Wu, Xiaoli Li, Weisi Lin, 2 Jul 2024, LPViT: Low-Power Semi-structured Pruning for Vision Transformers, https://arxiv.org/abs/2407.02068 (Block-level pruning to give a granular type of structured pruning which speeds up MatMul/GEMM by skipping whole blocks or tiles.)
Cong Guo; Fengchen Xue; Jingwen Leng; Yuxian Qiu, May 2024, Accelerating Sparse DNNs Based on Tiled GEMM, IEEE Transactions on Computers, vol. 73, no. 5, pp. 1275-1289, May 2024, doi: 10.1109/TC.2024.3365942, https://ieeexplore.ieee.org/abstract/document/10436533
Mohammad Mahdi Salehi Dezfuli, Kazem Cheshmi, 28 Jun 2024, Improving Locality in Sparse and Dense Matrix Multiplications, https://arxiv.org/abs/2407.00243
A. Haan, D. T. Popovici, K. Sen, C. Iacu and A. Cheung, 2014, "To Tile or not to Tile, That is the Question," 2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), San Francisco, CA, USA, 2024, pp. 449-458, doi: 10.1109/IPDPSW63119.2024.00096, https://ieeexplore.ieee.org/abstract/document/10596518
David Spuler, March 2024, Loop Tiling or Blocking, in Generative AI in C++, https://www.aussieai.com/book/ch15-loop-tiling-blocking
David Spuler, March 2024, Tiled Matrix-Vector Multiplication, in Generative AI in C++, in Generative AI in C++, https://www.aussieai.com/book/ch34-tiled-matrix-vector-multiplication
Zebin Yang, Renze Chen, Taiqiang Wu, Ngai Wong, Yun Liang, Runsheng Wang, Ru Huang, Meng Li, 23 Oct 2024, MCUBERT: Memory-Efficient BERT Inference on Commodity Microcontrollers https://arxiv.org/abs/2410.17957
Z. Zhang, D. Yang, X. Zhou and D. Cheng, "MCFuser: High-Performance and Rapid Fusion of Memory-Bound Compute-Intensive Operators," in 2024 SC24: International Conference for High Performance Computing, Networking, Storage and Analysis SC, Atlanta, GA, United States, 2024, pp. 528-542, doi: 10.1109/SC41406.2024.00040. https://www.computer.org/csdl/proceedings-article/sc/2024/529100a528/21HUVuG3S8M
Inas Bachiri, September 2024, A Literature Review on Combining Neural Architecture Search and Compiler Optimizations for Neural Network Acceleration, DOI:10.13140/RG.2.2.10612.16009, Thesis for: Master in Computer Science, https://www.researchgate.net/publication/384190836_A_Literature_Review_on_Combining_Neural_Architecture_Search_and_Compiler_Optimizations_for_Neural_Network_Acceleration https://www.researchgate.net/profile/Inas-Bachiri/publication/384190836_A_Literature_Review_on_Combining_Neural_Architecture_Search_and_Compiler_Optimizations_for_Neural_Network_Acceleration/links/66ed912c6b101f6fa4f3d6ce/A-Literature-Review-on-Combining-Neural-Architecture-Search-and-Compiler-Optimizations-for-Neural-Network-Acceleration.pdf
Mohammadali Shakerdargah, Shan Lu, Chao Gao, Di Niu, 20 Nov 2024, MAS-Attention: Memory-Aware Stream Processing for Attention Acceleration on Resource-Constrained Edge Devices, https://arxiv.org/abs/2411.17720
Mingcong Song, Xinru Tang, Fengfan Hou, Jing Li, Wei Wei, Yipeng Ma, Runqiu Xiao, Hongjie Si, Dingcheng Jiang, Shouyi Yin, Yang Hu, Guoping Long, 24 Dec 2024, Tackling the Dynamicity in a Production LLM Serving System with SOTA Optimizations via Hybrid Prefill/Decode/Verify Scheduling on Efficient Meta-kernels, https://arxiv.org/abs/2412.18106
Zhen Qin, Weigao Sun, Dong Li, Xuyang Shen, Weixuan Sun, Yiran Zhong, 15 Jan 2024 (v2), Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models, https://arxiv.org/abs/2401.04658 https://github.com/OpenNLPLab/lightning-attention

General Research on Parallelization

General papers on parallelism:

Pieter Hijma, Stijn Heldens, Alessio Sclocco, Ben van Werkhoven, Henri E. Bal, 2023, Optimization techniques for GPU programming, ACM Computing Surveys, Volume 55, Issue 11, Article No. 239, pp 1–81, https://dl.acm.org/doi/abs/10.1145/3570638, PDF: https://dl.acm.org/doi/pdf/10.1145/3570638 (Extensive survey of software optimizations to improve GPU latency and throughput.)
Daniel Nichols, Siddharth Singh, Shu-Huai Lin, Abhinav Bhatele, July 2022, A Survey and Empirical Evaluation of Parallel Deep Learning Frameworks, https://arxiv.org/abs/2111.04949 PDF: https://pssg.cs.umd.edu/assets/papers/2022-07-dl-survey-arxiv.pdf (Survey of frameworks from the theoretical perspective of parallelism.)
Daniel Kusswurm, 2022, Modern Parallel Programming with C++ and Assembly Language: X86 SIMD Development Using AVX, AVX2, and AVX-512, 1st Edition, Apress, https://www.amazon.com/Modern-Parallel-Programming-Assembly-Language/dp/1484279174/, Code: https://github.com/Apress/modern-parallel-programming-cpp-assembly
Wikipedia, 2023 (accessed), AoS and SoA, https://en.wikipedia.org/wiki/AoS_and_SoA (Structure of arrays vs array of structures.)
William S. Moses, Ivan R. Ivanov, Jens Domke, Toshio Endo, Johannes Doerfert, Oleksandr Zinenko, 2023, High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs, PPoPP '23: Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, February 2023, Pages 119–134, https://dl.acm.org/doi/abs/10.1145/3572848.3577475, PDF: https://dl.acm.org/doi/pdf/10.1145/3572848.3577475
Yu Emma Wang, Carole-Jean Wu, Xiaodong Wang, Kim Hazelwood, and David Brooks. Exploiting parallelism opportunities with deep learning frameworks. ACM Trans. Archit. Code Optim., 18(1), 2021. https://arxiv.org/abs/1908.04705 (Analysis of parallelization in terms of overhead of scheduling and intra-operator parallelization such as multi-threaded MatMul operations.)
Stanford University, 2018, General Matrix Multiply (GeMM), https://spatial-lang.org/gemm
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022. https://arxiv.org/abs/2204.02311 (Google Palm architecture parallelized the layers of the Transformer.)
Liangzhen Lai, Naveen Suda, Vikas Chandra, 2018, CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M CPUs, arXiv preprint arXiv:1801.06601, https://arxiv.org/abs/1801.06601 PDF: https://arxiv.org/pdf/1801.06601
Pieter Hijma, Stijn Heldens, Alessio Sclocco, Ben van Werkhoven, Henri E. Bal, 2023, Optimization techniques for GPU programming, ACM Computing Surveys, Volume 55, Issue 11, Article No. 239, pp 1–81, https://dl.acm.org/doi/abs/10.1145/3570638, PDF: https://dl.acm.org/doi/pdf/10.1145/3570638 (Extensive survey of software optimizations to improve GPU latency and throughput.)
Xiao Fu, Weiling Yang, Dezun Dong, Xing Su, 03 June 2024, Optimizing Attention by Exploiting Data Reuse on ARM Multi-core CPUs, ICS '24: Proceedings of the 38th ACM International Conference on Supercomputing, May 2024, Pages 137–149, https://doi.org/10.1145/3650200.3656620 https://dl.acm.org/doi/abs/10.1145/3650200.3656620
Minsik Cho, Mohammad Rastegari, Devang Naik, 8 May 2024, KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache Generation, https://arxiv.org/abs/2405.05329 (Parallelization of KV cache generation in prefill phase.)
Shaoyuan Chen, Yutong Lin, Mingxing Zhang, Yongwei Wu, 3 May 2024, Efficient and Economic Large Language Model Inference with Attention Offloading, https://arxiv.org/abs/2405.01814 (Separates the process-bound and memory-bound parts of inference for speedup, with focus on prefill, decoding, and the sub-tasks such as QKV and FFN use of GEMM kernels, versus the different pattern of attention computations and the KV cache.)
Sathya Krishnan Suresh, Shunmugapriya P, 24 Apr 2024 (v2), Towards smaller, faster decoder-only transformers: Architectural variants and their implications, https://arxiv.org/abs/2404.14462 Code: https://github.com/SkAndMl/gpt-variations (Focuses on three new variants of decoder-only Transformer architectures: ParallelGPT (p-gpt), LinearlyCompressedGPT (lc-gpt), and ConvCompressedGPT (cc-gpt).)
Tyler Griggs, Xiaoxuan Liu, Jiaxiang Yu, Doyoung Kim, Wei-Lin Chiang, Alvin Cheung, Ion Stoica, 22 Apr 2024, Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity, https://arxiv.org/abs/2404.14527 Code: https://github.com/tyler-griggs/melange-release
M Davies, I McDougall, S Anandaraj, D Machchhar, April 2024, A Journey of a 1,000 Kernels Begins with a Single Step: A Retrospective of Deep Learning on GPUs, ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, April 2024, Pages 20–36, https://doi.org/10.1145/3620665.3640367 https://dl.acm.org/doi/abs/10.1145/3620665.3640367 (Benchmarking analysis of GPU execution extending MLPerf.)
L. Zheng, L. Yin, Z. Xie, J. Huang, C. Sun, C. H. Yu, S. Cao, C. Kozyrakis, I. Stoica, J. E. Gonzalez et al., Dec 2023, Efficiently programming large language models using SGLang, arXiv preprint arXiv:2312.07104, 2023, https://arxiv.org/abs/2312.07104 (Uses a radix attention method, a trie or prefix tree, for KV caching.)
S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y. Cao, and K. Narasimhan, “Tree of thoughts: Deliberate problem solving with large language models,” Advances in Neural Information Processing Systems, vol. 36, 2024. https://arxiv.org/abs/2305.10601
Shashank Verma and Neal Vaidya, Nov 17, 2023 Mastering LLM Techniques: Inference Optimization, NVIDIA Technical Blog, https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/ (An overview that covers a lot of inference optimization techniques.)
Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Jiang Xu, Shuang Chen, Hao Feng, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan, 20 Jan 2024, Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads, https://arxiv.org/abs/2401.11181 (Separating the prefill and decoding phases for optimization.)
Yansong Xu, Dongxu Lyu, Zhenyu Li, Zilong Wang, Yuzhou Chen, Gang Wang, Zhican Wang, Haomin Li, Guanghui He, 16 Mar 2024, DEFA: Efficient Deformable Attention Acceleration via Pruning-Assisted Grid-Sampling and Multi-Scale Parallel Processing, https://arxiv.org/abs/2403.10913 (Attention optimizations in Vision Transformer with pruning of feature maps, and extensive parallelization with consideration of the hardware layer.)
Rocke, F. (2023), Evaluation of C++ SIMD Libraries, Bachelor’s Thesis, INSTITUT FUR INFORMATIK, DER LUDWIG–MAXIMILIANS–UNIVERSIT AT MUNCHEN, https://www.mnm-team.org/pub/Fopras/rock23/ PDF: https://www.mnm-team.org/pub/Fopras/rock23/PDF-Version/rock23.pdf (Reviewed six SIMD libraries: Highway, Vc, Libsimdpp, NSIMD, SIMD Everywhere, Pure SIMD).
Shixun Wu, Yujia Zhai, Jinyang Liu, Jiajun Huang, Zizhe Jian, Bryan M. Wong, Zizhong Chen, May 2023, Anatomy of High-Performance GEMM with Online Fault Tolerance on GPUs, https://arxiv.org/abs/2305.01024 (Focuses on error tolerance of failures within matrix multiplication algorithms.)
Zao Zhang, 23 May 2024, Design Efficient Deep Neural Networks with System Optimization, Ph.D. Thesis, School of Electrical and Information Engineering, Faculty of Engineering, The University of Sydney, Australia, PDF: https://ses.library.usyd.edu.au/bitstream/handle/2123/32642/zhang_z_thesis.pdf?sequence=1&isAllowed=y https://ses.library.usyd.edu.au/handle/2123/32642 https://hdl.handle.net/2123/32642
Radha Gulhane, 2024, Accelerated and Memory-Efficient Distributed Deep Learning: Leveraging Quantization, Parallelism Techniques, and Mix-Match Runtime Communication , Masters Thesis, Computer Science and Engineering , The Ohio State University, https://etd.ohiolink.edu/acprod/odb_etd/ws/send_file/send?accession=osu1713381834648517&disposition=inline
Hyungjun Oh, Kihong Kim, Jaemin Kim, Sungkyun Kim, Junyeol Lee, Du-seong Chang, Jiwon Seo, 15 Mar 2024, ExeGPT: Constraint-Aware Resource Scheduling for LLM Inference, https://arxiv.org/abs/2404.07947 (Scheduling and pipelining of inference calculations.)
Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
Jiahang Zhou, Yanyu Chen, Zicong Hong, Wuhui Chen, Yue Yu, Tao Zhang, Hui Wang, Chuanfu Zhang, Zibin Zheng, 5 Jan 2024, Training and Serving System of Foundation Models: A Comprehensive Survey, https://arxiv.org/abs/2401.02643
Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin Jin, Xin Liu, 23 Feb 2024, MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs, https://arxiv.org/abs/2402.15627
Vijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, Bryan Catanzaro, May 2022, Reducing Activation Recomputation in Large Transformer Models, https://arxiv.org/abs/2205.05198
Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, Communications of the ACM, Volume 60, Issue 6, June 2017, pp 84–90, https://doi.org/10.1145/3065386 https://dl.acm.org/doi/10.1145/3065386 PDF: https://proceedings.neurips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf Code: http://code.google.com/p/cuda-convnet/ (The early paper that introduced a grouped convolution architecture for multi-GPUs, later the basis of AlexNet, which was a famous image recognition CNN.)
David Spuler, March 2024, Part III: Parallel C++ Optimizations, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
X. Ning, Z. Lin, Z. Zhou, H. Yang, and Y. Wang, “Skeleton-of thought: Large language models can do parallel decoding,” arXiv preprint arXiv:2307.15337, 2023. https://arxiv.org/abs/2307.15337
S. Jin, Y. Wu, H. Zheng, Q. Zhang, M. Lentz, Z. M. Mao, A. Prakash, F. Qian, and D. Zhuo, “Adaptive skeleton graph decoding,” arXiv preprint arXiv:2402.12280, 2024. https://arxiv.org/abs/2402.12280
M. Liu, A. Zeng, B. Wang, P. Zhang, J. Tang, and Y. Dong, “Apar: Llms can do auto-parallel auto-regressive decoding,” arXiv preprint arXiv:2401.06761, 2024. https://arxiv.org/abs/2401.06761
Abhimanyu Bambhaniya, Ritik Raj, Geonhwa Jeong, Souvik Kundu, Sudarshan Srinivasan, Midhilesh Elavazhagan, Madhu Kumar, Tushar Krishna, 3 Jun 2024, Demystifying Platform Requirements for Diverse LLM Inference Use Cases, https://arxiv.org/abs/2406.01698
M.Besta, N. Blach, A. Kubicek, R. Gerstenberger, M. Podstawski, L. Gianinazzi, J. Gajda, T. Lehmann, H. Niewiadomski, P. Nyczyk et al., “Graph of thoughts: Solving elaborate problems with large language models,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 16, 2024, pp. 17682–17690. https://arxiv.org/abs/2308.09687
Wesley Brewer, Aditya Kashi, Sajal Dash, Aristeidis Tsaris, Junqi Yin, Mallikarjun Shankar, Feiyi Wang, 24 Jun 2024, Scalable Artificial Intelligence for Science: Perspectives, Methods and Exemplars, https://arxiv.org/abs/2406.17812
Ruilong Ma, Xiang Yang, Jingyu Wang, Qi Qi, Haifeng Sun, Jing Wang, Zirui Zhuang, Jianxin Liao, June 16-21, 2024, HPipe: Large Language Model Pipeline Parallelism for Long Context on Heterogeneous Cost-effective Devices, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track), pages 1–9, Association for Computational Linguistics, https://aclanthology.org/2024.naacl-industry.1.pdf
18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
Jiangfei Duan, Shuo Zhang, Zerui Wang, Lijuan Jiang, Wenwen Qu, Qinghao Hu, Guoteng Wang, Qizhen Weng, Hang Yan, Xingcheng Zhang, Xipeng Qiu, Dahua Lin, Yonggang Wen, Xin Jin, Tianwei Zhang, Peng Sun, 29 Jul 2024, Efficient Training of Large Language Models on Distributed Infrastructures: A Survey, https://arxiv.org/abs/2407.20018
Rohan Baskar Prabhakar, Hengrui Zhang, David Wentlzaff, 14 Aug 2024, Kraken: Inherently Parallel Transformers For Efficient Multi-Device Inference, https://arxiv.org/abs/2408.07802 (Modified Transformer architecture with parallelized sub-layers of attention and FFN.)
Minjia Zhang, Samyam Rajbhandari, Wenhan Wang, Yuxiong He, 2018, DeepCPU: serving RNN-based deep learning models 10x faster, USENIX ATC '18: Proceedings of the 2018 USENIX Conference on Usenix Annual Technical Conference, July 2018, Pages 951–965, https://dl.acm.org/doi/10.5555/3277355.3277446 https://www.microsoft.com/en-us/research/publication/deepcpu-serving-rnn-based-deep-learning-models-10x-faster/ PDF: https://www.usenix.org/system/files/conference/atc18/atc18-zhang-minjia.pdf (Microsoft DeepCPU paper shows details of code optimizations to parallelize matrix multiplications.)
Kan Zhu, Yilong Zhao, Liangyu Zhao, Gefei Zuo, Yile Gu, Dedong Xie, Yufei Gao, Qinyu Xu, Tian Tang, Zihao Ye, Keisuke Kamahori, Chien-Yu Lin, Stephanie Wang, Arvind Krishnamurthy, Baris Kasikci, 22 Aug 2024, NanoFlow: Towards Optimal Large Language Model Serving Throughput, https://arxiv.org/abs/2408.12757
Thomas Merth, Qichen Fu, Mohammad Rastegari, Mahyar Najibi, 19 Jul 2024 (v2), Superposition Prompting: Improving and Accelerating Retrieval-Augmented Generation https://arxiv.org/abs/2404.06910 (Process each RAG chunk in parallel and choose a final output.)
Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong, Yi Pan, Shaochen Xu, Zihao Wu, Zhengliang Liu, Xin Zhang, Shu Zhang, Xintao Hu, Tuo Zhang, Ning Qiang, Tianming Liu, Bao Ge, 6 Jan 2024 (v2), Understanding LLMs: A Comprehensive Overview from Training to Inference, https://arxiv.org/abs/2401.02038
Suchita Pati, Shaizeen Aga, Nuwan Jayasena, Matthew D. Sinclair, 3 Sep 2024, Global Optimizations & Lightweight Dynamic Logic for Concurrency, https://arxiv.org/abs/2409.02227 (Parallelizing GEMM at a granular level.)
Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
Amey Agrawal, Junda Chen, Íñigo Goiri, Ramachandran Ramjee, Chaojie Zhang, Alexey Tumanov, Esha Choukse, 25 Sep 2024, Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations, https://arxiv.org/abs/2409.17264
Kabir Nagrecha, Oct 2024, Thesis, Orchestration Systems to Support Deep Learning at Scale Doctor of Philosophy, Computer Science, University of California San Diego, https://escholarship.org/content/qt3pp6k1p4/qt3pp6k1p4_noSplash_457f4c7c0435172a3d0a17428455894c.pdf (Pipeline and data parallelism systems.)
S Durvasula, A Zhao, R Kiguru, Y Guan, Z Chen, Oct 2024, ACE: Efficient GPU Kernel Concurrency for Input-Dependent Irregular Computational Graphs, PACT ’24, October 14–16, 2024, Southern California, CA, USA, https://www.embarclab.com/static/media/ace.1c73b44bc2ad143f7b9f.pdf (Identify parallel kernels at runtime.)
Hao Ge, Fangcheng Fu, Haoyang Li, Xuanyu Wang, Sheng Lin, Yujie Wang, Xiaonan Nie, Hailin Zhang, Xupeng Miao, and Bin Cui. 2024. Enabling Parallelism Hot Switching for Efficient Training of Large Language Models. In Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles (SOSP '24). Association for Computing Machinery, New York, NY, USA, 178–194. https://doi.org/10.1145/3694715.3695969 https://dl.acm.org/doi/abs/10.1145/3694715.3695969
M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024, Resource-efficient Algorithms and Systems of Foundation Models: A Survey, https://dl.acm.org/doi/pdf/10.1145/3706418
Andronicus Rajasukumar, Tianchi Zhang, Ruiqi Xu, and Andrew A. Chien. 2024. UpDown: A Novel Architecture for Unlimited Memory Parallelism. In Proceedings of the International Symposium on Memory Systems (MEMSYS '24). Association for Computing Machinery, New York, NY, USA, 61–77. https://doi.org/10.1145/3695794.3695801 https://dl.acm.org/doi/full/10.1145/3695794.3695801 https://dl.acm.org/doi/pdf/10.1145/3695794.3695801
Wenchao Xu, Jinyu Chen, Peirong Zheng, Xiaoquan Yi, Tianyi Tian, Wenhui Zhu, Quan Wan, Haozhao Wang, Yunfeng Fan, Qinliang Su, Xuemin Shen, https://arxiv.org/abs/2412.13437 18 Dec 2024, Deploying Foundation Model Powered Agent Services: A Survey, (A survey of not just deployment, but many inference optimization techniques.)
Anirudha Agrawal, Shaizeen Aga, Suchita Pati, Mahzabeen Islam, 18 Dec 2024, Optimizing ML Concurrent Computation and Communication with GPU DMA Engines, https://arxiv.org/abs/2412.14335
Zongbiao Li , Xiezhao Li , Yinghao Cui , Yijun Chen , Zhixuan Gu , Yuxuan Liu , Wenbo Zhu , Fei Jia , Ke Liu , Qifeng Li , Junyao Zhan , Jiangtao Zhou , Chenxi Zhang , Qike Liu, 31 Dec 2024, Automatically Planning Optimal Parallel Strategy for Large Language Models, https://arxiv.org/abs/2501.00254
Pouya Hamadanian, Sadjad Fouladi, 20 Jan 2025, Glinthawk: A Two-Tiered Architecture for High-Throughput LLM Inference, https://arxiv.org/abs/2501.11779 https://github.com/microsoft/glinthawk (Separate memory-bound attention computation from other parts of model such as compute-bound FFNs, but only in the decoding phase (not prefill), whereby attention and KV cache management can be performed on a greater number of lower-end GPUs or CPU.)
M. Suvarna, O. Tehrani, 2 Apr 2025, GigaAPI for GPU Parallelization, https://arxiv.org/abs/2504.01266
Masahiro Tanaka, Du Li, Umesh Chand, Ali Zafar, Haiying Shen, Olatunji Ruwase, 14 Apr 2025, DeepCompile: A Compiler-Driven Approach to Optimizing Distributed Deep Learning Training, https://arxiv.org/abs/2504.09983