Aussie AI

Dataflow Optimizations of LLMs

Last Updated 2 March, 2025

by David Spuler, Ph.D.

Dataflow optimizations are a broad category of optimizations that can be used to speed up LLM execution by Transformers. The idea is to better manage the handling of the large amounts of data in both weights and activations, and thereby gaining in efficiency.

The sources of improvement may include:

Computation reuse (avoiding computations)
Memory access reduction (avoiding the cost of accessing memory)
A combination of these.

Types of Dataflow Optimizations

Some of the possible types of dataflow optimizations include:

Computation reuse
Conditional computation
Pipelining
Data marshalling improvements
Data locality (e.g., tiling)
Kernel fusion
Caching

Research Papers on Dataflow Optimizations

Papers on the use of dataflow optimizations in LLMs and Transformer architectures:

Hou-I Liu, Marco Galindo, Hongxia Xie, Lai-Kuan Wong, Hong-Han Shuai, Yung-Yui Li, Wen-Huang Cheng, 8 Apr 2024, Lightweight Deep Learning for Resource-Constrained Environments: A Survey, https://arxiv.org/abs/2404.07236 (A survey of various optimizations, with a lot of focus on image and vision models, including CNNs, RNNs, and Transformers.)
Mengke Ge, Junpeng Wang, Binhan Chen, Yingjian Zhong, Haitao Du, Song Chen, Yi Kang, 22 Mar 2024, Allspark: Workload Orchestration for Visual Transformers on Processing In-Memory Systems, https://arxiv.org/abs/2403.15069
Maurizio Capra, Beatrice Bussolino, Alberto Marchisio, Guido Masera, Maurizio Martina, Muhammad Shafique, 2020, Hardware and software optimizations for accelerating deep neural networks: Survey of current trends, challenges, and the road ahead, https://ieeexplore.ieee.org/iel7/6287639/6514899/09269334.pdf, https://arxiv.org/abs/2012.11233 (Analysis of optimizations for DNNs and SNNs.)
C Zhou, Z Hassman, R Xu, D Shah, V Richard, Y Li, Oct 2023, SIMD Dataflow Co-optimization for Efficient Neural Networks Inferences on CPUs, arXiv preprint arXiv:2310.00574, https://arxiv.org/pdf/2310.00574.pdf
Jianyi Cheng, Cheng Zhang, Zhewen Yu, Christos-Savvas Bouganis, George A. Constantinides, Yiren Zhao, 19 Apr 2024 (v2), A Dataflow Compiler for Efficient LLM Inference using Custom Microscaling Formats, https://arxiv.org/abs/2307.15517
Cyrus Zhou, Zack Hassman, Ruize Xu, Dhirpal Shah, Vaugnn Richard, Yanjing Li, 23 Nov 2023 (v3), YFlows: Systematic Dataflow Exploration and Code Generation for Efficient Neural Network Inference using SIMD Architectures on CPUs, https://arxiv.org/abs/2310.00574
Lois Orosa, Skanda Koppula, Yaman Umuroglu, Konstantinos Kanellopoulos, Juan Gomez-Luna, Michaela Blott, Kees Vissers, Onur Mutlu, 4 Feb 2022, EcoFlow: Efficient Convolutional Dataflows for Low-Power Neural Network Accelerators, https://arxiv.org/abs/2202.02310
G Abarajithan, Chamira U. S. Edussooriya, 6 Dec 2021, Kraken: An Efficient Engine with a Uniform Dataflow for Deep Neural Networks, https://arxiv.org/abs/2112.02793
Dingqing Yang, Amin Ghasemazar, Xiaowei Ren, Maximilian Golub, Guy Lemieux, Mieszko Lis, 23 Sep 2020, Procrustes: a Dataflow and Accelerator for Sparse Deep Neural Network Training, https://arxiv.org/abs/2009.10976
SC Kao, S Subramanian, G Agrawal, 2023, FLAT: An Optimized Dataflow for Mitigating Attention Bottlenecks https://dl.acm.org/doi/pdf/10.1145/3575693.3575747
Ke Hong, Guohao Dai, Jiaming Xu, Qiuli Mao, Xiuhong Li, Jun Liu, kangdi chen, Yuhan Dong, Yu Wang, 2024, FlashDecoding++: Faster Large Language Model Inference with Asynchronization, Flat GEMM Optimization, and Heuristics, Part of Proceedings of Machine Learning and Systems 6 (MLSys 2024) Conference, PDF: https://proceedings.mlsys.org/paper_files/paper/2024/file/5321b1dabcd2be188d796c21b733e8c7-Paper-Conference.pdf (Next generation of Flash Decoding, with improved ascynchronous parallelism of Softmax in both prefill and decoding phases, heuristic dataflow management algorithms, and enhanced GEMM during the decoding phase.)
Chen, C, 2024, Hardware‑software co‑exploration and optimization for next‑generation learning machines. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/178423 (Extensive coverage of hardware design with multiple contributions to accelerating various neural network types, ranging from acceleration of various single non-linear functions and end-to-end optimization algorithms. Specific topics include data compression, non-maximum suppression, MHA, and MatMul/GEMM optimizations.)
J Liu, 2024, Data-driven Performance Optimization for Data-intensive Applications, Ph.D. Thesis, Electrical Engineering and Computer Science, University of California, Merced, https://escholarship.org/content/qt6gn2p8mn/qt6gn2p8mn.pdf (Optimization of data movement intensive algorithms, mostly non-AI applications.)
Agarwal, Saurabh, Aug 2024, Minimizing Data Movement in Machine Learning Systems, Ph.D. Thesis, Computer Sciences, University of Wisconsin--Madison, https://digital.library.wisc.edu/1711.dl/MKLIYRPB24A5R9D https://search.library.wisc.edu/digital/AMKLIYRPB24A5R9D PDF: https://asset.library.wisc.edu/1711.dl/QXSTVAIXECHQA8L/R/file-62b54.pdf?dl https://www.proquest.com/openview/c1ae2a92106d7ec681a7296cd163e0c1/1 (Dataflow optimization in training and also "clustered head attention" for memory-efficient inference, an extension of multi-head attention similar to layer-wise head fusion/pruning.)
Marcin Rogowski, 2024, Addressing Data Movement Challenges in High-Performance Computing, Ph.D. Thesis, King Abdullah University of Science and Technology, Thuwal, Kingdom of Saudi Arabia, https://repository.kaust.edu.sa/bitstreams/6a297b08-e7a1-48b9-b0d4-bf2d101636c3/download
Ruhai Lin, Rui-Jie Zhu, Jason K. Eshraghian, 12 Oct 2024, Reducing Data Bottlenecks in Distributed, Heterogeneous Neural Networks, https://arxiv.org/abs/2410.09650
David Koeplinger, Darshan Gandhi, Pushkar Nandkar, Nathan Sheeley, Matheen Musaddiq, Leon Zhang, Reid Goodbar, Matthew Shaffer, Han Wang, Angela Wang, Mingran Wang, Raghu Prabhakar, 31 Oct 2024, Kernel Looping: Eliminating Synchronization Boundaries for Peak Inference Performance, https://arxiv.org/abs/2410.23668
Michael Davies, Neal Crago, Karthikeyan Sankaralingam, Stephen W. Keckler, 25 Feb 2025, Kitsune: Enabling Dataflow Execution on GPUs, https://arxiv.org/abs/2502.18403

Aussie AI

Dataflow Optimizations of LLMs

Types of Dataflow Optimizations

Research Papers on Dataflow Optimizations

More AI Research

Quick Links

Product

New to Writing?

Writing Styles