Aussie AI

Dataflow Optimizations of LLMs

  • Last Updated 7 November, 2024
  • by David Spuler, Ph.D.

Dataflow optimizations are a broad category of optimizations that can be used to speed up LLM execution by Transformers. The idea is to better manage the handling of the large amounts of data in both weights and activations, and thereby gaining in efficiency.

The sources of improvement may include:

  • Computation reuse (avoiding computations)
  • Memory access reduction (avoiding the cost of accessing memory)
  • A combination of these.

Types of Dataflow Optimizations

Some of the possible types of dataflow optimizations include:

  • Computation reuse
  • Conditional computation
  • Pipelining
  • Data marshalling improvements
  • Data locality (e.g., tiling)
  • Kernel fusion
  • Caching

Research Papers on Dataflow Optimizations

Papers on the use of dataflow optimizations in LLMs and Transformer architectures:

  • Hou-I Liu, Marco Galindo, Hongxia Xie, Lai-Kuan Wong, Hong-Han Shuai, Yung-Yui Li, Wen-Huang Cheng, 8 Apr 2024, Lightweight Deep Learning for Resource-Constrained Environments: A Survey, https://arxiv.org/abs/2404.07236 (A survey of various optimizations, with a lot of focus on image and vision models, including CNNs, RNNs, and Transformers.)
  • Mengke Ge, Junpeng Wang, Binhan Chen, Yingjian Zhong, Haitao Du, Song Chen, Yi Kang, 22 Mar 2024, Allspark: Workload Orchestration for Visual Transformers on Processing In-Memory Systems, https://arxiv.org/abs/2403.15069
  • Maurizio Capra, Beatrice Bussolino, Alberto Marchisio, Guido Masera, Maurizio Martina, Muhammad Shafique, 2020, Hardware and software optimizations for accelerating deep neural networks: Survey of current trends, challenges, and the road ahead, https://ieeexplore.ieee.org/iel7/6287639/6514899/09269334.pdf, https://arxiv.org/abs/2012.11233 (Analysis of optimizations for DNNs and SNNs.)
  • C Zhou, Z Hassman, R Xu, D Shah, V Richard, Y Li, Oct 2023, SIMD Dataflow Co-optimization for Efficient Neural Networks Inferences on CPUs, arXiv preprint arXiv:2310.00574, https://arxiv.org/pdf/2310.00574.pdf
  • Jianyi Cheng, Cheng Zhang, Zhewen Yu, Christos-Savvas Bouganis, George A. Constantinides, Yiren Zhao, 19 Apr 2024 (v2), A Dataflow Compiler for Efficient LLM Inference using Custom Microscaling Formats, https://arxiv.org/abs/2307.15517
  • Cyrus Zhou, Zack Hassman, Ruize Xu, Dhirpal Shah, Vaugnn Richard, Yanjing Li, 23 Nov 2023 (v3), YFlows: Systematic Dataflow Exploration and Code Generation for Efficient Neural Network Inference using SIMD Architectures on CPUs, https://arxiv.org/abs/2310.00574
  • Lois Orosa, Skanda Koppula, Yaman Umuroglu, Konstantinos Kanellopoulos, Juan Gomez-Luna, Michaela Blott, Kees Vissers, Onur Mutlu, 4 Feb 2022, EcoFlow: Efficient Convolutional Dataflows for Low-Power Neural Network Accelerators, https://arxiv.org/abs/2202.02310
  • G Abarajithan, Chamira U. S. Edussooriya, 6 Dec 2021, Kraken: An Efficient Engine with a Uniform Dataflow for Deep Neural Networks, https://arxiv.org/abs/2112.02793
  • Dingqing Yang, Amin Ghasemazar, Xiaowei Ren, Maximilian Golub, Guy Lemieux, Mieszko Lis, 23 Sep 2020, Procrustes: a Dataflow and Accelerator for Sparse Deep Neural Network Training, https://arxiv.org/abs/2009.10976
  • SC Kao, S Subramanian, G Agrawal, 2023, FLAT: An Optimized Dataflow for Mitigating Attention Bottlenecks https://dl.acm.org/doi/pdf/10.1145/3575693.3575747
  • Ke Hong, Guohao Dai, Jiaming Xu, Qiuli Mao, Xiuhong Li, Jun Liu, kangdi chen, Yuhan Dong, Yu Wang, 2024, FlashDecoding++: Faster Large Language Model Inference with Asynchronization, Flat GEMM Optimization, and Heuristics, Part of Proceedings of Machine Learning and Systems 6 (MLSys 2024) Conference, PDF: https://proceedings.mlsys.org/paper_files/paper/2024/file/5321b1dabcd2be188d796c21b733e8c7-Paper-Conference.pdf (Next generation of Flash Decoding, with improved ascynchronous parallelism of Softmax in both prefill and decoding phases, heuristic dataflow management algorithms, and enhanced GEMM during the decoding phase.)
  • Chen, C, 2024, Hardware‑software co‑exploration and optimization for next‑generation learning machines. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/178423 (Extensive coverage of hardware design with multiple contributions to accelerating various neural network types, ranging from acceleration of various single non-linear functions and end-to-end optimization algorithms. Specific topics include data compression, non-maximum suppression, MHA, and MatMul/GEMM optimizations.)
  • J Liu, 2024, Data-driven Performance Optimization for Data-intensive Applications, Ph.D. Thesis, Electrical Engineering and Computer Science, University of California, Merced, https://escholarship.org/content/qt6gn2p8mn/qt6gn2p8mn.pdf (Optimization of data movement intensive algorithms, mostly non-AI applications.)
  • Agarwal, Saurabh, Aug 2024, Minimizing Data Movement in Machine Learning Systems, Ph.D. Thesis, Computer Sciences, University of Wisconsin--Madison, https://digital.library.wisc.edu/1711.dl/MKLIYRPB24A5R9D https://search.library.wisc.edu/digital/AMKLIYRPB24A5R9D PDF: https://asset.library.wisc.edu/1711.dl/QXSTVAIXECHQA8L/R/file-62b54.pdf?dl https://www.proquest.com/openview/c1ae2a92106d7ec681a7296cd163e0c1/1 (Dataflow optimization in training and also "clustered head attention" for memory-efficient inference, an extension of multi-head attention similar to layer-wise head fusion/pruning.)
  • Marcin Rogowski, 2024, Addressing Data Movement Challenges in High-Performance Computing, Ph.D. Thesis, King Abdullah University of Science and Technology, Thuwal, Kingdom of Saudi Arabia, https://repository.kaust.edu.sa/bitstreams/6a297b08-e7a1-48b9-b0d4-bf2d101636c3/download
  • Ruhai Lin, Rui-Jie Zhu, Jason K. Eshraghian, 12 Oct 2024, Reducing Data Bottlenecks in Distributed, Heterogeneous Neural Networks, https://arxiv.org/abs/2410.09650
  • David Koeplinger, Darshan Gandhi, Pushkar Nandkar, Nathan Sheeley, Matheen Musaddiq, Leon Zhang, Reid Goodbar, Matthew Shaffer, Han Wang, Angela Wang, Mingran Wang, Raghu Prabhakar, 31 Oct 2024, Kernel Looping: Eliminating Synchronization Boundaries for Peak Inference Performance, https://arxiv.org/abs/2410.23668

More AI Research

Read more about: