Aussie AI

Kernel Operator Fusion

Last Updated 10 June, 2025

by David Spuler, Ph.D.

What is Kernel Fusion?

Kernel fusion is the LLM optimization of merging two different algorithmic components in the low-level kernel code in an AI inference or training engine. Usually performed as modifications to the low-level C++ code, kernel fusion is combining two operations together into one kernel, which is often faster than running two kernels sequentially. Kernel fusion is more formally called "kernel operator fusion," which is typically abbreviated to "kernel fusion" or "operator fusion" or just "fusion" in neural network papers.

Kernel fusion is not an approximation, but is a "lossless" optimization that will compute the exact same results. The total amount of arithmetic computation is not reduced by kernel fusion, since both operators are still fully calculated, but there can be efficiency gains from:

Avoiding the need for temporary data usage.
Reduced memory writes to store temporary data.
Reduced memory reads by avoiding accesses to stored temporary data.
Data locality improvements because data is operated on when it is already in registers.

The opposite of kernel fusion is "kernel fission", which involves splitting a kernel operator into two operators. These are closely related to the loop transformation optimizations of loop fusion versus loop fission

Both kernel fusion and kernel fission involve changes to the kernel code doing the inference rather than the model's data, and are a type of dynamic inference optimization. Many kernel optimizations can apply to both inference and training.

The primary goal of kernel fusion is often to reduce data movement and memory usage, rather than computation reduction, but the two metrics are closely linked and often both goals are achieved. By merging two operations into one, the cost of storing intermediate results to memory is avoided. Hence, it is a type of memory optimization technique for AI inference engines.

Kernel fusion is distinct from layer fusion, which usually refers to weight sharing across multiple layers of a model. Operator fusion merges the algorithmic steps, but does not reduce weights, or share parameters. Also, operator fusion does not usually skip any model steps, making it different from techniques such as pruning (e.g. layer pruning). Kernel fusion works by merging two major steps to avoid redundancy from the use of the intermediate data in subsequent computations, but both steps are still effectively performed.

Operator fusion has been an active area of research for deep learning compilers for years. These compiler frameworks can apply many optimizations on the low-level final execution version of the model. However, operator fusion can also be used in the higher-level kernel of Transformer inference engines, without resorting to a model compiler framework. Researchers have found various ways to speed up a vanilla Transformer using kernel fusion techniques. Several of the major component steps in the Transformer algorithm (e.g. attention heads, normalization, FFN, Softmax, etc.) can be merged using kernel fusion ideas.

Overall, kernel fusion is one of many possible inference optimization techniques. Other comparable techniques include memory optimizations, skipping methods, conditional computation, and caching.

Example of Kernel Fusion: MatMul-add-bias

The way that kernel fusion works is to combine or merge the code for two operations into a single merged operation. For example, a typical linear layer will do a matrix multiplication ("MatMul"), followed by adding the bias vector ("add-bias").

The same operands are used as without fusion. The MatMul is a matrix-vector multiply that takes two operands: a weights matrix and the vector (calculated internal data). The add-bias is vector addition that takes two operands: a bias vector and the calculated vector (from the MatMul). In the non-fused method, the MatMul first multiplies the weights matrix by the input vector, creating a new vector. The add-bias operation then adds the bias vector to that vector, outputting the final resulting vector. Both MatMul and add-bias are working on the same vector of propagated internal data, and these two operations can be "fused" to create a combined "MatMul-add-bias" operator. The result is a "three-way" fused operator, which has three inputs (the weights matrix, the bias vector, and the incoming calculated vector), and a single vector output.

Kernel fusion is an exact calculation done faster, not an approximation. The result of the merged MatMul-add-bias combined operation is not approximated. The output is exactly the same vector that would result from sequentially doing a MatMul and then an add-bias.

How is this any faster? It's true that the fused operator does the same number of calculations in terms of multiplications and additions (i.e. the same FLOPs, assuming it's floating point data). The speedup comes from "fusing" the add-bias calculations into the same code as the MatMul, which reduces some computation overhead, such as scanning through the vector twice. There is also a reduction in an entire vector of temporary data (i.e. the intermediate vector result of the MatMul before it goes into the add-bias), which reduces data transfers, and improves memory usage and dataflow, also benefiting wall clock speed.

Compiler-Optimized Kernel Operator Fusion

Many of the machine learning compilers have various operator fusion methods. This is a long-standing area of research with many papers.

Huan Song, Jayaraman J. Thiagarajan, Prasanna Sattigeri, Karthikeyan Natesan Ramamurthy, Andreas Spanias, Dec 2016, A Deep Learning Approach To Multiple Kernel Fusion, https://arxiv.org/abs/1612.09007
X Cai, Y Wang, L Zhang, 2022, Optimus: An operator fusion framework for deep neural networks, ACM Transactions on Embedded Computing Systems, Vol. 22, No. 1, Article 1. October 2022, https://dl.acm.org/doi/pdf/10.1145/3520142 (Operator fusion theory for hardware accelerators and deep learning compilers.)
S Zheng, S Chen, P Song, R Chen, X Li, 2023, Chimera: An Analytical Optimizing Framework for Effective Compute-intensive Operators Fusion, 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA), https://ieeexplore.ieee.org/abstract/document/10071018/ (Analysis of operator fusion on hardware accelerators.)
A. Ashari, S. Tatikonda, M. Boehm, B. Reinwald, K. Campbell, J. Keenleyside, et al., "On optimizing machine learning workloads via kernel fusion", Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming PPoPP 2015, pp. 173-182, February 7-11, 2015, https://doi.org/10.1145/2688500.2688521
Matthias Boehm, Berthold Reinwald, Dylan Hutchison, Alexandre V. Evfimievski, and Prithviraj Sen. 2018. On optimizing operator fusion plans for large-scale machine learning in systemml. arXiv:1801.00829. https://arxiv.org/abs/1801.00829
Zhihao Jia, Oded Padon, James Thomas, Todd Warszawski, Matei Zaharia, and Alex Aiken. 2019. TASO: optimizing deep learning computation with automatic generation of graph substitutions. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 47–62. https://doi.org/10.1145/3341301.3359630 PDF: https://cs.stanford.edu/~padon/taso-sosp19.pdf
Rasmus Munk Larsen and Tatiana Shpeisman. 2023. TensorFlow graph optimization with Grappler, https://www.tensorflow.org/guide/graph_optimization
Daniel Snider, Ruofan Liang, Jan 2023, Operator Fusion in XLA: Analysis and Evaluation, https://arxiv.org/pdf/2301.13062.pdf
Jonas Geiping, Tom Goldstein, Dec 2022, Cramming: Training a Language Model on a Single GPU in One Day, https://arxiv.org/abs/2212.14034, Code: https://github.com/JonasGeiping/cramming (Experiments with various fusion methods in training. Note: code uses deprecated nvFuser compiler.)
Robert Lim, 2019, Methods for accelerating machine learning in high performance computing, Report AREA-2019-01, School of Computer and Data Sciences, University of Oregon, https://www.cs.uoregon.edu/Reports/AREA-201901-Lim.pdf (Extension analysis of ML compiler optimizations. Table 8 lists the fusion operations supported by various ML compilers, as of 2019.)
B Qiao, O Reiche, F Hannig, 2019, From loop fusion to kernel fusion: A domain-specific approach to locality optimization, 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), https://ieeexplore.ieee.org/document/8661176 (Theory of loop fusion generalized to graph kernel fusion for image processing.)
H Peng, R Ran, Y Luo, J Zhao, S Huang, K Thorat, 2023, LinGCN: Structural Linearized Graph Convolutional Network for Homomorphically Encrypted Inference, https://arxiv.org/pdf/2309.14331.pdf (Has "operator fusion for node-wise activation functions"; see also "Section A.4 Further Detail of Operator Fusion".)
Christian Sarofeen, Piotr Bialecki, Jie Jiang, Kevin Stephano, Masaki Kozuki, Neal Vaidya, and Stas Bekman, August 2022, Introducing nvFuser, a deep learning compiler for PyTorch, https://pytorch.org/blog/introducing-nvfuser-a-deep-learning-compiler-for-pytorch/, Project: https://github.com/pytorch/pytorch/projects/30 (Note: nvFuser deep learning compiler for Pytorch, but seems to be deprecated.)
N. Rotem, J. Fix, S. Abdulrasool, S. Deng, R. Dzhabarov, J. Hegeman, R. Levenstein, B. Maher, N. Satish, J. Olesen, J. Park, A. Rakhov, and M. Smelyanskiy, “Glow: Graph lowering compiler techniques for neural networks,” CoRR, vol. abs/1805.00907, 2018. http://arxiv.org/abs/1805.00907 Code: http://github.com/pytorch/glow (Paper introduces PyTorch Glow, with much coverage of its kernel operator fusion and "operator stacking" methods.)
T. Chen, T. Moreau, Z. Jiang, H. Shen, E. Q. Yan, L. Wang, Y. Hu, L. Ceze, C. Guestrin, and A. Krishnamurthy, TVM: end-to-end optimization stack for deep learning, CoRR, vol. abs/1802.04799, 2018. http://arxiv.org/abs/1802.04799
Xia, C., Zhao, J., Sun, Q., Wang, Z., Wen, Y., Feng, X., Cui, H., 2023, Optimizing Deep Learning Inference via Global Analysis and Tensor Expressions, The ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 27 Apr-01 May 2023, San Diego, USA. https://eprints.whiterose.ac.uk/203681/, PDF: https://eprints.whiterose.ac.uk/203681/1/asplos24.pdf
Yichen Yang, Phitchaya Phothilimthana, Yisu Wang, Max Willsey, Sudip Roy, and Jacques Pienaar. Equality saturation for tensor graph superoptimization. Proceedings of Machine Learning and Systems, 3, 2021. https://arxiv.org/abs/2101.01332
Zhen Zheng, Xuanda Yang, et al. 2022. AStitch: enabling a new multidimensional optimization space for memory-intensive ML training and inference on modern SIMT architectures. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 359–373. https://dl.acm.org/doi/abs/10.1145/3503222.3507723
Riyadh Baghdadi, Jessica Ray, Malek Ben Romdhane, Emanuele Del Sozzo, Abdurrahman Akkas, Yunming Zhang, Patricia Suriana, Shoaib Kamil, and Saman Amarasinghe. 2019. Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code. In 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 193–205. https://doi.org/10.1109/CGO.2019.8661197
Huanting Wang, Zhanyong Tang, et al. 2022. Automating Reinforcement Learning Architecture Design for Code Optimization. Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction (Seoul, South Korea) (CC 2022). Association for Computing Machinery, New York, NY, USA, 129–143. https://doi.org/10.1145/3497776.3517769
Fredrik Kjolstad, Stephen Chou, David Lugato, Shoaib Kamil, and Saman Amarasinghe. 2017. Taco: A tool to generate tensor algebra kernels. In 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 943–948, https://ieeexplore.ieee.org/document/8115709
Jia, Z., Padon, O., Thomas, J., Warszawski, T., Zaharia, M., Aiken, A., 2019, Taso: Optimizing deep learning computation with automatic generation of graph substitutions. Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP’19, pp. 47–62, New York, NY, USA, 2019. ACM. ISBN 9781450368735. doi: 10.1145/3341301.3359630. https://doi.org/10.1145/3341301.3359630, https://dl.acm.org/doi/10.1145/3341301.3359630
Jangda, A. and Bondhugula, U., An Effective Fusion and Tile Size Model for PolyMage, ACM Trans. Program. Lang. Syst., 42(3), November 2020. ISSN 0164-0925. doi: 10.1145/3404 https://dl.acm.org/doi/10.1145/3404846
Ma, L., Xie, Z., Yang, Z., Xue, J., Miao, Y., Cui, W., Hu, W., Yang, F., Zhang, L., and Zhou, L., RAMMER: enabling holistic deep learning compiler optimizations with rtasks, In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pp. 881–897. USENIX Association, November 2020. ISBN 978-1-939133-19-9. https://www.usenix.org/conference/osdi20/presentation/ma, https://dl.acm.org/doi/10.5555/3488766.3488816
Ragan-Kelley, J., Barnes, C., Adams, A., Paris, S., Durand, F., and Amarasinghe, S., Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI’13, pp. 519–530, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-2014-6. doi: 10.1145/2491956.2462176. http://doi.acm.org/10.1145/2491956.2462176, PDF: https://people.csail.mit.edu/jrk/halide-pldi13.pdf
Vasilache, N., Zinenko, O., Theodoridis, T., Goyal, P., Devito, Z., Moses, W. S., Verdoolaege, S., Adams, A., and Cohen, A., 2019, The next 700 accelerated layers: From mathematical expressions of network computation graphs to accelerated gpu kernels, automatically. ACM Trans. Archit. Code Optim., 16(4), October 2019. ISSN 1544-3566. doi: 10.1145/3355606. https://doi.org/10.1145/3355606, https://dl.acm.org/doi/fullHtml/10.1145/3355606
Ding, Y., Zhu, L., Jia, Z., Pekhimenko, G., and Han, S. Ios, 2021, Inter-operator scheduler for cnn acceleration. In Smola, A., Dimakis, A., and Stoica, I. (eds.), Proceedings of Machine Learning and Systems, volume 3, pp. 1–14, 2021. https://arxiv.org/abs/2011.01302, PDF: https://proceedings.mlsys.org/paper/2021/file/38b3eff8baf56627478ec76a704e9b52-Paper.pdf, Code: https://github.com/mit-han-lab/inter-operator-scheduler (Examines improvements to single-operator parallelization, "intra-operator", to parallelization improvements across multiple operators, "inter-operator".)
Irigoin, F. and Triolet, R., Supernode partitioning. In Proc. of the 15th ACM SIGPLAN-SIGACT Symp. on Principles of Programming Languages, POPL’88, pp. 319–329, New York, NY, USA, 1988. ACM. ISBN 0-89791-252-7. doi: 10.1145/73560. 73588. http://doi.acm.org/10.1145/73560.73588
Jie Zhao, Xiong Gao, Ruijie Xia, Zhaochuang Zhang, Deshi Chen, Lei Chen, Renwei Zhang, Zhen Geng, Bin Cheng, Xuefeng Jin, 2022, Apollo: Automatic partition-based operator fusion through layer by layer optimization, https://proceedings.mlsys.org/paper_files/paper/2022/hash/e175e8a86d28d935be4f43719651f86d-Abstract.html PDF: https://proceedings.mlsys.org/paper_files/paper/2022/file/e175e8a86d28d935be4f43719651f86d-Paper.pdf, PDF: https://yaozhujia.github.io/assets/pdf/mlsys2022-paper.pdf
Keith G. Mills, Muhammad Fetrat Qharabagh, Weichen Qiu, Fred X. Han, Mohammad Salameh, Wei Lu, Shangling Jui, Di Niu, 31 Dec 2024, Applying Graph Explanation to Operator Fusion, https://arxiv.org/abs/2501.00636
Guoliang He, Eiko Yoneki, 14 Jan 2025, CuAsmRL: Optimizing GPU SASS Schedules via Deep Reinforcement Learning, https://arxiv.org/abs/2501.08071
Christian Sarofeen, Piotr Bialecki, Jie Jiang, Kevin Stephano, Masaki Kozuki, Neal Vaidya, and Stas Bekman, August 2022, Introducing nvFuser, a deep learning compiler for PyTorch, https://pytorch.org/blog/introducing-nvfuser-a-deep-learning-compiler-for-pytorch/, Project: https://github.com/pytorch/pytorch/projects/30 (Note: nvFuser deep learning compiler for Pytorch, but seems to be deprecated.)
Ding, Y., Zhu, L., Jia, Z., Pekhimenko, G., and Han, S. Ios, 2021, Inter-operator scheduler for cnn acceleration. In Smola, A., Dimakis, A., and Stoica, I. (eds.), Proceedings of Machine Learning and Systems, volume 3, pp. 1–14, 2021. https://arxiv.org/abs/2011.01302, PDF: https://proceedings.mlsys.org/paper/2021/file/38b3eff8baf56627478ec76a704e9b52-Paper.pdf, Code: https://github.com/mit-han-lab/inter-operator-scheduler (Examines improvements to single-operator parallelization, "intra-operator", to parallelization improvements across multiple operators, "inter-operator".)

Convolution Optimizations

Convolutional Neural Networks (CNNs) use "convolution" operations, which are conceptually similar to simplified feed-forward networks, except they are not fully interconnected. Various further optimizations of convolutions have been researched over the years, many of which are an early type of "kernel fusion" optimization, although they don't always call it by that term. Examples of convolution optiizations include:

Depth-wise separable convolutions
Grouped convolutions

Depthwise-separable convolutions: Papers on depthwise-separable convolutional neural network optimizations:

Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks,” ISCA, 2016. https://ieeexplore.ieee.org/document/7551407, PDF: http://www.rle.mit.edu/eems/wp-content/uploads/2016/04/eyeriss_isca_2016.pdf, PDF Slides: https://eems.mit.edu/wp-content/uploads/2016/06/eyeriss_isca_2016_slides.pdf, Project: http://eyeriss.mit.edu/ (Examines kernel operator fusion but calls it "folding".)
Zhiqi Huang, Lu Hou, Lifeng Shang, Xin Jiang, Xiao Chen, and Qun Liu. 2021. GhostBERT: Generate More Features with Cheap Operations for BERT. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6512–6523, Online. Association for Computational Linguistics, https://aclanthology.org/2021.acl-long.509.pdf (Uses depthwise-separable convolutions.)
Sathya Krishnan Suresh, Shunmugapriya P, 24 Apr 2024 (v2), Towards smaller, faster decoder-only transformers: Architectural variants and their implications, https://arxiv.org/abs/2404.14462 Code: https://github.com/SkAndMl/gpt-variations (Focuses on three new variants of decoder-only Transformer architectures: ParallelGPT (p-gpt), LinearlyCompressedGPT (lc-gpt), and ConvCompressedGPT (cc-gpt).)
Yunpeng Chen, Haoqi Fan, Bing Xu, Zhicheng Yan, Yannis Kalantidis, Marcus Rohrbach, Shuicheng Yan, and Jiashi Feng. 2019. Drop an octave: Reducing spatial redundancy in convolutional neural networks with octave convolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 3435–3444. https://arxiv.org/abs/1904.05049
Kamran Chitsaz, Mohsen Hajabdollahi, Nader Karimi, Shadrokh Samavi, and Shahram Shirani. 2020. Acceleration of convolutional neural network using FFT-based split convolutions. arXiv preprint arXiv:2003.12621 (2020), https://arxiv.org/abs/2003.12621
Zhuo Chen, Jiyuan Zhang, Ruizhou Ding, and Diana Marculescu. 2020. Vip: Virtual pooling for accelerating cnn-based image classification and object detection. In The IEEE Winter Conference on Applications of Computer Vision. 1180–1189. https://arxiv.org/abs/1906.07912 Code: https://github.com/cmu-enyac/VirtualPooling (Optimizing convolutions with a larger stride size.)
Abolfazl Younesi, Mohsen Ansari, MohammadAmin Fazli, Alireza Ejlali, Muhammad Shafique, Jörg Henkel, 28 Feb 2024 (v2), A Comprehensive Survey of Convolutions in Deep Learning: Applications, Challenges, and Future Trends, https://arxiv.org/abs/2402.15490 (A large survey of architectural issues and optimizations in CNNs.)
Grigor Gatchev, Valentin Mollov, 4 Apr 2021, Faster Convolution Inference Through Using Pre-Calculated Lookup Tables, https://arxiv.org/abs/2104.01681
Tan, M. and Le, Q. (2019). EfficientNet: Rethinking model scaling for convolutional neural networks. In Chaudhuri, K. and Salakhutdinov, R., editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 6105–6114, Long Beach, California, USA. PMLR. URL: http://proceedings.mlr.press/v97/tan19a.html
Francesco Daghero, Alessio Burrello, Massimo Poncino, Enrico Macii, Daniele Jahier Pagliari, 18 Jun 2024, Accelerating Depthwise Separable Convolutions on Ultra-Low-Power Devices, SAMOS2024 conference, https://arxiv.org/abs/2406.12478 Code: https://github.com/eml-eda/depthwise-separable-fusion
Chen, C, 2024, Hardware‑software co‑exploration and optimization for next‑generation learning machines. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/178423 (Extensive coverage of hardware design with multiple contributions to accelerating various neural network types, ranging from acceleration of various single non-linear functions and end-to-end optimization algorithms. Specific topics include data compression, non-maximum suppression, MHA, and MatMul/GEMM optimizations.)
X. Ding, X. Zhang, N. Ma, J. Han, G. Ding, and J. Sun, RepVGG: making VGG-style ConvNets great again,” in CVPR, 2021. https://arxiv.org/abs/2101.03697 https://github.com/megvii-model/RepVGG
Yang He, Lingao Xiao, 30 Nov 2023 (v2), Structured Pruning for Deep Convolutional Neural Networks: A survey, https://arxiv.org/abs/2303.00566 https://arxiv.org/pdf/2303.00566 https://ieeexplore.ieee.org/abstract/document/10330640 https://github.com/he-y/Awesome-Pruning https://huggingface.co/spaces/he-yang/Structured-Pruning-Survey (Extensive survey of pruning for CNNs, not LLMs.)
Penghao Xiao, Chunjie Zhang, Qian Guo, Xiayang Xiao, Haipeng Wang, 2024, Neural Networks Integer Computation: Quantizing Convolutional Neural Networks of Inference and Training for Object Detection in Embedded Systems, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, DOI 10.1109/JSTARS.2024.3452321, PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10660473
Peiming Liu, Alexander J Root, Anlunxu, Yinyig Li, Fredrik Kjolstad, Aart C. Bik, 2024, Compiler Support for Sparse Tensor Convolutions, https://rootjalex.github.io/publications/oopsla2024-spconv.pdf
Stefan Hadjis, Firas Abuzaid, Ce Zhang, Christopher Ré, 26 May 2015 (v2), Caffe con Troll: Shallow Ideas to Speed Up Deep Learning, https://arxiv.org/abs/1504.04343
Enrique Galvez, Feb 2024, A study of Convolutions for Efficient Inference of Deep Neural Networks on Embedded Processors, Master's Thesis, France, University de Lyon, https://largo.lip6.fr/~cassagnea/docs/reports/Galvez2024%20-%20A%20Study%20of%20Convolutions%20for%20Efficient%20Inference%20of%20Deep%20Neural%20Networks%20on%20Embedded%20Processors.pdf
Yuxiao Fan, 2024, Design and research of high-performance convolutional neural network accelerator based on Chipyard, ICAITA-2024, Journal of Physics: Conference Series 2858 (2024) 012001, IOP Publishing, doi:10.1088/1742-6596/2858/1/012001, https://iopscience.iop.org/article/10.1088/1742-6596/2858/1/012001/pdf (Convolution optimization on a RISC-V architecture.)
Andrew Lavin, Scott Gray, 10 Nov 2015 (v2), Fast Algorithms for Convolutional Neural Networks, https://arxiv.org/abs/1509.09308
Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, Evan Shelhamer, 18 Dec 2014 (v3), cuDNN: Efficient Primitives for Deep Learning, https://arxiv.org/abs/1410.0759
Hou-I Liu, Marco Galindo, Hongxia Xie, Lai-Kuan Wong, Hong-Han Shuai, Yung-Yui Li, Wen-Huang Cheng, 8 Apr 2024, Lightweight Deep Learning for Resource-Constrained Environments: A Survey, https://arxiv.org/abs/2404.07236 (A survey of various optimizations, with a lot of focus on image and vision models, including CNNs, RNNs, and Transformers.)
You Zhou, Xiujing Lin, Xiang Zhang, Maolin Wang, Gangwei Jiang, Huakang Lu, Yupeng Wu, Kai Zhang, Zhe Yang, Kehang Wang, Yongduo Sui, Fengwei Jia, Zuoli Tang, Yao Zhao, Hongxuan Zhang, Tiannuo Yang, Weibo Chen, Yunong Mao, Yi Li, De Bao, Yu Li, Hongrui Liao, Ting Liu, Jingwen Liu, Jinchi Guo, Xiangyu Zhao, Ying WEI, Hong Qian, Qi Liu, Xiang Wang, Wai Kin (Victor)Chan, Chenliang Li, Yusen Li, Shiyu Yang, Jining Yan, Chao Mou, Shuai Han, Wuxia Jin, Guannan Zhang, Xiaodong Zeng, Nov 2023, On the Opportunities of Green Computing: A Survey, https://arxiv.org/abs/2311.00447 (Extensive survey of environmental and green AI issues, along with a survey of various optimization methods to reduce AI resource requirements in training and inference.)
Gao Huang, Shichen Liu, Laurens Van der Maaten, and Kilian Q Weinberger. Condensenet: An efficient densenet using learned group convolutions. CVPR, pages 2752–2761, 2018. https://arxiv.org/abs/1711.09224, Code: https://github.com/ShichenLiu/CondenseNet
Dongting Hu, Jierun Chen, Xijie Huang, Huseyin Coskun, Arpit Sahni, Aarush Gupta, Anujraaj Goyal, Dishani Lahiri, Rajesh Singh, Yerlan Idelbayev, Junli Cao, Yanyu Li, Kwang-Ting Cheng, S.-H. Gary Chan, Mingming Gong, Sergey Tulyakov, Anil Kag, Yanwu Xu, Jian Ren, 12 Dec 2024, SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training, https://arxiv.org/abs/2412.09619

Grouped Convolutions: Research papers on grouped convolutions:

Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv preprint arXiv:1704.04861, https://arxiv.org/abs/1704.04861 (Uses depthwise-separable convolutions combined with narrowing at each layer.)
Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, Communications of the ACM, Volume 60, Issue 6, June 2017, pp 84–90, https://doi.org/10.1145/3065386, https://dl.acm.org/doi/10.1145/3065386, PDF: https://proceedings.neurips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf, Code: http://code.google.com/p/cuda-convnet/ (The early paper that introduced a grouped convolution architecture for multi-GPUs, later the basis of AlexNet, which was a famous image recognition CNN.)
Wikipedia, 2023 (accessed), AlexNet, https://en.wikipedia.org/wiki/AlexNet (AlexNet was a CNN using grouped convolutions.)
Akhil Kasare Dec 26, 2020, SqueezeBERT: What can computer vision teach NLP about efficient neural networks? Medium, https://akhil-kasare80.medium.com/squeezebert-what-can-computer-vision-teach-nlp-about-efficient-neural-networks-d8186579ca0d

General Research on Kernel Operator Fusion

Research papers on low-level operator fusion in the kernel of a neural network include:

Wei Niu, Jiexiong Guan, Yanzhi Wang, Gagan Agrawal, Bin Ren, 2021, DNNFusion: accelerating deep neural networks execution with advanced operator fusion, PLDI 2021: Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, June 2021, Pages 883–898, https://doi.org/10.1145/3453483.3454083, https://dl.acm.org/doi/10.1145/3453483.3454083, https://arxiv.org/abs/2108.13342
G Wang, YS Lin, W Yi, 2010, Kernel fusion: An effective method for better power efficiency on multithreaded GPU, 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing, https://ieeexplore.ieee.org/abstract/document/5724850/
J Filipovič, M Madzin, J Fousek, L Matyska, 2015, Optimizing CUDA code by kernel fusion: application on BLAS, The Journal of Supercomputing, volume 71, pages 3934–3957, 2015, https://link.springer.com/article/10.1007/s11227-015-1483-z, https://arxiv.org/pdf/1305.1183 (Fusion in a source-to-source compiler.)
M Wahib, N Maruyama, 2014, Scalable kernel fusion for memory-bound GPU applications SC '14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, https://ieeexplore.ieee.org/abstract/document/7013003/, PDF: https://scholar.archive.org/work/4xziwypwkjhmbgf7zqz677haym/access/wayback/http://conferences.computer.org/sc/2014/papers/5500a191.pdf
X Wang, LL Zhang, Y Wang, M Yang, 2022, Towards efficient vision transformer inference: A first study of transformers on mobile devices, HotMobile '22: Proceedings of the 23rd Annual International Workshop on Mobile Computing Systems and Applications, March 2022, Pages 1–7, https://doi.org/10.1145/3508396.3512869, https://dl.acm.org/doi/abs/10.1145/3508396.3512869
J Zhao, X Gao, R Xia, Z Zhang, 2022, Apollo: Automatic partition-based operator fusion through layer by layer optimization, Proceedings of the 5th MLSys Conference, Santa Clara, CA, USA, 2022, https://proceedings.mlsys.org/paper_files/paper/2022/file/e175e8a86d28d935be4f43719651f86d-Paper.pdf
Ofir Press, Noah A. Smith, Omer Levy, Apr 2020, Improving Transformer Models by Reordering their Sublayers, https://arxiv.org/abs/1911.03864 (Rearranging sublayer orderings is somewhat related to kernel fusion, if you squint and look obliquely.)
Zhewei Yao, Zhen Dong, Zhangcheng Zheng, Amir Gholami, Jiali Yu, Eric Tan, Leyuan Wang, Qijing Huang, Yida Wang, Michael Mahoney, Kurt Keutzer, HAWQ-V3: Dyadic Neural Network Quantization, Proceedings of the 38th International Conference on Machine Learning, PMLR 139:11875-11886, 2021, https://arxiv.org/abs/2011.10680 (Has a fused batch norm as part of using integer-only arithmetic.)
Jirí Filipovic; Siegfried Benkner, 2015, OpenCL Kernel Fusion for GPU, Xeon Phi and CPU, 2015 27th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), https://ieeexplore.ieee.org/document/7379839
Stratton, J.A., Krishna V. S., J., Palanisamy, J., Chinnaraju, K., 2022, Kernel Fusion in OpenCL. In: Chaves, R., et al. Euro-Par 2021: Parallel Processing Workshops. Euro-Par 2021. Lecture Notes in Computer Science, vol 13098. Springer, Cham. https://doi.org/10.1007/978-3-031-06156-1_16, https://link.springer.com/chapter/10.1007/978-3-031-06156-1_16
Wenchao Wu , Xuanhua Shi, 2023, TurboMGNN: Improving Concurrent GNN Training Tasks on GPU With Fine-Grained Kernel Fusion, IEEE Transactions on Parallel and Distributed Systems, Volume 34, Issue 6, June 2023, https://ieeexplore.ieee.org/document/10103627 PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10103627
Chellapilla, K., Puri, S., and Simard, P., 2006, High performance convolutional neural networks for document processing. In Tenth International Workshop on Frontiers in Handwriting Recognition, PDF: https://inria.hal.science/inria-00112631v1/document
Dennis Sebastian Rieber, 2023, Deployment of Deep Neural Networks on Dedicated Hardware Accelerators, Ph.D. thesis, Doctor of Natural Sciences, Ruprecht–Karls University Heidelberg, PDF: https://archiv.ub.uni-heidelberg.de/volltextserver/32994/1/dissertationPDFA.pdf (Fusion and fission optimizations with example algorithms on p.40 and p.45.)
Bo Qiao, Oliver Reiche, Frank Hannig, and Jïrgen Teich. 2019. From loop fusion to kernel fusion: A domain-specific approach to locality optimization. In 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 242–253, https://dl.acm.org/doi/10.5555/3314872.3314901
Carlo Bertolli, Adam Betts, Paul HJ Kelly, Gihan R Mudalige, and Mike B Giles. 2012. Mesh independent loop fusion for unstructured mesh applications. Proceedings of the 9th conference on Computing Frontiers. 43–52. https://dl.acm.org/doi/10.1145/2212908.2212917
Ravi Teja Mullapudi, Andrew Adams, Dillon Sharlet, Jonathan Ragan-Kelley, and Kayvon Fatahalian. 2016. Automatically Scheduling Halide Image Processing Pipelines. ACM Trans. Graph. 35, 4, Article 83 (jul 2016), 11 pages. https://doi.org/10.1145/2897824.2925952, https://research.google/pubs/pub45525/, PDF: https://dl.acm.org/doi/pdf/10.1145/2897824.2925952
C Bastoul, Z Zhang, H Razanajato et al., 2022, 2022 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), Optimizing GPU deep learning operators with polyhedral scheduling constraint injection, https://ieeexplore.ieee.org/document/9741260
X He, X Gao, J Zhao, C Gao, K Ye, 2022, Optimizing Cache Accesses with Tensor Memory Format Search for Transformers in TVM, 2022 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), https://link.springer.com/chapter/10.1007/978-3-031-23498-9_4
M Schuler, R Membarth, P Slusallek, 2022, Xengine: Optimal tensor rematerialization for neural networks in heterogeneous environments, ACM Transactions on Architecture and Code Optimization, Volume 20, Issue 1, Article No. 17, pp 1–25, https://dl.acm.org/doi/10.1145/3568956, PDF: https://dl.acm.org/doi/pdf/10.1145/3568956, Code: https://github.com/dfki-asr/xengine
Zhen Zheng, Pengzhan Zhao, Guoping Long, Feiwen Zhu, Kai Zhu, Wenyi Zhao, Lansong Diao, Jun Yang, Wei Lin, FusionStitching: Boosting Memory Intensive Computations for Deep Learning Workloads, 2020. https://arxiv.org/abs/2009.10924 (Calls it "stitching" of kernels.)
Haicheng Wu, Gregory Diamos, Jin Wang, Srihari Cadambi, Sudhakar Yalamanchili, and Srimat Chakradhar. 2012. Optimizing data warehousing applications for GPUs using kernel fusion/fission. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum. IEEE, 2433–2442. https://ieeexplore.ieee.org/document/6270615, PDF: http://www.istc-cc.cmu.edu/publications/papers/2012/optimizing-warehousing.pdf (Theoretical analysis of which algebraic operators can be merged in kernel fusion or split in kernel fission.)
Mohamed Wahib and Naoya Maruyama. 2014. Scalable kernel fusion for memory-bound GPU applications. In SC’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 191–202. https://dl.acm.org/doi/10.1109/SC.2014.21, PDF: https://ia803204.us.archive.org/view_archive.php?archive=/20/items/acm-digital-library-2020_20200719/fulltext/343.zip&file=10.1109%2FSC.2014.21.pdf (Theoretical analysis of kernel operator fusion with examples.)
J. Holewinski, L. Pouchet, and P. Sadayappan. High-performance code generation for stencil computations on gpu architectures. In ICS, 2012. https://web.cs.ucla.edu/~pouchet/doc/ics-article.12.pdf (Uses stencil-based "tiling" optimizations for GPUs.)
Ashari, A., Tatikonda, S., Boehm, M., Reinwald, B., Campbell, K., Keenleyside, J., and Sadayappan, P., 2015, On optimizing machine learning workloads via kernel fusion. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2015, pp. 173–182, New York, NY, USA. Association for Computing Machinery. ISBN 9781450332057. doi: 10.1145/2688500.2688521. https://dl.acm.org/doi/10.1145/2688500.2688521, PDF: https://mboehm7.github.io/resources/ppopp2015.pdf (Analysis of kernel fusion including for sparse and dense matrix multiplication.)
Pieter Hijma, Stijn Heldens, Alessio Sclocco, Ben van Werkhoven, Henri E. Bal, 2023, Optimization techniques for GPU programming, ACM Computing Surveys, Volume 55, Issue 11, Article No. 239, pp 1–81, https://dl.acm.org/doi/abs/10.1145/3570638, PDF: https://dl.acm.org/doi/pdf/10.1145/3570638 (Also calls kernel fusion by names of "kernel unification" and "kernel merge".)
Xiao Fu, Weiling Yang, Dezun Dong, Xing Su, 03 June 2024, Optimizing Attention by Exploiting Data Reuse on ARM Multi-core CPUs, ICS '24: Proceedings of the 38th ACM International Conference on Supercomputing, May 2024, Pages 137–149, https://doi.org/10.1145/3650200.3656620 https://dl.acm.org/doi/abs/10.1145/3650200.3656620
Peiyu Liu, Ze-Feng Gao, Wayne Xin Zhao, Yipeng Ma, Tao Wang, Ji-Rong Wen, 21 May 2024, Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression, https://arxiv.org/abs/2405.12591
Soroush Ghodrati, Sean Kinzer, Hanyang Xu, Rohan Mahapatra, Yoonsung Kim, Byung Hoon Ahn, Dong Kai Wang, Lavanya Karthikeyan, Amir Yazdanbakhsh, Jongse Park, Nam Sung Kim, Hadi Esmaeilzadeh, April 2024, Tandem processor: Grappling with emerging operators in neural networks, ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, April 2024, Pages 1165–1182, https://doi.org/10.1145/3620665.3640365 https://dl.acm.org/doi/abs/10.1145/3620665.3640365 Code: https://actlab-genesys.github.io (Reviews hardware acceleration of all sub-layer kernel operators, with a focus beyond just GEMM/MatMul operators.)
Wei Niu, Md Musfiqur Rahman Sanim, Zhihao Shu, Jiexiong Guan, Xipeng Shen, Miao Yin, Gagan Agrawal, Bin Ren, 21 Apr 2024, SmartMem: Layout Transformation Elimination and Adaptation for Efficient DNN Execution on Mobile, https://arxiv.org/abs/2404.13528 (Choosing optimal tensor memory layouts to optimize low-level operator kernels.)
Hanzhou Liu, Binghan Li, Chengkai Liu, Mi Lu, 19 Mar 2024, DeblurDiNAT: A Lightweight and Effective Transformer for Image Deblurring, https://arxiv.org/abs/2403.13163v1 (Implements "feature fusion" which is analogous to kernel fusion.) (Optimizes a deblurring Transformer with self-attention improvements and other optimizations.)
Shuming Shi, Enbo Zhao, Deng Cai, Leyang Cui, Xinting Huang, Huayang Li, 16 Jan 2024, Inferflow: an Efficient and Highly Configurable Inference Engine for Large Language Models, https://arxiv.org/abs/2401.08294 Source: https://github.com/inferflow/inferflow
Hui Wu, Yi Gan, Feng Yuan, Jing Ma, Wei Zhu, Yutao Xu, Hong Zhu, Yuhua Zhu, Xiaoli Liu, Jinghui Gu, Dec 2023, Efficient LLM inference solution on Intel GPU, https://arxiv.org/abs/2401.05391 (Optimized LLM inference using kernel fusion of GEMM with element-wise operations for better data movement, and also advanced management of the KV cache.)
Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
Xiaoming (Jason) Cui, Ashraf Bhuiyan, 2023, Optimizing Transformer Model Inference on Intel® Processors, https://www.intel.com/content/www/us/en/developer/articles/technical/optimize-transformer-model-inference-processors.html
Mohamed Wahib, Naoya Maruyama, 2015, Automated GPU Kernel Transformations in Large-Scale Production Stencil Applications, HPDC '15: Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, June 2015, Pages 259–270 https://dl.acm.org/doi/10.1145/2749246.2749255
H Shen, H Chang, B Dong, Y Luo, H Meng, Nov 2023, Efficient LLM Inference on CPUs, arXiv preprint arXiv:2311.00502, https://arxiv.org/pdf/2311.00502.pdf Code: https://github.com/intel/intel-extension-for-transformers (INT4 weight quantization with 16-bit activations, and highly optimized kernel with support for AVX2, AVX512, AVX512_VNNI and Advanced Matrix Extensions (AMX), and KV caching, tested on LLamam2 3B to 20B with 20-80ms latency per token.)
Rong Tian, Zijing Zhao, Weijie Liu, Haoyan Liu Weiquan Mao, Zhe Zhao, Kimmo Yan, Sep 2022, SAMP: A Toolkit for Model Inference with Self-Adaptive Mixed-Precision, https://arxiv.org/pdf/2209.09130.pdf (Mixed-precision quantization combined with kernel fusion, including QKV tensor operation fusion and AddBias-LayerNorm fusion.)
Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture. 1–12. https://doi.org/10.1145/3079856.3080246
Benjamin Lefaudeux, Francisco Massa, Diana Liskovich, Wenhan Xiong, Vittorio Caggiano, Sean Naren, Min Xu, Jieru Hu, Marta Tintore, Susan Zhang, Patrick Labatut, Daniel Haziza, xFormers: A modular and hackable Transformer modelling library, 2022, Facebook Research, Code: https://github.com/facebookresearch/xformers
Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Anselm Levskaya, Jonathan Heek, Kefan Xiao, Shivani Agrawal, Jeff Dean, Nov 2022, Efficiently Scaling Transformer Inference, Google Research, https://arxiv.org/abs/2211.05102
Daniel Snider, Ruofan Liang, Jan 2023, Operator Fusion in XLA: Analysis and Evaluation, https://arxiv.org/pdf/2301.13062.pdf
W. Jung, D. Jung, B. Kim, S. Lee, W. Rhee, and J. Ahn, “Restructuring Batch Normalization to Accelerate CNN Training,” in The Conference on Systems and Machine Learning, 2019, https://arxiv.org/abs/1807.01702
Maurizio Capra, Beatrice Bussolino, Alberto Marchisio, Guido Masera, Maurizio Martina, Muhammad Shafique, 2020, Hardware and software optimizations for accelerating deep neural networks: Survey of current trends, challenges, and the road ahead, https://ieeexplore.ieee.org/iel7/6287639/6514899/09269334.pdf, https://arxiv.org/abs/2012.11233 (Analysis of optimizations for DNNs and SNNs.)
Zhai, Yujia, 2023, Ph.D. thesis, Architectural-Aware Performance Optimization: From the Foundational Math Library to Cutting-Edge Applications, Computer Science, Universion of California, Riverside, https://escholarship.org/content/qt8s28g07q/qt8s28g07q.pdf (Includes examination of padding-free algorithms such as ByteTransformer.)
Ji Xin, Raphael Tang, Zhiying Jiang, Yaoliang Yu, Jimmy Lin, July 2022, Building an Efficiency Pipeline: Commutativity and Cumulativeness of Efficiency Operators for Transformers, arXiv preprint arXiv:2208.00483, https://arxiv.org/abs/2208.00483
Chen, T., Moreau, T., Jiang, Z., Zheng, L., Yan, E., Shen, H., Cowan, M., Wang, L., Hu, Y., Ceze, L., Guestrin, C., and Krishnamurthy, A., 2018, TVM: An automated endto-end optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pp. 578–594, Carlsbad, CA, https://arxiv.org/abs/1802.04799
Lattner, C. and Adve, V., 2004, Llvm: A compilation framework for lifelong program analysis & transformation. In International Symposium on Code Generation and Optimization. CGO 2004., pp. 75–86. IEEE, 2004, https://ieeexplore.ieee.org/abstract/document/1281665 https://www.llvm.org/pubs/2004-01-30-CGO-LLVM.pdf
Lattner, C., Amini, M., Bondhugula, U., Cohen, A., Davis, A., Pienaar, J. A., Riddle, R., Shpeisman, T., Vasilache, N., and Zinenko, O., 2021, MLIR: Scaling Compiler Infrastructure for Domain Specific Computation. In Proceedings of the 19th ACM/IEEE International Symposium on Code Generation and Optimization, 2021. https://ieeexplore.ieee.org/abstract/document/9370308 https://storage.googleapis.com/pub-tools-public-publication-data/pdf/85bf23fe88bd5c7ff60365bd0c6882928562cbeb.pdf Code: https://mlir.llvm.org/
Rotem, N., Fix, J., Abdulrasool, S., Deng, S., Dzhabarov, R., Hegeman, J., Levenstein, R., Maher, B., Nadathur, S., Olesen, J., Park, J., Rakhov, A., and Smelyanskiy, M., 2018, Glow: Graph lowering compiler techniques for neural networks. CoRR, abs/1805.00907, https://arxiv.org/abs/1805.00907 -
Haichen Shen, Jared Roesch, Zhi Chen, Wei Chen, Yong Wu, Mu Li, Vin Sharma, Zachary Tatlock, Yida Wang, 2021, Nimble: Efficiently Compiling Dynamic Neural Networks for Model Inference, Proceedings of Machine Learning and Systems 3 (MLSys 2021), https://proceedings.mlsys.org/paper_files/paper/2021/hash/5b47430e24a5a1f9fe21f0e8eb814131-Abstract.html https://arxiv.org/abs/2006.03031 -
Luchang Li, Sheng Qian, Jie Lu, Lunxi Yuan, Rui Wang, Qin Xie, 29 Mar 2024, Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs, https://arxiv.org/abs/2403.20041 (On-device LLMs via four optimizations: dynamic-tensor-shape inference, FP4 quantization, operator optimizations, and KV cache improvements.)
Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 15 Mar 2024 (v5), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer (A large survey of a variety of LLM optimizations.)
Wei Niu, Gagan Agrawal, Bin Ren, 29 Feb 2024, SoD2: Statically Optimizing Dynamic Deep Neural Network, https://arxiv.org/abs/2403.00176 (Analysis of operator computation shapes and pathways with kernel fusion and memory planning.)
Y Liang, Z Wang, X Xu, Y Tang, Z Jie, J Lu, Oct 2023, MCUFormer: Deploying Vision Tranformers on Microcontrollers with Limited Memory, arXiv preprint arXiv:2310.16898, https://arxiv.org/pdf/2310.16898.pdf
NVIDIA, Developer Guide (CuDNN), https://docs.nvidia.com/deeplearning/cudnn/developer-guide/index.html PDF: https://docs.nvidia.com/deeplearning/cudnn/pdf/cuDNN-Developer-Guide.pdf
Francesco Ratto, Ángela Porras Máinez, Carlo Sau, Paolo Meloni, Gianfranco Deriu, Stefano Delucchi, Massimo Massa, Luigi Raffo, Francesca Palumbo, April 2023, An Automated Design Flow for Adaptive Neural Network Hardware Accelerators. Journal of Signal Processing Systems (2023): 1-23. https://link.springer.com/article/10.1007/s11265-023-01855-x (Adapatable inference for a CNN by dynamic modification of FPGA-accelerated hardware integrations.)
Daniel Nichols, Siddharth Singh, Shu-Huai Lin, Abhinav Bhatele, July 2022, A Survey and Empirical Evaluation of Parallel Deep Learning Frameworks, https://arxiv.org/abs/2111.04949 PDF: https://pssg.cs.umd.edu/assets/papers/2022-07-dl-survey-arxiv.pdf (Survey of frameworks from the theoretical perspective of parallelism.)
Lingxiao Ma, Zhiqiang Xie, Zhi Yang, Jilong Xue, Youshan Miao, Wei Cui, Wenxiang Hu, Fan Yang, Lintao Zhang, and Lidong Zhou. 2020. Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks. In OSDI 2020. USENIX Association, 881–897. https://typeset.io/papers/rammer-enabling-holistic-deep-learning-compiler-12fbfi80ej
Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, June 2022. https://arxiv.org/abs/2205.14135 Code: https://github.com/HazyResearch/flash-attention (The original FlashAttention version 1, now superceded by FlashAttention 2, which uses tiling and memory-aware kernels to optimize attention.)
David Spuler, March 2024, Chapter 31. Kernel Fusion, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
RY Aminabadi, S Rajbhandari, AA Awan, 2022, DeepSpeed-inference: enabling efficient inference of transformer models at unprecedented scale, https://arxiv.org/abs/2207.00032 PDF: https://arxiv.org/pdf/2207.00032
8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
Li-Wen Chang, Wenlei Bao, Qi Hou, Chengquan Jiang, Ningxin Zheng, Yinmin Zhong, Xuanrun Zhang, Zuquan Song, Ziheng Jiang, Haibin Lin, Xin Jin, Xin Liu, 18 Jun 2024 (v4), FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion, https://arxiv.org/abs/2406.06858
Francesco Daghero, Alessio Burrello, Massimo Poncino, Enrico Macii, Daniele Jahier Pagliari, 18 Jun 2024, Accelerating Depthwise Separable Convolutions on Ultra-Low-Power Devices, SAMOS2024 conference, https://arxiv.org/abs/2406.12478 Code: https://github.com/eml-eda/depthwise-separable-fusion
Franklin Huang, May 17, 2024, Machine Learning Systems with Reduced Memory Requirements, Masters of Science, Electrical Engineering and Computer Sciences, University of California, Berkeley, Technical Report No. UCB/EECS-2024-120 http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.html https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.pdf Code: https://github.com/hongyihuang/spec-mcts/blob/main/triton (Broad paper that examines a lot of different optimizations that reduce memory costs, including quantization, kernel fusion, sparsity, MatMul optimizations, KV cache compression, and various other methods.)
Han Guo, William Brandon, Radostin Cholakov, Jonathan Ragan-Kelley, Eric P. Xing, Yoon Kim, 15 Jul 2024, Fast Matrix Multiplications for Lookup Table-Quantized LLMs, https://arxiv.org/abs/2407.10960
Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748
Jiangfei Duan, Shuo Zhang, Zerui Wang, Lijuan Jiang, Wenwen Qu, Qinghao Hu, Guoteng Wang, Qizhen Weng, Hang Yan, Xingcheng Zhang, Xipeng Qiu, Dahua Lin, Yonggang Wen, Xin Jin, Tianwei Zhang, Peng Sun, 29 Jul 2024, Efficient Training of Large Language Models on Distributed Infrastructures: A Survey, https://arxiv.org/abs/2407.20018
Zhiwen Mo, Lei Wang, Jianyu Wei, Zhichen Zeng, Shijie Cao, Lingxiao Ma, Naifeng Jing, Ting Cao, Jilong Xue, Fan Yang, Mao Yang, 12 Aug 2024, LUT Tensor Core: Lookup Table Enables Efficient Low-Bit LLM Inference Acceleration, https://arxiv.org/abs/2408.06003 (Lookup tables for mixed-precision MatMul/GEMM kernels using low-bit quantization mixed with full precision.)
Nils Graef, Matthew Clapp, Andrew Wasielewski, 12 Jul 2024, Flash normalization: fast RMSNorm for LLMs, https://arxiv.org/abs/2407.09577 Code: https://huggingface.co/open-machine/FlashNorm
Zihao Ye,, Lequn Chen, Ruihang Lai, Yilong Zhao, Size Zheng, Junru Shao, Bohan Hou, Hongyi Jin, Yifei Zuo, Liangsheng Yin, Tianqi Chen, Luis Ceze, Feb 2, 2024, Accelerating Self-Attentions for LLM Serving with FlashInfer, https://flashinfer.ai/2024/02/02/introduce-flashinfer.html
Jason Ramapuram, Federico Danieli, Eeshan Dhekane, Floris Weers, Dan Busbridge, Pierre Ablin, Tatiana Likhomanenko, Jagrit Digani, Zijin Gu, Amitis Shidani, Russ Webb, 6 Sep 2024, Theory, Analysis, and Best Practices for Sigmoid Self-Attention, https://arxiv.org/abs/2409.04431 https://github.com/apple/ml-sigmoid-attention
A. Jangda, J. Huang, G. Liu, A. H. N. Sabet, S. Maleki, Y. Miao, M. Musuvathi, T. Mytkowicz, and O. Saarikivi, “Breaking the computation and communication abstraction barrier in distributed machine learning workloads,” in Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 402–416, 2022. https://arxiv.org/abs/2105.05720
Ganesh Bikshandi, Jay Shah, 19 Dec 2023, A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library, https://arxiv.org/abs/2312.11918 https://research.colfax-intl.com/nvidia-hopper-flashattention-2/
Stephen Jones, March 2024, CUDA: New Features and Beyond, GTC 2024, https://www.nvidia.com/en-us/on-demand/session/gtc24-s62400/
Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Guohao Dai, 6 Oct 2024, Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective, https://arxiv.org/abs/2410.04466
Byron (Pin-Lun)Hsu, Yun Dai, Vignesh Kothapalli, Qingquan Song, Shao Tang, Siyu Zhu, Steven Shimizu, Shivam Sahni, Haowen Ning, Yanning Chen, 14 Oct 2024, Liger Kernel: Efficient Triton Kernels for LLM Training, https://arxiv.org/abs/2410.10989 http://github.com/linkedin/Liger-Kernel
David Koeplinger, Darshan Gandhi, Pushkar Nandkar, Nathan Sheeley, Matheen Musaddiq, Leon Zhang, Reid Goodbar, Matthew Shaffer, Han Wang, Angela Wang, Mingran Wang, Raghu Prabhakar, 31 Oct 2024, Kernel Looping: Eliminating Synchronization Boundaries for Peak Inference Performance, https://arxiv.org/abs/2410.23668
Z. Zhang, D. Yang, X. Zhou and D. Cheng, "MCFuser: High-Performance and Rapid Fusion of Memory-Bound Compute-Intensive Operators," in 2024 SC24: International Conference for High Performance Computing, Networking, Storage and Analysis SC, Atlanta, GA, United States, 2024, pp. 528-542, doi: 10.1109/SC41406.2024.00040. https://www.computer.org/csdl/proceedings-article/sc/2024/529100a528/21HUVuG3S8M
Character.AI, Nov 21, 2024, Optimizing AI Inference at Character.AI (Part Deux), https://research.character.ai/optimizing-ai-inference-at-character-ai-part-deux/ (Optimization techniques discussed include INT8, Flash attention 3, kernel fusion of KV dequantization and attention, MQA parallelization, producer-consumer CUDA warp specialization, fused matrix transpose, and more.)
Andrew Chan, Dec 12, 2024, Fast LLM Inference From Scratch: Pushing single-GPU inference throughput to the edge without libraries, https://andrewkchan.dev/posts/yalm.html
Keith G. Mills, Muhammad Fetrat Qharabagh, Weichen Qiu, Fred X. Han, Mohammad Salameh, Wei Lu, Shangling Jui, Di Niu, 31 Dec 2024, Applying Graph Explanation to Operator Fusion, https://arxiv.org/abs/2501.00636
F. Wang and M. Shen, "TileMap: Mapping Multi-Head Attention on Spatial Accelerators with Tile-based Analysis," 2024 IEEE 42nd International Conference on Computer Design (ICCD), Milan, Italy, 2024, pp. 653-660, doi: 10.1109/ICCD63220.2024.00104. https://ieeexplore.ieee.org/abstract/document/10818092 (Operator fusion applied to MHA.)
Guoliang He, Eiko Yoneki, 14 Jan 2025, CuAsmRL: Optimizing GPU SASS Schedules via Deep Reinforcement Learning, https://arxiv.org/abs/2501.08071
Jindong Li, Tenglong Li, Guobin Shen, Dongcheng Zhao, Qian Zhang, Yi Zeng, 15 Feb 2025, Pushing up to the Limit of Memory Bandwidth and Capacity Utilization for Efficient LLM Decoding on Embedded FPGA, https://arxiv.org/abs/2502.10659
Mahsa Salmani, Ilya Soloveychik, 24 Feb 2025, LLM Inference Acceleration via Efficient Operation Fusion, https://arxiv.org/abs/2502.17728
Benjamin Spector, Aaryan Singhal, Dan Fu, Chris Ré, March 4, 2025, ThunderMLA: FlashMLA, Faster and Fused-er! https://hazyresearch.stanford.edu/blog/2025-03-04-thundermla https://github.com/HazyResearch/ThunderKittens/blob/mla/kernels/attn/demo/mla_decode/template_mla_decode.cu (Using a single CUDA "megakernel" to perform all jobs and passing it meta-instructions, thereby avoiding launching and shutting down kernels.)
Zitong Li, Aparna Chandramowlishwaran, 12 May 2025, Fused3S: Fast Sparse Attention on Tensor Cores, https://arxiv.org/abs/2505.08098
Ofer Dekel, 29 Apr 2025, Blockbuster, Part 1: Block-level AI Operator Fusion, https://arxiv.org/abs/2505.07829
C Zhang, X Zhu, L Chen, T Yang, E Pan, G Yu, Y Zhao, 2025, Enhancing LLM Inference Performance on ARMCPUsthrough Software and Hardware Co-optimization Strategies, DOI 10.23919/ICS.2025.3568404, https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10994252
Aniruddha Nrusimha, William Brandon, Mayank Mishra, Yikang Shen, Rameswar Panda, Jonathan Ragan-Kelley, Yoon Kim, 28 May 2025, FlashFormer: Whole-Model Kernels for Efficient Low-Batch Inference, https://arxiv.org/abs/2505.22758 https://github.com/aninrusimha/flashformer (Optimizing kernels for low latency in a single isolated query, not a batch, via kernel fusion and running all components in one kernel, along with programming techniques like metaprogramming.)

Transformer Component Kernel Operator Fusion

The operations performed by Transformer components can be fused, which means to combine the code of two operations into one operation, reducing issues with temporary data and other overhead. Some examples of what is possible for fused operations:

Fused multi-head attention (fused MHA)
Fused Multiply-Add (FMA)
Fused normalization (e.g. fused LayerNorm or fused BatchNorm)
Fused SoftMax
Fused activations (e.g. fused RELU, fused GELU, fused SwiGLU, etc.)
Fused Add-Bias
Fused matrix transpose

Note that most operator fusion changes are not approximations. The goal is to implement the exact same functionality as two operators, but do it faster in a single combined loop or function. It is, of course, sometimes possible to incorporate approximations as part of operator fusion, but that's really a separate method (see approximation optimizations).

Research papers generally on kernel operator fusion specific to Transformers:

Benjamin Lefaudeux, Francisco Massa, Diana Liskovich, Wenhan Xiong, Vittorio Caggiano, Sean Naren, Min Xu, Jieru Hu, Marta Tintore, Susan Zhang, Patrick Labatut, Daniel Haziza, 2022, xFormers: A modular and hackable Transformer modelling library, Facebook Research, Code: https://github.com/facebookresearch/xformers (Supports several kernel fusion operations such as: fused softmax, fused linear layer, fused LayerNorm, fused dropout, fused SwiGLU.)
Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Anselm Levskaya, Jonathan Heek, Kefan Xiao, Shivani Agrawal, Jeff Dean, Nov 2022, Efficiently Scaling Transformer Inference, Google Research, https://arxiv.org/abs/2211.05102 (Includes multiple types of fusion of operations in the QKV attention head tensors, such as fusing the feedforward layer with the query projection matrix, and fusion of KV tensors.)
J Fang, Y Yu, C Zhao, J Zhou, 2021, Turbotransformers: an efficient gpu serving system for transformer models, Proceedings of the 26th ACM SIGPLAN, https://dl.acm.org/doi/abs/10.1145/3437801.3441578, PDF: https://dl.acm.org/doi/pdf/10.1145/3437801.3441578 (Turbotransformers uses various kernel fusions, such as fused LayerNorm, fused activation functions and fused transpose operations.)
Y Zhai, C Jiang, L Wang, X Jia, S Zhang, 2023, ByteTransformer: A high-performance transformer boosted for variable-length inputs, https://ieeexplore.ieee.org/abstract/document/10177488/, https://arxiv.org/abs/2210.03052 (ByteTransformer uses fused MHA and kernel operator fusion such as fused LayerNorm, fused add-bias, fused GELU, and Softmax fusion.)
B Hagedorn, B Fan, H Chen, C Cecka, 2023, Graphene: An IR for Optimized Tensor Computations on GPUs, ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada, PDF: https://dl.acm.org/doi/pdf/10.1145/3582016.3582018 (Includes various kernel fusions including fused MHA and fused LayerNorm.)
Ashraf Eassa, Bo Yang Hsueh, Brian Pharris, Zhihan Jiang and Ashwin Nanjappa, Sep 08, 2022, Full-Stack Innovation Fuels Highest MLPerf Inference 2.1 Results for NVIDIA, NVIDIA Technical Blog, https://developer.nvidia.com/blog/full-stack-innovation-fuels-highest-mlperf-inference-2-1-results-for-nvidia/ (The NVIDIA Bert submission included kernel fusions such as fused MHA, fused bias, and fused GELU.)
Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Shigang Li, and Torsten Hoefler. Nov 2021, Data movement is all you need: A case study on optimizing transformers, Proceedings of Machine Learning and Systems, 3, 2021. https://arxiv.org/abs/2007.00072, Code: https://github.com/spcl/substation (Extensive analysis of fusion opportunities in Transformers, with evaluation of several of them, such as fused LayerNorm, and theory about various other operator fusions; see the main paper body and also a long list of fusion methods in the Appendix Section "A.2 Fusion Implementation". Various merged combinations are listing, which involve components: normalization/layernorm, Softmax, RELU, bias, and dropout.)
DeepSpeed Team, Rangan Majumder, Andrey Proskurin, May 24, 2021, DeepSpeed: Accelerating large-scale model inference and training via system optimizations and compression, Microsoft Research Blog, https://www.microsoft.com/en-us/research/blog/deepspeed-accelerating-large-scale-model-inference-and-training-via-system-optimizations-and-compression/ (DeepSpeed uses various kernel fusion methods including for Softmax, LayerNorm, transpose, and GEMM.)
Rong Tian, Zijing Zhao, Weijie Liu, Haoyan Liu Weiquan Mao, Zhe Zhao, Kimmo Yan, Sep 2022, SAMP: A Toolkit for Model Inference with Self-Adaptive Mixed-Precision, https://arxiv.org/pdf/2209.09130.pdf (Mixed-precision quantization combined with kernel fusion, including QKV tensor operation fusion and AddBias-LayerNorm fusion.)
Christian Sarofeen, 2022, The Next Generation of GPU Performance in PyTorch with nvFuser, NVIDIA On-Demand, GTC 2022, https://www.nvidia.com/en-us/on-demand/session/gtcspring22-s41958/ (Note: nvFuser seems to be deprecated.)
Christian Sarofeen, 2021, NVIDIA On-Demand, Dynamic Shapes First: Advanced GPU Fusion in PyTorch, https://www.nvidia.com/en-us/on-demand/session/gtcspring21-s31952/ (Note: nvFuser seems to be deprecated.)
Forrest Iandola, Albert Shaw, Ravi Krishna, and Kurt Keutzer. 2020. SqueezeBERT: What can computer vision teach NLP about efficient neural networks? Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing, pages 124–135. Association for Computational Linguistics. https://arxiv.org/abs/2006.11316 (Merging of self-attention with grouped convolutions is similar to kernel operator fusion.)
Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, Kranthi Kiran GV, Xuzheng He, Haowen Hou, Przemyslaw Kazienko, Jan Kocon, Jiaming Kong, Bartlomiej Koptyra, Hayden Lau, Krishna Sri Ipsit Mantri, Ferdinand Mom, Atsushi Saito, Xiangru Tang, Bolun Wang, Johan S. Wind, Stansilaw Wozniak, Ruichong Zhang, Zhenyuan Zhang, Qihang Zhao, Peng Zhou, Jian Zhu, Rui-Jie Zhu, May 2023, RWKV: Reinventing RNNs for the Transformer Era, https://arxiv.org/pdf/2305.13048.pdf, Code: https://github.com/BlinkDL/RWKV-LM (Hybrid RNN-Transformer that uses a custom CUDA kernel for WKV computations.)
Lilit Yolyan, May 20, 2022, Inference Optimization for Convolutional Neural Networks, Quantization and fusion for faster inference, Towards Data Science, https://towardsdatascience.com/inference-optimization-for-convolutional-neural-networks-e63b51b0b519
chengzeyi, Oct 2023, Stable Fast, https://github.com/chengzeyi/stable-fast (Highly optimized inference engine)
Kai Lv, Yuqing Yang, Tengxiao Liu, Qinghui Gao, Qipeng Guo, and Xipeng Qiu, June 2023, Full parameter fine-tuning for large language models with limited resources, arXiv preprint arXiv:2306.09782, https://arxiv.org/abs/2306.09782 (Fused gradient computation and parameter update saves memory in training kernel by not saving the gradient tensor in memory.)
Hanzhou Liu, Binghan Li, Chengkai Liu, Mi Lu, 19 Mar 2024, DeblurDiNAT: A Lightweight and Effective Transformer for Image Deblurring, https://arxiv.org/abs/2403.13163 (Implements "feature fusion" which is analogous to kernel fusion.)
RY Aminabadi, S Rajbhandari, AA Awan, 2022, DeepSpeed-inference: enabling efficient inference of transformer models at unprecedented scale, https://arxiv.org/abs/2207.00032 PDF: https://arxiv.org/pdf/2207.00032

Fused Normalization

Papers on fused normalization sublayers are available for fusion of either LayerNorm or BatchNorm.

Fused LayerNorm: Papers on kernel fusion of layer normalization:

Moin Nadeem, 2022, Fused LayerNorm, MosaicML, https://docs.mosaicml.com/projects/composer/en/stable/method_cards/fused_layernorm.html (Fused LayerNorm in MosaicML Composer.)
NVIDIA, apex.normalization.fused_layer_norm, Apex (A PyTorch Extension), 2018, https://nvidia.github.io/apex/layernorm.html (Fused LayerNorm in Apex.)
Benjamin Lefaudeux, Francisco Massa, Diana Liskovich, Wenhan Xiong, Vittorio Caggiano, Sean Naren, Min Xu, Jieru Hu, Marta Tintore, Susan Zhang, Patrick Labatut, Daniel Haziza, 2022, xFormers: A modular and hackable Transformer modelling library, Facebook Research, Code: https://github.com/facebookresearch/xformers (Supports several kernel fusion operations such as: fused softmax, fused linear layer, fused LayerNorm, fused dropout, fused SwiGLU.)
J Fang, Y Yu, C Zhao, J Zhou, 2021, Turbotransformers: an efficient gpu serving system for transformer models, Proceedings of the 26th ACM SIGPLAN, https://dl.acm.org/doi/abs/10.1145/3437801.3441578, PDF: https://dl.acm.org/doi/pdf/10.1145/3437801.3441578 (Turbotransformers uses various kernel fusions, such as fused LayerNorm, fused activation functions and fused transpose operations.)
Y Zhai, C Jiang, L Wang, X Jia, S Zhang, 2023, ByteTransformer: A high-performance transformer boosted for variable-length inputs, https://ieeexplore.ieee.org/abstract/document/10177488/, https://arxiv.org/abs/2210.03052 (ByteTransformer uses fused MHA and kernel operator fusion such as fused LayerNorm, fused add-bias, fused GELU, and Softmax fusion.)
B Hagedorn, B Fan, H Chen, C Cecka, 2023, Graphene: An IR for Optimized Tensor Computations on GPUs, ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada, PDF: https://dl.acm.org/doi/pdf/10.1145/3582016.3582018 (Includes various kernel fusions including fused MHA and fused LayerNorm.)
Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Shigang Li, and Torsten Hoefler. Nov 2021, Data movement is all you need: A case study on optimizing transformers, Proceedings of Machine Learning and Systems, 3, 2021. https://arxiv.org/abs/2007.00072, Code: https://github.com/spcl/substation (Extensive analysis of fusion opportunities in Transformers, such as fused LayerNorm.)
DeepSpeed Team, Rangan Majumder, Andrey Proskurin, May 24, 2021, DeepSpeed: Accelerating large-scale model inference and training via system optimizations and compression, Microsoft Research Blog, https://www.microsoft.com/en-us/research/blog/deepspeed-accelerating-large-scale-model-inference-and-training-via-system-optimizations-and-compression/ (DeepSpeed uses various kernel fusion methods including for Softmax, LayerNorm, transpose, and GEMM.)
Rong Tian, Zijing Zhao, Weijie Liu, Haoyan Liu Weiquan Mao, Zhe Zhao, Kimmo Yan, Sep 2022, SAMP: A Toolkit for Model Inference with Self-Adaptive Mixed-Precision, https://arxiv.org/pdf/2209.09130.pdf (Mixed-precision quantization combined with kernel fusion, including QKV tensor operation fusion and AddBias-LayerNorm fusion.)
Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Shigang Li, and Torsten Hoefler. Nov 2021, Data movement is all you need: A case study on optimizing transformers, Proceedings of Machine Learning and Systems, 3, 2021. https://arxiv.org/abs/2007.00072, Code: https://github.com/spcl/substation (Section "A.2 Fusion Implementation" lists various types of fusion, including: normalization/layernorm, Softmax, RELU, bias, and dropout.)
Xiaozhe Ren, Pingyi Zhou, Xinfan Meng, Xinjing Huang, Yadao Wang, Weichao Wang, Pengfei Li, Xiaoda Zhang, Alexander Podolskiy, Grigory Arshinov, Andrey Bout, Irina Piontkovskaya, Jiansheng Wei, Xin Jiang, Teng Su, Qun Liu, Jun Yao, March 2023, PanGu-Σ: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing, Technical Report, https://arxiv.org/abs/2303.10845 (Method uses a fused layernorm.)
Mahsa Salmani, Ilya Soloveychik, 24 Feb 2025, LLM Inference Acceleration via Efficient Operation Fusion, https://arxiv.org/abs/2502.17728
Ofer Dekel, 29 Apr 2025, Blockbuster, Part 1: Block-level AI Operator Fusion, https://arxiv.org/abs/2505.07829

Fused batchnorm: Papers on kernel fusion of batch normalization:

S. Mehta and M. Rastegari, “MobileViT: light-weight, general-purpose, and mobile-friendly vision transformer,” International Conference on Learning Representations, 2021. https://arxiv.org/abs/2110.02178 (Fusion of elements of the convolutional layers including batch normalization into convolutions.)
Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2704–2713, 2018. https://arxiv.org/abs/1712.05877 (Fuses bias-addition into MatMul, and fuses activation functions and batch normalization with convolutional layers.)
Francesco Ratto, Ángela Porras Máinez, Carlo Sau, Paolo Meloni, Gianfranco Deriu, Stefano Delucchi, Massimo Massa, Luigi Raffo, Francesca Palumbo, April 2023, An Automated Design Flow for Adaptive Neural Network Hardware Accelerators. Journal of Signal Processing Systems (2023): 1-23. https://link.springer.com/article/10.1007/s11265-023-01855-x (Fused batchnorm with batch normalization merged into convolutions.)
Michael Anderson, Evangelos Georganas, Sasikanth Avancha, Alexander Heinecke, 2018, Tensorfolding: Improving convolutional neural network performance with fused microkernels, SC18, November 11-16, 2018, Dallas, Texas, USA PDF: https://sc18.supercomputing.org/proceedings/tech_poster/poster_files/post155s2-file3.pdf"> https://sc18.supercomputing.org/proceedings/tech_poster/poster_files/post155s2-file3.pdf (Includes fused batch norm and fused RELU, along with a process called "tensor folding".)
D. Jung, W. Jung, , B. Kim, S. Lee, W. Rhee, and J. H. Ahn, Restructuring batch normalization to accelerate CNN training, 2018. PDF: https://mlsys.org/Conferences/2019/doc/2019/18.pdf, Code: https://github.com/scale-snu/caffe-bn-restructuring, Code: https://github.com/scale-snu/mkldnn-bn-restructuring (Coverage of batch normalization, merging into prior and subsequent layers, and a technique called "batch norm fission".)
E. Georganas, S. Avancha, K. Banerjee, D. Kalamkar, G. Henry, H. Pabst, and A. Heinecke, “Anatomy of high-performance deep learning convolutions on simd architectures,” in Accepted to Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’18. IEEE Press, 2018, https://arxiv.org/abs/1808.05567 (Investigates kernel fusion for RELU, bias, and normalization, although mostly calls it "layer fusion".)
D. Das, N. Mellempudi, D. Mudigere, D. D. Kalamkar, S. Avancha, K. Banerjee, S. Sridharan, K. Vaidyanathan, B. Kaul, E. Georganas, A. Heinecke, P. Dubey, J. Corbal, N. Shustrov, R. Dubtsov, E. Fomenko, and V. O. Pirogov, “Mixed precision training of convolutional neural networks using integer operations,” CoRR, vol. abs/1802.00930, 2018. http://arxiv.org/abs/1802.00930 (Fused element-wise layers with RELU and batch normalization.)
Mathilde Guillemot, Catherine Heusele, Rodolphe Korichi, Sylvianne Schnebert, Liming Chen, Feb 2020, Breaking batch normalization for better explainability of deep neural networks through layer-wise relevance propagation, https://arxiv.org/abs/2002.11018 (Focuses on explainability propagations through batch norm, but is also a type of fused batch norm.)
Pytorch, 2023, Fusing Convolution and Batch Norm Using Custom Function, https://pytorch.org/tutorials/intermediate/custom_function_conv_bn_tutorial.html
J Zhang, 2023, Quantization for High-dimensional Data and Neural Networks: Theory and Algorithms, Ph.D. Thesis, University of California, San Diego, https://escholarship.org/content/qt9bd2k7gf/qt9bd2k7gf.pdf (See section 4.7: Fusing Convolution and Batch Normalization Layers.)
Xiaoming (Jason) Cui, Ashraf Bhuiyan, 2023, Optimizing Transformer Model Inference on Intel® Processors, https://www.intel.com/content/www/us/en/developer/articles/technical/optimize-transformer-model-inference-processors.html
S. R. Bulo, L. Porzi, and P. Kontschieder, In-place activated batchnorm for memory-optimized training of DNNs, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5639–5647. https://arxiv.org/abs/1712.02616 Code: https://github.com/mapillary/inplace_abn (Fused BatchNorm with activations in a single layer.)

Fused GroupNorm Research:

chengzeyi, Oct 2023, Stable Fast, https://github.com/chengzeyi/stable-fast (Highly optimized inference engine with fused GroupNorm + GELU operator in NHWC tensor memory format)

Fused Multi-Head Attention

Research papers on fused multi-head attention ("fused MHA") include:

Y Zhai, C Jiang, L Wang, X Jia, S Zhang, 2023, ByteTransformer: A high-performance transformer boosted for variable-length inputs, https://ieeexplore.ieee.org/abstract/document/10177488/, https://arxiv.org/abs/2210.03052 (ByteTransformer uses fused MHA and kernel operator fusion such as fused LayerNorm, fused add-bias, fused GELU, and Softmax fusion.)
B Hagedorn, B Fan, H Chen, C Cecka, 2023, Graphene: An IR for Optimized Tensor Computations on GPUs, ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada, PDF: https://dl.acm.org/doi/pdf/10.1145/3582016.3582018 (Includes various kernel fusions including fused MHA and fused LayerNorm.)
Ashraf Eassa, Bo Yang Hsueh, Brian Pharris, Zhihan Jiang and Ashwin Nanjappa, Sep 08, 2022, Full-Stack Innovation Fuels Highest MLPerf Inference 2.1 Results for NVIDIA, NVIDIA Technical Blog, https://developer.nvidia.com/blog/full-stack-innovation-fuels-highest-mlperf-inference-2-1-results-for-nvidia/ (The NVIDIA Bert submission included kernel fusions such as fused MHA, fused bias, and fused GELU.)
chengzeyi, Oct 2023, Stable Fast, https://github.com/chengzeyi/stable-fast (Highly optimized inference engine with fused MHA and other optimizations.)
Ganesh Bikshandi, Jay Shah, 19 Dec 2023, A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library, https://arxiv.org/abs/2312.11918 https://research.colfax-intl.com/nvidia-hopper-flashattention-2/

Fused Softmax

Fused Softmax is the optimization of kernel fusion whereby the Softmax normalization is fused into the prior operation. Softmax is not especially amenable to fusion because its computation requires the fully-computed vector for each element, rather than being a simple elementwise computation. Nevertheless, Softmax is an expensive operation that is worth the invested time, and fusing Softmax is a standard LLM efficiency optimization.

Research papers on fused Softmax functions:

Benjamin Lefaudeux, Francisco Massa, Diana Liskovich, Wenhan Xiong, Vittorio Caggiano, Sean Naren, Min Xu, Jieru Hu, Marta Tintore, Susan Zhang, Patrick Labatut, Daniel Haziza, 2022, xFormers: A modular and hackable Transformer modelling library, Facebook Research, Code: https://github.com/facebookresearch/xformers (Supports several kernel fusion operations such as: fused softmax, fused linear layer, fused LayerNorm, fused dropout, fused SwiGLU.)
Y Zhai, C Jiang, L Wang, X Jia, S Zhang, 2023, ByteTransformer: A high-performance transformer boosted for variable-length inputs, https://ieeexplore.ieee.org/abstract/document/10177488/, https://arxiv.org/abs/2210.03052 (ByteTransformer uses fused MHA and kernel operator fusion such as fused LayerNorm, fused add-bias, fused GELU, and Softmax fusion.)
DeepSpeed Team, Rangan Majumder, Andrey Proskurin, May 24, 2021, DeepSpeed: Accelerating large-scale model inference and training via system optimizations and compression, Microsoft Research Blog, https://www.microsoft.com/en-us/research/blog/deepspeed-accelerating-large-scale-model-inference-and-training-via-system-optimizations-and-compression/ (DeepSpeed uses various kernel fusion methods including for Softmax, LayerNorm, transpose, and GEMM.)
Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Shigang Li, and Torsten Hoefler. Nov 2021, Data movement is all you need: A case study on optimizing transformers, Proceedings of Machine Learning and Systems, 3, 2021. https://arxiv.org/abs/2007.00072, Code: https://github.com/spcl/substation (Section "A.2 Fusion Implementation" lists various types of fusion, including: normalization/layernorm, Softmax, RELU, bias, and dropout.)
Ganesh Bikshandi, Jay Shah, 19 Dec 2023, A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library, https://arxiv.org/abs/2312.11918 https://research.colfax-intl.com/nvidia-hopper-flashattention-2/
Mahsa Salmani, Ilya Soloveychik, 24 Feb 2025, LLM Inference Acceleration via Efficient Operation Fusion, https://arxiv.org/abs/2502.17728
Zitong Li, Aparna Chandramowlishwaran, 12 May 2025, Fused3S: Fast Sparse Attention on Tensor Cores, https://arxiv.org/abs/2505.08098

Fused Activation Functions

Fused activation functions are LLM kernel optimizations from merging the computation of the activation function on the vector of activations with the prior operation, such as a matrix-vector multiplication kernel. Activation functions are simple, elementwise vector operations that are easily parallelizable and amenable to merging back into the end of the prior kernel. Hence, fused activation functions are a common LLM kernel optimization.

Common examples of activation function fusion include:

Fused RELU
Fused GELU
Fused SwiGLU

Example: Fused RELU. It really seems to me that RELU should win the award for the most obscure name for a simple thing. I mean, it's just make all negatives into zero. Why isn't it called "make positive" or something? Oh, but that's wrong because zero isn't positive. Okay, so let's name it: Make It Zero Unless It's Already Positive (MIZUIAP). See where I'm going with this?

The good news about this is it's easy to code RELU in C++:

    float yapi_RELU_basic(float f)   // Basic RELU (inefficient)
    {
	if (f <= 0.0) return 0.0;
	return f;
    }

In fact, it's so simple that it doesn't even deserve to be a C++ function:

    #define YAPI_RELU_MACRO(f)  ( (f) <= 0.0 ? 0.0 : (f) )   // Macro RELU

So let's say we want a RELU after our MatMul. Here's how to do it at the vector dot products (inside a MatMul) using RELU on the result:

    float yapi_nonfused_vecdot_RELU_basic(float v1[], float v2[], int n)   // Basic fused dot product + RELU
    {
	float f = yapi_vecdot_basic(v1, v2, n);  // Basic vector dot product
	f = YAPI_RELU_MACRO(f);
	return f;
    }

This is unnecessarily inefficient. So let's "fuse" the RELU inside of the vector dot product code, right at the end:

    float yapi_fused_vecdot_RELU_basic(float v1[], float v2[], int n)   // Basic fused dot product + RELU
    {
	float sum = 0.0;
	for (int i = 0; i < n; i++) {
		sum += v1[i] * v2[i];
	}
	return YAPI_RELU_MACRO(sum);
    }

That example is so simplistic it almost seems like it's not an optimization. Surely that can't be all there is to kernel operator fusion? Well, sure, RELU is very simple, and the above example vector dot product hasn't been optimized. Believe me, if you look at the real code for fused operators, it gets squirrelly very fast.

General Research Papers on Fused Activation Functions

Some of the research papers on fused activation functions include:

Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2704–2713, 2018. https://arxiv.org/abs/1712.05877 (Fuses bias-addition into MatMul, and fuses activation functions and batch normalization with convolutional layers.)
J Fang, Y Yu, C Zhao, J Zhou, 2021, Turbotransformers: an efficient gpu serving system for transformer models, Proceedings of the 26th ACM SIGPLAN, https://dl.acm.org/doi/abs/10.1145/3437801.3441578, PDF: https://dl.acm.org/doi/pdf/10.1145/3437801.3441578 (Turbotransformers uses various kernel fusions, such as fused LayerNorm, fused activation functions and fused transpose operations.)

Research on Fused RELU

RELU is a very simple activation function that can be readily fused into another kernel. Research papers on fused RELU include:

Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Shigang Li, and Torsten Hoefler. Nov 2021, Data movement is all you need: A case study on optimizing transformers, Proceedings of Machine Learning and Systems, 3, 2021. https://arxiv.org/abs/2007.00072, Code: https://github.com/spcl/substation (Section "A.2 Fusion Implementation" lists various types of fusion, including: normalization/layernorm, Softmax, RELU, bias, and dropout.)
N. Vasilache, O. Zinenko, T. Theodoridis, P. Goyal, Z. DeVito, W. S. Moses, S. Verdoolaege, A. Adams, and A. Cohen, 2018, Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions, CoRR, vol. abs/1802.04730, http://arxiv.org/abs/1802.04730 (Many pseudo-code examples of kernel operator fusion, e.g. shows pseudo-code of fusing RELU into MatMul, and a fused MatMul-addbias-RELU.)
D. Das, N. Mellempudi, D. Mudigere, D. D. Kalamkar, S. Avancha, K. Banerjee, S. Sridharan, K. Vaidyanathan, B. Kaul, E. Georganas, A. Heinecke, P. Dubey, J. Corbal, N. Shustrov, R. Dubtsov, E. Fomenko, and V. O. Pirogov, “Mixed precision training of convolutional neural networks using integer operations,” CoRR, vol. abs/1802.00930, 2018. http://arxiv.org/abs/1802.00930 (Fused element-wise layers with RELU and batch normalization.)

Research on Fused GELU

Research papers on merging the GELU activation function with a prior kernel:

Wenxuan Zeng, Meng Li, Wenjie Xiong, Tong Tong, Wen-jie Lu, Jin Tan, Runsheng Wang, Ru Huang, Aug 2023, MPCViT: Searching for Accurate and Efficient MPC-Friendly Vision Transformer with Heterogeneous Attention, https://arxiv.org/abs/2211.13955, PDF: https://openaccess.thecvf.com/content/ICCV2023/papers/Zeng_MPCViT_Searching_for_Accurate_and_Efficient_MPC-Friendly_Vision_Transformer_with_ICCV_2023_paper.pdf, Code: https://github.com/PKU-SEC-Lab/mpcvit (Optimizes Softlayer, GELU, and MatMul. Fuses two linear layers with an approximated linear version of GELU.)
Y Zhai, C Jiang, L Wang, X Jia, S Zhang, 2023, ByteTransformer: A high-performance transformer boosted for variable-length inputs, https://ieeexplore.ieee.org/abstract/document/10177488/, https://arxiv.org/abs/2210.03052 (ByteTransformer uses fused MHA and kernel operator fusion such as fused LayerNorm, fused add-bias, fused GELU, and Softmax fusion.)
Ashraf Eassa, Bo Yang Hsueh, Brian Pharris, Zhihan Jiang and Ashwin Nanjappa, Sep 08, 2022, Full-Stack Innovation Fuels Highest MLPerf Inference 2.1 Results for NVIDIA, NVIDIA Technical Blog, https://developer.nvidia.com/blog/full-stack-innovation-fuels-highest-mlperf-inference-2-1-results-for-nvidia/ (The NVIDIA Bert submission included kernel fusions such as fused MHA, fused bias, and fused GELU.)
chengzeyi, Oct 2023, Stable Fast, https://github.com/chengzeyi/stable-fast (Highly optimized inference engine with fused GroupNorm + GELU operator in NHWC tensor memory format)

Research on Fused SwiGLU

Research papers on merging the SwiGLU activation function:

Benjamin Lefaudeux, Francisco Massa, Diana Liskovich, Wenhan Xiong, Vittorio Caggiano, Sean Naren, Min Xu, Jieru Hu, Marta Tintore, Susan Zhang, Patrick Labatut, Daniel Haziza, 2022, xFormers: A modular and hackable Transformer modelling library, Facebook Research, Code: https://github.com/facebookresearch/xformers (Supports several kernel fusion operations such as: fused softmax, fused linear layer, fused LayerNorm, fused dropout, fused SwiGLU.)
Byron (Pin-Lun)Hsu, Yun Dai, Vignesh Kothapalli, Qingquan Song, Shao Tang, Siyu Zhu, Steven Shimizu, Shivam Sahni, Haowen Ning, Yanning Chen, 14 Oct 2024, Liger Kernel: Efficient Triton Kernels for LLM Training, https://arxiv.org/abs/2410.10989 http://github.com/linkedin/Liger-Kernel

Fused Add-Bias

Fused add-bias is a kernel fusion optimization whereby adding bias vectors is fused into the previous operation. Adding bias vectors is an additional step often performed after matrix multiplications. Since vector addition is an easily vectorizable elementwise addition of two vectors, this can often be fused into the prior operator, usually a GEMM (matrix-matrix) or GEMV (matrix-vector) multiplication kernel. Note that some model architectures dispense with bias vectors completely ("bias pruning"), which is even faster than merging them!

Research papers on fused add-bias functions:

Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2704–2713, 2018. https://arxiv.org/abs/1712.05877 (Fuses bias-addition into MatMul, and fuses activation functions and batch normalization with convolutional layers.)
Y Zhai, C Jiang, L Wang, X Jia, S Zhang, 2023, ByteTransformer: A high-performance transformer boosted for variable-length inputs, https://ieeexplore.ieee.org/abstract/document/10177488/, https://arxiv.org/abs/2210.03052 (ByteTransformer uses fused MHA and kernel operator fusion such as fused LayerNorm, fused add-bias, fused GELU, and Softmax fusion.)
Ashraf Eassa, Bo Yang Hsueh, Brian Pharris, Zhihan Jiang and Ashwin Nanjappa, Sep 08, 2022, Full-Stack Innovation Fuels Highest MLPerf Inference 2.1 Results for NVIDIA, NVIDIA Technical Blog, https://developer.nvidia.com/blog/full-stack-innovation-fuels-highest-mlperf-inference-2-1-results-for-nvidia/ (The NVIDIA Bert submission included kernel fusions such as fused MHA, fused bias, and fused GELU.)
Rong Tian, Zijing Zhao, Weijie Liu, Haoyan Liu Weiquan Mao, Zhe Zhao, Kimmo Yan, Sep 2022, SAMP: A Toolkit for Model Inference with Self-Adaptive Mixed-Precision, https://arxiv.org/pdf/2209.09130.pdf (Mixed-precision quantization combined with kernel fusion, including QKV tensor operation fusion and AddBias-LayerNorm fusion.)
Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Shigang Li, and Torsten Hoefler. Nov 2021, Data movement is all you need: A case study on optimizing transformers, Proceedings of Machine Learning and Systems, 3, 2021. https://arxiv.org/abs/2007.00072, Code: https://github.com/spcl/substation (Examines benefit from add-bias fusion, as well as many other fusion opportunities. Section "A.2 Fusion Implementation" lists various types of fusion of bias operations.)
N. Vasilache, O. Zinenko, T. Theodoridis, P. Goyal, Z. DeVito, W. S. Moses, S. Verdoolaege, A. Adams, and A. Cohen, 2018, Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions, CoRR, vol. abs/1802.04730, http://arxiv.org/abs/1802.04730 (Many pseudo-code examples of kernel operator fusion, e.g. shows pseudo-code of fusing RELU into MatMul, and a fused MatMul-addbias-RELU.)
David Spuler, March 2024, Example: Fused VMM-add-bias, in Generative AI in C++, https://www.aussieai.com/book/ch31-fused-vmm-addbias
Y. Zhang, O. A. Kailani, B. Zhou and W. Zhao, "AdderNet 2.0: Optimal AdderNet Accelerator Designs With Activation-Oriented Quantization and Fused Bias Removal-Based Memory Optimization," in IEEE Transactions on Circuits and Systems I: Regular Papers, doi: 10.1109/TCSI.2025.3539912. https://ieeexplore.ieee.org/abstract/document/10884535/

Fused Multiply-Add (FMA)

Research papers on fused multiply-add (FMA) include:

E. Georganas, S. Avancha, K. Banerjee, D. Kalamkar, G. Henry, H. Pabst, and A. Heinecke, “Anatomy of high-performance deep learning convolutions on simd architectures,” in Accepted to Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’18. IEEE Press, 2018, https://arxiv.org/abs/1808.05567
D. Das, N. Mellempudi, D. Mudigere, D. D. Kalamkar, S. Avancha, K. Banerjee, S. Sridharan, K. Vaidyanathan, B. Kaul, E. Georganas, A. Heinecke, P. Dubey, J. Corbal, N. Shustrov, R. Dubtsov, E. Fomenko, and V. O. Pirogov, “Mixed precision training of convolutional neural networks using integer operations,” CoRR, vol. abs/1802.00930, 2018. http://arxiv.org/abs/1802.00930 (Analyses fused multiply-accumulate, and fuses element-wise layers with RELU and batch normalization.)
Y. Nievergelt, Scalar fused multiply-add instructions produce floating-point matrix arithmetic provably accurate to the penultimate digit, ACM Trans. Math. Softw., 29 (2003), pp. 27–48, https://dl.acm.org/doi/10.1145/641876.641878
S. Boldo, J.-M. Muller, 2005, Some functions computable with a Fused-mac, Proceedings of the 17th Symposium on Computer Arithmetic, Cape Cod, USA, 2005. https://ieeexplore.ieee.org/document/1467622, PDF: https://perso.ens-lyon.fr/jean-michel.muller/FmacArith.pdf
S Graillat, V Ménissier-Morain, 2012, Accurate summation, dot product and polynomial evaluation in complex floating point arithmetic, Information and Computation, Volume 216, July 2012, Pages 57-71, https://www.sciencedirect.com/science/article/pii/S0890540112000715
S Graillat, P Langlois, N Louvet, G Hanrot, 2006, Accurate dot products with FMA, https://inria.hal.science/inria-00107213/file/rnc7-proceedings-1.pdf#page=148
Christoph Peters, 2021, fma: A faster, more accurate instruction, Moments in Graphics (Blog), https://momentsingraphics.de/FMA.html

Fused Matrix Transpose

One of the main optimizations to matrix-matrix multiplications is to use the transpose of the second matrix, because it is stored in column-major order. This means the data is stored in contiguous memory, which is better for coalesced memory accesses. Hence, the computation of the matrix tranpose can be fused into the matrix multiplication kernel.

Research papers on fused matrix transpose operations:

J Fang, Y Yu, C Zhao, J Zhou, 2021, Turbotransformers: an efficient gpu serving system for transformer models, Proceedings of the 26th ACM SIGPLAN, https://dl.acm.org/doi/abs/10.1145/3437801.3441578, PDF: https://dl.acm.org/doi/pdf/10.1145/3437801.3441578 (Turbotransformers uses various kernel fusions, such as fused LayerNorm, fused activation functions and fused transpose operations.)
DeepSpeed Team, Rangan Majumder, Andrey Proskurin, May 24, 2021, DeepSpeed: Accelerating large-scale model inference and training via system optimizations and compression, Microsoft Research Blog, https://www.microsoft.com/en-us/research/blog/deepspeed-accelerating-large-scale-model-inference-and-training-via-system-optimizations-and-compression/ (DeepSpeed uses various kernel fusion methods including for Softmax, LayerNorm, transpose, and GEMM.)
Wei Niu, Jiexiong Guan, Yanzhi Wang, Gagan Agrawal, Bin Ren, 2021, DNNFusion: accelerating deep neural networks execution with advanced operator fusion, PLDI 2021: Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, June 2021, Pages 883–898, https://doi.org/10.1145/3453483.3454083, https://dl.acm.org/doi/10.1145/3453483.3454083, https://arxiv.org/abs/2108.13342 (Includes some discussion of fusing matrix tranposition.)
N. Vasilache, O. Zinenko, T. Theodoridis, P. Goyal, Z. DeVito, W. S. Moses, S. Verdoolaege, A. Adams, and A. Cohen, 2018, Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions, CoRR, vol. abs/1802.04730, http://arxiv.org/abs/1802.04730 (Mentions of optimizations of the transpose operation, with numerous other optimizations.)

Kernel Fission

Kernel fission is the LLM kernel optimization of splitting a single operation into two smaller kernels. As the name implies, kernel fussion is somewhat the opposite of kernel fusion. Whereas kernel fusion merges two operators into one kernel to avoid temporary data usage, kernel fission splits a single kernel into two simpler kernels, thereby making it easier to parallelize the two sub-kernels. The approach is analogous to the loop transformations of loop fusion (merging two loops) versus loop fission (splitting a loop in two). The optimizations possible with kernel fission include:

Run the two smaller kernels in parallel (with each other).
Overlap the computation of one or both subkernel with some other kernel or with network transmission.
Simpler kernels are more amenable to other followup optimizations, such as vectorization.
Data locality and memory access pattern improvements.

The optimization goal of loop fission is usually to create a simpler kernel that can be more efficiently parallelized. Another goal may be to run both of the two kernels in parallel, rather than merged. Parallelization advantages can arise in running one or both of the smaller kernels in parallel with other kernels, or overlapped with communications.

There can sometimes be an advantage in terms of data locality and cache access speed in each of the smaller kernels. On the other hand, it will more often be worsened by having two kernel operator loops skimming through the same data twice. At least one of the split out pair of simpler kernels must run much faster separately, usually from accessing hardware acceleration, or else we've simply added extra overhead and worsened the overall performance.

Research on Kernel Fission

Research papers on kernel fission are below; see also loop fission.

H Wu, G Diamos, J Wang, S Cadambi, 2012, Optimizing data warehousing applications for GPUs using kernel fusion/fission, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum, https://ieeexplore.ieee.org/abstract/document/6270615/, http://www.istc-cc.cmu.edu/publications/papers/2012/optimizing-warehousing.pdf
P Gibson, J Cano, J Turner, EJ Crowley, 2020, Optimizing grouped convolutions on edge devices, 2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP), https://ieeexplore.ieee.org/abstract/document/9153227/ PDF: https://arxiv.org/pdf/2006.09791 (Examines loop fission amongst various other convolution optimizations
D. Jung, W. Jung, , B. Kim, S. Lee, W. Rhee, and J. H. Ahn, Restructuring batch normalization to accelerate CNN training, 2018. PDF: https://mlsys.org/Conferences/2019/doc/2019/18.pdf, Code: https://github.com/scale-snu/caffe-bn-restructuring, Code: https://github.com/scale-snu/mkldnn-bn-restructuring (Coverage of a technique called "batch norm fission".)
William S. Moses, Ivan R. Ivanov, Jens Domke, Toshio Endo, Johannes Doerfert, Oleksandr Zinenko, 2023, High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs, PPoPP '23: Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, February 2023, Pages 119–134, https://dl.acm.org/doi/abs/10.1145/3572848.3577475, PDF: https://dl.acm.org/doi/pdf/10.1145/3572848.3577475 (Parallel loop splitting is used.)
Dennis Sebastian Rieber, 2023, Deployment of Deep Neural Networks on Dedicated Hardware Accelerators, Ph.D. thesis, Doctor of Natural Sciences, Ruprecht–Karls University Heidelberg, PDF: https://archiv.ub.uni-heidelberg.de/volltextserver/32994/1/dissertationPDFA.pdf (Fusion and fission optimizations with example algorithms on p.40 and p.45.)
Haicheng Wu, Gregory Diamos, Jin Wang, Srihari Cadambi, Sudhakar Yalamanchili, and Srimat Chakradhar. 2012. Optimizing data warehousing applications for GPUs using kernel fusion/fission. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum. IEEE, 2433–2442. https://ieeexplore.ieee.org/document/6270615, PDF: http://www.istc-cc.cmu.edu/publications/papers/2012/optimizing-warehousing.pdf (Theoretical analysis of which algebraic operators can be merged in kernel fusion or split in kernel fission.)
Pieter Hijma, Stijn Heldens, Alessio Sclocco, Ben van Werkhoven, Henri E. Bal, 2023, Optimization techniques for GPU programming, ACM Computing Surveys, Volume 55, Issue 11, Article No. 239, pp 1–81, https://dl.acm.org/doi/abs/10.1145/3570638, PDF: https://dl.acm.org/doi/pdf/10.1145/3570638 (Extensive survey of software optimizations to improve GPU latency and throughput.)
David Spuler, March 2024, Chapter 31. Kernel Fusion, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
David Spuler, March 2024, Kernel Fission, in Generative AI in C++, https://www.aussieai.com/book/ch31-kernel-fission