Aussie AI

Kernel Operator Fusion

  • Last Updated 7 December, 2024
  • by David Spuler, Ph.D.

What is Kernel Fusion?

Kernel fusion, also called kernel operator fusion, refers to the optimization of merging different algorithmic components in an AI inference engine. Combining two operations together is often faster than running them sequentially. Often it is abbreviated to "kernel fusion" or "operator fusion" or just "fusion" in neural network papers. The opposite of kernel fusion is "kernel fission", which involves splitting a kernel operator into two operators. These are closely related to the loop transformation optimizations of loop fusion versus loop fission

Both kernel fusion and kernel fission involve changes to the kernel code doing the inference rather than the model's data, and are a type of dynamic inference optimization. Many kernel optimizations can apply to both inference and training.

The primary goal of kernel fusion is often to reduce data movement and memory usage, rather than computation reduction, but the two metrics are closely linked and often both goals are achieved. By merging two operations into one, the cost of storing intermediate results to memory is avoided. Hence, it is a type of memory optimization technique for AI inference engines.

Kernel fusion is distinct from layer fusion, which usually refers to weight sharing across multiple layers of a model. Operator fusion merges the algorithmic steps, but does not reduce weights, or share parameters. Also, operator fusion does not usually skip any model steps, making it different from techniques such as pruning (e.g. layer pruning). Kernel fusion works by merging two major steps to avoid redundancy from the use of the intermediate data in subsequent computations, but both steps are still effectively performed.

Operator fusion has been an active area of research for deep learning compilers for years. These compiler frameworks can apply many optimizations on the low-level final execution version of the model. However, operator fusion can also be used in the higher-level kernel of Transformer inference engines, without resorting to a model compiler framework. Researchers have found various ways to speed up a vanilla Transformer using kernel fusion techniques. Several of the major component steps in the Transformer algorithm (e.g. attention heads, normalization, FFN, Softmax, etc.) can be merged using kernel fusion ideas.

Overall, kernel fusion is one of many possible inference optimization techniques. Other comparable techniques include memory optimizations, skipping methods, conditional computation, and caching.

Example of Kernel Fusion: MatMul-add-bias

The way that kernel fusion works is to combine or merge the code for two operations into a single merged operation. For example, a typical linear layer will do a matrix multiplication ("MatMul"), followed by adding the bias vector ("add-bias").

The same operands are used as without fusion. The MatMul is a matrix-vector multiply that takes two operands: a weights matrix and the vector (calculated internal data). The add-bias is vector addition that takes two operands: a bias vector and the calculated vector (from the MatMul). In the non-fused method, the MatMul first multiplies the weights matrix by the input vector, creating a new vector. The add-bias operation then adds the bias vector to that vector, outputting the final resulting vector. Both MatMul and add-bias are working on the same vector of propagated internal data, and these two operations can be "fused" to create a combined "MatMul-add-bias" operator. The result is a "three-way" fused operator, which has three inputs (the weights matrix, the bias vector, and the incoming calculated vector), and a single vector output.

Kernel fusion is an exact calculation done faster, not an approximation. The result of the merged MatMul-add-bias combined operation is not approximated. The output is exactly the same vector that would result from sequentially doing a MatMul and then an add-bias.

How is this any faster? It's true that the fused operator does the same number of calculations in terms of multiplications and additions (i.e. the same FLOPs, assuming it's floating point data). The speedup comes from "fusing" the add-bias calculations into the same code as the MatMul, which reduces some computation overhead, such as scanning through the vector twice. There is also a reduction in an entire vector of temporary data (i.e. the intermediate vector result of the MatMul before it goes into the add-bias), which reduces data transfers, and improves memory usage and dataflow, also benefiting wall clock speed.

Compiler-Optimized Kernel Operator Fusion

Many of the machine learning compilers have various operator fusion methods. This is a long-standing area of research with many papers.

Convolution Optimizations

Convolutional Neural Networks (CNNs) use "convolution" operations, which are conceptually similar to simplified feed-forward networks, except they are not fully interconnected. Various further optimizations of convolutions have been researched over the years, many of which are an early type of "kernel fusion" optimization, although they don't always call it by that term. Examples of convolution optiizations include:

  • Depth-wise separable convolutions
  • Grouped convolutions

Depthwise-separable convolutions: Papers on depthwise-separable convolutional neural network optimizations:

Grouped Convolutions: Research papers on grouped convolutions:

General Research on Kernel Operator Fusion

Research papers on low-level operator fusion in the kernel of a neural network include:

Transformer Component Kernel Operator Fusion

The operations performed by Transformer components can be fused, which means to combine the code of two operations into one operation, reducing issues with temporary data and other overhead. Some examples of what is possible for fused operations:

Note that most operator fusion changes are not approximations. The goal is to implement the exact same functionality as two operators, but do it faster in a single combined loop or function. It is, of course, sometimes possible to incorporate approximations as part of operator fusion, but that's really a separate method (see approximation optimizations).

Research papers generally on kernel operator fusion specific to Transformers:

  • Benjamin Lefaudeux, Francisco Massa, Diana Liskovich, Wenhan Xiong, Vittorio Caggiano, Sean Naren, Min Xu, Jieru Hu, Marta Tintore, Susan Zhang, Patrick Labatut, Daniel Haziza, 2022, xFormers: A modular and hackable Transformer modelling library, Facebook Research, Code: https://github.com/facebookresearch/xformers (Supports several kernel fusion operations such as: fused softmax, fused linear layer, fused LayerNorm, fused dropout, fused SwiGLU.)
  • Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Anselm Levskaya, Jonathan Heek, Kefan Xiao, Shivani Agrawal, Jeff Dean, Nov 2022, Efficiently Scaling Transformer Inference, Google Research, https://arxiv.org/abs/2211.05102 (Includes multiple types of fusion of operations in the QKV attention head tensors, such as fusing the feedforward layer with the query projection matrix, and fusion of KV tensors.)
  • J Fang, Y Yu, C Zhao, J Zhou, 2021, Turbotransformers: an efficient gpu serving system for transformer models, Proceedings of the 26th ACM SIGPLAN, https://dl.acm.org/doi/abs/10.1145/3437801.3441578, PDF: https://dl.acm.org/doi/pdf/10.1145/3437801.3441578 (Turbotransformers uses various kernel fusions, such as fused LayerNorm, fused activation functions and fused transpose operations.)
  • Y Zhai, C Jiang, L Wang, X Jia, S Zhang, 2023, ByteTransformer: A high-performance transformer boosted for variable-length inputs, https://ieeexplore.ieee.org/abstract/document/10177488/, https://arxiv.org/abs/2210.03052 (ByteTransformer uses fused MHA and kernel operator fusion such as fused LayerNorm, fused add-bias, fused GELU, and Softmax fusion.)
  • B Hagedorn, B Fan, H Chen, C Cecka, 2023, Graphene: An IR for Optimized Tensor Computations on GPUs, ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada, PDF: https://dl.acm.org/doi/pdf/10.1145/3582016.3582018 (Includes various kernel fusions including fused MHA and fused LayerNorm.)
  • Ashraf Eassa, Bo Yang Hsueh, Brian Pharris, Zhihan Jiang and Ashwin Nanjappa, Sep 08, 2022, Full-Stack Innovation Fuels Highest MLPerf Inference 2.1 Results for NVIDIA, NVIDIA Technical Blog, https://developer.nvidia.com/blog/full-stack-innovation-fuels-highest-mlperf-inference-2-1-results-for-nvidia/ (The NVIDIA Bert submission included kernel fusions such as fused MHA, fused bias, and fused GELU.)
  • Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Shigang Li, and Torsten Hoefler. Nov 2021, Data movement is all you need: A case study on optimizing transformers, Proceedings of Machine Learning and Systems, 3, 2021. https://arxiv.org/abs/2007.00072, Code: https://github.com/spcl/substation (Extensive analysis of fusion opportunities in Transformers, with evaluation of several of them, such as fused LayerNorm, and theory about various other operator fusions; see the main paper body and also a long list of fusion methods in the Appendix Section "A.2 Fusion Implementation". Various merged combinations are listing, which involve components: normalization/layernorm, Softmax, RELU, bias, and dropout.)
  • DeepSpeed Team, Rangan Majumder, Andrey Proskurin, May 24, 2021, DeepSpeed: Accelerating large-scale model inference and training via system optimizations and compression, Microsoft Research Blog, https://www.microsoft.com/en-us/research/blog/deepspeed-accelerating-large-scale-model-inference-and-training-via-system-optimizations-and-compression/ (DeepSpeed uses various kernel fusion methods including for Softmax, LayerNorm, transpose, and GEMM.)
  • Rong Tian, Zijing Zhao, Weijie Liu, Haoyan Liu Weiquan Mao, Zhe Zhao, Kimmo Yan, Sep 2022, SAMP: A Toolkit for Model Inference with Self-Adaptive Mixed-Precision, https://arxiv.org/pdf/2209.09130.pdf (Mixed-precision quantization combined with kernel fusion, including QKV tensor operation fusion and AddBias-LayerNorm fusion.)
  • Christian Sarofeen, 2022, The Next Generation of GPU Performance in PyTorch with nvFuser, NVIDIA On-Demand, GTC 2022, https://www.nvidia.com/en-us/on-demand/session/gtcspring22-s41958/ (Note: nvFuser seems to be deprecated.)
  • Christian Sarofeen, 2021, NVIDIA On-Demand, Dynamic Shapes First: Advanced GPU Fusion in PyTorch, https://www.nvidia.com/en-us/on-demand/session/gtcspring21-s31952/ (Note: nvFuser seems to be deprecated.)
  • Forrest Iandola, Albert Shaw, Ravi Krishna, and Kurt Keutzer. 2020. SqueezeBERT: What can computer vision teach NLP about efficient neural networks? Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing, pages 124–135. Association for Computational Linguistics. https://arxiv.org/abs/2006.11316 (Merging of self-attention with grouped convolutions is similar to kernel operator fusion.)
  • Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, Kranthi Kiran GV, Xuzheng He, Haowen Hou, Przemyslaw Kazienko, Jan Kocon, Jiaming Kong, Bartlomiej Koptyra, Hayden Lau, Krishna Sri Ipsit Mantri, Ferdinand Mom, Atsushi Saito, Xiangru Tang, Bolun Wang, Johan S. Wind, Stansilaw Wozniak, Ruichong Zhang, Zhenyuan Zhang, Qihang Zhao, Peng Zhou, Jian Zhu, Rui-Jie Zhu, May 2023, RWKV: Reinventing RNNs for the Transformer Era, https://arxiv.org/pdf/2305.13048.pdf, Code: https://github.com/BlinkDL/RWKV-LM (Hybrid RNN-Transformer that uses a custom CUDA kernel for WKV computations.)
  • Lilit Yolyan, May 20, 2022, Inference Optimization for Convolutional Neural Networks, Quantization and fusion for faster inference, Towards Data Science, https://towardsdatascience.com/inference-optimization-for-convolutional-neural-networks-e63b51b0b519
  • chengzeyi, Oct 2023, Stable Fast, https://github.com/chengzeyi/stable-fast (Highly optimized inference engine)
  • Kai Lv, Yuqing Yang, Tengxiao Liu, Qinghui Gao, Qipeng Guo, and Xipeng Qiu, June 2023, Full parameter fine-tuning for large language models with limited resources, arXiv preprint arXiv:2306.09782, https://arxiv.org/abs/2306.09782 (Fused gradient computation and parameter update saves memory in training kernel by not saving the gradient tensor in memory.)
  • Hanzhou Liu, Binghan Li, Chengkai Liu, Mi Lu, 19 Mar 2024, DeblurDiNAT: A Lightweight and Effective Transformer for Image Deblurring, https://arxiv.org/abs/2403.13163 (Implements "feature fusion" which is analogous to kernel fusion.)
  • RY Aminabadi, S Rajbhandari, AA Awan, 2022, DeepSpeed-inference: enabling efficient inference of transformer models at unprecedented scale, https://arxiv.org/abs/2207.00032 PDF: https://arxiv.org/pdf/2207.00032

Fused Normalization

Papers on fused normalization sublayers are available for fusion of either LayerNorm or BatchNorm.

Fused LayerNorm: Papers on kernel fusion of layer normalization:

Fused batchnorm: Papers on kernel fusion of batch normalization:

Fused GroupNorm Research:

Fused Multi-Head Attention

Research papers on fused multi-head attention ("fused MHA") include:

Fused Softmax

Research papers on fused Softmax functions:

Fused Activation Functions

Common examples of activation function fusion include:

  • Fused RELU
  • Fused GELU
  • Fused SwiGLU

Example: Fused RELU. It really seems to me that RELU should win the award for the most obscure name for a simple thing. I mean, it's just make all negatives into zero. Why isn't it called "make positive" or something? Oh, but that's wrong because zero isn't positive. Okay, so let's name it: Make It Zero Unless It's Already Positive (MIZUIAP). See where I'm going with this?

The good news about this is it's easy to code RELU in C++:

    float yapi_RELU_basic(float f)   // Basic RELU (inefficient)
    {
	if (f <= 0.0) return 0.0;
	return f;
    }

In fact, it's so simple that it doesn't even deserve to be a C++ function:

    #define YAPI_RELU_MACRO(f)  ( (f) <= 0.0 ? 0.0 : (f) )   // Macro RELU

So let's say we want a RELU after our MatMul. Here's how to do it at the vector dot products (inside a MatMul) using RELU on the result:

    float yapi_nonfused_vecdot_RELU_basic(float v1[], float v2[], int n)   // Basic fused dot product + RELU
    {
	float f = yapi_vecdot_basic(v1, v2, n);  // Basic vector dot product
	f = YAPI_RELU_MACRO(f);
	return f;
    }

This is unnecessarily inefficient. So let's "fuse" the RELU inside of the vector dot product code, right at the end:

    float yapi_fused_vecdot_RELU_basic(float v1[], float v2[], int n)   // Basic fused dot product + RELU
    {
	float sum = 0.0;
	for (int i = 0; i < n; i++) {
		sum += v1[i] * v2[i];
	}
	return YAPI_RELU_MACRO(sum);
    }

That example is so simplistic it almost seems like it's not an optimization. Surely that can't be all there is to kernel operator fusion? Well, sure, RELU is very simple, and the above example vector dot product hasn't been optimized. Believe me, if you look at the real code for fused operators, it gets squirrelly very fast.

General Research Papers on Fused Activation Functions

Some of the research papers on fused activation functions include:

  • Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2704–2713, 2018. https://arxiv.org/abs/1712.05877 (Fuses bias-addition into MatMul, and fuses activation functions and batch normalization with convolutional layers.)
  • J Fang, Y Yu, C Zhao, J Zhou, 2021, Turbotransformers: an efficient gpu serving system for transformer models, Proceedings of the 26th ACM SIGPLAN, https://dl.acm.org/doi/abs/10.1145/3437801.3441578, PDF: https://dl.acm.org/doi/pdf/10.1145/3437801.3441578 (Turbotransformers uses various kernel fusions, such as fused LayerNorm, fused activation functions and fused transpose operations.)

Research on Fused RELU

  • Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Shigang Li, and Torsten Hoefler. Nov 2021, Data movement is all you need: A case study on optimizing transformers, Proceedings of Machine Learning and Systems, 3, 2021. https://arxiv.org/abs/2007.00072, Code: https://github.com/spcl/substation (Section "A.2 Fusion Implementation" lists various types of fusion, including: normalization/layernorm, Softmax, RELU, bias, and dropout.)
  • N. Vasilache, O. Zinenko, T. Theodoridis, P. Goyal, Z. DeVito, W. S. Moses, S. Verdoolaege, A. Adams, and A. Cohen, 2018, Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions, CoRR, vol. abs/1802.04730, http://arxiv.org/abs/1802.04730 (Many pseudo-code examples of kernel operator fusion, e.g. shows pseudo-code of fusing RELU into MatMul, and a fused MatMul-addbias-RELU.)
  • D. Das, N. Mellempudi, D. Mudigere, D. D. Kalamkar, S. Avancha, K. Banerjee, S. Sridharan, K. Vaidyanathan, B. Kaul, E. Georganas, A. Heinecke, P. Dubey, J. Corbal, N. Shustrov, R. Dubtsov, E. Fomenko, and V. O. Pirogov, “Mixed precision training of convolutional neural networks using integer operations,” CoRR, vol. abs/1802.00930, 2018. http://arxiv.org/abs/1802.00930 (Fused element-wise layers with RELU and batch normalization.)

Research on Fused GELU

Research on Fused SwiGLU

  • Benjamin Lefaudeux, Francisco Massa, Diana Liskovich, Wenhan Xiong, Vittorio Caggiano, Sean Naren, Min Xu, Jieru Hu, Marta Tintore, Susan Zhang, Patrick Labatut, Daniel Haziza, 2022, xFormers: A modular and hackable Transformer modelling library, Facebook Research, Code: https://github.com/facebookresearch/xformers (Supports several kernel fusion operations such as: fused softmax, fused linear layer, fused LayerNorm, fused dropout, fused SwiGLU.)
  • Byron (Pin-Lun)Hsu, Yun Dai, Vignesh Kothapalli, Qingquan Song, Shao Tang, Siyu Zhu, Steven Shimizu, Shivam Sahni, Haowen Ning, Yanning Chen, 14 Oct 2024, Liger Kernel: Efficient Triton Kernels for LLM Training, https://arxiv.org/abs/2410.10989 http://github.com/linkedin/Liger-Kernel

Fused Add-Bias

Research papers on fused add-bias functions:

  • Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2704–2713, 2018. https://arxiv.org/abs/1712.05877 (Fuses bias-addition into MatMul, and fuses activation functions and batch normalization with convolutional layers.)
  • Y Zhai, C Jiang, L Wang, X Jia, S Zhang, 2023, ByteTransformer: A high-performance transformer boosted for variable-length inputs, https://ieeexplore.ieee.org/abstract/document/10177488/, https://arxiv.org/abs/2210.03052 (ByteTransformer uses fused MHA and kernel operator fusion such as fused LayerNorm, fused add-bias, fused GELU, and Softmax fusion.)
  • Ashraf Eassa, Bo Yang Hsueh, Brian Pharris, Zhihan Jiang and Ashwin Nanjappa, Sep 08, 2022, Full-Stack Innovation Fuels Highest MLPerf Inference 2.1 Results for NVIDIA, NVIDIA Technical Blog, https://developer.nvidia.com/blog/full-stack-innovation-fuels-highest-mlperf-inference-2-1-results-for-nvidia/ (The NVIDIA Bert submission included kernel fusions such as fused MHA, fused bias, and fused GELU.)
  • Rong Tian, Zijing Zhao, Weijie Liu, Haoyan Liu Weiquan Mao, Zhe Zhao, Kimmo Yan, Sep 2022, SAMP: A Toolkit for Model Inference with Self-Adaptive Mixed-Precision, https://arxiv.org/pdf/2209.09130.pdf (Mixed-precision quantization combined with kernel fusion, including QKV tensor operation fusion and AddBias-LayerNorm fusion.)
  • Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Shigang Li, and Torsten Hoefler. Nov 2021, Data movement is all you need: A case study on optimizing transformers, Proceedings of Machine Learning and Systems, 3, 2021. https://arxiv.org/abs/2007.00072, Code: https://github.com/spcl/substation (Examines benefit from add-bias fusion, as well as many other fusion opportunities. Section "A.2 Fusion Implementation" lists various types of fusion of bias operations.)
  • N. Vasilache, O. Zinenko, T. Theodoridis, P. Goyal, Z. DeVito, W. S. Moses, S. Verdoolaege, A. Adams, and A. Cohen, 2018, Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions, CoRR, vol. abs/1802.04730, http://arxiv.org/abs/1802.04730 (Many pseudo-code examples of kernel operator fusion, e.g. shows pseudo-code of fusing RELU into MatMul, and a fused MatMul-addbias-RELU.)
  • David Spuler, March 2024, Example: Fused VMM-add-bias, in Generative AI in C++, https://www.aussieai.com/book/ch31-fused-vmm-addbias

Fused Multiply-Add (FMA)

Research papers on fused multiply-add (FMA) include:

Fused Matrix Transpose

Research papers on fused matrix transpose operations:

Kernel Fission

Kernel fission is somewhat the opposite of kernel fusion. Whereas kernel fusion merges two operators into one kernel, kernel fission splits a single kernel into two simpler kernels. The approach is analogous to the loop transformations of loop fusion (merging two loops) versus loop fission (splitting a loop in two).

The optimization goal of loop fission is usually to create a simpler kernel that can be more efficiently parallelized. Another goal may be to run both of the two kernels in parallel, rather than merged. There can sometimes be an advantage in terms of data locality and cache access speed, but it will more often be worsened by having two kernel operator loops skimming through the same data twice. At least one of the split out pair of simpler kernels must run much faster separately, usually from accessing hardware acceleration, or else we've simply added extra overhead and worsened the overall performance.

Research on Kernel Fission

Research papers on kernel fission are below; see also loop fission.

More AI Research

Read more about: