Aussie AI

Normalization Optimizations

  • Last Updated 12 December, 2024
  • by David Spuler, Ph.D.

Research has suggested various ways to speed up the normalization component. Examples of normalization improvements include:

  • Normalization alternatives
  • Normalization approximations
  • Removing normalization ("norm pruning")
  • Placement of normalization blocks (i.e. "pre-norm" vs "post-norm")
  • Fused normalization (see kernel operator fusion)

Normalization Implementation and Optimization

Normalization functions are not usually as significant as MatMul in terms of time cost, but they can still be significant. A typical normalization requires a multi-scan of all of the elements of the output vectors. And this is done multiple times per token throughout each inference phase.

Example: BatchNorm in C++: The batch normalization operation involves scanning the full vector, modifying each element so that it is re-centered to a zero mean, and re-scaled to a normal magnitude. A naive non-optimized version of C++ of BatchNorm looks like this:

    void yapi_vector_batch_normalize_basic(    // Basic normalization (BatchNorm)
	float v[], int n, 
	float epsilon, // Smoothing term -- usually 1^e-5 (0.00005)
	float lambda, // Scaling term hyper-parameter (multiplication)
	float beta    // Bias/shift term hyper-parameter (addition)
    ) 
    {
	float fmean = yapi_vector_mean(v, n);  // Calculate "mean" (aka average)
	float variance = yapi_vector_variance_of_mean(v, n, fmean);  // Variance (sum-of-diffs-squared)
	
	float denom = sqrtf(variance + epsilon);  // like std. deviation, but smoothed by epsilon
	for (int i = 0; i < n; i++) {
		v[i] = (v[i] - fmean) / denom; // Normalize all elements to re-center and scale
	}
	yapi_vector_multiply_scalar(v, n, lambda);  // Scale all values by lambda hyper-param
	yapi_vector_add_scalar(v, n, beta);  // Add beta hyper-param to all values 
    }

This version is very inefficient with literally five scans of the entire vector. Loop fusion can obviously improve this, with the loops doing multiplication by lambda and the addition of beta merged into the prior for loop. Another optimization is to replace the division by "denom" with its reciprocal and a multiplication. Division is often an order-of-magnitude worse than multiplication.

Further optimizations become clear once we notice that each element of the vector has four operations being performed on it: subtracting the mean, dividing by the denominator, multiplying by lambda, and adding beta. We can use a loop fission optimization to split out the first two operations into separate loops, where simpler operations are probably faster with hardware acceleration. And then we notice that division and multiply are two versions of the same operation, so we can then use the loop fusion technique to merge the division-by-denom and multiply-by-lambda into a single multiplication by a combined scaling factor. Faster C++ code that has one less loop, and also calls atomic vector operations (easier to hardware accelerate), then results from these changes:

	yapi_vector_add_scalar(v, n, -mean);  // Subtract the mean
	float scalef = lambda / denom;  // Combined scale factor
	yapi_vector_multiply_scalar(v, n, scalef);  // Scale by both denom and lambda 
	yapi_vector_add_scalar(v, n, beta);  // Add beta hyper-param to all values 

Another way to optimize this code is simply to remove lambda and beta variables. Choosing lambda=1 and beta=0 means that the last two scalar multiplication and scalar addition loops can be avoided. However, there's now little benefit to removing lambda in the merged code above, although the add-beta loop can still be removed. Anyway, whether we can remove these parameters is not a speed decision, but depends on whether these two learned parameters are important to the overall model's capability. Note that there is also little value in trying to remove epsilon, as it is only used once in total.

Research on Optimizating Normalization: Research papers on fast versions of normalization functions:

Approximating Normalization

Research on approximate normalization functions:

Integer-Only Normalization

One approximations for normalization is to use integer-only arithmetic (see also overview of integers in inference). Research on integer-only normalization algorithms:

  • Y. Lin, Y. Li, T. Liu et al., “Towards fully 8-bit integer inference for the transformer model,” in Proc. of IJCAI, 2020, pp. 3759–3765. https://arxiv.org/abs/2009.08034 (Integers for weights, but also for Softmax, layer normalization, and other components, by replacing or approximating non-linear functions such as exponential and square-root.)
  • Zhewei Yao, Zhen Dong, Zhangcheng Zheng, Amir Gholami, Jiali Yu, Eric Tan, Leyuan Wang, Qijing Huang, Yida Wang, Michael Mahoney, Kurt Keutzer, HAWQ-V3: Dyadic Neural Network Quantization, Proceedings of the 38th International Conference on Machine Learning, PMLR 139:11875-11886, 2021, https://arxiv.org/abs/2011.10680 (Integer-only quantized weights and activations with INT4 or INT8, but also uses integers for batch normalization and residual connection components, too.)
  • A. Rock, A. Untether, O. Khalil, O. Shai, and P. Grouchy, 2022, INT8 Transformers for Inference Acceleration, 36th Conference on Neural Information Processing Systems (NeurIPS), PDF: https://neurips2022-enlsp.github.io/papers/paper_52.pdf
  • Y. Lin, Y. Li, T. Liu, T. Xiao, T. Liu, and J. Zhu, “Towards fully 8-bit integer inference for the transformer model,” the 29th International Joint Conference on Artificial Intelligence and the 17th Pacific Rim International Conference on Artificial Intelligence (IJCAI-PRICAI2020), 2020. https://arxiv.org/abs/2009.08034

Layer Normalization Placement Reordering (Pre-Norm/Post-Norm)

The original 2017 vanilla Transformer architecture (Vaswani et al, 2017) had a "post-norm" architecture. Subsequently, researchers found that switching to a "pre-norm" architecture, instead of post-norm, could fix one of the major problems with the original Transformer, namely that it was initially unstable in training, requiring a "warmup" phase. Pre-norm was found to stabilize early training and remove the need for any special handling in the warmup.

Since then, several researchers have explored where to place the layer normalization submodule. The general consensus seems to be that placing them before computations ("pre-norm") is better than after calcuations ("post-norm"). However, there are papers going either way, so there's still room for definitive research.

Research on Normalization Alternatives and Optimization

Research papers on different types of normalization or other alternatives:

General Research on Normalization

Research papers on normalization issues in general:

RMSNorm Research

RMSNorm is based on the Root Mean Squared (RMS) calculation. Research papers on RMSNorm include:

  • Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, Furu Wei, 9 May 2024 (v2), You Only Cache Once: Decoder-Decoder Architectures for Language Models, https://arxiv.org/abs/2405.05254 Code: https://aka.ms/YOCO (A novel decoder-decoder architecture with fast KV caching and cross-attention.)
  • Sachin Mehta, Mohammad Hossein Sekhavat, Qingqing Cao, Maxwell Horton, Yanzi Jin, Chenfan Sun, Iman Mirzadeh, Mahyar Najibi, Dmitry Belenko, Peter Zatloukal, Mohammad Rastegari, 22 Apr 2024, OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework, Apple Research, https://arxiv.org/abs/2404.14619 Code: https://huggingface.co/apple/OpenELM
  • Biao Zhang, Rico Sennrich, 2019, Root Mean Square Layer Normalization, https://arxiv.org/abs/1910.07467
  • Jie Wu, Yufeng Zhu, Lei Shen, Xuqing Lu, 14 Jun 2024, GEB-1.3B: Open Lightweight Large Language Model, https://arxiv.org/abs/2406.09900 Code: https://huggingface.co/GEB-AGI/geb-1.3b
  • David Spuler, March 2024, Chapter 24. Normalization, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
  • kipply's blog, 2023-03-30, Transformer Taxonomy (the last lit review), https://kipp.ly/transformer-taxonomy/ (Papers for all the Transformer architectures and milestone papers for the major optimization improvements on them.)
  • An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng Xue, Na Ni, Pei Zhang, Peng Wang, Ru Peng, Rui Men, Ruize Gao, Runji Lin, Shijie Wang, Shuai Bai, Sinan Tan, Tianhang Zhu, Tianhao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng, Xiaohuan Zhou, Xingzhang Ren, Xinyu Zhang, Xipin Wei, Xuancheng Ren, Yang Fan, Yang Yao, Yichang Zhang, Yu Wan, Yunfei Chu, Zeyu Cui, Zhenru Zhang, Zhihao Fan, 15 Jul 2024, Qwen2 Technical Report, https://arxiv.org/abs/2407.10671
  • Nils Graef, Matthew Clapp, Andrew Wasielewski, 12 Jul 2024, Flash normalization: fast RMSNorm for LLMs, https://arxiv.org/abs/2407.09577 Code: https://huggingface.co/open-machine/FlashNorm
  • David Spuler, March 2024, Root Mean Square Normalization, in Generative AI in C++, https://www.aussieai.com/book/ch24-rmsnorm-root-mean-square
  • Byron (Pin-Lun)Hsu, Yun Dai, Vignesh Kothapalli, Qingquan Song, Shao Tang, Siyu Zhu, Steven Shimizu, Shivam Sahni, Haowen Ning, Yanning Chen, 14 Oct 2024, Liger Kernel: Efficient Triton Kernels for LLM Training, https://arxiv.org/abs/2410.10989 http://github.com/linkedin/Liger-Kernel

LayerNorm Optimizations

Research papers on optimizing LayerNorm:

  • Jingyu Wang; Lu Zhang; Xueqing Li; Huazhong Yang; Yongpan Liu, Nov 2023, ULSeq-TA: Ultra-Long Sequence Attention Fusion Transformer Accelerator Supporting Grouped Sparse Softmax and Dual-Path Sparse LayerNorm, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (Early Access), https://ieeexplore.ieee.org/abstract/document/10304367
  • Chen, C, 2024, Hardware‑software co‑exploration and optimization for next‑generation learning machines. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/178423 (Extensive coverage of hardware design with multiple contributions to accelerating various neural network types, ranging from acceleration of various single non-linear functions and end-to-end optimization algorithms. Specific topics include data compression, non-maximum suppression, MHA, and MatMul/GEMM optimizations.)
  • Mahsa Salmani, Nikita Trukhanov, Ilya Soloveychik, 14 Oct 2024, SLaNC: Static LayerNorm Calibration, https://arxiv.org/abs/2410.10553
  • Sachin Mehta, Mohammad Hossein Sekhavat, Qingqing Cao, Maxwell Horton, Yanzi Jin, Chenfan Sun, Iman Mirzadeh, Mahyar Najibi, Dmitry Belenko, Peter Zatloukal, Mohammad Rastegari, 22 Apr 2024, OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework, Apple Research, https://arxiv.org/abs/2404.14619 Code: https://huggingface.co/apple/OpenELM
  • Xinyi Wu, Amir Ajorlou, Yifei Wang, Stefanie Jegelka, Ali Jadbabaie, 29 May 2024, On the Role of Attention Masks and LayerNorm in Transformers, https://arxiv.org/abs/2405.18781
  • David Spuler, March 2024, Layer Normalization, in Generative AI in C++, https://www.aussieai.com/book/ch24-layer-normalization
  • ChangMin Ye, Yonguk Sim, Youngchae Kim, SeongMin Jin, Doo Seok Jeong, 6 Dec 2024, IterNorm: Fast Iterative Normalization, https://arxiv.org/abs/2412.04778

More AI Research

Read more about: