Aussie AI
Normalization Optimizations
-
Last Updated 12 December, 2024
-
by David Spuler, Ph.D.
Research has suggested various ways to speed up the normalization component. Examples of normalization improvements include:
- Normalization alternatives
- Normalization approximations
- Removing normalization ("norm pruning")
- Placement of normalization blocks (i.e. "pre-norm" vs "post-norm")
- Fused normalization (see kernel operator fusion)
Normalization Implementation and Optimization
Normalization functions are not usually as significant as MatMul in terms of time cost, but they can still be significant. A typical normalization requires a multi-scan of all of the elements of the output vectors. And this is done multiple times per token throughout each inference phase.
Example: BatchNorm in C++: The batch normalization operation involves scanning the full vector, modifying each element so that it is re-centered to a zero mean, and re-scaled to a normal magnitude. A naive non-optimized version of C++ of BatchNorm looks like this:
void yapi_vector_batch_normalize_basic( // Basic normalization (BatchNorm) float v[], int n, float epsilon, // Smoothing term -- usually 1^e-5 (0.00005) float lambda, // Scaling term hyper-parameter (multiplication) float beta // Bias/shift term hyper-parameter (addition) ) { float fmean = yapi_vector_mean(v, n); // Calculate "mean" (aka average) float variance = yapi_vector_variance_of_mean(v, n, fmean); // Variance (sum-of-diffs-squared) float denom = sqrtf(variance + epsilon); // like std. deviation, but smoothed by epsilon for (int i = 0; i < n; i++) { v[i] = (v[i] - fmean) / denom; // Normalize all elements to re-center and scale } yapi_vector_multiply_scalar(v, n, lambda); // Scale all values by lambda hyper-param yapi_vector_add_scalar(v, n, beta); // Add beta hyper-param to all values }
This version is very inefficient with literally five scans of the entire vector. Loop fusion can obviously improve this, with the loops doing multiplication by lambda and the addition of beta merged into the prior for loop. Another optimization is to replace the division by "denom" with its reciprocal and a multiplication. Division is often an order-of-magnitude worse than multiplication.
Further optimizations become clear once we notice that each element of the vector has four operations being performed on it: subtracting the mean, dividing by the denominator, multiplying by lambda, and adding beta. We can use a loop fission optimization to split out the first two operations into separate loops, where simpler operations are probably faster with hardware acceleration. And then we notice that division and multiply are two versions of the same operation, so we can then use the loop fusion technique to merge the division-by-denom and multiply-by-lambda into a single multiplication by a combined scaling factor. Faster C++ code that has one less loop, and also calls atomic vector operations (easier to hardware accelerate), then results from these changes:
yapi_vector_add_scalar(v, n, -mean); // Subtract the mean float scalef = lambda / denom; // Combined scale factor yapi_vector_multiply_scalar(v, n, scalef); // Scale by both denom and lambda yapi_vector_add_scalar(v, n, beta); // Add beta hyper-param to all values
Another way to optimize this code is simply to remove lambda and beta variables. Choosing lambda=1 and beta=0 means that the last two scalar multiplication and scalar addition loops can be avoided. However, there's now little benefit to removing lambda in the merged code above, although the add-beta loop can still be removed. Anyway, whether we can remove these parameters is not a speed decision, but depends on whether these two learned parameters are important to the overall model's capability. Note that there is also little value in trying to remove epsilon, as it is only used once in total.
Research on Optimizating Normalization: Research papers on fast versions of normalization functions:
- OneFlow, Dec 24, 2021 How to Implement an Efficient LayerNorm CUDA Kernel — OneFlow Performance Optimization, https://oneflow2020.medium.com/how-to-implement-an-efficient-layernorm-cuda-kernel-oneflow-performance-optimization-731e91a285b8, Code: https://github.com/Oneflow-Inc/oneflow (Efficient one-pass LayerNorm and efficient variance calculations.)
- J. Lei Ba, J. R. Kiros, and G. E. Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450, https://arxiv.org/abs/1607.06450
- Shulin Zeng, Jun Liu, Guohao Dai, Xinhao Yang, Tianyu Fu, Hongyi Wang, Wenheng Ma, Hanbo Sun, Shiyao Li, Zixiao Huang, Yadong Dai, Jintao Li, Zehao Wang, Ruoyu Zhang, Kairui Wen, Xuefei Ning, Yu Wang, 9 Jan 2024, FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs, https://arxiv.org/abs/2401.03868 (Does FFN optimization by splitting FFNs into two categories, those commonly firing and those rarely used, in both RELU and non-RELU models; effectively this is FFN pruning of a subset of FFNs.)
- Sam Shleifer, Jason Weston, and Myle Ott. NormFormer: Improved Transformer Pretraining with Extra Normalization. arXiv:2110.09456 [cs], November 2021. http://arxiv.org/abs/2110.09456
- David Spuler, March 2024, Chapter 24. Normalization, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- ChangMin Ye, Yonguk Sim, Youngchae Kim, SeongMin Jin, Doo Seok Jeong, 6 Dec 2024, IterNorm: Fast Iterative Normalization, https://arxiv.org/abs/2412.04778
Approximating Normalization
Research on approximate normalization functions:
- Xiao Shi Huang, Felipe Perez, Jimmy Ba, and Maksims Volkovs. 2020. Improving transformer optimization through better initialization. In Proc. Int. Conf. on Machine Learning (ICML), pages 4475-4483, https://proceedings.mlr.press/v119/huang20f.html, Code: https://github.com/layer6ai-labs/T-Fixup (Pruning of layer normalization.)
- Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton, July 2016, Layer Normalization, https://arxiv.org/abs/1607.06450
- Hongyi Zhang, Yann N. Dauphin, Tengyu Ma, Fixup Initialization: Residual Learning Without Normalization, Mar 2019, https://arxiv.org/abs/1901.09321 (Bye bye, normalization.)
- Nguyen, T. and Salazar, J., Transformers without tears: Improving the normalization of self-attention. In arXiv:1910.05895, 2019. https://arxiv.org/abs/1910.05895
- J Zhong, Z Liu, X Chen, Apr 2023, Transformer-based models and hardware acceleration analysis in autonomous driving: A survey, https://arxiv.org/abs/2304.10891 (Has a section on approximating normalization.)
- Ruokai Yin, Yuhang Li, Abhishek Moitra, Priyadarshini Panda, Dec 2022, Training Integer-Only Deep Recurrent Neural Networks https://arxiv.org/abs/2212.11791 (Integer-only version of RNNs called iRNN, with integer-only layer normalization, integer-only attention, and piecewise linear approximation for integer-only activation functions such as tanh and sigmoid.)
- Z Zhang, B He, Z Zhang, 2023, Practical Edge Kernels for Integer-Only Vision Transformers Under Post-training Quantization, Proceedings of Machine Learning and Systems 5 pre-proceedings (MLSys 2023) mlsys2023, https://proceedings.mlsys.org/paper_files/paper/2023/hash/023560744aae353c03f7ae787f2998dd-Abstract-mlsys2023.html, PDF: https://proceedings.mlsys.org/paper_files/paper/2023/file/023560744aae353c03f7ae787f2998dd-Paper-mlsys2023.pdf (Integer-only-arithmetic quantization with integer-only versions of Softmax, LayerNorm, and GELU.)
- Wenjie Li,, Dongxu LYu, Gang Wang, Aokun Hu, Ningyi Xu, Guanghui He, October 2024, Hardware-oriented algorithms for softmax and layer normalization of large language models, Science China, Vol. 67, Iss. 10, 200404:1–200404:15, https://doi.org/10.1007/s11432-024-4137-4 http://scis.scichina.com/en/2024/200404.pdf
- W. Wang, W. Sun and Y. Liu, "Improving Transformer Inference Through Optimized Non-Linear Operations With Quantization-Approximation-Based Strategy," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, doi: 10.1109/TCAD.2024.3488572. https://ieeexplore.ieee.org/abstract/document/10738457
- ChangMin Ye, Yonguk Sim, Youngchae Kim, SeongMin Jin, Doo Seok Jeong, 6 Dec 2024, IterNorm: Fast Iterative Normalization, https://arxiv.org/abs/2412.04778
Integer-Only Normalization
One approximations for normalization is to use integer-only arithmetic (see also overview of integers in inference). Research on integer-only normalization algorithms:
- Y. Lin, Y. Li, T. Liu et al., “Towards fully 8-bit integer inference for the transformer model,” in Proc. of IJCAI, 2020, pp. 3759–3765. https://arxiv.org/abs/2009.08034 (Integers for weights, but also for Softmax, layer normalization, and other components, by replacing or approximating non-linear functions such as exponential and square-root.)
- Zhewei Yao, Zhen Dong, Zhangcheng Zheng, Amir Gholami, Jiali Yu, Eric Tan, Leyuan Wang, Qijing Huang, Yida Wang, Michael Mahoney, Kurt Keutzer, HAWQ-V3: Dyadic Neural Network Quantization, Proceedings of the 38th International Conference on Machine Learning, PMLR 139:11875-11886, 2021, https://arxiv.org/abs/2011.10680 (Integer-only quantized weights and activations with INT4 or INT8, but also uses integers for batch normalization and residual connection components, too.)
- A. Rock, A. Untether, O. Khalil, O. Shai, and P. Grouchy, 2022, INT8 Transformers for Inference Acceleration, 36th Conference on Neural Information Processing Systems (NeurIPS), PDF: https://neurips2022-enlsp.github.io/papers/paper_52.pdf
- Y. Lin, Y. Li, T. Liu, T. Xiao, T. Liu, and J. Zhu, “Towards fully 8-bit integer inference for the transformer model,” the 29th International Joint Conference on Artificial Intelligence and the 17th Pacific Rim International Conference on Artificial Intelligence (IJCAI-PRICAI2020), 2020. https://arxiv.org/abs/2009.08034
Layer Normalization Placement Reordering (Pre-Norm/Post-Norm)
The original 2017 vanilla Transformer architecture (Vaswani et al, 2017) had a "post-norm" architecture. Subsequently, researchers found that switching to a "pre-norm" architecture, instead of post-norm, could fix one of the major problems with the original Transformer, namely that it was initially unstable in training, requiring a "warmup" phase. Pre-norm was found to stabilize early training and remove the need for any special handling in the warmup.
Since then, several researchers have explored where to place the layer normalization submodule. The general consensus seems to be that placing them before computations ("pre-norm") is better than after calcuations ("post-norm"). However, there are papers going either way, so there's still room for definitive research.
- He, K., Zhang, X., Ren, S., and Sun, J. Identity mappings in deep residual networks. In European conference on computer vision, pp. 630–645. Springer, 2016, https://arxiv.org/abs/1603.05027 Code: https://github.com/KaimingHe/resnet-1k-layers (Only uses layer normalization on the input streams.)
- Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I., Language models are unsupervised multitask learners. 2019, PDF: https://cs.brown.edu/courses/cs146/assets/papers/language_models_are_unsupervised_multitask_learners.pdf (Layer normalization is moved to layer inputs.)
- Baevski, A. and Auli, M., Adaptive input representations for neural language modeling. Int. Conf. Learn. Represent., 2019. https://arxiv.org/abs/1809.10853 (Has layer normalization before the self-attention and FFN blocks.)
- Emilio Parisotto, H. Francis Song, Jack W. Rae, Razvan Pascanu, Caglar Gulcehre, Siddhant M. Jayakumar, Max Jaderberg, Raphael Lopez Kaufman, Aidan Clark, Seb Noury, Matthew M. Botvinick, Nicolas Heess, Raia Hadsell, 2020, Stabilizing Transformers for Reinforcement Learning, https://arxiv.org/abs/1910.06764, PDF: http://proceedings.mlr.press/v119/parisotto20a/parisotto20a.pdf (Has normalization at the inputs of layers.)
- Tianyang Lin, Yuxin Wang, Xiangyang Liu, Xipeng Qiu, June 2021, A Survey of Transformers, AI Open, https://arxiv.org/abs/2106.04554 (Examines some Transformer models that vary "normalization placement", i.e. as "pre-LN" or "post-LN". Also examines various alternatives and substitutes for normalization.)
- Jonas Geiping, Tom Goldstein, Dec 2022, Cramming: Training a Language Model on a Single GPU in One Day, https://arxiv.org/abs/2212.14034, Code: https://github.com/JonasGeiping/cramming (Found evidence that pre-norm was better than post-norm.)
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of HLT-NAACL. Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423, https://arxiv.org/abs/1810.04805 (Post-norm.)
- Liyuan Liu, Xiaodong Liu, Jianfeng Gao, Weizhu Chen, and Jiawei Han. 2020. Understanding the Difficulty of Training Transformers. In Proceedings of EMNLP. 5747–5763. https://doi.org/10.18653/v1/2020.emnlp-main.463, https://arxiv.org/abs/2004.08249 (Post-norm)
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Proceedings of NeurIPS. 5998–6008. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.htm, https://arxiv.org/abs/1706.03762 (Post-norm was used in the original 2017 Transformer paper.)
- Ashish Vaswani, Samy Bengio, Eugene Brevdo, Francois Chollet, Aidan Gomez, Stephan Gouws, Llion Jones, Łukasz Kaiser, Nal Kalchbrenner, Niki Parmar, Ryan Sepassi, Noam Shazeer, and Jakob Uszkoreit. 2018. Tensor2Tensor for Neural Machine Translation. In Proceedings of AMTA. 193–199. https://www.aclweb.org/anthology/W18-1819, PDF: https://aclanthology.org/W18-1819.pdf, Code: https://github.com/tensorflow/tensor2tensor (Has a reference implementation of the vanilla Transformer, which was post-norm, but this is pre-norm.)
- Qiang Wang, Bei Li, Tong Xiao, Jingbo Zhu, Changliang Li, Derek F. Wong, and Lidia S. Chao. 2019. Learning Deep Transformer Models for Machine Translation. In Proceedings of ACL. 1810–1822. https://doi.org/10.18653/v1/p19-1176, https://arxiv.org/abs/1906.01787, Code: https://github.com/wangqiangneu/dlcl (Researching pre-norm vs post-norm for Transformers.)
- Tobias Domhan. 2018. How much attention do you need? a granular analysis of neural machine translation architectures. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1799–1808. PDF: https://aclanthology.org/P18-1167.pdf (Uses a pre-norm architecture, based on Tensor2Tensor.)
- Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander Rush. 2017. OpenNMT: Open-Source Toolkit for Neural Machine Translation. In Proceedings of ACL. 67–72. https://www.aclweb.org/anthology/P17-4012, https://arxiv.org/abs/1701.02810 (Pre-norm architecture.)
- Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating Long Sequences with Sparse Transformers. arXiv:1904.10509 [cs.LG] https://arxiv.org/abs/1904.10509 (Pre-norm)
- Alexei Baevski and Michael Auli. 2019. Adaptive Input Representations for Neural Language Modeling. In Proceedings of ICLR. https://openreview.net/forum?id=ByxZX20qFQ, https://arxiv.org/abs/1809.10853 (Uses pre-norm with normalization at the inputs.)
- Alibaba Qwen Team, Sep 2023, Qwen Technical Report, https://arxiv.org/pdf/2309.16609.pdf (Tested both standard pre-norm and RMSNorm architectures.)
- David R. So, Wojciech Mańke, Hanxiao Liu, Zihang Dai, Noam Shazeer, Quoc V. Le, Jan 2022, Primer: Searching for Efficient Transformers for Language Modeling, https://arxiv.org/abs/2109.08668
- Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language Models are Unsupervised Multitask Learners. OpenAI, 2019. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
- An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng Xue, Na Ni, Pei Zhang, Peng Wang, Ru Peng, Rui Men, Ruize Gao, Runji Lin, Shijie Wang, Shuai Bai, Sinan Tan, Tianhang Zhu, Tianhao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng, Xiaohuan Zhou, Xingzhang Ren, Xinyu Zhang, Xipin Wei, Xuancheng Ren, Yang Fan, Yang Yao, Yichang Zhang, Yu Wan, Yunfei Chu, Zeyu Cui, Zhenru Zhang, Zhihao Fan, 15 Jul 2024, Qwen2 Technical Report, https://arxiv.org/abs/2407.10671
- Seungrok Jung., 15, Mar 2024, Large language model inference optimizations on AMD GPUs, ROCm blogs, https://rocm.blogs.amd.com/artificial-intelligence/llm-inference-optimize/README.html
- Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tie-Yan Liu, 2020, On Layer Normalization in the Transformer Architecture. arXiv:2002.04745 [cs, stat], June 2020. http://arxiv.org/abs/2002.04745 (Pre-norm versus post-norm.)
- Liyuan Liu, Xiaodong Liu, Jianfeng Gao, Weizhu Chen, and Jiawei Han, 2020, Understanding the Difficulty of Training Transformers. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 5747–5763, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.463. https://aclanthology.org/2020.emnlp-main.463 (Pre-norm versus post-norm.)
Research on Normalization Alternatives and Optimization
Research papers on different types of normalization or other alternatives:
- Biao Zhang and Rico Sennrich, 2019, Root Mean Square Layer Normalization. arXiv:1910.07467 [cs, stat], October 2019. http://arxiv.org/abs/1910.07467 (RMS normalization paper.)
- Toan Q. Nguyen and Julian Salazar. 2019. Transformers without Tears: Improving the Normalization of Self-Attention. CoRR abs/1910.05895 (2019). arXiv:1910.05895, https://arxiv.org/abs/1910.05895
- Sheng Shen, Zhewei Yao, Amir Gholami, Michael W. Mahoney, and Kurt Keutzer. 2020. PowerNorm: Rethinking Batch Normalization in Transformers. In Proceedings of ICML. 8741–8751. http://proceedings.mlr.press/v119/shen20e.html
- Jingjing Xu, Xu Sun, Zhiyuan Zhang, Guangxiang Zhao, and Junyang Lin. 2019. Understanding and Improving Layer Normalization. In Proceedings of NeurIPS. 4383–4393. https://proceedings.neurips.cc/paper/2019/hash/2f4fe03d77724a7217006e5d16728874-Abstract.html, https://arxiv.org/abs/1911.07013
- Jingjing Xu, Xu Sun, Zhiyuan Zhang, Guangxiang Zhao, Junyang Lin, Nov 2019, Understanding and Improving Layer Normalization, https://arxiv.org/abs/1911.07013
- Richmond Alake, Apr 22, 2020 Batch Normalization In Neural Networks Explained (Algorithm Breakdown), Towards Data Science, https://towardsdatascience.com/batch-normalization-explained-algorithm-breakdown-23d2794511c (BatchNorm explained quite well.)
- Sergey Ioffe, Christian Szegedy, March 2015, Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, https://arxiv.org/abs/1502.03167 (Original BatchNorm paper.)
General Research on Normalization
Research papers on normalization issues in general:
- Sachin Mehta, Mohammad Hossein Sekhavat, Qingqing Cao, Maxwell Horton, Yanzi Jin, Chenfan Sun, Iman Mirzadeh, Mahyar Najibi, Dmitry Belenko, Peter Zatloukal, Mohammad Rastegari, 22 Apr 2024, OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework, Apple Research, https://arxiv.org/abs/2404.14619 Code: https://huggingface.co/apple/OpenELM
- Biao Zhang, Rico Sennrich, 2019, Root Mean Square Layer Normalization, https://arxiv.org/abs/1910.07467
- Shaked Brody, Uri Alon, Eran Yahav, May 2023, On the Expressivity Role of LayerNorm in Transformers' Attention, https://arxiv.org/abs/2305.02582 Code: https://github.com/tech-srl/layer_norm_expressivity_role
- Toan Q. Nguyen, Julian Salazar, 2019, Transformers without Tears: Improving the Normalization of Self-Attention, https://arxiv.org/abs/1910.05895
- S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, ser. JMLR Workshop and Conference Proceedings, F. R. Bach and D. M. Blei, Eds., vol. 37. JMLR.org, 2015, pp. 448–456. [Online]. Available: http://proceedings.mlr.press/ v37/ioffe15.html
- David Spuler, March 2024, Chapter 24. Normalization, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Dongdong Zhang, Furu Wei, 1 Mar 2022, DeepNet: Scaling Transformers to 1,000 Layers, https://arxiv.org/abs/2203.00555 https://arxiv.org/pdf/2203.00555.pdf (New normalization function DeepNorm)
- kipply's blog, 2023-03-30, Transformer Taxonomy (the last lit review), https://kipp.ly/transformer-taxonomy/ (Papers for all the Transformer architectures and milestone papers for the major optimization improvements on them.)
RMSNorm Research
RMSNorm is based on the Root Mean Squared (RMS) calculation. Research papers on RMSNorm include:
- Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, Furu Wei, 9 May 2024 (v2), You Only Cache Once: Decoder-Decoder Architectures for Language Models, https://arxiv.org/abs/2405.05254 Code: https://aka.ms/YOCO (A novel decoder-decoder architecture with fast KV caching and cross-attention.)
- Sachin Mehta, Mohammad Hossein Sekhavat, Qingqing Cao, Maxwell Horton, Yanzi Jin, Chenfan Sun, Iman Mirzadeh, Mahyar Najibi, Dmitry Belenko, Peter Zatloukal, Mohammad Rastegari, 22 Apr 2024, OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework, Apple Research, https://arxiv.org/abs/2404.14619 Code: https://huggingface.co/apple/OpenELM
- Biao Zhang, Rico Sennrich, 2019, Root Mean Square Layer Normalization, https://arxiv.org/abs/1910.07467
- Jie Wu, Yufeng Zhu, Lei Shen, Xuqing Lu, 14 Jun 2024, GEB-1.3B: Open Lightweight Large Language Model, https://arxiv.org/abs/2406.09900 Code: https://huggingface.co/GEB-AGI/geb-1.3b
- David Spuler, March 2024, Chapter 24. Normalization, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- kipply's blog, 2023-03-30, Transformer Taxonomy (the last lit review), https://kipp.ly/transformer-taxonomy/ (Papers for all the Transformer architectures and milestone papers for the major optimization improvements on them.)
- An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng Xue, Na Ni, Pei Zhang, Peng Wang, Ru Peng, Rui Men, Ruize Gao, Runji Lin, Shijie Wang, Shuai Bai, Sinan Tan, Tianhang Zhu, Tianhao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng, Xiaohuan Zhou, Xingzhang Ren, Xinyu Zhang, Xipin Wei, Xuancheng Ren, Yang Fan, Yang Yao, Yichang Zhang, Yu Wan, Yunfei Chu, Zeyu Cui, Zhenru Zhang, Zhihao Fan, 15 Jul 2024, Qwen2 Technical Report, https://arxiv.org/abs/2407.10671
- Nils Graef, Matthew Clapp, Andrew Wasielewski, 12 Jul 2024, Flash normalization: fast RMSNorm for LLMs, https://arxiv.org/abs/2407.09577 Code: https://huggingface.co/open-machine/FlashNorm
- David Spuler, March 2024, Root Mean Square Normalization, in Generative AI in C++, https://www.aussieai.com/book/ch24-rmsnorm-root-mean-square
- Byron (Pin-Lun)Hsu, Yun Dai, Vignesh Kothapalli, Qingquan Song, Shao Tang, Siyu Zhu, Steven Shimizu, Shivam Sahni, Haowen Ning, Yanning Chen, 14 Oct 2024, Liger Kernel: Efficient Triton Kernels for LLM Training, https://arxiv.org/abs/2410.10989 http://github.com/linkedin/Liger-Kernel
LayerNorm Optimizations
Research papers on optimizing LayerNorm:
- Jingyu Wang; Lu Zhang; Xueqing Li; Huazhong Yang; Yongpan Liu, Nov 2023, ULSeq-TA: Ultra-Long Sequence Attention Fusion Transformer Accelerator Supporting Grouped Sparse Softmax and Dual-Path Sparse LayerNorm, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (Early Access), https://ieeexplore.ieee.org/abstract/document/10304367
- Chen, C, 2024, Hardware‑software co‑exploration and optimization for next‑generation learning machines. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/178423 (Extensive coverage of hardware design with multiple contributions to accelerating various neural network types, ranging from acceleration of various single non-linear functions and end-to-end optimization algorithms. Specific topics include data compression, non-maximum suppression, MHA, and MatMul/GEMM optimizations.)
- Mahsa Salmani, Nikita Trukhanov, Ilya Soloveychik, 14 Oct 2024, SLaNC: Static LayerNorm Calibration, https://arxiv.org/abs/2410.10553
- Sachin Mehta, Mohammad Hossein Sekhavat, Qingqing Cao, Maxwell Horton, Yanzi Jin, Chenfan Sun, Iman Mirzadeh, Mahyar Najibi, Dmitry Belenko, Peter Zatloukal, Mohammad Rastegari, 22 Apr 2024, OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework, Apple Research, https://arxiv.org/abs/2404.14619 Code: https://huggingface.co/apple/OpenELM
- Xinyi Wu, Amir Ajorlou, Yifei Wang, Stefanie Jegelka, Ali Jadbabaie, 29 May 2024, On the Role of Attention Masks and LayerNorm in Transformers, https://arxiv.org/abs/2405.18781
- David Spuler, March 2024, Layer Normalization, in Generative AI in C++, https://www.aussieai.com/book/ch24-layer-normalization
- ChangMin Ye, Yonguk Sim, Youngchae Kim, SeongMin Jin, Doo Seok Jeong, 6 Dec 2024, IterNorm: Fast Iterative Normalization, https://arxiv.org/abs/2412.04778
More AI Research
Read more about:
- Layer pruning
- Token pruning
- Attention head pruning
- Embeddings pruning
- FFN pruning
- Shallow decoder architecture
- Length pruning
- Width pruning
- Channel pruning
- « Research Home