Aussie AI

Normalization Pruning

  • Last Updated 3 November, 2024
  • by David Spuler, Ph.D.

Normalization Pruning

Some research has suggested that the "normalization" layer in Transformers can be pruned without a major loss in model accuracy. This is similar to other Transformer optimization techniques involving architectural changes, such as FFN pruning and shallow decoder architectures.

There have been a few research papers that investigated the need for the "normalization" components in Transformers, and whether it can be removed.

  • Chen, M. X., Firat, O., Bapna, A., Johnson, M., Macherey, W., Foster, G., Jones, L., Schuster, M., Shazeer, N., Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, L., Chen, Z., Wu, Y., and Hughes, M. The best of both worlds: Combining recent advances in neural machine translation. In ACL, 2018, https://arxiv.org/abs/1804.09849
  • Ma, J. and Yarats, D. On the adequacy of untuned warmup for adaptive optimization. arXiv:1910.04209, 2019. https://arxiv.org/abs/1910.04209
  • Hongyi Zhang, Yann N. Dauphin, Tengyu Ma, Fixup Initialization: Residual Learning Without Normalization, Mar 2019, https://arxiv.org/abs/1901.09321
  • Sainbayar Sukhbaatar, Edouard Grave, Guillaume Lample, Herve Jegou, and Armand Joulin. 2019. Augmenting Self-attention with Persistent Memory. arXiv:1907.01470 https://arxiv.org/abs/1907.01470
  • Xiao Shi Huang, Felipe Perez, Jimmy Ba, and Maksims Volkovs. 2020. Improving transformer optimization through better initialization. In Proc. Int. Conf. on Machine Learning (ICML), pages 4475-4483, https://proceedings.mlr.press/v119/huang20f.html Code: https://github.com/layer6ai-labs/T-Fixup
  • Wang, Q., Li, B., Xiao, T., Zhu, J., Li, C., Wong, D., and Chao, L. Learning deep transformer models for machine translation. In ACL, 2019. https://arxiv.org/abs/1906.01787
  • Nguyen, T. and Salazar, J., Transformers without tears: Improving the normalization of self-attention. In arXiv:1910.05895, 2019. https://arxiv.org/abs/1910.05895
  • Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J., and Han, J. On the variance of the adaptive learning rate and beyond. In ICLR, 2020 https://arxiv.org/abs/1908.03265 Code: https://github.com/LiyuanLucasLiu/RAdam
  • Tianyang Lin, Yuxin Wang, Xiangyang Liu, and Xipeng Qiu. A survey of transformers. AI Open, 2022. https://arxiv.org/abs/2106.04554 (Survey paper with some analysis of the needs for various Transformer components including normalization; see "Section 5.2.3 Normalization-free Transformer".)
  • Thomas Bachlechner, Bodhisattwa Prasad Majumder, Huanru Henry Mao, Garrison W. Cottrell, and Julian J. McAuley. 2020. ReZero is All You Need: Fast Convergence at Large Depth. CoRR abs/2003.04887 (2020). arXiv:2003.04887 https://arxiv.org/abs/2003.04887
  • Sharath Nittur Sridhar, Anthony Sarah, and Sairam Sundaresan. TrimBERT: Tailoring BERT for Trade-offs. arXiv:2202.12411 [cs], February 2022. http://arxiv.org/abs/2202.12411 (Optimizations include softmax replacement and removing half of all LayerNorms.)
  • David Spuler, March 2024, Chapter 24. Normalization, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
  • Bobby He, Thomas Hofmann, 31 May 2024 (v2), Simplifying Transformer Blocks, https://arxiv.org/abs/2311.01906 (Examines the removal of various Transformer sublayer components including skip connections, projection/value parameters, and normalization.)
  • David Spuler, March 2024, Norm Pruning, in Generative AI in C++, https://www.aussieai.com/book/ch24-norm-pruning
  • James Martens, Andy Ballard, Guillaume Desjardins, Grzegorz Swirszcz, Valentin Dalibard, Jascha Sohl-Dickstein, Samuel S. Schoenholz, 5 Oct 2021, Rapid training of deep neural networks without skip connections or normalization layers using Deep Kernel Shaping, https://arxiv.org/abs/2110.01765
  • Nandan Kumar Jha, Brandon Reagen, 12 Oct 2024, ReLU's Revival: On the Entropic Overload in Normalization-Free Large Language Models, https://arxiv.org/abs/2410.09637
  • Anonymous authors, Oct 2024, Dense Attention: No-CCompromise Exact All NxN Interactions Algorithm with O(N) Space and Time Complexity, https://openreview.net/pdf?id=2bIQBDSfRk

More Research on Pruning Types

More AI Research

Read more about: