Aussie AI
Normalization Pruning
-
Last Updated 3 November, 2024
-
by David Spuler, Ph.D.
Normalization Pruning
Some research has suggested that the "normalization" layer in Transformers can be pruned without a major loss in model accuracy. This is similar to other Transformer optimization techniques involving architectural changes, such as FFN pruning and shallow decoder architectures.
There have been a few research papers that investigated the need for the "normalization" components in Transformers, and whether it can be removed.
- Chen, M. X., Firat, O., Bapna, A., Johnson, M., Macherey, W., Foster, G., Jones, L., Schuster, M., Shazeer, N., Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, L., Chen, Z., Wu, Y., and Hughes, M. The best of both worlds: Combining recent advances in neural machine translation. In ACL, 2018, https://arxiv.org/abs/1804.09849
- Ma, J. and Yarats, D. On the adequacy of untuned warmup for adaptive optimization. arXiv:1910.04209, 2019. https://arxiv.org/abs/1910.04209
- Hongyi Zhang, Yann N. Dauphin, Tengyu Ma, Fixup Initialization: Residual Learning Without Normalization, Mar 2019, https://arxiv.org/abs/1901.09321
- Sainbayar Sukhbaatar, Edouard Grave, Guillaume Lample, Herve Jegou, and Armand Joulin. 2019. Augmenting Self-attention with Persistent Memory. arXiv:1907.01470 https://arxiv.org/abs/1907.01470
- Xiao Shi Huang, Felipe Perez, Jimmy Ba, and Maksims Volkovs. 2020. Improving transformer optimization through better initialization. In Proc. Int. Conf. on Machine Learning (ICML), pages 4475-4483, https://proceedings.mlr.press/v119/huang20f.html Code: https://github.com/layer6ai-labs/T-Fixup
- Wang, Q., Li, B., Xiao, T., Zhu, J., Li, C., Wong, D., and Chao, L. Learning deep transformer models for machine translation. In ACL, 2019. https://arxiv.org/abs/1906.01787
- Nguyen, T. and Salazar, J., Transformers without tears: Improving the normalization of self-attention. In arXiv:1910.05895, 2019. https://arxiv.org/abs/1910.05895
- Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J., and Han, J. On the variance of the adaptive learning rate and beyond. In ICLR, 2020 https://arxiv.org/abs/1908.03265 Code: https://github.com/LiyuanLucasLiu/RAdam
- Tianyang Lin, Yuxin Wang, Xiangyang Liu, and Xipeng Qiu. A survey of transformers. AI Open, 2022. https://arxiv.org/abs/2106.04554 (Survey paper with some analysis of the needs for various Transformer components including normalization; see "Section 5.2.3 Normalization-free Transformer".)
- Thomas Bachlechner, Bodhisattwa Prasad Majumder, Huanru Henry Mao, Garrison W. Cottrell, and Julian J. McAuley. 2020. ReZero is All You Need: Fast Convergence at Large Depth. CoRR abs/2003.04887 (2020). arXiv:2003.04887 https://arxiv.org/abs/2003.04887
- Sharath Nittur Sridhar, Anthony Sarah, and Sairam Sundaresan. TrimBERT: Tailoring BERT for Trade-offs. arXiv:2202.12411 [cs], February 2022. http://arxiv.org/abs/2202.12411 (Optimizations include softmax replacement and removing half of all LayerNorms.)
- David Spuler, March 2024, Chapter 24. Normalization, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Bobby He, Thomas Hofmann, 31 May 2024 (v2), Simplifying Transformer Blocks, https://arxiv.org/abs/2311.01906 (Examines the removal of various Transformer sublayer components including skip connections, projection/value parameters, and normalization.)
- David Spuler, March 2024, Norm Pruning, in Generative AI in C++, https://www.aussieai.com/book/ch24-norm-pruning
- James Martens, Andy Ballard, Guillaume Desjardins, Grzegorz Swirszcz, Valentin Dalibard, Jascha Sohl-Dickstein, Samuel S. Schoenholz, 5 Oct 2021, Rapid training of deep neural networks without skip connections or normalization layers using Deep Kernel Shaping, https://arxiv.org/abs/2110.01765
- Nandan Kumar Jha, Brandon Reagen, 12 Oct 2024, ReLU's Revival: On the Entropic Overload in Normalization-Free Large Language Models, https://arxiv.org/abs/2410.09637
- Anonymous authors, Oct 2024, Dense Attention: No-CCompromise Exact All NxN Interactions Algorithm with O(N) Space and Time Complexity, https://openreview.net/pdf?id=2bIQBDSfRk
More Research on Pruning Types
- Depth pruning (overview)
— Static layer pruning
— Layer pruning
— Early exit
— Dynamic layer pruning
— Layer skipping
— Layer approximation
— Shallow decoder architecture
— Layer reordering
— Layer Importance - Width pruning (overview)
— Attention head pruning
— Slimmable networks (width pruning)
— FFN pruning
— Channel pruning
— Filter pruning - Length pruning (longitudinal/input/end-to-end):
— Token pruning (input pruning)
— Dynamic token pruning
— Prompt compression
— Context compression
— Token merging
— Token skipping
— Token dropping
— Zero padding removal - Embedding-dimension pruning
— Embedding pruning
— Embedding matrix compression (embedding pruning)
— Embedding low-rank matrix factorization
— Unembedding matrix (output embeddings) - Multi-dimensional pruning
— Dual pruning
— Triple pruning
— Quadruple pruning
— 3D CNN model pruning
More AI Research
Read more about:
- Normalization optimizations
- Layer pruning
- Token pruning
- Attention head pruning
- Embeddings pruning
- FFN pruning
- Shallow decoder architecture
- Length pruning
- Width pruning
- Channel pruning
- « Research Home