Aussie AI

Attention Head Pruning

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

Attention Head Pruning

Attention head pruning, often simply abbreviated to “head pruning”, is structured pruning that removes attention heads. It is a type of width pruning that makes the Transformer “thinner” on that dimension.

The attention heads were one of the main advances in the seminal 2017 Transformer paper, so much so that the paper was titled "Attention Is All You Need" (Vaswani et. al, 2017). However, research has shown that the attention mechanism is expensive and there are various ways to optimize its efficiency, including removing some redundant attention heads.

In addition to head pruning techniques that remove redundant or under-utilized attention heads, there is also research into using simpler attention heads (see approximate attention heads) and simplifying the cost of attention on long sequences (see non-autoregressive architectures).

Head pruning can be combined with various other optimization techniques such as quantization. It is also orthogonal to “depth pruning” such as “layer pruning” and “early exit”, and combined depth/width pruning is possible.

Research papers on attention head pruning:

Hanrui Wang, Zhekai Zhang, and Song Han, SpAtten: Efficient sparse attention architecture with cascade token and head pruning, In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 97–110. IEEE, 2021, https://arxiv.org/abs/2012.09852
Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned, In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pp. 5797–5808, 2019, https://arxiv.org/abs/1905.09418
Ji Lin, Yongming Rao, Jiwen Lu, and Jie Zhou. 2017. Runtime Neural Pruning, In Advances in Neural Information Processing Systems (NeurIPS). https://dl.acm.org/doi/10.5555/3294771.3294979, https://papers.nips.cc/paper/2017/hash/a51fb975227d6640e4fe47854476d133-Abstract.html
J. S. McCarley, Rishav Chakravarti, and Avirup Sil. 2020. Structured Pruning of a BERT-based Question Answering Model. (2020), arXiv:cs.CL/1910.0636, https://arxiv.org/abs/1910.06360
Sai Prasanna, Anna Rogers, and Anna Rumshisky. 2020. When BERT Plays the Lottery, All Tickets Are Winning, In Conference on Empirical Methods in Natural Language Processing (EMNLP). arXiv:cs.CL/2005.00561, https://arxiv.org/abs/2005.00561
Zi Lin, Jeremiah Zhe Liu, Zi Yang, Nan Hua, Dan Roth, Pruning Redundant Mappings in Transformer Models via Spectral-Normalized Identity Prior, Oct 2020 https://arxiv.org/abs/2010.01791
Jiaoda Li, Ryan Cotterell, Mrinmaya Sachan, 2023, Differentiable Subset Pruning of Transformer Heads, revised July 2023, https://arxiv.org/abs/2108.04657
William Held, Diyi Yang, 2022, Shapley Head Pruning: Identifying and Removing Interference in Multilingual Transformers, arXiv preprint arXiv:2210.05709, Oct 2022, https://arxiv.org/abs/2210.05709
Zhengyan Zhang, Fanchao Qi, Zhiyuan Liu, Qun Liu, Maosong Sun, 2021, Know What You Don't Need: Single-Shot Meta-Pruning for Attention Heads, AI Open, 2021, Elsevier, https://arxiv.org/abs/2011.03770v1
François Lagunas, Ella Charlaix, Victor Sanh, Alexander M. Rush, 2021, Block Pruning For Faster Transformers, https://arxiv.org/abs/2109.04838
Kyuhong Shim, Iksoo Choi, Wonyong Sung, Jungwook Choi, 2021, Layer-wise Pruning of Transformer Attention Heads for Efficient Language Modeling, 2021, 18th International SoC 2021, https://arxiv.org/abs/2110.03252
Zi Lin, Jeremiah Zhe Liu, Zi Yang, Nan Hua, and Dan Roth. Oct 2020. Pruning Redundant Mappings in Transformer Models via Spectral-Normalized Identity Prior, arXiv preprint arXiv:2010.01791 (2020), https://arxiv.org/abs/2010.01791
Haoyu He, Jing Liu, Zizheng Pan, Jianfei Cai, Jing Zhang, Dacheng Tao, Bohan Zhuang, 2022, Pruning Self-attentions into Convolutional Layers in Single Path, 2021, revised Aug 2022, https://arxiv.org/abs/2111.11802 Code: https://github.com/ziplab/SPViT
Zuzana Jelčicová, Marian Verhelst, 2022, Delta Keyword Transformer: Bringing Transformers to the Edge through Dynamically Pruned Multi-Head Self-Attention, arXiv preprint arXiv:2204.03479, March 2022, https://arxiv.org/abs/2204.03479v1
Fang Yu, Kun Huang, Meng Wang, Yuan Cheng, Wei Chu, Li Cui, 2022, Width & Depth Pruning for Vision Transformers, Vol. 36 No. 3: AAAI-22 Technical Tracks 3 / AAAI Technical Track on Computer Vision III, DOI: https://doi.org/10.1609/aaai.v36i3.20222, https://ojs.aaai.org/index.php/AAAI/article/view/20222, PDF: https://ojs.aaai.org/index.php/AAAI/article/view/20222/19981
Z. Liu, F. Li, G. Li, and J. Cheng, 2021, EBERT: Efficient BERT Inference with Dynamic Structured Pruning, in Findings of the Association for Computational Linguistics: ACL-IJCNLP, 2021, pp. 4814–4823, https://aclanthology.org/2021.findings-acl.425/
Guorun Wang, Jun Yang, Yaoru Sun, 2023, Task-oriented Memory-efficient Pruning-Adapter, arXiv preprint arXiv:2303.14704, Apr 2023, https://arxiv.org/abs/2303.14704
Archit Parnami, Rahul Singh, Tarun Joshi, 2021, Pruning Attention Heads of Transformer Models Using A* Search: A Novel Approach to Compress Big NLP Architectures, arXiv preprint arXiv:2110.15225, Nov 2021, https://arxiv.org/abs/2110.15225
Weiping Song, Chence Shi, Zhiping Xiao, Zhijian Duan, Yewen Xu, Ming Zhang, and Jian Tang. 2019, AutoInt: Automatic feature interaction learning via self-attentive neural networks, In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 1161–1170, 2019, https://arxiv.org/abs/1810.11921, Code: https://github.com/DeepGraphLearning/RecommenderSystems
Ye Lin, Yanyang Li, Tong Xiao, Jingbo Zhu, 2021, Bag of Tricks for Optimizing Transformer Efficiency, Findings of the Association for Computational Linguistics: EMNLP 2021, November 2021, https://aclanthology.org/2021.findings-emnlp.357/
Yi-Te Hsu, Sarthak Garg, Yi-Hsiu Liao, and Ilya Chatsviorkin. 2020. Efficient inference for neural machine translation, CoRR, abs/2010.02416. https://arxiv.org/abs/2010.02416
Maximiliana Behnke and Kenneth Heafield. 2020. Losing heads in the lottery: Pruning transformer attention in neural machine translation, In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pages 2664–2674. Association for Computational Linguistics. https://aclanthology.org/2020.emnlp-main.211/
Wenxuan Wang and Zhaopeng Tu. 2020. Rethinking the value of transformer components, In Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020, Barcelona, Spain (Online), December 8-13, 2020, pages 6019– 6029. International Committee on Computational Linguistics. https://arxiv.org/abs/2011.03803v1
Tobias Domhan. July 2018. How much attention do you need? a granular analysis of neural machine translation architectures, In ACL. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, https://aclanthology.org/P18-1167/
Biao Zhang, Deyi Xiong, and Jinsong Su. 2018. Accelerating neural transformer via an average attention network, In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pages 1789–1798. Association for Computational Linguistics, https://arxiv.org/abs/1805.00631 (This paper proposes a simpler version of the attention heads, rather than just pruning them.)
Shazeer, N. M., 2019, Fast transformer decoding: One write-head is all you need, ArXiv, abs/1911.02150, 2019, https://arxiv.org/abs/1911.02150
Bapna, A., Arivazhagan, N., and Firat, O. 2020, Controlling computation versus quality for neural sequence models, ArXiv, abs/2002.07106, Apr 2020, https://arxiv.org/abs/2002.07106 (Conditionally controls attention heads and FFN units.)
Chonghan Lee, Md Fahim Faysal Khan, Rita Brugarolas Brufau, Ke Ding, Vijaykrishnan Narayanan, Oct 2022, Token and Head Adaptive Transformers for Efficient Natural Language Processing, https://aclanthology.org/2022.coling-1.404/
Zejiang Hou, Sun-Yuan Kung, 2022, Multi-Dimensional Vision Transformer Compression via Dependency Guided Gaussian Process Search, https://ieeexplore.ieee.org/document/9857488, PDF: https://openaccess.thecvf.com/content/CVPR2022W/EVW/html/Hou_Multi-Dimensional_Vision_Transformer_Compression_via_Dependency_Guided_Gaussian_Process_Search_CVPRW_2022_paper.html (Multi-dimensional pruning.)
Francois Lagunas, Ella Charlaix, Victor Sanh, and Alexander M Rush. 2021, Block pruning for faster transformers, arXiv preprint arXiv:2109.04838, 2021. https://arxiv.org/abs/2109.04838, Code: https://github.com/huggingface/nn_pruning
Ofir Press, Noah A. Smith, and Omer Levy. 2020. Improving Transformer Models by Reordering their Sublayers, In Proceedings of ACL. Online, 2996–3005. https://doi.org/10.18653/v1/2020.acl-main.270, https://arxiv.org/abs/1911.03864 (Alternates layers of attention heads and FFN units, effectively pruning attention heads and FFN components from some layers.)
Sachin Mehta, Marjan Ghazvininejad, Srinivasan Iyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2020. DeLighT: Very Deep and Light-weight Transformer, arXiv:2008.00623 https://arxiv.org/abs/2008.00623 (Different Transformer architecture that includes removing attention heads and simplifies the FFN.)
Tianyang Lin, Yuxin Wang, Xiangyang Liu, and Xipeng Qiu. 2022, A survey of transformers, AI Open, 2022. https://arxiv.org/abs/2106.04554 (Survey paper with some analysis of attention head components.)
Tong Xiao, Yinqiao Li, Jingbo Zhu, Zhengtao Yu, and Tongran Liu. 2019. Sharing attention weights for fast transformer, In Proceedings of IJCAI, pages 5292–5298, https://arxiv.org/abs/1906.11024 (Parameter sharing of attention heads, which is a lighter attention mode than full head pruning.)
E Youn, S Prabhu, S Chen, 2023, Compressing Vision Transformers for Low-Resource Visual Learning, arXiv preprint arXiv:2309.02617, PDF: https://arxiv.org/pdf/2309.02617.pdf
T Ding, X Zhang, C Hu, Y Shan, B Zhao, Y Lin, S Hu, 2023, DH-Net: Dynamic Head for Object Detection Network, In book: New Materials, Machinery and Vehicle Engineering, https://www.researchgate.net/publication/374183525_DH-Net_Dynamic_Head_for_Object_Detection_Network, PDF: https://ebooks.iospress.nl/pdf/doi/10.3233/ATDE230123 (Dynamic head pruning for image analysis.)
Xiyang Dai, Yinpeng Chen, Bin Xiao, Dongdong Chen, Mengchen Liu, Lu Yuan, Lei Zhang, 2021, Dynamic Head: Unifying Object Detection Heads with Attentions, June 2021, https://arxiv.org/abs/2106.08322, Code: https://github.com/microsoft/DynamicHead (Combining heads, which is similar to removing attention heads in head pruning.)

For additional research papers on attention head pruning, see https://www.aussieai.com/research/head-pruning.

• Next:

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs

The new AI programming book by Aussie AI co-founders:

AI coding in C++
Transformer engine speedups
LLM models
Phone and desktop AI
Code examples
Research citations

Get your copy from Amazon: Generative AI in C++

Aussie AI

Attention Head Pruning

Attention Head Pruning

Quick Links

Product

New to Writing?

Writing Styles