Aussie AI

Width Pruning

  • Last Updated 12 December, 2024
  • by David Spuler, Ph.D.

Width pruning is a type of structured pruning that reduces the "width" of the internal layers of a neural network. In early models, the "width" was the number of internal neurons in the hidden layers, although in newer Transformers, it often refers to the number of attention heads. Like all types of pruning, the goal is model compression and faster inference by avoiding computation associated with the pruned parts.

Width pruning is orthogonal to "layer pruning" (depth pruning) where the number of internal layers are removed. There is also "length pruning" such as token pruning and embeddings pruning.

Width pruning can mean several different things that all affect the fan-out of data as it propagates through the hidden layers. The types and names include:

Like all pruning methods, width pruning can be static or dynamic. Static width pruning is removing channels or attention heads from the model as it's trained or shortly after training. It is conceptually identical to choosing a smaller width hyper-parameter for the model as part of neural architecture search. Dynamic width pruning means a selective inference path by removing width at runtime by blocking or bypassing some elements of the model, usually depending on the input sequence.

Slimmable Networks (Dynamic Width Pruning)

Some of the papers on "slimmer" or "thin" models include:

Channel Pruning (Width Pruning)

Various research papers examine the method of "channel pruning", which is a type of width pruning.

Filter Pruning

Filter pruning is another type of width pruning. Research papers on filter pruning:

Dynamic Channel Pruning (Width Pruning)

Whereas layer pruning reduces the "depth" of the model, it is also possible to dynamically prune the "width" of a model by inactivating channels during inference.

  • Yizeng Han, Gao Huang, Shiji Song, Le Yang, Honghui Wang, Yulin Wang, Dynamic Neural Networks: A Survey, Dec 2021, https://arxiv.org/abs/2102.04906
  • Y Sun, J Li, X Xu, 2022, Meta-GF: Training Dynamic-Depth Neural Networks Harmoniously, European Conference on Computer Vision, https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136710691.pdf (Code: https://github.com/SYVAE/MetaGF)
  • Kyunghyun Cho and Yoshua Bengio. Exponentially increasing the capacity-to-computation ratio for conditional computation in deep learning, 2014, arXiv preprint arXiv:1406.7362, https://arxiv.org/abs/1406.7362
  • Emmanuel Bengio, Pierre-Luc Bacon, Joelle Pineau, and Doina Precup, Conditional computation in neural networks for faster models, 2016, ICLR Workshop 2016, https://arxiv.org/abs/1511.06297
  • Yoshua Bengio, Nicholas Leonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation, 2013, arXiv preprint arXiv:1308.3432, https://arxiv.org/abs/1308.3432
  • Andreas Veit and Serge Belongie, Convolutional networks with adaptive inference graphs, In ECCV, 2018, https://arxiv.org/abs/1711.11503
  • Xitong Gao, Yiren Zhao, Łukasz Dudziak, Robert Mullins, and Cheng zhong Xu. Dynamic channel pruning: Feature boosting and suppression. In ICLR, 2019, https://arxiv.org/abs/1810.05331
  • Bo Chang, Lili Meng, Eldad Haber, Frederick Tung, and David Begert, Multi-level residual networks from dynamical systems view, In International Conference on Learning Representations, 2018, https://openreview.net/forum?id=SyJS-OgR-
  • Charles Herrmann, Richard Strong Bowen, and Ramin Zabih. Channel selection using Gumbel Softmax. In ECCV, 2020, https://arxiv.org/abs/1812.04180
  • Babak Ehteshami Bejnordi, Tijmen Blankevoort, and Max Welling, Batch-shaping for learning conditional channel gated networks, In ICLR, 2020, https://arxiv.org/abs/1907.06627
  • J. Yu, L. Yang, N. Xu, J. Yang, and T. Huang, Slimmable neural networks, in ICLR, 2019, https://arxiv.org/abs/1812.08928
  • Yacine Jernite, Edouard Grave, Armand Joulin, and Tomas Mikolov. Variable computation in recurrent neural networks. arXiv preprint arXiv:1611.06188, 2016, https://arxiv.org/abs/1611.06188
  • Y. Guo, A. Yao, and Y. Chen., Dynamic network surgery for efficient DNNs, 2016, Advances In Neural Information Processing Systems, pages 1379–1387, https://arxiv.org/abs/1608.04493 (Code: https://github.com/yiwenguo/Dynamic-Network-Surgery)
  • Q. Huang, K. Zhou, S. You, and U. Neumann, “Learning to prune filters in convolutional neural networks,” in 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2018, pp. 709–718, https://arxiv.org/abs/1801.07365
  • Caleb Tung, Nicholas Eliopoulos, Purvish Jajal, Gowri Ramshankar, Chen-Yun Yang, Nicholas Synovic, Xuecen Zhang, Vipin Chaudhary, George K. Thiruvathukal, Yung-Hsiang Lu, Oct 2023, An automated approach for improving the inference latency and energy efficiency of pretrained CNNs by removing irrelevant pixels with focused convolutions, https://arxiv.org/pdf/2310.07782.pdf, Code: https://github.com/PurdueCAM2Project/focused-convolutions
  • Wang Y., Lv K., Huang R., Song S., Yang L., Huang G., 2020, Glance and focus: a dynamic approach to reducing spatial redundancy in image classification, Advances in neural information processing systems, Vol. 33 (2020), pp. 2432-2444, https://arxiv.org/abs/2010.05300, Code: https://github.com/blackfeather-wang/GFNet-Pytorch (Focuses on a subset of an input image, which is analogous to channel pruning or width pruning.)
  • Dieter Verbruggen, Sofie Pollin, Hazem Sallouha, 6 May 2024, Computational Efficient Width-Wise Early Exits in Modulation Classification, https://arxiv.org/abs/2405.03222
  • Qiaozhi He, Zhihua Wu, 28 Apr 2024, Efficient LLM Inference with Kcache, https://arxiv.org/abs/2404.18057 (Splits the KV cache into a KCache stored in HBM and a Vcache stored in CPU memory. The requests for the V cache are limited by filtering after attention based on the Softmax scaled top-N results of the QK matrix multiplication, so thereby pruning a lot of the V cache memory loads and corresponding calculations.)
  • Sathya Krishnan Suresh, Shunmugapriya P, 24 Apr 2024 (v2), Towards smaller, faster decoder-only transformers: Architectural variants and their implications, https://arxiv.org/abs/2404.14462 Code: https://github.com/SkAndMl/gpt-variations (Focuses on three new variants of decoder-only Transformer architectures: ParallelGPT (p-gpt), LinearlyCompressedGPT (lc-gpt), and ConvCompressedGPT (cc-gpt).)
  • Sachin Mehta, Mohammad Hossein Sekhavat, Qingqing Cao, Maxwell Horton, Yanzi Jin, Chenfan Sun, Iman Mirzadeh, Mahyar Najibi, Dmitry Belenko, Peter Zatloukal, Mohammad Rastegari, 22 Apr 2024, OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework, Apple Research, https://arxiv.org/abs/2404.14619 Code: https://huggingface.co/apple/OpenELM
  • Snehasis Dey, 2024, Differentiable Slimming for Transformers with Improved Memory Efficiency, College of Engineering Bhubaneswar, Odisha , https://ijte.in/pdf/EE14.pdf (Dual pruning by attention head pruning for slimmable networks combined with layer pruning.)
  • Ignacio de Gregorio, April 2024, Mixture-of-Depths, a Dazzling New AI Breakthrough: Conditional Computing is Finally Here, Medium, https://medium.com/@ignacio.de.gregorio.noblejas/mixture-of-depths-a-dazzling-new-ai-breakthrough-be958fc629b2 (Mixture of depths is a layer-wise per-token limit to attention head computations, which is like width pruning with dynamic depth.)
  • David Raposo, Sam Ritter, Blake Richards, Timothy Lillicrap, Peter Conway Humphreys, Adam Santoro, 2 Apr 2024, Mixture-of-Depths: Dynamically allocating compute in transformer-based language models, https://arxiv.org/abs/2404.02258 (Per-layer pruning of which tokens can be in the attention computations to give a type of mixed lengthwise pruning combined with a dynamic width pruning or slimmable network approach.)
  • Namgyu Ho, Sangmin Bae, Taehyeon Kim, Hyunjik Jo, Yireun Kim, Tal Schuster, Adam Fisch, James Thorne, Se-Young Yun, 4 Jun 2024, Block Transformer: Global-to-Local Language Modeling for Fast Inference, https://arxiv.org/abs/2406.02657 Code: https://github.com/itsnamgyu/block-transformer (Uses local attention versus global attention at different layers.)
  • Sotiris Anagnostidis, Gregor Bachmann, Imanol Schlag, Thomas Hofmann, 2024, Navigating Scaling Laws: Compute Optimality in Adaptive Model Training https://openreview.net/pdf?id=3KxPo62PYn (Evaluates some model properties, such as width, on vision Transformers from the point of view of the scaling laws.)
  • Yixin Song, Zeyu Mi, Haotong Xie, Haibo Chen 2023, PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU, https://arxiv.org/abs/2312.12456 Code: https://github.com/SJTU-IPADS/PowerInfer (Computes a GPU-CPU hybrid engine with some "active" neurons run on the GPU and other less "hot" neurons on the CPU, which is akin to adaptive inference on the width dimension.)
  • Weijie Chen, Yuan Zhang, Di Xie, and Shiliang Pu. 2019. A layer decomposition-recomposition framework for neuron pruning towards accurate lightweight networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 3355–3362. https://arxiv.org/abs/1812.06611 (Layerwise dynamic structural pruning of unimportant neurons.)
  • Mojtaba Valipour, Mehdi Rezagholizadeh, Hossein Rajabzadeh, Marzieh Tahaei, Boxing Chen, Ali Ghodsi, 2023, SortedNet, a Place for Every Network and Every Network in its Place: Towards a Generalized Solution for Training Many-in-One Neural Networks https://arxiv.org/abs/2309.00255 (Generalization of multi-dimensional pruning, by training a large neural network with many sub-networks across different width and depth dimensions.)
  • Jiahui Yu and Thomas S Huang. 2019. Universally slimmable networks and improved training techniques. In Proceedings of the IEEE International Conference on Computer Vision, pages 1803–1811. https://arxiv.org/abs/1903.05134 Code: https://github.com/JiahuiYu/slimmable_networks
  • Hankook Lee and Jinwoo Shin. 2018. Anytime neural prediction via slicing networks vertically. arXiv preprint arXiv:1807.02609. https://arxiv.org/abs/1807.02609 (Training multiple thin networks.)
  • Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014 https://arxiv.org/abs/1412.6550
  • Ji Xin, Raphael Tang, Zhiying Jiang, Yaoliang Yu, Jimmy Lin, July 2022, Building an Efficiency Pipeline: Commutativity and Cumulativeness of Efficiency Operators for Transformers, arXiv preprint arXiv:2208.00483, https://arxiv.org/abs/2208.00483
  • Jean Senellart, Dakun Zhang, Bo Wang, Guillaume Klein, Jean-Pierre Ramatchandirin, Josep Crego, and Alexander Rush. 2018, OpenNMT system description for WNMT 2018: 800 words/sec on a single-core CPU. In Proc. of WNG, 2018. URL https://www.aclweb.org/anthology/W18-2715
  • Bo-Kyeong Kim, Geonmin Kim, Tae-Ho Kim, Thibault Castells, Shinkook Choi, Junho Shin, Hyoung-Kyu Song, 5 Feb 2024, Shortened LLaMA: A Simple Depth Pruning for Large Language Models, https://arxiv.org/abs/2402.02834 (Analysis of dual pruning with combined depth and width pruning.)
  • Peter Belcak, Roger Wattenhofer, 21 Nov 2023, Exponentially Faster Language Modelling, https://arxiv.org/abs/2311.10770 (Conditionally executes only a fraction of active neurons, which is effectively width pruning.)
  • M Salehi, S Mehta, A Kusupati, A Farhadi, H Hajishirzi, 2023 Sharcs: Efficient transformers through routing with dynamic width sub-networks https://arxiv.org/pdf/2310.12126.pdf (Direct queries to subnetworks with different widths.)
  • Zuxuan Wu, Tushar Nagarajan, Abhishek Kumar, Steven Rennie, Larry S Davis, Kristen Grauman, and Rogerio Feris. Blockdrop: Dynamic inference paths in residual networks. CVPR, pages 8817–8826, 2018. https://arxiv.org/abs/1711.08393 Code: https://github.com/Tushar-N/blockdrop
  • Jonas Geiping, Tom Goldstein, Dec 2022, Cramming: Training a Language Model on a Single GPU in One Day, https://arxiv.org/abs/2212.14034, Code: https://github.com/JonasGeiping/cramming (Note: code uses deprecated nvFuser compiler.) (Note: uses Pytorch nvFuser deep learning compiler, which seems to be deprecated now.)
  • Piotr Nawrot, Szymon Tworkowski, Michał Tyrolski, Łukasz Kaiser, Yuhuai Wu, Christian Szegedy, and Henryk Michalewski. Hierarchical Transformers Are More Efficient Language Models. arxiv:2110.13711[cs], April 2022. URL http://arxiv.org/abs/2110.13711
  • Yi Tay, Mostafa Dehghani, Jinfeng Rao, William Fedus, Samira Abnar, Hyung Won Chung, Sharan Narang, Dani Yogatama, Ashish Vaswani, and Donald Metzler. Scale Efficiently: Insights from Pretraining and Finetuning Transformers. In International Conference on Learning Representations, September 2021. URL https://openreview.net/forum?id=f2OYVD
  • Noam Wies, Yoav Levine, Daniel Jannai, and Amnon Shashua. Which transformer architecture fits my data? A vocabulary bottleneck in self-attention. In Proceedings of the 38th International Conference on Machine Learning, pp. 11170–11181. PMLR, July 2021. URL https://proc eedings.mlr.press/v139/wies21a.html.
  • Chen, Yilong ; Zhang, Linhao ; Shang, Junyuan ; Zhang, Zhenyu ; Liu, Tingwen ; Wang, Shuohuan ; Sun, Yu, June 2024, DHA: Learning Decoupled-Head Attention from Transformer Checkpoints via Adaptive Heads Fusion, https://arxiv.org/abs/2406.06567 https://ui.adsabs.harvard.edu/abs/2024arXiv240606567C/abstract
  • T Hu, C Meinel, H Yang, 2023, Flexible BERT with Width-and Depth-dynamic Inference, 2023 International Joint Conference on Neural Networks (IJCNN), https://ieeexplore.ieee.org/abstract/document/10191515/ (A 2023 version of BERT that does dual pruning with early exit and width gating.)
  • David Spuler, March 2024, Chapter 48. Width Pruning, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
  • Basar Kutukcu, Sabur Baidya, Sujit Dey, 2024, SLEXNet: Adaptive Inference Using Slimmable Early Exit Neural Networks, https://doi.org/10.1145/3689632 https://dl.acm.org/doi/pdf/10.1145/3689632 (Combined width and depth pruning with slimmable and early exit networks.)
  • Simone Scardapane, Alessandro Baiocchi, Alessio Devoto, Valerio Marsocci, Pasquale Minervini, Jary Pomponi, 8 Jul 2024 (v2), Conditional computation in neural networks: principles and research trends, https://arxiv.org/abs/2403.07965
  • Haizhong Zheng, Xiaoyan Bai, Xueshen Liu, Z. Morley Mao, Beidi Chen, Fan Lai, Atul Prakash, 3 Jun 2024 (v3), Learn To be Efficient: Build Structured Sparsity in Large Language Models, https://arxiv.org/abs/2402.06126
  • Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, Pavlo Molchanov, 19 Jul 2024, Compact Language Models via Pruning and Knowledge Distillation, https://arxiv.org/abs/2407.14679 https://github.com/NVlabs/Minitron (Combination of distillation and structured pruning on the depth and width dimensions.)
  • Tuowei Wang, Ruwen Fan, Huang, Zixu Hao, Kun Li, Ting Cao, Youyou Lu, Yaoxue Zhang, Ju Ren, 29 Oct 2024 (v2), Ripple: Accelerating LLM Inference on Smartphones with Correlation-Aware Neuron Management, https://arxiv.org/abs/2410.19274
  • Xiangyu Zhang, Yu Zhou, Guang Yang, Harald C. Gall, Taolue Chen, 11 Nov 2024, Anchor Attention, Small Cache: Code Generation with Large Language Models, https://arxiv.org/abs/2411.06680
  • Noam Wies, Yoav Levine, Daniel Jannai, and Amnon Shashua. Which transformer architecture fits my data? A vocabulary bottleneck in self-attention. In Proceedings of the 38th International Conference on Machine Learning, pp. 11170–11181. PMLR, July 2021. https://proceedings.mlr.press/v139/wies21a.html.

Width Pruning

More research papers on width pruning:

More Research on Pruning Types

More Pruning Research

Read more about: