Aussie AI

48. Width Pruning

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

“Width of life is more important than length of life.”

— Avicenna.

What is Width Pruning?

Width pruning is a type of structured pruning that reduces the “width” of the internal layers of a neural network. In early neural networks, the “width” was the number of internal neurons in the hidden layers. In modern Transformer architectures, width pruning usually refers to reducing the number of attention heads.

Like all types of pruning, the goal is both smaller models and faster inference. Model compression by width pruning reduces the size of the model and thereby the memory usage to store it. This also reduces inference time by avoiding computation associated with the pruned structures, and the avoidance of the cost of memory-to-cache transfers of data.

Width pruning can mean several different things that all affect the fan-out of data as it propagates through the hidden layers. The types and names include:

Attention head pruning (in Transformers)
Thin networks (static)
“Slimmable” neural networks (dynamic)
Channel pruning
Filter pruning

Width pruning techniques are orthogonal to “layer pruning” (depth pruning) where a number of internal layers are removed. Another orthogonal type of pruning is “length pruning” such as token pruning and embeddings pruning.

Like all pruning methods, width pruning can be static or dynamic. Static width pruning is removing channels or attention heads from the model as it's trained or shortly after training. It is conceptually similar to choosing a smaller width hyper-parameter for the model as part of Neural Architecture Search (NAS).

Dynamic width pruning means a selective inference path by removing width at runtime by blocking or bypassing some elements of the model, usually depending on the input sequence. Transformers can dynamically choose which attention head to activate or suppress in dynamic attention head pruning. It is also possible to dynamically prune the width of a non-Transformer model by inactivating channels or filters during inference.

Attention Head Pruning

Attention head pruning, often simply abbreviated to “head pruning”, is structured pruning that removes attention heads. It is a type of width pruning that makes the Transformer “thinner” on that dimension.

The attention heads were one of the main advances in the seminal 2017 Transformer paper, so much so that the paper was titled "Attention Is All You Need" (Vaswani et. al, 2017). However, research has shown that the attention mechanism is expensive and there are various ways to optimize its efficiency, including removing some redundant attention heads.

In addition to head pruning techniques that remove redundant or under-utilized attention heads, there is also research into using simpler attention heads (see approximate attention heads) and simplifying the cost of attention on long sequences (see non-autoregressive architectures).

Head pruning can be combined with various other optimization techniques such as quantization. It is also orthogonal to “depth pruning” such as “layer pruning” and “early exit”, and combined depth/width pruning is possible.

Research papers on attention head pruning:

Hanrui Wang, Zhekai Zhang, and Song Han, SpAtten: Efficient sparse attention architecture with cascade token and head pruning, In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 97–110. IEEE, 2021, https://arxiv.org/abs/2012.09852
Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned, In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pp. 5797–5808, 2019, https://arxiv.org/abs/1905.09418
Ji Lin, Yongming Rao, Jiwen Lu, and Jie Zhou. 2017. Runtime Neural Pruning, In Advances in Neural Information Processing Systems (NeurIPS). https://dl.acm.org/doi/10.5555/3294771.3294979, https://papers.nips.cc/paper/2017/hash/a51fb975227d6640e4fe47854476d133-Abstract.html
J. S. McCarley, Rishav Chakravarti, and Avirup Sil. 2020. Structured Pruning of a BERT-based Question Answering Model. (2020), arXiv:cs.CL/1910.0636, https://arxiv.org/abs/1910.06360
Sai Prasanna, Anna Rogers, and Anna Rumshisky. 2020. When BERT Plays the Lottery, All Tickets Are Winning, In Conference on Empirical Methods in Natural Language Processing (EMNLP). arXiv:cs.CL/2005.00561, https://arxiv.org/abs/2005.00561
Zi Lin, Jeremiah Zhe Liu, Zi Yang, Nan Hua, Dan Roth, Pruning Redundant Mappings in Transformer Models via Spectral-Normalized Identity Prior, Oct 2020 https://arxiv.org/abs/2010.01791
Jiaoda Li, Ryan Cotterell, Mrinmaya Sachan, 2023, Differentiable Subset Pruning of Transformer Heads, revised July 2023, https://arxiv.org/abs/2108.04657
William Held, Diyi Yang, 2022, Shapley Head Pruning: Identifying and Removing Interference in Multilingual Transformers, arXiv preprint arXiv:2210.05709, Oct 2022, https://arxiv.org/abs/2210.05709
Zhengyan Zhang, Fanchao Qi, Zhiyuan Liu, Qun Liu, Maosong Sun, 2021, Know What You Don't Need: Single-Shot Meta-Pruning for Attention Heads, AI Open, 2021, Elsevier, https://arxiv.org/abs/2011.03770v1
François Lagunas, Ella Charlaix, Victor Sanh, Alexander M. Rush, 2021, Block Pruning For Faster Transformers, https://arxiv.org/abs/2109.04838
Kyuhong Shim, Iksoo Choi, Wonyong Sung, Jungwook Choi, 2021, Layer-wise Pruning of Transformer Attention Heads for Efficient Language Modeling, 2021, 18th International SoC 2021, https://arxiv.org/abs/2110.03252
Zi Lin, Jeremiah Zhe Liu, Zi Yang, Nan Hua, and Dan Roth. Oct 2020. Pruning Redundant Mappings in Transformer Models via Spectral-Normalized Identity Prior, arXiv preprint arXiv:2010.01791 (2020), https://arxiv.org/abs/2010.01791
Haoyu He, Jing Liu, Zizheng Pan, Jianfei Cai, Jing Zhang, Dacheng Tao, Bohan Zhuang, 2022, Pruning Self-attentions into Convolutional Layers in Single Path, 2021, revised Aug 2022, https://arxiv.org/abs/2111.11802 Code: https://github.com/ziplab/SPViT
Zuzana Jelčicová, Marian Verhelst, 2022, Delta Keyword Transformer: Bringing Transformers to the Edge through Dynamically Pruned Multi-Head Self-Attention, arXiv preprint arXiv:2204.03479, March 2022, https://arxiv.org/abs/2204.03479v1
Fang Yu, Kun Huang, Meng Wang, Yuan Cheng, Wei Chu, Li Cui, 2022, Width & Depth Pruning for Vision Transformers, Vol. 36 No. 3: AAAI-22 Technical Tracks 3 / AAAI Technical Track on Computer Vision III, DOI: https://doi.org/10.1609/aaai.v36i3.20222, https://ojs.aaai.org/index.php/AAAI/article/view/20222, PDF: https://ojs.aaai.org/index.php/AAAI/article/view/20222/19981
Z. Liu, F. Li, G. Li, and J. Cheng, 2021, EBERT: Efficient BERT Inference with Dynamic Structured Pruning, in Findings of the Association for Computational Linguistics: ACL-IJCNLP, 2021, pp. 4814–4823, https://aclanthology.org/2021.findings-acl.425/
Guorun Wang, Jun Yang, Yaoru Sun, 2023, Task-oriented Memory-efficient Pruning-Adapter, arXiv preprint arXiv:2303.14704, Apr 2023, https://arxiv.org/abs/2303.14704
Archit Parnami, Rahul Singh, Tarun Joshi, 2021, Pruning Attention Heads of Transformer Models Using A* Search: A Novel Approach to Compress Big NLP Architectures, arXiv preprint arXiv:2110.15225, Nov 2021, https://arxiv.org/abs/2110.15225
Weiping Song, Chence Shi, Zhiping Xiao, Zhijian Duan, Yewen Xu, Ming Zhang, and Jian Tang. 2019, AutoInt: Automatic feature interaction learning via self-attentive neural networks, In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 1161–1170, 2019, https://arxiv.org/abs/1810.11921, Code: https://github.com/DeepGraphLearning/RecommenderSystems
Ye Lin, Yanyang Li, Tong Xiao, Jingbo Zhu, 2021, Bag of Tricks for Optimizing Transformer Efficiency, Findings of the Association for Computational Linguistics: EMNLP 2021, November 2021, https://aclanthology.org/2021.findings-emnlp.357/
Yi-Te Hsu, Sarthak Garg, Yi-Hsiu Liao, and Ilya Chatsviorkin. 2020. Efficient inference for neural machine translation, CoRR, abs/2010.02416. https://arxiv.org/abs/2010.02416
Maximiliana Behnke and Kenneth Heafield. 2020. Losing heads in the lottery: Pruning transformer attention in neural machine translation, In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pages 2664–2674. Association for Computational Linguistics. https://aclanthology.org/2020.emnlp-main.211/
Wenxuan Wang and Zhaopeng Tu. 2020. Rethinking the value of transformer components, In Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020, Barcelona, Spain (Online), December 8-13, 2020, pages 6019– 6029. International Committee on Computational Linguistics. https://arxiv.org/abs/2011.03803v1
Tobias Domhan. July 2018. How much attention do you need? a granular analysis of neural machine translation architectures, In ACL. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, https://aclanthology.org/P18-1167/
Biao Zhang, Deyi Xiong, and Jinsong Su. 2018. Accelerating neural transformer via an average attention network, In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pages 1789–1798. Association for Computational Linguistics, https://arxiv.org/abs/1805.00631 (This paper proposes a simpler version of the attention heads, rather than just pruning them.)
Shazeer, N. M., 2019, Fast transformer decoding: One write-head is all you need, ArXiv, abs/1911.02150, 2019, https://arxiv.org/abs/1911.02150
Bapna, A., Arivazhagan, N., and Firat, O. 2020, Controlling computation versus quality for neural sequence models, ArXiv, abs/2002.07106, Apr 2020, https://arxiv.org/abs/2002.07106 (Conditionally controls attention heads and FFN units.)
Chonghan Lee, Md Fahim Faysal Khan, Rita Brugarolas Brufau, Ke Ding, Vijaykrishnan Narayanan, Oct 2022, Token and Head Adaptive Transformers for Efficient Natural Language Processing, https://aclanthology.org/2022.coling-1.404/
Zejiang Hou, Sun-Yuan Kung, 2022, Multi-Dimensional Vision Transformer Compression via Dependency Guided Gaussian Process Search, https://ieeexplore.ieee.org/document/9857488, PDF: https://openaccess.thecvf.com/content/CVPR2022W/EVW/html/Hou_Multi-Dimensional_Vision_Transformer_Compression_via_Dependency_Guided_Gaussian_Process_Search_CVPRW_2022_paper.html (Multi-dimensional pruning.)
Francois Lagunas, Ella Charlaix, Victor Sanh, and Alexander M Rush. 2021, Block pruning for faster transformers, arXiv preprint arXiv:2109.04838, 2021. https://arxiv.org/abs/2109.04838, Code: https://github.com/huggingface/nn_pruning
Ofir Press, Noah A. Smith, and Omer Levy. 2020. Improving Transformer Models by Reordering their Sublayers, In Proceedings of ACL. Online, 2996–3005. https://doi.org/10.18653/v1/2020.acl-main.270, https://arxiv.org/abs/1911.03864 (Alternates layers of attention heads and FFN units, effectively pruning attention heads and FFN components from some layers.)
Sachin Mehta, Marjan Ghazvininejad, Srinivasan Iyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2020. DeLighT: Very Deep and Light-weight Transformer, arXiv:2008.00623 https://arxiv.org/abs/2008.00623 (Different Transformer architecture that includes removing attention heads and simplifies the FFN.)
Tianyang Lin, Yuxin Wang, Xiangyang Liu, and Xipeng Qiu. 2022, A survey of transformers, AI Open, 2022. https://arxiv.org/abs/2106.04554 (Survey paper with some analysis of attention head components.)
Tong Xiao, Yinqiao Li, Jingbo Zhu, Zhengtao Yu, and Tongran Liu. 2019. Sharing attention weights for fast transformer, In Proceedings of IJCAI, pages 5292–5298, https://arxiv.org/abs/1906.11024 (Parameter sharing of attention heads, which is a lighter attention mode than full head pruning.)
E Youn, S Prabhu, S Chen, 2023, Compressing Vision Transformers for Low-Resource Visual Learning, arXiv preprint arXiv:2309.02617, PDF: https://arxiv.org/pdf/2309.02617.pdf
T Ding, X Zhang, C Hu, Y Shan, B Zhao, Y Lin, S Hu, 2023, DH-Net: Dynamic Head for Object Detection Network, In book: New Materials, Machinery and Vehicle Engineering, https://www.researchgate.net/publication/374183525_DH-Net_Dynamic_Head_for_Object_Detection_Network, PDF: https://ebooks.iospress.nl/pdf/doi/10.3233/ATDE230123 (Dynamic head pruning for image analysis.)
Xiyang Dai, Yinpeng Chen, Bin Xiao, Dongdong Chen, Mengchen Liu, Lu Yuan, Lei Zhang, 2021, Dynamic Head: Unifying Object Detection Heads with Attentions, June 2021, https://arxiv.org/abs/2106.08322, Code: https://github.com/microsoft/DynamicHead (Combining heads, which is similar to removing attention heads in head pruning.)

For additional research papers on attention head pruning, see https://www.aussieai.com/research/head-pruning.

Slimmable Neural Networks

Slimmable neural networks are optimized with a type of dynamic width pruning. They can adjust their width dynamically at runtime, and are thus a type of adaptive inference. There are various research papers on these methods, but they have not achieved widespread usage in industry models.

Some of the papers on “slimmer” or “thin” models via width pruning include:

J. Yu, L. Yang, N. Xu, J. Yang, and T. Huang. 2018, Slimmable neural networks, In International Conference on Learning Representations, 2018. https://arxiv.org/abs/1812.08928, Code: https://github.com/JiahuiYu/slimmable_networks (The earliest paper that coined the term “slimmable networks”.)
J. Yu and T. S. Huang. 2019, Universally slimmable networks and improved training techniques, In IEEE International Conference on Computer Vision, pages 1803–1811, 2019. https://arxiv.org/abs/1903.05134, Code: https://github.com/JiahuiYu/slimmable_networks
Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. 2017, Learning efficient convolutional networks through network slimming, In Computer Vision (ICCV), 2017 IEEE International Conference on, pp. 2755–2763. IEEE, 2017. https://arxiv.org/abs/1708.06519
Sergio Guadarrama, Nathan Silberman, 2016, TensorFlow-Slim: A lightweight library for defining, training and evaluating complex models in TensorFlow, Google Research, Code: https://github.com/google-research/tf-slim
Yu and T. S. Huang. 2019, Network slimming by slimmable networks: Towards one-shot architecture search for channel numbers, Preprint arXiv:1903.11728, 2019. PDF: https://arxiv.org/pdf/1903.11728v1.pdf
Jiahui Yu, Thomas Huang, June 2019, AutoSlim: Towards One-Shot Architecture Search for Channel Numbers, https://arxiv.org/abs/1903.11728, Code: https://github.com/JiahuiYu/slimmable_networks
Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. 2014, Fitnets: Hints for thin deep nets, arXiv preprint arXiv:1412.6550, 2014. https://arxiv.org/abs/1412.6550
Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, Liang-Chieh Chen, 2019, MobileNetV2: Inverted Residuals and Linear Bottlenecks, Mar 2019, https://arxiv.org/abs/1801.04381, Code: https://github.com/tensorflow/models/tree/master/research/slim/nets/mobilenet (MobileNetV2 uses some slimming techniques with TensorFlow-Slim.)
Changlin Li, Guangrun Wang, Bing Wang, Xiaodan Liang, Zhihui Li, Xiaojun Chang, 2021, Dynamic Slimmable Network, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 8607-8617, https://arxiv.org/abs/2103.13258, http://openaccess.thecvf.com/content/CVPR2021/html/Li_Dynamic_Slimmable_Network_CVPR_2021_paper.html, Code: https://github.com/changlin31/DS-Net
Ting-Wu Chin, Ari S. Morcos, Diana Marculescu, 2020, Pareco: Pareto-aware channel optimization for slimmable neural networks, https://arxiv.org/abs/2007.11752v2, https://openreview.net/forum?id=SPyxaz_h9Nd
Won Joon Yun; Yunseok Kwak; Hankyul Baek; Soyi Jung; Mingyue Ji; Mehdi Bennis; Jihong Park, 2023, SlimFL: Federated learning with superposition coding over slimmable neural networks, IEEE/ACM Transactions on Networking (Early Access), DOI: 10.1109/TNET.2022.3231864 https://ieeexplore.ieee.org/document/10004844 https://arxiv.org/abs/2203.14094
Ting-Wu Chin, Ari S. Morcos & Diana Marculescu, 2021, Joslim: Joint Widths and Weights Optimization for Slimmable Neural Networks, Lecture Notes in Computer Science book series (LNAI,volume 12977) Springer, https://link.springer.com/chapter/10.1007/978-3-030-86523-8_8, https://arxiv.org/pdf/2007.11752, Code: https://github.com/cmu-enyac/Joslim
Hideaki Kuratsu, Atsuyoshi Nakamura, 2022, Slimmable Pruned Neural Networks, arXiv preprint arXiv:2212.03415, https://arxiv.org/abs/2212.03415 Code: https://github.com/hideakikuratsu/SP-Net
A Ozerov, A Lambert, SK Kumaraswamy, 2021, ParaDiS: Parallelly Distributable Slimmable Neural Networks, arXiv preprint arXiv:2110.02724, https://arxiv.org/abs/2110.02724
L Hou, Z Yuan, L Huang, H Shen, X Cheng, 2021, Slimmable generative adversarial networks, Proceedings of the AAAI Conference on Artificial Intelligence, 35, No. 9: AAAI-21 Technical Tracks 9, https://ojs.aaai.org/index.php/AAAI/article/view/16946, https://arxiv.org/abs/2012.05660
F Yang, L Herranz, Y Cheng, 2021, Slimmable compressive autoencoders for practical neural image compression, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4998-5007, http://openaccess.thecvf.com/content/CVPR2021/html/Yang_Slimmable_Compressive_Autoencoders_for_Practical_Neural_Image_Compression_CVPR_2021_paper.html , https://arxiv.org/abs/2103.15726
Zuhaib Akhtar, Mohammad Omar Khursheed, Dongsu Du, Yuzong Liu, Apr 2023, Small-footprint slimmable networks for keyword spotting, ICASSP 2023: 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), https://arxiv.org/abs/2304.12183
Yu, Jiahui, 2019, Slimmable neural networks for edge devices, Masters Thesis, Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, https://www.ideals.illinois.edu/items/112252
Rang Meng, Weijie Chen, Shicai Yang, Jie Song, Luojun Lin, Di Xie, Shiliang Pu, Xinchao Wang, Mingli Song, Yueting Zhuang, June 2022, Slimmable domain adaptation, http://openaccess.thecvf.com/content/CVPR2022/html/Meng_Slimmable_Domain_Adaptation_CVPR_2022_paper.html, https://arxiv.org/abs/2206.06620, Code: https://github.com/hikvision-research/SlimDA
Juliano S. Assine, J. C. S. Santos Filho, Eduardo Valle, Marco Levorato, 2023, Slimmable Encoders for Flexible Split DNNs in Bandwidth and Resource Constrained IoT Systems, arXiv preprint arXiv:2306.12691, June 2023, https://arxiv.org/abs/2306.12691
Li Yang, Zhezhi He, Yu Cao, Deliang Fan, Sep 2020, A Progressive Sub-Network Searching Framework for Dynamic Inference, https://arxiv.org/abs/2009.05681
Dawei Li, Xiaolong Wang, and Deguang Kong. 2018. DeepRebirth: Accelerating Deep Neural Network Execution on Mobile Devices, In AAAI’18, https://arxiv.org/abs/1708.04728
Hankook Lee and Jinwoo Shin. 2018. Anytime neural prediction via slicing networks vertically, arXiv preprint arXiv:1807.02609. https://arxiv.org/abs/1807.02609
Junjie He, Yinzhang Ding, Ming Zhang, Dongxiao Li, 2022, Towards efficient network compression via Few-Shot Slimming, Neural Networks, Volume 147, 2022, pp. 113-125, https://doi.org/10.1016/j.neunet.2021.12.011, https://pubmed.ncbi.nlm.nih.gov/34999388/
Mojtaba Valipour, Mehdi Rezagholizadeh, Hossein Rajabzadeh, Marzieh Tahaei, Boxing Chen, Ali Ghodsi, 2023, SortedNet, a Place for Every Network and Every Network in its Place: Towards a Generalized Solution for Training Many-in-One Neural Networks, https://arxiv.org/abs/2309.00255 (Generalization of multi-dimensional pruning, by training a large neural network with many sub-networks across different width and depth dimensions.)
Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H., 2017, Mobilenets: Efficient convolutional neural networks for mobile vision applications (2017), https://doi.org/10.48550/ARXIV.1704.04861, https://arxiv.org/abs/1704.04861 (Combines depthwise separable convolutions and thinning at each layer.)
N Penkov, K Balaskas, M Rapp, J Henkel, 2023, Differentiable Slimming for Memory-Efficient Transformers, IEEE Embedded Systems Letters (Early Access), DOI: 10.1109/LES.2023.3299638, https://ieeexplore.ieee.org/abstract/document/10261943

For additional research papers on slimmable networks, see https://www.aussieai.com/research/width-pruning#slimmable.

Filter Pruning

Filter pruning is a type of structural pruning on Convolutional Neural Networks (CNNs), but does not apply to Transformer architectures. It narrows the width of a CNN and thereby reduces the required computations.

Research papers on filter pruning:

Q. Huang, K. Zhou, S. You, and U. Neumann, 2018, Learning to prune filters in convolutional neural networks, in 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2018, pp. 709–718, https://arxiv.org/abs/1801.07365
Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. 2017, Thinet: A filter level pruning method for deep neural network compression, arXiv preprint arXiv:1707.06342, July 2017. https://arxiv.org/abs/1707.06342 (Filter pruning.)
Z. You, K. Yan, J. Ye, M. Ma and P. Wang, 2019, Gate decorator: Global filter pruning method for accelerating deep convolutional neural networks, arXiv:1909.08174, 2019. https://arxiv.org/abs/1909.08174
Wang, Y., Zhang, X., Xie, L., Zhou, J., Su, H., Zhang, B., and Hu, X. 2020, Pruning from scratch, In AAAI, 2020. https://arxiv.org/abs/1909.12579
He, Y., Liu, P., Wang, Z., Hu, Z., and Yang, Y., 2019, Filter pruning via geometric median for deep convolutional neural networks acceleration, In CVPR, 2019. https://arxiv.org/abs/1811.00250, Code: https://github.com/he-y/filter-pruning-geometric-median
Wang, W., Fu, C., Guo, J., Cai, D., and He, X., 2019, COP: customized deep model compression via regularized correlation-based filter-level pruning, In IJCAI, 2019. https://arxiv.org/abs/1906.10337
Luo, J., Zhang, H., Zhou, H., Xie, C., Wu, J., and Lin, W., 2019, Thinet: Pruning CNN filters for a thinner net, TPAMI, 2019. https://ieeexplore.ieee.org/document/8416559
Molchanov, P., Tyree, S., Karras, T., Aila, T., and Kautz, J., 2017, Pruning convolutional neural networks for resource efficient inference, In ICLR, 2017. https://arxiv.org/abs/1611.06440
Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, Hans Peter Graf, 2017, Pruning Filters for Efficient ConvNets, In ICLR, 2017. https://arxiv.org/abs/1608.08710
Mingbao Lin, Rongrong Ji, Yan Wang, Yichen Zhang, Baochang Zhang, Yonghong Tian, Ling Shao, 2020, HRank: Filter pruning using high-rank feature map, in CVPR, 2020. https://arxiv.org/abs/2002.10179, Code: https://github.com/lmbxmu/HRank
Pavlo Molchanov, Arun Mallya, Stephen Tyree, Iuri Frosio, Jan Kautz, June 2019, Importance Estimation for Neural Network Pruning, in CVPR, 2019. https://arxiv.org/abs/1906.10771
Shaohui Lin, Rongrong Ji, Chenqian Yan, Baochang Zhang, Liujuan Cao, Qixiang Ye, Feiyue Huang, David Doermann, Mar 2019, Towards Optimal Structured CNN Pruning via Generative Adversarial Learning, https://arxiv.org/abs/1903.09291
Shaohui Lin, Rongrong Ji, Yuchao Li, Yongjian Wu, Feiyue Huang, and Baochang Zhang. 2018, Accelerating convolutional networks via global & dynamic filter pruning, In IJCAI, pages 2425–2432, 2018. https://typeset.io/papers/accelerating-convolutional-networks-via-global-dynamic-38cal1vpjb
Lucas Liebenwein, Cenk Baykal, Harry Lang, Dan Feldman, Daniela Rus, Mar 2020, Provable Filter Pruning for Efficient Neural Networks, in ICLR, 2020 https://arxiv.org/abs/1911.07412
B Wang, C Ma, B Liu, N Liu, J Zhu, Oct 2023, Filter Pruning For CNN With Enhanced Linear Representation Redundancy, arXiv preprint arXiv:2310.06344, https://arxiv.org/pdf/2310.06344.pdf

For additional research papers on filter pruning, see https://www.aussieai.com/research/width-pruning#filter.

Channel Pruning

Channel pruning is a structured pruning method for CNNs. By reducing the number of channels in a feature map, the “width” of the network is reduced. This method does not apply to Transformer architectures, but is analogous to attention head pruning.

Various research papers examine the method of “channel pruning”, which is a type of width pruning. Research papers on channel pruning:

Y. He, X. Zhang, and J. Sun, 2017, Channel pruning for accelerating very deep neural networks, in Computer Vision (ICCV), 2017 IEEE International Conference on. IEEE, 2017, pp. 1398–1406, https://arxiv.org/abs/1707.06168
J. Ye, X. Lu, Z. Lin, and J. Z. Wang, 2018, Rethinking the smaller-norm-less-informative assumption in channel pruning of convolution layers, arXiv preprint arXiv:1802.00124, 2018. https://arxiv.org/abs/1802.00124
N. Lee, T. Ajanthan, and P. H. Torr, 2018, Snip: Single-shot network pruning based on connection sensitivity, arXiv preprint arXiv:1810.02340, 2018. https://arxiv.org/abs/1810.02340
Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. 2017, Shufflenet: An extremely efficient convolutional neural network for mobile devices, CoRR, abs/1707.01083, 2017. https://arxiv.org/abs/1707.01083 (Uses “channel shuffle” which is similar to channel pruning.)
W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. 2016, Learning structured sparsity in deep neural networks, In Advances in Neural Information Processing Systems, pages 2074–2082, 2016, https://arxiv.org/abs/1608.03665 Code: https://github.com/wenwei202/caffe/tree/scnn
Min Wang, Baoyuan Liu, and Hassan Foroosh. 2016, Design of efficient convolutional layers using single intra-channel convolution, topological subdivisioning and spatial ”bottleneck” structure, CoRR, abs/1608.04337, 2016, https://arxiv.org/abs/1608.04337
Mingbao Lin, Rongrong Ji, Yuxin Zhang, Baochang Zhang, Yongjian Wu, Yonghong Tian, June 2020, Channel Pruning via Automatic Structure Search, arXiv preprint, https://arxiv.org/abs/2001.08565
Jing Liu; Bohan Zhuang; Zhuangwei Zhuang; Yong Guo; Junzhou Hua, 2018, Discrimination-aware channel pruning for deep neural networks, IEEE Transactions on Pattern Analysis and Machine Intelligence (Volume 44, Issue 8, 01 August 2022), https://ieeexplore.ieee.org/abstract/document/9384353
Xitong Gao, Yiren Zhao, Łukasz Dudziak, Robert Mullins, Cheng-zhong Xu, Jan 2019, Dynamic Channel Pruning: Feature Boosting and Suppression, https://arxiv.org/abs/1810.05331
Hanyu Peng, Jiaxiang Wu, Shifeng Chen, Junzhou Huang, 2019, Collaborative channel pruning for deep networks, Proceedings of the 36th International Conference on Machine Learning, PMLR 97:5113-5122, 2019. http://proceedings.mlr.press/v97/peng19c/peng19c.pdf
Jinyang Guo; Weichen Zhang; Wanli Ouyang; Dong Xu, 2021, Model compression using progressive channel pruning, IEEE Transactions on Circuits and Systems for Video Technology (Volume 31, Issue 3, March 2021), https://ieeexplore.ieee.org/document/9097925
J Guo, W Ouyang, D Xu, 2020, Channel pruning guided by classification loss and feature importance, Proceedings of the AAAI Conference, Association for the Advancement of Artificial Intelligence, https://aaai.org/ojs/index.php/AAAI/article/view/6720/6574
Zechun Liu, Haoyuan Mu, Xiangyu Zhang, Zichao Guo, Xin Yang, Tim Kwang-Ting Cheng, Jian Sun, Aug 2019, MetaPruning: Meta Learning for Automatic Neural Network Channel Pruning, https://arxiv.org/abs/1903.10258, http://openaccess.thecvf.com/content_ICCV_2019/papers/Liu_MetaPruning_Meta_Learning_for_Automatic_Neural_Network_Channel_Pruning_ICCV_2019_paper.pdf
Y Huang, N Liu, Z Che, Z Xu, C Shen, 2023, Yaomin Huang, Ning Liu, Zhengping Che, Zhiyuan Xu, Chaomin Shen, Yaxin Peng, Guixu Zhang, Xinmei Liu, Feifei Feng, Jian Tang, 2023, CP3: Channel Pruning Plug-In for Point-Based Networks, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), https://ieeexplore.ieee.org/document/10203696, https://arxiv.org/abs/2303.13097, http://openaccess.thecvf.com/content/CVPR2023/papers/Huang_CP3_Channel_Pruning_Plug-In_for_Point-Based_Networks_CVPR_2023_paper.pdf
Y Liu, D Wu, W Zhou, K Fan, Z Zhou, 2023, EACP: An effective automatic channel pruning for neural networks, Neurocomputing, Volume 526, 14 March 2023, Pages 131-142, https://www.sciencedirect.com/science/article/pii/S0925231223000255
Hancheng Ye, Bo Zhang, Tao Chen, Jiayuan Fan, Bin Wang, 2023, Performance-aware Approximation of Global Channel Pruning for Multitask CNNs, IEEE Transactions on Pattern Analysis and Machine Intelligence ( Volume: 45, Issue: 8, August 2023), https://ieeexplore.ieee.org/document/10083285, https://arxiv.org/abs/2303.11923, Code: http://www.github.com/HankYe/PAGCP.git
Kang, M. and Han, B., 2020, Operation-aware soft channel pruning using differentiable masks, In ICML, 2020, https://arxiv.org/abs/2007.03938
Manoj Alwani, Yang Wang, Vashisht Madhavan, Feb 2022, DECORE: Deep Compression with Reinforcement Learning, in CVPR, 2022, https://arxiv.org/abs/2106.06091
Shixing Yu, Zhewei Yao, Amir Gholami, Zhen Dong, Sehoon Kim, Michael W Mahoney, Kurt Keutzer, 2022, Hessian-Aware Pruning and Optimal Neural Implant, in WACV, 2022. https://arxiv.org/abs/2101.08940
Weizhe Hua, Yuan Zhou, Christopher M De Sa, Zhiru Zhang, and G Edward Suh. 2019, Channel gating neural networks, NeurIPS, pages 1884–1894, 2019, https://arxiv.org/abs/1805.12549
Zhenda Xie, Zheng Zhang, Xizhou Zhu, Gao Huang, and Stephen Lin. 2020. Spatially adaptive inference with stochastic feature sampling and interpolation, arXiv preprint arXiv:2003.08866, https://arxiv.org/abs/2003.08866
Ji Lin, Yongming Rao, Jiwen Lu, and Jie Zhou. 2017, Runtime neural pruning, In NeurIPS, pages 2181–2191, 2017. PDF: https://dl.acm.org/doi/pdf/10.5555/3294771.3294979