Aussie AI

Dual Pruning (Width + Depth)

Last Updated 2 February, 2025

by David Spuler, Ph.D.

Dual pruning is the combination of width pruning and depth pruning. Depth pruning involves pruning or skipping the layers of the model, such as in layer pruning, early exiting or the shallow decoder architecture. Width pruning techniques include attention head pruning, slimmable networks, channel pruning, filter pruning, and other strategies. Some papers also describe combinations of multiple pruning strategies as "hybrid pruning".

Depth and width pruning are orthogonal strategies, and so can be combined in various ways. The obvious extension is "triple axis pruning" which also adds length pruning (such as token pruning or embeddings pruning); see Triple pruning.

Research on Dual Pruning

Research papers on combined two-dimensional model pruning combining width and depth pruning:

Yue Wang, Jianghao Shen, Ting-Kuei Hu, Pengfei Xu, Tan Nguyen, Richard G. Baraniuk, Zhangyang Wang, and Yingyan Lin. Dual dynamic inference: Enabling more efficient, adaptive and controllable deep inference, 2020, JSTSP, https://arxiv.org/abs/1907.04523
X. Xu, M. S. Park, and C. Brick, Hybrid pruning: Thinner sparse networks for fast inference on edge devices, in ICLR, 2018, https://arxiv.org/abs/1811.00482
Wenhan Xia, Hongxu Yin, Xiaoliang Dai, and Niraj K Jha, Fully dynamic inference with deep neural networks, 2021, IEEE Transactions on Emerging Topics in Computing, https://arxiv.org/abs/2007.15151
Ali Ehteshami Bejnordi and Ralf Krestel, Dynamic channel and layer gating in convolutional neural networks, 2020, In KI, https://dl.acm.org/doi/10.1007/978-3-030-58285-2_3
Fang Yu, Kun Huang, Meng Wang, Yuan Cheng, Wei Chu, Li Cui, Width & Depth Pruning for Vision Transformers, Vol. 36 No. 3: AAAI-22 Technical Tracks 3 / AAAI Technical Track on Computer Vision III, DOI: https://doi.org/10.1609/aaai.v36i3.20222, https://ojs.aaai.org/index.php/AAAI/article/view/20222, PDF: https://ojs.aaai.org/index.php/AAAI/article/view/20222/19981
Lu Hou, Zhiqi Huang, Lifeng Shang, Xin Jiang, Xiao Chen, and Qun Liu. 2020. DynaBERT: Dynamic BERT with Adaptive Width and Depth. arXiv preprint arXiv:2004.04037 (Oct 2020), https://arxiv.org/abs/2004.04037, Code: https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/DynaBERT
H. Cai, C. Gan, T. Wang, Z. Zhang, and S. Han. Once for all: Train one network and specialize it for efficient deployment. In International Conference on Learning Representations, 2020. https://arxiv.org/abs/1908.09791
H. Wang, Z. Wu, Z. Liu, H. Cai, L. Zhu, C. Gan, and S. Han. HAT: Hardware-Aware Transformers for Efficient Natural Language Processing. In Annual Meeting of the Association for Computational Linguistics, 2020. https://aclanthology.org/2020.acl-main.686/
Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014. https://arxiv.org/abs/1412.6550
Mojtaba Valipour, Mehdi Rezagholizadeh, Hossein Rajabzadeh, Marzieh Tahaei, Boxing Chen, Ali Ghodsi, 2023, SortedNet, a Place for Every Network and Every Network in its Place: Towards a Generalized Solution for Training Many-in-One Neural Networks https://arxiv.org/abs/2309.00255 (Generalization of multi-dimensional pruning, by training a large neural network with many sub-networks across different width and depth dimensions.)
T Hu, C Meinel, H Yang, 2023, Flexible BERT with Width-and Depth-dynamic Inference, 2023 International Joint Conference on Neural Networks (IJCNN), https://ieeexplore.ieee.org/abstract/document/10191515/ (A 2023 version of BERT that does dual pruning with early exit and width gating.)
Y Liu, Z Lin, F Yuan, 2021, Rosita: Refined bert compression with integrated techniques, The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21), https://arxiv.org/abs/2103.11367, Code: https://github.com/llyx97/Rosita (Pruning on multiple dimensions of layer, FFN outputs, and embeddings, also combined with distillation.)
Bolukbasi, T., Wang, J., Dekel, O., and Saligrama, V. 2017. Adaptive neural networks for efficient inference. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, Proceedings of Machine Learning Research, pages 527–536. https://arxiv.org/abs/1702.07811, http://proceedings.mlr.press/v70/bolukbasi17a.html (Early exit of layers and network selection.)
Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: Mobilenets: Efficient convolutional neural networks for mobile vision applications (2017). https://doi.org/10.48550/ARXIV.1704.04861, https://arxiv.org/abs/1704.04861 (Combines depthwise separable convolutions and thinning at each layer.)
Wen, W., Wu, C., Wang, Y., Chen, Y., and Li, H. Learning structured sparsity in deep neural networks. In NIPS, 2016. https://github.com/wenwei202/caffe/tree/scnn (Combined filter and layer pruning.)
Lin, S., Ji, R., Yan, C., Zhang, B., Cao, L., Ye, Q., Huang, F., and Doermann, D. S. Towards optimal structured CNN pruning via generative adversarial learning. In CVPR, 2019. https://arxiv.org/abs/1903.09291 (Similar to a combined filter and layer pruning algorithm.)
Zehao Huang and Naiyan Wang, “Data-driven sparse structure selection for deep neural networks,” in ECCV, 2018. https://arxiv.org/abs/1707.01213, Code: https://github.com/huangzehao/sparse-structure-selection (Not typical width-depth pruning, but a combined pruning that uses sparsification to force weight structures to zero, allowig pruning of whole structures.)
Tianli Zhao, Xi Sheryl Zhang, Wentao Zhu, Jiaxing Wang, Sen Yang, Ji Liu, Jian Cheng, Nov 2021, Joint Channel and Weight Pruning for Model Acceleration on Moblie Devices, https://arxiv.org/abs/2110.08013
H Litao, X Fei, G Xiaoyang, Y Tingting, June 2023, Research on Model Compression Method Based on Deep Separable Convolutional and Pruning, Available at SSRN, https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4478190 (Somewhat related to dual pruning; also adds quantization.)
Xiaoying Zhi, Varun Babbar, Pheobe Sun, Fran Silavong, Ruibo Shi, Sean Moran, Sep 2023, A New Baseline for GreenAI: Finding the Optimal Sub-Network via Layer and Channel Pruning, https://arxiv.org/abs/2302.10798 (Layer and channel pruning combined.)
Dieter Verbruggen, Sofie Pollin, Hazem Sallouha, 6 May 2024, Computational Efficient Width-Wise Early Exits in Modulation Classification, https://arxiv.org/abs/2405.03222
Aaron Klein, Jacek Golebiowski, Xingchen Ma, Valerio Perrone, Cedric Archambeau, 3 May 2024, Structural Pruning of Pre-trained Language Models via Neural Architecture Search, https://arxiv.org/abs/2405.02267 (Post-training structured pruning of sub-networks based on NAS, also with weight sharing and several different focus areas of pruning including attention heads, FFNs, and layers.)
Snehasis Dey, 2024, Differentiable Slimming for Transformers with Improved Memory Efficiency, College of Engineering Bhubaneswar, Odisha , https://ijte.in/pdf/EE14.pdf (Dual pruning by attention head pruning for slimmable networks combined with layer pruning.)
Yashas Samaga B L, Varun Yerram, Chong You, Srinadh Bhojanapalli, Sanjiv Kumar, Prateek Jain, Praneeth Netrapalli, 14 Feb 2024, HiRE: High Recall Approximate Top-k Estimation for Efficient LLM Inference, https://arxiv.org/abs/2402.09360 (Attempts to estimate the output of top-k decoding, so as to prune computations on two dimensions earlier in the inference computations.)
Ignacio de Gregorio, April 2024, Mixture-of-Depths, a Dazzling New AI Breakthrough: Conditional Computing is Finally Here, Medium, https://medium.com/@ignacio.de.gregorio.noblejas/mixture-of-depths-a-dazzling-new-ai-breakthrough-be958fc629b2 (Mixture of depths is a layer-wise per-token limit to attention head computations, which is like width pruning with dynamic depth.)
David Raposo, Sam Ritter, Blake Richards, Timothy Lillicrap, Peter Conway Humphreys, Adam Santoro, 2 Apr 2024, Mixture-of-Depths: Dynamically allocating compute in transformer-based language models, https://arxiv.org/abs/2404.02258 (Per-layer pruning of which tokens can be in the attention computations to give a type of mixed lengthwise pruning combined with a dynamic width pruning or slimmable network approach.)
Zihao Wang, Shaoduo Gan, 7 Apr 2024] SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget, https://arxiv.org/abs/2404.04793 Code: https://github.com/hetailang/squeezeattention (Optimization of the KV cache along the two dimensions of layers and input sequence.)
Ting Hu, Christoph Meinel, Haojin Yang, 2024, A flexible BERT model enabling width- and depth-dynamic inference, Computer Speech & Language 4 April 2024, 101646, https://www.sciencedirect.com/science/article/pii/S0885230824000299 (Dual pruning method with layerwise "neural grafting" that gives dynamic width models, and combined with early exit on the depth dimension.)
Bo-Kyeong Kim, Geonmin Kim, Tae-Ho Kim, Thibault Castells, Shinkook Choi, Junho Shin, Hyoung-Kyu Song, 5 Feb 2024, Shortened LLaMA: A Simple Depth Pruning for Large Language Models, https://arxiv.org/abs/2402.02834 (Analysis of dual pruning with combined depth and width pruning.)
Basar Kutukcu, Sabur Baidya, Sujit Dey, 2024, SLEXNet: Adaptive Inference Using Slimmable Early Exit Neural Networks, https://doi.org/10.1145/3689632 https://dl.acm.org/doi/pdf/10.1145/3689632 (Combined width and depth pruning with slimmable and early exit networks.)
Dhananjay Saikumar, Blesson Varghese, 1 Apr 2024, DRIVE: Dual Gradient-Based Rapid Iterative Pruning, https://arxiv.org/abs/2404.03687
Jiwon Song, Kyungseok Oh, Taesu Kim, Hyungjun Kim, Yulhwa Kim, Jae-Joon Kim, 19 Jul 2024 (v5), SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks, https://arxiv.org/abs/2402.09025 https://github.com/jiwonsong-dev/SLEB
David Spuler, 26th August, 2024, Inference Optimization Research Ideas, https://www.aussieai.com/blog/inference-optimization-ideas
Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, Pavlo Molchanov, 19 Jul 2024, Compact Language Models via Pruning and Knowledge Distillation, https://arxiv.org/abs/2407.14679 https://github.com/NVlabs/Minitron (Combination of distillation and structured pruning on the depth and width dimensions.)
Sharath Turuvekere Sreenivas, Saurav Muralidharan, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, Pavlo Molchanov, 26 Aug 2024 (v2), LLM Pruning and Distillation in Practice: The Minitron Approach, https://arxiv.org/abs/2408.11796
Xia, Wenhan, Sep 2024, Methods for Efficient and Scalable Deep Learning, Ph.D. Thesis, Electrical and Computer Engineering Department, Princeton University, http://arks.princeton.edu/ark:/88435/dsp015q47rs12x (Covers PEFT/LoRA on training, and dual pruning with layer skipping and channel/width pruning for inference.)
Jing Xiong, Jianghan Shen, Fanghua Ye, Chaofan Tao, Zhongwei Wan, Jianqiao Lu, Xun Wu, Chuanyang Zheng, Zhijiang Guo, Lingpeng Kong, Ngai Wong, 4 Oct 2024, UNComp: Uncertainty-Aware Long-Context Compressor for Efficient Large Language Model Inference, https://arxiv.org/abs/2410.03090
Rhea Sanjay Sukthanker, Benedikt Staffler, Frank Hutter, Aaron Klein, 9 Oct 2024, LLM Compression with Neural Architecture Search, https://arxiv.org/abs/2410.06479 (NAS with width/attention head and layer pruning.)
Zekun Hao, David W. Romero, Tsung-Yi Lin, Ming-Yu Liu, 12 Dec 2024, Meshtron: High-Fidelity, Artist-Like 3D Mesh Generation at Scale, https://arxiv.org/abs/2412.09548 https://research.nvidia.com/labs/dir/meshtron/ (Optimizations to avoid the quadratic Transformer cost, in both training and inference, include "hourglass neural architecture" analogous to widthwise pruning or slimming, sliding window attention, rolling KV cache, truncated sequence training, and a "robust sampling strategy" that is effectively a type of constrained decoding based on mesh layouts.)
Xuanli He, Iman Keivanloo, Yi Xu, Xiang He, Belinda Zeng, Santosh Rajagopalan, Trishul Chilimbi, 30 Oct 2021, Magic Pyramid: Accelerating Inference with Early Exiting and Token Pruning, https://arxiv.org/abs/2111.00230
Fabrizio Sandri, Elia Cunegatti, Giovanni Iacca, 29 Jan 2025, 2SSP: A Two-Stage Framework for Structured Pruning of LLMs, https://arxiv.org/abs/2501.17771 https://github.com/FabrizioSandri/2SSP (Dual width-depth pruning of neurons and attention heads.)