Aussie AI
Dual Pruning (Width + Depth)
-
Last Updated 14 October, 2024
-
by David Spuler, Ph.D.
Dual pruning is the combination of width pruning and depth pruning. Depth pruning involves pruning or skipping the layers of the model, such as in layer pruning, early exiting or the shallow decoder architecture. Width pruning techniques include attention head pruning, slimmable networks, channel pruning, filter pruning, and other strategies. Some papers also describe combinations of multiple pruning strategies as "hybrid pruning".
Depth and width pruning are orthogonal strategies, and so can be combined in various ways. The obvious extension is "triple axis pruning" which also adds length pruning (such as token pruning or embeddings pruning); see Triple pruning.
Research on Dual Pruning
Research papers on combined two-dimensional model pruning combining width and depth pruning:
- Yue Wang, Jianghao Shen, Ting-Kuei Hu, Pengfei Xu, Tan Nguyen, Richard G. Baraniuk, Zhangyang Wang, and Yingyan Lin. Dual dynamic inference: Enabling more efficient, adaptive and controllable deep inference, 2020, JSTSP, https://arxiv.org/abs/1907.04523
- X. Xu, M. S. Park, and C. Brick, Hybrid pruning: Thinner sparse networks for fast inference on edge devices, in ICLR, 2018, https://arxiv.org/abs/1811.00482
- Wenhan Xia, Hongxu Yin, Xiaoliang Dai, and Niraj K Jha, Fully dynamic inference with deep neural networks, 2021, IEEE Transactions on Emerging Topics in Computing, https://arxiv.org/abs/2007.15151
- Ali Ehteshami Bejnordi and Ralf Krestel, Dynamic channel and layer gating in convolutional neural networks, 2020, In KI, https://dl.acm.org/doi/10.1007/978-3-030-58285-2_3
- Fang Yu, Kun Huang, Meng Wang, Yuan Cheng, Wei Chu, Li Cui, Width & Depth Pruning for Vision Transformers, Vol. 36 No. 3: AAAI-22 Technical Tracks 3 / AAAI Technical Track on Computer Vision III, DOI: https://doi.org/10.1609/aaai.v36i3.20222, https://ojs.aaai.org/index.php/AAAI/article/view/20222, PDF: https://ojs.aaai.org/index.php/AAAI/article/view/20222/19981
- Lu Hou, Zhiqi Huang, Lifeng Shang, Xin Jiang, Xiao Chen, and Qun Liu. 2020. DynaBERT: Dynamic BERT with Adaptive Width and Depth. arXiv preprint arXiv:2004.04037 (Oct 2020), https://arxiv.org/abs/2004.04037, Code: https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/DynaBERT
- H. Cai, C. Gan, T. Wang, Z. Zhang, and S. Han. Once for all: Train one network and specialize it for efficient deployment. In International Conference on Learning Representations, 2020. https://arxiv.org/abs/1908.09791
- H. Wang, Z. Wu, Z. Liu, H. Cai, L. Zhu, C. Gan, and S. Han. HAT: Hardware-Aware Transformers for Efficient Natural Language Processing. In Annual Meeting of the Association for Computational Linguistics, 2020. https://aclanthology.org/2020.acl-main.686/
- Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014. https://arxiv.org/abs/1412.6550
- Mojtaba Valipour, Mehdi Rezagholizadeh, Hossein Rajabzadeh, Marzieh Tahaei, Boxing Chen, Ali Ghodsi, 2023, SortedNet, a Place for Every Network and Every Network in its Place: Towards a Generalized Solution for Training Many-in-One Neural Networks https://arxiv.org/abs/2309.00255 (Generalization of multi-dimensional pruning, by training a large neural network with many sub-networks across different width and depth dimensions.)
- T Hu, C Meinel, H Yang, 2023, Flexible BERT with Width-and Depth-dynamic Inference, 2023 International Joint Conference on Neural Networks (IJCNN), https://ieeexplore.ieee.org/abstract/document/10191515/ (A 2023 version of BERT that does dual pruning with early exit and width gating.)
- Y Liu, Z Lin, F Yuan, 2021, Rosita: Refined bert compression with integrated techniques, The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21), https://arxiv.org/abs/2103.11367, Code: https://github.com/llyx97/Rosita (Pruning on multiple dimensions of layer, FFN outputs, and embeddings, also combined with distillation.)
- Bolukbasi, T., Wang, J., Dekel, O., and Saligrama, V. 2017. Adaptive neural networks for efficient inference. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, Proceedings of Machine Learning Research, pages 527–536. https://arxiv.org/abs/1702.07811, http://proceedings.mlr.press/v70/bolukbasi17a.html (Early exit of layers and network selection.)
- Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: Mobilenets: Efficient convolutional neural networks for mobile vision applications (2017). https://doi.org/10.48550/ARXIV.1704.04861, https://arxiv.org/abs/1704.04861 (Combines depthwise separable convolutions and thinning at each layer.)
- Wen, W., Wu, C., Wang, Y., Chen, Y., and Li, H. Learning structured sparsity in deep neural networks. In NIPS, 2016. https://github.com/wenwei202/caffe/tree/scnn (Combined filter and layer pruning.)
- Lin, S., Ji, R., Yan, C., Zhang, B., Cao, L., Ye, Q., Huang, F., and Doermann, D. S. Towards optimal structured CNN pruning via generative adversarial learning. In CVPR, 2019. https://arxiv.org/abs/1903.09291 (Similar to a combined filter and layer pruning algorithm.)
- Zehao Huang and Naiyan Wang, “Data-driven sparse structure selection for deep neural networks,” in ECCV, 2018. https://arxiv.org/abs/1707.01213, Code: https://github.com/huangzehao/sparse-structure-selection (Not typical width-depth pruning, but a combined pruning that uses sparsification to force weight structures to zero, allowig pruning of whole structures.)
- Tianli Zhao, Xi Sheryl Zhang, Wentao Zhu, Jiaxing Wang, Sen Yang, Ji Liu, Jian Cheng, Nov 2021, Joint Channel and Weight Pruning for Model Acceleration on Moblie Devices, https://arxiv.org/abs/2110.08013
- H Litao, X Fei, G Xiaoyang, Y Tingting, June 2023, Research on Model Compression Method Based on Deep Separable Convolutional and Pruning, Available at SSRN, https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4478190 (Somewhat related to dual pruning; also adds quantization.)
- Xiaoying Zhi, Varun Babbar, Pheobe Sun, Fran Silavong, Ruibo Shi, Sean Moran, Sep 2023, A New Baseline for GreenAI: Finding the Optimal Sub-Network via Layer and Channel Pruning, https://arxiv.org/abs/2302.10798 (Layer and channel pruning combined.)
- Dieter Verbruggen, Sofie Pollin, Hazem Sallouha, 6 May 2024, Computational Efficient Width-Wise Early Exits in Modulation Classification, https://arxiv.org/abs/2405.03222
- Aaron Klein, Jacek Golebiowski, Xingchen Ma, Valerio Perrone, Cedric Archambeau, 3 May 2024, Structural Pruning of Pre-trained Language Models via Neural Architecture Search, https://arxiv.org/abs/2405.02267 (Post-training structured pruning of sub-networks based on NAS, also with weight sharing and several different focus areas of pruning including attention heads, FFNs, and layers.)
- Snehasis Dey, 2024, Differentiable Slimming for Transformers with Improved Memory Efficiency, College of Engineering Bhubaneswar, Odisha , https://ijte.in/pdf/EE14.pdf (Dual pruning by attention head pruning for slimmable networks combined with layer pruning.)
- Yashas Samaga B L, Varun Yerram, Chong You, Srinadh Bhojanapalli, Sanjiv Kumar, Prateek Jain, Praneeth Netrapalli, 14 Feb 2024, HiRE: High Recall Approximate Top-k Estimation for Efficient LLM Inference, https://arxiv.org/abs/2402.09360 (Attempts to estimate the output of top-k decoding, so as to prune computations on two dimensions earlier in the inference computations.)
- Ignacio de Gregorio, April 2024, Mixture-of-Depths, a Dazzling New AI Breakthrough: Conditional Computing is Finally Here, Medium, https://medium.com/@ignacio.de.gregorio.noblejas/mixture-of-depths-a-dazzling-new-ai-breakthrough-be958fc629b2 (Mixture of depths is a layer-wise per-token limit to attention head computations, which is like width pruning with dynamic depth.)
- David Raposo, Sam Ritter, Blake Richards, Timothy Lillicrap, Peter Conway Humphreys, Adam Santoro, 2 Apr 2024, Mixture-of-Depths: Dynamically allocating compute in transformer-based language models, https://arxiv.org/abs/2404.02258 (Per-layer pruning of which tokens can be in the attention computations to give a type of mixed lengthwise pruning combined with a dynamic width pruning or slimmable network approach.)
- Zihao Wang, Shaoduo Gan, 7 Apr 2024] SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget, https://arxiv.org/abs/2404.04793 Code: https://github.com/hetailang/squeezeattention (Optimization of the KV cache along the two dimensions of layers and input sequence.)
- Ting Hu, Christoph Meinel, Haojin Yang, 2024, A flexible BERT model enabling width- and depth-dynamic inference, Computer Speech & Language 4 April 2024, 101646, https://www.sciencedirect.com/science/article/pii/S0885230824000299 (Dual pruning method with layerwise "neural grafting" that gives dynamic width models, and combined with early exit on the depth dimension.)
- Bo-Kyeong Kim, Geonmin Kim, Tae-Ho Kim, Thibault Castells, Shinkook Choi, Junho Shin, Hyoung-Kyu Song, 5 Feb 2024, Shortened LLaMA: A Simple Depth Pruning for Large Language Models, https://arxiv.org/abs/2402.02834 (Analysis of dual pruning with combined depth and width pruning.)
- Basar Kutukcu, Sabur Baidya, Sujit Dey, 2024, SLEXNet: Adaptive Inference Using Slimmable Early Exit Neural Networks, https://doi.org/10.1145/3689632 https://dl.acm.org/doi/pdf/10.1145/3689632 (Combined width and depth pruning with slimmable and early exit networks.)
- Dhananjay Saikumar, Blesson Varghese, 1 Apr 2024, DRIVE: Dual Gradient-Based Rapid Iterative Pruning, https://arxiv.org/abs/2404.03687
- Jiwon Song, Kyungseok Oh, Taesu Kim, Hyungjun Kim, Yulhwa Kim, Jae-Joon Kim, 19 Jul 2024 (v5), SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks, https://arxiv.org/abs/2402.09025 https://github.com/jiwonsong-dev/SLEB
- David Spuler, 26th August, 2024, Inference Optimization Research Ideas, https://www.aussieai.com/blog/inference-optimization-ideas
- Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, Pavlo Molchanov, 19 Jul 2024, Compact Language Models via Pruning and Knowledge Distillation, https://arxiv.org/abs/2407.14679 https://github.com/NVlabs/Minitron (Combination of distillation and structured pruning on the depth and width dimensions.)
- Sharath Turuvekere Sreenivas, Saurav Muralidharan, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, Pavlo Molchanov, 26 Aug 2024 (v2), LLM Pruning and Distillation in Practice: The Minitron Approach, https://arxiv.org/abs/2408.11796
- Xia, Wenhan, Sep 2024, Methods for Efficient and Scalable Deep Learning, Ph.D. Thesis, Electrical and Computer Engineering Department, Princeton University, http://arks.princeton.edu/ark:/88435/dsp015q47rs12x (Covers PEFT/LoRA on training, and dual pruning with layer skipping and channel/width pruning for inference.)
- Jing Xiong, Jianghan Shen, Fanghua Ye, Chaofan Tao, Zhongwei Wan, Jianqiao Lu, Xun Wu, Chuanyang Zheng, Zhijiang Guo, Lingpeng Kong, Ngai Wong, 4 Oct 2024, UNComp: Uncertainty-Aware Long-Context Compressor for Efficient Large Language Model Inference, https://arxiv.org/abs/2410.03090
- Rhea Sanjay Sukthanker, Benedikt Staffler, Frank Hutter, Aaron Klein, 9 Oct 2024, LLM Compression with Neural Architecture Search, https://arxiv.org/abs/2410.06479 (NAS with width/attention head and layer pruning.)
More Research on Pruning Types
- Depth pruning (overview)
— Static layer pruning
— Layer pruning
— Early exit
— Dynamic layer pruning
— Layer skipping
— Layer approximation
— Shallow decoder architecture
— Layer reordering
— Layer Importance - Width pruning (overview)
— Attention head pruning
— Slimmable networks (width pruning)
— FFN pruning
— Channel pruning
— Filter pruning - Length pruning (longitudinal/input/end-to-end):
— Token pruning (input pruning)
— Dynamic token pruning
— Prompt compression
— Context compression
— Token merging
— Token skipping
— Token dropping
— Zero padding removal - Embedding-dimension pruning
— Embedding pruning
— Embedding matrix compression (embedding pruning)
— Embedding low-rank matrix factorization
— Unembedding matrix (output embeddings) - Multi-dimensional pruning
— Dual pruning
— Triple pruning
— Quadruple pruning
— 3D CNN model pruning
More AI Pruning Research
Read more about:
- Triple pruning
- Model pruning overview
- Layer pruning
- Token pruning
- Attention head pruning
- Embeddings pruning
- FFN pruning
- Shallow decoder architecture
- Normalization pruning
- Length pruning
- Width pruning
- Channel pruning
- « Research Home