Aussie AI

Width Pruning

Last Updated 2 February, 2025

by David Spuler, Ph.D.

Width pruning is a type of structured pruning that reduces the "width" of the internal layers of a neural network. In early models, the "width" was the number of internal neurons in the hidden layers, although in newer Transformers, it often refers to the number of attention heads. Like all types of pruning, the goal is model compression and faster inference by avoiding computation associated with the pruned parts.

Width pruning is orthogonal to "layer pruning" (depth pruning) where the number of internal layers are removed. There is also "length pruning" such as token pruning and embeddings pruning.

Width pruning can mean several different things that all affect the fan-out of data as it propagates through the hidden layers. The types and names include:

Attention head pruning (in Transformers)
Channel pruning
Filter pruning
Thin networks, or "slimming" of neural networks ("slimmable" networks are really "dynamic width pruning")

Like all pruning methods, width pruning can be static or dynamic. Static width pruning is removing channels or attention heads from the model as it's trained or shortly after training. It is conceptually identical to choosing a smaller width hyper-parameter for the model as part of neural architecture search. Dynamic width pruning means a selective inference path by removing width at runtime by blocking or bypassing some elements of the model, usually depending on the input sequence.

Slimmable Networks (Dynamic Width Pruning)

Some of the papers on "slimmer" or "thin" models include:

J. Yu, L. Yang, N. Xu, J. Yang, and T. Huang. Slimmable neural networks. In International Conference on Learning Representations, 2018. https://arxiv.org/abs/1812.08928, Code: https://github.com/JiahuiYu/slimmable_networks (The earliest paper that coined the term "slimmable networks".)
J. Yu and T. S. Huang. Universally slimmable networks and improved training techniques. In IEEE International Conference on Computer Vision, pages 1803–1811, 2019. https://arxiv.org/abs/1903.05134, Code: https://github.com/JiahuiYu/slimmable_networks
Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning efficient convolutional networks through network slimming. In Computer Vision (ICCV), 2017 IEEE International Conference on, pp. 2755–2763. IEEE, 2017. https://arxiv.org/abs/1708.06519
Sergio Guadarrama, Nathan Silberman, 2016, Google Research, TensorFlow-Slim: A lightweight library for defining, training and evaluating complex models in TensorFlow, Code: https://github.com/google-research/tf-slim
Yu and T. S. Huang. Network slimming by slimmable networks: Towards one-shot architecture search for channel numbers. Preprint arXiv:1903.11728, 2019. PDF: https://arxiv.org/pdf/1903.11728v1.pdf
Jiahui Yu, Thomas Huang, June 2019, AutoSlim: Towards One-Shot Architecture Search for Channel Numbers, https://arxiv.org/abs/1903.11728, Code: https://github.com/JiahuiYu/slimmable_networks
Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014. https://arxiv.org/abs/1412.6550
Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, Liang-Chieh Chen, MobileNetV2: Inverted Residuals and Linear Bottlenecks, Mar 2019, https://arxiv.org/abs/1801.04381, Code: https://github.com/tensorflow/models/tree/master/research/slim/nets/mobilenet (MobileNetV2 uses some slimming techniques with TensorFlow-Slim.)
Changlin Li, Guangrun Wang, Bing Wang, Xiaodan Liang, Zhihui Li, Xiaojun Chang, 2021, Dynamic Slimmable Network, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 8607-8617, https://arxiv.org/abs/2103.13258, http://openaccess.thecvf.com/content/CVPR2021/html/Li_Dynamic_Slimmable_Network_CVPR_2021_paper.html, Code: https://github.com/changlin31/DS-Net
Ting-Wu Chin, Ari S. Morcos, Diana Marculescu, 2020, Pareco: Pareto-aware channel optimization for slimmable neural networks, https://arxiv.org/abs/2007.11752v2, https://openreview.net/forum?id=SPyxaz_h9Nd
Won Joon Yun; Yunseok Kwak; Hankyul Baek; Soyi Jung; Mingyue Ji; Mehdi Bennis; Jihong Park, 2023, SlimFL: Federated learning with superposition coding over slimmable neural networks, IEEE/ACM Transactions on Networking (Early Access), DOI: 10.1109/TNET.2022.3231864 https://ieeexplore.ieee.org/document/10004844 https://arxiv.org/abs/2203.14094
Ting-Wu Chin, Ari S. Morcos & Diana Marculescu, 2021, Joslim: Joint Widths and Weights Optimization for Slimmable Neural Networks Lecture Notes in Computer Science book series (LNAI,volume 12977) Springer, https://link.springer.com/chapter/10.1007/978-3-030-86523-8_8, https://arxiv.org/pdf/2007.11752, Code: https://github.com/cmu-enyac/Joslim
Hideaki Kuratsu, Atsuyoshi Nakamura, 2022, Slimmable Pruned Neural Networks, arXiv preprint arXiv:2212.03415, https://arxiv.org/abs/2212.03415 Code: https://github.com/hideakikuratsu/SP-Net
A Ozerov, A Lambert, SK Kumaraswamy, 2021, ParaDiS: Parallelly Distributable Slimmable Neural Networks, arXiv preprint arXiv:2110.02724, https://arxiv.org/abs/2110.02724
L Hou, Z Yuan, L Huang, H Shen, X Cheng, 2021, Slimmable generative adversarial networks, Proceedings of the AAAI Conference on Artificial Intelligence, 35, No. 9: AAAI-21 Technical Tracks 9, https://ojs.aaai.org/index.php/AAAI/article/view/16946, https://arxiv.org/abs/2012.05660
F Yang, L Herranz, Y Cheng, 2021, Slimmable compressive autoencoders for practical neural image compression, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4998-5007, http://openaccess.thecvf.com/content/CVPR2021/html/Yang_Slimmable_Compressive_Autoencoders_for_Practical_Neural_Image_Compression_CVPR_2021_paper.html , https://arxiv.org/abs/2103.15726
Zuhaib Akhtar, Mohammad Omar Khursheed, Dongsu Du, Yuzong Liu, Apr 2023, Small-footprint slimmable networks for keyword spotting, ICASSP 2023: 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), https://arxiv.org/abs/2304.12183
Yu, Jiahui, 2019, Slimmable neural networks for edge devices, Masters Thesis, Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, https://www.ideals.illinois.edu/items/112252
Rang Meng, Weijie Chen, Shicai Yang, Jie Song, Luojun Lin, Di Xie, Shiliang Pu, Xinchao Wang, Mingli Song, Yueting Zhuang, June 2022, Slimmable domain adaptation, http://openaccess.thecvf.com/content/CVPR2022/html/Meng_Slimmable_Domain_Adaptation_CVPR_2022_paper.html, https://arxiv.org/abs/2206.06620, Code: https://github.com/hikvision-research/SlimDA
Juliano S. Assine, J. C. S. Santos Filho, Eduardo Valle, Marco Levorato, Slimmable Encoders for Flexible Split DNNs in Bandwidth and Resource Constrained IoT Systems, arXiv preprint arXiv:2306.12691, June 2023, https://arxiv.org/abs/2306.12691
A Progressive Sub-Network Searching Framework for Dynamic Inference, Li Yang, Zhezhi He, Yu Cao, Deliang Fan, Sep 2020, https://arxiv.org/abs/2009.05681
Dawei Li, Xiaolong Wang, and Deguang Kong. 2018. DeepRebirth: Accelerating Deep Neural Network Execution on Mobile Devices. In AAAI’18, https://arxiv.org/abs/1708.04728
Hankook Lee and Jinwoo Shin. 2018. Anytime neural prediction via slicing networks vertically. arXiv preprint arXiv:1807.02609. https://arxiv.org/abs/1807.02609
Junjie He, Yinzhang Ding, Ming Zhang, Dongxiao Li, 2022, Towards efficient network compression via Few-Shot Slimming, Neural Networks, Volume 147, 2022, pp. 113-125, https://doi.org/10.1016/j.neunet.2021.12.011, https://pubmed.ncbi.nlm.nih.gov/34999388/
Mojtaba Valipour, Mehdi Rezagholizadeh, Hossein Rajabzadeh, Marzieh Tahaei, Boxing Chen, Ali Ghodsi, 2023, SortedNet, a Place for Every Network and Every Network in its Place: Towards a Generalized Solution for Training Many-in-One Neural Networks https://arxiv.org/abs/2309.00255 (Generalization of multi-dimensional pruning, by training a large neural network with many sub-networks across different width and depth dimensions.)
Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: Mobilenets: Efficient convolutional neural networks for mobile vision applications (2017). https://doi.org/10.48550/ARXIV.1704.04861, https://arxiv.org/abs/1704.04861 (Combines depthwise separable convolutions and thinning at each layer.)
N Penkov, K Balaskas, M Rapp, J Henkel, 2023, Differentiable Slimming for Memory-Efficient Transformers, IEEE Embedded Systems Letters (Early Access), DOI: 10.1109/LES.2023.3299638, https://ieeexplore.ieee.org/abstract/document/10261943
Sathya Krishnan Suresh, Shunmugapriya P, 24 Apr 2024 (v2), Towards smaller, faster decoder-only transformers: Architectural variants and their implications, https://arxiv.org/abs/2404.14462 Code: https://github.com/SkAndMl/gpt-variations (Focuses on three new variants of decoder-only Transformer architectures: ParallelGPT (p-gpt), LinearlyCompressedGPT (lc-gpt), and ConvCompressedGPT (cc-gpt).)
Sachin Mehta, Mohammad Hossein Sekhavat, Qingqing Cao, Maxwell Horton, Yanzi Jin, Chenfan Sun, Iman Mirzadeh, Mahyar Najibi, Dmitry Belenko, Peter Zatloukal, Mohammad Rastegari, 22 Apr 2024, OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework, Apple Research, https://arxiv.org/abs/2404.14619 Code: https://huggingface.co/apple/OpenELM
Snehasis Dey, 2024, Differentiable Slimming for Transformers with Improved Memory Efficiency, College of Engineering Bhubaneswar, Odisha , https://ijte.in/pdf/EE14.pdf (Dual pruning by attention head pruning for slimmable networks combined with layer pruning.)
Ignacio de Gregorio, April 2024, Mixture-of-Depths, a Dazzling New AI Breakthrough: Conditional Computing is Finally Here, Medium, https://medium.com/@ignacio.de.gregorio.noblejas/mixture-of-depths-a-dazzling-new-ai-breakthrough-be958fc629b2 (Mixture of depths is a layer-wise per-token limit to attention head computations, which is like width pruning with dynamic depth.)
David Raposo, Sam Ritter, Blake Richards, Timothy Lillicrap, Peter Conway Humphreys, Adam Santoro, 2 Apr 2024, Mixture-of-Depths: Dynamically allocating compute in transformer-based language models, https://arxiv.org/abs/2404.02258 (Per-layer pruning of which tokens can be in the attention computations to give a type of mixed lengthwise pruning combined with a dynamic width pruning or slimmable network approach.)
M Sponner, B Waschneck, A Kumar , 2024, Adapting Neural Networks at Runtime: Current Trends in At-Runtime Optimizations for Deep Learning, ACM Computing Surveys,, PDF: https://dl.acm.org/doi/pdf/10.1145/3657283 (Survey of various adaptive inference optimization techniques with much focus on image and video processing optimization for LLMs.)
Namgyu Ho, Sangmin Bae, Taehyeon Kim, Hyunjik Jo, Yireun Kim, Tal Schuster, Adam Fisch, James Thorne, Se-Young Yun, 4 Jun 2024, Block Transformer: Global-to-Local Language Modeling for Fast Inference, https://arxiv.org/abs/2406.02657 Code: https://github.com/itsnamgyu/block-transformer (Uses local attention versus global attention at different layers.)
Zefan Cai., Yichi Zhang, Bofei Gao, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, Wen Xiao, 4 Jun 2024, PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling, https://arxiv.org/abs/2406.02069
Chen, Yilong ; Zhang, Linhao ; Shang, Junyuan ; Zhang, Zhenyu ; Liu, Tingwen ; Wang, Shuohuan ; Sun, Yu, June 2024, DHA: Learning Decoupled-Head Attention from Transformer Checkpoints via Adaptive Heads Fusion, https://arxiv.org/abs/2406.06567 https://ui.adsabs.harvard.edu/abs/2024arXiv240606567C/abstract
David Spuler, March 2024, Chapter 48. Width Pruning, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Y Li, K Adamczewski, W Li, S Gu, 2022, Revisiting random channel pruning for neural network compression, http://openaccess.thecvf.com/content/CVPR2022/html/Li_Revisiting_Random_Channel_Pruning_for_Neural_Network_Compression_CVPR_2022_paper.html
Ruisi Cai1, Saurav Muralidharan, Greg Heinrich, Hongxu Yin, Zhangyang Wang, Jan Kautz, Pavlo Molchanov, 2024, FLEXTRON: Many-in-One Flexible Large Language Model, https://openreview.net/pdf?id=9vKRhnflAs (Using one model to act in different ways by making it "elastic" with parameters, effectively using slimming via techniques such as layer fusion in MLPs and MHA Attention Heads.)
C Tang, H Zhai, K Ouyang, Z Wang, Y Zhu, 2022, Arbitrary Bit-width Network: A Joint Layer-Wise Quantization and Adaptive Inference Approach, https://dl.acm.org/doi/pdf/10.1145/3503161.3548001
Character.AI, June 20, 2024, Optimizing AI Inference at Character.AI, https://research.character.ai/optimizing-inference/
Zhuocheng Gong, Ang Lv, Jian Guan, Junxi Yan, Wei Wu, Huishuai Zhang, Minlie Huang, Dongyan Zhao, Rui Yan, 9 Jul 2024, Mixture-of-Modules: Reinventing Transformers as Dynamic Assemblies of Modules, https://arxiv.org/abs/2407.06677
R Tiwari, A Chavan, D Gupta, G Mago, A Gupta, 2023, RCV2023 Challenges: Benchmarking Model Training and Inference for Resource-Constrained Deep Learning, PDF: https://openaccess.thecvf.com/content/ICCV2023W/RCV/papers/Tiwari_RCV2023_Challenges_Benchmarking_Model_Training_and_Inference_for_Resource-Constrained_Deep_ICCVW_2023_paper.pdf
J. Yu, T. Huang, Universally slimmable networks and improved training techniques, in: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 1803–1811. doi:10.1109/ICCV.2019.00189. http://dx.doi.org/10.1109/ICCV.2019.00189
J. Yu, L. Yang, N. Xu, J. Yang, T. Huang, Slimmable neural networks, in: International Conference on Learning Representations, 2019. https://openreview.net/forum?id=H1gMCsAqY7 https://openreview.net/forum?id=H1gMCsAqY7
M. Elminshawi, S. R. Chetupalli, E. A. P. Habets, 16 August 2024, Dynamic Slimmable Network for Speech Separation, IEEE Signal Processing Letters, doi: 10.1109/LSP.2024.3445304, https://doi.org/10.1109/LSP.2024.3445304 https://ieeexplore.ieee.org/abstract/document/10638203
Alessio Devoto, Federico Alvetreti, Jary Pomponi, Paolo Di Lorenzo, Pasquale Minervini, Simone Scardapane, 16 Aug 2024, Adaptive Layer Selection for Efficient Vision Transformer Fine-Tuning, https://arxiv.org/abs/2408.08670 (Faster fine-tuning by selecting layers, freezing layers, or slimming them to fewer fine-tuned parameters.)
J. Wang, B. Li and G. L. Zhang, 2024, Early-Exit with Class Exclusion for Efficient Inference of Neural Networks, 2024 IEEE 6th International Conference on AI Circuits and Systems (AICAS), Abu Dhabi, United Arab Emirates, 2024, pp. 263-267, doi: 10.1109/AICAS59952.2024.10595861, https://ieeexplore.ieee.org/document/10595861 Code: https://github.com/HWAI-TUDa/EarlyClassExclusion (Reduces the early exit tokens under consideration for the classifier at each layer, which is similar to slimmable networks.)
Basar Kutukcu, Sabur Baidya, Sujit Dey, 2024, SLEXNet: Adaptive Inference Using Slimmable Early Exit Neural Networks, https://doi.org/10.1145/3689632 https://dl.acm.org/doi/pdf/10.1145/3689632 (Combined width and depth pruning with slimmable and early exit networks.)
Janek Haberer, Ali Hojjat, Olaf Landsiedel, 26 Sep 2024, HydraViT: Stacking Heads for a Scalable ViT, https://arxiv.org/abs/2409.17978 https://github.com/ds-kiel/HydraViT
Zekun Hao, David W. Romero, Tsung-Yi Lin, Ming-Yu Liu, 12 Dec 2024, Meshtron: High-Fidelity, Artist-Like 3D Mesh Generation at Scale, https://arxiv.org/abs/2412.09548 https://research.nvidia.com/labs/dir/meshtron/ (Optimizations to avoid the quadratic Transformer cost, in both training and inference, include "hourglass neural architecture" analogous to widthwise pruning or slimming, sliding window attention, rolling KV cache, truncated sequence training, and a "robust sampling strategy" that is effectively a type of constrained decoding based on mesh layouts.)

Channel Pruning (Width Pruning)

Various research papers examine the method of "channel pruning", which is a type of width pruning.

Y. He, X. Zhang, and J. Sun, “Channel pruning for accelerating very deep neural networks,” in Computer Vision (ICCV), 2017 IEEE International Conference on. IEEE, 2017, pp. 1398–1406, https://arxiv.org/abs/1707.06168
J. Ye, X. Lu, Z. Lin, and J. Z. Wang, “Rethinking the smaller-norm-less-informative assumption in channel pruning of convolution layers,” arXiv preprint arXiv:1802.00124, 2018. https://arxiv.org/abs/1802.00124
N. Lee, T. Ajanthan, and P. H. Torr, “Snip: Single-shot network pruning based on connection sensitivity,” arXiv preprint arXiv:1810.02340, 2018. https://arxiv.org/abs/1810.02340
Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. CoRR, abs/1707.01083, 2017. https://arxiv.org/abs/1707.01083 (Uses "channel shuffle" which is similar to channel pruning.)
W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems, pages 2074–2082, 2016, https://arxiv.org/abs/1608.03665 Code: https://github.com/wenwei202/caffe/tree/scnn
Min Wang, Baoyuan Liu, and Hassan Foroosh. Design of efficient convolutional layers using single intra-channel convolution, topological subdivisioning and spatial ”bottleneck” structure. CoRR, abs/1608.04337, 2016, https://arxiv.org/abs/1608.04337
Mingbao Lin, Rongrong Ji, Yuxin Zhang, Baochang Zhang, Yongjian Wu, Yonghong Tian, June 2020, Channel Pruning via Automatic Structure Search arXiv preprint, https://arxiv.org/abs/2001.08565
Jing Liu; Bohan Zhuang; Zhuangwei Zhuang; Yong Guo; Junzhou Hua, 2018, Discrimination-aware channel pruning for deep neural networks, IEEE Transactions on Pattern Analysis and Machine Intelligence (Volume 44, Issue 8, 01 August 2022), https://ieeexplore.ieee.org/abstract/document/9384353
Xitong Gao, Yiren Zhao, Łukasz Dudziak, Robert Mullins, Cheng-zhong Xu, Jan 2019, Dynamic Channel Pruning: Feature Boosting and Suppression, https://arxiv.org/abs/1810.05331
Hanyu Peng, Jiaxiang Wu, Shifeng Chen, Junzhou Huang, 2019, Collaborative channel pruning for deep networks, Proceedings of the 36th International Conference on Machine Learning, PMLR 97:5113-5122, 2019. http://proceedings.mlr.press/v97/peng19c/peng19c.pdf
Jinyang Guo; Weichen Zhang; Wanli Ouyang; Dong Xu, 2021, Model compression using progressive channel pruning, IEEE Transactions on Circuits and Systems for Video Technology (Volume 31, Issue 3, March 2021), https://ieeexplore.ieee.org/document/9097925
J Guo, W Ouyang, D Xu, 2020, Channel pruning guided by classification loss and feature importance, Proceedings of the AAAI Conference, Association for the Advancement of Artificial Intelligence, https://aaai.org/ojs/index.php/AAAI/article/view/6720/6574
Zechun Liu, Haoyuan Mu, Xiangyu Zhang, Zichao Guo, Xin Yang, Tim Kwang-Ting Cheng, Jian Sun, Aug 2019, MetaPruning: Meta Learning for Automatic Neural Network Channel Pruning, https://arxiv.org/abs/1903.10258, http://openaccess.thecvf.com/content_ICCV_2019/papers/Liu_MetaPruning_Meta_Learning_for_Automatic_Neural_Network_Channel_Pruning_ICCV_2019_paper.pdf
Y Huang, N Liu, Z Che, Z Xu, C Shen, 2023, Yaomin Huang, Ning Liu, Zhengping Che, Zhiyuan Xu, Chaomin Shen, Yaxin Peng, Guixu Zhang, Xinmei Liu, Feifei Feng, Jian Tang, 2023, CP3: Channel Pruning Plug-In for Point-Based Networks, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), https://ieeexplore.ieee.org/document/10203696, https://arxiv.org/abs/2303.13097, http://openaccess.thecvf.com/content/CVPR2023/papers/Huang_CP3_Channel_Pruning_Plug-In_for_Point-Based_Networks_CVPR_2023_paper.pdf
Y Liu, D Wu, W Zhou, K Fan, Z Zhou, 2023, EACP: An effective automatic channel pruning for neural networks, Neurocomputing, Volume 526, 14 March 2023, Pages 131-142, https://www.sciencedirect.com/science/article/pii/S0925231223000255
Hancheng Ye, Bo Zhang, Tao Chen, Jiayuan Fan, Bin Wang, 2023, Performance-aware Approximation of Global Channel Pruning for Multitask CNNs, IEEE Transactions on Pattern Analysis and Machine Intelligence ( Volume: 45, Issue: 8, August 2023), https://ieeexplore.ieee.org/document/10083285, https://arxiv.org/abs/2303.11923, Code: http://www.github.com/HankYe/PAGCP.git
Kang, M. and Han, B., Operation-aware soft channel pruning using differentiable masks. In ICML, 2020, https://arxiv.org/abs/2007.03938
Manoj Alwani, Yang Wang, Vashisht Madhavan, Feb 2022, DECORE: Deep Compression with Reinforcement Learning, in CVPR, 2022, https://arxiv.org/abs/2106.06091
Shixing Yu, Zhewei Yao, Amir Gholami, Zhen Dong, Sehoon Kim, Michael W Mahoney, Kurt Keutzer Hessian-Aware Pruning and Optimal Neural Implant, in WACV, 2022. https://arxiv.org/abs/2101.08940
Weizhe Hua, Yuan Zhou, Christopher M De Sa, Zhiru Zhang, and G Edward Suh. Channel gating neural networks. NeurIPS, pages 1884–1894, 2019, https://arxiv.org/abs/1805.12549
Zhenda Xie, Zheng Zhang, Xizhou Zhu, Gao Huang, and Stephen Lin. 2020. Spatially adaptive inference with stochastic feature sampling and interpolation. arXiv preprint arXiv:2003.08866, https://arxiv.org/abs/2003.08866
Ji Lin, Yongming Rao, Jiwen Lu, and Jie Zhou. Runtime neural pruning. In NeurIPS, pages 2181–2191, 2017. PDF: https://dl.acm.org/doi/pdf/10.5555/3294771.3294979
M Sponner, B Waschneck, A Kumar , 2024, Adapting Neural Networks at Runtime: Current Trends in At-Runtime Optimizations for Deep Learning, ACM Computing Surveys,, PDF: https://dl.acm.org/doi/pdf/10.1145/3657283 (Survey of various adaptive inference optimization techniques with much focus on image and video processing optimization for LLMs.)
Mogaka, O.M., Zewail, R., Inoue, K. et al. TinyEmergencyNet: a hardware-friendly ultra-lightweight deep learning model for aerial scene image classification. J Real-Time Image Proc 21, 51 (2024). https://doi.org/10.1007/s11554-024-01430-y https://link.springer.com/article/10.1007/s11554-024-01430-y#citeas (Use of both power-of-two quantization and channel pruning for fast image analysis.)
Ji Liu, Dehua Tang, Yuanxian Huang, Li Zhang, Xiaocheng Zeng, Dong Li, Mingjie Lu, Jinzhang Peng, Yu Wang, Fan Jiang, Lu Tian, Ashish Sirasao, 12 Jan 2024, UPDP: A Unified Progressive Depth Pruner for CNN and Vision Transformer, https://arxiv.org/abs/2401.06426 (Block pruning strategy gives a type of depth pruning.)
LOCP: Latency-optimized channel pruning for CNN inference acceleration on GPUs 2023, Journal of Supercomputing https://doi.org/10.1007/s11227-023-05212-4
LRP-based network pruning and policy distillation of robust and non-robust DRL agents for embedded systems 2023, Concurrency and Computation: Practice and Experience https://doi.org/10.1002/cpe.7351
David Spuler, March 2024, Chapter 48. Width Pruning, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Y Li, K Adamczewski, W Li, S Gu, 2022, Revisiting random channel pruning for neural network compression, http://openaccess.thecvf.com/content/CVPR2022/html/Li_Revisiting_Random_Channel_Pruning_for_Neural_Network_Compression_CVPR_2022_paper.html
Tien-Ju Yang, Andrew Howard, Bo Chen, Xiao Zhang, Alec Go, Mark Sandler, Vivienne Sze, and Hartwig Adam. NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications. In ECCV, 2018, https://arxiv.org/abs/1804.03230
Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, James Hensman, 9 Feb 2024 (v2), SliceGPT: Compress Large Language Models by Deleting Rows and Columns, Microsoft Research, https://arxiv.org/abs/2401.15024 Code: https://github.com/microsoft/TransformerCompression (Pruning of matrices effectively prunes along the width dimension and the "fourth" internal dimension of embeddings using techniques such as low-rank matrix factorization.)
Bejnordi, B.E., Blankevoort, T., Welling, M.: Batch-shaping for learning conditional channel gated networks. In: International Conference on Learning Representations (2020). https://openreview.net/forum?id=Bke89JBtvB
Gao, X., Zhao, Y., Dudziak, Ł., Mullins, R., Xu, C.-Z: Dynamic channel pruning: Feature boosting and suppression. In: International Conference on Learning Representations (2019). https://openreview.net/forum?id=BJxh2j0qYm
Jing Liu, Ruihao Gong, Mingyang Zhang, Yefei He, Jianfei Cai, Bohan Zhuang, 13 Jun 2024, ME-Switch: A Memory-Efficient Expert Switching Framework for Large Language Models, https://arxiv.org/abs/2406.09041 (How to load multiple experts for MoE in a memory-efficient way using mixed-precision quantization based on identifying the few salient channels that need higher precision, as an alternative to multi-LoRA.)
Xiaotong Luo; Zekun Ai; Qiuyuan Liang; Yuan Xie, 06 August 2024, EdgeFormer: Edge-aware Efficient Transformer for Image Super-resolution, IEEE Transactions on Instrumentation and Measurement ( Early Access), DOI: 10.1109/TIM.2024.3436070, https://ieeexplore.ieee.org/abstract/document/10623619 https://github.com/xiaotongtt/EdgeFormer
Yang He, Lingao Xiao, 30 Nov 2023 (v2), Structured Pruning for Deep Convolutional Neural Networks: A survey, https://arxiv.org/abs/2303.00566 https://arxiv.org/pdf/2303.00566 https://ieeexplore.ieee.org/abstract/document/10330640 https://github.com/he-y/Awesome-Pruning https://huggingface.co/spaces/he-yang/Structured-Pruning-Survey (Extensive survey of pruning for CNNs, not LLMs.)
Junhui He, Shangyu Wu, Weidong Wen, Chun Jason Xue, Qingan Li, 2 Sep 2024, CHESS: Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification, https://arxiv.org/abs/2409.01366
Luning Wang, Shiyao Li, Xuefei Ning, Zhihang Yuan, Shengen Yan, Guohao Dai, Yu Wang, 16 Sep 2024, CSKV: Training-Efficient Channel Shrinking for KV Cache in Long-Context Scenarios, https://arxiv.org/abs/2409.10593 (KV cache compression on the "channel" or "width" dimension.)
Xia, Wenhan, Sep 2024, Methods for Efficient and Scalable Deep Learning, Ph.D. Thesis, Electrical and Computer Engineering Department, Princeton University, http://arks.princeton.edu/ark:/88435/dsp015q47rs12x (Covers PEFT/LoRA on training, and dual pruning with layer skipping and channel/width pruning for inference.)
Fabio Montello, Ronja Güldenring, Simone Scardapane, Lazaros Nalpantidis, 13 Jan 2025, A Survey on Dynamic Neural Networks: from Computer Vision to Multi-modal Sensor Fusion, https://arxiv.org/abs/2501.07451 (Survey of adaptive inference optimizations: early exit, dynamic routing, token skimming.)

Filter Pruning

Filter pruning is another type of width pruning. Research papers on filter pruning:

Q. Huang, K. Zhou, S. You, and U. Neumann, “Learning to prune filters in convolutional neural networks,” in 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2018, pp. 709–718, https://arxiv.org/abs/1801.07365
Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. Thinet: A filter level pruning method for deep neural network compression. arXiv preprint arXiv:1707.06342, July 2017. https://arxiv.org/abs/1707.06342 (Filter pruning.)
Z. You, K. Yan, J. Ye, M. Ma and P. Wang, "Gate decorator: Global filter pruning method for accelerating deep convolutional neural networks", arXiv:1909.08174, 2019. https://arxiv.org/abs/1909.08174
Wang, Y., Zhang, X., Xie, L., Zhou, J., Su, H., Zhang, B., and Hu, X. Pruning from scratch. In AAAI, 2020. https://arxiv.org/abs/1909.12579
He, Y., Liu, P., Wang, Z., Hu, Z., and Yang, Y., Filter pruning via geometric median for deep convolutional neural networks acceleration. In CVPR, 2019. https://arxiv.org/abs/1811.00250, Code: https://github.com/he-y/filter-pruning-geometric-median
Wang, W., Fu, C., Guo, J., Cai, D., and He, X. COP: customized deep model compression via regularized correlation-based filter-level pruning. In IJCAI, 2019. https://arxiv.org/abs/1906.10337
Luo, J., Zhang, H., Zhou, H., Xie, C., Wu, J., and Lin, W. Thinet: Pruning CNN filters for a thinner net. TPAMI, 2019. https://ieeexplore.ieee.org/document/8416559
Molchanov, P., Tyree, S., Karras, T., Aila, T., and Kautz, J. Pruning convolutional neural networks for resource efficient inference. In ICLR, 2017. https://arxiv.org/abs/1611.06440
Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, Hans Peter Graf, Pruning Filters for Efficient ConvNets In ICLR, 2017. https://arxiv.org/abs/1608.08710
Mingbao Lin, Rongrong Ji, Yan Wang, Yichen Zhang, Baochang Zhang, Yonghong Tian, Ling Shao, “HRank: Filter pruning using high-rank feature map,” in CVPR, 2020. https://arxiv.org/abs/2002.10179, Code: https://github.com/lmbxmu/HRank
Pavlo Molchanov, Arun Mallya, Stephen Tyree, Iuri Frosio, Jan Kautz, June 2019, Importance Estimation for Neural Network Pruning, in CVPR, 2019. https://arxiv.org/abs/1906.10771
Shaohui Lin, Rongrong Ji, Chenqian Yan, Baochang Zhang, Liujuan Cao, Qixiang Ye, Feiyue Huang, David Doermann, Mar 2019, Towards Optimal Structured CNN Pruning via Generative Adversarial Learning, https://arxiv.org/abs/1903.09291
Shaohui Lin, Rongrong Ji, Yuchao Li, Yongjian Wu, Feiyue Huang, and Baochang Zhang. Accelerating convolutional networks via global & dynamic filter pruning. In IJCAI, pages 2425–2432, 2018. https://typeset.io/papers/accelerating-convolutional-networks-via-global-dynamic-38cal1vpjb
Lucas Liebenwein, Cenk Baykal, Harry Lang, Dan Feldman, Daniela Rus, Mar 2020, Provable Filter Pruning for Efficient Neural Networks, in ICLR, 2020 https://arxiv.org/abs/1911.07412
B Wang, C Ma, B Liu, N Liu, J Zhu, Oct 2023, Filter Pruning For CNN With Enhanced Linear Representation Redundancy, arXiv preprint arXiv:2310.06344, https://arxiv.org/pdf/2310.06344.pdf
M Sponner, B Waschneck, A Kumar , 2024, Adapting Neural Networks at Runtime: Current Trends in At-Runtime Optimizations for Deep Learning, ACM Computing Surveys,, PDF: https://dl.acm.org/doi/pdf/10.1145/3657283 (Survey of various adaptive inference optimization techniques with much focus on image and video processing optimization for LLMs.)
You Zhou, Xiujing Lin, Xiang Zhang, Maolin Wang, Gangwei Jiang, Huakang Lu, Yupeng Wu, Kai Zhang, Zhe Yang, Kehang Wang, Yongduo Sui, Fengwei Jia, Zuoli Tang, Yao Zhao, Hongxuan Zhang, Tiannuo Yang, Weibo Chen, Yunong Mao, Yi Li, De Bao, Yu Li, Hongrui Liao, Ting Liu, Jingwen Liu, Jinchi Guo, Xiangyu Zhao, Ying WEI, Hong Qian, Qi Liu, Xiang Wang, Wai Kin (Victor)Chan, Chenliang Li, Yusen Li, Shiyu Yang, Jining Yan, Chao Mou, Shuai Han, Wuxia Jin, Guannan Zhang, Xiaodong Zeng, Nov 2023, On the Opportunities of Green Computing: A Survey, https://arxiv.org/abs/2311.00447 (Extensive survey of environmental and green AI issues, along with a survey of various optimization methods to reduce AI resource requirements in training and inference.)
Hanting Chen, Yunhe Wang, Han Shu, Yehui Tang, Chunjing Xu, Boxin Shi, Chao Xu, Qi Tian, and Chang Xu. 2020. Frequency Domain Compact 3D Convolutional Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 1641–1650. https://ieeexplore.ieee.org/document/9156798 https://openaccess.thecvf.com/content_CVPR_2020/papers/Chen_Frequency_Domain_Compact_3D_Convolutional_Neural_Networks_CVPR_2020_paper.pdf
Y Hu, J Zhang, C Zhao, C Li, H Chen, 2023, Transformer Compression via Subspace Projection, arXiv preprint arXiv:2308.16475, https://arxiv.org/abs/2308.16475
Zhepeng Wang, Isaacshubhanand Putla, Weiwen Jiang, Youzuo Lin, Oct 2023, Edge-InversionNet: Enabling Efficient Inference of InversionNet on Edge Devices, https://arxiv.org/abs/2310.09667 (Using structured pruning via layerwise filter pruning to run a model on a Raspberry Pi.)
David Spuler, March 2024, Chapter 48. Width Pruning, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, James Hensman, 9 Feb 2024 (v2), SliceGPT: Compress Large Language Models by Deleting Rows and Columns, Microsoft Research, https://arxiv.org/abs/2401.15024 Code: https://github.com/microsoft/TransformerCompression (Pruning of matrices effectively prunes along the width dimension and the "fourth" internal dimension of embeddings using techniques such as low-rank matrix factorization.)
Lampros Konstantinos, February 2024, Design and Implementation of a Filter Based Pruning Compression Technique for Convolution Neural Networks, Master's Thesis, Aristotle University of Thessalonikii, School of Informatics, Thessaloniki, Greece, https://ikee.lib.auth.gr/record/356738/files/GRI-2024-44302.pdf
Xiaotong Luo; Zekun Ai; Qiuyuan Liang; Yuan Xie, 06 August 2024, EdgeFormer: Edge-aware Efficient Transformer for Image Super-resolution, IEEE Transactions on Instrumentation and Measurement ( Early Access), DOI: 10.1109/TIM.2024.3436070, https://ieeexplore.ieee.org/abstract/document/10623619 https://github.com/xiaotongtt/EdgeFormer
Yang He, Lingao Xiao, 30 Nov 2023 (v2), Structured Pruning for Deep Convolutional Neural Networks: A survey, https://arxiv.org/abs/2303.00566 https://arxiv.org/pdf/2303.00566 https://ieeexplore.ieee.org/abstract/document/10330640 https://github.com/he-y/Awesome-Pruning https://huggingface.co/spaces/he-yang/Structured-Pruning-Survey (Extensive survey of pruning for CNNs, not LLMs.)
Fabio Montello, Ronja Güldenring, Simone Scardapane, Lazaros Nalpantidis, 13 Jan 2025, A Survey on Dynamic Neural Networks: from Computer Vision to Multi-modal Sensor Fusion, https://arxiv.org/abs/2501.07451 (Survey of adaptive inference optimizations: early exit, dynamic routing, token skimming.)

Dynamic Channel Pruning (Width Pruning)

Whereas layer pruning reduces the "depth" of the model, it is also possible to dynamically prune the "width" of a model by inactivating channels during inference.

Yizeng Han, Gao Huang, Shiji Song, Le Yang, Honghui Wang, Yulin Wang, Dynamic Neural Networks: A Survey, Dec 2021, https://arxiv.org/abs/2102.04906
Y Sun, J Li, X Xu, 2022, Meta-GF: Training Dynamic-Depth Neural Networks Harmoniously, European Conference on Computer Vision, https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136710691.pdf (Code: https://github.com/SYVAE/MetaGF)
Kyunghyun Cho and Yoshua Bengio. Exponentially increasing the capacity-to-computation ratio for conditional computation in deep learning, 2014, arXiv preprint arXiv:1406.7362, https://arxiv.org/abs/1406.7362
Emmanuel Bengio, Pierre-Luc Bacon, Joelle Pineau, and Doina Precup, Conditional computation in neural networks for faster models, 2016, ICLR Workshop 2016, https://arxiv.org/abs/1511.06297
Yoshua Bengio, Nicholas Leonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation, 2013, arXiv preprint arXiv:1308.3432, https://arxiv.org/abs/1308.3432
Andreas Veit and Serge Belongie, Convolutional networks with adaptive inference graphs, In ECCV, 2018, https://arxiv.org/abs/1711.11503
Xitong Gao, Yiren Zhao, Łukasz Dudziak, Robert Mullins, and Cheng zhong Xu. Dynamic channel pruning: Feature boosting and suppression. In ICLR, 2019, https://arxiv.org/abs/1810.05331
Bo Chang, Lili Meng, Eldad Haber, Frederick Tung, and David Begert, Multi-level residual networks from dynamical systems view, In International Conference on Learning Representations, 2018, https://openreview.net/forum?id=SyJS-OgR-
Charles Herrmann, Richard Strong Bowen, and Ramin Zabih. Channel selection using Gumbel Softmax. In ECCV, 2020, https://arxiv.org/abs/1812.04180
Babak Ehteshami Bejnordi, Tijmen Blankevoort, and Max Welling, Batch-shaping for learning conditional channel gated networks, In ICLR, 2020, https://arxiv.org/abs/1907.06627
J. Yu, L. Yang, N. Xu, J. Yang, and T. Huang, Slimmable neural networks, in ICLR, 2019, https://arxiv.org/abs/1812.08928
Yacine Jernite, Edouard Grave, Armand Joulin, and Tomas Mikolov. Variable computation in recurrent neural networks. arXiv preprint arXiv:1611.06188, 2016, https://arxiv.org/abs/1611.06188
Y. Guo, A. Yao, and Y. Chen., Dynamic network surgery for efficient DNNs, 2016, Advances In Neural Information Processing Systems, pages 1379–1387, https://arxiv.org/abs/1608.04493 (Code: https://github.com/yiwenguo/Dynamic-Network-Surgery)
Q. Huang, K. Zhou, S. You, and U. Neumann, “Learning to prune filters in convolutional neural networks,” in 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2018, pp. 709–718, https://arxiv.org/abs/1801.07365
Caleb Tung, Nicholas Eliopoulos, Purvish Jajal, Gowri Ramshankar, Chen-Yun Yang, Nicholas Synovic, Xuecen Zhang, Vipin Chaudhary, George K. Thiruvathukal, Yung-Hsiang Lu, Oct 2023, An automated approach for improving the inference latency and energy efficiency of pretrained CNNs by removing irrelevant pixels with focused convolutions, https://arxiv.org/pdf/2310.07782.pdf, Code: https://github.com/PurdueCAM2Project/focused-convolutions
Wang Y., Lv K., Huang R., Song S., Yang L., Huang G., 2020, Glance and focus: a dynamic approach to reducing spatial redundancy in image classification, Advances in neural information processing systems, Vol. 33 (2020), pp. 2432-2444, https://arxiv.org/abs/2010.05300, Code: https://github.com/blackfeather-wang/GFNet-Pytorch (Focuses on a subset of an input image, which is analogous to channel pruning or width pruning.)
Dieter Verbruggen, Sofie Pollin, Hazem Sallouha, 6 May 2024, Computational Efficient Width-Wise Early Exits in Modulation Classification, https://arxiv.org/abs/2405.03222
Qiaozhi He, Zhihua Wu, 28 Apr 2024, Efficient LLM Inference with Kcache, https://arxiv.org/abs/2404.18057 (Splits the KV cache into a KCache stored in HBM and a Vcache stored in CPU memory. The requests for the V cache are limited by filtering after attention based on the Softmax scaled top-N results of the QK matrix multiplication, so thereby pruning a lot of the V cache memory loads and corresponding calculations.)
Sathya Krishnan Suresh, Shunmugapriya P, 24 Apr 2024 (v2), Towards smaller, faster decoder-only transformers: Architectural variants and their implications, https://arxiv.org/abs/2404.14462 Code: https://github.com/SkAndMl/gpt-variations (Focuses on three new variants of decoder-only Transformer architectures: ParallelGPT (p-gpt), LinearlyCompressedGPT (lc-gpt), and ConvCompressedGPT (cc-gpt).)
Sachin Mehta, Mohammad Hossein Sekhavat, Qingqing Cao, Maxwell Horton, Yanzi Jin, Chenfan Sun, Iman Mirzadeh, Mahyar Najibi, Dmitry Belenko, Peter Zatloukal, Mohammad Rastegari, 22 Apr 2024, OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework, Apple Research, https://arxiv.org/abs/2404.14619 Code: https://huggingface.co/apple/OpenELM
Snehasis Dey, 2024, Differentiable Slimming for Transformers with Improved Memory Efficiency, College of Engineering Bhubaneswar, Odisha , https://ijte.in/pdf/EE14.pdf (Dual pruning by attention head pruning for slimmable networks combined with layer pruning.)
Ignacio de Gregorio, April 2024, Mixture-of-Depths, a Dazzling New AI Breakthrough: Conditional Computing is Finally Here, Medium, https://medium.com/@ignacio.de.gregorio.noblejas/mixture-of-depths-a-dazzling-new-ai-breakthrough-be958fc629b2 (Mixture of depths is a layer-wise per-token limit to attention head computations, which is like width pruning with dynamic depth.)
David Raposo, Sam Ritter, Blake Richards, Timothy Lillicrap, Peter Conway Humphreys, Adam Santoro, 2 Apr 2024, Mixture-of-Depths: Dynamically allocating compute in transformer-based language models, https://arxiv.org/abs/2404.02258 (Per-layer pruning of which tokens can be in the attention computations to give a type of mixed lengthwise pruning combined with a dynamic width pruning or slimmable network approach.)
Namgyu Ho, Sangmin Bae, Taehyeon Kim, Hyunjik Jo, Yireun Kim, Tal Schuster, Adam Fisch, James Thorne, Se-Young Yun, 4 Jun 2024, Block Transformer: Global-to-Local Language Modeling for Fast Inference, https://arxiv.org/abs/2406.02657 Code: https://github.com/itsnamgyu/block-transformer (Uses local attention versus global attention at different layers.)
Sotiris Anagnostidis, Gregor Bachmann, Imanol Schlag, Thomas Hofmann, 2024, Navigating Scaling Laws: Compute Optimality in Adaptive Model Training https://openreview.net/pdf?id=3KxPo62PYn (Evaluates some model properties, such as width, on vision Transformers from the point of view of the scaling laws.)
Yixin Song, Zeyu Mi, Haotong Xie, Haibo Chen 2023, PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU, https://arxiv.org/abs/2312.12456 Code: https://github.com/SJTU-IPADS/PowerInfer (Computes a GPU-CPU hybrid engine with some "active" neurons run on the GPU and other less "hot" neurons on the CPU, which is akin to adaptive inference on the width dimension.)
Weijie Chen, Yuan Zhang, Di Xie, and Shiliang Pu. 2019. A layer decomposition-recomposition framework for neuron pruning towards accurate lightweight networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 3355–3362. https://arxiv.org/abs/1812.06611 (Layerwise dynamic structural pruning of unimportant neurons.)
Mojtaba Valipour, Mehdi Rezagholizadeh, Hossein Rajabzadeh, Marzieh Tahaei, Boxing Chen, Ali Ghodsi, 2023, SortedNet, a Place for Every Network and Every Network in its Place: Towards a Generalized Solution for Training Many-in-One Neural Networks https://arxiv.org/abs/2309.00255 (Generalization of multi-dimensional pruning, by training a large neural network with many sub-networks across different width and depth dimensions.)
Jiahui Yu and Thomas S Huang. 2019. Universally slimmable networks and improved training techniques. In Proceedings of the IEEE International Conference on Computer Vision, pages 1803–1811. https://arxiv.org/abs/1903.05134 Code: https://github.com/JiahuiYu/slimmable_networks
Hankook Lee and Jinwoo Shin. 2018. Anytime neural prediction via slicing networks vertically. arXiv preprint arXiv:1807.02609. https://arxiv.org/abs/1807.02609 (Training multiple thin networks.)
Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014 https://arxiv.org/abs/1412.6550
Ji Xin, Raphael Tang, Zhiying Jiang, Yaoliang Yu, Jimmy Lin, July 2022, Building an Efficiency Pipeline: Commutativity and Cumulativeness of Efficiency Operators for Transformers, arXiv preprint arXiv:2208.00483, https://arxiv.org/abs/2208.00483
Jean Senellart, Dakun Zhang, Bo Wang, Guillaume Klein, Jean-Pierre Ramatchandirin, Josep Crego, and Alexander Rush. 2018, OpenNMT system description for WNMT 2018: 800 words/sec on a single-core CPU. In Proc. of WNG, 2018. URL https://www.aclweb.org/anthology/W18-2715
Bo-Kyeong Kim, Geonmin Kim, Tae-Ho Kim, Thibault Castells, Shinkook Choi, Junho Shin, Hyoung-Kyu Song, 5 Feb 2024, Shortened LLaMA: A Simple Depth Pruning for Large Language Models, https://arxiv.org/abs/2402.02834 (Analysis of dual pruning with combined depth and width pruning.)
Peter Belcak, Roger Wattenhofer, 21 Nov 2023, Exponentially Faster Language Modelling, https://arxiv.org/abs/2311.10770 (Conditionally executes only a fraction of active neurons, which is effectively width pruning.)
M Salehi, S Mehta, A Kusupati, A Farhadi, H Hajishirzi, 2023 Sharcs: Efficient transformers through routing with dynamic width sub-networks https://arxiv.org/pdf/2310.12126.pdf (Direct queries to subnetworks with different widths.)
Zuxuan Wu, Tushar Nagarajan, Abhishek Kumar, Steven Rennie, Larry S Davis, Kristen Grauman, and Rogerio Feris. Blockdrop: Dynamic inference paths in residual networks. CVPR, pages 8817–8826, 2018. https://arxiv.org/abs/1711.08393 Code: https://github.com/Tushar-N/blockdrop
Jonas Geiping, Tom Goldstein, Dec 2022, Cramming: Training a Language Model on a Single GPU in One Day, https://arxiv.org/abs/2212.14034, Code: https://github.com/JonasGeiping/cramming (Note: code uses deprecated nvFuser compiler.) (Note: uses Pytorch nvFuser deep learning compiler, which seems to be deprecated now.)
Piotr Nawrot, Szymon Tworkowski, Michał Tyrolski, Łukasz Kaiser, Yuhuai Wu, Christian Szegedy, and Henryk Michalewski. Hierarchical Transformers Are More Efficient Language Models. arxiv:2110.13711[cs], April 2022. URL http://arxiv.org/abs/2110.13711
Yi Tay, Mostafa Dehghani, Jinfeng Rao, William Fedus, Samira Abnar, Hyung Won Chung, Sharan Narang, Dani Yogatama, Ashish Vaswani, and Donald Metzler. Scale Efficiently: Insights from Pretraining and Finetuning Transformers. In International Conference on Learning Representations, September 2021. URL https://openreview.net/forum?id=f2OYVD
Noam Wies, Yoav Levine, Daniel Jannai, and Amnon Shashua. Which transformer architecture fits my data? A vocabulary bottleneck in self-attention. In Proceedings of the 38th International Conference on Machine Learning, pp. 11170–11181. PMLR, July 2021. URL https://proc eedings.mlr.press/v139/wies21a.html.
Chen, Yilong ; Zhang, Linhao ; Shang, Junyuan ; Zhang, Zhenyu ; Liu, Tingwen ; Wang, Shuohuan ; Sun, Yu, June 2024, DHA: Learning Decoupled-Head Attention from Transformer Checkpoints via Adaptive Heads Fusion, https://arxiv.org/abs/2406.06567 https://ui.adsabs.harvard.edu/abs/2024arXiv240606567C/abstract
T Hu, C Meinel, H Yang, 2023, Flexible BERT with Width-and Depth-dynamic Inference, 2023 International Joint Conference on Neural Networks (IJCNN), https://ieeexplore.ieee.org/abstract/document/10191515/ (A 2023 version of BERT that does dual pruning with early exit and width gating.)
David Spuler, March 2024, Chapter 48. Width Pruning, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Basar Kutukcu, Sabur Baidya, Sujit Dey, 2024, SLEXNet: Adaptive Inference Using Slimmable Early Exit Neural Networks, https://doi.org/10.1145/3689632 https://dl.acm.org/doi/pdf/10.1145/3689632 (Combined width and depth pruning with slimmable and early exit networks.)
Simone Scardapane, Alessandro Baiocchi, Alessio Devoto, Valerio Marsocci, Pasquale Minervini, Jary Pomponi, 8 Jul 2024 (v2), Conditional computation in neural networks: principles and research trends, https://arxiv.org/abs/2403.07965
Haizhong Zheng, Xiaoyan Bai, Xueshen Liu, Z. Morley Mao, Beidi Chen, Fan Lai, Atul Prakash, 3 Jun 2024 (v3), Learn To be Efficient: Build Structured Sparsity in Large Language Models, https://arxiv.org/abs/2402.06126
Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, Pavlo Molchanov, 19 Jul 2024, Compact Language Models via Pruning and Knowledge Distillation, https://arxiv.org/abs/2407.14679 https://github.com/NVlabs/Minitron (Combination of distillation and structured pruning on the depth and width dimensions.)
Tuowei Wang, Ruwen Fan, Huang, Zixu Hao, Kun Li, Ting Cao, Youyou Lu, Yaoxue Zhang, Ju Ren, 29 Oct 2024 (v2), Ripple: Accelerating LLM Inference on Smartphones with Correlation-Aware Neuron Management, https://arxiv.org/abs/2410.19274
Xiangyu Zhang, Yu Zhou, Guang Yang, Harald C. Gall, Taolue Chen, 11 Nov 2024, Anchor Attention, Small Cache: Code Generation with Large Language Models, https://arxiv.org/abs/2411.06680
Noam Wies, Yoav Levine, Daniel Jannai, and Amnon Shashua. Which transformer architecture fits my data? A vocabulary bottleneck in self-attention. In Proceedings of the 38th International Conference on Machine Learning, pp. 11170–11181. PMLR, July 2021. https://proceedings.mlr.press/v139/wies21a.html.
Zekun Hao, David W. Romero, Tsung-Yi Lin, Ming-Yu Liu, 12 Dec 2024, Meshtron: High-Fidelity, Artist-Like 3D Mesh Generation at Scale, https://arxiv.org/abs/2412.09548 https://research.nvidia.com/labs/dir/meshtron/ (Optimizations to avoid the quadratic Transformer cost, in both training and inference, include "hourglass neural architecture" analogous to widthwise pruning or slimming, sliding window attention, rolling KV cache, truncated sequence training, and a "robust sampling strategy" that is effectively a type of constrained decoding based on mesh layouts.)
Fabrizio Sandri, Elia Cunegatti, Giovanni Iacca, 29 Jan 2025, 2SSP: A Two-Stage Framework for Structured Pruning of LLMs, https://arxiv.org/abs/2501.17771 https://github.com/FabrizioSandri/2SSP (Dual width-depth pruning of neurons and attention heads.)

Width Pruning

More research papers on width pruning:

More Research on Pruning Types

Depth pruning (overview)
— Static layer pruning
— Layer pruning
— Early exit
— Dynamic layer pruning
— Layer skipping
— Layer approximation
— Shallow decoder architecture
— Layer reordering
— Layer Importance
Width pruning (overview)
— Attention head pruning
— Slimmable networks (width pruning)
— FFN pruning
— Channel pruning
— Filter pruning
Length pruning (longitudinal/input/end-to-end):
— Token pruning (input pruning)
— Dynamic token pruning
— Prompt compression
— Context compression
— Token merging
— Token skipping
— Token dropping
— Zero padding removal
Embedding-dimension pruning
— Embedding pruning
— Embedding matrix compression (embedding pruning)
— Embedding low-rank matrix factorization
— Unembedding matrix (output embeddings)
Multi-dimensional pruning
— Dual pruning
— Triple pruning
— Quadruple pruning
— 3D CNN model pruning

Aussie AI

Width Pruning

Slimmable Networks (Dynamic Width Pruning)

Channel Pruning (Width Pruning)

Filter Pruning

Dynamic Channel Pruning (Width Pruning)

Width Pruning

More Research on Pruning Types

More Pruning Research

Quick Links

Product

New to Writing?

Writing Styles