Aussie AI
LLM Model Pruning Research
-
Last Updated 12 December, 2024
-
by David Spuler, Ph.D.
Model pruning is a type of model compression for Large Language Models (LLMs) and other neural networks. Conceptually, this involves removal of weights within the model, which reduces the total computation required for model inference (i.e. "running" the AI model to generate a response).
Types of Model Pruning
The top-level classification involves whether pruning is done to the more, or during inference.
- Static pruning. Changing the model during or after training, but not during inference.
- Dynamic pruning. Inference decisions that effectively prune parts of the network.
There is also the very general category according to strategy:
- Unstructured pruning. Remove low weight links ("magnitude pruning") regardless of where they are in the structure of the model.
- Structured pruning. Everything else, which means removing weight groups in a specific part of the structure of the model (e.g. all weights in a layer, channel, filter, etc.)
Note that a hybrid strategy is to do unstructured magnitude pruning of all the weights in one particular type of structure, rather than removing an entire structural component (e.g. layer-specific magnitude pruning).
Unstructured Pruning
Weight pruning or "magnitude pruning" is unstructured pruning that removes very low weights, including small negative weights. Technically, weight pruning is the general class of pruning decisions made about a single weight at a time, whereas magnitude pruning is the specific case of pruning based on cutting small near-zero absolute-value weights (using some threshold). Magnitude pruning also obviously removes zero weights (see also zero skipping), or indeed the nebulous oddity called "negative-zero" if that is found. Read more about magnitude pruning.
Note that quantization techniques can also increase the number of zero weights throughout the quantized model. Quantization may round low-magnitude weights down to zero, with a smaller number of discrete weights possible. However, it depends on the level of quantization, and the choices made about rounding versus truncation in the quantization method.
The idea is simply that weights with a low magnitude are not contributing much to the overall inference, and hence skipping these calculations should not greatly affect the overall accuracy of the model. Theory says that this works quite well. This is the simplest and most longstanding type of pruning, and has a large body of research.
An important practical aspect of unstructured pruning is that it has no latency benefit unless multiplication by zeros in vector dot products can be efficiently avoided. For example, any benefit from zeroing some weights in a vector that is all still sent to an accelerator (i.e. with some zeroed elements) depends on the characteristics of the hardware, or on the algorithms used by the deep learning compiler (or graph compiler), and the latency improvement may be limited.
In addition to the simplistic idea of cutting all low weights throughout the model, much research has been done on various techniques to increase the number of such low weights, by making the weight matrices sparser. Research on such issues is addressed under topics such as "magnitude pruning", "sparsification" and "sparse matrix" theory.
Movement Pruning
Another type of unstructured pruning is "movement pruning". This is a pruning of weights during training, based on how they change during the training process. Weights that are "moving" towards zero as training progresses are removed (pruned), whereas weights "moving away from zero" are retained. Note that this relates to both positive and negative weights (with opposite movement directions). It's also possible to combine a movement threshold and an absolute magnitude threshold, so that important weights with large absolute values are not subsequently removed if they change a tiny amount towards zero. Overall, this means movement pruning focuses on the changes to weights as they are trained, rather than on their absolute value at the end of training.
Research papers on movement pruning:
- Francois Lagunas, Ella Charlaix, Victor Sanh, and Alexander M Rush. Block pruning for faster transformers. arXiv preprint arXiv:2109.04838, 2021. https://arxiv.org/abs/2109.04838, Code: https://github.com/huggingface/nn_pruning
- Victor Sanh, Thomas Wolf, and Alexander M Rush. “Movement pruning: Adaptive sparsity by fine-tuning”. In: arXiv preprint arXiv:2005.07683 (2020). https://arxiv.org/abs/2005.07683
- David Spuler, March 2024, Chapter 33. Pruning, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Hongrong Cheng, Miao Zhang, Javen Qinfeng Shi, 16 Jul 2024, MINI-LLM: Memory-Efficient Structured Pruning for Large Language Models, https://arxiv.org/abs/2407.11681
- David Spuler, March 2024, Movement Pruning, in Generative AI in C++, https://www.aussieai.com/book/ch33-movement-pruning
Semi-Structured Pruning
Semi-structured pruning is part-way between structured and unstructured pruning. It involves zeroing some weights in an unstructured manner, but doing so in particular structures, or according to various patterns or limitations
- Kaixin Xu, Zhe Wang, Chunyun Chen, Xue Geng, Jie Lin, Xulei Yang, Min Wu, Xiaoli Li, Weisi Lin, 2 Jul 2024, LPViT: Low-Power Semi-structured Pruning for Vision Transformers, https://arxiv.org/abs/2407.02068 (Block-level pruning to give a granular type of structured pruning which speeds up MatMul/GEMM by skipping whole blocks or tiles.)
- 18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
- Hongrong Cheng, Miao Zhang, Javen Qinfeng Shi, 9 Aug 2024 (v2), A Survey on Deep Neural Network Pruning-Taxonomy, Comparison, Analysis, and Recommendations, IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/TPAMI.2024.3447085, https://arxiv.org/abs/2308.06767 https://ieeexplore.ieee.org/abstract/document/10643325
Types of Structured Pruning
The more specific types of structured pruning depend on what type of "structure" has the weights being pruned:
- Layer pruning
- Early exit (effectively dynamic layer pruning)
- Attention head pruning
- Channel pruning
- Token pruning
- Embeddings pruning
- Block pruning
- Filter pruning
- FFN pruning
- Normalization pruning
Another technique similar to structured pruning is to cut weights by using smaller matrices. Advanced matrix algebra can be used to factorize the large matrices into smaller "low-rank" matrices, with fewer rows and columns (hence, less weights); see matrix algebra.
Hybrid pruning
There are various hybrid pruning strategies, where it is possible to combine two types of pruning (or indeed with other non-pruning optimizations such as quantization). For example, you can try to prune both depth and width at the same time (called "dual pruning"). There's also "triple pruning" of depth, width, and length. Or you can combine structured and unstructured pruning, e.g. structured pruning of a particular part of the model (e.g. a layer), but then only do unstructured weight pruning (magnitude pruning) of that structure, such that only low-value unimportant weights are pruned. This is an area of intensive ongoing research, and many hybrid pruning strategies are being tested.
Multidimensional Pruning
There are many types of pruning, and some are orthogonal to each other. This means it is possible, and desirable, to prune a model from multiple dimensions. Some of the types include:
- Dual pruning (usually depth and width pruning, such as combining layer pruning and channel pruning).
- Triple pruning
Quadruple pruning? Is there a way to do four? For example, can you combine the four dimensions:
- Depth: layer
- Width: attention heads, neurons, or FFNs
- Length: tokens, and
- Model dimension: embeddings dimension
Survey Papers on Model Pruning
Review papers with coverage of pruning include:
- Xunyu Zhu, Jian Li, Yong Liu, Can Ma, Weiping Wang, A Survey on Model Compression for Large Language Models, arXiv preprint arXiv:2308.07633, Aug 2023 https://arxiv.org/abs/2308.07633 (Recent 2023 survey paper on various types of pruning.)
- T Liang, J Glossner, L Wang, S Shi, X Zhang, 2021, Neurocomputing, Pruning and quantization for deep neural network acceleration: A survey, https://arxiv.org/abs/2101.09671
- Canwen Xu, Julian McAuley, Nov 2022, A Survey on Model Compression and Acceleration for Pretrained Language Models, https://arxiv.org/abs/2202.07105
- T Choudhary, V Mishra, A Goswami, 2020, A comprehensive survey on model compression and acceleration, Artifcial Intelligence Review, https://doi.org/10.1007/s10462-020-09816-7, https://link.springer.com/article/10.1007/s10462-020-09816-7
- J Liu, S Tripathi, U Kurup, M Shah, 2020, Pruning algorithms to accelerate convolutional neural networks for edge applications: A survey, arXiv preprint arXiv:2005.04275, https://arxiv.org/abs/2005.04275
- Y Cheng, D Wang, P Zhou, T Zhang, June 2020 (revised), A survey of model compression and acceleration for deep neural networks, arXiv preprint arXiv:1710.09282, https://arxiv.org/abs/1710.09282
- K Nan, S Liu, J Du, H Liu, 2019, Deep model compression for mobile platforms: A survey, Tsinghua Science and Technology (Volume 24, Issue 6, December 2019), https://ieeexplore.ieee.org/abstract/document/8727762, PDF: https://ieeexplore.ieee.org/iel7/5971803/8727756/08727762.pdf
- Yu Cheng; Duo Wang; Pan Zhou; Tao Zhang, 2018, Model Compression and Acceleration for Deep Neural Networks: The Principles, Progress, and Challenges, IEEE Signal Processing Magazine (Volume 35, Issue 1, January 2018), https://ieeexplore.ieee.org/document/8253600
- G Menghani, 2023, Efficient deep learning: A survey on making deep learning models smaller, faster, and better, ACM Computing Surveys, https://dl.acm.org/doi/abs/10.1145/3578938, https://arxiv.org/abs/2106.08962
- L Deng, G Li, S Han, L Shi, Y Xie, 2020, Model compression and hardware acceleration for neural networks: A comprehensive survey, Proceedings of the IEEE (Volume 108, Issue 4, April 2020), https://ieeexplore.ieee.org/abstract/document/9043731
- S Xu, A Huang, L Chen, B Zhang, 2020, Convolutional neural network pruning: A survey 2020 39th Chinese Control Conference (CCC), https://ieeexplore.ieee.org/document/9189610
- K Ramesh, A Chavan, S Pandit, 2023, A Comparative Study on the Impact of Model Compression Techniques on Fairness in Language Models, Microsoft Research, https://aclanthology.org/2023.acl-long.878.pdf, https://www.microsoft.com/en-us/research/uploads/prod/2023/07/3687_Paper.pdf
- H Wang, C Qin, Y Bai, Y Zhang, Y Fu, 2022, Recent advances on neural network pruning at initialization, arXiv preprint arXiv:2103.06460, 2021 (updated May 2022), https://arxiv.org/abs/2103.06460
- Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
- Hongrong Cheng, Miao Zhang, Javen Qinfeng Shi, 9 Aug 2024 (v2), A Survey on Deep Neural Network Pruning-Taxonomy, Comparison, Analysis, and Recommendations, IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/TPAMI.2024.3447085, https://arxiv.org/abs/2308.06767 https://ieeexplore.ieee.org/abstract/document/10643325
- David Spuler, September 2nd, 2024, 500+ LLM Inference Optimization Techniques, Aussie AI Blog, https://www.aussieai.com/blog/llm-inference-optimization
- Ummara Bibi, Mahrukh Mazharm Dilshad Sabir, Muhammad Fasih Uddin Butt, Ali Hassan, Mustansar Ali Ghazanfar, Arshad Ali Khan, Wadood Abdul, 2024, Advances in Pruning and Quantization for Natural Language Processing, IEEE Access, doi: 10.1109/ACCESS.2024.3465631. https://ieeexplore.ieee.org/document/10685352 PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10685352
Model Pruning General Research
Model pruning is a long-standing algorithm for model compression, with extensive literature, and numerous types of pruning.
- Trim insignificant weights (TensorFlow), https://www.tensorflow.org/model_optimization/guide/pruning
- Pruning comprehensive guide (TensorFlow), https://www.tensorflow.org/model_optimization/guide/pruning/comprehensive_guide
- Song Han, Huizi Mao, William J. Dally, "Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding", arXiv:1510.00149v5 [cs.CV], 15 Feb 2016, https://arxiv.org/abs/1510.00149
- Kim Martineau, "A foolproof way to shrink deep learning models", MIT News, April 30, 2020, https://news.mit.edu/2020/foolproof-way-shrink-deep-learning-models-0430
- Jonathan Frankle, Michael Carbin, "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks", March 2019, https://arxiv.org/abs/1803.03635
- Shaokai Ye, Tianyun Zhang, Kaiqi Zhang, Jiayu Li, Jiaming Xie, Yun Liang, Sijia Liu, Xue Lin, Yanzhi Wang, "A Unified Framework of DNN Weight Pruning and Weight Clustering/Quantization Using ADMM", November 2018, https://arxiv.org/abs/1811.01907
- Fahim Dalvi, Hassan Sajjad, Nadir Durrani, and Yonatan Belinkov. 2020. Analyzing redundancy in pretrained transformer models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4908–4926, Online. Association for Computational Linguistics. https://arxiv.org/abs/2004.04010 (Detailed analysis finding redundancy in 85% of parameters, with relevance to pruning and sharing.)
- Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning, M Xia, T Gao, Z Zeng, D Chen, arXiv preprint arXiv:2310.06694, Oct 2023, https://arxiv.org/pdf/2310.06694.pdf, Code: https://github.com/princeton-nlp/LLM-Shearing
- Megan Flynn, Alexander Wang, Dean Edward Alvarez, Christopher De Sa, Anil Damle, 29 May 2024, STAT: Shrinking Transformers After Training, https://arxiv.org/abs/2406.00061
- Maolin Wang, Yao Zhao, Jiajia Liu, Jingdong Chen, Chenyi Zhuang, Jinjie Gu, Ruocheng Guo, Xiangyu Zhao, May 2024, Large Multimodal Model Compression via Iterative Efficient Pruning and Distillation, WWW '24: Companion Proceedings of the ACM on Web Conference 2024May 2024, Pages 235–244, https://doi.org/10.1145/3589335.3648321
- Maolin Wang, Yao Zhao, Jiajia Liu, Jingdong Chen, Chenyi Zhuang, Jinjie Gu, Ruocheng Guo, Xiangyu Zhao, Dec 2023, Large Multimodal Model Compression via Efficient Pruning and Distillation at AntGroup, https://arxiv.org/abs/2312.05795
- Hou-I Liu, Marco Galindo, Hongxia Xie, Lai-Kuan Wong, Hong-Han Shuai, Yung-Yui Li, Wen-Huang Cheng, 8 Apr 2024, Lightweight Deep Learning for Resource-Constrained Environments: A Survey, https://arxiv.org/abs/2404.07236 (A survey of various optimizations, with a lot of focus on image and vision models, including CNNs, RNNs, and Transformers.)
- Georgy Tyukin, 2 Apr 2024, Enhancing Inference Efficiency of Large Language Models: Investigating Optimization Strategies and Architectural Innovations, Masters Thesis, Data Science and Machine Learning, University College London., https://arxiv.org/abs/2404.05741 (Reviews various model compression and inference optimization techniques, and specifically analyzes layer skipping and sublayer skipping, such as attention head pruning and FFN/MLP pruning.)
- Busayo Awobade, Mardiyyah Oduwole, Steven Kolawole, 6 Apr 2024, What Happens When Small Is Made Smaller? Exploring the Impact of Compression on Small Data Pretrained Language Models, https://arxiv.org/abs/2404.04759 (General article shows that the big three of model compression work not just on compression big LLMs, but also on making small models even smaller.)
- Seungtae Hong, Gunju Park, Jeong-Si Kim, 9 June 2024, Automated deep-learning model optimization framework for microcontrollers, https://doi.org/10.4218/etrij.2023-0522 https://onlinelibrary.wiley.com/doi/full/10.4218/etrij.2023-0522 (Framework for using quantization and pruning on microcontroller devices.)
- Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, Liang Zhao, 4 Jan 2024, Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models https://arxiv.org/abs/2401.00625 (A general survey paper with coverage of many techniques including this one.)
- Wei Niu, Xiaolong Ma, Sheng Lin, Shihao Wang, Xuehai Qian, Xue Lin, Yanzhi Wang, and Bin Ren. 2020. PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with Pattern-Based Weight Pruning. Association for Computing Machinery, New York, NY, USA, 907–922. https://doi.org/10.1145/3373376.3378534 (Pattern-based pruning method.)
- W Li, H Hacid, E Almazrouei, M Debbah, 2023, A Comprehensive Review and a Taxonomy of Edge Machine Learning: Requirements, Paradigms, and Techniques, AI 2023, 4(3), 729-786, https://www.mdpi.com/2673-2688/4/3/39
- T Liang, J Glossner, L Wang, S Shi, X Zhang, 2021, Neurocomputing, Pruning and quantization for deep neural network acceleration: A survey, https://arxiv.org/abs/2101.09671
- Abhiroop Bhattacharjee, Yeshwanth Venkatesha, Abhishek Moitra, Priyadarshini Panda, MIME: adapting a single neural network for multi-task inference with memory-efficient dynamic pruning. In: DAC (2022) https://arxiv.org/abs/2204.05274
- Krishna Teja Chitty-Venkata, Sparsh Mittal, Murali Emani, Venkatram Vishwanath, Arun K. Somani, 2023, A Survey of Techniques for Optimizing Transformer Inference, https://arxiv.org/abs/2307.07982
- Canwen Xu, Julian McAuley, Nov 2022, A Survey on Model Compression and Acceleration for Pretrained Language Models, https://arxiv.org/abs/2202.07105
- Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. A survey of model compression and acceleration for deep neural networks. CoRR, abs/1710.09282, 2017. https://arxiv.org/abs/1710.09282
- Gale, T., Elsen, E., and Hooker, S., The state of sparsity in deep neural networks, arXiv preprint arXiv:1902.09574, 2019, https://arxiv.org/abs/1902.09574
- Kwon, W., Kim, S., Mahoney, M. W., Hassoun, J., Keutzer, K., and Gholami, A., 2022, A fast post-training pruning framework for transformers, arXiv preprint arXiv:2204.09656, https://arxiv.org/abs/2204.09656
- S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, 2016, Eie: Efficient inference engine on compressed deep neural network, in Proceedings of the 43rd International Symposium on Computer Architecture, ser. ISCA ’16. Piscataway, NJ, USA, IEEE Press, 2016, pp. 243–254, https://doi.org/10.1109/ISCA.2016.30 https://arxiv.org/abs/1602.01528
- Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. 2017, Thinet: A filter level pruning method for deep neural network compression. In ICCV, pages 5058–5066, https://arxiv.org/abs/1707.06342
- Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748 (General survey of various model compression and other inference optimizations.)
- Yongqi An, Xu Zhao, Tao Yu, Ming Tang, Jinqiao Wang, 19 Dec 2023, Fluctuation-based Adaptive Structured Pruning for Large Language Models, https://arxiv.org/abs/2312.11983 Code: https://github.com/CASIA-IVA-Lab/FLAP
- Vladimír Boža, 1 Jan 2024, Fast and Optimal Weight Update for Pruned Large Language Models, https://arxiv.org/abs/2401.02938 Code: https://github.com/fmfi-compbio/admm-pruning (Fast algorithm for fine-tuning after pruning to recover any lost model accuracy efficiently.)
- S Guo, J Xu, LL Zhang, M Yang, Oct 2023, Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models, arXiv preprint arXiv:2310.05015, https://arxiv.org/pdf/2310.05015.pdf Code: https://github.com/microsoft/Moonlit/tree/main/Compresso
- Zixiao Wang, Jingwei Zhang, Wenqian Zhao, Farzan Farnia, Bei Yu, 11 Jun 2024, MoreauPruner: Robust Pruning of Large Language Models against Weight Perturbations, https://arxiv.org/abs/2406.07017 Code: https://github.com/ShiningSord/MoreauPruner
- K Nan, S Liu, J Du, H Liu, 2019, Deep model compression for mobile platforms: A survey, Tsinghua Science and Technology (Volume 24, Issue 6, December 2019), https://ieeexplore.ieee.org/abstract/document/8727762 PDF: https://ieeexplore.ieee.org/iel7/5971803/8727756/08727762.pdf
- Meriam Dhouibi, Ahmed Karim Ben Salem, Afef Saidi, Slim Ben Saoud, March 2021, Accelerating deep neural networks implementation: A survey, https://doi.org/10.1049/cdt2.12016 PDF: https://ietresearch.onlinelibrary.wiley.com/doi/pdfdirect/10.1049/cdt2.12016
- David Spuler, March 2024, Chapter 33. Pruning, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Xinyin Ma, Gongfan Fang, Xinchao Wang, May 2023, LLM-Pruner: On the Structural Pruning of Large Language Models, https://arxiv.org/abs/2305.11627
- J. Choi and S. Venkataramani, 2019, Approximate Computing Techniques for Deep Neural Networks. Cham: Springer, 2019, pp. 307–329, Chapter 15, https://link.springer.com/chapter/10.1007/978-3-319-99322-5_15
- Arnav Chavan, Raghav Magazine, Shubham Kushwaha, Mérouane Debbah, Deepak Gupta, 24 Apr 2024 (v2), Faster and Lighter LLMs: A Survey on Current Challenges and Way Forward, https://arxiv.org/abs/2402.01799 Code: https://github.com/nyunAI/Faster-LLM-Survey
- Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey
- 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
- Xinji Mai, Zeng Tao, Junxiong Lin, Haoran Wang, Yang Chang, Yanlan Kang, Yan Wang, Wenqiang Zhang, 27 Jun 2024, From Efficient Multimodal Models to World Models: A Survey, https://arxiv.org/abs/2407.00118 (A survey of multimodal models with coverage of many optimization techniques.)
- Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas Kübler, Jiaji Huang, Matthäus Kleindessner, Jun Huan, Volkan Cevher, Yida Wang, George Karypis, 12 Jul 2024, Inference Optimization of Foundation Models on AI Accelerators, KDD’24, August 25–29, 2024, Barcelona, Spain, https://arxiv.org/abs/2407.09111
- 18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
- Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
- Sharath Sreenivas, Vinh Nguyen, Saurav Muralidharan, Marcin Chochowski and Raviraj Joshi, How to Prune and Distill Llama-3.1 8B to an NVIDIA Llama-3.1-Minitron 4B Model, Aug 14, 2024, https://developer.nvidia.com/blog/how-to-prune-and-distill-llama-3-1-8b-to-an-nvidia-llama-3-1-minitron-4b-model/
- Szabolcs Cséfalvay, James Imber, 31 Jan 2023 (v2), Self-Compressing Neural Networks, https://arxiv.org/abs/2301.13142
- Sophia R. Cunningham,Dominique Archambault,Austin Kung, May 2024, Efficient Training and Inference: Techniques for Large Language Models Using Llama, https://www.techrxiv.org/doi/full/10.36227/techrxiv.171651876.65094225/v1
- Yang He, Lingao Xiao, 30 Nov 2023 (v2), Structured Pruning for Deep Convolutional Neural Networks: A survey, https://arxiv.org/abs/2303.00566 https://arxiv.org/pdf/2303.00566 https://ieeexplore.ieee.org/abstract/document/10330640 https://github.com/he-y/Awesome-Pruning https://huggingface.co/spaces/he-yang/Structured-Pruning-Survey (Extensive survey of pruning for CNNs, not LLMs.)
- David Spuler, September 2nd, 2024, 500+ LLM Inference Optimization Techniques, Aussie AI Blog, https://www.aussieai.com/blog/llm-inference-optimization
- Ummara Bibi, Mahrukh Mazharm Dilshad Sabir, Muhammad Fasih Uddin Butt, Ali Hassan, Mustansar Ali Ghazanfar, Arshad Ali Khan, Wadood Abdul, 2024, Advances in Pruning and Quantization for Natural Language Processing, IEEE Access, doi: 10.1109/ACCESS.2024.3465631. https://ieeexplore.ieee.org/document/10685352 PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10685352
- Yanyue Xie, Zhi Zhang, Ding Zhou, Cong Xie, Ziang Song, Xin Liu, Yanzhi Wang, Xue Lin, An Xu, 15 Oct 2024, MoE-Pruner: Pruning Mixture-of-Experts Large Language Model using the Hints from Its Router, https://arxiv.org/abs/2410.12013 (Pruning applied to MoE.)
- Pu Zhao, Fei Sun, Xuan Shen, Pinrui Yu, Zhenglun Kong, Yanzhi Wang, Xue Lin, 21 Oct 2024, Pruning Foundation Models for High Accuracy without Retraining, https://arxiv.org/abs/2410.15567 https://github.com/piuzha/APT
- Mostafa Hussien, Mahmoud Afifi, Kim Khoa Nguyen, Mohamed Cheriet, 21 Oct 2024, Small Contributions, Small Networks: Efficient Neural Network Pruning Based on Relative Importance, https://arxiv.org/abs/2410.16151
- Samarth N Ramesh, Zhixue Zhao, 22 Nov 2024, Efficient Pruning of Text-to-Image Models: Insights from Pruning Stable Diffusion, https://arxiv.org/abs/2411.15113 (Comprehensive analysis of different types of pruning on diffusion image models.)
- M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024, Resource-efficient Algorithms and Systems of Foundation Models: A Survey, https://dl.acm.org/doi/pdf/10.1145/3706418
Structured Pruning
Research papers on structured pruning:
- Megan Flynn, Alexander Wang, Dean Edward Alvarez, Christopher De Sa, Anil Damle, 29 May 2024, STAT: Shrinking Transformers After Training, https://arxiv.org/abs/2406.00061
- Aaron Klein, Jacek Golebiowski, Xingchen Ma, Valerio Perrone, Cedric Archambeau, 3 May 2024, Structural Pruning of Pre-trained Language Models via Neural Architecture Search, https://arxiv.org/abs/2405.02267 (Post-training structured pruning of sub-networks based on NAS, also with weight sharing and several different focus areas of pruning including attention heads, FFNs, and layers.)
- M Sponner, B Waschneck, A Kumar , 2024, Adapting Neural Networks at Runtime: Current Trends in At-Runtime Optimizations for Deep Learning, ACM Computing Surveys,, PDF: https://dl.acm.org/doi/pdf/10.1145/3657283 (Survey of various adaptive inference optimization techniques with much focus on image and video processing optimization for LLMs.)
- Abigail See, Minh-Thang Luong, and Christopher D. Manning. Compression of neural machine translation models via pruning. In CoNLL, pages 291–301. ACL, 2016 https://arxiv.org/abs/1606.09274
- Sharan Narang, Gregory F. Diamos, Shubho Sengupta, and Erich Elsen. Exploring sparsity in recurrent neural networks. CoRR, abs/1704.05119, 2017, https://arxiv.org/abs/1704.05119
- Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 1135–1143. Curran Associates, Inc., 2015, https://arxiv.org/abs/1506.02626
- Lucio Dery, Steven Kolawole, Jean-François Kagy, Virginia Smith, Graham Neubig, Ameet Talwalkar, 9 Feb 2024 (v2), Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes, https://arxiv.org/abs/2402.05406 Code: https://github.com/ldery/Bonsai (Structured pruning of very large LLMs down to 1B or 2B.)
- Zhepeng Wang, Isaacshubhanand Putla, Weiwen Jiang, Youzuo Lin, Oct 2023, Edge-InversionNet: Enabling Efficient Inference of InversionNet on Edge Devices, https://arxiv.org/abs/2310.09667 (Using structured pruning via layerwise filter pruning to run a model on a Raspberry Pi.)
- H. Tann, S. Hashemi, R. I. Bahar, and S. Reda. Runtime configurable deep neural networks for energy-accuracy trade-off. In CODES + ISSS, pages 34:1–34:10, 2016. https://ieeexplore.ieee.org/document/9166549
- David Spuler, March 2024, Chapter 46. Structured Pruning, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey
- Kaixin Xu, Zhe Wang, Chunyun Chen, Xue Geng, Jie Lin, Xulei Yang, Min Wu, Xiaoli Li, Weisi Lin, 2 Jul 2024, LPViT: Low-Power Semi-structured Pruning for Vision Transformers, https://arxiv.org/abs/2407.02068 (Block-level pruning to give a granular type of structured pruning which speeds up MatMul/GEMM by skipping whole blocks or tiles.)
- Hongrong Cheng, Miao Zhang, Javen Qinfeng Shi, 16 Jul 2024, MINI-LLM: Memory-Efficient Structured Pruning for Large Language Models, https://arxiv.org/abs/2407.11681
- Ghadeer Jaradat, Mohammed Tolba, Ghada Alsuhli, Hani Saleh, Mahmoud Al-Qutayri, Thanos Stouraitis, Baker Mohammad, 7 Jul 2024, Hybrid Dynamic Pruning: A Pathway to Efficient Transformer Inference, https://arxiv.org/abs/2407.12893
- Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas Kübler, Jiaji Huang, Matthäus Kleindessner, Jun Huan, Volkan Cevher, Yida Wang, George Karypis, 12 Jul 2024, Inference Optimization of Foundation Models on AI Accelerators, KDD’24, August 25–29, 2024, Barcelona, Spain, https://arxiv.org/abs/2407.09111
- 18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
- Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748
- Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
- Oshin Dutta, Ritvik Gupta, Sumeet Agarwal, 2024, Efficient LLM Pruning with Global Token-Dependency Awareness and Hardware-Adapted Inference, https://openreview.net/pdf?id=cqhAzteLzc
- Leo Donisch, Sigurd Schacht, Carsten Lanquillon, 6 Aug 2024, Inference Optimizations for Large Language Models: Effects, Challenges, and Practical Considerations, https://arxiv.org/abs/2408.03130
- Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
- Hongrong Cheng, Miao Zhang, Javen Qinfeng Shi, 9 Aug 2024 (v2), A Survey on Deep Neural Network Pruning-Taxonomy, Comparison, Analysis, and Recommendations, IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/TPAMI.2024.3447085, https://arxiv.org/abs/2308.06767 https://ieeexplore.ieee.org/abstract/document/10643325
- Yang He, Lingao Xiao, 30 Nov 2023 (v2), Structured Pruning for Deep Convolutional Neural Networks: A survey, https://arxiv.org/abs/2303.00566 https://arxiv.org/pdf/2303.00566 https://ieeexplore.ieee.org/abstract/document/10330640 https://github.com/he-y/Awesome-Pruning https://huggingface.co/spaces/he-yang/Structured-Pruning-Survey (Extensive survey of pruning for CNNs, not LLMs.)
- Meta, August 14, 2024, How NVIDIA is using structured weight pruning and knowledge distillation to build new Llama models, Meta AI Blog, https://ai.meta.com/blog/nvidia-llama/
- Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, Pavlo Molchanov, 19 Jul 2024, Compact Language Models via Pruning and Knowledge Distillation, https://arxiv.org/abs/2407.14679 https://github.com/NVlabs/Minitron (Combination of distillation and structured pruning on the depth and width dimensions.)
- Sharath Turuvekere Sreenivas, Saurav Muralidharan, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, Pavlo Molchanov, 26 Aug 2024 (v2), LLM Pruning and Distillation in Practice: The Minitron Approach, https://arxiv.org/abs/2408.11796
- Yue Zheng, Yuhao Chen, Bin Qian, Xiufang Shi, Yuanchao Shu, Jiming Chen, 29 Sep 2024, A Review on Edge Large Language Models: Design, Execution, and Applications, https://arxiv.org/abs/2410.11845
- Fali Wang, Zhiwei Zhang, Xianren Zhang, Zongyu Wu, Tzuhao Mo, Qiuhao Lu, Wanjing Wang, Rui Li, Junjie Xu, Xianfeng Tang, Qi He, Yao Ma, Ming Huang, Suhang Wang, 4 Nov 2024, A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness, https://arxiv.org/abs/2411.03350
- Zehua Pei, Hui-Ling Zhen, Xianzhi Yu, Sinno Jialin Pan, Mingxuan Yuan, Bei Yu, 21 Nov 2024, FuseGPT: Learnable Layers Fusion of Generative Pre-trained Transformers, https://arxiv.org/abs/2411.14507
- Samarth N Ramesh, Zhixue Zhao, 22 Nov 2024, Efficient Pruning of Text-to-Image Models: Insights from Pruning Stable Diffusion, https://arxiv.org/abs/2411.15113 (Comprehensive analysis of different types of pruning on diffusion image models.)
- M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024, Resource-efficient Algorithms and Systems of Foundation Models: A Survey, https://dl.acm.org/doi/pdf/10.1145/3706418
- Haihang Wu, 9 Dec 2024, LLM-BIP: Structured Pruning for Large Language Models with Block-Wise Forward Importance Propagation, https://arxiv.org/abs/2412.06419
Block Pruning
Block pruning is structured pruning at very low-level granurality, at the vector level or below. A "block" is a small chunk of a tensor, either a vector sub-sequence, or a small rectangular block (also called a "tile"). Pruning of model weights at this level of granularity is called "block pruning."
Research papers on block pruning:
- Shikhar Tuli, Niraj K. Jha, EdgeTran: Co-designing Transformers for Efficient Inference on Mobile Edge Platforms, arXiv preprint arXiv:2303.13745, 2023, https://arxiv.org/abs/2303.13745
- Zuxuan Wu, Tushar Nagarajan, Abhishek Kumar, Steven Rennie, Larry S Davis, Kristen Grauman, and Rogerio Feris. Blockdrop: Dynamic inference paths in residual networks. CVPR, pages 8817–8826, 2018. https://arxiv.org/abs/1711.08393 Code: https://github.com/Tushar-N/blockdrop
- Songwei Liu, Chao Zeng, Lianqiang Li, Chenqian Yan, Lean Fu, Xing Mei, Fangmin Chen, 1 Jul 2024, FoldGPT: Simple and Effective Large Language Model Compression Scheme, https://arxiv.org/abs/2407.00928 (Identifies block-level similariy in model layers.)
- Kaixin Xu, Zhe Wang, Chunyun Chen, Xue Geng, Jie Lin, Xulei Yang, Min Wu, Xiaoli Li, Weisi Lin, 2 Jul 2024, LPViT: Low-Power Semi-structured Pruning for Vision Transformers, https://arxiv.org/abs/2407.02068 (Block-level pruning to give a granular type of structured pruning which speeds up MatMul/GEMM by skipping whole blocks or tiles.)
- AnonymousAuthor, 2024, AdeeperlookatdepthpruningofLLMs, https://openreview.net/pdf?id=9B7ayWclwN
- Ghadeer Jaradat, Mohammed Tolba, Ghada Alsuhli, Hani Saleh, Mahmoud Al-Qutayri, Thanos Stouraitis, Baker Mohammad, 7 Jul 2024, Hybrid Dynamic Pruning: A Pathway to Efficient Transformer Inference, https://arxiv.org/abs/2407.12893
- Yang Zhang, Yawei Li, Xinpeng Wang, Qianli Shen, Barbara Plank, Bernd Bischl, Mina Rezaei, Kenji Kawaguchi, 28 May 2024, FinerCut: Finer-grained Interpretable Layer Pruning for Large Language Models, https://arxiv.org/abs/2405.18218
- Yang He, Lingao Xiao, 30 Nov 2023 (v2), Structured Pruning for Deep Convolutional Neural Networks: A survey, https://arxiv.org/abs/2303.00566 https://arxiv.org/pdf/2303.00566 https://ieeexplore.ieee.org/abstract/document/10330640 https://github.com/he-y/Awesome-Pruning https://huggingface.co/spaces/he-yang/Structured-Pruning-Survey (Extensive survey of pruning for CNNs, not LLMs.)
- Junlin Lv, Yuan Feng, Xike Xie, Xin Jia, Qirong Peng, Guiming Xie, 19 Sep 2024, CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling Acceleration in LLMs, https://arxiv.org/abs/2409.12490
- Jinming Lou, Wenyang Luo, Yufan Liu, Bing Li, Xinmiao Ding, Weiming Hu, Jiajiong Cao, Yuming Li, Chenguang Ma, 27 Sep 2024, Token Caching for Diffusion Transformer Acceleration, https://arxiv.org/abs/2409.18523
- Zehua Pei, Hui-Ling Zhen, Xianzhi Yu, Sinno Jialin Pan, Mingxuan Yuan, Bei Yu, 21 Nov 2024, FuseGPT: Learnable Layers Fusion of Generative Pre-trained Transformers, https://arxiv.org/abs/2411.14507
- Haihang Wu, 9 Dec 2024, LLM-BIP: Structured Pruning for Large Language Models with Block-Wise Forward Importance Propagation, https://arxiv.org/abs/2412.06419
Vector Pruning
Vector pruning is structured pruning at very low-level granurality, at the vector level. Block pruning is slight lower granularity, since it can mean a sub-portion of a vector.
Research papers on vector pruning:
- David Spuler, March 2024, Chapter 46. Structured Pruning, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, James Hensman, 9 Feb 2024 (v2), SliceGPT: Compress Large Language Models by Deleting Rows and Columns, Microsoft Research, https://arxiv.org/abs/2401.15024 Code: https://github.com/microsoft/TransformerCompression (Pruning of matrices effectively prunes along the width dimension and the "fourth" internal dimension of embeddings using techniques such as low-rank matrix factorization.)
- Songwei Liu, Chao Zeng, Lianqiang Li, Chenqian Yan, Lean Fu, Xing Mei, Fangmin Chen, 1 Jul 2024, FoldGPT: Simple and Effective Large Language Model Compression Scheme, https://arxiv.org/abs/2407.00928 (Identifies block-level similariy in model layers.)
- Kaixin Xu, Zhe Wang, Chunyun Chen, Xue Geng, Jie Lin, Xulei Yang, Min Wu, Xiaoli Li, Weisi Lin, 2 Jul 2024, LPViT: Low-Power Semi-structured Pruning for Vision Transformers, https://arxiv.org/abs/2407.02068 (Block-level pruning to give a granular type of structured pruning which speeds up MatMul/GEMM by skipping whole blocks or tiles.)
Dynamic Pruning
Dynamic pruning refers to pruning of network weights, links, or entire layers at runtime during inference. This differs from "static pruning" that is done offline during training, or in a post-training optimization to create a modified model. The types of dynamic pruning may include:
- Dynamic weight pruning (dynamic unstructured pruning): This means suppressing near-zero weights and computed probabilities dynamically, which creates sparsity, sometimes called "network sparsification". This allows optimizations related to sparse matrices to reduce overall computations. Weight magnitude pruning is often done statically, as in most of the research, but can also be done dynamically during inference.
- Dynamic depth pruning: Skipping of inference of entire layers of the model using an "early exit" of the inference loop. See also depth pruning, layer pruning, layer skipping, layer fusion, and shallow decoders.
- Dynamic width pruning: Dynamically reducing the "width" of the model based on the input. See width pruning, attention head pruning, channel pruning, filter pruning.
- Dynamic length pruning: Adaptive to the input to modify internal dimensions related to tokens, embeddings, etc. See length pruning, token pruning, embeddings pruning, autoregressive algorithms.
Note that all types of dynamic pruning suffer some extra inference cost in the calculations that decide whether to prune or not. The hope is that the benefit of pruning will exceed the cost of decision logic. For example, choosing an "early exit" criteria for layer pruning will require extra computation at each layer, which is hopefully recouped by skipping layers often enough.
Research Papers on Dynamic Pruning
Some general research on dynamic or adaptive pruning methods:
- Yiwen Guo, Anbang Yao, and Yurong Chen. 2016, Dynamic network surgery for efficient DNNs. Advances in neural information processing systems, 29, 2016. https://arxiv.org/abs/1608.04493 (Dynamic pruning.)
- Tao Lin, Sebastian U. Stich, Luis Barba, Daniil Dmitriev, and Martin Jaggi. 2020, Dynamic model pruning with feedback. In International Conference on Learning Representations, https://arxiv.org/abs/2006.07253
- Mojtaba Valipour, Mehdi Rezagholizadeh, Hossein Rajabzadeh, Marzieh Tahaei, Boxing Chen, Ali Ghodsi, 2023, SortedNet, a Place for Every Network and Every Network in its Place: Towards a Generalized Solution for Training Many-in-One Neural Networks https://arxiv.org/abs/2309.00255 (Generalization of multi-dimensional pruning, by training a large neural network with many sub-networks across different width and depth dimensions.)
- H Wang, C Qin, Y Bai, Y Zhang, Y Fu, 2022, Recent advances on neural network pruning at initialization, arXiv preprint arXiv:2103.06460, 2021 (updated May 2022), https://arxiv.org/abs/2103.06460
- Bowen Zhao, Hannaneh Hajishirzi, Qingqing Cao, 22 Jan 2024, APT: Adaptive Pruning and Tuning Pretrained Language Models for Efficient Training and Inference, https://arxiv.org/abs/2401.12200
- David Spuler, March 2024, Chapter 33. Pruning, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Jinting Chen, Zhaocheng Zhu, Cheng Li, Yuming Zhao, Oct 2019, Self-Adaptive Network Pruning, https://arxiv.org/abs/1910.08906
- Zejian Liu, Fanrong Li, Gang Li, and Jian Cheng. 2021. EBERT: Efficient BERT Inference with Dynamic Structured Pruning. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 4814–4823. https://aclanthology.org/2021.findings-acl.425/ PDF: https://aclanthology.org/2021.findings-acl.425.pdf
- Bowen Zhao, Hannaneh Hajishirzi, Qingqing Cao, July 2024, APT: Adaptive Pruning and Tuning Pretrained Language Models for Efficient Training and Inference, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:60812-60831, 2024. https://proceedings.mlr.press/v235/zhao24g.html PDF: https://raw.githubusercontent.com/mlresearch/v235/main/assets/zhao24g/zhao24g.pdf
- Hongrong Cheng, Miao Zhang, Javen Qinfeng Shi, 9 Aug 2024 (v2), A Survey on Deep Neural Network Pruning-Taxonomy, Comparison, Analysis, and Recommendations, IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/TPAMI.2024.3447085, https://arxiv.org/abs/2308.06767 https://ieeexplore.ieee.org/abstract/document/10643325
- Yang He, Lingao Xiao, 30 Nov 2023 (v2), Structured Pruning for Deep Convolutional Neural Networks: A survey, https://arxiv.org/abs/2303.00566 https://arxiv.org/pdf/2303.00566 https://ieeexplore.ieee.org/abstract/document/10330640 https://github.com/he-y/Awesome-Pruning https://huggingface.co/spaces/he-yang/Structured-Pruning-Survey (Extensive survey of pruning for CNNs, not LLMs.)
- David Spuler, March 2024, Static vs Dynamic Pruning, in Generative AI in C++, https://www.aussieai.com/book/ch33-static-vs-dynamic-pruning
Combined Dynamic Width/Depth Pruning (Dual Pruning)
It's possible to simultaneously prune the depth (e.g. layer pruning, early-exit) and width (i.e., width pruning, channel pruning, token pruning, slimming networks, etc.). This is also called "hybrid pruning" or "dual pruning". See more research papers on dual pruning and triple pruning.
More Research on Pruning Types
- Depth pruning (overview)
— Static layer pruning
— Layer pruning
— Early exit
— Dynamic layer pruning
— Layer skipping
— Layer approximation
— Shallow decoder architecture
— Layer reordering
— Layer Importance - Width pruning (overview)
— Attention head pruning
— Slimmable networks (width pruning)
— FFN pruning
— Channel pruning
— Filter pruning - Length pruning (longitudinal/input/end-to-end):
— Token pruning (input pruning)
— Dynamic token pruning
— Prompt compression
— Context compression
— Token merging
— Token skipping
— Token dropping
— Zero padding removal - Embedding-dimension pruning
— Embedding pruning
— Embedding matrix compression (embedding pruning)
— Embedding low-rank matrix factorization
— Unembedding matrix (output embeddings) - Multi-dimensional pruning
— Dual pruning
— Triple pruning
— Quadruple pruning
— 3D CNN model pruning
More AI Pruning Research
Read more about:
- Layer pruning
- Token pruning
- Attention head pruning
- Embeddings pruning
- FFN pruning
- Shallow decoder architecture
- Normalization pruning
- Length pruning
- Width pruning
- Channel pruning
- « Research Home