Aussie AI
Magnitude Pruning
-
Last Updated 3 December, 2024
-
by David Spuler, Ph.D.
Magnitude pruning is zeroing weights that have a small magnitude, which means they have a small numeric absolute value. Another way to say it is that it is the removal of near-zero weights (positive and negative).
Magnitude pruning is the simplest type of unstructured pruning. In its pure form, any of the weights in the whole model may be pruned, regardless of what structure they are in. This can be combined with structural pruning by limiting to particular structural units of the model.
Magnitude pruning can be performed as part of training, or after training. Post-training magnitude pruning is conceptually similar to quantization, in that a new model with changed weights can be created. Sometimes post-pruning re-training may be required, or it also may be avoided.
Research on Magnitude Pruning
Research papers on magnitude pruning include:
- Zhu, M. and Gupta, S. To prune, or not to prune: exploring the efficacy of pruning for model compression. CoRR, abs/1710.01878, 2017. https://arxiv.org/abs/1710.01878
- Abigail See, Minh-Thang Luong, and Christopher D. Manning. Compression of neural machine translation models via pruning. In CoNLL, pages 291–301. ACL, 2016, https://arxiv.org/abs/1606.09274
- Sharan Narang, Gregory F. Diamos, Shubho Sengupta, and Erich Elsen. Exploring sparsity in recurrent neural networks. CoRR, abs/1704.05119, 2017, https://arxiv.org/abs/1704.05119
- Jaeho Lee, Sejun Park, Sangwoo Mo, Sungsoo Ahn, and Jinwoo Shin. Layer-adaptive sparsity for the magnitude-based pruning. In International Conference on Learning Representations, 2020. https://arxiv.org/abs/2010.07611
- Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 1135–1143. Curran Associates, Inc., 2015, https://arxiv.org/abs/1506.02626
- J Back, N Ahn, J Kim, 2023, Magnitude Attention-based Dynamic Pruning, arXiv preprint arXiv:2306.05056, https://arxiv.org/abs/2306.05056
- Manas Gupta, Efe Camci, Vishandi Rudy Keneta, Abhishek Vaidyanathan, Ritwik Kanodia, Chuan-Sheng Foo, Wu Min, and Lin Jie. Is complexity required for neural network pruning? a case study on global magnitude pruning. arXiv preprint arXiv:2209.14624, 2022, https://arxiv.org/abs/2209.14624
- U. Evci, T. Gale, J. Menick, P. S. Castro, and E. Elsen, “Rigging the lottery: Making all tickets winners,” in Proceedings of the 37th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, H. D. III and A. Singh, Eds., vol. 119. PMLR, 13–18 Jul 2020, pp. 2943–2952. https://proceedings.mlr.press/v119/evci20a.html, https://arxiv.org/abs/1911.11134
- N. Strom, “Sparse connection and pruning in large dynamic artificial neural networks,” 1997. PDF: https://www.nikkostrom.com/publications/euro97/euro97.pdf
- S. Park, J. Lee, S. Mo, and J. Shin, “Lookahead: a far-sighted alternative of magnitude-based pruning,” International Conference on Learning Representations, 2020. https://arxiv.org/abs/2002.04809, Code: https://github.com/alinlab/lookahead_pruning
- Thimm G. & Fiesler E., 1995, “Evaluating pruning methods” , In 1995 International Symposium on Artificial Neural Networks, Proc. ISANN ‘95, pp. A2 20-25, National Chiao-Tung University, Hsinchu, Taiwan, 1995. https://www.semanticscholar.org/paper/Evaluating-pruning-methods-Thimm-Fiesler/80e02a91b0645d076e9584a266978fd322e35f6b
- Z. Wang, Ce Zhu, Zhiqiang Xia, Qi Guo, Y. Liu, 2017, Towards thinner convolutional neural networks through gradually global pruning, Computer Science IEEE International Conference on Image Processing, https://arxiv.org/abs/1703.09916v1 (Pruning a percentage of weights across layers.)
- Yu-xin Zhang, Mingbao Lin, +7 authors Rongrong Ji 2021, Efficient Weight Pruning using Pre-trained Lottery Jackpots, https://arxiv.org/abs/2104.08700v3
- L. Prechelt 1997, Adaptive parameter pruning in neural networks https://www.researchgate.net/publication/2283202_Adaptive_Parameter_Pruning_in_Neural_Networks (Adapting the pruning method during training.)
- S. Han, J. Pool, J. Tran, and W. J. Dally. Learning both weights and connections for efficient neural networks. In Advances in Neural Information Processing Systems, 2015. https://arxiv.org/abs/1506.02626 (Iterative pruning and re-training.)
- T. Gale, E. Elsen, and S. Hooker. 2019. The state of sparsity in deep neural networks. arXiv preprint 1902.09574, https://arxiv.org/abs/1902.09574
- Frantar, E.; and Alistarh, D. 2023. SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot. arXiv:2301.00774. https://arxiv.org/abs/2301.00774
- F. Manessi, A. Rozza, S. Bianco, P. Napoletano and R. Schettini, "Automated pruning for deep neural network compression", Proc. 24th Int. Conf. Pattern Recognit. (ICPR), pp. 657-664, Aug. 2018. https://arxiv.org/abs/1712.01721 (Magnitude pruning with layerwise thresholds.)
- You Zhou, Xiujing Lin, Xiang Zhang, Maolin Wang, Gangwei Jiang, Huakang Lu, Yupeng Wu, Kai Zhang, Zhe Yang, Kehang Wang, Yongduo Sui, Fengwei Jia, Zuoli Tang, Yao Zhao, Hongxuan Zhang, Tiannuo Yang, Weibo Chen, Yunong Mao, Yi Li, De Bao, Yu Li, Hongrui Liao, Ting Liu, Jingwen Liu, Jinchi Guo, Xiangyu Zhao, Ying WEI, Hong Qian, Qi Liu, Xiang Wang, Wai Kin (Victor)Chan, Chenliang Li, Yusen Li, Shiyu Yang, Jining Yan, Chao Mou, Shuai Han, Wuxia Jin, Guannan Zhang, Xiaodong Zeng, Nov 2023, On the Opportunities of Green Computing: A Survey, https://arxiv.org/abs/2311.00447 (Extensive survey of environmental and green AI issues, along with a survey of various optimization methods to reduce AI resource requirements in training and inference.)
- Wei Niu, Xiaolong Ma, Sheng Lin, Shihao Wang, Xuehai Qian, Xue Lin, Yanzhi Wang, and Bin Ren. 2020. PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with Pattern-Based Weight Pruning. Association for Computing Machinery, New York, NY, USA, 907–922. https://doi.org/10.1145/3373376.3378534 (Pattern-based pruning method.)
- David Spuler, March 2024, Chapter 33. Pruning, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Hongrong Cheng, Miao Zhang, Javen Qinfeng Shi, 16 Jul 2024, MINI-LLM: Memory-Efficient Structured Pruning for Large Language Models, https://arxiv.org/abs/2407.11681
- Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas Kübler, Jiaji Huang, Matthäus Kleindessner, Jun Huan, Volkan Cevher, Yida Wang, George Karypis, 12 Jul 2024, Inference Optimization of Foundation Models on AI Accelerators, KDD’24, August 25–29, 2024, Barcelona, Spain, https://arxiv.org/abs/2407.09111
- Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748
- Y. Guo, A. Yao, and Y. Chen. 2016. Dynamic network surgery for efficient dnns. In Advances In Neural Information Processing Systems, pages 1379–1387, https://arxiv.org/abs/1608.04493 -
- M. P. Véstias, R. P. Duarte, J. T. de Sousa, and H. C. Neto, 2019, “Fast convolutional neural networks in low density FPGAs using zero-skipping and weight pruning,” Electronics, vol. 8, no. 11, p. 1321, Nov. 2019. https://www.mdpi.com/2079-9292/8/11/1321
- Arnav Chavan, Raghav Magazine, Shubham Kushwaha, Mérouane Debbah, Deepak Gupta, 24 Apr 2024 (v2), Faster and Lighter LLMs: A Survey on Current Challenges and Way Forward, https://arxiv.org/abs/2402.01799 Code: https://github.com/nyunAI/Faster-LLM-Survey
- 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
- Haojun Xia, Zhen Zheng, Yuchao Li, Donglin Zhuang, Zhongzhu Zhou, Xiafei Qiu, Yong Li, Wei Lin, Shuaiwen Leon Song, 19 Sep 2023, Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity, https://arxiv.org/abs/2309.10285 Code: https://github.com/AlibabaResearch/flash-llm (Unstructured pruning on tensor cores in GPUs with sparse MatMul optimizations.)
- 18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
- Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
- Leo Donisch, Sigurd Schacht, Carsten Lanquillon, 6 Aug 2024, Inference Optimizations for Large Language Models: Effects, Challenges, and Practical Considerations, https://arxiv.org/abs/2408.03130
- Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
- Hongrong Cheng, Miao Zhang, Javen Qinfeng Shi, 9 Aug 2024 (v2), A Survey on Deep Neural Network Pruning-Taxonomy, Comparison, Analysis, and Recommendations, IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/TPAMI.2024.3447085, https://arxiv.org/abs/2308.06767 https://ieeexplore.ieee.org/abstract/document/10643325
- Jaxpruner: A Concise Library for Sparsity Research, Joo Hyung Lee, Wonpyo Park, Nicole Elyse Mitchell, Jonathan Pilault, Johan Samir Obando Ceron, Han-Byul Kim, Namhoon Lee, Elias Frantar, Yun Long, Amir Yazdanbakhsh, Woohyun Han, Shivani Agrawal, Suvinay Subramanian, Xin Wang, Sheng-Chun Kao, Xingyao Zhang, Trevor Gale, Aart J.C. Bik, Milen Ferev, Zhonglin Han, Hong-Seok Kim, Yann Dauphin, Gintare Karolina Dziugaite, Pablo Samuel Castro, Utku Evci, Conference on Parsimony and Learning, PMLR 234:515-528, 2024. https://proceedings.mlr.press/v234/lee24a.html https://proceedings.mlr.press/v234/lee24a/lee24a.pdf https://openreview.net/forum?id=H2rCZCfXkS https://openreview.net/pdf?id=H2rCZCfXkS
- David Spuler, March 2024, Magnitude Pruning, in Generative AI in C++, https://www.aussieai.com/book/ch33-magnitude-pruning
- Yue Zheng, Yuhao Chen, Bin Qian, Xiufang Shi, Yuanchao Shu, Jiming Chen, 29 Sep 2024, A Review on Edge Large Language Models: Design, Execution, and Applications, https://arxiv.org/abs/2410.11845
- Fali Wang, Zhiwei Zhang, Xianren Zhang, Zongyu Wu, Tzuhao Mo, Qiuhao Lu, Wanjing Wang, Rui Li, Junjie Xu, Xianfeng Tang, Qi He, Yao Ma, Ming Huang, Suhang Wang, 4 Nov 2024, A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness, https://arxiv.org/abs/2411.03350
- M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024, Resource-efficient Algorithms and Systems of Foundation Models: A Survey, https://dl.acm.org/doi/pdf/10.1145/3706418
More Research on Pruning Types
- Depth pruning (overview)
— Static layer pruning
— Layer pruning
— Early exit
— Dynamic layer pruning
— Layer skipping
— Layer approximation
— Shallow decoder architecture
— Layer reordering
— Layer Importance - Width pruning (overview)
— Attention head pruning
— Slimmable networks (width pruning)
— FFN pruning
— Channel pruning
— Filter pruning - Length pruning (longitudinal/input/end-to-end):
— Token pruning (input pruning)
— Dynamic token pruning
— Prompt compression
— Context compression
— Token merging
— Token skipping
— Token dropping
— Zero padding removal - Embedding-dimension pruning
— Embedding pruning
— Embedding matrix compression (embedding pruning)
— Embedding low-rank matrix factorization
— Unembedding matrix (output embeddings) - Multi-dimensional pruning
— Dual pruning
— Triple pruning
— Quadruple pruning
— 3D CNN model pruning
More AI Pruning Research
Read more about:
- Layer pruning
- Token pruning
- Attention head pruning
- Embeddings pruning
- FFN pruning
- Shallow decoder architecture
- Normalization pruning
- Length pruning
- Width pruning
- Channel pruning
- « Research Home