Aussie AI

LLM Model Pruning Research

  • Last Updated 12 December, 2024
  • by David Spuler, Ph.D.

Model pruning is a type of model compression for Large Language Models (LLMs) and other neural networks. Conceptually, this involves removal of weights within the model, which reduces the total computation required for model inference (i.e. "running" the AI model to generate a response).

Types of Model Pruning

The top-level classification involves whether pruning is done to the more, or during inference.

  • Static pruning. Changing the model during or after training, but not during inference.
  • Dynamic pruning. Inference decisions that effectively prune parts of the network.

There is also the very general category according to strategy:

  • Unstructured pruning. Remove low weight links ("magnitude pruning") regardless of where they are in the structure of the model.
  • Structured pruning. Everything else, which means removing weight groups in a specific part of the structure of the model (e.g. all weights in a layer, channel, filter, etc.)

Note that a hybrid strategy is to do unstructured magnitude pruning of all the weights in one particular type of structure, rather than removing an entire structural component (e.g. layer-specific magnitude pruning).

Unstructured Pruning

Weight pruning or "magnitude pruning" is unstructured pruning that removes very low weights, including small negative weights. Technically, weight pruning is the general class of pruning decisions made about a single weight at a time, whereas magnitude pruning is the specific case of pruning based on cutting small near-zero absolute-value weights (using some threshold). Magnitude pruning also obviously removes zero weights (see also zero skipping), or indeed the nebulous oddity called "negative-zero" if that is found. Read more about magnitude pruning.

Note that quantization techniques can also increase the number of zero weights throughout the quantized model. Quantization may round low-magnitude weights down to zero, with a smaller number of discrete weights possible. However, it depends on the level of quantization, and the choices made about rounding versus truncation in the quantization method.

The idea is simply that weights with a low magnitude are not contributing much to the overall inference, and hence skipping these calculations should not greatly affect the overall accuracy of the model. Theory says that this works quite well. This is the simplest and most longstanding type of pruning, and has a large body of research.

An important practical aspect of unstructured pruning is that it has no latency benefit unless multiplication by zeros in vector dot products can be efficiently avoided. For example, any benefit from zeroing some weights in a vector that is all still sent to an accelerator (i.e. with some zeroed elements) depends on the characteristics of the hardware, or on the algorithms used by the deep learning compiler (or graph compiler), and the latency improvement may be limited.

In addition to the simplistic idea of cutting all low weights throughout the model, much research has been done on various techniques to increase the number of such low weights, by making the weight matrices sparser. Research on such issues is addressed under topics such as "magnitude pruning", "sparsification" and "sparse matrix" theory.

Movement Pruning

Another type of unstructured pruning is "movement pruning". This is a pruning of weights during training, based on how they change during the training process. Weights that are "moving" towards zero as training progresses are removed (pruned), whereas weights "moving away from zero" are retained. Note that this relates to both positive and negative weights (with opposite movement directions). It's also possible to combine a movement threshold and an absolute magnitude threshold, so that important weights with large absolute values are not subsequently removed if they change a tiny amount towards zero. Overall, this means movement pruning focuses on the changes to weights as they are trained, rather than on their absolute value at the end of training.

Research papers on movement pruning:

Semi-Structured Pruning

Semi-structured pruning is part-way between structured and unstructured pruning. It involves zeroing some weights in an unstructured manner, but doing so in particular structures, or according to various patterns or limitations

  • Kaixin Xu, Zhe Wang, Chunyun Chen, Xue Geng, Jie Lin, Xulei Yang, Min Wu, Xiaoli Li, Weisi Lin, 2 Jul 2024, LPViT: Low-Power Semi-structured Pruning for Vision Transformers, https://arxiv.org/abs/2407.02068 (Block-level pruning to give a granular type of structured pruning which speeds up MatMul/GEMM by skipping whole blocks or tiles.)
  • 18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
  • Hongrong Cheng, Miao Zhang, Javen Qinfeng Shi, 9 Aug 2024 (v2), A Survey on Deep Neural Network Pruning-Taxonomy, Comparison, Analysis, and Recommendations, IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/TPAMI.2024.3447085, https://arxiv.org/abs/2308.06767 https://ieeexplore.ieee.org/abstract/document/10643325

Types of Structured Pruning

The more specific types of structured pruning depend on what type of "structure" has the weights being pruned:

Another technique similar to structured pruning is to cut weights by using smaller matrices. Advanced matrix algebra can be used to factorize the large matrices into smaller "low-rank" matrices, with fewer rows and columns (hence, less weights); see matrix algebra.

Hybrid pruning

There are various hybrid pruning strategies, where it is possible to combine two types of pruning (or indeed with other non-pruning optimizations such as quantization). For example, you can try to prune both depth and width at the same time (called "dual pruning"). There's also "triple pruning" of depth, width, and length. Or you can combine structured and unstructured pruning, e.g. structured pruning of a particular part of the model (e.g. a layer), but then only do unstructured weight pruning (magnitude pruning) of that structure, such that only low-value unimportant weights are pruned. This is an area of intensive ongoing research, and many hybrid pruning strategies are being tested.

Multidimensional Pruning

There are many types of pruning, and some are orthogonal to each other. This means it is possible, and desirable, to prune a model from multiple dimensions. Some of the types include:

Quadruple pruning? Is there a way to do four? For example, can you combine the four dimensions:

  • Depth: layer
  • Width: attention heads, neurons, or FFNs
  • Length: tokens, and
  • Model dimension: embeddings dimension

Survey Papers on Model Pruning

Review papers with coverage of pruning include:

Model Pruning General Research

Model pruning is a long-standing algorithm for model compression, with extensive literature, and numerous types of pruning.

Structured Pruning

Research papers on structured pruning:

  • Megan Flynn, Alexander Wang, Dean Edward Alvarez, Christopher De Sa, Anil Damle, 29 May 2024, STAT: Shrinking Transformers After Training, https://arxiv.org/abs/2406.00061
  • Aaron Klein, Jacek Golebiowski, Xingchen Ma, Valerio Perrone, Cedric Archambeau, 3 May 2024, Structural Pruning of Pre-trained Language Models via Neural Architecture Search, https://arxiv.org/abs/2405.02267 (Post-training structured pruning of sub-networks based on NAS, also with weight sharing and several different focus areas of pruning including attention heads, FFNs, and layers.)
  • M Sponner, B Waschneck, A Kumar , 2024, Adapting Neural Networks at Runtime: Current Trends in At-Runtime Optimizations for Deep Learning, ACM Computing Surveys,, PDF: https://dl.acm.org/doi/pdf/10.1145/3657283 (Survey of various adaptive inference optimization techniques with much focus on image and video processing optimization for LLMs.)
  • Abigail See, Minh-Thang Luong, and Christopher D. Manning. Compression of neural machine translation models via pruning. In CoNLL, pages 291–301. ACL, 2016 https://arxiv.org/abs/1606.09274
  • Sharan Narang, Gregory F. Diamos, Shubho Sengupta, and Erich Elsen. Exploring sparsity in recurrent neural networks. CoRR, abs/1704.05119, 2017, https://arxiv.org/abs/1704.05119
  • Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 1135–1143. Curran Associates, Inc., 2015, https://arxiv.org/abs/1506.02626
  • Lucio Dery, Steven Kolawole, Jean-François Kagy, Virginia Smith, Graham Neubig, Ameet Talwalkar, 9 Feb 2024 (v2), Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes, https://arxiv.org/abs/2402.05406 Code: https://github.com/ldery/Bonsai (Structured pruning of very large LLMs down to 1B or 2B.)
  • Zhepeng Wang, Isaacshubhanand Putla, Weiwen Jiang, Youzuo Lin, Oct 2023, Edge-InversionNet: Enabling Efficient Inference of InversionNet on Edge Devices, https://arxiv.org/abs/2310.09667 (Using structured pruning via layerwise filter pruning to run a model on a Raspberry Pi.)
  • H. Tann, S. Hashemi, R. I. Bahar, and S. Reda. Runtime configurable deep neural networks for energy-accuracy trade-off. In CODES + ISSS, pages 34:1–34:10, 2016. https://ieeexplore.ieee.org/document/9166549
  • David Spuler, March 2024, Chapter 46. Structured Pruning, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
  • Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey
  • Kaixin Xu, Zhe Wang, Chunyun Chen, Xue Geng, Jie Lin, Xulei Yang, Min Wu, Xiaoli Li, Weisi Lin, 2 Jul 2024, LPViT: Low-Power Semi-structured Pruning for Vision Transformers, https://arxiv.org/abs/2407.02068 (Block-level pruning to give a granular type of structured pruning which speeds up MatMul/GEMM by skipping whole blocks or tiles.)
  • Hongrong Cheng, Miao Zhang, Javen Qinfeng Shi, 16 Jul 2024, MINI-LLM: Memory-Efficient Structured Pruning for Large Language Models, https://arxiv.org/abs/2407.11681
  • Ghadeer Jaradat, Mohammed Tolba, Ghada Alsuhli, Hani Saleh, Mahmoud Al-Qutayri, Thanos Stouraitis, Baker Mohammad, 7 Jul 2024, Hybrid Dynamic Pruning: A Pathway to Efficient Transformer Inference, https://arxiv.org/abs/2407.12893
  • Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas Kübler, Jiaji Huang, Matthäus Kleindessner, Jun Huan, Volkan Cevher, Yida Wang, George Karypis, 12 Jul 2024, Inference Optimization of Foundation Models on AI Accelerators, KDD’24, August 25–29, 2024, Barcelona, Spain, https://arxiv.org/abs/2407.09111
  • 18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
  • Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748
  • Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
  • Oshin Dutta, Ritvik Gupta, Sumeet Agarwal, 2024, Efficient LLM Pruning with Global Token-Dependency Awareness and Hardware-Adapted Inference, https://openreview.net/pdf?id=cqhAzteLzc
  • Leo Donisch, Sigurd Schacht, Carsten Lanquillon, 6 Aug 2024, Inference Optimizations for Large Language Models: Effects, Challenges, and Practical Considerations, https://arxiv.org/abs/2408.03130
  • Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
  • Hongrong Cheng, Miao Zhang, Javen Qinfeng Shi, 9 Aug 2024 (v2), A Survey on Deep Neural Network Pruning-Taxonomy, Comparison, Analysis, and Recommendations, IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/TPAMI.2024.3447085, https://arxiv.org/abs/2308.06767 https://ieeexplore.ieee.org/abstract/document/10643325
  • Yang He, Lingao Xiao, 30 Nov 2023 (v2), Structured Pruning for Deep Convolutional Neural Networks: A survey, https://arxiv.org/abs/2303.00566 https://arxiv.org/pdf/2303.00566 https://ieeexplore.ieee.org/abstract/document/10330640 https://github.com/he-y/Awesome-Pruning https://huggingface.co/spaces/he-yang/Structured-Pruning-Survey (Extensive survey of pruning for CNNs, not LLMs.)
  • Meta, August 14, 2024, How NVIDIA is using structured weight pruning and knowledge distillation to build new Llama models, Meta AI Blog, https://ai.meta.com/blog/nvidia-llama/
  • Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, Pavlo Molchanov, 19 Jul 2024, Compact Language Models via Pruning and Knowledge Distillation, https://arxiv.org/abs/2407.14679 https://github.com/NVlabs/Minitron (Combination of distillation and structured pruning on the depth and width dimensions.)
  • Sharath Turuvekere Sreenivas, Saurav Muralidharan, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, Pavlo Molchanov, 26 Aug 2024 (v2), LLM Pruning and Distillation in Practice: The Minitron Approach, https://arxiv.org/abs/2408.11796
  • Yue Zheng, Yuhao Chen, Bin Qian, Xiufang Shi, Yuanchao Shu, Jiming Chen, 29 Sep 2024, A Review on Edge Large Language Models: Design, Execution, and Applications, https://arxiv.org/abs/2410.11845
  • Fali Wang, Zhiwei Zhang, Xianren Zhang, Zongyu Wu, Tzuhao Mo, Qiuhao Lu, Wanjing Wang, Rui Li, Junjie Xu, Xianfeng Tang, Qi He, Yao Ma, Ming Huang, Suhang Wang, 4 Nov 2024, A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness, https://arxiv.org/abs/2411.03350
  • Zehua Pei, Hui-Ling Zhen, Xianzhi Yu, Sinno Jialin Pan, Mingxuan Yuan, Bei Yu, 21 Nov 2024, FuseGPT: Learnable Layers Fusion of Generative Pre-trained Transformers, https://arxiv.org/abs/2411.14507
  • Samarth N Ramesh, Zhixue Zhao, 22 Nov 2024, Efficient Pruning of Text-to-Image Models: Insights from Pruning Stable Diffusion, https://arxiv.org/abs/2411.15113 (Comprehensive analysis of different types of pruning on diffusion image models.)
  • M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024, Resource-efficient Algorithms and Systems of Foundation Models: A Survey, https://dl.acm.org/doi/pdf/10.1145/3706418
  • Haihang Wu, 9 Dec 2024, LLM-BIP: Structured Pruning for Large Language Models with Block-Wise Forward Importance Propagation, https://arxiv.org/abs/2412.06419

Block Pruning

Block pruning is structured pruning at very low-level granurality, at the vector level or below. A "block" is a small chunk of a tensor, either a vector sub-sequence, or a small rectangular block (also called a "tile"). Pruning of model weights at this level of granularity is called "block pruning."

Research papers on block pruning:

Vector Pruning

Vector pruning is structured pruning at very low-level granurality, at the vector level. Block pruning is slight lower granularity, since it can mean a sub-portion of a vector.

Research papers on vector pruning:

  • David Spuler, March 2024, Chapter 46. Structured Pruning, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
  • Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, James Hensman, 9 Feb 2024 (v2), SliceGPT: Compress Large Language Models by Deleting Rows and Columns, Microsoft Research, https://arxiv.org/abs/2401.15024 Code: https://github.com/microsoft/TransformerCompression (Pruning of matrices effectively prunes along the width dimension and the "fourth" internal dimension of embeddings using techniques such as low-rank matrix factorization.)
  • Songwei Liu, Chao Zeng, Lianqiang Li, Chenqian Yan, Lean Fu, Xing Mei, Fangmin Chen, 1 Jul 2024, FoldGPT: Simple and Effective Large Language Model Compression Scheme, https://arxiv.org/abs/2407.00928 (Identifies block-level similariy in model layers.)
  • Kaixin Xu, Zhe Wang, Chunyun Chen, Xue Geng, Jie Lin, Xulei Yang, Min Wu, Xiaoli Li, Weisi Lin, 2 Jul 2024, LPViT: Low-Power Semi-structured Pruning for Vision Transformers, https://arxiv.org/abs/2407.02068 (Block-level pruning to give a granular type of structured pruning which speeds up MatMul/GEMM by skipping whole blocks or tiles.)

Dynamic Pruning

Dynamic pruning refers to pruning of network weights, links, or entire layers at runtime during inference. This differs from "static pruning" that is done offline during training, or in a post-training optimization to create a modified model. The types of dynamic pruning may include:

Note that all types of dynamic pruning suffer some extra inference cost in the calculations that decide whether to prune or not. The hope is that the benefit of pruning will exceed the cost of decision logic. For example, choosing an "early exit" criteria for layer pruning will require extra computation at each layer, which is hopefully recouped by skipping layers often enough.

Research Papers on Dynamic Pruning

Some general research on dynamic or adaptive pruning methods:

Combined Dynamic Width/Depth Pruning (Dual Pruning)

It's possible to simultaneously prune the depth (e.g. layer pruning, early-exit) and width (i.e., width pruning, channel pruning, token pruning, slimming networks, etc.). This is also called "hybrid pruning" or "dual pruning". See more research papers on dual pruning and triple pruning.

More Research on Pruning Types

More AI Pruning Research

Read more about: