Aussie AI
Triple Axis Pruning
-
Last Updated 25 September, 2024
-
by David Spuler, Ph.D.
Structured pruning methods are often categorized according to the crosswise dimension of the model that they aim to reduce. Weights can be structurally pruned across the three major axes of the models: depth, width, and length.
- Depth pruning. The weights are pruned by removing layers to make the model "shallower". Techniques include layer pruning, inference loop early exit, and "shallow decoder" Transformer architectures. Note that choosing the model meta-parameter of the number of layers via neural architecture search (NAS) is conceptually very similar to static layer pruning. Also, dynamic early exit with a decision condition based only on a fixed number of layers (e.g. always exit after 10 layers) is also effectively static layer pruning, but with wasted storage space for unused layers of weights.
- Width pruning. The fanning out of incoming embeddings data across multiple attention heads or internal neural nodes is the "width" of the model. Width pruning is sometimes called "thinning" or "slimming" of the model (see slimmable networks). Width pruning strategies include: attention head pruning, filter pruning, channel pruning. Read more about: width pruning.
- Length pruning. The third dimension of the model is actually the model size, which decides the fixed size of vectors (embeddings) that propagate through the width and depth of the model. Note that choosing the meta-parameters of embedding size and context window (e.g. via NAS) are conceptually similar to static length pruning. Length pruning strategies include token pruning and embeddings pruning. Also related is autoregression research. Of the three axes, length pruning has had the least research. Read more about: length pruning.
Note that "length" is mainly applicable to text transformers. In vision transformers, the third dimension is the image, or patches of the image.
Triple Axis Pruning Research
Research papers on "triple pruning" (see also dual pruning research):
- W. Wang, M. Chen, S. Zhao, L. Chen, J. Hu, H. Liu, D. Cai, X. He, and W. Liu, 2020, Accelerate CNNs from Three Dimensions: A Comprehensive Pruning Framework, https://arxiv.org/abs/2010.04879
- J Guo, J Liu, D Xu, 2021, JointPruning: Pruning networks along multiple dimensions for efficient point cloud processing, IEEE Transactions on Circuits and Systems for Video Technology (Volume 32, Issue 6, June 2022), https://ieeexplore.ieee.org/abstract/document/9516010/
- H Kong, X Luo, S Huai, D Liu, 2023, EMNAPE: Efficient Multi-Dimensional Neural Architecture Pruning for EdgeAI, 2023 Design, Automation & Test in Europe Conference & Exhibition (DATE), https://ieeexplore.ieee.org/document/10137122, https://dr.ntu.edu.sg/bitstream/10356/167488/2/camera-ready-DATE.pdf (Triple-pruning algorithm with comparison to various dual pruning algorithms.)
- Z Hou, SY Kung, Multi-dimensional model compression of vision transformer, Conference on Multimedia and Expo (ICME), 2022, https://arxiv.org/pdf/2201.00043 (Pruning of attention heads, neurons, and sequence dimensions jointly.)
- Zechun Liu, Xiangyu Zhang, Zhiqiang Shen, Zhe Li, Yichen Wei, Kwang-Ting Cheng, Jian Sun, Sep 2021, Joint Multi-Dimension Pruning via Numerical Gradient Update, https://arxiv.org/abs/2005.08931
- Zechun Liu, Xiangyu Zhang, Zhiqiang Shen, Zhe Li, Yichen Wei, Kwang-Ting Cheng, and Jian Sun, 2020, “Joint multi-dimension pruning,” arXiv preprint arXiv:2005.08931, https://arxiv.org/abs/2005.08931v1
- Z Hou, SY Kung, 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Multi-Dimensional Vision Transformer Compression via Dependency Guided Gaussian Process Search, https://ieeexplore.ieee.org/document/9857488, PDF: https://openaccess.thecvf.com/content/CVPR2022W/EVW/papers/Hou_Multi-Dimensional_Vision_Transformer_Compression_via_Dependency_Guided_Gaussian_Process_Search_CVPRW_2022_paper.pdf
- Mingxing Tan and Quoc V Le. Efficientnet: Rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.1 https://arxiv.org/abs/1905.11946 Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
- David Spuler, March 2024, Chapter 46. Structured Pruning, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- S. Kim, J. Lee, S. Kang, D. Han, W. Jo and H. -J. Yoo, 2022, TSUNAMI: Triple Sparsity-Aware Ultra Energy-Efficient Neural Network Training Accelerator With Multi-Modal Iterative Pruning, IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 69, no. 4, pp. 1494-1506, April 2022, doi: 10.1109/TCSI.2021.3138092, https://ieeexplore.ieee.org/abstract/document/9669119
- David Spuler, 26th August, 2024, Inference Optimization Research Ideas, https://www.aussieai.com/blog/inference-optimization-ideas
3D CNN Model Pruning
This isn't really what is meant by "triple axis pruning", and it is limited to Convolutional neural networks (CNNs). It is a conceptually different type of "triple-dimensional pruning" but is similar in goals. CNNs can have 3D tensors, and these can be pruned in multiple dimensions. Papers include:
More Research on Pruning Types
- Depth pruning (overview)
— Static layer pruning
— Layer pruning
— Early exit
— Dynamic layer pruning
— Layer skipping
— Layer approximation
— Shallow decoder architecture
— Layer reordering
— Layer Importance - Width pruning (overview)
— Attention head pruning
— Slimmable networks (width pruning)
— FFN pruning
— Channel pruning
— Filter pruning - Length pruning (longitudinal/input/end-to-end):
— Token pruning (input pruning)
— Dynamic token pruning
— Prompt compression
— Context compression
— Token merging
— Token skipping
— Token dropping
— Zero padding removal - Embedding-dimension pruning
— Embedding pruning
— Embedding matrix compression (embedding pruning)
— Embedding low-rank matrix factorization
— Unembedding matrix (output embeddings) - Multi-dimensional pruning
— Dual pruning
— Triple pruning
— Quadruple pruning
— 3D CNN model pruning
More AI Research
Read more about:
- Dual pruning
- Layer pruning
- Token pruning
- Attention head pruning
- Embeddings pruning
- FFN pruning
- Shallow decoder architecture
- Normalization pruning
- Length pruning
- Width pruning
- Channel pruning
- « Research Home