Aussie AI

Triple Axis Pruning

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

Triple Axis Pruning

Structured pruning methods are often categorized according to the crosswise dimension of the model that they aim to reduce. Weights can be structurally pruned across the three major axes of the models: depth, width, and length.

Depth pruning. The weights are pruned by removing layers to make the model “shallower”. Techniques include layer pruning, inference loop early exit, and “shallow decoder” Transformer architectures. Note that choosing the model meta-parameter of the number of layers via neural architecture search (NAS) is conceptually very similar to static layer pruning. Also, dynamic early exit with a decision condition based only on a fixed number of layers (e.g. always exit after 10 layers) is also effectively static layer pruning, but with wasted storage space for unused layers of weights. See Chapter 47 for more details on depth pruning, such as early exit inference and layer pruning.

Width pruning. The fanning out of incoming embeddings data across multiple attention heads or internal neural nodes is the “width” of the model. Width pruning is sometimes called “thinning” or “slimming” of the model (see slimmable networks). Width pruning strategies include: attention head pruning, filter pruning, channel pruning. See Chapter 48 for more on width pruning.

Length pruning. The third dimension of the model is actually the model size, which decides the fixed size of vectors (embeddings) that propagate through the width and depth of the model. Note that choosing the meta-parameters of embedding size and context window (e.g. via NAS) are conceptually similar to static length pruning. Length pruning strategies include token pruning and embeddings pruning. Also related is autoregression research. Of the three axes, length pruning has had the least research. Note that “length” is mainly applicable to text transformers. In vision transformers, the third dimension is the image, or patches of the image. See Chapter 49 for more on length pruning.

Dual pruning. Dual pruning is the combination of width pruning and depth pruning. Depth pruning involves pruning or skipping the layers of the model, such as in layer pruning, early exiting or the shallow decoder architecture. Width pruning techniques include attention head pruning, slimmable networks, channel pruning, filter pruning, and other strategies. Some papers also describe combinations of multiple pruning strategies as “hybrid pruning”.

Research papers on dual pruning:

Yue Wang, Jianghao Shen, Ting-Kuei Hu, Pengfei Xu, Tan Nguyen, Richard G. Baraniuk, Zhangyang Wang, and Yingyan Lin. 2020, Dual dynamic inference: Enabling more efficient, adaptive and controllable deep inference, 2020, JSTSP, https://arxiv.org/abs/1907.04523
X. Xu, M. S. Park, and C. Brick, 2018, Hybrid pruning: Thinner sparse networks for fast inference on edge devices, in ICLR, 2018, https://arxiv.org/abs/1811.00482
Wenhan Xia, Hongxu Yin, Xiaoliang Dai, and Niraj K Jha, 2021, Fully dynamic inference with deep neural networks, 2021, IEEE Transactions on Emerging Topics in Computing, https://arxiv.org/abs/2007.15151
Ali Ehteshami Bejnordi and Ralf Krestel, 2020, Dynamic channel and layer gating in convolutional neural networks, In KI, https://dl.acm.org/doi/10.1007/978-3-030-58285-2_3
Fang Yu, Kun Huang, Meng Wang, Yuan Cheng, Wei Chu, Li Cui, 2022, Width & Depth Pruning for Vision Transformers, Vol. 36 No. 3: AAAI-22 Technical Tracks 3 / AAAI Technical Track on Computer Vision III, DOI: https://doi.org/10.1609/aaai.v36i3.20222, https://ojs.aaai.org/index.php/AAAI/article/view/20222, PDF: https://ojs.aaai.org/index.php/AAAI/article/view/20222/19981
Lu Hou, Zhiqi Huang, Lifeng Shang, Xin Jiang, Xiao Chen, and Qun Liu. 2020. DynaBERT: Dynamic BERT with Adaptive Width and Depth, arXiv preprint arXiv:2004.04037 (Oct 2020), https://arxiv.org/abs/2004.04037, Code: https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/DynaBERT
H. Cai, C. Gan, T. Wang, Z. Zhang, and S. Han. 2020, Once for all: Train one network and specialize it for efficient deployment, In International Conference on Learning Representations, 2020. https://arxiv.org/abs/1908.09791
H. Wang, Z. Wu, Z. Liu, H. Cai, L. Zhu, C. Gan, and S. Han. 2020, HAT: Hardware-Aware Transformers for Efficient Natural Language Processing, In Annual Meeting of the Association for Computational Linguistics, 2020. https://aclanthology.org/2020.acl-main.686/
Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. 2014, Fitnets: Hints for thin deep nets, arXiv preprint arXiv:1412.6550, 2014. https://arxiv.org/abs/1412.6550
Mojtaba Valipour, Mehdi Rezagholizadeh, Hossein Rajabzadeh, Marzieh Tahaei, Boxing Chen, Ali Ghodsi, 2023, SortedNet, a Place for Every Network and Every Network in its Place: Towards a Generalized Solution for Training Many-in-One Neural Networks, https://arxiv.org/abs/2309.00255 (Generalization of multi-dimensional pruning, by training a large neural network with many sub-networks across different width and depth dimensions.)
T Hu, C Meinel, H Yang, 2023, Flexible BERT with Width-and Depth-dynamic Inference, 2023 International Joint Conference on Neural Networks (IJCNN), https://ieeexplore.ieee.org/abstract/document/10191515/ (A 2023 version of BERT that does dual pruning with early exit and width gating.)
Y Liu, Z Lin, F Yuan, 2021, Rosita: Refined bert compression with integrated techniques, The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21), https://arxiv.org/abs/2103.11367, Code: https://github.com/llyx97/Rosita (Pruning on multiple dimensions of layer, FFN outputs, and embeddings, also combined with distillation.)
Bolukbasi, T., Wang, J., Dekel, O., and Saligrama, V. 2017. Adaptive neural networks for efficient inference, In Proceedings of the 34th International Conference on Machine Learning-Volume 70, Proceedings of Machine Learning Research, pages 527–536. https://arxiv.org/abs/1702.07811, http://proceedings.mlr.press/v70/bolukbasi17a.html (Early exit of layers and network selection.)
Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H., 2017, Mobilenets: Efficient convolutional neural networks for mobile vision applications (2017), https://doi.org/10.48550/ARXIV.1704.04861, https://arxiv.org/abs/1704.04861 (Combines depthwise separable convolutions and thinning at each layer.)
Wen, W., Wu, C., Wang, Y., Chen, Y., and Li, H. 2016, Learning structured sparsity in deep neural networks, In NIPS 2016. Code: https://github.com/wenwei202/caffe/tree/scnn (Combined filter and layer pruning.)
Lin, S., Ji, R., Yan, C., Zhang, B., Cao, L., Ye, Q., Huang, F., and Doermann, D. S. 2019, Towards optimal structured CNN pruning via generative adversarial learning, In CVPR, 2019. https://arxiv.org/abs/1903.09291 (Similar to a combined filter and layer pruning algorithm.)
Zehao Huang and Naiyan Wang, 2018, Data-driven sparse structure selection for deep neural networks, in ECCV, 2018. https://arxiv.org/abs/1707.01213, Code: https://github.com/huangzehao/sparse-structure-selection (Not typical width-depth pruning, but a combined pruning that uses sparsification to force weight structures to zero, allowing pruning of whole structures.)
Tianli Zhao, Xi Sheryl Zhang, Wentao Zhu, Jiaxing Wang, Sen Yang, Ji Liu, Jian Cheng, Nov 2021, Joint Channel and Weight Pruning for Model Acceleration on Mobile Devices, https://arxiv.org/abs/2110.08013
H Litao, X Fei, G Xiaoyang, Y Tingting, June 2023, Research on Model Compression Method Based on Deep Separable Convolutional and Pruning, Available at SSRN, https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4478190 (Somewhat related to dual pruning; also adds quantization.)
Xiaoying Zhi, Varun Babbar, Pheobe Sun, Fran Silavong, Ruibo Shi, Sean Moran, Sep 2023, A New Baseline for Green AI: Finding the Optimal Sub-Network via Layer and Channel Pruning, https://arxiv.org/abs/2302.10798 (Layer and channel pruning combined.)

See research papers on dual pruning (hybrid pruning) at https://www.aussieai.com/research/dual-pruning

Triple Pruning. Research papers on triple axis pruning:

W. Wang, M. Chen, S. Zhao, L. Chen, J. Hu, H. Liu, D. Cai, X. He, and W. Liu, 2020, Accelerate CNNs from Three Dimensions: A Comprehensive Pruning Framework, https://arxiv.org/abs/2010.04879
J Guo, J Liu, D Xu, 2021, JointPruning: Pruning networks along multiple dimensions for efficient point cloud processing, IEEE Transactions on Circuits and Systems for Video Technology (Volume 32, Issue 6, June 2022), https://ieeexplore.ieee.org/abstract/document/9516010/
H Kong, X Luo, S Huai, D Liu, 2023, EMNAPE: Efficient Multi-Dimensional Neural Architecture Pruning for Edge AI, 2023 Design, Automation & Test in Europe Conference & Exhibition (DATE), https://ieeexplore.ieee.org/document/10137122, https://dr.ntu.edu.sg/bitstream/10356/167488/2/camera-ready-DATE.pdf (Triple-pruning algorithm with comparison to various dual pruning algorithms.)
Z Hou, SY Kung, 2022, Multi-dimensional model compression of vision transformer, Conference on Multimedia and Expo (ICME), https://arxiv.org/pdf/2201.00043 (Pruning of attention heads, neurons, and sequence dimensions jointly.)
Zechun Liu, Xiangyu Zhang, Zhiqiang Shen, Zhe Li, Yichen Wei, Kwang-Ting Cheng, Jian Sun, Sep 2021, Joint Multi-Dimension Pruning via Numerical Gradient Update, https://arxiv.org/abs/2005.08931
Zechun Liu, Xiangyu Zhang, Zhiqiang Shen, Zhe Li, Yichen Wei, Kwang-Ting Cheng, and Jian Sun, 2020, Joint multi-dimension pruning, arXiv preprint arXiv:2005.08931, https://arxiv.org/abs/2005.08931v1
Z Hou, SY Kung, 2022, Multi-Dimensional Vision Transformer Compression via Dependency Guided Gaussian Process Search, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), https://ieeexplore.ieee.org/document/9857488, PDF: https://openaccess.thecvf.com/content/CVPR2022W/EVW/papers/Hou_Multi-Dimensional_Vision_Transformer_Compression_via_Dependency_Guided_Gaussian_Process_Search_CVPRW_2022_paper.pdf

For more research on triple pruning, see https://www.aussieai.com/research/triple-pruning.

Quadruple pruning? Is there a way to do four? For example, can you combine layer, width, token, and model dimension pruning? I haven't yet seen a research paper with this.

Hybrid pruning. There are various hybrid pruning strategies, where it is possible to combine pruning with other non-pruning optimizations such as quantization. You can even combine structured and unstructured pruning, by doing structured pruning of a particular part of the model (e.g. a layer), but then unstructured weight pruning (magnitude pruning) of that structure, such that low-value unimportant weights are pruned.

• Next:

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs

The new AI programming book by Aussie AI co-founders:

AI coding in C++
Transformer engine speedups
LLM models
Phone and desktop AI
Code examples
Research citations

Get your copy from Amazon: Generative AI in C++

Aussie AI

Triple Axis Pruning

Triple Axis Pruning

Quick Links

Product

New to Writing?

Writing Styles