Aussie AI
Channel Pruning
-
Last Updated 12 December, 2024
-
by David Spuler, Ph.D.
Channel pruning is a type of LLM inference optimization that reduces calculations along the width dimension of models. It is primarily related to CNNs, and is analogous to attention head pruning in Transformer architectures.
Research on Channel Pruning
Research papers on channel pruning include:
- M Sponner, B Waschneck, A Kumar , 2024, Adapting Neural Networks at Runtime: Current Trends in At-Runtime Optimizations for Deep Learning, ACM Computing Surveys,, PDF: https://dl.acm.org/doi/pdf/10.1145/3657283 (Survey of various adaptive inference optimization techniques with much focus on image and video processing optimization for LLMs.)
- Xitong Gao, Yiren Zhao, Lukasz Dudziak, Robert D. Mullins, and Cheng-Zhong Xu. 2019. Dynamic Channel Pruning: Feature Boosting and Suppression. Proceedings of the 7th International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1810.05331
- Mogaka, O.M., Zewail, R., Inoue, K. et al. TinyEmergencyNet: a hardware-friendly ultra-lightweight deep learning model for aerial scene image classification. J Real-Time Image Proc 21, 51 (2024). https://doi.org/10.1007/s11554-024-01430-y https://link.springer.com/article/10.1007/s11554-024-01430-y#citeas (Use of both power-of-two quantization and channel pruning for fast image analysis.)
- Ji Liu, Dehua Tang, Yuanxian Huang, Li Zhang, Xiaocheng Zeng, Dong Li, Mingjie Lu, Jinzhang Peng, Yu Wang, Fan Jiang, Lu Tian, Ashish Sirasao, 12 Jan 2024, UPDP: A Unified Progressive Depth Pruner for CNN and Vision Transformer, https://arxiv.org/abs/2401.06426 (Block pruning strategy gives a type of depth pruning.)
- LOCP: Latency-optimized channel pruning for CNN inference acceleration on GPUs 2023, Journal of Supercomputing https://doi.org/10.1007/s11227-023-05212-4
- LRP-based network pruning and policy distillation of robust and non-robust DRL agents for embedded systems 2023, Concurrency and Computation: Practice and Experience https://doi.org/10.1002/cpe.7351
- David Spuler, March 2024, Chapter 48. Width Pruning, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Y Li, K Adamczewski, W Li, S Gu, 2022, Revisiting random channel pruning for neural network compression, http://openaccess.thecvf.com/content/CVPR2022/html/Li_Revisiting_Random_Channel_Pruning_for_Neural_Network_Compression_CVPR_2022_paper.html
- Tien-Ju Yang, Andrew Howard, Bo Chen, Xiao Zhang, Alec Go, Mark Sandler, Vivienne Sze, and Hartwig Adam. NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications. In ECCV, 2018, https://arxiv.org/abs/1804.03230
- Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, James Hensman, 9 Feb 2024 (v2), SliceGPT: Compress Large Language Models by Deleting Rows and Columns, Microsoft Research, https://arxiv.org/abs/2401.15024 Code: https://github.com/microsoft/TransformerCompression (Pruning of matrices effectively prunes along the width dimension and the "fourth" internal dimension of embeddings using techniques such as low-rank matrix factorization.)
- Bejnordi, B.E., Blankevoort, T., Welling, M.: Batch-shaping for learning conditional channel gated networks. In: International Conference on Learning Representations (2020). https://openreview.net/forum?id=Bke89JBtvB
- Gao, X., Zhao, Y., Dudziak, Ł., Mullins, R., Xu, C.-Z: Dynamic channel pruning: Feature boosting and suppression. In: International Conference on Learning Representations (2019). https://openreview.net/forum?id=BJxh2j0qYm
- Jing Liu, Ruihao Gong, Mingyang Zhang, Yefei He, Jianfei Cai, Bohan Zhuang, 13 Jun 2024, ME-Switch: A Memory-Efficient Expert Switching Framework for Large Language Models, https://arxiv.org/abs/2406.09041 (How to load multiple experts for MoE in a memory-efficient way using mixed-precision quantization based on identifying the few salient channels that need higher precision, as an alternative to multi-LoRA.)
- Xiaotong Luo; Zekun Ai; Qiuyuan Liang; Yuan Xie, 06 August 2024, EdgeFormer: Edge-aware Efficient Transformer for Image Super-resolution, IEEE Transactions on Instrumentation and Measurement ( Early Access), DOI: 10.1109/TIM.2024.3436070, https://ieeexplore.ieee.org/abstract/document/10623619 https://github.com/xiaotongtt/EdgeFormer
- Yang He, Lingao Xiao, 30 Nov 2023 (v2), Structured Pruning for Deep Convolutional Neural Networks: A survey, https://arxiv.org/abs/2303.00566 https://arxiv.org/pdf/2303.00566 https://ieeexplore.ieee.org/abstract/document/10330640 https://github.com/he-y/Awesome-Pruning https://huggingface.co/spaces/he-yang/Structured-Pruning-Survey (Extensive survey of pruning for CNNs, not LLMs.)
- Junhui He, Shangyu Wu, Weidong Wen, Chun Jason Xue, Qingan Li, 2 Sep 2024, CHESS: Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification, https://arxiv.org/abs/2409.01366
- Luning Wang, Shiyao Li, Xuefei Ning, Zhihang Yuan, Shengen Yan, Guohao Dai, Yu Wang, 16 Sep 2024, CSKV: Training-Efficient Channel Shrinking for KV Cache in Long-Context Scenarios, https://arxiv.org/abs/2409.10593 (KV cache compression on the "channel" or "width" dimension.)
- Xia, Wenhan, Sep 2024, Methods for Efficient and Scalable Deep Learning, Ph.D. Thesis, Electrical and Computer Engineering Department, Princeton University, http://arks.princeton.edu/ark:/88435/dsp015q47rs12x (Covers PEFT/LoRA on training, and dual pruning with layer skipping and channel/width pruning for inference.)
More Research on Pruning Types
- Depth pruning (overview)
— Static layer pruning
— Layer pruning
— Early exit
— Dynamic layer pruning
— Layer skipping
— Layer approximation
— Shallow decoder architecture
— Layer reordering
— Layer Importance - Width pruning (overview)
— Attention head pruning
— Slimmable networks (width pruning)
— FFN pruning
— Channel pruning
— Filter pruning - Length pruning (longitudinal/input/end-to-end):
— Token pruning (input pruning)
— Dynamic token pruning
— Prompt compression
— Context compression
— Token merging
— Token skipping
— Token dropping
— Zero padding removal - Embedding-dimension pruning
— Embedding pruning
— Embedding matrix compression (embedding pruning)
— Embedding low-rank matrix factorization
— Unembedding matrix (output embeddings) - Multi-dimensional pruning
— Dual pruning
— Triple pruning
— Quadruple pruning
— 3D CNN model pruning
More AI Research
Read more about: