Aussie AI
Layer Pruning
-
Last Updated 12 December, 2024
-
by David Spuler, Ph.D.
Layer pruning is a type of structured pruning because it prunes entire layers. More precisely, it is a type of "depth pruning" because it reduces the depth of the stacks of encoders and/or decoders in the Transformer architecture. This technique can sometimes be called "layer compaction".
Dynamic layer pruning is also called early exiting if all remaining layers are skipped, or "layer skipping" if only the current layer is skipped. Reducing layers in the decoder is called a "shallow decoder", which has been found to be effective, because encoder layers are more important in a Transformer than decoder layers. Layer pruning is also related to "layer fusion" (usually via parameter sharing) and layer reordering.
Layer pruning refers to removing one or more entire layers from the model, which is a subtype of "depth pruning". Most AI models have multiple hidden layers of nodes, and sometimes a layer can be removed without too great of a loss in model accuracy. The layer can be removed statically to create a new model file or dynamically via some adaptive criteria. Most of the literature focuses on dynamic layer pruning via early exit of the inference algorithm, when it detects a threshold of accuracy has been achieved.
Layer pruning can be combined with many other methods to create hybrid optimizations. For example, it is orthogonal to quantization, width pruning, (e.g. attention head pruning) and length pruning (e.g. token pruning, embeddings pruning).
Static layer pruning
Static layer pruning is removal of entire layers of weights from a model file. This which would involve detecting layers that add minimal value during training (or post-training but pre-inference), but it seems seems to have less chance to succeed, and seems relatively under-researched. This is related to the training design choice of how many layers to use in a model, which was once more art than science, but has received some research attention more recently (see "neural architecture search"). Interestingly, some of the "early exit" and "layer skipping" inference techniques are effectively changing the choice of the number of model layers from a static constant to a dynamic choice, and the generalization of that to dynamic layer management may warrant some research.
General research on layer pruning:
- Sabina Pokhrel, "4 Popular Model Compression Techniques Explained", January 19, 2022, https://xailient.com/blog/4-popular-model-compression-techniques-explained/
- Wei Deng, Junwei Pan, Tian Zhou, Deguang Kong, Aaron Flores, and Guang Lin. 2021. DeepLight: Deep Lightweight Feature Interactions for Accelerating CTR Predictions in Ad Serving. In Proceedings of the 14th ACM international conference on Web search and data mining. 922–930, https://arxiv.org/abs/2002.06987
- H Sajjad, F Dalvi, N Durrani, P Nakov, 2020, Poor Man's BERT: Smaller and Faster Transformer Models, arXiv preprint arXiv:2004.03844 https://arxiv.org/abs/2004.03844v1
- E Youn, S Prabhu, S Chen, 2023, Compressing Vision Transformers for Low-Resource Visual Learning, arXiv preprint arXiv:2309.02617, PDF: https://arxiv.org/pdf/2309.02617.pdf
- Jaeho Lee, Sejun Park, Sangwoo Mo, Sungsoo Ahn, and Jinwoo Shin. Layer-adaptive sparsity for the magnitude-based pruning. In International Conference on Learning Representations, 2020. https://arxiv.org/abs/2010.07611
- Q Wei, G Zeng, B Zeng, 2023, Efficient Training for Visual Tracking with Deformable Transformer, arXiv preprint arXiv:2309.02676, https://arxiv.org/pdf/2309.02676.pdf (Optimization and also investigated effects of number of decoder layers.)
- Xiaodong Chen, Yuxuan Hu, Jing Zhang, 28 Mar 2024, Compressing Large Language Models by Streamlining the Unimportant Layer, https://arxiv.org/abs/2403.19135 (Finds the less important layers and either prunes them or replaces them with a faster approximate layer.)
- Aaron Klein, Jacek Golebiowski, Xingchen Ma, Valerio Perrone, Cedric Archambeau, 3 May 2024, Structural Pruning of Pre-trained Language Models via Neural Architecture Search, https://arxiv.org/abs/2405.02267 (Post-training structured pruning of sub-networks based on NAS, also with weight sharing and several different focus areas of pruning including attention heads, FFNs, and layers.)
- Xin Men, Mingyu Xu, Qingyu Zhang, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, Weipeng Chen, 7 Mar 2024 (v2), ShortGPT: Layers in Large Language Models are More Redundant Than You Expect, https://arxiv.org/abs/2403.03853
- Pedram Rostami, Mohammad Javad Dousti, 10 Nov 2024, CULL-MT: Compression Using Language and Layer pruning for Machine Translation, https://arxiv.org/abs/2411.06506
Dynamic Layer Pruning (Inference Early Exit)
There are several papers on "early exit" of the inference algorithms, without processing all the layers, which is effective dynamic pruning of layers of the model (at runtime during inference). Other names for "early exit" in the literature include "early stopping" and "dropout". This overall technique can be categorized as "dynamic depth pruning" or "dynamic depth models". This method relies on an assumption that each successive layer makes the results more accurate, but with reducing changes. After a few layers, the results may be "good enough" to decide on the outcome without finishing all the layers.
Dynamic pruning techniques have to use a decision method, usually called a "classifier" to choose whether or not to exit at a given layer. Various different methods have been researched for this decision.
Always-exit test: It should be noted that dynamic layer pruning with a simplistic decision test, such as always exiting at layer N=5, is effectively the same as static layer pruning of layers N>=6, but without the benefit of reduced model storage space. However, implementing this always-exit test dynamically can still be beneficial during testing of the efficacy of the model in terms of the meta-parameter, N (i.e. when deciding the number of layers to use). Accuracy of the model for different values for N can be tested dynamically without rebuilding the model file.
Various research papers on dynamic layer pruning (see also early exit research for even more):
- Ji Xin, Raphael Tang, Jaejun Lee, Yaoliang Yu, and Jimmy Lin, DeeBERT: Dynamic early exiting for accelerating bert inference, arXiv preprint arXiv:2004.12993, 2020, https://arxiv.org/abs/2004.12993 (Code: https://github.com/castorini/DeeBERT (This is a form of dynamic layer pruning.)
- Angela Fan, Edouard Grave, and Armand Joulin, Reducing transformer depth on demand with structured dropout, 2019, arXiv:1909.11556, https://arxiv.org/abs/1909.11556
- Surat Teerapittayanon, Bradley McDanel, and Hsiang Tsung Kung, BranchyNet: Fast inference via early exiting from deep neural networks, 2017, arXiv:1709.01686, https://arxiv.org/abs/1709.01686
- S. Teerapittayanon, B. McDanel, H.T. Kung, Distributed deep neural networks over the cloud, the edge and end devices, IEEE, Atlanta, GA, USA, 2017, pp. 328–339, doi:10.1109/ICDCS.2017.226., 5–8 June, https://doi.org/10.1109/ICDCS.2017.226
- Xiaonan Li, Yunfan Shao, Tianxiang Sun, Hang Yan, Xipeng Qiu, Xuanjing Huang, Accelerating BERT Inference for Sequence Labeling via Early-Exit, May 2021, https://arxiv.org/abs/2105.13878
- Arian Bakhtiarnia, Qi Zhang, Alexandros Iosifidis, Multi-Exit Vision Transformer for Dynamic Inference, June 2021, https://arxiv.org/abs/2106.15183
- Nikolaos Passalis, Jenni Raitoharju, Anastasios Tefas, Moncef Gabbouj, Efficient adaptive inference for deep convolutional neural networks using hierarchical early exits, Pattern Recognition Volume 105, September 2020, 107346, https://doi.org/10.1016/j.patcog.2020.107346
- Xiangjie Li, Chenfei Lou, Yuchi Chen, Zhengping Zhu, Yingtao Shen, Yehan Ma, An Zou, Predictive Exit: Prediction of Fine-Grained Early Exits for Computation- and Energy-Efficient Inference, DOI: https://doi.org/10.1609/aaai.v37i7.26042, https://ojs.aaai.org/index.php/AAAI/article/view/26042
- A Global Past-Future Early Exit Method for Accelerating Inference of Pre-trained Language Models Kaiyuan Liao, Yi Zhang, Xuancheng Ren, Qi Su, Xu Sun, Bin He, Proceedings of the 29th International Conference on Computational Linguistics, October 2022, https://aclanthology.org/2021.naacl-main.162/, https://aclanthology.org/2021.naacl-main.162.pdf
- Stefanos Laskaridis, Alexandros Kouris, Nicholas D. Lane, Adaptive Inference through Early-Exit Networks: Design, Challenges and Directions, EMDL'21: Proceedings of the 5th International Workshop on Embedded and Mobile Deep Learning, June 2021, Pages 1–6, https://doi.org/10.1145/3469116.3470012, https://dl.acm.org/doi/abs/10.1145/3469116.3470012
- Vanderlei Bonato and Christos Bouganis, 2021, Class-specific early exit design methodology for convolutional neural networks, Applied Soft Computing (2021), https://www.sciencedirect.com/science/article/abs/pii/S1568494621002398, https://doi.org/10.1016/j.asoc.2021.107316, https://spiral.imperial.ac.uk/bitstream/10044/1/90316/2/Paper___Early_Exit___Applied_Soft_Computing.pdf
- E. Baccarelli, S. Scardapane, M. Scarpiniti, A. Momenzadeh, A. Uncini, Optimized training and scalable implementation of Conditional Deep Neural Networks with early exits for Fog-supported IoT applications, Information Sciences 521 (June 2020), 107–143, DOI: https://doi.org/10.1016/j.ins.2020.02.041, http://www.sciencedirect.com/science/article/pii/
- S. Wang, T. Tuor, T. Salonidis, K.K. Leung, C. Makaya, T. He, K. Chan, When edge meets learning: Adaptive control for resource-constrained distributed machine learning, in: IEEE Conference on Computer Communications (IEEE INFOCOM 2018), 2018, pp. 63–71, doi:10.1109/INFOCOM.2018.8486403, https://doi.org/10.1109/INFOCOM.2018.8486403, Honolulu, HI, USA, 16–19 April, 2018.
- Wangchunshu Zhou, Canwen Xu, Tao Ge, Julian McAuley, Ke Xu, Furu Wei, BERT Loses Patience: Fast and Robust Inference with Early Exit, https://doi.org/10.48550/arXiv.2006.04152, https://arxiv.org/abs/2006.04152
- S. Scardapane, M. Scarpiniti, E. Baccarelli, A. Uncini, Why should we add early exits to neural networks?, Cognitive Computation 12 (5) (2020), 954–966, doi:10.1007/s12559-020-09734-4, http://dx.doi.org/10.1007/s12559-020-09734-4
- Kaiyuan Liao, Yi Zhang, Xuancheng Ren, Qi Su, Xu Sun, Bin He, A Global Past-Future Early Exit Method for Accelerating Inference of Pre-trained Language Models, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, June 2021, DOI: 10.18653/v1/2021.naacl-main.162 https://aclanthology.org/2021.naacl-main.162/
- Zizhao Wang, Wei Bao, Dong Yuan, Liming Ge, Nguyen H. Tran, Albert Y. Zomaya, SEE: Scheduling Early Exit for Mobile DNN Inference during Service Outage, MSWIM '19: Proceedings of the 22nd International ACM Conference on Modeling, Analysis and Simulation of Wireless and Mobile Systems, November 2019, Pages 279–288, https://doi.org/10.1145/3345768.3355917, https://dl.acm.org/doi/abs/10.1145/3345768.3355917
- Xinrui Tan, Hongjia Li, Liming Wang, Xueqing Huang, Zhen Xu, Empowering Adaptive Early-Exit Inference with Latency Awareness, DOI: https://doi.org/10.1609/aaai.v35i11.17181, PDF: https://ojs.aaai.org/index.php/AAAI/article/view/17181/16988
- Xuanli He, Iman Keivanloo, Yi Xu, Xiang He, Belinda Zeng, Santosh Rajagopalan, Trishul Chilimbi, Magic Pyramid: Accelerating Inference with Early Exiting and Token Pruning, Oct 2021, https://arxiv.org/abs/2111.00230, https://doi.org/10.48550/arXiv.2111.00230
- A Görmez, E Koyuncu, Pruning Early Exit Networks 2022, arXiv preprint arXiv:2207.03644, https://arxiv.org/pdf/2207.03644
- Meta-GF: Training Dynamic-Depth Neural Networks Harmoniously Y Sun, J Li, X Xu, 2022, European Conference on Computer Vision, https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136710691.pdf (Code: https://github.com/SYVAE/MetaGF)
- I3D: Transformer architectures with input-dependent dynamic depth for speech recognition, Y Peng, J Lee, S Watanabe, 2023, ICASSP 2023-2023 IEEE, https://ieeexplore.ieee.org/abstract/document/10096662, PDF: https://arxiv.org/pdf/2303.07624
- S Lin, B Ji, R Ji, A Yao, A closer look at branch classifiers of multi-exit architectures, 2022, arXiv preprint arXiv:2204.13347, PDF: https://arxiv.org/pdf/2204.13347
- Mojtaba Valipour, Mehdi Rezagholizadeh, Hossein Rajabzadeh, Marzieh Tahaei, Boxing Chen, Ali Ghodsi, 2023, SortedNet, a Place for Every Network and Every Network in its Place: Towards a Generalized Solution for Training Many-in-One Neural Networks https://arxiv.org/abs/2309.00255 (Generalization of multi-dimensional pruning, by training a large neural network with many sub-networks across different width and depth dimensions.)
- Anton Razzhigaev, Matvey Mikhalchuk, Elizaveta Goncharova, Nikolai Gerasimenko, Ivan Oseledets, Denis Dimitrov, Andrey Kuznetsov, 19 May 2024, Your Transformer is Secretly Linear, https://arxiv.org/abs/2405.12250 (Replacing model layers in the decoder with linear approximations.)
- Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, Furu Wei, 9 May 2024 (v2), You Only Cache Once: Decoder-Decoder Architectures for Language Models, https://arxiv.org/abs/2405.05254 Code: https://aka.ms/YOCO (A novel decoder-decoder architecture with fast KV caching and cross-attention.)
- Aaron Klein, Jacek Golebiowski, Xingchen Ma, Valerio Perrone, Cedric Archambeau, 3 May 2024, Structural Pruning of Pre-trained Language Models via Neural Architecture Search, https://arxiv.org/abs/2405.02267 (Post-training structured pruning of sub-networks based on NAS, also with weight sharing and several different focus areas of pruning including attention heads, FFNs, and layers.)
- Snehasis Dey, 2024, Differentiable Slimming for Transformers with Improved Memory Efficiency, College of Engineering Bhubaneswar, Odisha , https://ijte.in/pdf/EE14.pdf (Dual pruning by attention head pruning for slimmable networks combined with layer pruning.)
- Andrey Gromov, Kushal Tirumala, Hassan Shapourian, Paolo Glorioso, Daniel A. Roberts, 26 Mar 2024, The Unreasonable Ineffectiveness of the Deeper Layers, https://arxiv.org/abs/2403.17887 (Static layer pruning with some PEFT re-training after removing layers, with quantization and QLoRA.)
- Q Wei, G Zeng, B Zeng, 2023, Efficient Training for Visual Tracking with Deformable Transformer, arXiv preprint arXiv:2309.02676, https://arxiv.org/pdf/2309.02676.pdf (Optimization and also investigated effects of number of decoder layers.)
- Y Liu, Z Lin, F Yuan, 2021, Rosita: Refined bert compression with integrated techniques, The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21), https://arxiv.org/abs/2103.11367 Code: https://github.com/llyx97/Rosita (Pruning on multiple dimensions of layer, FFN outputs, and embeddings, also combined with distillation.)
- Yilin Yang, Longyue Wang, Shuming Shi, Prasad Tadepalli, Stefan Lee, and Zhaopeng Tu. 2020. On the Sub-layer Functionalities of Transformer Decoder. In Findings of EMNLP. Online, 4799–4811. https://doi.org/10.18653/v1/2020.findings-emnlp.432, https://arxiv.org/abs/2010.02648
- Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Fractalnet: Ultra-deep neural networks without residuals. In ICLR, 2017 https://arxiv.org/abs/1605.07648 (Uses both shallow and deep networks with dropout, but not fully layer-wise; similar to cascades.)
- Liu, L.; and Deng, J. 2018. Dynamic deep neural networks: Optimizing accuracy-efficiency trade-offs by selective execution. In Thirty-Second AAAI Conference on Artificial Intelligence, https://arxiv.org/abs/1701.00299
- Yijin Liu, Fandong Meng, Jie Zhou, Yufeng Chen, and Jinan Xu. Dec 2020. Explicitly modeling adaptive depths for transformer. arXiv preprint arXiv:2004.13542. https://arxiv.org/abs/2004.13542v2 (Adaptive layer pruning, where harder words in the input make the model go deeper into the layers.)
- Krishna Teja Chitty-Venkata, Sparsh Mittal, Murali Emani, Venkatram Vishwanath, Arun K. Somani, 2023, A Survey of Techniques for Optimizing Transformer Inference, https://arxiv.org/abs/2307.07982
- Raj Dabre, Raphael Rubino, and Atsushi Fujita. 2020. Balancing cost and benefit with tied-multi transformers. In Proceedings of the Fourth Workshop on Neural Generation and Translation, pages 24–34, Online. Association for Computational Linguistics. https://arxiv.org/abs/2002.08614 (Choose number of layers for encoder and decoder based on input; dynamic layer pruning)
- K. Kim, F. Wu, Y. Peng, et al., “E-branchformer: Branch-former with enhanced merging for speech recognition,” arXiv:2210.00077, 2022. https://arxiv.org/abs/2210.00077
- Cristóbal Eyzaguirre, Felipe del Río, Vladimir Araujo, Álvaro Soto, DACT-BERT: Differentiable Adaptive Computation Time for an Efficient BERT Inference, Sep 2021, ArXiv preprint, abs/2109.11745, https://arxiv.org/abs/2109.11745
- Yi-Te Hsu, Sarthak Garg, Yi-Hsiu Liao, and Ilya Chatsviorkin. 2020. Efficient inference for neural machine translation. CoRR, abs/2010.02416. https://arxiv.org/abs/2010.02416
- J. Jin, A. Dundar, and E. Culurciello. 2014, Flattened convolutional neural networks for feedforward acceleration. arXiv preprint arXiv:1412.5474, https://arxiv.org/abs/1412.5474
- Zhensu Sun, Xiaoning Du, Fu Song, Shangwen Wang, Li Li, 18 Jan 2024, When Neural Code Completion Models Size up the Situation: Attaining Cheaper and Faster Completion through Dynamic Model Inference, https://arxiv.org/abs/2401.09964 (Analysing the importance of different layers in code completion use case.)
- Siqi Fan, Xin Jiang, Xiang Li, Xuying Meng, Peng Han, Shuo Shang, Aixin Sun, Yequan Wang, Zhongyuan Wang, 4 Mar 2024, Not all Layers of LLMs are Necessary during Inference, https://arxiv.org/abs/2403.02181
- Minjia Zhang and Yuxiong He, 2020, Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping. In Advances in Neural Information Processing Systems, volume 33, pp. 14011–14023, 2020. https://proceedings.neurips.cc/paper/2020/hash/a1140a3d0df1c81e24ae954d935e8926-Abstract.html?fbclid=IwAR2bOnieMB2yBSK6Ks9 RVAkPHfUnTQLliZzJEzp89cLiIvErFzf6efu6c
- Ofir Press, Noah A. Smith, Omer Levy, Apr 2020, Improving Transformer Models by Reordering their Sublayers, https://arxiv.org/abs/1911.03864
- David Spuler, March 2024, Chapter 47. Early Exit and Layer Pruning, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Yingce Xia, Tianyu He, Xu Tan, Fei Tian, Di He, and Tao Qin. 2019. Tied Transformers: Neural machine translation with shared encoder and decoder. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 5466–5473, Honolulu, USA. https://aaai.org/ojs/index.php/AAAI/article/view/4487
- Yifan Peng, Jaesong Lee, Shinji Watanabe, I3D: Transformer architectures with input-dependent dynamic depth for speech recognition, https://arxiv.org/abs/2303.07624
- S. Sun, Y. Cheng, Z. Gan, J. Liu, Patient knowledge distillation for BERT model compression, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong, China, 2019, pp. 4322–4331. URL: https://www.aclweb.org/anthology/D19-1441. doi:10.18653/v1/D19-1441.
- J Yang, Y Yin, L Yang, S Ma, H Huang, 2022, Gtrans: Grouping and fusing transformer layers for neural machine translation, https://arxiv.org/pdf/2207.14467
- Hassan Sajjad, Fahim Dalvi, Nadir Durrani, and Preslav Nakov. 2020. On the Effect of Dropping Layers of Pre-trained Transformer Models. arXiv preprint arXiv:2004.03844 (Aug 2020), https://arxiv.org/abs/2004.03844
- Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam, Mobilenets: Efficient convolutional neural networks for mobile vision applications, arXiv preprint arXiv:1704.04861, 2017, https://arxiv.org/abs/1704.04861
- Jungmin Yun, Mihyeon Kim, Youngbin Kim, 3 Jun 2024, Focus on the Core: Efficient Attention via Pruned Token Compression for Document Classification, https://arxiv.org/abs/2406.01283
- Jiachen Jiang, Jinxin Zhou, Zhihui Zhu, 20 Jun 2024, On Layer-wise Representation Similarity: Application for Multi-Exit Models with a Single Classifier, https://arxiv.org/abs/2406.14479 (Using layer similarity for early exit classifiers, which is also related to layer fusion.)
- Vedang Lad, Wes Gurnee, Max Tegmark, 27 Jun 2024, The Remarkable Robustness of LLMs: Stages of Inference, https://arxiv.org/abs/2406.19384 (Deleting and swapping adjacent model layers. Hypothesizes that the first layer is effectively detokenization, the early layers focus on "features", the middle layers focus on "ensemble predictions" and the latter layers "sharpen" or finalize, with a lot of suppression happening near the end.)
- Songwei Liu, Chao Zeng, Lianqiang Li, Chenqian Yan, Lean Fu, Xing Mei, Fangmin Chen, 1 Jul 2024, FoldGPT: Simple and Effective Large Language Model Compression Scheme, https://arxiv.org/abs/2407.00928 (Identifies block-level similariy in model layers.)
- Jinuk Kim, Marwa El Halabi, Mingi Ji, Hyun Oh Song, July 2024, LayerMerge: Neural Network Depth Compression through Layer Pruning and Merging, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:23825-23842, 2024, https://proceedings.mlr.press/v235/kim24c.html PDF: https://raw.githubusercontent.com/mlresearch/v235/main/assets/kim24c/kim24c.pdf Code: https://github.com/snu-mllab/LayerMerge
- Yang Zhang, Yawei Li, Xinpeng Wang, Qianli Shen, Barbara Plank, Bernd Bischl, Mina Rezaei, Kenji Kawaguchi, 28 May 2024, FinerCut: Finer-grained Interpretable Layer Pruning for Large Language Models, https://arxiv.org/abs/2405.18218
- Hongrong Cheng, Miao Zhang, Javen Qinfeng Shi, 9 Aug 2024 (v2), A Survey on Deep Neural Network Pruning-Taxonomy, Comparison, Analysis, and Recommendations, IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/TPAMI.2024.3447085, https://arxiv.org/abs/2308.06767 https://ieeexplore.ieee.org/abstract/document/10643325
- Yang He, Lingao Xiao, 30 Nov 2023 (v2), Structured Pruning for Deep Convolutional Neural Networks: A survey, https://arxiv.org/abs/2303.00566 https://arxiv.org/pdf/2303.00566 https://ieeexplore.ieee.org/abstract/document/10330640 https://github.com/he-y/Awesome-Pruning https://huggingface.co/spaces/he-yang/Structured-Pruning-Survey (Extensive survey of pruning for CNNs, not LLMs.)
- Michael R. Metel, Peng Lu, Boxing Chen, Mehdi Rezagholizadeh, Ivan Kobyzev, 1 Oct 2024, Draft on the Fly: Adaptive Self-Speculative Decoding using Cosine Similarity, https://arxiv.org/abs/2410.01028 (Self-speculative decoding that removes layers based on cosine similarity.)
- Rhea Sanjay Sukthanker, Benedikt Staffler, Frank Hutter, Aaron Klein, 9 Oct 2024, LLM Compression with Neural Architecture Search, https://arxiv.org/abs/2410.06479 (NAS with width/attention head and layer pruning.)
- Pu Zhao, Fei Sun, Xuan Shen, Pinrui Yu, Zhenglun Kong, Yanzhi Wang, Xue Lin, 21 Oct 2024, Pruning Foundation Models for High Accuracy without Retraining, https://arxiv.org/abs/2410.15567 https://github.com/piuzha/APT
- Zejian Liu, Fanrong Li, Gang Li, and Jian Cheng. 2021. EBERT: Efficient BERT Inference with Dynamic Structured Pruning. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 4814–4823. https://aclanthology.org/2021.findings-acl.425/ PDF: https://aclanthology.org/2021.findings-acl.425.pdf
- Zhen Tan, Daize Dong, Xinyu Zhao, Jie Peng, Yu Cheng, Tianlong Chen, 3 Jul 2024, DLO: Dynamic Layer Operation for Efficient Vertical Scaling of LLMs. https://arxiv.org/abs/2407.11030
- J. Li, Q. Li and P. Wang, 2024, From Static to Dynamic: A Deeper, Faster, and Adaptive Language Modeling Approach, 2024 International Joint Conference on Neural Networks (IJCNN), Yokohama, Japan, 2024, pp. 1-8, doi: 10.1109/IJCNN60899.2024.10650050, https://ieeexplore.ieee.org/abstract/document/10650050 (Uses a preliminary "estimator module" to decide which layers to use.)
- Wang, Z., Han, J. (2024). Improve Shallow Decoder Based Transformer with Structured Expert Prediction. In: Wand, M., Malinovská, K., Schmidhuber, J., Tetko, I.V. (eds) Artificial Neural Networks and Machine Learning – ICANN 2024. ICANN 2024. Lecture Notes in Computer Science, vol 15022. Springer, Cham. https://doi.org/10.1007/978-3-031-72350-6_15 https://link.springer.com/chapter/10.1007/978-3-031-72350-6_15
- Jing Xiong, Jianghan Shen, Fanghua Ye, Chaofan Tao, Zhongwei Wan, Jianqiao Lu, Xun Wu, Chuanyang Zheng, Zhijiang Guo, Lingpeng Kong, Ngai Wong, 4 Oct 2024, UNComp: Uncertainty-Aware Long-Context Compressor for Efficient Large Language Model Inference, https://arxiv.org/abs/2410.03090
- Zehua Pei, Hui-Ling Zhen, Xianzhi Yu, Sinno Jialin Pan, Mingxuan Yuan, Bei Yu, 21 Nov 2024, FuseGPT: Learnable Layers Fusion of Generative Pre-trained Transformers, https://arxiv.org/abs/2411.14507
Dynamic Layer Skipping
Layer skipping refers to bypassing the processing of a layer and moving onto the next, rather than "early exiting" to skip all the layers. Although much of the existing research is about early exit to skip all further layers (depth pruning), there is some research on choosing to skip a single layer, or per-layer early exit algorithms. Some example policies for layer skipping could be:
- Skip all remaining layers (early exiting)
- Skip some early layers
- Skip some middle layers
- Skip selected layers (e.g. every second layer)
- Skip random layers (stochastic layer skipping)
Some ways to generalize the method include:
- Skip encoder vs decoder layers differently (see shallow decoder)
- Skip prefill vs decoding phase layers differently (in decoder-only Transformers like GPT)
This is also a form of dynamic depth pruning, because it reduces the number of layers that the model will execute, using some criteria.
- Xin Wang, Fisher Yu, Zi-Yi Dou, Trevor Darrell, and Joseph E. Gonzalez, Skipnet: Learning dynamic routing in convolutional networks, In ECCV, 2018, https://arxiv.org/abs/1711.09485
- Hassan Sajjad, Fahim Dalvi, Nadir Durrani, and Preslav Nakov, On the Effect of Dropping Layers of Pre-trained Transformer Models, 2020, arXiv preprint arXiv:2004.03844 (2020), https://arxiv.org/pdf/2004.03844v2.pdf
- Alex Graves. Adaptive computation time for recurrent neural networks. arXiv preprint arXiv:1603.08983, 2016, https://arxiv.org/abs/1603.08983
- Jianghao Shen, Yue Wang, Pengfei Xu, Yonggan Fu, Zhangyang Wang, Yingyan Lin, Fractional Skipping: Towards Finer-Grained Dynamic CNN Inference, January 2020, DOI: https://doi.org/10.1609/aaai.v34i04.6025, https://arxiv.org/abs/2001.00705
- Learning layer-skippable inference network YG Jiang, C Cheng, H Lin, Y Fu, 2020, IEEE Transactions on Image Processing, Volume 29, pp. 8747-8759, 28 August 2020, https://ieeexplore.ieee.org/abstract/document/9180094
- H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua, A convolutional neural network cascade for face detection, 2015, in CVPR, https://paperswithcode.com/paper/a-convolutional-neural-network-cascade-for
- F. Yang, W. Choi, and Y. Lin, Exploit all the layers: Fast and accurate cnn object detector with scale dependent pooling and cascaded rejection classifiers, 2016, in CVPR, https://ieeexplore.ieee.org/document/7780603
- Andreas Veit and Serge Belongie, Convolutional networks with adaptive inference graphs, In ECCV, 2018, https://arxiv.org/abs/1711.11503
- X. Dong, J. Huang, Y. Yang, and S. Yan, More is less: A more complicated network with less inference complexity,” in CVPR, 2017. https://arxiv.org/abs/1703.08651
- Hassan Sajjad, Fahim Dalvi, Nadir Durrani, Preslav Nakov, On the Effect of Dropping Layers of Pre-trained Transformer Models arXiv preprint arXiv:2004.03844, 2020 (revised Aug 2022), https://arxiv.org/abs/2004.03844 (Examined dropping alternative layers, layer fusion, and other layer pruning strategies.)
- Andreas Veit and Serge Belongie. Convolutional networks with adaptive inference graphs. ECCV, pages 3–18, 2018. https://arxiv.org/abs/1711.11503
- Jinmin He, Kai Li, Yifan Zang, Haobo Fu, Qiang Fu, Junliang Xing, Jian Cheng, 25 Jan 2024, Not All Tasks Are Equally Difficult: Multi-Task Deep Reinforcement Learning with Dynamic Depth Routing, https://arxiv.org/abs/2312.14472 (Dynamic layer depth routing based on easy vs hard queries to optimize training.)
- Yunqi Zhu, Xuebing Yang, Yuanyuan Wu, Wensheng Zhang, 22 Mar 2024, Hierarchical Skip Decoding for Efficient Autoregressive Text Generation, https://arxiv.org/abs/2403.14919 (A new decoding algorithm called Hierarchical Skip Decoding involving layer skipping.)
- Xiaodong Chen, Yuxuan Hu, Jing Zhang, 28 Mar 2024, Compressing Large Language Models by Streamlining the Unimportant Layer, https://arxiv.org/abs/2403.19135 (Finds the less important layers and either prunes them or replaces them with a faster approximate layer.)
- Yijin Liu, Fandong Meng, Jie Zhou, 10 Apr 2024, Accelerating Inference in Large Language Models with a Unified Layer Skipping Strategy, https://arxiv.org/abs/2404.06954 Code: https://github.com/Adaxry/Unified_Layer_Skipping (Layer skipping with choosing globally which layers to skip in an orderly way for all tokens based on speedup required. All tokens skip the exact same layers, which avoids the problem with out-of-date KV caches.)
- Longwei Zou, Qingyang Wang, Han Zhao, Jiangang Kong, Yi Yang, Yangdong Deng, 10 Apr 2024, CQIL: Inference Latency Optimization with Concurrent Computation of Quasi-Independent Layers, https://arxiv.org/abs/2404.06709 (Similar to layer skipping or layer fusion, but concurrently calculates some layers that seem to be less important, rather than running the layers sequentially.)
- Georgy Tyukin, 2 Apr 2024, Enhancing Inference Efficiency of Large Language Models: Investigating Optimization Strategies and Architectural Innovations, Masters Thesis, Data Science and Machine Learning, University College London., https://arxiv.org/abs/2404.05741 (Reviews various model compression and inference optimization techniques, and specifically analyzed layer skipping and sublayer skipping, such as attention head pruning and FFN/MLP pruning.)
- Jordan Dotzel, Yash Akhauri, Ahmed S. AbouElhamayed, Carly Jiang, Mohamed Abdelfattah, Zhiru Zhang, 7 Apr 2024, Radial Networks: Dynamic Layer Routing for High-Performance Large Language Models, https://arxiv.org/abs/2404.04900 (Token-specific layer routing is similar to layer skipping and dynamic depth pruning.)
- David Raposo, Sam Ritter, Blake Richards, Timothy Lillicrap, Peter Conway Humphreys, Adam Santoro, 2 Apr 2024, Mixture-of-Depths: Dynamically allocating compute in transformer-based language models, https://arxiv.org/abs/2404.02258 (Per-token layer skipping for a type of adaptive inference with conditional computation.)
- Dewen Zeng, Nan Du, Tao Wang, Yuanzhong Xu, Tao Lei, Zhifeng Chen, Claire Cui, 26 Nov 2023, Learning to Skip for Language Modeling, https://arxiv.org/abs/2311.15436 (Generalizes token-based early exiting to skip entire layers.)
- Haoyu Wang, Yaqing Wang, Tianci Liu, Tuo Zhao, and Jing Gao, 2023, HadSkip: Homotopic and Adaptive Layer Skipping of Pre-trained Language Models for Efficient Inference https://aclanthology.org/2023.findings-emnlp.283.pdf (Layer skipping during fine-tuning.)
- Zuxuan Wu, Tushar Nagarajan, Abhishek Kumar, Steven Rennie, Larry S Davis, Kristen Grauman, and Rogerio Feris. Blockdrop: Dynamic inference paths in residual networks. CVPR, pages 8817–8826, 2018. https://arxiv.org/abs/1711.08393 Code: https://github.com/Tushar-N/blockdrop
- Ofir Press, Noah A. Smith, Omer Levy, Apr 2020, Improving Transformer Models by Reordering their Sublayers, https://arxiv.org/abs/1911.03864
- Rafael Fão de Moura, Paulo C Santos, João Paulo C de Lima, Marco AZ Alves, Antonio CS Beck, and Luigi Carro. 2019. Skipping CNN convolutions through efficient memoization. In International Conference on Embedded Computer Systems. Springer, 65–76. https://link.springer.com/chapter/10.1007/978-3-030-27562-4_5
- J Park, DY Kim, YH Moon, 2022, Lazy Net: Lazy Entry Neural Networks for Accelerated and Efficient Inference 2022 13th International Conference on Information and Communication Technology Convergence (ICTC), https://ieeexplore.ieee.org/abstract/document/9953031
- Tolga Bolukbasi, Joseph Wang, Ofer Dekel, Venkatesh Saligrama, 2017, Adaptive Neural Networks for Efficient Inference, Proceedings of the 34th International Conference on Machine Learning, PMLR 70:527-536, 2017. http://proceedings.mlr.press/v70/bolukbasi17a.html http://proceedings.mlr.press/v70/bolukbasi17a/bolukbasi17a.pdf
- Xinyin Ma, Gongfan Fang, Michael Bi Mi, Xinchao Wang, 3 Jun 2024, Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching, https://arxiv.org/abs/2406.01733 Code: https://github.com/horseee/learning-to-cache (Layer skipping in diffusion transformers via layer caching.)
- Wei Zhong, Manasa Bharadwaj, 30 May 2024, S3D: A Simple and Cost-Effective Self-Speculative Decoding Scheme for Low-Memory GPUs, https://arxiv.org/abs/2405.20314 (Self-speculative decoding using early layers, multi-token non-autoregressive token predictions for the draft model, and layer skipping.)
- Anton Razzhigaev, Matvey Mikhalchuk, Elizaveta Goncharova, Nikolai Gerasimenko, Ivan Oseledets, Denis Dimitrov, Andrey Kuznetsov, 19 May 2024, Your Transformer is Secretly Linear, https://arxiv.org/abs/2405.12250 (Replacing model layers in the decoder with linear approximations.)
- Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed A Aly, Beidi Chen, Carole-Jean Wu, 25 Apr 2024, Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding, https://arxiv.org/abs/2404.16710 (Multiple contributions including training with early exit, and speculative decoding with a draft model that is early exit within the larger model, with the advantages: (a) the draft and verifier model thereby share KV cache data for the early layers and (b) avoidance of the problems with an outdated KV cache normally caused by early exiting.)
- J. Jin, A. Dundar, and E. Culurciello. 2014, Flattened convolutional neural networks for feedforward acceleration. arXiv preprint arXiv:1412.5474, https://arxiv.org/abs/1412.5474
- Haoyi Wu, Kewei Tu, 4 Jun 2024 (v2), Layer-Condensed KV Cache for Efficient Inference of Large Language Models, https://arxiv.org/abs/2405.10637 Code: https://github.com/whyNLP/LCKV
- David Spuler, March 2024, Chapter 47. Early Exit and Layer Pruning, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Byung-Kwan Lee, Sangyun Chung, Chae Won Kim, Beomchan Park, Yong Man Ro, 19 Jun 2024 (v2), TroL: Traversal of Layers for Large Language and Vision Models, https://arxiv.org/abs/2406.12246 https://arxiv.org/pdf/2406.12246 (To achieve higher accuracy, this model re-traverses some of the layers, which achieves higher model accuracy from the same size model without more memory.)
- Vedang Lad, Wes Gurnee, Max Tegmark, 27 Jun 2024, The Remarkable Robustness of LLMs: Stages of Inference, https://arxiv.org/abs/2406.19384 (Deleting and swapping adjacent model layers. Hypothesizes that the first layer is effectively detokenization, the early layers focus on "features", the middle layers focus on "ensemble predictions" and the latter layers "sharpen" or finalize, with a lot of suppression happening near the end.)
- Suyi Li, Lingyun Yang, Xiaoxiao Jiang, Hanfeng Lu, Zhipeng Di, Weiyi Lu, Jiawei Chen, Kan Liu, Yinghao Yu, Tao Lan, Guodong Yang, Lin Qu, Liping Zhang, Wei Wang, 2 Jul 2024, SwiftDiffusion: Efficient Diffusion Model Serving with Add-on Modules, https://arxiv.org/abs/2407.02031 (Efficient diffusion models in systems with multi-LoRA, ControlNets, and other multi-module add-ons, including parallelizing execution of add-ons and more efficient loading of LoRA with faster updating or "patching" of model weights, including by performing some layers in parallel without LoRA weights, while loading the LoRA adapters.)
- Zhen Tan, Daize Dong, Xinyu Zhao, Jie Peng, Yu Cheng, Tianlong Chen, 3 Jul 2024, DLO: Dynamic Layer Operation for Efficient Vertical Scaling of LLMs. https://arxiv.org/abs/2407.11030
- H Wang, 2024, Minimalism Yields Maximum Results: Deep Learning with Limited Resource, Ph.D. Thesis, Purdue University, PDF: https://hammer.purdue.edu/articles/thesis/Minimalism_Yields_Maximum_Results_Deep_Learning_with_Limited_Resource/26349415/1/files/47855029.pdf
- Alessio Devoto, Federico Alvetreti, Jary Pomponi, Paolo Di Lorenzo, Pasquale Minervini, Simone Scardapane, 16 Aug 2024, Adaptive Layer Selection for Efficient Vision Transformer Fine-Tuning, https://arxiv.org/abs/2408.08670 (Faster fine-tuning by selecting layers, freezing layers, or slimming them to fewer fine-tuned parameters.)
- Siqi Fan, Xin Jiang, Xiang Li, Xuying Meng, Peng Han, Shuo Shang, Aixin Sun, Yequan Wang, Zhongyuan Wang, 9 Jul 2024 (v3), Not All Layers of LLMs Are Necessary During Inference, https://arxiv.org/abs/2403.02181
- Jiwon Song, Kyungseok Oh, Taesu Kim, Hyungjun Kim, Yulhwa Kim, Jae-Joon Kim, 19 Jul 2024 (v5), SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks, https://arxiv.org/abs/2402.09025 https://github.com/jiwonsong-dev/SLEB
- Qingyuan Wang, Barry Cardiff, Antoine Frappé, Benoit Larras, Deepu John, 16 Aug 2024 (v2), DyCE: Dynamically Configurable Exiting for Deep Learning Compression and Real-time Scaling, https://arxiv.org/abs/2403.01695
- J. Li, Q. Li and P. Wang, 2024, From Static to Dynamic: A Deeper, Faster, and Adaptive Language Modeling Approach, 2024 International Joint Conference on Neural Networks (IJCNN), Yokohama, Japan, 2024, pp. 1-8, doi: 10.1109/IJCNN60899.2024.10650050, https://ieeexplore.ieee.org/abstract/document/10650050 (Uses a preliminary "estimator module" to decide which layers to use.)
- Yejin Lee, Anna Sun, Basil Hosmer, Bilge Acun, Can Balioglu, Changhan Wang, Charles David Hernandez, Christian Puhrsch, Daniel Haziza, Driss Guessous, Francisco Massa, Jacob Kahn, Jeffrey Wan, Jeremy Reizenstein, Jiaqi Zhai, Joe Isaacson, Joel Schlosser, Juan Pino, Kaushik Ram Sadagopan, Leonid Shamis, Linjian Ma, Min-Jae Hwang, Mingda Chen, Mostafa Elhoushi, Pedro Rodriguez, Ram Pasunuru, Scott Yih, Sravya Popuri, Xing Liu, Carole-Jean Wu, 30 Sep 2024, Characterizing and Efficiently Accelerating Multimodal Generation Model Inference, https://arxiv.org/abs/2410.00215 (Analyzes the bottlenecks in inference, finding the usual problems of autoregression, but also more interesting issues such as that linear kernels can be expensive, and KV cache reordering is a bottleneck in beam search, and layer skipping is analyzed.)
- Michael R. Metel, Peng Lu, Boxing Chen, Mehdi Rezagholizadeh, Ivan Kobyzev, 1 Oct 2024, Draft on the Fly: Adaptive Self-Speculative Decoding using Cosine Similarity, https://arxiv.org/abs/2410.01028 (Self-speculative decoding that removes layers based on cosine similarity.)
- Xia, Wenhan, Sep 2024, Methods for Efficient and Scalable Deep Learning, Ph.D. Thesis, Electrical and Computer Engineering Department, Princeton University, http://arks.princeton.edu/ark:/88435/dsp015q47rs12x (Covers PEFT/LoRA on training, and dual pruning with layer skipping and channel/width pruning for inference.)
- Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Guohao Dai, 6 Oct 2024, Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective, https://arxiv.org/abs/2410.04466
- Heming Xia, Yongqi Li, Jun Zhang, Cunxiao Du, Wenjie Li, 9 Oct 2024, SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration, https://arxiv.org/abs/2410.06916 (Self-speculative decoding using layer skipping, rather than early exit.)
- Akriti Jain, Saransh Sharma, Koyel Mukherjee, Soumyabrata Pal, 16 Oct 2024, FiRST: Finetuning Router-Selective Transformers for Input-Adaptive Latency Reduction, https://arxiv.org/abs/2410.12513
- Theodore Glavas, Joud Chataoui, Florence Regol, Wassim Jabbour, Antonios Valkanas, Boris N. Oreshkin, Mark Coates, 26 Oct 2024, Dynamic layer selection in decoder-only transformers, https://arxiv.org/abs/2410.20022
- Zejian Liu, Fanrong Li, Gang Li, and Jian Cheng. 2021. EBERT: Efficient BERT Inference with Dynamic Structured Pruning. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 4814–4823. https://aclanthology.org/2021.findings-acl.425/ PDF: https://aclanthology.org/2021.findings-acl.425.pdf
- Wang, Z., Han, J. (2024). Improve Shallow Decoder Based Transformer with Structured Expert Prediction. In: Wand, M., Malinovská, K., Schmidhuber, J., Tetko, I.V. (eds) Artificial Neural Networks and Machine Learning – ICANN 2024. ICANN 2024. Lecture Notes in Computer Science, vol 15022. Springer, Cham. https://doi.org/10.1007/978-3-031-72350-6_15 https://link.springer.com/chapter/10.1007/978-3-031-72350-6_15
- Y Zhou, C Zhou, W Xie, X Wang, J Chen, Z Ni, J Li, 2024, The Benefits in Shallow: Merge Decoding Across Large Language Model Layers. In: Wong, D.F., Wei, Z., Yang, M. (eds) Natural Language Processing and Chinese Computing. NLPCC 2024. Lecture Notes in Computer Science(), vol 15360. Springer, Singapore. https://doi.org/10.1007/978-981-97-9434-8_30 https://link.springer.com/chapter/10.1007/978-981-97-9434-8_30
- Xiangyu Zhang, Yu Zhou, Guang Yang, Harald C. Gall, Taolue Chen, 11 Nov 2024, Anchor Attention, Small Cache: Code Generation with Large Language Models, https://arxiv.org/abs/2411.06680
Layer Skipping and KV Caching
All forms of dynamic layer pruning, such as layer skipping and early exit, have a problem when used with KV caching. These two optimizations seem like they should be orthogonal and work well when combined, but there's a problem. When a layer is skipped or multiple layers are skipped by exiting early, the KV cache is not computed, and will be out-of-date the next time that layer is not skipped.
Various methods to fix the KV cache have been examined by researchers. At a minimum, any layers that are skipped need to track a flag that their KV cache is invalid, and therefore the KV cache will require re-computation at times. But there are more advanced solutions and for more more details about this research see: KV caching with early exit.
Layer Reordering
An interesting technique, that generalizes the use of layers, is "layer reordering". The idea is motivated by the realization that Transformer layers are building blocks which output the same format as their input. Hence, not only can you remove a layer (early exit or layer pruning) or skip a layer (layer skipping), or run the same layer twice (layer fusion), but it can be generalized in any way. You can pick and choose which layers to run, and in what order, and how often. You could even run each layer twice, or run all the layers in reverse, or whatever.
Layer reordering usually refers to entire Transformer layers. For other types of merging or reordering of separate sub-layer structures within Transformer layers, see kernel operator fusion. For discussion of the order of layer normalization subcomponents, see normalization reordering.
Layer reordering seems like it shouldn't work. After all, didn't we expend all those GPU cycles in training to carefully work out the correct weights for each layer? Isn't it true that the first layers do the broad analysis and the upper layers do the finessing? So early exiting makes some kind of sense, because it just skips the finer details at the end, but randomly reordering things seems weird. Nevertheless, there are some research papers that explore layer reordering and its generalizations.
- Ofir Press, Noah A. Smith, Omer Levy, Improving Transformer Models by Reordering their Sublayers, arXiv preprint arXiv:1911.03864, 2019, https://arxiv.org/abs/1911.03864 (Layer reordering! Includes analysis of multiple layers, and also reordering self-attention and feed-forward sub-components in a "sandwich" architecture.)
- Jinhua Zhu, Lijun Wu, Yingce Xia, Shufang Xie, Tao Qin, Wengang Zhou, Houqiang Li, Tie-Yan Liu, Mar 2021, IOT: Instance-wise Layer Reordering for Transformer Structures, https://arxiv.org/abs/2103.03457
- Elicia Ye, March 2023, Greedy Ordering of Layer Weight Matrices in Transformers Improves Translation, https://arxiv.org/abs/2302.02123
- Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. In International Conference on Learning Representations, September 2019. https://openreview.net/forum?id=H1eA7AEtvS
- David Spuler, March 2024, Chapter 47. Early Exit and Layer Pruning, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Byung-Kwan Lee, Sangyun Chung, Chae Won Kim, Beomchan Park, Yong Man Ro, 19 Jun 2024 (v2), TroL: Traversal of Layers for Large Language and Vision Models, https://arxiv.org/abs/2406.12246 https://arxiv.org/pdf/2406.12246 (To achieve higher accuracy, this model re-traverses some of the layers, which achieves higher model accuracy from the same size model without more memory.)
- Vedang Lad, Wes Gurnee, Max Tegmark, 27 Jun 2024, The Remarkable Robustness of LLMs: Stages of Inference, https://arxiv.org/abs/2406.19384 (Deleting and swapping adjacent model layers. Hypothesizes that the first layer is effectively detokenization, the early layers focus on "features", the middle layers focus on "ensemble predictions" and the latter layers "sharpen" or finalize, with a lot of suppression happening near the end.)
- Zhuocheng Gong, Ang Lv, Jian Guan, Junxi Yan, Wei Wu, Huishuai Zhang, Minlie Huang, Dongyan Zhao, Rui Yan, 9 Jul 2024, Mixture-of-Modules: Reinventing Transformers as Dynamic Assemblies of Modules, https://arxiv.org/abs/2407.06677
- Matthias Freiberger, Peter Kun, Anders Sundnes Løvlie, Sebastian Risi, 5 Jul 2024, LayerShuffle: Enhancing Robustness in Vision Transformers by Randomizing Layer Execution Order, https://arxiv.org/abs/2407.04513
Layer Approximation
The idea of approximating a layer with something faster has received much less attention than simply removing them!
Research papers on layer approximations:
- Xiaodong Chen, Yuxuan Hu, Jing Zhang, 28 Mar 2024, Compressing Large Language Models by Streamlining the Unimportant Layer, https://arxiv.org/abs/2403.19135 (Finds the less important layers and either prunes them or replaces them with a faster approximate layer.)
- Longwei Zou, Qingyang Wang, Han Zhao, Jiangang Kong, Yi Yang, Yangdong Deng, 10 Apr 2024, CQIL: Inference Latency Optimization with Concurrent Computation of Quasi-Independent Layers, https://arxiv.org/abs/2404.06709 (Similar to layer skipping or layer fusion, but concurrently calculates some layers that seem to be less important, rather than running the layers sequentially.)
- Anton Razzhigaev, Matvey Mikhalchuk, Elizaveta Goncharova, Nikolai Gerasimenko, Ivan Oseledets, Denis Dimitrov, Andrey Kuznetsov, 19 May 2024, Your Transformer is Secretly Linear, https://arxiv.org/abs/2405.12250 (Replacing model layers in the decoder with linear approximations.)
- Zhuocheng Gong, Ang Lv, Jian Guan, Junxi Yan, Wei Wu, Huishuai Zhang, Minlie Huang, Dongyan Zhao, Rui Yan, 9 Jul 2024, Mixture-of-Modules: Reinventing Transformers as Dynamic Assemblies of Modules, https://arxiv.org/abs/2407.06677
KV Caching and Layer Pruning
There are analogous techniques that can be applied to layers in the KV cache data. KV caching involves storing a per-layer set of data, and many optimizations have been researched. Read more about these KV cache research areas:
- KV cache layer pruning
- KV cache layer fusion
- KV caching in early exit of layers
- KV caching (overall)
Layer Importance
Research has found that the early layers of a model tend to make the bigger contextual decisions, whereas the later layers tend to choose between a few acceptable tokens. This explains why early exit of layers (dynamic layer pruning) can lead to acceptable output, but loses some of the finesse. For larger models, there are three main zones: initial layers, middle layers, and later layers.
Research papers that examine layer importance include:
- Anton Razzhigaev, Matvey Mikhalchuk, Elizaveta Goncharova, Nikolai Gerasimenko, Ivan Oseledets, Denis Dimitrov, Andrey Kuznetsov, 19 May 2024, Your Transformer is Secretly Linear, https://arxiv.org/abs/2405.12250 (Replacing model layers in the decoder with linear approximations.)
- Haoyi Wu, Kewei Tu, 17 May 2024, Layer-Condensed KV Cache for Efficient Inference of Large Language Models, https://arxiv.org/abs/2405.10637 Code: https://github.com/whyNLP/LCKV (Use the KV cache for only the final layer as the KV cache for all other layers, or alternatively, use only the cache from a few layers, also possibly using a few standard layers as "warmup layers". This idea is conceptuatlly similar to "propagation" of the KV cache in early exit methods or to layer fusion of weights.)
- Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed A Aly, Beidi Chen, Carole-Jean Wu, 25 Apr 2024, Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding, https://arxiv.org/abs/2404.16710 (Multiple contributions including training with early exit, and speculative decoding with a draft model that is early exit within the larger model, with the advantages: (a) the draft and verifier model thereby share KV cache data for the early layers and (b) avoidance of the problems with an outdated KV cache normally caused by early exiting.)
- BS Akash, V Singh, A Krishna, LB Murthy, L Kumar, April 2024, Investigating BERT Layer Performance and SMOTE Through MLP-Driven Ablation on Gittercom, Lecture Notes on Data Engineering and Communications Technologies (LNDECT,volume 200), https://link.springer.com/chapter/10.1007/978-3-031-57853-3_25
- Georgy Tyukin, 2 Apr 2024, Enhancing Inference Efficiency of Large Language Models: Investigating Optimization Strategies and Architectural Innovations, Masters Thesis, Data Science and Machine Learning, University College London., https://arxiv.org/abs/2404.05741 (Reviews various model compression and inference optimization techniques, and specifically analyzes layer skipping and sublayer skipping, such as attention head pruning and FFN/MLP pruning.)
- Jiachen Jiang, Jinxin Zhou, Zhihui Zhu, 20 Jun 2024, On Layer-wise Representation Similarity: Application for Multi-Exit Models with a Single Classifier, https://arxiv.org/abs/2406.14479 (Using layer similarity for early exit classifiers, which is also related to layer fusion.)
- Vedang Lad, Wes Gurnee, Max Tegmark, 27 Jun 2024, The Remarkable Robustness of LLMs: Stages of Inference, https://arxiv.org/abs/2406.19384 (Deleting and swapping adjacent model layers. Hypothesizes that the first layer is effectively detokenization, the early layers focus on "features", the middle layers focus on "ensemble predictions" and the latter layers "sharpen" or finalize, with a lot of suppression happening near the end.)
- Xu Cheng, Lei Cheng, Zhaoran Peng, Yang Xu, Tian Han, Quanshi Zhang, July 2024, Layerwise Change of Knowledge in Neural Networks, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:8038-8059, 2024, https://proceedings.mlr.press/v235/cheng24b.html PDF: https://raw.githubusercontent.com/mlresearch/v235/main/assets/cheng24b/cheng24b.pdf
- Siqi Fan, Xin Jiang, Xiang Li, Xuying Meng, Peng Han, Shuo Shang, Aixin Sun, Yequan Wang, Zhongyuan Wang, 9 Jul 2024 (v3), Not All Layers of LLMs Are Necessary During Inference, https://arxiv.org/abs/2403.02181
- Hongrong Cheng, Miao Zhang, Javen Qinfeng Shi, 9 Aug 2024 (v2), A Survey on Deep Neural Network Pruning-Taxonomy, Comparison, Analysis, and Recommendations, IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/TPAMI.2024.3447085, https://arxiv.org/abs/2308.06767 https://ieeexplore.ieee.org/abstract/document/10643325
- Benjamin L. Badger, 2 Sep 2024, Masked Mixers for Language Generation and Retrieval, https://arxiv.org/abs/2409.01482
- Amit Ben Artzy, Roy Schwartz, 5 Sep 2024, Attend First, Consolidate Later: On the Importance of Attention in Different LLM Layers, https://arxiv.org/abs/2409.03621
- Zifan Zheng, Yezhaohui Wang, Yuxin Huang, Shichao Song, Bo Tang, Feiyu Xiong, Zhiyu Li, 5 Sep 2024, Attention Heads of Large Language Models: A Survey, https://arxiv.org/abs/2409.03752 https://github.com/IAAR-Shanghai/Awesome-Attention-Heads (This survey is about making attention mechanisms more performant, accurate and intelligent, rather than improving efficiency.)
- Jordan Dotzel, Carly Jiang, Mohamed Abdelfattah, Zhiru Zhang, Sep 2024, Opportunities for Post-Training Dynamic Layer Sparsity in Large Vision and Language Models, https://openaccess.thecvf.com/content/CVPR2024W/ELVM/papers/Dotzel_Opportunities_for_Post-Training_Dynamic_Layer_Sparsity_in_Large_Vision_and_CVPRW_2024_paper.pdf (Layerwise dynamic sparsity for vision models.)
- Bernard Ryhede Bengtsson, Joel Bengs, 2024, Accelerated Segmentation with Mixed-Precision Quantization of EfficientViT-SAM, MSc Thesis, Lund University, Sweden, https://lup.lub.lu.se/luur/download?func=downloadFile&recordOId=9174462&fileOId=9174463
- Hossein Rajabzadeh, Aref Jafari, Aman Sharma, Benyamin Jami, Hyock Ju Kwon, Ali Ghodsi, Boxing Chen, Mehdi Rezagholizadeh, 22 Sep 2024, EchoAtt: Attend, Copy, then Adjust for More Efficient Large Language Models, https://arxiv.org/abs/2409.14595
- Pierre Marion, Raphaël Berthier, Gérard Biau, Claire Boyer, 2 Oct 2024, Attention layers provably solve single-location regression, https://arxiv.org/abs/2410.01537
- Tianyu Guo, Druv Pai, Yu Bai, Jiantao Jiao, Michael I. Jordan, Song Mei, 17 Oct 2024, Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs, https://arxiv.org/abs/2410.13835
- Khashayar Gatmiry, Nikunj Saunshi, Sashank J. Reddi, Stefanie Jegelka, Sanjiv Kumar, 29 Oct 2024, On the Role of Depth and Looping for In-Context Learning with Task Diversity, https://arxiv.org/abs/2410.21698
- Guangyuan Shi, Zexin Lu, Xiaoyu Dong, Wenlong Zhang, Xuanyu Zhang, Yujie Feng, Xiao-Ming Wu, 23 Oct 2024, Understanding Layer Significance in LLM Alignment, https://arxiv.org/abs/2410.17875
- Weizhuo Li, Zhigang Wang, Yu Gu, Ge Yu, 8 Dec 2024, XKV: Personalized KV Cache Memory Reduction for Long-Context LLM Inference, https://arxiv.org/abs/2412.05896
More Research on Pruning Types
- Depth pruning (overview)
— Static layer pruning
— Layer pruning
— Early exit
— Dynamic layer pruning
— Layer skipping
— Layer approximation
— Shallow decoder architecture
— Layer reordering
— Layer Importance - Width pruning (overview)
— Attention head pruning
— Slimmable networks (width pruning)
— FFN pruning
— Channel pruning
— Filter pruning - Length pruning (longitudinal/input/end-to-end):
— Token pruning (input pruning)
— Dynamic token pruning
— Prompt compression
— Context compression
— Token merging
— Token skipping
— Token dropping
— Zero padding removal - Embedding-dimension pruning
— Embedding pruning
— Embedding matrix compression (embedding pruning)
— Embedding low-rank matrix factorization
— Unembedding matrix (output embeddings) - Multi-dimensional pruning
— Dual pruning
— Triple pruning
— Quadruple pruning
— 3D CNN model pruning
More AI Pruning Research
Read more about other types of pruning:
- AI model pruning overview
- Head pruning
- Token pruning
- Embeddings pruning
- FFN pruning
- Shallow decoder architecture
- « Research Home