Aussie AI

Layer Pruning

  • Last Updated 12 December, 2024
  • by David Spuler, Ph.D.

Layer pruning is a type of structured pruning because it prunes entire layers. More precisely, it is a type of "depth pruning" because it reduces the depth of the stacks of encoders and/or decoders in the Transformer architecture. This technique can sometimes be called "layer compaction".

Dynamic layer pruning is also called early exiting if all remaining layers are skipped, or "layer skipping" if only the current layer is skipped. Reducing layers in the decoder is called a "shallow decoder", which has been found to be effective, because encoder layers are more important in a Transformer than decoder layers. Layer pruning is also related to "layer fusion" (usually via parameter sharing) and layer reordering.

Layer pruning refers to removing one or more entire layers from the model, which is a subtype of "depth pruning". Most AI models have multiple hidden layers of nodes, and sometimes a layer can be removed without too great of a loss in model accuracy. The layer can be removed statically to create a new model file or dynamically via some adaptive criteria. Most of the literature focuses on dynamic layer pruning via early exit of the inference algorithm, when it detects a threshold of accuracy has been achieved.

Layer pruning can be combined with many other methods to create hybrid optimizations. For example, it is orthogonal to quantization, width pruning, (e.g. attention head pruning) and length pruning (e.g. token pruning, embeddings pruning).

Static layer pruning

Static layer pruning is removal of entire layers of weights from a model file. This which would involve detecting layers that add minimal value during training (or post-training but pre-inference), but it seems seems to have less chance to succeed, and seems relatively under-researched. This is related to the training design choice of how many layers to use in a model, which was once more art than science, but has received some research attention more recently (see "neural architecture search"). Interestingly, some of the "early exit" and "layer skipping" inference techniques are effectively changing the choice of the number of model layers from a static constant to a dynamic choice, and the generalization of that to dynamic layer management may warrant some research.

General research on layer pruning:

  • Sabina Pokhrel, "4 Popular Model Compression Techniques Explained", January 19, 2022, https://xailient.com/blog/4-popular-model-compression-techniques-explained/
  • Wei Deng, Junwei Pan, Tian Zhou, Deguang Kong, Aaron Flores, and Guang Lin. 2021. DeepLight: Deep Lightweight Feature Interactions for Accelerating CTR Predictions in Ad Serving. In Proceedings of the 14th ACM international conference on Web search and data mining. 922–930, https://arxiv.org/abs/2002.06987
  • H Sajjad, F Dalvi, N Durrani, P Nakov, 2020, Poor Man's BERT: Smaller and Faster Transformer Models, arXiv preprint arXiv:2004.03844 https://arxiv.org/abs/2004.03844v1
  • E Youn, S Prabhu, S Chen, 2023, Compressing Vision Transformers for Low-Resource Visual Learning, arXiv preprint arXiv:2309.02617, PDF: https://arxiv.org/pdf/2309.02617.pdf
  • Jaeho Lee, Sejun Park, Sangwoo Mo, Sungsoo Ahn, and Jinwoo Shin. Layer-adaptive sparsity for the magnitude-based pruning. In International Conference on Learning Representations, 2020. https://arxiv.org/abs/2010.07611
  • Q Wei, G Zeng, B Zeng, 2023, Efficient Training for Visual Tracking with Deformable Transformer, arXiv preprint arXiv:2309.02676, https://arxiv.org/pdf/2309.02676.pdf (Optimization and also investigated effects of number of decoder layers.)
  • Xiaodong Chen, Yuxuan Hu, Jing Zhang, 28 Mar 2024, Compressing Large Language Models by Streamlining the Unimportant Layer, https://arxiv.org/abs/2403.19135 (Finds the less important layers and either prunes them or replaces them with a faster approximate layer.)
  • Aaron Klein, Jacek Golebiowski, Xingchen Ma, Valerio Perrone, Cedric Archambeau, 3 May 2024, Structural Pruning of Pre-trained Language Models via Neural Architecture Search, https://arxiv.org/abs/2405.02267 (Post-training structured pruning of sub-networks based on NAS, also with weight sharing and several different focus areas of pruning including attention heads, FFNs, and layers.)
  • Xin Men, Mingyu Xu, Qingyu Zhang, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, Weipeng Chen, 7 Mar 2024 (v2), ShortGPT: Layers in Large Language Models are More Redundant Than You Expect, https://arxiv.org/abs/2403.03853
  • Pedram Rostami, Mohammad Javad Dousti, 10 Nov 2024, CULL-MT: Compression Using Language and Layer pruning for Machine Translation, https://arxiv.org/abs/2411.06506

Dynamic Layer Pruning (Inference Early Exit)

There are several papers on "early exit" of the inference algorithms, without processing all the layers, which is effective dynamic pruning of layers of the model (at runtime during inference). Other names for "early exit" in the literature include "early stopping" and "dropout". This overall technique can be categorized as "dynamic depth pruning" or "dynamic depth models". This method relies on an assumption that each successive layer makes the results more accurate, but with reducing changes. After a few layers, the results may be "good enough" to decide on the outcome without finishing all the layers.

Dynamic pruning techniques have to use a decision method, usually called a "classifier" to choose whether or not to exit at a given layer. Various different methods have been researched for this decision.

Always-exit test: It should be noted that dynamic layer pruning with a simplistic decision test, such as always exiting at layer N=5, is effectively the same as static layer pruning of layers N>=6, but without the benefit of reduced model storage space. However, implementing this always-exit test dynamically can still be beneficial during testing of the efficacy of the model in terms of the meta-parameter, N (i.e. when deciding the number of layers to use). Accuracy of the model for different values for N can be tested dynamically without rebuilding the model file.

Various research papers on dynamic layer pruning (see also early exit research for even more):

Dynamic Layer Skipping

Layer skipping refers to bypassing the processing of a layer and moving onto the next, rather than "early exiting" to skip all the layers. Although much of the existing research is about early exit to skip all further layers (depth pruning), there is some research on choosing to skip a single layer, or per-layer early exit algorithms. Some example policies for layer skipping could be:

  • Skip all remaining layers (early exiting)
  • Skip some early layers
  • Skip some middle layers
  • Skip selected layers (e.g. every second layer)
  • Skip random layers (stochastic layer skipping)

Some ways to generalize the method include:

  • Skip encoder vs decoder layers differently (see shallow decoder)
  • Skip prefill vs decoding phase layers differently (in decoder-only Transformers like GPT)

This is also a form of dynamic depth pruning, because it reduces the number of layers that the model will execute, using some criteria.

  • Xin Wang, Fisher Yu, Zi-Yi Dou, Trevor Darrell, and Joseph E. Gonzalez, Skipnet: Learning dynamic routing in convolutional networks, In ECCV, 2018, https://arxiv.org/abs/1711.09485
  • Hassan Sajjad, Fahim Dalvi, Nadir Durrani, and Preslav Nakov, On the Effect of Dropping Layers of Pre-trained Transformer Models, 2020, arXiv preprint arXiv:2004.03844 (2020), https://arxiv.org/pdf/2004.03844v2.pdf
  • Alex Graves. Adaptive computation time for recurrent neural networks. arXiv preprint arXiv:1603.08983, 2016, https://arxiv.org/abs/1603.08983
  • Jianghao Shen, Yue Wang, Pengfei Xu, Yonggan Fu, Zhangyang Wang, Yingyan Lin, Fractional Skipping: Towards Finer-Grained Dynamic CNN Inference, January 2020, DOI: https://doi.org/10.1609/aaai.v34i04.6025, https://arxiv.org/abs/2001.00705
  • Learning layer-skippable inference network YG Jiang, C Cheng, H Lin, Y Fu, 2020, IEEE Transactions on Image Processing, Volume 29, pp. 8747-8759, 28 August 2020, https://ieeexplore.ieee.org/abstract/document/9180094
  • H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua, A convolutional neural network cascade for face detection, 2015, in CVPR, https://paperswithcode.com/paper/a-convolutional-neural-network-cascade-for
  • F. Yang, W. Choi, and Y. Lin, Exploit all the layers: Fast and accurate cnn object detector with scale dependent pooling and cascaded rejection classifiers, 2016, in CVPR, https://ieeexplore.ieee.org/document/7780603
  • Andreas Veit and Serge Belongie, Convolutional networks with adaptive inference graphs, In ECCV, 2018, https://arxiv.org/abs/1711.11503
  • X. Dong, J. Huang, Y. Yang, and S. Yan, More is less: A more complicated network with less inference complexity,” in CVPR, 2017. https://arxiv.org/abs/1703.08651
  • Hassan Sajjad, Fahim Dalvi, Nadir Durrani, Preslav Nakov, On the Effect of Dropping Layers of Pre-trained Transformer Models arXiv preprint arXiv:2004.03844, 2020 (revised Aug 2022), https://arxiv.org/abs/2004.03844 (Examined dropping alternative layers, layer fusion, and other layer pruning strategies.)
  • Andreas Veit and Serge Belongie. Convolutional networks with adaptive inference graphs. ECCV, pages 3–18, 2018. https://arxiv.org/abs/1711.11503
  • Jinmin He, Kai Li, Yifan Zang, Haobo Fu, Qiang Fu, Junliang Xing, Jian Cheng, 25 Jan 2024, Not All Tasks Are Equally Difficult: Multi-Task Deep Reinforcement Learning with Dynamic Depth Routing, https://arxiv.org/abs/2312.14472 (Dynamic layer depth routing based on easy vs hard queries to optimize training.)
  • Yunqi Zhu, Xuebing Yang, Yuanyuan Wu, Wensheng Zhang, 22 Mar 2024, Hierarchical Skip Decoding for Efficient Autoregressive Text Generation, https://arxiv.org/abs/2403.14919 (A new decoding algorithm called Hierarchical Skip Decoding involving layer skipping.)
  • Xiaodong Chen, Yuxuan Hu, Jing Zhang, 28 Mar 2024, Compressing Large Language Models by Streamlining the Unimportant Layer, https://arxiv.org/abs/2403.19135 (Finds the less important layers and either prunes them or replaces them with a faster approximate layer.)
  • Yijin Liu, Fandong Meng, Jie Zhou, 10 Apr 2024, Accelerating Inference in Large Language Models with a Unified Layer Skipping Strategy, https://arxiv.org/abs/2404.06954 Code: https://github.com/Adaxry/Unified_Layer_Skipping (Layer skipping with choosing globally which layers to skip in an orderly way for all tokens based on speedup required. All tokens skip the exact same layers, which avoids the problem with out-of-date KV caches.)
  • Longwei Zou, Qingyang Wang, Han Zhao, Jiangang Kong, Yi Yang, Yangdong Deng, 10 Apr 2024, CQIL: Inference Latency Optimization with Concurrent Computation of Quasi-Independent Layers, https://arxiv.org/abs/2404.06709 (Similar to layer skipping or layer fusion, but concurrently calculates some layers that seem to be less important, rather than running the layers sequentially.)
  • Georgy Tyukin, 2 Apr 2024, Enhancing Inference Efficiency of Large Language Models: Investigating Optimization Strategies and Architectural Innovations, Masters Thesis, Data Science and Machine Learning, University College London., https://arxiv.org/abs/2404.05741 (Reviews various model compression and inference optimization techniques, and specifically analyzed layer skipping and sublayer skipping, such as attention head pruning and FFN/MLP pruning.)
  • Jordan Dotzel, Yash Akhauri, Ahmed S. AbouElhamayed, Carly Jiang, Mohamed Abdelfattah, Zhiru Zhang, 7 Apr 2024, Radial Networks: Dynamic Layer Routing for High-Performance Large Language Models, https://arxiv.org/abs/2404.04900 (Token-specific layer routing is similar to layer skipping and dynamic depth pruning.)
  • David Raposo, Sam Ritter, Blake Richards, Timothy Lillicrap, Peter Conway Humphreys, Adam Santoro, 2 Apr 2024, Mixture-of-Depths: Dynamically allocating compute in transformer-based language models, https://arxiv.org/abs/2404.02258 (Per-token layer skipping for a type of adaptive inference with conditional computation.)
  • Dewen Zeng, Nan Du, Tao Wang, Yuanzhong Xu, Tao Lei, Zhifeng Chen, Claire Cui, 26 Nov 2023, Learning to Skip for Language Modeling, https://arxiv.org/abs/2311.15436 (Generalizes token-based early exiting to skip entire layers.)
  • Haoyu Wang, Yaqing Wang, Tianci Liu, Tuo Zhao, and Jing Gao, 2023, HadSkip: Homotopic and Adaptive Layer Skipping of Pre-trained Language Models for Efficient Inference https://aclanthology.org/2023.findings-emnlp.283.pdf (Layer skipping during fine-tuning.)
  • Zuxuan Wu, Tushar Nagarajan, Abhishek Kumar, Steven Rennie, Larry S Davis, Kristen Grauman, and Rogerio Feris. Blockdrop: Dynamic inference paths in residual networks. CVPR, pages 8817–8826, 2018. https://arxiv.org/abs/1711.08393 Code: https://github.com/Tushar-N/blockdrop
  • Ofir Press, Noah A. Smith, Omer Levy, Apr 2020, Improving Transformer Models by Reordering their Sublayers, https://arxiv.org/abs/1911.03864
  • Rafael Fão de Moura, Paulo C Santos, João Paulo C de Lima, Marco AZ Alves, Antonio CS Beck, and Luigi Carro. 2019. Skipping CNN convolutions through efficient memoization. In International Conference on Embedded Computer Systems. Springer, 65–76. https://link.springer.com/chapter/10.1007/978-3-030-27562-4_5
  • J Park, DY Kim, YH Moon, 2022, Lazy Net: Lazy Entry Neural Networks for Accelerated and Efficient Inference 2022 13th International Conference on Information and Communication Technology Convergence (ICTC), https://ieeexplore.ieee.org/abstract/document/9953031
  • Tolga Bolukbasi, Joseph Wang, Ofer Dekel, Venkatesh Saligrama, 2017, Adaptive Neural Networks for Efficient Inference, Proceedings of the 34th International Conference on Machine Learning, PMLR 70:527-536, 2017. http://proceedings.mlr.press/v70/bolukbasi17a.html http://proceedings.mlr.press/v70/bolukbasi17a/bolukbasi17a.pdf
  • Xinyin Ma, Gongfan Fang, Michael Bi Mi, Xinchao Wang, 3 Jun 2024, Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching, https://arxiv.org/abs/2406.01733 Code: https://github.com/horseee/learning-to-cache (Layer skipping in diffusion transformers via layer caching.)
  • Wei Zhong, Manasa Bharadwaj, 30 May 2024, S3D: A Simple and Cost-Effective Self-Speculative Decoding Scheme for Low-Memory GPUs, https://arxiv.org/abs/2405.20314 (Self-speculative decoding using early layers, multi-token non-autoregressive token predictions for the draft model, and layer skipping.)
  • Anton Razzhigaev, Matvey Mikhalchuk, Elizaveta Goncharova, Nikolai Gerasimenko, Ivan Oseledets, Denis Dimitrov, Andrey Kuznetsov, 19 May 2024, Your Transformer is Secretly Linear, https://arxiv.org/abs/2405.12250 (Replacing model layers in the decoder with linear approximations.)
  • Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed A Aly, Beidi Chen, Carole-Jean Wu, 25 Apr 2024, Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding, https://arxiv.org/abs/2404.16710 (Multiple contributions including training with early exit, and speculative decoding with a draft model that is early exit within the larger model, with the advantages: (a) the draft and verifier model thereby share KV cache data for the early layers and (b) avoidance of the problems with an outdated KV cache normally caused by early exiting.)
  • J. Jin, A. Dundar, and E. Culurciello. 2014, Flattened convolutional neural networks for feedforward acceleration. arXiv preprint arXiv:1412.5474, https://arxiv.org/abs/1412.5474
  • Haoyi Wu, Kewei Tu, 4 Jun 2024 (v2), Layer-Condensed KV Cache for Efficient Inference of Large Language Models, https://arxiv.org/abs/2405.10637 Code: https://github.com/whyNLP/LCKV
  • David Spuler, March 2024, Chapter 47. Early Exit and Layer Pruning, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
  • Byung-Kwan Lee, Sangyun Chung, Chae Won Kim, Beomchan Park, Yong Man Ro, 19 Jun 2024 (v2), TroL: Traversal of Layers for Large Language and Vision Models, https://arxiv.org/abs/2406.12246 https://arxiv.org/pdf/2406.12246 (To achieve higher accuracy, this model re-traverses some of the layers, which achieves higher model accuracy from the same size model without more memory.)
  • Vedang Lad, Wes Gurnee, Max Tegmark, 27 Jun 2024, The Remarkable Robustness of LLMs: Stages of Inference, https://arxiv.org/abs/2406.19384 (Deleting and swapping adjacent model layers. Hypothesizes that the first layer is effectively detokenization, the early layers focus on "features", the middle layers focus on "ensemble predictions" and the latter layers "sharpen" or finalize, with a lot of suppression happening near the end.)
  • Suyi Li, Lingyun Yang, Xiaoxiao Jiang, Hanfeng Lu, Zhipeng Di, Weiyi Lu, Jiawei Chen, Kan Liu, Yinghao Yu, Tao Lan, Guodong Yang, Lin Qu, Liping Zhang, Wei Wang, 2 Jul 2024, SwiftDiffusion: Efficient Diffusion Model Serving with Add-on Modules, https://arxiv.org/abs/2407.02031 (Efficient diffusion models in systems with multi-LoRA, ControlNets, and other multi-module add-ons, including parallelizing execution of add-ons and more efficient loading of LoRA with faster updating or "patching" of model weights, including by performing some layers in parallel without LoRA weights, while loading the LoRA adapters.)
  • Zhen Tan, Daize Dong, Xinyu Zhao, Jie Peng, Yu Cheng, Tianlong Chen, 3 Jul 2024, DLO: Dynamic Layer Operation for Efficient Vertical Scaling of LLMs. https://arxiv.org/abs/2407.11030
  • H Wang, 2024, Minimalism Yields Maximum Results: Deep Learning with Limited Resource, Ph.D. Thesis, Purdue University, PDF: https://hammer.purdue.edu/articles/thesis/Minimalism_Yields_Maximum_Results_Deep_Learning_with_Limited_Resource/26349415/1/files/47855029.pdf
  • Alessio Devoto, Federico Alvetreti, Jary Pomponi, Paolo Di Lorenzo, Pasquale Minervini, Simone Scardapane, 16 Aug 2024, Adaptive Layer Selection for Efficient Vision Transformer Fine-Tuning, https://arxiv.org/abs/2408.08670 (Faster fine-tuning by selecting layers, freezing layers, or slimming them to fewer fine-tuned parameters.)
  • Siqi Fan, Xin Jiang, Xiang Li, Xuying Meng, Peng Han, Shuo Shang, Aixin Sun, Yequan Wang, Zhongyuan Wang, 9 Jul 2024 (v3), Not All Layers of LLMs Are Necessary During Inference, https://arxiv.org/abs/2403.02181
  • Jiwon Song, Kyungseok Oh, Taesu Kim, Hyungjun Kim, Yulhwa Kim, Jae-Joon Kim, 19 Jul 2024 (v5), SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks, https://arxiv.org/abs/2402.09025 https://github.com/jiwonsong-dev/SLEB
  • Qingyuan Wang, Barry Cardiff, Antoine Frappé, Benoit Larras, Deepu John, 16 Aug 2024 (v2), DyCE: Dynamically Configurable Exiting for Deep Learning Compression and Real-time Scaling, https://arxiv.org/abs/2403.01695
  • J. Li, Q. Li and P. Wang, 2024, From Static to Dynamic: A Deeper, Faster, and Adaptive Language Modeling Approach, 2024 International Joint Conference on Neural Networks (IJCNN), Yokohama, Japan, 2024, pp. 1-8, doi: 10.1109/IJCNN60899.2024.10650050, https://ieeexplore.ieee.org/abstract/document/10650050 (Uses a preliminary "estimator module" to decide which layers to use.)
  • Yejin Lee, Anna Sun, Basil Hosmer, Bilge Acun, Can Balioglu, Changhan Wang, Charles David Hernandez, Christian Puhrsch, Daniel Haziza, Driss Guessous, Francisco Massa, Jacob Kahn, Jeffrey Wan, Jeremy Reizenstein, Jiaqi Zhai, Joe Isaacson, Joel Schlosser, Juan Pino, Kaushik Ram Sadagopan, Leonid Shamis, Linjian Ma, Min-Jae Hwang, Mingda Chen, Mostafa Elhoushi, Pedro Rodriguez, Ram Pasunuru, Scott Yih, Sravya Popuri, Xing Liu, Carole-Jean Wu, 30 Sep 2024, Characterizing and Efficiently Accelerating Multimodal Generation Model Inference, https://arxiv.org/abs/2410.00215 (Analyzes the bottlenecks in inference, finding the usual problems of autoregression, but also more interesting issues such as that linear kernels can be expensive, and KV cache reordering is a bottleneck in beam search, and layer skipping is analyzed.)
  • Michael R. Metel, Peng Lu, Boxing Chen, Mehdi Rezagholizadeh, Ivan Kobyzev, 1 Oct 2024, Draft on the Fly: Adaptive Self-Speculative Decoding using Cosine Similarity, https://arxiv.org/abs/2410.01028 (Self-speculative decoding that removes layers based on cosine similarity.)
  • Xia, Wenhan, Sep 2024, Methods for Efficient and Scalable Deep Learning, Ph.D. Thesis, Electrical and Computer Engineering Department, Princeton University, http://arks.princeton.edu/ark:/88435/dsp015q47rs12x (Covers PEFT/LoRA on training, and dual pruning with layer skipping and channel/width pruning for inference.)
  • Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Guohao Dai, 6 Oct 2024, Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective, https://arxiv.org/abs/2410.04466
  • Heming Xia, Yongqi Li, Jun Zhang, Cunxiao Du, Wenjie Li, 9 Oct 2024, SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration, https://arxiv.org/abs/2410.06916 (Self-speculative decoding using layer skipping, rather than early exit.)
  • Akriti Jain, Saransh Sharma, Koyel Mukherjee, Soumyabrata Pal, 16 Oct 2024, FiRST: Finetuning Router-Selective Transformers for Input-Adaptive Latency Reduction, https://arxiv.org/abs/2410.12513
  • Theodore Glavas, Joud Chataoui, Florence Regol, Wassim Jabbour, Antonios Valkanas, Boris N. Oreshkin, Mark Coates, 26 Oct 2024, Dynamic layer selection in decoder-only transformers, https://arxiv.org/abs/2410.20022
  • Zejian Liu, Fanrong Li, Gang Li, and Jian Cheng. 2021. EBERT: Efficient BERT Inference with Dynamic Structured Pruning. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 4814–4823. https://aclanthology.org/2021.findings-acl.425/ PDF: https://aclanthology.org/2021.findings-acl.425.pdf
  • Wang, Z., Han, J. (2024). Improve Shallow Decoder Based Transformer with Structured Expert Prediction. In: Wand, M., Malinovská, K., Schmidhuber, J., Tetko, I.V. (eds) Artificial Neural Networks and Machine Learning – ICANN 2024. ICANN 2024. Lecture Notes in Computer Science, vol 15022. Springer, Cham. https://doi.org/10.1007/978-3-031-72350-6_15 https://link.springer.com/chapter/10.1007/978-3-031-72350-6_15
  • Y Zhou, C Zhou, W Xie, X Wang, J Chen, Z Ni, J Li, 2024, The Benefits in Shallow: Merge Decoding Across Large Language Model Layers. In: Wong, D.F., Wei, Z., Yang, M. (eds) Natural Language Processing and Chinese Computing. NLPCC 2024. Lecture Notes in Computer Science(), vol 15360. Springer, Singapore. https://doi.org/10.1007/978-981-97-9434-8_30 https://link.springer.com/chapter/10.1007/978-981-97-9434-8_30
  • Xiangyu Zhang, Yu Zhou, Guang Yang, Harald C. Gall, Taolue Chen, 11 Nov 2024, Anchor Attention, Small Cache: Code Generation with Large Language Models, https://arxiv.org/abs/2411.06680

Layer Skipping and KV Caching

All forms of dynamic layer pruning, such as layer skipping and early exit, have a problem when used with KV caching. These two optimizations seem like they should be orthogonal and work well when combined, but there's a problem. When a layer is skipped or multiple layers are skipped by exiting early, the KV cache is not computed, and will be out-of-date the next time that layer is not skipped.

Various methods to fix the KV cache have been examined by researchers. At a minimum, any layers that are skipped need to track a flag that their KV cache is invalid, and therefore the KV cache will require re-computation at times. But there are more advanced solutions and for more more details about this research see: KV caching with early exit.

Layer Reordering

An interesting technique, that generalizes the use of layers, is "layer reordering". The idea is motivated by the realization that Transformer layers are building blocks which output the same format as their input. Hence, not only can you remove a layer (early exit or layer pruning) or skip a layer (layer skipping), or run the same layer twice (layer fusion), but it can be generalized in any way. You can pick and choose which layers to run, and in what order, and how often. You could even run each layer twice, or run all the layers in reverse, or whatever.

Layer reordering usually refers to entire Transformer layers. For other types of merging or reordering of separate sub-layer structures within Transformer layers, see kernel operator fusion. For discussion of the order of layer normalization subcomponents, see normalization reordering.

Layer reordering seems like it shouldn't work. After all, didn't we expend all those GPU cycles in training to carefully work out the correct weights for each layer? Isn't it true that the first layers do the broad analysis and the upper layers do the finessing? So early exiting makes some kind of sense, because it just skips the finer details at the end, but randomly reordering things seems weird. Nevertheless, there are some research papers that explore layer reordering and its generalizations.

  • Ofir Press, Noah A. Smith, Omer Levy, Improving Transformer Models by Reordering their Sublayers, arXiv preprint arXiv:1911.03864, 2019, https://arxiv.org/abs/1911.03864 (Layer reordering! Includes analysis of multiple layers, and also reordering self-attention and feed-forward sub-components in a "sandwich" architecture.)
  • Jinhua Zhu, Lijun Wu, Yingce Xia, Shufang Xie, Tao Qin, Wengang Zhou, Houqiang Li, Tie-Yan Liu, Mar 2021, IOT: Instance-wise Layer Reordering for Transformer Structures, https://arxiv.org/abs/2103.03457
  • Elicia Ye, March 2023, Greedy Ordering of Layer Weight Matrices in Transformers Improves Translation, https://arxiv.org/abs/2302.02123
  • Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. In International Conference on Learning Representations, September 2019. https://openreview.net/forum?id=H1eA7AEtvS
  • David Spuler, March 2024, Chapter 47. Early Exit and Layer Pruning, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
  • Byung-Kwan Lee, Sangyun Chung, Chae Won Kim, Beomchan Park, Yong Man Ro, 19 Jun 2024 (v2), TroL: Traversal of Layers for Large Language and Vision Models, https://arxiv.org/abs/2406.12246 https://arxiv.org/pdf/2406.12246 (To achieve higher accuracy, this model re-traverses some of the layers, which achieves higher model accuracy from the same size model without more memory.)
  • Vedang Lad, Wes Gurnee, Max Tegmark, 27 Jun 2024, The Remarkable Robustness of LLMs: Stages of Inference, https://arxiv.org/abs/2406.19384 (Deleting and swapping adjacent model layers. Hypothesizes that the first layer is effectively detokenization, the early layers focus on "features", the middle layers focus on "ensemble predictions" and the latter layers "sharpen" or finalize, with a lot of suppression happening near the end.)
  • Zhuocheng Gong, Ang Lv, Jian Guan, Junxi Yan, Wei Wu, Huishuai Zhang, Minlie Huang, Dongyan Zhao, Rui Yan, 9 Jul 2024, Mixture-of-Modules: Reinventing Transformers as Dynamic Assemblies of Modules, https://arxiv.org/abs/2407.06677
  • Matthias Freiberger, Peter Kun, Anders Sundnes Løvlie, Sebastian Risi, 5 Jul 2024, LayerShuffle: Enhancing Robustness in Vision Transformers by Randomizing Layer Execution Order, https://arxiv.org/abs/2407.04513

Layer Approximation

The idea of approximating a layer with something faster has received much less attention than simply removing them!

Research papers on layer approximations:

  • Xiaodong Chen, Yuxuan Hu, Jing Zhang, 28 Mar 2024, Compressing Large Language Models by Streamlining the Unimportant Layer, https://arxiv.org/abs/2403.19135 (Finds the less important layers and either prunes them or replaces them with a faster approximate layer.)
  • Longwei Zou, Qingyang Wang, Han Zhao, Jiangang Kong, Yi Yang, Yangdong Deng, 10 Apr 2024, CQIL: Inference Latency Optimization with Concurrent Computation of Quasi-Independent Layers, https://arxiv.org/abs/2404.06709 (Similar to layer skipping or layer fusion, but concurrently calculates some layers that seem to be less important, rather than running the layers sequentially.)
  • Anton Razzhigaev, Matvey Mikhalchuk, Elizaveta Goncharova, Nikolai Gerasimenko, Ivan Oseledets, Denis Dimitrov, Andrey Kuznetsov, 19 May 2024, Your Transformer is Secretly Linear, https://arxiv.org/abs/2405.12250 (Replacing model layers in the decoder with linear approximations.)
  • Zhuocheng Gong, Ang Lv, Jian Guan, Junxi Yan, Wei Wu, Huishuai Zhang, Minlie Huang, Dongyan Zhao, Rui Yan, 9 Jul 2024, Mixture-of-Modules: Reinventing Transformers as Dynamic Assemblies of Modules, https://arxiv.org/abs/2407.06677

KV Caching and Layer Pruning

There are analogous techniques that can be applied to layers in the KV cache data. KV caching involves storing a per-layer set of data, and many optimizations have been researched. Read more about these KV cache research areas:

Layer Importance

Research has found that the early layers of a model tend to make the bigger contextual decisions, whereas the later layers tend to choose between a few acceptable tokens. This explains why early exit of layers (dynamic layer pruning) can lead to acceptable output, but loses some of the finesse. For larger models, there are three main zones: initial layers, middle layers, and later layers.

Research papers that examine layer importance include:

  • Anton Razzhigaev, Matvey Mikhalchuk, Elizaveta Goncharova, Nikolai Gerasimenko, Ivan Oseledets, Denis Dimitrov, Andrey Kuznetsov, 19 May 2024, Your Transformer is Secretly Linear, https://arxiv.org/abs/2405.12250 (Replacing model layers in the decoder with linear approximations.)
  • Haoyi Wu, Kewei Tu, 17 May 2024, Layer-Condensed KV Cache for Efficient Inference of Large Language Models, https://arxiv.org/abs/2405.10637 Code: https://github.com/whyNLP/LCKV (Use the KV cache for only the final layer as the KV cache for all other layers, or alternatively, use only the cache from a few layers, also possibly using a few standard layers as "warmup layers". This idea is conceptuatlly similar to "propagation" of the KV cache in early exit methods or to layer fusion of weights.)
  • Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed A Aly, Beidi Chen, Carole-Jean Wu, 25 Apr 2024, Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding, https://arxiv.org/abs/2404.16710 (Multiple contributions including training with early exit, and speculative decoding with a draft model that is early exit within the larger model, with the advantages: (a) the draft and verifier model thereby share KV cache data for the early layers and (b) avoidance of the problems with an outdated KV cache normally caused by early exiting.)
  • BS Akash, V Singh, A Krishna, LB Murthy, L Kumar, April 2024, Investigating BERT Layer Performance and SMOTE Through MLP-Driven Ablation on Gittercom, Lecture Notes on Data Engineering and Communications Technologies (LNDECT,volume 200), https://link.springer.com/chapter/10.1007/978-3-031-57853-3_25
  • Georgy Tyukin, 2 Apr 2024, Enhancing Inference Efficiency of Large Language Models: Investigating Optimization Strategies and Architectural Innovations, Masters Thesis, Data Science and Machine Learning, University College London., https://arxiv.org/abs/2404.05741 (Reviews various model compression and inference optimization techniques, and specifically analyzes layer skipping and sublayer skipping, such as attention head pruning and FFN/MLP pruning.)
  • Jiachen Jiang, Jinxin Zhou, Zhihui Zhu, 20 Jun 2024, On Layer-wise Representation Similarity: Application for Multi-Exit Models with a Single Classifier, https://arxiv.org/abs/2406.14479 (Using layer similarity for early exit classifiers, which is also related to layer fusion.)
  • Vedang Lad, Wes Gurnee, Max Tegmark, 27 Jun 2024, The Remarkable Robustness of LLMs: Stages of Inference, https://arxiv.org/abs/2406.19384 (Deleting and swapping adjacent model layers. Hypothesizes that the first layer is effectively detokenization, the early layers focus on "features", the middle layers focus on "ensemble predictions" and the latter layers "sharpen" or finalize, with a lot of suppression happening near the end.)
  • Xu Cheng, Lei Cheng, Zhaoran Peng, Yang Xu, Tian Han, Quanshi Zhang, July 2024, Layerwise Change of Knowledge in Neural Networks, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:8038-8059, 2024, https://proceedings.mlr.press/v235/cheng24b.html PDF: https://raw.githubusercontent.com/mlresearch/v235/main/assets/cheng24b/cheng24b.pdf
  • Siqi Fan, Xin Jiang, Xiang Li, Xuying Meng, Peng Han, Shuo Shang, Aixin Sun, Yequan Wang, Zhongyuan Wang, 9 Jul 2024 (v3), Not All Layers of LLMs Are Necessary During Inference, https://arxiv.org/abs/2403.02181
  • Hongrong Cheng, Miao Zhang, Javen Qinfeng Shi, 9 Aug 2024 (v2), A Survey on Deep Neural Network Pruning-Taxonomy, Comparison, Analysis, and Recommendations, IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/TPAMI.2024.3447085, https://arxiv.org/abs/2308.06767 https://ieeexplore.ieee.org/abstract/document/10643325
  • Benjamin L. Badger, 2 Sep 2024, Masked Mixers for Language Generation and Retrieval, https://arxiv.org/abs/2409.01482
  • Amit Ben Artzy, Roy Schwartz, 5 Sep 2024, Attend First, Consolidate Later: On the Importance of Attention in Different LLM Layers, https://arxiv.org/abs/2409.03621
  • Zifan Zheng, Yezhaohui Wang, Yuxin Huang, Shichao Song, Bo Tang, Feiyu Xiong, Zhiyu Li, 5 Sep 2024, Attention Heads of Large Language Models: A Survey, https://arxiv.org/abs/2409.03752 https://github.com/IAAR-Shanghai/Awesome-Attention-Heads (This survey is about making attention mechanisms more performant, accurate and intelligent, rather than improving efficiency.)
  • Jordan Dotzel, Carly Jiang, Mohamed Abdelfattah, Zhiru Zhang, Sep 2024, Opportunities for Post-Training Dynamic Layer Sparsity in Large Vision and Language Models, https://openaccess.thecvf.com/content/CVPR2024W/ELVM/papers/Dotzel_Opportunities_for_Post-Training_Dynamic_Layer_Sparsity_in_Large_Vision_and_CVPRW_2024_paper.pdf (Layerwise dynamic sparsity for vision models.)
  • Bernard Ryhede Bengtsson, Joel Bengs, 2024, Accelerated Segmentation with Mixed-Precision Quantization of EfficientViT-SAM, MSc Thesis, Lund University, Sweden, https://lup.lub.lu.se/luur/download?func=downloadFile&recordOId=9174462&fileOId=9174463
  • Hossein Rajabzadeh, Aref Jafari, Aman Sharma, Benyamin Jami, Hyock Ju Kwon, Ali Ghodsi, Boxing Chen, Mehdi Rezagholizadeh, 22 Sep 2024, EchoAtt: Attend, Copy, then Adjust for More Efficient Large Language Models, https://arxiv.org/abs/2409.14595
  • Pierre Marion, Raphaël Berthier, Gérard Biau, Claire Boyer, 2 Oct 2024, Attention layers provably solve single-location regression, https://arxiv.org/abs/2410.01537
  • Tianyu Guo, Druv Pai, Yu Bai, Jiantao Jiao, Michael I. Jordan, Song Mei, 17 Oct 2024, Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs, https://arxiv.org/abs/2410.13835
  • Khashayar Gatmiry, Nikunj Saunshi, Sashank J. Reddi, Stefanie Jegelka, Sanjiv Kumar, 29 Oct 2024, On the Role of Depth and Looping for In-Context Learning with Task Diversity, https://arxiv.org/abs/2410.21698
  • Guangyuan Shi, Zexin Lu, Xiaoyu Dong, Wenlong Zhang, Xuanyu Zhang, Yujie Feng, Xiao-Ming Wu, 23 Oct 2024, Understanding Layer Significance in LLM Alignment, https://arxiv.org/abs/2410.17875
  • Weizhuo Li, Zhigang Wang, Yu Gu, Ge Yu, 8 Dec 2024, XKV: Personalized KV Cache Memory Reduction for Long-Context LLM Inference, https://arxiv.org/abs/2412.05896

More Research on Pruning Types

More AI Pruning Research

Read more about other types of pruning: