Aussie AI

Static Layer Pruning

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Static Layer Pruning

Static layer pruning is removal of entire layers of weights from a model file. This which would involve detecting layers that add minimal value during training (or post-training but pre-inference), but it seems to have less chance to succeed, and seems relatively under-researched. This is related to the training design choice of how many layers to use in a model, which was once more art than science, but has received research attention more recently as Neural Architecture Search (NAS). Interestingly, some of the “early exit” and “layer skipping” inference techniques are effectively changing the choice of the number of model layers from a static constant to a dynamic choice, and the generalization of that to dynamic layer management may warrant some research.

Pruning beats re-training. If you discover that the last few layers of your model can be pruned completely, why wouldn't you just re-train your model with a smaller number of layers? Well, firstly there's the high cost of training. Furthermore, even if these layers were redundant at the end of training, it doesn't necessarily mean they were unnecessary during training. It's quite possible that weights have propagated through these levels in the early stages of training, becoming unimportant only in the later stages.

Layer pruning research. Research papers on static layer pruning or layer pruning in general:

  1. Sabina Pokhrel, 2022, 4 Popular Model Compression Techniques Explained, January 19, 2022, https://xailient.com/blog/4-popular-model-compression-techniques-explained/
  2. Wei Deng, Junwei Pan, Tian Zhou, Deguang Kong, Aaron Flores, and Guang Lin. 2021. DeepLight: Deep Lightweight Feature Interactions for Accelerating CTR Predictions in Ad Serving, In Proceedings of the 14th ACM international conference on Web search and data mining. 922–930, https://arxiv.org/abs/2002.06987
  3. H Sajjad, F Dalvi, N Durrani, P Nakov, 2020, Poor Man's BERT: Smaller and Faster Transformer Models, arXiv preprint arXiv:2004.03844 https://arxiv.org/abs/2004.03844v1
  4. E Youn, S Prabhu, S Chen, 2023, Compressing Vision Transformers for Low-Resource Visual Learning, arXiv preprint arXiv:2309.02617, PDF: https://arxiv.org/pdf/2309.02617.pdf
  5. Jaeho Lee, Sejun Park, Sangwoo Mo, Sungsoo Ahn, and Jinwoo Shin. 2020, Layer-adaptive sparsity for the magnitude-based pruning, In International Conference on Learning Representations, 2020. https://arxiv.org/abs/2010.07611
  6. Q Wei, G Zeng, B Zeng, 2023, Efficient Training for Visual Tracking with Deformable Transformer, arXiv preprint arXiv:2309.02676, https://arxiv.org/pdf/2309.02676.pdf (Optimization and also investigated effects of number of decoder layers.)

For more research on the layer pruning, refer to https://www.aussieai.com/research/layer-pruning.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++