Aussie AI
Length Pruning
-
Last Updated 26 September, 2024
-
by David Spuler, Ph.D.
Length pruning is weight pruning on one of the three axes of pruning. The other two axes are width pruning (e.g. attention head pruning) and depth pruning (e.g. layer pruning and early exit). All three types of pruning are mostly orthogonal to each other and can be combined into triple pruning.
The main types of length pruning along the "lengthwise" dimension of the inputs are:
- Token pruning
- Embeddings pruning
- Non-autoregressive optimizations (not really pruning, but a closely related issue)
Other non-pruning AI model techniques that operate on the same "lengthwise" dimension include:
- Long context window optimizations
- Length generalization
- Input padding removal
- Attention linearization optimizations
Length Pruning Research
The term "length pruning" can apparently mean a few different things in the literature. It can mean avoiding redundant computations from the padding in the input vector, such as in Zhai et al. (2023). Or cutting tokens out of the input stream (see token pruning). It can also mean changing the size of the embeddings to reduce the memory size of the embedding matrix (see embeddings pruning). It may mean "length prediction" in the decoder output. And it can refer to managing the size of the inputs to reduce the auto-regression bottleneck (see non-autoregressive algorithms).
Research papers directly related to "length pruning" include:
- Yujia Zhai, Chengquan Jiang, Leyuan Wang, Xiaoying Jia, Shang Zhang, Zizhong Chen, Xin Liu, Yibo Zhu, 2023, ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs, https://arxiv.org/abs/2210.03052, Code: https://github.com/bytedance/ByteTransformer (Avoiding zero-padding in input vectors throughout the whole model is the length-wise pruning, with various other optimizations.)
- Gyuwan Kim and Kyunghyun Cho. 2021. Length-Adaptive Transformer: Train Once with Length Drop, Use Anytime with Search. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, 6501–6511, https://arxiv.org/abs/2010.07003, Code: https://github.com/clovaai/length-adaptive-transformer (Makes stochastic length decisions and attempts to optimize length during training.)
- Ofir Press, Noah A Smith, and Mike Lewis, 2021, Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation, arXiv preprint arXiv:2108.12409 (2021), https://arxiv.org/abs/2108.12409
- Sukhbaatar, S., Grave, E., Bojanowski, P., and Joulin, A., Adaptive attention span in transformers. In Annual Meeting of the Association for Computational Linguistics, Aug 2019, https://arxiv.org/abs/1905.07799 (Self-adaptive context lengths for attention heads.)
- Ji Xin, Raphael Tang, Zhiying Jiang, Yaoliang Yu, Jimmy Lin, Building an Efficiency Pipeline: Commutativity and Cumulativeness of Efficiency Operators for Transformers, arXiv preprint arXiv:2208.00483, July 2022, https://arxiv.org/abs/2208.00483
- Saurabh Goyal, Anamitra Roy Choudhury, Saurabh Raje, Venkatesan Chakaravarthy, Yogish Sabharwal, and Ashish Verma, Power-BERT: Accelerating BERT inference via progressive word-vector elimination, In International Conference on Machine Learning, pages 3690–3699, PMLR, 2020, https://arxiv.org/abs/2001.08950, https://doi.org/10.48550/arXiv.2001.08950 (Identifies unimportant word vectors during training, removes them, addresses accuracy with re-training.)
- Bowei He, Xu He, Renrui Zhang, Yingxue Zhang, Ruiming Tang, Chen Ma, Dynamic Embedding Size Search with Minimum Regret for Streaming Recommender System, Aug 2023, https://arxiv.org/abs/2308.07760
- Xing Shi and Kevin Knight. Speeding up neural machine translation decoding by shrinking run-time vocabulary. In Proc. of ACL, 2017. https://aclanthology.org/P17-2091/, PDF: http://xingshi.me/data/pdf/ACL2017short.pdf
- Jean Senellart, Dakun Zhang, Bo Wang, Guillaume Klein, Jean-Pierre Ramatchandirin, Josep Crego, and Alexander Rush. OpenNMT system description for WNMT 2018: 800 words/sec on a single-core CPU. In Proc. of WNG, 2018. https://www.aclweb.org/anthology/W18-2715
- Sotiris Anagnostidis, Dario Pavllo, Luca Biggio, Lorenzo Noci, Aurelien Lucchi, Thomas Hofmann, Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers, arXiv preprint, 2023, https://arxiv.org/abs/2305.15805 (A form of dynamic token pruning that drops tokens from the context, thereby addressing the quadratic attention cost as it relates to autoregression, with the need for some re-training.)
- Sachin Mehta, Marjan Ghazvininejad, Srinivasan Iyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2020. DeLighT: Very Deep and Light-weight Transformer. arXiv:2008.00623 https://arxiv.org/abs/2008.00623 (Mostly focused on simplifying attention heads and FFN, but also adjusts the internal dimension.)
- Zihang Dai, Guokun Lai, Yiming Yang, and Quoc Le. 2020. Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing. In Proceedings of NeurIPS. https://proceedings.neurips.cc/paper/2020/hash/2cd2915e69546904e4e5d4a2ac9e1652-Abstract.html, https://arxiv.org/abs/2006.03236, Code: https://github.com/laiguokun/Funnel-Transformer (During training, decreases the sequence length of the hidden states through middle encoder layers, but expands it again with up-sampling at the end of the decoder.)
- Hanrui Wang, Zhanghao Wu, Zhijian Liu, Han Cai, Ligeng Zhu, Chuang Gan, and Song Han. 2020. HAT: Hardware-aware transformers for efficient natural language processing. arXiv preprint arXiv:2005.14187. https://arxiv.org/abs/2005.14187
- Representational Redundancy Reduction Strategies for Efficient Neural Network Architectures for Visual and Language Tasks, Rene Bidart, Ph.D. thesis, 2023, University of Waterloo, https://uwspace.uwaterloo.ca/bitstream/handle/10012/19682/Bidart_Rene.pdf?sequence=1
- H Peng, S Huang, S Chen, B Li, T Geng, A Li, 2022, A length adaptive algorithm-hardware co-design of transformer on fpga through sparse attention and dynamic pipelining, DAC '22: Proceedings of the 59th ACM/IEEE Design Automation Conference, July 2022, Pages 1135–1140, https://doi.org/10.1145/3489517.3530585, https://dl.acm.org/doi/10.1145/3489517.3530585 https://arxiv.org/pdf/2208.03646
- Y Liu, Z Lin, F Yuan, 2021, Rosita: Refined bert compression with integrated techniques, The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21), https://arxiv.org/abs/2103.11367, Code: https://github.com/llyx97/Rosita (Pruning on multiple dimensions of layer, FFN outputs, and embeddings, also combined with distillation.)
- Maedeh Hemmat, Joshua San Miguel, Azadeh Davoodi, "AirNN: A Featherweight Framework for Dynamic Input-Dependent Approximation of CNNs", IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol.40, no.10, pp.2090-2103, 2021. https://ieeexplore.ieee.org/document/9239327 (Input dependent matching of weight clusters from tokens is vaguely similar to token pruning or length pruning.)
- Jon Saad-Falcon, Amanpreet Singh, Luca Soldaini, Mike D'Arcy, Arman Cohan, Doug Downey, 30 Jan 2023 (v3), Embedding Recycling for Language Models, https://arxiv.org/abs/2207.04993
- L. Denoyer and P. Gallinari. Deep sequential neural network. arXiv preprint arXiv:1410.0510, 2014. https://arxiv.org/abs/1410.0510
- Jiahui Yu and Thomas S Huang. 2019. Universally slimmable networks and improved training techniques. In Proceedings of the IEEE International Conference on Computer Vision, pages 1803–1811. https://arxiv.org/abs/1903.05134 Code: https://github.com/JiahuiYu/slimmable_networks
- Hankook Lee and Jinwoo Shin. 2018. Anytime neural prediction via slicing networks vertically. arXiv preprint arXiv:1807.02609. https://arxiv.org/abs/1807.02609 (Training multiple thin networks.)
- S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He, “Zero: Memory optimizations toward training trillion parameter models,” in SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020, pp. 1–16, https://arxiv.org/abs/1910.02054 Code: part of: https://github.com/microsoft/deepspeed (Zero Redundancy Optimizer (ZeRO) provides memory optimization, improved utilization, and fragmentation avoidance, allowing improved pipelining during training.)
- M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro, “Megatron-LM: Training multi-billion parameter language models using model parallelism,” arXiv preprint arXiv:1909.08053, 2019, https://arxiv.org/abs/1909.08053
- Zhai, Yujia, 2023, Ph.D. thesis, Architectural-Aware Performance Optimization: From the Foundational Math Library to Cutting-Edge Applications, Computer Science, Universion of California, Riverside, https://escholarship.org/content/qt8s28g07q/qt8s28g07q.pdf (Includes examination of padding-free algorithms such as ByteTransformer.)
- Cao, Q.; Trivedi, H.; Balasubramanian, A.; and Balasubramanian, N. 2020. Deformer: Decomposing pre-trained transformers for faster question answering. arXiv preprint arXiv:2005.00697, https://arxiv.org/abs/2005.00697 Code: https://github.com/StonyBrookNLP/deformer
- Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019. https://arxiv.org/abs/1901.02860
- Xin Huang, Ashish Khetan, Rene Bidart, and Zohar Karnin. Pyramid-BERT: Reducing complexity via successive core-set based token selection. arXiv preprint arXiv:2203.14380, 2022. https://arxiv.org/abs/2203.14380
- Zhanghao Wu, Zhijian Liu, Ji Lin, Yujun Lin, and Song Han. Lite transformer with long-short range attention. arXiv preprint arXiv:2004.11886, 2020. https://arxiv.org/abs/2004.11886 Code: https://github.com/mit-han-lab/lite-transformer
- Ziru Liu, Kecheng Chen, Fengyi Song, Bo Chen, Xiangyu Zhao, Huifeng Guo, Ruiming Tang, Aug 2023, AutoAssign+: Automatic Shared Embedding Assignment in Streaming Recommendation, https://arxiv.org/abs/2308.06965
- J Du, J Jiang, J Zheng, H Zhang, D Huang, Y Lu, August 2023, Improving Computation and Memory Efficiency for Real-world Transformer Inference on GPUs, ACM Transactions on Architecture and Code Optimization, https://dl.acm.org/doi/10.1145/3617689 PDF: https://dl.acm.org/doi/pdf/10.1145/3617689
- H. Tann, S. Hashemi, R. I. Bahar, and S. Reda. Runtime configurable deep neural networks for energy-accuracy trade-off. In CODES + ISSS, pages 34:1–34:10, 2016. https://ieeexplore.ieee.org/document/9166549
- David Spuler, March 2024, Chapter 49. Length Pruning, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- H Peng, S Huang, S Chen, B Li, T Geng, A Li, 2022, A length adaptive algorithm-hardware co-design of transformer on fpga through sparse attention and dynamic pipelining, https://dl.acm.org/doi/abs/10.1145/3489517.3530585 https://arxiv.org/pdf/2208.03646
- Mootez Saad, José Antonio Hernández López, Boqi Chen, Dániel Varró, Tushar Sharma, 4 Jul 2024, ALPINE: An adaptive language-agnostic pruning method for language models for code, https://arxiv.org/abs/2407.04147
- Yufei Huang, Xu Han, Maosong Sun, 12 Aug 2024, FastFiD: Improve Inference Efficiency of Open Domain Question Answering via Sentence Selection, https://arxiv.org/abs/2408.06333 https://github.com/thunlp/FastFiD (Sentence-level pruning after encoding on the input text dimension.)
- Wang, X, Sep 2024, KELTP: Keyword-Enhanced Learned Token Pruning for Knowledge-Grounded Dialogue. In: Wand, M., Malinovská, K., Schmidhuber, J., Tetko, I.V. (eds) Artificial Neural Networks and Machine Learning – ICANN 2024. ICANN 2024. Lecture Notes in Computer Science, vol 15022. Springer, Cham. https://doi.org/10.1007/978-3-031-72350-6_16 https://link.springer.com/chapter/10.1007/978-3-031-72350-6_16 (Adaptive removal of low-attention tokens during inference.)
More Research on Pruning Types
- Depth pruning (overview)
— Static layer pruning
— Layer pruning
— Early exit
— Dynamic layer pruning
— Layer skipping
— Layer approximation
— Shallow decoder architecture
— Layer reordering
— Layer Importance - Width pruning (overview)
— Attention head pruning
— Slimmable networks (width pruning)
— FFN pruning
— Channel pruning
— Filter pruning - Length pruning (longitudinal/input/end-to-end):
— Token pruning (input pruning)
— Dynamic token pruning
— Prompt compression
— Context compression
— Token merging
— Token skipping
— Token dropping
— Zero padding removal - Embedding-dimension pruning
— Embedding pruning
— Embedding matrix compression (embedding pruning)
— Embedding low-rank matrix factorization
— Unembedding matrix (output embeddings) - Multi-dimensional pruning
— Dual pruning
— Triple pruning
— Quadruple pruning
— 3D CNN model pruning
More AI Pruning Research
Read more about other types of pruning:
- Model pruning overview
- Layer pruning
- Head pruning
- Token pruning
- Embeddings pruning
- FFN pruning
- Shallow decoder architecture
- « Research Home