Aussie AI

Research on Length Pruning

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Research on Length Pruning

The term “length pruning” can apparently mean a few different things in the research literature. It can mean avoiding redundant computations from the padding in the input vector, such as in Zhai et al. (2023). Or cutting tokens out of the input stream in token pruning or prompt compression. It can also mean changing the size of the embeddings to reduce the memory size of the embedding matrix, a type of embeddings pruning. It may also mean “length prediction” in the decoder output. And it can refer to managing the size of the inputs to reduce the auto-regression bottleneck, as in non-autoregressive decoding algorithms.

Research papers on length pruning:

  1. Yujia Zhai, Chengquan Jiang, Leyuan Wang, Xiaoying Jia, Shang Zhang, Zizhong Chen, Xin Liu, Yibo Zhu, 2023, ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs, https://arxiv.org/abs/2210.03052, Code: https://github.com/bytedance/ByteTransformer (Avoiding zero-padding in input vectors throughout the whole model is the length-wise pruning, with various other optimizations.)
  2. Gyuwan Kim and Kyunghyun Cho. 2021. Length-Adaptive Transformer: Train Once with Length Drop, Use Anytime with Search, In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, 6501–6511, https://arxiv.org/abs/2010.07003, Code: https://github.com/clovaai/length-adaptive-transformer (Makes stochastic length decisions and attempts to optimize length during training.)
  3. Ofir Press, Noah A Smith, and Mike Lewis, 2021, Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation, arXiv preprint arXiv:2108.12409 (2021), https://arxiv.org/abs/2108.12409
  4. Sukhbaatar, S., Grave, E., Bojanowski, P., and Joulin, A., 2019, Adaptive attention span in transformers, In Annual Meeting of the Association for Computational Linguistics, Aug 2019, https://arxiv.org/abs/1905.07799 (Self-adaptive context lengths for attention heads.)
  5. Ji Xin, Raphael Tang, Zhiying Jiang, Yaoliang Yu, Jimmy Lin, 2022, Building an Efficiency Pipeline: Commutativity and Cumulativeness of Efficiency Operators for Transformers, arXiv preprint arXiv:2208.00483, July 2022, https://arxiv.org/abs/2208.00483
  6. Saurabh Goyal, Anamitra Roy Choudhury, Saurabh Raje, Venkatesan Chakaravarthy, Yogish Sabharwal, and Ashish Verma, 2020, Power-BERT: Accelerating BERT inference via progressive word-vector elimination, In International Conference on Machine Learning, pages 3690–3699, PMLR, 2020, https://arxiv.org/abs/2001.08950, https://doi.org/10.48550/arXiv.2001.08950 (Identifies unimportant word vectors during training, removes them, addresses accuracy with re-training.)
  7. Bowei He, Xu He, Renrui Zhang, Yingxue Zhang, Ruiming Tang, Chen Ma, 2023, Dynamic Embedding Size Search with Minimum Regret for Streaming Recommender System, Aug 2023, https://arxiv.org/abs/2308.07760
  8. Xing Shi and Kevin Knight. 2017, Speeding up neural machine translation decoding by shrinking run-time vocabulary, In Proc. of ACL, 2017. https://aclanthology.org/P17-2091/, PDF: http://xingshi.me/data/pdf/ACL2017short.pdf
  9. Jean Senellart, Dakun Zhang, Bo Wang, Guillaume Klein, Jean-Pierre Ramatchandirin, Josep Crego, and Alexander Rush. 2018, OpenNMT system description for WNMT 2018: 800 words/sec on a single-core CPU, In Proc. of WNG, 2018. https://www.aclweb.org/anthology/W18-2715
  10. Sotiris Anagnostidis, Dario Pavllo, Luca Biggio, Lorenzo Noci, Aurelien Lucchi, Thomas Hofmann, 2023, Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers, arXiv preprint, 2023, https://arxiv.org/abs/2305.15805 (A form of dynamic token pruning that drops tokens from the context, thereby addressing the quadratic attention cost as it relates to autoregression, with the need for some re-training.)
  11. Sachin Mehta, Marjan Ghazvininejad, Srinivasan Iyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2020. DeLighT: Very Deep and Light-weight Transformer, arXiv:2008.00623 https://arxiv.org/abs/2008.00623 (Mostly focused on simplifying attention heads and FFN, but also adjusts the internal dimension.)
  12. Zihang Dai, Guokun Lai, Yiming Yang, and Quoc Le. 2020. Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing, In Proceedings of NeurIPS. https://proceedings.neurips.cc/paper/2020/hash/2cd2915e69546904e4e5d4a2ac9e1652-Abstract.html, https://arxiv.org/abs/2006.03236, Code: https://github.com/laiguokun/Funnel-Transformer (During training, decreases the sequence length of the hidden states through middle encoder layers, but expands it again with up-sampling at the end of the decoder.)
  13. Hanrui Wang, Zhanghao Wu, Zhijian Liu, Han Cai, Ligeng Zhu, Chuang Gan, and Song Han. 2020. HAT: Hardware-aware transformers for efficient natural language processing, arXiv preprint arXiv:2005.14187. https://arxiv.org/abs/2005.14187
  14. Rene Bidart, 2023, Representational Redundancy Reduction Strategies for Efficient Neural Network Architectures for Visual and Language Tasks, Ph.D. thesis, University of Waterloo, https://uwspace.uwaterloo.ca/bitstream/handle/10012/19682/Bidart_Rene.pdf?sequence=1
  15. H Peng, S Huang, S Chen, B Li, T Geng, A Li, 2022, A length adaptive algorithm-hardware co-design of transformer on fpga through sparse attention and dynamic pipelining, DAC '22: Proceedings of the 59th ACM/IEEE Design Automation Conference, July 2022, Pages 1135–1140, https://doi.org/10.1145/3489517.3530585, https://dl.acm.org/doi/10.1145/3489517.3530585 https://arxiv.org/pdf/2208.03646
  16. Y Liu, Z Lin, F Yuan, 2021, Rosita: Refined bert compression with integrated techniques, The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21), https://arxiv.org/abs/2103.11367, Code: https://github.com/llyx97/Rosita (Pruning on multiple dimensions of layer, FFN outputs, and embeddings, also combined with distillation.)
  17. Maedeh Hemmat, Joshua San Miguel, Azadeh Davoodi, 2021, AirNN: A Featherweight Framework for Dynamic Input-Dependent Approximation of CNNs, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol.40, no.10, pp.2090-2103, 2021. https://ieeexplore.ieee.org/document/9239327 (Input dependent matching of weight clusters from tokens is vaguely similar to token pruning or length pruning.)

For more research papers on length pruning, see https://www.aussieai.com/research/length-pruning.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++