Aussie AI

49. Length Pruning

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

“If I had more time,
I would have written a shorter letter.”

— Marcus Tullius Cicero.

What is Length Pruning?

Length pruning refers to applying pruning on model dimension corresponding to the user's input sequence and the propagation of that sequence through the model. There is much confusing terminology in the research on length pruning and token pruning, but I have attempted to categorize the main types of pruning along the “lengthwise” model dimension is as follows:

Token pruning
Embeddings pruning
Prompt compression

Other non-pruning AI model optimization techniques that operate on the same “lengthwise” dimension include:

Long context window optimizations
Length generalization
Input padding removal
Batching of multiple prompt queries (without padding)
Attention linearization optimizations
Non-autoregressive optimizations

Length pruning is structured weight pruning on one of the three axes of pruning. The other two axes are width pruning (e.g. attention head pruning) and depth pruning (e.g. layer pruning and early exit). All three types of pruning are mostly orthogonal to each other and can be combined into triple pruning.

Research on Length Pruning

The term “length pruning” can apparently mean a few different things in the research literature. It can mean avoiding redundant computations from the padding in the input vector, such as in Zhai et al. (2023). Or cutting tokens out of the input stream in token pruning or prompt compression. It can also mean changing the size of the embeddings to reduce the memory size of the embedding matrix, a type of embeddings pruning. It may also mean “length prediction” in the decoder output. And it can refer to managing the size of the inputs to reduce the auto-regression bottleneck, as in non-autoregressive decoding algorithms.

Research papers on length pruning:

Yujia Zhai, Chengquan Jiang, Leyuan Wang, Xiaoying Jia, Shang Zhang, Zizhong Chen, Xin Liu, Yibo Zhu, 2023, ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs, https://arxiv.org/abs/2210.03052, Code: https://github.com/bytedance/ByteTransformer (Avoiding zero-padding in input vectors throughout the whole model is the length-wise pruning, with various other optimizations.)
Gyuwan Kim and Kyunghyun Cho. 2021. Length-Adaptive Transformer: Train Once with Length Drop, Use Anytime with Search, In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, 6501–6511, https://arxiv.org/abs/2010.07003, Code: https://github.com/clovaai/length-adaptive-transformer (Makes stochastic length decisions and attempts to optimize length during training.)
Ofir Press, Noah A Smith, and Mike Lewis, 2021, Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation, arXiv preprint arXiv:2108.12409 (2021), https://arxiv.org/abs/2108.12409
Sukhbaatar, S., Grave, E., Bojanowski, P., and Joulin, A., 2019, Adaptive attention span in transformers, In Annual Meeting of the Association for Computational Linguistics, Aug 2019, https://arxiv.org/abs/1905.07799 (Self-adaptive context lengths for attention heads.)
Ji Xin, Raphael Tang, Zhiying Jiang, Yaoliang Yu, Jimmy Lin, 2022, Building an Efficiency Pipeline: Commutativity and Cumulativeness of Efficiency Operators for Transformers, arXiv preprint arXiv:2208.00483, July 2022, https://arxiv.org/abs/2208.00483
Saurabh Goyal, Anamitra Roy Choudhury, Saurabh Raje, Venkatesan Chakaravarthy, Yogish Sabharwal, and Ashish Verma, 2020, Power-BERT: Accelerating BERT inference via progressive word-vector elimination, In International Conference on Machine Learning, pages 3690–3699, PMLR, 2020, https://arxiv.org/abs/2001.08950, https://doi.org/10.48550/arXiv.2001.08950 (Identifies unimportant word vectors during training, removes them, addresses accuracy with re-training.)
Bowei He, Xu He, Renrui Zhang, Yingxue Zhang, Ruiming Tang, Chen Ma, 2023, Dynamic Embedding Size Search with Minimum Regret for Streaming Recommender System, Aug 2023, https://arxiv.org/abs/2308.07760
Xing Shi and Kevin Knight. 2017, Speeding up neural machine translation decoding by shrinking run-time vocabulary, In Proc. of ACL, 2017. https://aclanthology.org/P17-2091/, PDF: http://xingshi.me/data/pdf/ACL2017short.pdf
Jean Senellart, Dakun Zhang, Bo Wang, Guillaume Klein, Jean-Pierre Ramatchandirin, Josep Crego, and Alexander Rush. 2018, OpenNMT system description for WNMT 2018: 800 words/sec on a single-core CPU, In Proc. of WNG, 2018. https://www.aclweb.org/anthology/W18-2715
Sotiris Anagnostidis, Dario Pavllo, Luca Biggio, Lorenzo Noci, Aurelien Lucchi, Thomas Hofmann, 2023, Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers, arXiv preprint, 2023, https://arxiv.org/abs/2305.15805 (A form of dynamic token pruning that drops tokens from the context, thereby addressing the quadratic attention cost as it relates to autoregression, with the need for some re-training.)
Sachin Mehta, Marjan Ghazvininejad, Srinivasan Iyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2020. DeLighT: Very Deep and Light-weight Transformer, arXiv:2008.00623 https://arxiv.org/abs/2008.00623 (Mostly focused on simplifying attention heads and FFN, but also adjusts the internal dimension.)
Zihang Dai, Guokun Lai, Yiming Yang, and Quoc Le. 2020. Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing, In Proceedings of NeurIPS. https://proceedings.neurips.cc/paper/2020/hash/2cd2915e69546904e4e5d4a2ac9e1652-Abstract.html, https://arxiv.org/abs/2006.03236, Code: https://github.com/laiguokun/Funnel-Transformer (During training, decreases the sequence length of the hidden states through middle encoder layers, but expands it again with up-sampling at the end of the decoder.)
Hanrui Wang, Zhanghao Wu, Zhijian Liu, Han Cai, Ligeng Zhu, Chuang Gan, and Song Han. 2020. HAT: Hardware-aware transformers for efficient natural language processing, arXiv preprint arXiv:2005.14187. https://arxiv.org/abs/2005.14187
Rene Bidart, 2023, Representational Redundancy Reduction Strategies for Efficient Neural Network Architectures for Visual and Language Tasks, Ph.D. thesis, University of Waterloo, https://uwspace.uwaterloo.ca/bitstream/handle/10012/19682/Bidart_Rene.pdf?sequence=1
H Peng, S Huang, S Chen, B Li, T Geng, A Li, 2022, A length adaptive algorithm-hardware co-design of transformer on fpga through sparse attention and dynamic pipelining, DAC '22: Proceedings of the 59th ACM/IEEE Design Automation Conference, July 2022, Pages 1135–1140, https://doi.org/10.1145/3489517.3530585, https://dl.acm.org/doi/10.1145/3489517.3530585 https://arxiv.org/pdf/2208.03646
Y Liu, Z Lin, F Yuan, 2021, Rosita: Refined bert compression with integrated techniques, The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21), https://arxiv.org/abs/2103.11367, Code: https://github.com/llyx97/Rosita (Pruning on multiple dimensions of layer, FFN outputs, and embeddings, also combined with distillation.)
Maedeh Hemmat, Joshua San Miguel, Azadeh Davoodi, 2021, AirNN: A Featherweight Framework for Dynamic Input-Dependent Approximation of CNNs, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol.40, no.10, pp.2090-2103, 2021. https://ieeexplore.ieee.org/document/9239327 (Input dependent matching of weight clusters from tokens is vaguely similar to token pruning or length pruning.)

For more research papers on length pruning, see https://www.aussieai.com/research/length-pruning.

Token Pruning

Token pruning is a type of model “length pruning” that aims to address the cost of processing input sequences and the related embeddings. It is closely related to “embeddings pruning”, but is orthogonal to depth pruning (e.g. layer pruning) or width pruning (e.g. attention head pruning).

Input token pruning refers to removing some of the tokens with a low probability, based on some evaluation. It is also called “prompt compression” because the user's input prompt is shortened by removing less important tokens. For a model perspective, this reduces the input vocabulary size proportionally to the tokens pruned, thereby reducing the overall model size. This technique suffers the trade-off of accuracy loss in the model, as it is difficult to predict which tokens to prune.

Probabilities and weights for a pruned token are lost, no longer affecting output of other tokens. Token pruning and prompt compression can nevertheless be effective in use cases such as summarization or concept classification, since some of the common, small words in the input sequence are obviously less important.

Research papers on token pruning:

Xuanli He, Iman Keivanloo, Yi Xu, Xiang He, Belinda Zeng, Santosh Rajagopalan, Trishul Chilimbi, 2021, Magic Pyramid: Accelerating Inference with Early Exiting and Token Pruning, Oct 2021, https://arxiv.org/abs/2111.00230, https://doi.org/10.48550/arXiv.2111.00230
Sehoon Kim, Sheng Shen, David Thorsley, Amir Gholami, Joseph Hassoun, and Kurt Keutzer, 2021, Learned token pruning for transformers, arXiv preprint arXiv:2107.00910, 2021, https://arxiv.org/abs/2107.00910
Hanrui Wang, Zhekai Zhang, and Song Han, 2021, SpAtten: Efficient sparse attention architecture with cascade token and head pruning, In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 97–110. IEEE, 2021, https://arxiv.org/abs/2012.09852
Saurabh Goyal, Anamitra Roy Choudhury, Saurabh Raje, Venkatesan Chakaravarthy, Yogish Sabharwal, and Ashish Verma, 2020, Power-bert: Accelerating bert inference via progressive word-vector elimination, In International Conference on Machine Learning, pages 3690–3699, PMLR, 2020, https://arxiv.org/abs/2001.08950, https://doi.org/10.48550/arXiv.2001.08950
Deming Ye, Yankai Lin, Yufei Huang, and Maosong Sun, 2021, TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference, arXiv preprint arXiv:2105.11618 (2021), https://arxiv.org/abs/2105.11618
Ofir Press, Noah A Smith, and Mike Lewis, 2021, Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation, arXiv preprint arXiv:2108.12409 (2021), https://arxiv.org/abs/2108.12409
Carlos Lassance, Maroua Maachou, Joohee Park, Stéphane Clinchant, 2021, A Study on Token Pruning for ColBERT, Dec 2021, https://arxiv.org/abs/2112.06540
Zihang Dai, Guokun Lai, Yiming Yang, and Quoc V Le. 2020, Funnel-transformer: Filtering out sequential redundancy for efficient language processing, Proceedings of NeurIPS, June 2020, https://arxiv.org/abs/2006.03236, Code: https://github.com/laiguokun/Funnel-Transformer
S Ren, Q Jia, KQ Zhu, 2023, arXiv preprint arXiv:2310.08152, Context Compression for Auto-regressive Transformers with Sentinel Tokens, Oct 2023, https://arxiv.org/pdf/2310.08152.pdf, Code: https://github.com/DRSY/KV_Compression

For more research papers on token pruning, see https://www.aussieai.com/research/token-pruning.

Dynamic Token Pruning

Dynamic token pruning is where the choice of which tokens to discard is made during the inference algorithm. This is a form of dynamic length pruning of the model.

Another related lengthwise technique is that avoiding attention logic on some tokens has been a method researched to speed up Transformer attention (i.e. to reduce the quadratic dependence on input length). This is effectively attention-specific token pruning, where other weights for the token may still be used. See Chapter 20 for more on long context research.

Research papers on dynamic token pruning:

Sehoon Kim, Sheng Shen, David Thorsley, Amir Gholami, Woosuk Kwon, Joseph Hassoun, 14 August 2022, Learned Token Pruning for Transformers, https://dl.acm.org/doi/abs/10.1145/3534678.3539260, PDF: https://dl.acm.org/doi/pdf/10.1145/3534678.3539260
Zhenglun Kong, Peiyan Dong, Xiaolong Ma, Xin Meng, Wei Niu, Mengshu Sun, Xuan Shen, Geng Yuan, Bin Ren, Hao Tang, Minghai Qin & Yanzhi Wang, 2022, SPViT: Enabling Faster Vision Transformers via Latency-Aware Soft Token Pruning, Nov 2022, LNCS,volume 13671, https://link.springer.com/chapter/10.1007/978-3-031-20083-0_37, Code: https://github.com/PeiyanFlying/SPViT
Peiyan Dong; Mengshu Sun; Alec Lu; Yanyue Xie; Kenneth Liu; Zhenglun Kong; Xin Meng; Zhengang Li; 2023, Heatvit: Hardware-efficient adaptive token pruning for vision transformers, 2023, IEEE International Symposium on High-Performance Computer Architecture (HPCA) 2023, DOI: 10.1109/HPCA56546.2023.10071047, https://ieeexplore.ieee.org/abstract/document/10071047
Ling Li, David Thorsley, Joseph Hassoun Oct 2022, SaiT: Sparse Vision Transformers through Adaptive Token Pruning, https://arxiv.org/abs/2210.05832
Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, Cho-Jui Hsieh, 2021, Dynamicvit: Efficient vision transformers with dynamic token sparsification, Advances in Neural Information Processing Systems 34 (NeurIPS 2021), https://proceedings.neurips.cc/paper_files/paper/2021/hash/747d3443e319a22747fbb873e8b2f9f2-Abstract.html, PDF: https://proceedings.neurips.cc/paper_files/paper/2021/file/747d3443e319a22747fbb873e8b2f9f2-Paper.pdf
J Li, LL Zhang, J Xu, Y Wang, S Yan, Y Xia, 2023, Constraint-aware and Ranking-distilled Token Pruning for Efficient Transformer Inference, https://arxiv.org/abs/2306.14393
Haoyu Ma, Zhe Wang, Yifei Chen, Deying Kong, Liangjian Chen, Xingwei Liu, Xiangyi Yan, Hao Tang & Xiaohui Xie, 2022, PPT: Token-Pruned Pose Transformer for Monocular and Multi-view Human Pose Estimation, ECCV 2022: Computer Vision, pp 424–442, LNCS volume 13665, https://link.springer.com/chapter/10.1007/978-3-031-20065-6_25
Xiangcheng Liu, Tianyi Wu, Guodong Guo, 2022, Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention, Sep 2022, https://arxiv.org/abs/2209.13802
Luca Soldaini and Alessandro Moschitti. 2020. The Cascade Transformer: An application for efficient answer sentence selection, In Proceedings of ACL, pages 5697–5708, https://arxiv.org/abs/2005.02534
Gyuwan Kim and Kyunghyun Cho. 2021. Length-Adaptive Transformer: Train Once with Length Drop, Use Anytime with Search, In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, 6501–6511, https://arxiv.org/abs/2010.07003, Code: https://github.com/clovaai/length-adaptive-transformer (Technique is the “Length adaptive transformer” or LAT)
Canwen Xu, Julian McAuley, 2022, A Survey on Model Compression and Acceleration for Pretrained Language Models, https://arxiv.org/abs/2202.07105
Yuang Liu, Qiang Zhou, Jing Wang, Zhibin Wang, Fan Wang, Jun Wang, Wei Zhang, 2023, Dynamic Token-Pass Transformers for Semantic Segmentation, August 2023, DOI: 10.48550/arXiv.2308.01944, https://ui.adsabs.harvard.edu/abs/2023arXiv230801944L/abstract, PDF: https://arxiv.org/pdf/2308.01944.pdf
Mohsen Fayyaz, Soroush Abbasi Kouhpayegani, Farnoush Rezaei Jafari, Eric Sommerlade, Hamid Reza Vaezi Joze, Hamed Pirsiavash, and Juergen Gall. 2023, ATS: Adaptive token sampling for efficient vision transformers, In ECCV, July 2022, https://arxiv.org/abs/2111.15667v1
Hongjie Wang, Bhishma Dedhia, Niraj K. Jha, 2023, Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention Graph in Pre-Trained Transformers, May 2023, https://arxiv.org/abs/2305.17328
Yifei Liu, Mathias Gehrig, Nico Messikommer, Marco Cannici, Davide Scaramuzza, 2023, Revisiting Token Pruning for Object Detection and Instance Segmentation, June 2023, https://arxiv.org/abs/2306.07050
Xiangcheng Liu, Tianyi Wu, Guodong Guo, July 2023, Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention, https://arxiv.org/abs/2209.13802
Zhewei Yao, Linjian Ma, Sheng Shen, Kurt Keutzer, and Michael W Mahoney. 2021. MLPruning: A Multilevel Structured Pruning Framework for Transformer-based Models, arXiv preprint arXiv:2105.14636 (2021), https://arxiv.org/abs/2105.14636v1
Victor Sanh, Thomas Wolf, and Alexander M Rush. 2020. Movement pruning: Adaptive sparsity by fine-tuning, arXiv preprint arXiv:2005.07683 (2020), https://arxiv.org/abs/2005.07683
Mohsen Fayyaz, Soroush Abbasi Koohpayegani, Farnoush Rezaei Jafari, Sunando Sengupta, Hamid Reza Vaezi Joze, Eric Sommerlade, Hamed Pirsiavash, Juergen Gall, 2022, Adaptive Token Sampling For Efficient Vision Transformers, July 2022, https://arxiv.org/abs/2111.15667
Zi Lin, Jeremiah Zhe Liu, Zi Yang, Nan Hua, and Dan Roth. Oct 2020. Pruning Redundant Mappings in Transformer Models via Spectral-Normalized Identity Prior, arXiv preprint arXiv:2010.01791 (2020), https://arxiv.org/abs/2010.01791
François Lagunas, Ella Charlaix, Victor Sanh, and Alexander M Rush. Sep 2021. Block pruning for faster transformers, arXiv preprint arXiv:2109.04838 (2021), https://arxiv.org/abs/2109.04838
Siyuan Wei, Tianzhu Ye, Shen Zhang, Yao Tang, Jiajun Liang, 2023, Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers, Apr 2023, https://arxiv.org/abs/2304.10716, Code: https://github.com/megvii-research/TPS-CVPR2023
Youwei Liang, Chongjian Ge, Zhan Tong, Yibing Song, Jue Wang, and Pengtao Xie. 2022, Not all patches are what you need: Expediting vision transformers via token reorganizations, arXiv preprint arXiv:2202.07800, Apr 2022, https://arxiv.org/abs/2202.07800
Yifan Xu, Zhijie Zhang, Mengdan Zhang, Kekai Sheng, Ke Li, Weiming Dong, Liqing Zhang, Changsheng Xu, and Xing Sun. 2022, Evo-ViT: Slow-fast token evolution for dynamic vision transformer, In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 2964–2972, 2022, https://arxiv.org/abs/2108.01390, Code: https://github.com/YifanXu74/Evo-ViT
Bowen Pan, Rameswar Panda, Yifan Jiang, Zhangyang Wang, Rogerio Feris, and Aude Oliva. 2021, IA-RED2: Interpretability-aware redundancy reduction for vision transformers, In Advances in Neural Information Processing Systems (NeurIPS), Oct 2021, https://arxiv.org/abs/2106.12620
Hongxu Yin, Arash Vahdat, Jose Alvarez, Arun Mallya, Jan Kautz, and Pavlo Molchanov. 2022, AdaViT: Adaptive tokens for efficient vision transformer, arXiv preprint arXiv:2112.07658, 2021 (revised Oct 2022), https://arxiv.org/abs/2112.07658, Code: https://a-vit.github.io/
Hao Yu and Jianxin Wu. 2021, A unified pruning framework for vision transformers, arXiv preprint arXiv:2111.15127, Nov 2021, https://arxiv.org/abs/2111.15127
Youwei Liang, Chongjian GE, Zhan Tong, Yibing Song, Jue Wang, and Pengtao Xie. 2022, EViT: Expediting vision transformers via token reorganizations, In International Conference on Learning Representations (ICLR), Jan 2022, https://openreview.net/forum?id=BjyvwnXXVn_, PDF: https://openreview.net/pdf?id=BjyvwnXXVn_, Code: https://github.com/youweiliang/evit
Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. 2021, Dynamicvit: Efficient vision transformers with dynamic token sparsification, In Advances in Neural Information Processing Systems (NeurIPS), Oct 2021, https://arxiv.org/abs/2106.02034, Code: https://github.com/raoyongming/DynamicViT
Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zi-Hang Jiang, Francis E.H. Tay, Jiashi Feng, and Shuicheng Yan. 2021, Tokens-to-token vit: Training vision transformers from scratch on imagenet, In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 558–567, October 2021, https://arxiv.org/abs/2101.11986
Huanrui Yang, Hongxu Yin, Pavlo Molchanov, Hai Li, and Jan Kautz. 2021, Nvit: Vision transformer compression and parameter redistribution, arXiv preprint arXiv:2110.04869, 2021, PDF: https://arxiv.org/pdf/2110.04869v1.pdf
Yehui Tang, Kai Han, Yunhe Wang, Chang Xu, Jianyuan Guo, Chao Xu, and Dacheng Tao. 2022, Patch slimming for efficient vision transformers, arXiv preprint arXiv:2106.02852, 2021 (revised Apr 2022). https://arxiv.org/abs/2106.02852
Shixing Yu, Tianlong Chen, Jiayi Shen, Huan Yuan, Jianchao Tan, Sen Yang, Ji Liu, and Zhangyang Wang. 2022, Unified visual transformer compression, arXiv preprint arXiv:2203.08243, Mar 2022, https://arxiv.org/abs/2203.08243
Tianlong Chen, Yu Cheng, Zhe Gan, Lu Yuan, Lei Zhang, and Zhangyang Wang. 2021, Chasing sparsity in vision transformers: An end-to-end exploration, Advances in Neural Information Processing Systems, 34, 2021, https://arxiv.org/abs/2106.04533
Mingjian Zhu, Yehui Tang, and Kai Han. 2021, Vision transformer pruning, arXiv preprint arXiv:2104.08500, Aug 2021, https://arxiv.org/abs/2104.08500
Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. 2017, Pruning convolutional neural networks for resource efficient inference, arXiv preprint arXiv:1611.06440, 2016 (revised June 2017), https://arxiv.org/abs/1611.06440
Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. 2016, Pruning filters for efficient convnets, arXiv preprint arXiv:1608.08710, 2016, https://arxiv.org/abs/1608.08710
Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. 2017, Learning efficient convolutional networks through network slimming, In ICCV 2017, Aug 2017, https://arxiv.org/abs/1708.06519
Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. 2017, ThiNet: A filter level pruning method for deep neural network compression, In Proceedings of the IEEE international conference on computer vision, pages 5058–5066, 2017, https://arxiv.org/abs/1707.06342
Zihang Jiang, Qibin Hou, Li Yuan, Daquan Zhou, Yujun Shi, Xiaojie Jin, Anran Wang, and Jiashi Feng. 2021, All tokens matter: Token labeling for training better vision transformers, arXiv preprint arXiv:2104.10858, June 2021, https://arxiv.org/abs/2104.10858, Code: https://github.com/zihangJiang/TokenLabeling
Zhewei Yao, Linjian Ma, Sheng Shen, Kurt Keutzer, and Michael W Mahoney. 2021. MLPruning: A Multilevel Structured Pruning Framework for Transformer-based Models, arXiv preprint arXiv:2105.14636 (2021), PDF: https://arxiv.org/pdf/2105.14636v1.pdf, Code: https://github.com/yaozhewei/mlpruning.git
Yue Guan, Zhengyi Li, Jingwen Leng, Zhouhan Lin, Minyi Guo, 2022, Transkimmer: Transformer Learns to Layer-wise Skim, May 2022, In AC, https://arxiv.org/abs/2205.07324 (This paper does per-layer dynamic token pruning.)
Sehoon Kim, Sheng Shen, David Thorsley, Amir Gholami, Woosuk Kwon, Joseph Hassoun, Kurt Keutzer, May 11, 2023, Learned Token Pruning for Efficient Transformer Inference, Masters Thesis, Technical Report No. UCB/EECS-2023-119, Electrical Engineering and Computer Sciences, University of California, Berkeley, http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-119.html PDF: https://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-119.pdf (Learns threshold-based token pruning parameters; novel approach to token pruning during attention. Also contains good literature survey on token pruning.)
Ofir Press, Noah A Smith, and Mike Lewis. 2022, Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation, arXiv preprint arXiv:2108.12409, 2021 (revised Apr 2022), https://arxiv.org/abs/2108.12409 (Attention with Linear Biases (ALiBi) paper)
Chonghan Lee, Md Fahim Faysal Khan, Rita Brugarolas Brufau, Ke Ding, Vijaykrishnan Narayanan, Oct 2022, Token and Head Adaptive Transformers for Efficient Natural Language Processing, https://aclanthology.org/2022.coling-1.404/ (Combination of token pruning and attention head pruning, i.e. length/width pruning combined)
Zejiang Hou, Sun-Yuan Kung, 2022, Multi-Dimensional Vision Transformer Compression via Dependency Guided Gaussian Process Search, https://ieeexplore.ieee.org/document/9857488, PDF: https://openaccess.thecvf.com/content/CVPR2022W/EVW/html/Hou_Multi-Dimensional_Vision_Transformer_Compression_via_Dependency_Guided_Gaussian_Process_Search_CVPRW_2022_paper.html (Multi-dimensional pruning.)
K Luo, H Li, X Zhou, B Huang, 2022, An Attention-Based Token Pruning Method for Vision Transformers, International Joint Conference, IJCRS 2022, Suzhou, China, November 11–14, 2022, Proceedings, Nov 2022, Pages 274–288, https://doi.org/10.1007/978-3-031-21244-4_21
Siyuan Wei, Tianzhu Ye, Shen Zhang, Yao Tang, Jiajun Liang, 2023, Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 2092-2101, http://openaccess.thecvf.com/content/CVPR2023/html/Wei_Joint_Token_Pruning_and_Squeezing_Towards_More_Aggressive_Compression_of_CVPR_2023_paper.html, https://arxiv.org/abs/2304.10716, Code: https://github.com/megvii-research/TPS-CVPR2023
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. 2019, Transformer-xl: Attentive language models beyond a fixed-length context, arXiv preprint arXiv:1901.02860, 2019. https://arxiv.org/abs/1901.02860 (Related to length pruning and context length, although not fully token pruning.)
Xin Huang, Ashish Khetan, Rene Bidart, and Zohar Karnin. 2022, Pyramid-BERT: Reducing complexity via successive core-set based token selection, arXiv preprint arXiv:2203.14380, 2022. https://arxiv.org/abs/2203.14380
Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. 2022, Token merging: Your vit but faster, arXiv preprint arXiv:2210.09461, 2022, https://arxiv.org/abs/2210.09461 (Token merging idea is similar to token pruning.)
Y Guan, Z Li, Z Lin, Y Zhu, J Leng, M Guo, 2022, Block-skim: Efficient question answering for transformer, Proceedings of the AAAI, 2022, https://doi.org/10.1609/aaai.v36i10.21316, https://ojs.aaai.org/index.php/AAAI/article/view/21316, PDF: https://ojs.aaai.org/index.php/AAAI/article/view/21316/21065
Q Tang, B Zhang, J Liu, F Liu, Y Liu, 2023, Dynamic Token Pruning in Plain Vision Transformers for Semantic Segmentation, arXiv preprint arXiv:2308.01045, 2023, https://arxiv.org/abs/2308.01045
Sotiris Anagnostidis, Dario Pavllo, Luca Biggio, Lorenzo Noci, Aurelien Lucchi, Thomas Hofmann, 2023, Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers, arXiv preprint, 2023, https://arxiv.org/abs/2305.15805
Hansen, C.; Hansen, C.; Alstrup, S.; Simonsen, J. G.; and Lioma, C. 2018. Neural Speed Reading with Structural-Jump-LSTM, In International Conference on Learning Representations, https://arxiv.org/abs/1904.00761
Seo, M.; Min, S.; Farhadi, A.; and Hajishirzi, H. 2018. Neural Speed Reading via Skim-RNN, In International Conference on Learning Representations, https://arxiv.org/abs/1711.02085
Adams Wei Yu, Hongrae Lee, Quoc V. Le. 2017. Learning to Skim Text, In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), https://arxiv.org/abs/1704.06877
Y Wang, K Chen, H Tan, K Guo, 2023, Tabi: An Efficient Multi-Level Inference System for Large Language Models, EuroSys '23: Proceedings of the Eighteenth European Conference on Computer Systems, Rome, Italy May 2023, Pages 233–248, https://doi.org/10.1145/3552326.3587438, PDF: https://cse.hkust.edu.hk/~kaichen/papers/tabi-eurosys23.pdf
L. Denoyer and P. Gallinari. 2014, Deep sequential neural network, arXiv preprint arXiv:1410.0510, 2014. https://arxiv.org/abs/1410.0510 (Input adaptive method, somewhat related to token pruning.)
Hochreiter, S.; and Schmidhuber, J., 1997. Long short-term memory, Neural computation, 9(8): 1735–1780. https://ieeexplore.ieee.org/abstract/document/6795963 (Early paper, somewhat related to token skimming.)
Campos, V.; Jou, B.; Giro-i-Nieto, X.; Torres, J.; and Chang, S., 2017. Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks, CoRR, abs/1708.06834. https://arxiv.org/abs/1708.06834, Code: https://imatge-upc.github.io/skiprnn-2017-telecombcn/
Zheng Qu, Liu Liu, Fengbin Tu, Zhaodong Chen, Yufei Ding, and Yuan Xie. 2022, DOTA: Detect and Omit Weak Attentions for Scalable Transformer Acceleration, In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pages 14–26, 2022. https://dl.acm.org/doi/pdf/10.1145/3503222.3507738 (Involves some reordering of tokens.)
Jean Senellart, Dakun Zhang, Bo Wang, Guillaume Klein, Jean-Pierre Ramatchandirin, Josep Crego, and Alexander Rush. 2018, OpenNMT system description for WNMT 2018: 800 words/sec on a single-core CPU, In Proc. of WNG, 2018. https://www.aclweb.org/anthology/W18-2715
Xing Shi and Kevin Knight. 2017, Speeding up neural machine translation decoding by shrinking run-time vocabulary, In Proc. of ACL, 2017. https://aclanthology.org/P17-2091/, PDF: http://xingshi.me/data/pdf/ACL2017short.pdf
Gurvan L’Hostis, David Grangier, and Michael Auli. 2016. Vocabulary Selection Strategies for Neural Machine Translation, Arxiv preprint arXiv:1610.00072, https://arxiv.org/abs/1610.00072
Ali Modarressi, Hosein Mohebbi, Mohammad Taher Pilehvar, 2022, AdapLeR: Speeding up Inference by Adaptive Length Reduction, arXiv preprint arXiv:2203.08991, https://arxiv.org/abs/2203.08991 Code: https://github.com/amodaresi/AdapLeR
Hansen, C., Hansen, C., Alstrup, S., Simonsen, J. G., and Lioma, C. (2019). Neural speed reading with structural-jump-LSTM, In International Conference on Learning Representations. https://arxiv.org/abs/1904.00761, https://openreview.net/forum?id=B1xf9j
Maedeh Hemmat, Joshua San Miguel, Azadeh Davoodi, 2021, AirNN: A Featherweight Framework for Dynamic Input-Dependent Approximation of CNNs, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol.40, no.10, pp.2090-2103, 2021. https://ieeexplore.ieee.org/document/9239327 (Input dependent matching of weight clusters from tokens is vaguely similar to token pruning or length pruning.)
Minxuan Zhou; Weihong Xu; Jaeyoung Kang; Tajana Rosing, 2022, TransPIM: A Memory-based Acceleration via Software-Hardware Co-Design for Transformer, 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA), https://ieeexplore.ieee.org/document/9773212 PDF: https://par.nsf.gov/servlets/purl/10345536 (Does some token pruning but is primarily focused on memory optimization, including with token-based data sharding for allocation to different memory banks.)
Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, Lee Sharkey, Sep 2023, Sparse Autoencoders Find Highly Interpretable Features in Language Models, https://arxiv.org/abs/2309.08600 (Analysis has some relevant to tokenization and token pruning.)
H Jiang, Q Wu, CY Lin, Y Yang, L Qiu, Oct 2023, LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models, arXiv preprint arXiv:2310.05736, https://arxiv.org/pdf/2310.05736.pdf, Code: https://aka.ms/LLMLingua (Dynamic token pruning for prompt compression.)
X Xu, C Li, Y Chen, X Chang, J Liu, S Wang, Oct 2023, No Token Left Behind: Efficient Vision Transformer via Dynamic Token Idling, arXiv preprint arXiv:2310.05654, https://arxiv.org/pdf/2310.05654.pdf (Suggests “token idling” that allows reuse of pruned tokens in later layers)
Yucheng Li. April 2023. Unlocking context constraints of LLMs: Enhancing context efficiency of LLMs with self-information-based content filtering, ArXiv preprint abs/2304.12102. https://arxiv.org/abs/2304.12102 (Token pruning for prompt compression.)
Wang, Y., Huang, R., Song, S., Huang, Z., Huang, G., May 2021, Not all images are worth 16x16 words: Dynamic vision transformers with adaptive sequence length, NeurIPS 2021, https://arxiv.org/abs/2105.15075, Code: https://github.com/blackfeather-wang/Dynamic-Vision-Transformer, Code: https://github.com/blackfeather-wang/Dynamic-Vision-Transformer-MindSpore
Jesse Mu, Xiang Lisa Li, and Noah Goodman. July 2023. Learning to compress prompts with gist tokens, arXiv preprint arXiv:2304.08467. https://arxiv.org/abs/2304.08467 (Prompt compression.)
Wang Y., Lv K., Huang R., Song S., Yang L., Huang G., 2020, Glance and focus: a dynamic approach to reducing spatial redundancy in image classification, Advances in neural information processing systems, Vol. 33 (2020), pp. 2432-2444, https://arxiv.org/abs/2010.05300, Code: https://github.com/blackfeather-wang/GFNet-Pytorch (Focuses on a subset of image inputs, which is analogous to token pruning.)

For more research papers on dynamic token pruning, see https://www.aussieai.com/research/token-pruning#dynamic.

Embeddings Matrix Pruning

Pruning of embeddings doesn't receive much research attention, because it isn't a bottleneck in most Transformers. Most of the research on pruning embeddings has been on compacting the space of the embedding matrices for use on smaller devices, rather than for speeding it up. The conversion into an embeddings vector uses a single embedding matrix, which can be large if the model's vocabulary size is large. Various pruning approaches exist using matrix compression techniques such as sparsity or hashing.

Research papers on embeddings matrix pruning:

Daochen Zha, Louis Feng, Bhargav Bhushanam, Dhruv Choudhary, Jade Nie, Yuandong Tian, Jay Chae, Yinbin Ma, Arun Kejariwal, Xia Hu, 2022, AutoShard: Automated Embedding Table Sharding for Recommender Systems, https://dl.acm.org/doi/abs/10.1145/3534678.3539034, https://arxiv.org/abs/2208.06399
A Desai, L Chou, A Shrivastava, 2022, Random Offset Block Embedding (ROBE) for compressed embedding tables in deep learning recommendation systems, Conference on Machine Learning and Systems, https://arxiv.org/abs/2108.02191
Xiangyu Zhao, Haochen Liu, Hui Liu, Jiliang Tang, Weiwei Guo, Jun Shi, Sida Wang, Huiji Gao, and Bo Long. 2020. Memory-efficient embedding for recommendations, arXiv preprint arXiv:2006.14827 (2020), https://arxiv.org/abs/2006.14827
Antonio Ginart, Maxim Naumov, Dheevatsa Mudigere, Jiyan Yang, and James Zou. 2019. Mixed Dimension Embeddings with Application to Memory-Efficient Recommendation Systems, arXiv preprint arXiv:1909.11810 (2019), https://arxiv.org/abs/1909.11810
Nicola Tonellotto, Craig Macdonald, 2021, Query Embedding Pruning for Dense Retrieval, CIKM ’21, November 1–5, 2021, Virtual Event, QLD, Australia, https://arxiv.org/abs/2108.10341
IamAdiSri, 2021, Pruning a model embedding matrix for memory efficiency, April 2021, Hugging Face discussion board, https://discuss.huggingface.co/t/pruning-a-model-embedding-matrix-for-memory-efficiency/5502/7
Raphael Shu and Hideki Nakayama. 2017, Compressing word embeddings via deep compositional code learning, In ICLR (Poster). OpenReview.net, 2018, https://arxiv.org/abs/1711.01068
Hao-Jun Michael Shi, Dheevatsa Mudigere, Maxim Naumov, and Jiyan Yang. 2020, Compositional embeddings using complementary partitions for memory-efficient recommendation systems, In KDD, pp. 165-175. ACM, 2020, https://arxiv.org/abs/1909.02107
Valentin Khrulkov, Oleksii Hrinchuk, Leyla Mirvakhabova, and Ivan Oseledets. 2019. Tensorized Embedding Layers for Efficient Model Compression, arXiv preprint arXiv:1901.10787 (2019), updated Feb 2020, https://arxiv.org/abs/1901.10787v1
Manaal Faruqui, Yulia Tsvetkov, Dani Yogatama, Chris Dyer, and Noah A. Smith. 2015, Sparse overcomplete word vector representations, In ACL (1), pp. 1491-1500. The Association for Computer Linguistics, 2015, https://arxiv.org/abs/1506.02004
Yunchuan Chen, Lili Mou, Yan Xu, Ge Li, and Zhi Jin. 2016, Compressing neural language models by sparse word representations, In ACL (1). The Association for Computer Linguistics, 2016, https://arxiv.org/abs/1610.03950 (Sparse matrix via common and rare word embeddings)
Jingjing Xu, Wangchunshu Zhou, Zhiyi Fu, Hao Zhou, Lei Li, 2021, A Survey on Green Deep Learning, Nov 2021, https://arxiv.org/abs/2111.05193 (Extensive survey paper with section on “Compact Embeddings”.)
Wei Deng, Junwei Pan, Tian Zhou, Deguang Kong, Aaron Flores, and Guang Lin. 2021. DeepLight: Deep Lightweight Feature Interactions for Accelerating CTR Predictions in Ad Serving, In Proceedings of the 14th ACM international conference on Web search and data mining. 922–930, https://arxiv.org/abs/2002.06987
Jun Suzuki and Masaaki Nagata. 2016. Learning Compact Neural Word Embeddings by Parameter Space Sharing, In IJCAI. 2046–2052, https://dl.acm.org/doi/10.5555/3060832.3060907
Aliakbar Panahi, Seyran Saeedi, and Tom Arodz. 2019. word2ket: Space-efficient Word Embeddings inspired by Quantum Entanglement, In ICLR. https://arxiv.org/abs/1911.04975
Weiping Song, Chence Shi, Zhiping Xiao, Zhijian Duan, Yewen Xu, Ming Zhang, and Jian Tang. 2019, AutoInt: Automatic feature interaction learning via self-attentive neural networks, In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 1161–1170, 2019, https://arxiv.org/abs/1810.11921, Code: https://github.com/DeepGraphLearning/RecommenderSystems
Antonio Ginart, Maxim Naumov, Dheevatsa Mudigere, Jiyan Yang, and James Zou. 2019, Mixed dimension embeddings with application to memory-efficient recommendation systems, arXiv preprint arXiv:1909.11810, 2019 (preprint revised Feb 2021), https://arxiv.org/abs/1909.11810
Xiaorui Wu, Hong Xu, Honglin Zhang, Huaming Chen, and Jian Wang. 2019, Saec: Similarity-aware embedding compression in recommendation systems, CoRR, abs/1903.00103, 2019, https://arxiv.org/abs/1903.00103
Martin Andrews. 2016, Compressing word embeddings, CoRR, abs/1511.06397, 2015 (revised May 2016), https://arxiv.org/abs/1511.06397v2
Lili Mou, Ran Jia, Yan Xu, Ge Li, Lu Zhang, and Zhi Jin. 2016, Distilling word embeddings: An encoding approach, In CIKM, pp. 1977–1980. ACM, 2016. https://arxiv.org/abs/1506.04488 (Distillation of embeddings.)
Patrick H. Chen, Si Si, Yang Li, Ciprian Chelba, and Cho-Jui Hsieh. 2018, GroupReduce: Block-wise low-rank approximation for neural language model shrinking, In NeurIPS, pp. 11011–11021, 2018. https://arxiv.org/abs/1806.06950 (Using low-rank matrices for vocabulary and embeddings.)
Maximilian Lam. 2018, Word2bits - quantized word vectors, CoRR, abs/1803.05651, 2018, https://arxiv.org/abs/1803.05651 (Quantization ideas leads to compression of word vectors and embeddings.)
Manaal Faruqui, Yulia Tsvetkov, Dani Yogatama, Chris Dyer, and Noah A. Smith. 2015, Sparse overcomplete word vector representations, In ACL (1), pp. 1491–1500. The Association for Computer Linguistics, 2015. https://arxiv.org/abs/1506.02004 (Binary quantization in relation to word vector embeddings.)
Alexei Baevski and Michael Auli. 2019, Adaptive input representations for neural language modeling, In ICLR, 2019, https://arxiv.org/abs/1809.10853 (Faster training with adaptive embeddings size.)

For more research papers on embeddings matrix pruning and optimizations, see https://www.aussieai.com/research/embeddings.

Embedding Size Optimization with NAS

A conceptually simple way to reduce embedding size is to choose a smaller embedding size as a model hyper-parameter. The size of the embedding is a model “hyper-parameter” that is chosen before training. Optimizing this number is a sub-problem of “neural architecture search” (NAS), also called “hyper-parameter optimization” (HPO). The embedding-specific NAS problem has some research papers:

Haochen Liu, Xiangyu Zhao, Chong Wang, Xiaobing Liu, and Jiliang Tang. 2020. Automated embedding size search in deep recommender systems, In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2307–2316, https://dl.acm.org/doi/abs/10.1145/3397271.3401436
L Qu, Y Ye, N Tang, L Zhang, Y Shi, H Yin, 2022, Single-shot embedding dimension search in recommender system, 2022, https://dl.acm.org/doi/abs/10.1145/3477495.3532060, https://arxiv.org/abs/2204.03281
Xiangyu Zhao, Chong Wang, Ming Chen, Xudong Zheng, Xiaobing Liu, and Jiliang Tang, 2020, AutoEmb: Automated Embedding Dimensionality Search in Streaming Recommendations, CoRR abs/2002.11252 (2020). arXiv:2002.11252, https://arxiv.org/abs/2002.11252
Zi Yin and Yuanyuan Shen. 2018. On the dimensionality of word embedding, In Advances in Neural Information Processing Systems. 887 898, https://arxiv.org/abs/1812.04224
Maxim Naumov. 2019. On the Dimensionality of Embeddings for Sparse Features and Data, arXiv preprint arXiv:1901.02103 (2019), https://arxiv.org/abs/1901.02103

For more research papers on embeddings NAS, see https://www.aussieai.com/research/embeddings#nas.

• Next: Chapter 50. Adaptive Inference

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs

The new AI programming book by Aussie AI co-founders:

AI coding in C++
Transformer engine speedups
LLM models
Phone and desktop AI
Code examples
Research citations

Get your copy from Amazon: Generative AI in C++

Aussie AI

49. Length Pruning

What is Length Pruning?

Research on Length Pruning

Token Pruning

Dynamic Token Pruning

Embeddings Matrix Pruning

Embedding Size Optimization with NAS

Quick Links

Product

New to Writing?

Writing Styles