Aussie AI

Token Pruning

  • Last Updated 8 December, 2024
  • by David Spuler, Ph.D.

What is Token Pruning?

Token pruning is a type of model "length pruning" that aims to address the cost of processing input sequences and the related embeddings. It is closely related to "embeddings pruning", but is orthogonal to depth pruning (e.g. layer pruning) or width pruning (e.g. attention head pruning).

Types of token pruning along the lengthwise dimension include:

  • Input token pruning
  • Dynamic token pruning
  • Prompt compression
  • Context compression
  • Token merging
  • Token skipping

There is a lot of overlap between some of the terminology used in research papers about this space (i.e., token pruning). For example, I'm not sure there's much difference between these ones:

  • Token skipping
  • Token dropping
  • Input token pruning

However, token merging is a distinct technique, which merges two tokens into one, rather than just skipping one of them. And both "prompt compression" and "context compression" are arguably a generalization of all of these multiple techniques. Another different but related area is "KV cache token pruning" to reduce the in-memory size of the cached data.

KV Caching and Token Pruning

There are analogous optimization techniques on the lengthwise input token dimension of KV cache data. Read more about these KV cache research areas:

Pruning Dimensions

Pruning of LLMs can be done on four dimensions:

And these four approaches are orthogonal, so there's also:

Input Token Pruning

Token pruning refers to removing some of the tokens with a low probability, based on some evaluation. This reduces the vocabulary size proportionally to the tokens pruned, thereby reducing the overall model size. This technique suffers the trade-off of accuracy loss in the model, as it is difficult to predict which tokens to prune. Probabilities and weights for a pruned token are lost, no longer affecting output of other tokens, and the pruned token itself cannot be output. Token pruning can nevertheless be effective in use cases such as summarization or concept classification, since some of the common, small words in the input sequence are obviously less important.

Token pruning effectively zeros the weights associated with that token. Hence, it should be noted that various types of "weight pruning" and "sparsification" of matrices are somewhat related to token pruning, as they can sometimes effectively reduce a token's effect on inference to zero.

Another related technique is that avoiding attention logic on some tokens has been a method researched to speed up transformer attention (i.e. to reduce the quadratic dependence on input length). This is effectively attention-specific token pruning, where other weights for the token may still be used.

Research papers on token pruning:

Token Merging

Token merging is a type of token pruning achieved by merging two (or more) sequential tokens into a single token. This causes token input compression and reduces overall tokens to be processed, thereby speeding up inference (or training).

  • Joonmyung Choi, Sanghyeok Lee, Jaewon Chu, Minhyuk Choi, Hyunwoo J. Kim, 20 Mar 2024, vid-TLDR: Training Free Token merging for Light-weight Video Transformer, https://arxiv.org/abs/2403.13347, Code: https://github.com/mlvlab/vid-TLDR (Token merging in video with a focus on the background of the image.)
  • Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, Yan Yan, 25 Mar 2024, LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models, https://arxiv.org/abs/2403.15388 Code: https://llava-prumerge.github.io/ (Compresses input images based on redundant sections.)
  • Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster. arXiv preprint arXiv:2210.09461, 2022, https://arxiv.org/abs/2210.09461 (Token merging idea is similar to token pruning.)
  • Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, Yan Yan, 25 Mar 2024, LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models, https://arxiv.org/abs/2403.15388 Code: https://llava-prumerge.github.io/ (Compresses input images based on redundant sections.)
  • Hoai-Chau Tran, Duy M. H. Nguyen, Duy M. Nguyen, Trung-Tin Nguyen, Ngan Le, Pengtao Xie, Daniel Sonntag, James Y. Zou, Binh T. Nguyen, Mathias Niepert, 25 May 2024, Accelerating Transformers with Spectrum-Preserving Token Merging, https://arxiv.org/abs/2405.16148
  • Maxim Bonnaerens, Joni Dambre, 17 Aug 2023 (v2), Learned Thresholds Token Merging and Pruning for Vision Transformers, https://arxiv.org/abs/2307.10780
  • Qingqing Cao, Bhargavi Paranjape, Hannaneh Hajishirzi, 27 May 2023, PuMer: Pruning and Merging Tokens for Efficient Vision Language Models, https://arxiv.org/abs/2305.17530
  • Mingliang Zhai, Yulin Li, Xiameng Qin, Chen Yi, Qunyi Xie, Chengquan Zhang, Kun Yao, Yuwei Wu, Yunde Jia, 19 May 2023, Fast-StrucTexT: An Efficient Hourglass Transformer with Modality-guided Dynamic Token Merge for Document Understanding, https://arxiv.org/abs/2305.11392
  • Daniel Bolya, Judy Hoffman, 30 Mar 2023, Token Merging for Fast Stable Diffusion, https://arxiv.org/abs/2303.17604
  • Cedric Renggli, André Susano Pinto, Neil Houlsby, Basil Mustafa, Joan Puigcerver, Carlos Riquelme, 24 Feb 2022, Learning to Merge Tokens in Vision Transformers, https://arxiv.org/abs/2202.12015
  • Minchul Kim, Shangqian Gao, Yen-Chang Hsu, Yilin Shen, Hongxia Jin, 2 Dec 2023, Token Fusion: Bridging the Gap between Token Pruning and Token Merging, https://arxiv.org/abs/2312.01026
  • Liang Zhang, Anwen Hu, Haiyang Xu, Ming Yan, Yichen Xu, Qin Jin, Ji Zhang, Fei Huang, 25 Apr 2024, TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning, https://arxiv.org/abs/2404.16635
  • Daniel Kienzle, Marco Kantonis, Robin Schön, Rainer Lienhart, 23 May 2024, Segformer++: Efficient Token-Merging Strategies for High-Resolution Semantic Segmentation, https://arxiv.org/abs/2405.14467
  • Wenxuan Huang, Yunhang Shen, Jiao Xie, Baochang Zhang, Gaoqi He, Ke Li, Xing Sun, Shaohui Lin, 31 Mar 2024, A General and Efficient Training for Transformer via Token Expansion, https://arxiv.org/abs/2404.00672 Code: https://github.com/Osilly/TokenExpansion (Token merging to accelerate training.)
  • Xu, L., Wang, L. & Guo, Z., 2024, ATFTrans: attention-weighted token fusion transformer for robust and efficient object tracking. Neural Comput & Applic 36, 7043–7056 (2024). https://doi.org/10.1007/s00521-024-09444-0 https://link.springer.com/article/10.1007/s00521-024-09444-0
  • Simone Scardapane, Alessandro Baiocchi, Alessio Devoto, Valerio Marsocci, Pasquale Minervini, Jary Pomponi, 12 Mar 2024, Conditional computation in neural networks: principles and research trends, https://arxiv.org/abs/2403.07965 (Investigated three types of dynamic inference: MoE, early exit, and token selection.)
  • Maxim Bonnaerens, Nov 2023, Resource-Efficient Deep Learning for Computer Vision, Ph.D. thesis, Ghent University, https://biblio.ugent.be/publication/01HEMGWENRT8C255N2RD9KAEJC/file/01HEMGZ9JYP8NXPSQJZM14ACT9 (Examines various vision Transformer optimizations including a NAS approached based on building blocks and also combined token pruning/merging for input compression.)
  • Zhongwei Wan, Xinjian Wu, Yu Zhang, Yi Xin, Chaofan Tao, Zhihong Zhu, Xin Wang, Siqi Luo, Jing Xiong, Mi Zhang, 18 Jun 2024, D2O:Dynamic Discriminative Operations for Efficient Generative Inference of Large Language Models, https://arxiv.org/abs/2406.13035 (Per-layer KV cache eviction strategies with token merging applied to the KV cache.)
  • Zhongwei Wan, Ziang Wu, Che Liu, Jinfa Huang, Zhihong Zhu, Peng Jin, Longyue Wang, Li Yuan, 26 Jun 2024, LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference, https://arxiv.org/abs/2406.18139 (KV cache compression in text and multimodal inference, prioritizing eviction of text over image tokens, and using new ways to merge evicted KV cache data into the retained KV cache, including averaging, pivotal tokens, and weighted averages, which is relevant to token merging and KV cache fusion
  • Namgyu Ho, Sangmin Bae, Taehyeon Kim, Hyunjik Jo, Yireun Kim, Tal Schuster, Adam Fisch, James Thorne, Se-Young Yun, 4 Jun 2024, Block Transformer: Global-to-Local Language Modeling for Fast Inference, https://arxiv.org/abs//2406.02657 Code: https://github.com/itsnamgyu/block-transformer (Impressive technique of combining tokens into blocks, then doing inference on the blocks, then unblocking to get tokens.)
  • Yuqing Yang, Yuedong Xu, Lei Jiao, 7 Jul 2024, A Queueing Theoretic Perspective on Low-Latency LLM Inference with Variable Token Length, https://arxiv.org/abs/2407.05347
  • Yancheng Wang, Yingzhen Yang, 21 Jul 2024, Efficient Visual Transformer by Learnable Token Merging, https://arxiv.org/abs/2407.15219 Code: https://github.com/Statistical-Deep-Learning/LTM
  • Yifan Pu, Zhuofan Xia, Jiayi Guo, Dongchen Han, Qixiu Li, Duo Li, Yuhui Yuan, Ji Li, Yizeng Han, Shiji Song, Gao Huang, Xiu Li, 11 Aug 2024, Efficient Diffusion Transformer with Step-wise Dynamic Attention Mediators, https://arxiv.org/abs/2408.05710 (Reduce the attention cost in diffusion models by what is effectively token merging between the Q and K data.)
  • Kyle Wiggers, September 11, 2024, Mistral releases Pixtral 12B, its first multimodal model, https://techcrunch.com/2024/09/11/mistral-releases-pixtral-its-first-multimodal-model/
  • J Hong, G Lee, J Cho, Accelerating Multilingual Language Model for Excessively Tokenized Languages, Findings of the Association for Computational Linguistics: ACL 2024, pages 11095–11111 August 11-16, 2024, https://arxiv.org/abs/2401.10660 https://aclanthology.org/2024.findings-acl.660/ https://aclanthology.org/2024.findings-acl.660.pdf
  • Yiwu Zhong, Zhuoming Liu, Yin Li, Liwei Wang, 4 Dec 2024, AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning, https://arxiv.org/abs/2412.03248

Dynamic Token Pruning

Dynamic token pruning is where the choice of which tokens to discard is made during the inference algorithm. This is a form of dynamic length pruning of the model.

  • Sehoon Kim, Sheng Shen, David Thorsley, Amir Gholami, Woosuk Kwon, Joseph Hassoun, Learned Token Pruning for Transformers, 14 August 2022, https://dl.acm.org/doi/abs/10.1145/3534678.3539260, PDF: https://dl.acm.org/doi/pdf/10.1145/3534678.3539260
  • Zhenglun Kong, Peiyan Dong, Xiaolong Ma, Xin Meng, Wei Niu, Mengshu Sun, Xuan Shen, Geng Yuan, Bin Ren, Hao Tang, Minghai Qin & Yanzhi Wang, SPViT: Enabling Faster Vision Transformers via Latency-Aware Soft Token Pruning, Nov 2022, LNCS,volume 13671, https://link.springer.com/chapter/10.1007/978-3-031-20083-0_37, Code: https://github.com/PeiyanFlying/SPViT
  • Peiyan Dong; Mengshu Sun; Alec Lu; Yanyue Xie; Kenneth Liu; Zhenglun Kong; Xin Meng; Zhengang Li; Heatvit: Hardware-efficient adaptive token pruning for vision transformers, 2023, IEEE International Symposium on High-Performance Computer Architecture (HPCA) 2023, DOI: 10.1109/HPCA56546.2023.10071047, https://ieeexplore.ieee.org/abstract/document/10071047
  • SaiT: Sparse Vision Transformers through Adaptive Token Pruning Ling Li, David Thorsley, Joseph Hassoun Oct 2022, https://arxiv.org/abs/2210.05832
  • Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, Cho-Jui Hsieh, Dynamicvit: Efficient vision transformers with dynamic token sparsification, 2021, Advances in Neural Information Processing Systems 34 (NeurIPS 2021), https://proceedings.neurips.cc/paper_files/paper/2021/hash/747d3443e319a22747fbb873e8b2f9f2-Abstract.html, PDF: https://proceedings.neurips.cc/paper_files/paper/2021/file/747d3443e319a22747fbb873e8b2f9f2-Paper.pdf
  • J Li, LL Zhang, J Xu, Y Wang, S Yan, Y Xia, Constraint-aware and Ranking-distilled Token Pruning for Efficient Transformer Inference, 2023, https://arxiv.org/abs/2306.14393
  • Haoyu Ma, Zhe Wang, Yifei Chen, Deying Kong, Liangjian Chen, Xingwei Liu, Xiangyi Yan, Hao Tang & Xiaohui Xie , PPT: Token-Pruned Pose Transformer for Monocular and Multi-view Human Pose Estimation, ECCV 2022: Computer Vision, pp 424–442, LNCS volume 13665, https://link.springer.com/chapter/10.1007/978-3-031-20065-6_25
  • Xiangcheng Liu, Tianyi Wu, Guodong Guo, Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention, Sep 2022, https://arxiv.org/abs/2209.13802
  • Luca Soldaini and Alessandro Moschitti. 2020. The Cascade Transformer: An application for efficient answer sentence selection. In Proceedings of ACL, pages 5697–5708, https://arxiv.org/abs/2005.02534
  • Gyuwan Kim and Kyunghyun Cho. 2021. Length-Adaptive Transformer: Train Once with Length Drop, Use Anytime with Search. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, 6501–6511, https://arxiv.org/abs/2010.07003, Code: https://github.com/clovaai/length-adaptive-transformer (Technique is the "Length adaptive transformer" or LAT)
  • Canwen Xu, Julian McAuley, 2022, A Survey on Model Compression and Acceleration for Pretrained Language Models, https://arxiv.org/abs/2202.07105
  • Yuang Liu, Qiang Zhou, Jing Wang, Zhibin Wang, Fan Wang, Jun Wang, Wei Zhang, Dynamic Token-Pass Transformers for Semantic Segmentation, August 2023, DOI: 10.48550/arXiv.2308.01944, https://ui.adsabs.harvard.edu/abs/2023arXiv230801944L/abstract, PDF: https://arxiv.org/pdf/2308.01944.pdf
  • Mohsen Fayyaz, Soroush Abbasi Kouhpayegani, Farnoush Rezaei Jafari, Eric Sommerlade, Hamid Reza Vaezi Joze, Hamed Pirsiavash, and Juergen Gall. ATS: Adaptive token sampling for efficient vision transformers. In ECCV, July 2022, https://arxiv.org/abs/2111.15667v1
  • Hongjie Wang, Bhishma Dedhia, Niraj K. Jha, Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention Graph in Pre-Trained Transformers, May 2023, https://arxiv.org/abs/2305.17328
  • Yifei Liu, Mathias Gehrig, Nico Messikommer, Marco Cannici, Davide Scaramuzza, Revisiting Token Pruning for Object Detection and Instance Segmentation, June 2023, https://arxiv.org/abs/2306.07050
  • Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention, Xiangcheng Liu, Tianyi Wu, Guodong Guo, July 2023, https://arxiv.org/abs/2209.13802
  • Zhewei Yao, Linjian Ma, Sheng Shen, Kurt Keutzer, and Michael W Mahoney. 2021. MLPruning: A Multilevel Structured Pruning Framework for Transformer-based Models. arXiv preprint arXiv:2105.14636 (2021), https://arxiv.org/abs/2105.14636v1
  • Victor Sanh, Thomas Wolf, and Alexander M Rush. 2020. Movement pruning: Adaptive sparsity by fine-tuning. arXiv preprint arXiv:2005.07683 (2020), https://arxiv.org/abs/2005.07683
  • Mohsen Fayyaz, Soroush Abbasi Koohpayegani, Farnoush Rezaei Jafari, Sunando Sengupta, Hamid Reza Vaezi Joze, Eric Sommerlade, Hamed Pirsiavash, Juergen Gall, Adaptive Token Sampling For Efficient Vision Transformers, July 2022, https://arxiv.org/abs/2111.15667
  • Zi Lin, Jeremiah Zhe Liu, Zi Yang, Nan Hua, and Dan Roth. Oct 2020. Pruning Redundant Mappings in Transformer Models via Spectral-Normalized Identity Prior. arXiv preprint arXiv:2010.01791 (2020), https://arxiv.org/abs/2010.01791
  • François Lagunas, Ella Charlaix, Victor Sanh, and Alexander M Rush. Sep 2021. Block pruning for faster transformers. arXiv preprint arXiv:2109.04838 (2021), https://arxiv.org/abs/2109.04838
  • Siyuan Wei, Tianzhu Ye, Shen Zhang, Yao Tang, Jiajun Liang, Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers, Apr 2023, https://arxiv.org/abs/2304.10716, Code: https://github.com/megvii-research/TPS-CVPR2023
  • Youwei Liang, Chongjian Ge, Zhan Tong, Yibing Song, Jue Wang, and Pengtao Xie. Not all patches are what you need: Expediting vision transformers via token reorganizations. arXiv preprint arXiv:2202.07800, Apr 2022, https://arxiv.org/abs/2202.07800
  • Yifan Xu, Zhijie Zhang, Mengdan Zhang, Kekai Sheng, Ke Li, Weiming Dong, Liqing Zhang, Changsheng Xu, and Xing Sun. Evo-ViT: Slow-fast token evolution for dynamic vision transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 2964–2972, 2022, https://arxiv.org/abs/2108.01390, Code: https://github.com/YifanXu74/Evo-ViT
  • Bowen Pan, Rameswar Panda, Yifan Jiang, Zhangyang Wang, Rogerio Feris, and Aude Oliva. IA-RED2: Interpretability-aware redundancy reduction for vision transformers. In Advances in Neural Information Processing Systems (NeurIPS), Oct 2021, https://arxiv.org/abs/2106.12620
  • Hongxu Yin, Arash Vahdat, Jose Alvarez, Arun Mallya, Jan Kautz, and Pavlo Molchanov. AdaViT: Adaptive tokens for efficient vision transformer. arXiv preprint arXiv:2112.07658, 2021 (revised Oct 2022), https://arxiv.org/abs/2112.07658, Code: https://a-vit.github.io/
  • Hao Yu and Jianxin Wu. A unified pruning framework for vision transformers. arXiv preprint arXiv:2111.15127, Nov 2021, https://arxiv.org/abs/2111.15127
  • Youwei Liang, Chongjian GE, Zhan Tong, Yibing Song, Jue Wang, and Pengtao Xie. EViT: Expediting vision transformers via token reorganizations. In International Conference on Learning Representations (ICLR), Jan 2022, https://openreview.net/forum?id=BjyvwnXXVn_, PDF: https://openreview.net/pdf?id=BjyvwnXXVn_, Code: https://github.com/youweiliang/evit
  • Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynamicvit: Efficient vision transformers with dynamic token sparsification. In Advances in Neural Information Processing Systems (NeurIPS), Oct 2021, https://arxiv.org/abs/2106.02034, Code: https://github.com/raoyongming/DynamicViT
  • Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zi-Hang Jiang, Francis E.H. Tay, Jiashi Feng, and Shuicheng Yan. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 558–567, October 2021, https://arxiv.org/abs/2101.11986
  • Huanrui Yang, Hongxu Yin, Pavlo Molchanov, Hai Li, and Jan Kautz. Nvit: Vision transformer compression and parameter redistribution. arXiv preprint arXiv:2110.04869, 2021, PDF: https://arxiv.org/pdf/2110.04869v1.pdf
  • Yehui Tang, Kai Han, Yunhe Wang, Chang Xu, Jianyuan Guo, Chao Xu, and Dacheng Tao. Patch slimming for efficient vision transformers. arXiv preprint arXiv:2106.02852, 2021 (revised Apr 2022). https://arxiv.org/abs/2106.02852
  • Shixing Yu, Tianlong Chen, Jiayi Shen, Huan Yuan, Jianchao Tan, Sen Yang, Ji Liu, and Zhangyang Wang. Unified visual transformer compression. arXiv preprint arXiv:2203.08243, Mar 2022, https://arxiv.org/abs/2203.08243
  • Tianlong Chen, Yu Cheng, Zhe Gan, Lu Yuan, Lei Zhang, and Zhangyang Wang. Chasing sparsity in vision transformers: An end-to-end exploration. Advances in Neural Information Processing Systems, 34, 2021, https://arxiv.org/abs/2106.04533
  • Mingjian Zhu, Yehui Tang, and Kai Han. Vision transformer pruning. arXiv preprint arXiv:2104.08500, Aug 2021, https://arxiv.org/abs/2104.08500
  • Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural networks for resource efficient inference. arXiv preprint arXiv:1611.06440, 2016 (revised June 2017), https://arxiv.org/abs/1611.06440
  • Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710, 2016, https://arxiv.org/abs/1608.08710
  • Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning efficient convolutional networks through network slimming. In ICCV 2017, Aug 2017, https://arxiv.org/abs/1708.06519
  • Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. ThiNet: A filter level pruning method for deep neural network compression. In Proceedings of the IEEE international conference on computer vision, pages 5058–5066, 2017, https://arxiv.org/abs/1707.06342
  • Zihang Jiang, Qibin Hou, Li Yuan, Daquan Zhou, Yujun Shi, Xiaojie Jin, Anran Wang, and Jiashi Feng. All tokens matter: Token labeling for training better vision transformers. arXiv preprint arXiv:2104.10858, June 2021, https://arxiv.org/abs/2104.10858, Code: https://github.com/zihangJiang/TokenLabeling
  • Zhewei Yao, Linjian Ma, Sheng Shen, Kurt Keutzer, and Michael W Mahoney. 2021. MLPruning: A Multilevel Structured Pruning Framework for Transformer-based Models. arXiv preprint arXiv:2105.14636 (2021), PDF: https://arxiv.org/pdf/2105.14636v1.pdf, Code: https://github.com/yaozhewei/mlpruning.git
  • Yue Guan, Zhengyi Li, Jingwen Leng, Zhouhan Lin, Minyi Guo, Transkimmer: Transformer Learns to Layer-wise Skim, May 2022, In AC, https://arxiv.org/abs/2205.07324 (This paper does per-layer dynamic token pruning.)
  • Learned Token Pruning for Efficient Transformer Inference, Sehoon Kim, Sheng Shen, David Thorsley, Amir Gholami, Woosuk Kwon, Joseph Hassoun, Kurt Keutzer, May 11, 2023, Electrical Engineering and Computer Sciences, University of California, Berkeley, Technical Report No. UCB/EECS-2023-119, Masters Thesis, http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-119.html PDF: https://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-119.pdf (Learns threshold-based token pruning parameters; novel approach to token pruning during attention. Also contains good literature survey on token pruning.)
  • Ofir Press, Noah A Smith, and Mike Lewis. “Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation”. arXiv preprint arXiv:2108.12409, 2021 (revised Apr 2022), https://arxiv.org/abs/2108.12409 (Attention with Linear Biases (ALiBi) paper)
  • Token and Head Adaptive Transformers for Efficient Natural Language Processing Chonghan Lee, Md Fahim Faysal Khan, Rita Brugarolas Brufau, Ke Ding, Vijaykrishnan Narayanan, Oct 2022, https://aclanthology.org/2022.coling-1.404/ (Combination of token pruning and attention head pruning, i.e. length/width pruning combined)
  • Zejiang Hou, Sun-Yuan Kung, Multi-Dimensional Vision Transformer Compression via Dependency Guided Gaussian Process Search, 2022, https://ieeexplore.ieee.org/document/9857488, PDF: https://openaccess.thecvf.com/content/CVPR2022W/EVW/html/Hou_Multi-Dimensional_Vision_Transformer_Compression_via_Dependency_Guided_Gaussian_Process_Search_CVPRW_2022_paper.html (Multi-dimensional pruning.)
  • K Luo, H Li, X Zhou, B Huang, An Attention-Based Token Pruning Method for Vision Transformers, International Joint Conference, IJCRS 2022, Suzhou, China, November 11–14, 2022, Proceedings, Nov 2022, Pages 274–288, https://doi.org/10.1007/978-3-031-21244-4_21
  • Siyuan Wei, Tianzhu Ye, Shen Zhang, Yao Tang, Jiajun Liang, Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 2092-2101, http://openaccess.thecvf.com/content/CVPR2023/html/Wei_Joint_Token_Pruning_and_Squeezing_Towards_More_Aggressive_Compression_of_CVPR_2023_paper.html, https://arxiv.org/abs/2304.10716, Code: https://github.com/megvii-research/TPS-CVPR2023
  • Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019. https://arxiv.org/abs/1901.02860 (Related to length pruning and context length, although not fully token pruning.)
  • Xin Huang, Ashish Khetan, Rene Bidart, and Zohar Karnin. Pyramid-BERT: Reducing complexity via successive core-set based token selection. arXiv preprint arXiv:2203.14380, 2022. https://arxiv.org/abs/2203.14380
  • Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster. arXiv preprint arXiv:2210.09461, 2022, https://arxiv.org/abs/2210.09461 (Token merging idea is similar to token pruning.)
  • Y Guan, Z Li, Z Lin, Y Zhu, J Leng, M Guo, Block-skim: Efficient question answering for transformer, Proceedings of the AAAI, 2022, https://doi.org/10.1609/aaai.v36i10.21316, https://ojs.aaai.org/index.php/AAAI/article/view/21316, PDF: https://ojs.aaai.org/index.php/AAAI/article/view/21316/21065
  • Q Tang, B Zhang, J Liu, F Liu, Y Liu, Dynamic Token Pruning in Plain Vision Transformers for Semantic Segmentation arXiv preprint arXiv:2308.01045, 2023, https://arxiv.org/abs/2308.01045
  • Sotiris Anagnostidis, Dario Pavllo, Luca Biggio, Lorenzo Noci, Aurelien Lucchi, Thomas Hofmann, Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers, arXiv preprint, 2023, https://arxiv.org/abs/2305.15805
  • Structural-Jump-LSTM (Hansen et al. 2018) Hansen, C.; Hansen, C.; Alstrup, S.; Simonsen, J. G.; and Lioma, C. 2018. Neural Speed Reading with Structural-Jump-LSTM. In International Conference on Learning Representations, https://arxiv.org/abs/1904.00761
  • Seo, M.; Min, S.; Farhadi, A.; and Hajishirzi, H. 2018. Neural Speed Reading via Skim-RNN. In International Conference on Learning Representations, https://arxiv.org/abs/1711.02085
  • Adams Wei Yu, Hongrae Lee, Quoc V. Le. 2017. Learning to Skim Text. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), https://arxiv.org/abs/1704.06877
  • Y Wang, K Chen, H Tan, K Guo, 2023, Tabi: An Efficient Multi-Level Inference System for Large Language Models, EuroSys '23: Proceedings of the Eighteenth European Conference on Computer Systems, Rome, Italy May 2023, Pages 233–248, https://doi.org/10.1145/3552326.3587438, PDF: https://cse.hkust.edu.hk/~kaichen/papers/tabi-eurosys23.pdf
  • L. Denoyer and P. Gallinari. Deep sequential neural network. arXiv preprint arXiv:1410.0510, 2014. https://arxiv.org/abs/1410.0510 (Input adaptive method, somewhat related to token pruning.)
  • Hochreiter, S.; and Schmidhuber, J., 1997. Long short-term memory. Neural computation, 9(8): 1735–1780. https://ieeexplore.ieee.org/abstract/document/6795963 (Early paper, somewhat related to token skimming.)
  • Campos, V.; Jou, B.; Giro-i-Nieto, X.; Torres, J.; and Chang, S., 2017. Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks. CoRR, abs/1708.06834. https://arxiv.org/abs/1708.06834, Code: https://imatge-upc.github.io/skiprnn-2017-telecombcn/
  • Zheng Qu, Liu Liu, Fengbin Tu, Zhaodong Chen, Yufei Ding, and Yuan Xie. DOTA: Detect and Omit Weak Attentions for Scalable Transformer Acceleration. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pages 14–26, 2022. https://dl.acm.org/doi/pdf/10.1145/3503222.3507738 (Involves some reordering of tokens.)
  • Jean Senellart, Dakun Zhang, Bo Wang, Guillaume Klein, Jean-Pierre Ramatchandirin, Josep Crego, and Alexander Rush. OpenNMT system description for WNMT 2018: 800 words/sec on a single-core CPU. In Proc. of WNG, 2018. https://www.aclweb.org/anthology/W18-2715">URL https://www.aclweb.org/anthology/W18-2715
  • Xing Shi and Kevin Knight. Speeding up neural machine translation decoding by shrinking run-time vocabulary. In Proc. of ACL, 2017. https://aclanthology.org/P17-2091/, PDF: http://xingshi.me/data/pdf/ACL2017short.pdf
  • Gurvan L’Hostis, David Grangier, and Michael Auli. 2016. Vocabulary Selection Strategies for Neural Machine Translation. Arxiv preprint arXiv:1610.00072, https://arxiv.org/abs/1610.00072
  • Ali Modarressi, Hosein Mohebbi, Mohammad Taher Pilehvar, 2022, AdapLeR: Speeding up Inference by Adaptive Length Reduction, arXiv preprint arXiv:2203.08991, https://arxiv.org/abs/2203.08991 Code: https://github.com/amodaresi/AdapLeR
  • Hansen, C., Hansen, C., Alstrup, S., Simonsen, J. G., and Lioma, C. (2019). Neural speed reading with structural-jump-LSTM. In International Conference on Learning Representations. https://arxiv.org/abs/1904.00761, https://openreview.net/forum?id=B1xf9j
  • Maedeh Hemmat, Joshua San Miguel, Azadeh Davoodi, "AirNN: A Featherweight Framework for Dynamic Input-Dependent Approximation of CNNs", IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol.40, no.10, pp.2090-2103, 2021. https://ieeexplore.ieee.org/document/9239327 (Input dependent matching of weight clusters from tokens is vaguely similar to token pruning or length pruning.)
  • Minxuan Zhou; Weihong Xu; Jaeyoung Kang; Tajana Rosing, 2022, TransPIM: A Memory-based Acceleration via Software-Hardware Co-Design for Transformer, 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA), https://ieeexplore.ieee.org/document/9773212 PDF: https://par.nsf.gov/servlets/purl/10345536 (Does some token pruning but is primarily focused on memory optimization, including with token-based data sharding for allocation to different memory banks.)
  • Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, Lee Sharkey, Sep 2023, Sparse Autoencoders Find Highly Interpretable Features in Language Models, https://arxiv.org/abs/2309.08600 (Analysis has some relevant to tokenization and token pruning.)
  • H Jiang, Q Wu, CY Lin, Y Yang, L Qiu, Oct 2023, LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models, arXiv preprint arXiv:2310.05736, https://arxiv.org/pdf/2310.05736.pdf, Code: https://aka.ms/LLMLingua (Dynamic token pruning for prompt compression.)
  • X Xu, C Li, Y Chen, X Chang, J Liu, S Wang, Oct 2023, No Token Left Behind: Efficient Vision Transformer via Dynamic Token Idling, arXiv preprint arXiv:2310.05654, https://arxiv.org/pdf/2310.05654.pdf (Suggests "token idling" that allows reuse of pruned tokens in later layers)
  • Yucheng Li. April 2023. Unlocking context constraints of LLMs: Enhancing context efficiency of LLMs with self-information-based content filtering. ArXiv preprint abs/2304.12102. https://arxiv.org/abs/2304.12102 (Token pruning for prompt compression.)
  • Wang, Y., Huang, R., Song, S., Huang, Z., Huang, G., May 2021, Not all images are worth 16x16 words: Dynamic vision transformers with adaptive sequence length. NeurIPS 2021, https://arxiv.org/abs/2105.15075, Code: https://github.com/blackfeather-wang/Dynamic-Vision-Transformer, Code: https://github.com/blackfeather-wang/Dynamic-Vision-Transformer-MindSpore
  • Jesse Mu, Xiang Lisa Li, and Noah Goodman. July 2023. Learning to compress prompts with gist tokens. arXiv preprint arXiv:2304.08467. https://arxiv.org/abs/2304.08467 (Prompt compression.)
  • Wang Y., Lv K., Huang R., Song S., Yang L., Huang G., 2020, Glance and focus: a dynamic approach to reducing spatial redundancy in image classification, Advances in neural information processing systems, Vol. 33 (2020), pp. 2432-2444, https://arxiv.org/abs/2010.05300, Code: https://github.com/blackfeather-wang/GFNet-Pytorch (Focuses on a subset of image inputs, which is analogous to token pruning.)
  • Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, Yan Yan, 25 Mar 2024, LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models, https://arxiv.org/abs/2403.15388 Code: https://llava-prumerge.github.io/ (Compresses input images based on redundant sections.)
  • Benjamin Minixhofer, Edoardo Maria Ponti, Ivan Vulić, 13 May 2024, Zero-Shot Tokenizer Transfer, https://arxiv.org/abs/2405.07883 (Overcoming the limitation that the tokenizer is fixed for the model, by training the tokenizer to embeddings mapping so as to use different tokenizers, including effective input token pruning reducing tokens in the input with a larger vocabulary.)
  • Yubin Qin; Yang Wang; Dazheng Deng; Xiaolong Yang, Ayaka: A Versatile Transformer Accelerator With Low-Rank Estimation and Heterogeneous Dataflow, https://ieeexplore.ieee.org/abstract/document/10530252 (Cross layer random prediction to allow sparsification of attention and linear layers.)
  • Yao Fu, 14 May 2024, Challenges in Deploying Long-Context Transformers: A Theoretical Peak Performance Analysis, https://arxiv.org/abs/2405.08944
  • Zhihang Lin, Mingbao Lin, Luxi Lin, Rongrong Ji, 9 May 2024, Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference, https://arxiv.org/abs/2405.05803 Code: https://github.com/lzhxmu/VTW (Removing all visual tokens in later layers of a multimodal model, which is effectively early exiting or token pruning, but affecting only the vision part of the multimodal Transformer.)
  • Shujian Zhang, Korawat Tanwisuth, Chengyue Gong, Pengcheng He, Mingyuan Zhou, 7 May 2024, Switchable Decision: Dynamic Neural Generation Networks, https://arxiv.org/abs/2405.04513 (Switching and skipping sub-layer components such as attention heads, FFNs, or input token skipping, using decisions made based on allocating computation resources.)
  • Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, 22 Apr 2024, A Survey on Efficient Inference for Large Language Models, https://arxiv.org/abs/2404.14294
  • Yashas Samaga B L, Varun Yerram, Chong You, Srinadh Bhojanapalli, Sanjiv Kumar, Prateek Jain, Praneeth Netrapalli, 14 Feb 2024, HiRE: High Recall Approximate Top-k Estimation for Efficient LLM Inference, https://arxiv.org/abs/2402.09360 (Attempts to estimate the output of top-k decoding, so as to prune computations on two dimensions earlier in the inference computations.)
  • Ignacio de Gregorio, April 2024, Mixture-of-Depths, a Dazzling New AI Breakthrough: Conditional Computing is Finally Here, Medium, https://medium.com/@ignacio.de.gregorio.noblejas/mixture-of-depths-a-dazzling-new-ai-breakthrough-be958fc629b2 (Mixture of depths is a layer-wise per-token limit to attention head computations, which is like width pruning with dynamic depth.)
  • David Raposo, Sam Ritter, Blake Richards, Timothy Lillicrap, Peter Conway Humphreys, Adam Santoro, 2 Apr 2024, Mixture-of-Depths: Dynamically allocating compute in transformer-based language models, https://arxiv.org/abs/2404.02258 (Per-layer pruning of which tokens can be in the attention computations to give a type of mixed lengthwise pruning combined with a dynamic width pruning or slimmable network approach.)
  • Hou-I Liu, Marco Galindo, Hongxia Xie, Lai-Kuan Wong, Hong-Han Shuai, Yung-Yui Li, Wen-Huang Cheng, 8 Apr 2024, Lightweight Deep Learning for Resource-Constrained Environments: A Survey, https://arxiv.org/abs/2404.07236 (A survey of various optimizations, with a lot of focus on image and vision models, including CNNs, RNNs, and Transformers.)
  • M Sponner, B Waschneck, A Kumar , 2024, Adapting Neural Networks at Runtime: Current Trends in At-Runtime Optimizations for Deep Learning, ACM Computing Surveys,, PDF: https://dl.acm.org/doi/pdf/10.1145/3657283 (Survey of various adaptive inference optimization techniques with much focus on image and video processing optimization for LLMs.)
  • Wangbo Zhao, Jiasheng Tang, Yizeng Han, Yibing Song, Kai Wang, Gao Huang, Fan Wang, Yang You, 18 Mar 2024, Dynamic Tuning Towards Parameter and Inference Efficiency for ViT Adaptation, https://arxiv.org/abs/2403.11808 (PEFT and adaptive inference and token pruning in Vision Transformers.)
  • Yansong Xu, Dongxu Lyu, Zhenyu Li, Zilong Wang, Yuzhou Chen, Gang Wang, Zhican Wang, Haomin Li, Guanghui He, 16 Mar 2024, DEFA: Efficient Deformable Attention Acceleration via Pruning-Assisted Grid-Sampling and Multi-Scale Parallel Processing, https://arxiv.org/abs/2403.10913 (Attention optimizations in Vision Transformer with pruning of feature maps, and extensive parallelization with consideration of the hardware layer.)
  • Tanvir Mahmud, Burhaneddin Yaman, Chun-Hao Liu, Diana Marculescu, 24 Mar 2024, PaPr: Training-Free One-Step Patch Pruning with Lightweight ConvNets for Faster Inference, https://arxiv.org/abs/2403.16020 (Pruning of patches in input images, which is a form of token or channel pruning.)
  • Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
  • Dewen Zeng, Nan Du, Tao Wang, Yuanzhong Xu, Tao Lei, Zhifeng Chen, Claire Cui, 26 Nov 2023, Learning to Skip for Language Modeling, https://arxiv.org/abs/2311.15436 (Generalizes token-based early exiting to skip entire layers.)
  • Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
  • Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, Liang Zhao, 4 Jan 2024, Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models https://arxiv.org/abs/2401.00625 (A general survey paper with coverage of many techniques including this one.)
  • J Ainslie, T Lei, M de Jong, S Ontañón, 2023, Colt5: Faster long-range transformers with conditional computation, https://arxiv.org/abs/2303.09752
  • Krishna Teja Chitty-Venkata, Sparsh Mittal, Murali Emani, Venkatram Vishwanath, Arun K. Somani, 2023, A Survey of Techniques for Optimizing Transformer Inference, https://arxiv.org/abs/2307.07982
  • Zihang Dai, Guokun Lai, Yiming Yang, and Quoc V Le. Funnel-transformer: Filtering out sequential redundancy for efficient language processing. arXiv preprint arXiv:2006.03236, 2020. https://arxiv.org/abs/2006.03236 Code: https://github.com/laiguokun/Funnel-Transformer
  • Ji Xin, Raphael Tang, Zhiying Jiang, Yaoliang Yu, Jimmy Lin, July 2022, Building an Efficiency Pipeline: Commutativity and Cumulativeness of Efficiency Operators for Transformers, arXiv preprint arXiv:2208.00483, https://arxiv.org/abs/2208.00483
  • Minchul Kim, Shangqian Gao, Yen-Chang Hsu, Yilin Shen, Hongxia Jin, 2 Dec 2023, Token Fusion: Bridging the Gap between Token Pruning and Token Merging, https://arxiv.org/abs/2312.01026
  • BAP Matrix, 2024, Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention Graph in Pre-Trained Transformers, https://openaccess.thecvf.com/content/CVPR2024/supplemental/Wang_Zero-TPrune_Zero-Shot_Token_CVPR_2024_supplemental.pdf
  • Wenxuan Huang, Yunhang Shen, Jiao Xie, Baochang Zhang, Gaoqi He, Ke Li, Xing Sun, Shaohui Lin, 31 Mar 2024, A General and Efficient Training for Transformer via Token Expansion, https://arxiv.org/abs/2404.00672 Code: https://github.com/Osilly/TokenExpansion (Token merging to accelerate training.)
  • Moein Heidari, Reza Azad, Sina Ghorbani Kolahi, René Arimond, Leon Niggemeier, Alaa Sulaiman, Afshin Bozorgpour, Ehsan Khodapanah Aghdam, Amirhossein Kazerouni, Ilker Hacihaliloglu, Dorit Merhof, 28 Mar 2024, Enhancing Efficiency in Vision Transformer Networks: Design Techniques and Insights, https://arxiv.org/abs/2403.19882 Project: https://github.com/mindflow-institue/Awesome-Attention-Mechanism-in-Medical-Imaging (Survey of optimization techniques for Vision Transformers, with particular focus on attention optimizations.)
  • Alessandro Baiocchi, Indro Spinelli, Alessandro Nicolosi, Simone Scardapane, 26 Jan 2024, Adaptive Point Transformer, https://arxiv.org/abs/2401.14845
  • Jianjian Cao, Peng Ye, Shengze Li, Chong Yu, Yansong Tang, Jiwen Lu, Tao Chen, 5 Mar 2024, MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer, https://arxiv.org/abs/2403.02991
  • Simone Scardapane, Alessandro Baiocchi, Alessio Devoto, Valerio Marsocci, Pasquale Minervini, Jary Pomponi, 12 Mar 2024, Conditional computation in neural networks: principles and research trends, https://arxiv.org/abs/2403.07965 (Investigated three types of dynamic inference: MoE, early exit, and token selection.)
  • Muhammad Adnan, Akhil Arunkumar, Gaurav Jain, Prashant J. Nair, Ilya Soloveychik, Purushotham Kamath, 14 Mar 2024, Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference, https://arxiv.org/abs/2403.09054 (Reducing KV cache in-memory size and related computations by focusing on a subset of tokens.)
  • Jurek Leonhardt, Henrik Müller, Koustav Rudra, Megha Khosla, Abhijit Anand, Avishek Anand, Nov 2023, Efficient Neural Ranking using Forward Indexes and Lightweight Encoders, https://arxiv.org/abs/2311.01263
  • Maxim Bonnaerens, Nov 2023, Resource-Efficient Deep Learning for Computer Vision, Ph.D. thesis, Ghent University, https://biblio.ugent.be/publication/01HEMGWENRT8C255N2RD9KAEJC/file/01HEMGZ9JYP8NXPSQJZM14ACT9 (Examines various vision Transformer optimizations including a NAS approached based on building blocks and also combined token pruning/merging for input compression.)
  • Nikolay Bogoychev, Pinzhen Chen, Barry Haddow, Alexandra Birch, Nov 2023, Large Language Model Inference with Lexical Shortlisting, https://arxiv.org/abs/2311.09709 (Shortlisting the vocabulary to common words for reduced tokens and embedding matrix size.)
  • Jungmin Yun, Mihyeon Kim, Youngbin Kim, 2023, Focus on the Core: Efficient Attention via Pruned Token Compression for Document Classification https://aclanthology.org/2023.findings-emnlp.909.pdf
  • Chang Liu, Chongyang Tao, Jianxin Liang, Jiazhan Feng, Tao Shen, 2023, Quzhe Huang, Dongyan Zhao,Length-Adaptive Distillation: Customizing Small Language Model for Dynamic Token Pruning, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 4452–4463, December 6-10, 2023, https://aclanthology.org/2023.findings-emnlp.294.pdf (Explores combining static model compression via knowledge distillation with dynamic adaptive inference via token pruning. This creates a modified distillation algorithm that prepares the model for token pruning during training.)
  • Yifei Liu, Mathias Gehrig, Nico Messikommer, Marco Cannici, Davide Scaramuzza, 2024, Revisiting Token Pruning for Object Detection and Instance Segmentation, IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, 2024, https://rpg.ifi.uzh.ch/docs/WACV24_Liu.pdf Code: https://github.com/uzh-rpg/svit/ (Examination of token pruning in image processing including re-activating pruned tokens and advanced techniques such as a dynamic pruning rate.)
  • Jinyu Chen, Wenchao Xu, Zicong Hong, Song Guo, Haozhao Wang, Jie Zhang, Deze Zeng, 10 Jan 2024, OTAS: An Elastic Transformer Serving System via Token Adaptation, https://arxiv.org/abs/2401.05031
  • Lu Yin, You Wu, Zhenyu Zhang, Cheng-Yu Hsieh, Yaqing Wang, Yiling Jia, Mykola Pechenizkiy, Yi Liang, Zhangyang Wang, Shiwei Liu, Oct 2023, Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity, https://arxiv.org/abs/2310.05175
  • Ali Modarressi, Hosein Mohebbi, and Mohammad Taher Pilehvar. 2022. AdapLeR: Speeding up inference by adaptive length reduction. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1–15, Dublin, Ireland. Association for Computational Linguistics, https://doi.org/10.18653/v1/2022.acl-long.1
  • B Sun, X Ye, Z Wang, H Li, Z Wang 2023 Exploring Coarse-to-Fine Action Token Localization and Interaction for Fine-grained Video Action Recognition, https://dl.acm.org/doi/abs/10.1145/3581783.3612206
  • F Shi, L Wang, Oct 2023 Bridging The Gaps Between Token Pruning and Full Pre-training via Masked Fine-tuning, arXiv preprint arXiv:2310.17177, https://arxiv.org/pdf/2310.17177.pdf
  • M Berchansky, P Izsak, A Caciularu, I Dagan, Oct 2023, Optimizing Retrieval-augmented Reader Models via Token Elimination, https://arxiv.org/pdf/2310.13682.pdf Code: https://github.com/mosheber/token_elimination
  • Zhengyang Zhuge, Peisong Wang, Xingting Yao1, Jian Cheng, 2024, Towards Efficient Spiking Transformer: a Token Sparsification Framework for Training and Inference Acceleration, https://openreview.net/pdf?id=yL6hljtjW4
  • Tao Yang, Dongyue Li, Zhuoran Song, Yilong Zhao, Fangxin Liu, Zongwu Wang, Zhezhi He, and Li Jiang, 2022, DTQAtten: Leveraging Dynamic Token-based Quantization for Efficient Attention Architecture, https://mxhx7199.github.io/files/%5BDATE-22%5DDTQAtten_preprint.pdf
  • David Spuler, March 2024, Chapter 49. Length Pruning, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
  • P Dong, M Sun, A Lu, Y Xie, K Liu, 2023, Heatvit: Hardware-efficient adaptive token pruning for vision transformers, https://ieeexplore.ieee.org/abstract/document/10071047/ https://arxiv.org/pdf/2211.08110
  • T. Ge, J. Hu, X. Wang, S.-Q. Chen, and F. Wei, “In-context autoencoder for context compression in a large language model,” arXiv preprint arXiv:2307.06945, 2023. https://arxiv.org/abs/2307.06945
  • Quandong Wang, Yuxuan Yuan, Xiaoyu Yang, Ruike Zhang, Kang Zhao, Wei Liu, Jian Luan, Daniel Povey, Bin Wang, 17 Jun 2024 (v2), SUBLLM: A Novel Efficient Architecture with Token Sequence Subsampling for LLM, https://arxiv.org/abs/2406.06571
  • Zhiyu Guo, Hidetaka Kamigaito, Taro Watanabe, 18 Jun 2024, Attention Score is not All You Need for Token Importance Indicator in KV Cache Reduction: Value Also Matters, https://arxiv.org/abs/2406.12335 (Extensions of KV cache token pruning methods tjat ise attention scores to find pivotal tokens, generalized to also consider L1 vector norms of value vectors.)
  • Nikolay Bogoychev, Pinzhen Chen, Barry Haddow, Alexandra Birch, June 20, 2024, The Ups and Downs of Large Language Model Inference, with Vocabulary Trimming by Language Heuristics, School of Informatics, University of Edinburgh, Proceedings of the Fifth Workshop on Insights from Negative Results in NLP, pages 148–153 https://aclanthology.org/2024.insights-1.17.pdf
  • Wonbeom Lee, Jungi Lee, Junghwan Seo, Jaewoong Sim, 28 Jun 2024, InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management, https://arxiv.org/abs/2406.19707 (KV caching optimization using salient token pruning for the attention layer.)
  • Yun-Chia Yu; Mao-Chi Weng; Ming-Guang Lin; An-Yeu Andy Wu, May 2024, Retraining-free Constraint-aware Token Pruning for Vision Transformer on Edge Devices, 2024 IEEE International Symposium on Circuits and Systems (ISCAS), 19-22 May 2024, https://ieeexplore.ieee.org/abstract/document/10558603
  • Yuqing Yang, Yuedong Xu, Lei Jiao, 7 Jul 2024, A Queueing Theoretic Perspective on Low-Latency LLM Inference with Variable Token Length, https://arxiv.org/abs/2407.05347
  • Mootez Saad, José Antonio Hernández López, Boqi Chen, Dániel Varró, Tushar Sharma, 4 Jul 2024, ALPINE: An adaptive language-agnostic pruning method for language models for code, https://arxiv.org/abs/2407.04147
  • Siddharth Jha, Lutfi Eren Erdogan, Sehoon Kim, Kurt Keutzer, Amir Gholami, 11 Jul 2024, Characterizing Prompt Compression Methods for Long Context Inference, https://arxiv.org/abs/2407.08892
  • Haonan Wang, Jie Liu, Jie Tang, Gangshan Wu, Bo Xu, Yanbing Chou, Yong Wang, 16 Jul 2024 (v2), GTPT: Group-based Token Pruning Transformer for Efficient Human Pose Estimation, https://arxiv.org/abs/2407.10756
  • Qichen Fu, Minsik Cho, Thomas Merth, Sachin Mehta, Mohammad Rastegari, Mahyar Najibi, 19 Jul 2024, LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference, https://arxiv.org/abs/2407.14057
  • Junyoung Park, Myeonggu Kang, Yunki Han, Yanggon Kim, Jaekang Shin, Lee-Sup Kim, 21 Jul 2024, Token-Picker: Accelerating Attention in Text Generation with Minimized Memory Transfer via Probability Estimation, https://arxiv.org/abs/2407.15131
  • David Gu, July 18, 2024, Text Compression for Efficient Language Generation, Master’s Thesis, Distributed Computing Group, Computer Engineering and Networks Laboratory, ETH Zürich, https://pub.tik.ee.ethz.ch/students/2023-HS/MA-2023-19.pdf (Training and inference at the sentence level, including caching of embeddings per sentence, which also has the side-effect of compressing the input prompts and reducing computation analogously to token pruning.)
  • Oshin Dutta, Ritvik Gupta, Sumeet Agarwal, 2024, Efficient LLM Pruning with Global Token-Dependency Awareness and Hardware-Adapted Inference, https://openreview.net/pdf?id=cqhAzteLzc
  • Haode Qi, Cheng Qian, Jian Ni, Pratyush Singh, Reza Fazeli, Gengyu Wang, Zhongzheng Shu, Eric Wayne, Juergen Bross, 21 Aug 2024, Practical token pruning for foundation models in few-shot conversational virtual assistant systems, https://arxiv.org/abs/2408.11799 (Using token pruning for faster intent classification.)
  • Kazi Hasan Ibn Arif, JinYi Yoon, Dimitrios S. Nikolopoulos, Hans Vandierendonck, Deepu John, Bo Ji, 20 Aug 2024, HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models in Resource-Constrained Environments, https://arxiv.org/abs/2408.10945
  • Yufei Huang, Xu Han, Maosong Sun, 12 Aug 2024, FastFiD: Improve Inference Efficiency of Open Domain Question Answering via Sentence Selection, https://arxiv.org/abs/2408.06333 https://github.com/thunlp/FastFiD (Sentence-level pruning after encoding on the input text dimension.)
  • Bartosz Wójcik, Alessio Devoto, Karol Pustelnik, Pasquale Minervini, Simone Scardapane, 15 Dec 2023, Adaptive Computation Modules: Granular Conditional Computation For Efficient Inference, https://arxiv.org/abs/2312.10193
  • Li, Yanyu, Aug 2024, Accelerating large scale generative AI : a comprehensive study, Ph.D. Thesis, Northeastern University, Boston, Massachusetts, USA, https://hdl.handle.net/2047/D20669654 https://repository.library.northeastern.edu/files/neu:ms35wj107 https://repository.library.northeastern.edu/files/neu:ms35wj107/fulltext.pdf
  • CE Song, A Moradifirouzabadi, T Rosing, M Kang, 2024, Efficient Transformer Acceleration via Reconfiguration for Encoder and Decoder Models and Sparsity-Aware Algorithm Mapping, PDF: https://dl.acm.org/doi/pdf/10.1145/3665314.3670798
  • Purvish Jajal, Nick John Eliopoulos, Benjamin Shiue-Hal Chou, George K. Thiravathukal, James C. Davis, Yung-Hsiang Lu, 11 Sep 2024, Token Turing Machines are Efficient Vision Models, https://arxiv.org/abs/2409.07613
  • Weihao Ye, Qiong Wu, Wenhao Lin, Yiyi Zhou, 16 Sep 2024, Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models, https://arxiv.org/abs/2409.10197 https://github.com/ywh187/FitPrune
  • Wang, X, Sep 2024, KELTP: Keyword-Enhanced Learned Token Pruning for Knowledge-Grounded Dialogue. In: Wand, M., Malinovská, K., Schmidhuber, J., Tetko, I.V. (eds) Artificial Neural Networks and Machine Learning – ICANN 2024. ICANN 2024. Lecture Notes in Computer Science, vol 15022. Springer, Cham. https://doi.org/10.1007/978-3-031-72350-6_16 https://link.springer.com/chapter/10.1007/978-3-031-72350-6_16 (Adaptive removal of low-attention tokens during inference.)
  • Jinming Lou, Wenyang Luo, Yufan Liu, Bing Li, Xinmiao Ding, Weiming Hu, Jiajiong Cao, Yuming Li, Chenguang Ma, 27 Sep 2024, Token Caching for Diffusion Transformer Acceleration, https://arxiv.org/abs/2409.18523
  • Yaniv Leviathan, Matan Kalman, Yossi Matias, 3 Oct 2024, Selective Attention Improves Transformer, https://arxiv.org/abs/2410.02703 (Allowing adjacent tokens to predict whether required for attention.)
  • Yuxin Li, Yiheng Li, Xulei Yang, Mengying Yu, Zihang Huang, Xiaojun Wu, Chai Kiat Yeo, 9 Oct 2024, Learning Content-Aware Multi-Modal Joint Input Pruning via Bird's-Eye-View Representation, https://arxiv.org/abs/2410.07268
  • K. Shi et al., "Fitop-Trans: Maximizing Transformer Pipeline Efficiency through Fixed-Length Token Pruning on FPGA," 2024 34th International Conference on Field-Programmable Logic and Applications (FPL), Torino, Italy, 2024, pp. 243-249, doi: 10.1109/FPL64840.2024.00041. https://ieeexplore.ieee.org/abstract/document/10705515/
  • Yefei He, Feng Chen, Jing Liu, Wenqi Shao, Hong Zhou, Kaipeng Zhang, Bohan Zhua, 11 Oct 2024, ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification and KV Cache Compression, https://arxiv.org/abs/2410.08584
  • Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, Dahua Lin, 22 Oct 2024, PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction, https://arxiv.org/abs/2410.17247
  • Jort Vincenti, Karim Abdel Sadek, Joan Velja, Matteo Nulli, Metod Jazbec, 24 Oct 2024, Dynamic Vocabulary Pruning in Early-Exit LLMs, https://arxiv.org/abs/2410.18952
  • X. Shen et al., "HotaQ: Hardware Oriented Token Adaptive Quantization for Large Language Models," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, doi: 10.1109/TCAD.2024.3487781. https://ieeexplore.ieee.org/abstract/document/10737419/
  • Zhenmei Shi, Yifei Ming, Xuan-Phi Nguyen, Yingyu Liang, Shafiq Joty, 25 Sep 2024, Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction, https://arxiv.org/abs/2409.17422 https://github.com/SalesforceAIResearch/GemFilter (Use the early layers of a model to choose the most relevant tokens, similar to early exiting, and then compress the input token sequences based on the importance of these tokens. Notably, this reduces latency and also increases accuracy on long contexts.)
  • Xiangyu Zhang, Yu Zhou, Guang Yang, Harald C. Gall, Taolue Chen, 11 Nov 2024, Anchor Attention, Small Cache: Code Generation with Large Language Models, https://arxiv.org/abs/2411.06680
  • M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024, Resource-efficient Algorithms and Systems of Foundation Models: A Survey, https://dl.acm.org/doi/pdf/10.1145/3706418
  • Christopher Keith, Michael Robinson, Francis Duncan , Allan Worthington, Joseph Wilson, Soa Harris, October 22nd, 2024, Optimizing large language models: A novel approach through dynamic token pruning, https://doi.org/10.21203/rs.3.rs-5293588/v1 https://assets-eu.researchsquare.com/files/rs-5293588/v1_covered_52b30393-790e-4ab2-8bee-e6e7d4b16895.pdf?c=1729565713 (Reduces the cost of the unembedding step in the early exit tests by dynamically pruning vocabularies, thereby dynamically reducing the size of the unembedding matrix used in early exit testing.)
  • Yiwu Zhong, Zhuoming Liu, Yin Li, Liwei Wang, 4 Dec 2024, AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning, https://arxiv.org/abs/2412.03248

Token Skipping

Token skipping is a type of token pruning. The idea is to skip selected tokens in the input in an adaptive manner. This is similar to dynamic token pruning or token merging.

  • Shujian Zhang, Korawat Tanwisuth, Chengyue Gong, Pengcheng He, Mingyuan Zhou, 7 May 2024, Switchable Decision: Dynamic Neural Generation Networks, https://arxiv.org/abs/2405.04513 (Switching and skipping sub-layer components such as attention heads, FFNs, or input token skipping, using decisions made based on allocating computation resources.)
  • Wangbo Zhao, Jiasheng Tang, Yizeng Han, Yibing Song, Kai Wang, Gao Huang, Fan Wang, Yang You, 18 Mar 2024, Dynamic Tuning Towards Parameter and Inference Efficiency for ViT Adaptation, https://arxiv.org/abs/2403.11808 (PEFT and adaptive inference and token pruning in Vision Transformers.)
  • Foozhan Ataiefard, Walid Ahmed, Habib Hajimolahoseini, Saina Asani, Farnoosh Javadi, Mohammad Hassanpour, Omar Mohamed Awad, Austin Wen, Kangling Liu, Yang Liu, 27 Jan 2024, SkipViT: Speeding Up Vision Transformers with a Token-Level Skip Connection, https://arxiv.org/abs/2401.15293

Token Dropping

  • Bartosz Wójcik, Alessio Devoto, Karol Pustelnik, Pasquale Minervini, Simone Scardapane, 15 Dec 2023, Adaptive Computation Modules: Granular Conditional Computation For Efficient Inference, https://arxiv.org/abs/2312.10193 (Modifies its computation depending on the difficulty of each input token.)
  • M Salehi, S Mehta, A Kusupati, A Farhadi, H Hajishirzi, 2023 Sharcs: Efficient transformers through routing with dynamic width sub-networks https://arxiv.org/pdf/2310.12126.pdf (Direct queries to subnetworks with different widths.)
  • Simone Scardapane, Alessandro Baiocchi, Alessio Devoto, Valerio Marsocci, Pasquale Minervini, Jary Pomponi, 8 Jul 2024 (v2), Conditional computation in neural networks: principles and research trends, https://arxiv.org/abs/2403.07965
  • Le Hou, Richard Yuanzhe Pang, Tianyi Zhou, Yuexin Wu, Xinying Song, Xiaodan Song, Denny Zhou, 24 Mar 2022. Token Dropping for Efficient BERT Pretraining, https://arxiv.org/abs/2203.13240
  • Huaao Zhang, Shigui Qiu, Xiangyu Duan, Min Zhang, 21 Oct 2020, Token Drop mechanism for Neural Machine Translation, https://arxiv.org/abs/2010.11018
  • Qihuang Zhong, Liang Ding, Juhua Liu, Xuebo Liu, Min Zhang, Bo Du, Dacheng Tao, 24 May 2023, Revisiting Token Dropping Strategy in Efficient BERT Pretraining, https://arxiv.org/abs/2305.15273
  • Zhewei Yao, Xiaoxia Wu, Conglong Li, Connor Holmes, Minjia Zhang, Cheng Li, Yuxiong He, 17 Nov 2022, Random-LTD: Random and Layerwise Token Dropping Brings Efficient Training for Large-scale Transformers, https://arxiv.org/abs/2211.11586
  • Kazi Hasan Ibn Arif, JinYi Yoon, Dimitrios S. Nikolopoulos, Hans Vandierendonck, Deepu John, Bo Ji, 20 Aug 2024, HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models in Resource-Constrained Environments, https://arxiv.org/abs/2408.10945
  • Ting Liu, Liangtao Shi, Richang Hong, Yue Hu, Quanjun Yin, Linfeng Zhang, 16 Nov 2024, Multi-Stage Vision Token Dropping: Towards Efficient Multimodal Large Language Model, https://arxiv.org/abs/2411.10803

Prompt Compression

Prompt compression is a type of input token pruning. The idea is to reduce the size of the prompt, thereby reducing the number of tokens that must be processed.

Token Compression

It's not really clear that "token compression" is a technique much different from token pruning, token merging, or other types of prompt compression. Nevertheless, here are a few papers on this specific topic:

  • Wangbo Zhao, Jiasheng Tang, Yizeng Han, Yibing Song, Kai Wang, Gao Huang, Fan Wang, Yang You, 18 Mar 2024, Dynamic Tuning Towards Parameter and Inference Efficiency for ViT Adaptation, https://arxiv.org/abs/2403.11808 (PEFT and adaptive inference and token pruning in Vision Transformers.)
  • Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, Yan Yan, 25 Mar 2024, LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models, https://arxiv.org/abs/2403.15388 Code: https://llava-prumerge.github.io/ (Compresses input images based on redundant sections.)
  • Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, Qianyu Chen, Huarong Zhou, Zhensheng Zou, Haoye Zhang, Shengding Hu, Zhi Zheng, Jie Zhou, Jie Cai, Xu Han, Guoyang Zeng, Dahai Li, Zhiyuan Liu, Maosong Sun, 3 Aug 2024, MiniCPM-V: A GPT-4V Level MLLM on Your Phone, https://arxiv.org/abs/2408.01800 Code: https://github.com/OpenBMB/MiniCPM-V
  • Keda Tao, Can Qin, Haoxuan You, Yang Sui, Huan Wang, 22 Nov 2024, DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models, https://arxiv.org/abs/2411.15024
  • Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yuxian Gu, Dacheng Li, Xiuyu Li, Yunhao Fang, Yukang Chen, Cheng-Yu Hsieh, De-An Huang, An-Chieh Cheng, Vishwesh Nath, Jinyi Hu, Sifei Liu, Ranjay Krishna, Daguang Xu, Xiaolong Wang, Pavlo Molchanov, Jan Kautz, Hongxu Yin, Song Han, Yao Lu, 5 Dec 2024, NVILA: Efficient Frontier Visual Language Models, https://arxiv.org/abs/2412.04468

Context Compression

Context compression is a type of prompt compression and input token pruning. The idea is to reduce the size of the context that is part of the input prompt. Context tokens that are candidates for compression include, for example, (a) the conversational history in a chatbot session, (b) global instructions used at the beginning of every query, or (c) the supplementary document chunks that are prepended in a RAG architecture.

Image Token Pruning

Tokenization of images is a two-dimensional process. There are various papers on applying the ideas of "token pruning" to image inputs, which effectively means ignoring some parts of the image.

  • Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, Yan Yan, 25 Mar 2024, LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models, https://arxiv.org/abs/2403.15388 Code: https://llava-prumerge.github.io/ (Compresses input images based on redundant sections.)
  • Joonmyung Choi, Sanghyeok Lee, Jaewon Chu, Minhyuk Choi, Hyunwoo J. Kim, 20 Mar 2024, vid-TLDR: Training Free Token merging for Light-weight Video Transformer, https://arxiv.org/abs/2403.13347, Code: https://github.com/mlvlab/vid-TLDR (Token merging in video with a focus on the background of the image.)
  • Zhongwei Wan, Ziang Wu, Che Liu, Jinfa Huang, Zhihong Zhu, Peng Jin, Longyue Wang, Li Yuan, 26 Jun 2024, LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference, https://arxiv.org/abs/2406.18139 (KV cache compression in text and multimodal inference, prioritizing eviction of text over image tokens, and using new ways to merge evicted KV cache data into the retained KV cache, including averaging, pivotal tokens, and weighted averages, which is relevant to token merging and KV cache fusion.)
  • Wang, Y., Huang, R., Song, S., Huang, Z., Huang, G., May 2021, Not all images are worth 16x16 words: Dynamic vision transformers with adaptive sequence length. NeurIPS 2021, https://arxiv.org/abs/2105.15075, Code: https://github.com/blackfeather-wang/Dynamic-Vision-Transformer, Code: https://github.com/blackfeather-wang/Dynamic-Vision-Transformer-MindSpore
  • Wang Y., Lv K., Huang R., Song S., Yang L., Huang G., 2020, Glance and focus: a dynamic approach to reducing spatial redundancy in image classification, Advances in neural information processing systems, Vol. 33 (2020), pp. 2432-2444, https://arxiv.org/abs/2010.05300, Code: https://github.com/blackfeather-wang/GFNet-Pytorch (Focuses on a subset of image inputs, which is analogous to token pruning.)
  • Hou-I Liu, Marco Galindo, Hongxia Xie, Lai-Kuan Wong, Hong-Han Shuai, Yung-Yui Li, Wen-Huang Cheng, 8 Apr 2024, Lightweight Deep Learning for Resource-Constrained Environments: A Survey, https://arxiv.org/abs/2404.07236 (A survey of various optimizations, with a lot of focus on image and vision models, including CNNs, RNNs, and Transformers.)
  • M Sponner, B Waschneck, A Kumar , 2024, Adapting Neural Networks at Runtime: Current Trends in At-Runtime Optimizations for Deep Learning, ACM Computing Surveys,, PDF: https://dl.acm.org/doi/pdf/10.1145/3657283 (Survey of various adaptive inference optimization techniques with much focus on image and video processing optimization for LLMs.)
  • Tanvir Mahmud, Burhaneddin Yaman, Chun-Hao Liu, Diana Marculescu, 24 Mar 2024, PaPr: Training-Free One-Step Patch Pruning with Lightweight ConvNets for Faster Inference, https://arxiv.org/abs/2403.16020 (Pruning of patches in input images, which is a form of token or channel pruning.)
  • Yifei Liu, Mathias Gehrig, Nico Messikommer, Marco Cannici, Davide Scaramuzza, 2024, Revisiting Token Pruning for Object Detection and Instance Segmentation, IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, 2024, https://rpg.ifi.uzh.ch/docs/WACV24_Liu.pdf Code: https://github.com/uzh-rpg/svit/ (Examination of token pruning in image processing including re-activating pruned tokens and advanced techniques such as a dynamic pruning rate.)
  • Keda Tao, Can Qin, Haoxuan You, Yang Sui, Huan Wang, 22 Nov 2024, DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models, https://arxiv.org/abs/2411.15024
  • Yiwu Zhong, Zhuoming Liu, Yin Li, Liwei Wang, 4 Dec 2024, AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning, https://arxiv.org/abs/2412.03248

Vision Token Pruning

The ideas of pruning parts of an image, or "patches" of an image input, can also be applied to vision models. There are various papers on vision LLM token pruning:

More Research on Pruning Types

More AI Pruning Research

Read more about other types of pruning: