Aussie AI

Token Pruning

Last Updated 23 May, 2025

by David Spuler, Ph.D.

What is Token Pruning?

Token pruning is a type of model "length pruning" that aims to address the cost of processing input sequences and the related embeddings. It is closely related to "embeddings pruning", but is orthogonal to depth pruning (e.g. layer pruning) or width pruning (e.g. attention head pruning).

Types of token pruning along the lengthwise dimension include:

Input token pruning
Dynamic token pruning
Prompt compression
Context compression
Token merging
Token skipping

There is a lot of overlap between some of the terminology used in research papers about this space (i.e., token pruning). For example, I'm not sure there's much difference between these ones:

Token skipping
Token dropping
Input token pruning

However, token merging is a distinct technique, which merges two tokens into one, rather than just skipping one of them. And both "prompt compression" and "context compression" are arguably a generalization of all of these multiple techniques. Another different but related area is "KV cache token pruning" to reduce the in-memory size of the cached data.

KV Caching and Token Pruning

There are analogous optimization techniques on the lengthwise input token dimension of KV cache data. Read more about these KV cache research areas:

KV cache token pruning
Prefix KV cache
Session KV cache (multi-turn KV caching)
Substring KV cache (Lengthwise-fused KV caching)
KV cache compression
KV cache sparsity
KV caching (overall)

Pruning Dimensions

Pruning of LLMs can be done on four dimensions:

Lengthwise pruning (e.g., token pruning)
Width pruning (e.g., attention head pruning, channel pruning, filter pruning)
Depth pruning (e.g., layer pruning, early exiting, layer fusion)
Embeddings pruning

And these four approaches are orthogonal, so there's also:

Input Token Pruning

Token pruning refers to removing some of the tokens with a low probability, based on some evaluation. This reduces the vocabulary size proportionally to the tokens pruned, thereby reducing the overall model size. This technique suffers the trade-off of accuracy loss in the model, as it is difficult to predict which tokens to prune. Probabilities and weights for a pruned token are lost, no longer affecting output of other tokens, and the pruned token itself cannot be output. Token pruning can nevertheless be effective in use cases such as summarization or concept classification, since some of the common, small words in the input sequence are obviously less important.

Token pruning effectively zeros the weights associated with that token. Hence, it should be noted that various types of "weight pruning" and "sparsification" of matrices are somewhat related to token pruning, as they can sometimes effectively reduce a token's effect on inference to zero.

Another related technique is that avoiding attention logic on some tokens has been a method researched to speed up transformer attention (i.e. to reduce the quadratic dependence on input length). This is effectively attention-specific token pruning, where other weights for the token may still be used.

Research papers on token pruning:

Xuanli He, Iman Keivanloo, Yi Xu, Xiang He, Belinda Zeng, Santosh Rajagopalan, Trishul Chilimbi, Magic Pyramid: Accelerating Inference with Early Exiting and Token Pruning, Oct 2021, https://arxiv.org/abs/2111.00230, https://doi.org/10.48550/arXiv.2111.00230
Sehoon Kim, Sheng Shen, David Thorsley, Amir Gholami, Joseph Hassoun, and Kurt Keutzer, Learned token pruning for transformers. arXiv preprint arXiv:2107.00910, 2021, https://arxiv.org/abs/2107.00910
Hanrui Wang, Zhekai Zhang, and Song Han, SpAtten: Efficient sparse attention architecture with cascade token and head pruning, In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 97–110. IEEE, 2021, https://arxiv.org/abs/2012.09852
Saurabh Goyal, Anamitra Roy Choudhury, Saurabh Raje, Venkatesan Chakaravarthy, Yogish Sabharwal, and Ashish Verma, Power-bert: Accelerating bert inference via progressive word-vector elimination, In International Conference on Machine Learning, pages 3690–3699, PMLR, 2020, https://arxiv.org/abs/2001.08950, https://doi.org/10.48550/arXiv.2001.08950
Deming Ye, Yankai Lin, Yufei Huang, and Maosong Sun, 2021, TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference. arXiv preprint arXiv:2105.11618 (2021), https://arxiv.org/abs/2105.11618
Ofir Press, Noah A Smith, and Mike Lewis, 2021, Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation, arXiv preprint arXiv:2108.12409 (2021), https://arxiv.org/abs/2108.12409
Carlos Lassance, Maroua Maachou, Joohee Park, Stéphane Clinchant, A Study on Token Pruning for ColBERT, Dec 2021, https://arxiv.org/abs/2112.06540
Zihang Dai, Guokun Lai, Yiming Yang, and Quoc V Le. Funnel-transformer: Filtering out sequential redundancy for efficient language processing. Proceedings of NeurIPS, June 2020, https://arxiv.org/abs/2006.03236, Code: https://github.com/laiguokun/Funnel-Transformer
S Ren, Q Jia, KQ Zhu, arXiv preprint arXiv:2310.08152, Context Compression for Auto-regressive Transformers with Sentinel Tokens, Oct 2023, https://arxiv.org/pdf/2310.08152.pdf, Code: https://github.com/DRSY/KV_Compression
Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, Yan Yan, 25 Mar 2024, LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models, https://arxiv.org/abs/2403.15388 Code: https://llava-prumerge.github.io/ (Compresses input images based on redundant sections.)

Token Merging

Token merging is a type of token pruning achieved by merging two (or more) sequential tokens into a single token. This causes token input compression and reduces overall tokens to be processed, thereby speeding up inference (or training).

Joonmyung Choi, Sanghyeok Lee, Jaewon Chu, Minhyuk Choi, Hyunwoo J. Kim, 20 Mar 2024, vid-TLDR: Training Free Token merging for Light-weight Video Transformer, https://arxiv.org/abs/2403.13347, Code: https://github.com/mlvlab/vid-TLDR (Token merging in video with a focus on the background of the image.)
Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, Yan Yan, 25 Mar 2024, LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models, https://arxiv.org/abs/2403.15388 Code: https://llava-prumerge.github.io/ (Compresses input images based on redundant sections.)
Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster. arXiv preprint arXiv:2210.09461, 2022, https://arxiv.org/abs/2210.09461 (Token merging idea is similar to token pruning.)
Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, Yan Yan, 25 Mar 2024, LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models, https://arxiv.org/abs/2403.15388 Code: https://llava-prumerge.github.io/ (Compresses input images based on redundant sections.)
Hoai-Chau Tran, Duy M. H. Nguyen, Duy M. Nguyen, Trung-Tin Nguyen, Ngan Le, Pengtao Xie, Daniel Sonntag, James Y. Zou, Binh T. Nguyen, Mathias Niepert, 25 May 2024, Accelerating Transformers with Spectrum-Preserving Token Merging, https://arxiv.org/abs/2405.16148
Maxim Bonnaerens, Joni Dambre, 17 Aug 2023 (v2), Learned Thresholds Token Merging and Pruning for Vision Transformers, https://arxiv.org/abs/2307.10780
Qingqing Cao, Bhargavi Paranjape, Hannaneh Hajishirzi, 27 May 2023, PuMer: Pruning and Merging Tokens for Efficient Vision Language Models, https://arxiv.org/abs/2305.17530
Mingliang Zhai, Yulin Li, Xiameng Qin, Chen Yi, Qunyi Xie, Chengquan Zhang, Kun Yao, Yuwei Wu, Yunde Jia, 19 May 2023, Fast-StrucTexT: An Efficient Hourglass Transformer with Modality-guided Dynamic Token Merge for Document Understanding, https://arxiv.org/abs/2305.11392
Daniel Bolya, Judy Hoffman, 30 Mar 2023, Token Merging for Fast Stable Diffusion, https://arxiv.org/abs/2303.17604
Cedric Renggli, André Susano Pinto, Neil Houlsby, Basil Mustafa, Joan Puigcerver, Carlos Riquelme, 24 Feb 2022, Learning to Merge Tokens in Vision Transformers, https://arxiv.org/abs/2202.12015
Minchul Kim, Shangqian Gao, Yen-Chang Hsu, Yilin Shen, Hongxia Jin, 2 Dec 2023, Token Fusion: Bridging the Gap between Token Pruning and Token Merging, https://arxiv.org/abs/2312.01026
Liang Zhang, Anwen Hu, Haiyang Xu, Ming Yan, Yichen Xu, Qin Jin, Ji Zhang, Fei Huang, 25 Apr 2024, TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning, https://arxiv.org/abs/2404.16635
Daniel Kienzle, Marco Kantonis, Robin Schön, Rainer Lienhart, 23 May 2024, Segformer++: Efficient Token-Merging Strategies for High-Resolution Semantic Segmentation, https://arxiv.org/abs/2405.14467
Wenxuan Huang, Yunhang Shen, Jiao Xie, Baochang Zhang, Gaoqi He, Ke Li, Xing Sun, Shaohui Lin, 31 Mar 2024, A General and Efficient Training for Transformer via Token Expansion, https://arxiv.org/abs/2404.00672 Code: https://github.com/Osilly/TokenExpansion (Token merging to accelerate training.)
Xu, L., Wang, L. & Guo, Z., 2024, ATFTrans: attention-weighted token fusion transformer for robust and efficient object tracking. Neural Comput & Applic 36, 7043–7056 (2024). https://doi.org/10.1007/s00521-024-09444-0 https://link.springer.com/article/10.1007/s00521-024-09444-0
Simone Scardapane, Alessandro Baiocchi, Alessio Devoto, Valerio Marsocci, Pasquale Minervini, Jary Pomponi, 12 Mar 2024, Conditional computation in neural networks: principles and research trends, https://arxiv.org/abs/2403.07965 (Investigated three types of dynamic inference: MoE, early exit, and token selection.)
Maxim Bonnaerens, Nov 2023, Resource-Efficient Deep Learning for Computer Vision, Ph.D. thesis, Ghent University, https://biblio.ugent.be/publication/01HEMGWENRT8C255N2RD9KAEJC/file/01HEMGZ9JYP8NXPSQJZM14ACT9 (Examines various vision Transformer optimizations including a NAS approached based on building blocks and also combined token pruning/merging for input compression.)
Zhongwei Wan, Xinjian Wu, Yu Zhang, Yi Xin, Chaofan Tao, Zhihong Zhu, Xin Wang, Siqi Luo, Jing Xiong, Mi Zhang, 18 Jun 2024, D2O:Dynamic Discriminative Operations for Efficient Generative Inference of Large Language Models, https://arxiv.org/abs/2406.13035 (Per-layer KV cache eviction strategies with token merging applied to the KV cache.)
Zhongwei Wan, Ziang Wu, Che Liu, Jinfa Huang, Zhihong Zhu, Peng Jin, Longyue Wang, Li Yuan, 26 Jun 2024, LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference, https://arxiv.org/abs/2406.18139 (KV cache compression in text and multimodal inference, prioritizing eviction of text over image tokens, and using new ways to merge evicted KV cache data into the retained KV cache, including averaging, pivotal tokens, and weighted averages, which is relevant to token merging and KV cache fusion
Namgyu Ho, Sangmin Bae, Taehyeon Kim, Hyunjik Jo, Yireun Kim, Tal Schuster, Adam Fisch, James Thorne, Se-Young Yun, 4 Jun 2024, Block Transformer: Global-to-Local Language Modeling for Fast Inference, https://arxiv.org/abs//2406.02657 Code: https://github.com/itsnamgyu/block-transformer (Impressive technique of combining tokens into blocks, then doing inference on the blocks, then unblocking to get tokens.)
Yuqing Yang, Yuedong Xu, Lei Jiao, 7 Jul 2024, A Queueing Theoretic Perspective on Low-Latency LLM Inference with Variable Token Length, https://arxiv.org/abs/2407.05347
Yancheng Wang, Yingzhen Yang, 21 Jul 2024, Efficient Visual Transformer by Learnable Token Merging, https://arxiv.org/abs/2407.15219 Code: https://github.com/Statistical-Deep-Learning/LTM
Yifan Pu, Zhuofan Xia, Jiayi Guo, Dongchen Han, Qixiu Li, Duo Li, Yuhui Yuan, Ji Li, Yizeng Han, Shiji Song, Gao Huang, Xiu Li, 11 Aug 2024, Efficient Diffusion Transformer with Step-wise Dynamic Attention Mediators, https://arxiv.org/abs/2408.05710 (Reduce the attention cost in diffusion models by what is effectively token merging between the Q and K data.)
Kyle Wiggers, September 11, 2024, Mistral releases Pixtral 12B, its first multimodal model, https://techcrunch.com/2024/09/11/mistral-releases-pixtral-its-first-multimodal-model/
J Hong, G Lee, J Cho, Accelerating Multilingual Language Model for Excessively Tokenized Languages, Findings of the Association for Computational Linguistics: ACL 2024, pages 11095–11111 August 11-16, 2024, https://arxiv.org/abs/2401.10660 https://aclanthology.org/2024.findings-acl.660/ https://aclanthology.org/2024.findings-acl.660.pdf
Yiwu Zhong, Zhuoming Liu, Yin Li, Liwei Wang, 4 Dec 2024, AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning, https://arxiv.org/abs/2412.03248
Mingjia Shi, Yuhao Zhou, Ruiji Yu, Zekai Li, Zhiyuan Liang, Xuanlei Zhao, Xiaojiang Peng, Tanmay Rajpurohit, Shanmukha Ramakrishna Vedantam, Wangbo Zhao, Kai Wang, Yang You, 17 Dec 2024, Faster Vision Mamba is Rebuilt in Minutes via Merged Token Re-training, https://arxiv.org/abs/2412.12496
Seungdong Yoa, Seungjun Lee, Hyeseung Cho, Bumsoo Kim, Woohyung Lim, 21 Dec 2024, ImagePiece: Content-aware Re-tokenization for Efficient Image Recognition, https://arxiv.org/abs/2412.16491
Fabio Montello, Ronja Güldenring, Simone Scardapane, Lazaros Nalpantidis, 13 Jan 2025, A Survey on Dynamic Neural Networks: from Computer Vision to Multi-modal Sensor Fusion, https://arxiv.org/abs/2501.07451 (Survey of adaptive inference optimizations: early exit, dynamic routing, token skimming.)
Karim Haroun, Thibault Allenet, Karim Ben Chehida, Jean Martinet, 2025, Dynamic hierarchical token merging for vision transformers, VISAPP-2025- 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, Feb 2025, Porto, Portugal. hal 04885469 https://hal.science/hal-04885469/document
J. Shin, M. Kang, Y. Han, J. Park and L. -S. Kim, "AToM: Adaptive Token Merging for Efficient Acceleration of Vision Transformer," in IEEE Transactions on Computers, doi: 10.1109/TC.2025.3540638. https://ieeexplore.ieee.org/abstract/document/10880106/
Yu Yang, Yue Zhou, Xiaofang Hu, Shukai Duan, 2025, KFF: K-Feature Fusion Token Merging for Vision Transformer, Expert Systems with Applications, 128206, ISSN 0957-4174, https://doi.org/10.1016/j.eswa.2025.128206, https://www.sciencedirect.com/science/article/abs/pii/S0957417425018263

Dynamic Token Pruning

Dynamic token pruning is where the choice of which tokens to discard is made during the inference algorithm. This is a form of dynamic length pruning of the model.

Sehoon Kim, Sheng Shen, David Thorsley, Amir Gholami, Woosuk Kwon, Joseph Hassoun, Learned Token Pruning for Transformers, 14 August 2022, https://dl.acm.org/doi/abs/10.1145/3534678.3539260, PDF: https://dl.acm.org/doi/pdf/10.1145/3534678.3539260
Zhenglun Kong, Peiyan Dong, Xiaolong Ma, Xin Meng, Wei Niu, Mengshu Sun, Xuan Shen, Geng Yuan, Bin Ren, Hao Tang, Minghai Qin & Yanzhi Wang, SPViT: Enabling Faster Vision Transformers via Latency-Aware Soft Token Pruning, Nov 2022, LNCS,volume 13671, https://link.springer.com/chapter/10.1007/978-3-031-20083-0_37, Code: https://github.com/PeiyanFlying/SPViT
Peiyan Dong; Mengshu Sun; Alec Lu; Yanyue Xie; Kenneth Liu; Zhenglun Kong; Xin Meng; Zhengang Li; Heatvit: Hardware-efficient adaptive token pruning for vision transformers, 2023, IEEE International Symposium on High-Performance Computer Architecture (HPCA) 2023, DOI: 10.1109/HPCA56546.2023.10071047, https://ieeexplore.ieee.org/abstract/document/10071047
SaiT: Sparse Vision Transformers through Adaptive Token Pruning Ling Li, David Thorsley, Joseph Hassoun Oct 2022, https://arxiv.org/abs/2210.05832
Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, Cho-Jui Hsieh, Dynamicvit: Efficient vision transformers with dynamic token sparsification, 2021, Advances in Neural Information Processing Systems 34 (NeurIPS 2021), https://proceedings.neurips.cc/paper_files/paper/2021/hash/747d3443e319a22747fbb873e8b2f9f2-Abstract.html, PDF: https://proceedings.neurips.cc/paper_files/paper/2021/file/747d3443e319a22747fbb873e8b2f9f2-Paper.pdf
J Li, LL Zhang, J Xu, Y Wang, S Yan, Y Xia, Constraint-aware and Ranking-distilled Token Pruning for Efficient Transformer Inference, 2023, https://arxiv.org/abs/2306.14393
Haoyu Ma, Zhe Wang, Yifei Chen, Deying Kong, Liangjian Chen, Xingwei Liu, Xiangyi Yan, Hao Tang & Xiaohui Xie , PPT: Token-Pruned Pose Transformer for Monocular and Multi-view Human Pose Estimation, ECCV 2022: Computer Vision, pp 424–442, LNCS volume 13665, https://link.springer.com/chapter/10.1007/978-3-031-20065-6_25
Xiangcheng Liu, Tianyi Wu, Guodong Guo, Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention, Sep 2022, https://arxiv.org/abs/2209.13802
Luca Soldaini and Alessandro Moschitti. 2020. The Cascade Transformer: An application for efficient answer sentence selection. In Proceedings of ACL, pages 5697–5708, https://arxiv.org/abs/2005.02534
Gyuwan Kim and Kyunghyun Cho. 2021. Length-Adaptive Transformer: Train Once with Length Drop, Use Anytime with Search. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, 6501–6511, https://arxiv.org/abs/2010.07003, Code: https://github.com/clovaai/length-adaptive-transformer (Technique is the "Length adaptive transformer" or LAT)
Canwen Xu, Julian McAuley, 2022, A Survey on Model Compression and Acceleration for Pretrained Language Models, https://arxiv.org/abs/2202.07105
Yuang Liu, Qiang Zhou, Jing Wang, Zhibin Wang, Fan Wang, Jun Wang, Wei Zhang, Dynamic Token-Pass Transformers for Semantic Segmentation, August 2023, DOI: 10.48550/arXiv.2308.01944, https://ui.adsabs.harvard.edu/abs/2023arXiv230801944L/abstract, PDF: https://arxiv.org/pdf/2308.01944.pdf
Mohsen Fayyaz, Soroush Abbasi Kouhpayegani, Farnoush Rezaei Jafari, Eric Sommerlade, Hamid Reza Vaezi Joze, Hamed Pirsiavash, and Juergen Gall. ATS: Adaptive token sampling for efficient vision transformers. In ECCV, July 2022, https://arxiv.org/abs/2111.15667v1
Hongjie Wang, Bhishma Dedhia, Niraj K. Jha, Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention Graph in Pre-Trained Transformers, May 2023, https://arxiv.org/abs/2305.17328
Yifei Liu, Mathias Gehrig, Nico Messikommer, Marco Cannici, Davide Scaramuzza, Revisiting Token Pruning for Object Detection and Instance Segmentation, June 2023, https://arxiv.org/abs/2306.07050
Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention, Xiangcheng Liu, Tianyi Wu, Guodong Guo, July 2023, https://arxiv.org/abs/2209.13802
Zhewei Yao, Linjian Ma, Sheng Shen, Kurt Keutzer, and Michael W Mahoney. 2021. MLPruning: A Multilevel Structured Pruning Framework for Transformer-based Models. arXiv preprint arXiv:2105.14636 (2021), https://arxiv.org/abs/2105.14636v1
Victor Sanh, Thomas Wolf, and Alexander M Rush. 2020. Movement pruning: Adaptive sparsity by fine-tuning. arXiv preprint arXiv:2005.07683 (2020), https://arxiv.org/abs/2005.07683
Mohsen Fayyaz, Soroush Abbasi Koohpayegani, Farnoush Rezaei Jafari, Sunando Sengupta, Hamid Reza Vaezi Joze, Eric Sommerlade, Hamed Pirsiavash, Juergen Gall, Adaptive Token Sampling For Efficient Vision Transformers, July 2022, https://arxiv.org/abs/2111.15667
Zi Lin, Jeremiah Zhe Liu, Zi Yang, Nan Hua, and Dan Roth. Oct 2020. Pruning Redundant Mappings in Transformer Models via Spectral-Normalized Identity Prior. arXiv preprint arXiv:2010.01791 (2020), https://arxiv.org/abs/2010.01791
François Lagunas, Ella Charlaix, Victor Sanh, and Alexander M Rush. Sep 2021. Block pruning for faster transformers. arXiv preprint arXiv:2109.04838 (2021), https://arxiv.org/abs/2109.04838
Siyuan Wei, Tianzhu Ye, Shen Zhang, Yao Tang, Jiajun Liang, Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers, Apr 2023, https://arxiv.org/abs/2304.10716, Code: https://github.com/megvii-research/TPS-CVPR2023
Youwei Liang, Chongjian Ge, Zhan Tong, Yibing Song, Jue Wang, and Pengtao Xie. Not all patches are what you need: Expediting vision transformers via token reorganizations. arXiv preprint arXiv:2202.07800, Apr 2022, https://arxiv.org/abs/2202.07800
Yifan Xu, Zhijie Zhang, Mengdan Zhang, Kekai Sheng, Ke Li, Weiming Dong, Liqing Zhang, Changsheng Xu, and Xing Sun. Evo-ViT: Slow-fast token evolution for dynamic vision transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 2964–2972, 2022, https://arxiv.org/abs/2108.01390, Code: https://github.com/YifanXu74/Evo-ViT
Bowen Pan, Rameswar Panda, Yifan Jiang, Zhangyang Wang, Rogerio Feris, and Aude Oliva. IA-RED2: Interpretability-aware redundancy reduction for vision transformers. In Advances in Neural Information Processing Systems (NeurIPS), Oct 2021, https://arxiv.org/abs/2106.12620
Hongxu Yin, Arash Vahdat, Jose Alvarez, Arun Mallya, Jan Kautz, and Pavlo Molchanov. AdaViT: Adaptive tokens for efficient vision transformer. arXiv preprint arXiv:2112.07658, 2021 (revised Oct 2022), https://arxiv.org/abs/2112.07658, Code: https://a-vit.github.io/
Hao Yu and Jianxin Wu. A unified pruning framework for vision transformers. arXiv preprint arXiv:2111.15127, Nov 2021, https://arxiv.org/abs/2111.15127
Youwei Liang, Chongjian GE, Zhan Tong, Yibing Song, Jue Wang, and Pengtao Xie. EViT: Expediting vision transformers via token reorganizations. In International Conference on Learning Representations (ICLR), Jan 2022, https://openreview.net/forum?id=BjyvwnXXVn_, PDF: https://openreview.net/pdf?id=BjyvwnXXVn_, Code: https://github.com/youweiliang/evit
Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynamicvit: Efficient vision transformers with dynamic token sparsification. In Advances in Neural Information Processing Systems (NeurIPS), Oct 2021, https://arxiv.org/abs/2106.02034, Code: https://github.com/raoyongming/DynamicViT
Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zi-Hang Jiang, Francis E.H. Tay, Jiashi Feng, and Shuicheng Yan. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 558–567, October 2021, https://arxiv.org/abs/2101.11986
Huanrui Yang, Hongxu Yin, Pavlo Molchanov, Hai Li, and Jan Kautz. Nvit: Vision transformer compression and parameter redistribution. arXiv preprint arXiv:2110.04869, 2021, PDF: https://arxiv.org/pdf/2110.04869v1.pdf
Yehui Tang, Kai Han, Yunhe Wang, Chang Xu, Jianyuan Guo, Chao Xu, and Dacheng Tao. Patch slimming for efficient vision transformers. arXiv preprint arXiv:2106.02852, 2021 (revised Apr 2022). https://arxiv.org/abs/2106.02852
Shixing Yu, Tianlong Chen, Jiayi Shen, Huan Yuan, Jianchao Tan, Sen Yang, Ji Liu, and Zhangyang Wang. Unified visual transformer compression. arXiv preprint arXiv:2203.08243, Mar 2022, https://arxiv.org/abs/2203.08243
Tianlong Chen, Yu Cheng, Zhe Gan, Lu Yuan, Lei Zhang, and Zhangyang Wang. Chasing sparsity in vision transformers: An end-to-end exploration. Advances in Neural Information Processing Systems, 34, 2021, https://arxiv.org/abs/2106.04533
Mingjian Zhu, Yehui Tang, and Kai Han. Vision transformer pruning. arXiv preprint arXiv:2104.08500, Aug 2021, https://arxiv.org/abs/2104.08500
Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural networks for resource efficient inference. arXiv preprint arXiv:1611.06440, 2016 (revised June 2017), https://arxiv.org/abs/1611.06440
Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710, 2016, https://arxiv.org/abs/1608.08710
Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning efficient convolutional networks through network slimming. In ICCV 2017, Aug 2017, https://arxiv.org/abs/1708.06519
Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. ThiNet: A filter level pruning method for deep neural network compression. In Proceedings of the IEEE international conference on computer vision, pages 5058–5066, 2017, https://arxiv.org/abs/1707.06342
Zihang Jiang, Qibin Hou, Li Yuan, Daquan Zhou, Yujun Shi, Xiaojie Jin, Anran Wang, and Jiashi Feng. All tokens matter: Token labeling for training better vision transformers. arXiv preprint arXiv:2104.10858, June 2021, https://arxiv.org/abs/2104.10858, Code: https://github.com/zihangJiang/TokenLabeling
Zhewei Yao, Linjian Ma, Sheng Shen, Kurt Keutzer, and Michael W Mahoney. 2021. MLPruning: A Multilevel Structured Pruning Framework for Transformer-based Models. arXiv preprint arXiv:2105.14636 (2021), PDF: https://arxiv.org/pdf/2105.14636v1.pdf, Code: https://github.com/yaozhewei/mlpruning.git
Yue Guan, Zhengyi Li, Jingwen Leng, Zhouhan Lin, Minyi Guo, Transkimmer: Transformer Learns to Layer-wise Skim, May 2022, In AC, https://arxiv.org/abs/2205.07324 (This paper does per-layer dynamic token pruning.)
Learned Token Pruning for Efficient Transformer Inference, Sehoon Kim, Sheng Shen, David Thorsley, Amir Gholami, Woosuk Kwon, Joseph Hassoun, Kurt Keutzer, May 11, 2023, Electrical Engineering and Computer Sciences, University of California, Berkeley, Technical Report No. UCB/EECS-2023-119, Masters Thesis, http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-119.html PDF: https://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-119.pdf (Learns threshold-based token pruning parameters; novel approach to token pruning during attention. Also contains good literature survey on token pruning.)
Ofir Press, Noah A Smith, and Mike Lewis. “Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation”. arXiv preprint arXiv:2108.12409, 2021 (revised Apr 2022), https://arxiv.org/abs/2108.12409 (Attention with Linear Biases (ALiBi) paper)
Token and Head Adaptive Transformers for Efficient Natural Language Processing Chonghan Lee, Md Fahim Faysal Khan, Rita Brugarolas Brufau, Ke Ding, Vijaykrishnan Narayanan, Oct 2022, https://aclanthology.org/2022.coling-1.404/ (Combination of token pruning and attention head pruning, i.e. length/width pruning combined)
Zejiang Hou, Sun-Yuan Kung, Multi-Dimensional Vision Transformer Compression via Dependency Guided Gaussian Process Search, 2022, https://ieeexplore.ieee.org/document/9857488, PDF: https://openaccess.thecvf.com/content/CVPR2022W/EVW/html/Hou_Multi-Dimensional_Vision_Transformer_Compression_via_Dependency_Guided_Gaussian_Process_Search_CVPRW_2022_paper.html (Multi-dimensional pruning.)
K Luo, H Li, X Zhou, B Huang, An Attention-Based Token Pruning Method for Vision Transformers, International Joint Conference, IJCRS 2022, Suzhou, China, November 11–14, 2022, Proceedings, Nov 2022, Pages 274–288, https://doi.org/10.1007/978-3-031-21244-4_21
Siyuan Wei, Tianzhu Ye, Shen Zhang, Yao Tang, Jiajun Liang, Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 2092-2101, http://openaccess.thecvf.com/content/CVPR2023/html/Wei_Joint_Token_Pruning_and_Squeezing_Towards_More_Aggressive_Compression_of_CVPR_2023_paper.html, https://arxiv.org/abs/2304.10716, Code: https://github.com/megvii-research/TPS-CVPR2023
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019. https://arxiv.org/abs/1901.02860 (Related to length pruning and context length, although not fully token pruning.)
Xin Huang, Ashish Khetan, Rene Bidart, and Zohar Karnin. Pyramid-BERT: Reducing complexity via successive core-set based token selection. arXiv preprint arXiv:2203.14380, 2022. https://arxiv.org/abs/2203.14380
Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster. arXiv preprint arXiv:2210.09461, 2022, https://arxiv.org/abs/2210.09461 (Token merging idea is similar to token pruning.)
Y Guan, Z Li, Z Lin, Y Zhu, J Leng, M Guo, Block-skim: Efficient question answering for transformer, Proceedings of the AAAI, 2022, https://doi.org/10.1609/aaai.v36i10.21316, https://ojs.aaai.org/index.php/AAAI/article/view/21316, PDF: https://ojs.aaai.org/index.php/AAAI/article/view/21316/21065
Q Tang, B Zhang, J Liu, F Liu, Y Liu, Dynamic Token Pruning in Plain Vision Transformers for Semantic Segmentation arXiv preprint arXiv:2308.01045, 2023, https://arxiv.org/abs/2308.01045
Sotiris Anagnostidis, Dario Pavllo, Luca Biggio, Lorenzo Noci, Aurelien Lucchi, Thomas Hofmann, Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers, arXiv preprint, 2023, https://arxiv.org/abs/2305.15805
Structural-Jump-LSTM (Hansen et al. 2018) Hansen, C.; Hansen, C.; Alstrup, S.; Simonsen, J. G.; and Lioma, C. 2018. Neural Speed Reading with Structural-Jump-LSTM. In International Conference on Learning Representations, https://arxiv.org/abs/1904.00761
Seo, M.; Min, S.; Farhadi, A.; and Hajishirzi, H. 2018. Neural Speed Reading via Skim-RNN. In International Conference on Learning Representations, https://arxiv.org/abs/1711.02085
Adams Wei Yu, Hongrae Lee, Quoc V. Le. 2017. Learning to Skim Text. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), https://arxiv.org/abs/1704.06877
Y Wang, K Chen, H Tan, K Guo, 2023, Tabi: An Efficient Multi-Level Inference System for Large Language Models, EuroSys '23: Proceedings of the Eighteenth European Conference on Computer Systems, Rome, Italy May 2023, Pages 233–248, https://doi.org/10.1145/3552326.3587438, PDF: https://cse.hkust.edu.hk/~kaichen/papers/tabi-eurosys23.pdf
L. Denoyer and P. Gallinari. Deep sequential neural network. arXiv preprint arXiv:1410.0510, 2014. https://arxiv.org/abs/1410.0510 (Input adaptive method, somewhat related to token pruning.)
Hochreiter, S.; and Schmidhuber, J., 1997. Long short-term memory. Neural computation, 9(8): 1735–1780. https://ieeexplore.ieee.org/abstract/document/6795963 (Early paper, somewhat related to token skimming.)
Campos, V.; Jou, B.; Giro-i-Nieto, X.; Torres, J.; and Chang, S., 2017. Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks. CoRR, abs/1708.06834. https://arxiv.org/abs/1708.06834, Code: https://imatge-upc.github.io/skiprnn-2017-telecombcn/
Zheng Qu, Liu Liu, Fengbin Tu, Zhaodong Chen, Yufei Ding, and Yuan Xie. DOTA: Detect and Omit Weak Attentions for Scalable Transformer Acceleration. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pages 14–26, 2022. https://dl.acm.org/doi/pdf/10.1145/3503222.3507738 (Involves some reordering of tokens.)
Jean Senellart, Dakun Zhang, Bo Wang, Guillaume Klein, Jean-Pierre Ramatchandirin, Josep Crego, and Alexander Rush. OpenNMT system description for WNMT 2018: 800 words/sec on a single-core CPU. In Proc. of WNG, 2018. https://www.aclweb.org/anthology/W18-2715">URL https://www.aclweb.org/anthology/W18-2715
Xing Shi and Kevin Knight. Speeding up neural machine translation decoding by shrinking run-time vocabulary. In Proc. of ACL, 2017. https://aclanthology.org/P17-2091/, PDF: http://xingshi.me/data/pdf/ACL2017short.pdf
Gurvan L’Hostis, David Grangier, and Michael Auli. 2016. Vocabulary Selection Strategies for Neural Machine Translation. Arxiv preprint arXiv:1610.00072, https://arxiv.org/abs/1610.00072
Ali Modarressi, Hosein Mohebbi, Mohammad Taher Pilehvar, 2022, AdapLeR: Speeding up Inference by Adaptive Length Reduction, arXiv preprint arXiv:2203.08991, https://arxiv.org/abs/2203.08991 Code: https://github.com/amodaresi/AdapLeR
Hansen, C., Hansen, C., Alstrup, S., Simonsen, J. G., and Lioma, C. (2019). Neural speed reading with structural-jump-LSTM. In International Conference on Learning Representations. https://arxiv.org/abs/1904.00761, https://openreview.net/forum?id=B1xf9j
Maedeh Hemmat, Joshua San Miguel, Azadeh Davoodi, "AirNN: A Featherweight Framework for Dynamic Input-Dependent Approximation of CNNs", IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol.40, no.10, pp.2090-2103, 2021. https://ieeexplore.ieee.org/document/9239327 (Input dependent matching of weight clusters from tokens is vaguely similar to token pruning or length pruning.)
Minxuan Zhou; Weihong Xu; Jaeyoung Kang; Tajana Rosing, 2022, TransPIM: A Memory-based Acceleration via Software-Hardware Co-Design for Transformer, 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA), https://ieeexplore.ieee.org/document/9773212 PDF: https://par.nsf.gov/servlets/purl/10345536 (Does some token pruning but is primarily focused on memory optimization, including with token-based data sharding for allocation to different memory banks.)
Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, Lee Sharkey, Sep 2023, Sparse Autoencoders Find Highly Interpretable Features in Language Models, https://arxiv.org/abs/2309.08600 (Analysis has some relevant to tokenization and token pruning.)
H Jiang, Q Wu, CY Lin, Y Yang, L Qiu, Oct 2023, LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models, arXiv preprint arXiv:2310.05736, https://arxiv.org/pdf/2310.05736.pdf, Code: https://aka.ms/LLMLingua (Dynamic token pruning for prompt compression.)
X Xu, C Li, Y Chen, X Chang, J Liu, S Wang, Oct 2023, No Token Left Behind: Efficient Vision Transformer via Dynamic Token Idling, arXiv preprint arXiv:2310.05654, https://arxiv.org/pdf/2310.05654.pdf (Suggests "token idling" that allows reuse of pruned tokens in later layers)
Yucheng Li. April 2023. Unlocking context constraints of LLMs: Enhancing context efficiency of LLMs with self-information-based content filtering. ArXiv preprint abs/2304.12102. https://arxiv.org/abs/2304.12102 (Token pruning for prompt compression.)
Wang, Y., Huang, R., Song, S., Huang, Z., Huang, G., May 2021, Not all images are worth 16x16 words: Dynamic vision transformers with adaptive sequence length. NeurIPS 2021, https://arxiv.org/abs/2105.15075, Code: https://github.com/blackfeather-wang/Dynamic-Vision-Transformer, Code: https://github.com/blackfeather-wang/Dynamic-Vision-Transformer-MindSpore
Jesse Mu, Xiang Lisa Li, and Noah Goodman. July 2023. Learning to compress prompts with gist tokens. arXiv preprint arXiv:2304.08467. https://arxiv.org/abs/2304.08467 (Prompt compression.)
Wang Y., Lv K., Huang R., Song S., Yang L., Huang G., 2020, Glance and focus: a dynamic approach to reducing spatial redundancy in image classification, Advances in neural information processing systems, Vol. 33 (2020), pp. 2432-2444, https://arxiv.org/abs/2010.05300, Code: https://github.com/blackfeather-wang/GFNet-Pytorch (Focuses on a subset of image inputs, which is analogous to token pruning.)
Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, Yan Yan, 25 Mar 2024, LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models, https://arxiv.org/abs/2403.15388 Code: https://llava-prumerge.github.io/ (Compresses input images based on redundant sections.)
Benjamin Minixhofer, Edoardo Maria Ponti, Ivan Vulić, 13 May 2024, Zero-Shot Tokenizer Transfer, https://arxiv.org/abs/2405.07883 (Overcoming the limitation that the tokenizer is fixed for the model, by training the tokenizer to embeddings mapping so as to use different tokenizers, including effective input token pruning reducing tokens in the input with a larger vocabulary.)
Yubin Qin; Yang Wang; Dazheng Deng; Xiaolong Yang, Ayaka: A Versatile Transformer Accelerator With Low-Rank Estimation and Heterogeneous Dataflow, https://ieeexplore.ieee.org/abstract/document/10530252 (Cross layer random prediction to allow sparsification of attention and linear layers.)
Yao Fu, 14 May 2024, Challenges in Deploying Long-Context Transformers: A Theoretical Peak Performance Analysis, https://arxiv.org/abs/2405.08944
Zhihang Lin, Mingbao Lin, Luxi Lin, Rongrong Ji, 9 May 2024, Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference, https://arxiv.org/abs/2405.05803 Code: https://github.com/lzhxmu/VTW (Removing all visual tokens in later layers of a multimodal model, which is effectively early exiting or token pruning, but affecting only the vision part of the multimodal Transformer.)
Shujian Zhang, Korawat Tanwisuth, Chengyue Gong, Pengcheng He, Mingyuan Zhou, 7 May 2024, Switchable Decision: Dynamic Neural Generation Networks, https://arxiv.org/abs/2405.04513 (Switching and skipping sub-layer components such as attention heads, FFNs, or input token skipping, using decisions made based on allocating computation resources.)
Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, 22 Apr 2024, A Survey on Efficient Inference for Large Language Models, https://arxiv.org/abs/2404.14294
Yashas Samaga B L, Varun Yerram, Chong You, Srinadh Bhojanapalli, Sanjiv Kumar, Prateek Jain, Praneeth Netrapalli, 14 Feb 2024, HiRE: High Recall Approximate Top-k Estimation for Efficient LLM Inference, https://arxiv.org/abs/2402.09360 (Attempts to estimate the output of top-k decoding, so as to prune computations on two dimensions earlier in the inference computations.)
Ignacio de Gregorio, April 2024, Mixture-of-Depths, a Dazzling New AI Breakthrough: Conditional Computing is Finally Here, Medium, https://medium.com/@ignacio.de.gregorio.noblejas/mixture-of-depths-a-dazzling-new-ai-breakthrough-be958fc629b2 (Mixture of depths is a layer-wise per-token limit to attention head computations, which is like width pruning with dynamic depth.)
David Raposo, Sam Ritter, Blake Richards, Timothy Lillicrap, Peter Conway Humphreys, Adam Santoro, 2 Apr 2024, Mixture-of-Depths: Dynamically allocating compute in transformer-based language models, https://arxiv.org/abs/2404.02258 (Per-layer pruning of which tokens can be in the attention computations to give a type of mixed lengthwise pruning combined with a dynamic width pruning or slimmable network approach.)
Hou-I Liu, Marco Galindo, Hongxia Xie, Lai-Kuan Wong, Hong-Han Shuai, Yung-Yui Li, Wen-Huang Cheng, 8 Apr 2024, Lightweight Deep Learning for Resource-Constrained Environments: A Survey, https://arxiv.org/abs/2404.07236 (A survey of various optimizations, with a lot of focus on image and vision models, including CNNs, RNNs, and Transformers.)
M Sponner, B Waschneck, A Kumar , 2024, Adapting Neural Networks at Runtime: Current Trends in At-Runtime Optimizations for Deep Learning, ACM Computing Surveys,, PDF: https://dl.acm.org/doi/pdf/10.1145/3657283 (Survey of various adaptive inference optimization techniques with much focus on image and video processing optimization for LLMs.)
Wangbo Zhao, Jiasheng Tang, Yizeng Han, Yibing Song, Kai Wang, Gao Huang, Fan Wang, Yang You, 18 Mar 2024, Dynamic Tuning Towards Parameter and Inference Efficiency for ViT Adaptation, https://arxiv.org/abs/2403.11808 (PEFT and adaptive inference and token pruning in Vision Transformers.)
Yansong Xu, Dongxu Lyu, Zhenyu Li, Zilong Wang, Yuzhou Chen, Gang Wang, Zhican Wang, Haomin Li, Guanghui He, 16 Mar 2024, DEFA: Efficient Deformable Attention Acceleration via Pruning-Assisted Grid-Sampling and Multi-Scale Parallel Processing, https://arxiv.org/abs/2403.10913 (Attention optimizations in Vision Transformer with pruning of feature maps, and extensive parallelization with consideration of the hardware layer.)
Tanvir Mahmud, Burhaneddin Yaman, Chun-Hao Liu, Diana Marculescu, 24 Mar 2024, PaPr: Training-Free One-Step Patch Pruning with Lightweight ConvNets for Faster Inference, https://arxiv.org/abs/2403.16020 (Pruning of patches in input images, which is a form of token or channel pruning.)
Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
Dewen Zeng, Nan Du, Tao Wang, Yuanzhong Xu, Tao Lei, Zhifeng Chen, Claire Cui, 26 Nov 2023, Learning to Skip for Language Modeling, https://arxiv.org/abs/2311.15436 (Generalizes token-based early exiting to skip entire layers.)
Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, Liang Zhao, 4 Jan 2024, Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models https://arxiv.org/abs/2401.00625 (A general survey paper with coverage of many techniques including this one.)
J Ainslie, T Lei, M de Jong, S Ontañón, 2023, Colt5: Faster long-range transformers with conditional computation, https://arxiv.org/abs/2303.09752
Krishna Teja Chitty-Venkata, Sparsh Mittal, Murali Emani, Venkatram Vishwanath, Arun K. Somani, 2023, A Survey of Techniques for Optimizing Transformer Inference, https://arxiv.org/abs/2307.07982
Zihang Dai, Guokun Lai, Yiming Yang, and Quoc V Le. Funnel-transformer: Filtering out sequential redundancy for efficient language processing. arXiv preprint arXiv:2006.03236, 2020. https://arxiv.org/abs/2006.03236 Code: https://github.com/laiguokun/Funnel-Transformer
Ji Xin, Raphael Tang, Zhiying Jiang, Yaoliang Yu, Jimmy Lin, July 2022, Building an Efficiency Pipeline: Commutativity and Cumulativeness of Efficiency Operators for Transformers, arXiv preprint arXiv:2208.00483, https://arxiv.org/abs/2208.00483
Minchul Kim, Shangqian Gao, Yen-Chang Hsu, Yilin Shen, Hongxia Jin, 2 Dec 2023, Token Fusion: Bridging the Gap between Token Pruning and Token Merging, https://arxiv.org/abs/2312.01026
BAP Matrix, 2024, Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention Graph in Pre-Trained Transformers, https://openaccess.thecvf.com/content/CVPR2024/supplemental/Wang_Zero-TPrune_Zero-Shot_Token_CVPR_2024_supplemental.pdf
Wenxuan Huang, Yunhang Shen, Jiao Xie, Baochang Zhang, Gaoqi He, Ke Li, Xing Sun, Shaohui Lin, 31 Mar 2024, A General and Efficient Training for Transformer via Token Expansion, https://arxiv.org/abs/2404.00672 Code: https://github.com/Osilly/TokenExpansion (Token merging to accelerate training.)
Moein Heidari, Reza Azad, Sina Ghorbani Kolahi, René Arimond, Leon Niggemeier, Alaa Sulaiman, Afshin Bozorgpour, Ehsan Khodapanah Aghdam, Amirhossein Kazerouni, Ilker Hacihaliloglu, Dorit Merhof, 28 Mar 2024, Enhancing Efficiency in Vision Transformer Networks: Design Techniques and Insights, https://arxiv.org/abs/2403.19882 Project: https://github.com/mindflow-institue/Awesome-Attention-Mechanism-in-Medical-Imaging (Survey of optimization techniques for Vision Transformers, with particular focus on attention optimizations.)
Alessandro Baiocchi, Indro Spinelli, Alessandro Nicolosi, Simone Scardapane, 26 Jan 2024, Adaptive Point Transformer, https://arxiv.org/abs/2401.14845
Jianjian Cao, Peng Ye, Shengze Li, Chong Yu, Yansong Tang, Jiwen Lu, Tao Chen, 5 Mar 2024, MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer, https://arxiv.org/abs/2403.02991
Simone Scardapane, Alessandro Baiocchi, Alessio Devoto, Valerio Marsocci, Pasquale Minervini, Jary Pomponi, 12 Mar 2024, Conditional computation in neural networks: principles and research trends, https://arxiv.org/abs/2403.07965 (Investigated three types of dynamic inference: MoE, early exit, and token selection.)
Muhammad Adnan, Akhil Arunkumar, Gaurav Jain, Prashant J. Nair, Ilya Soloveychik, Purushotham Kamath, 14 Mar 2024, Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference, https://arxiv.org/abs/2403.09054 (Reducing KV cache in-memory size and related computations by focusing on a subset of tokens.)
Jurek Leonhardt, Henrik Müller, Koustav Rudra, Megha Khosla, Abhijit Anand, Avishek Anand, Nov 2023, Efficient Neural Ranking using Forward Indexes and Lightweight Encoders, https://arxiv.org/abs/2311.01263
Maxim Bonnaerens, Nov 2023, Resource-Efficient Deep Learning for Computer Vision, Ph.D. thesis, Ghent University, https://biblio.ugent.be/publication/01HEMGWENRT8C255N2RD9KAEJC/file/01HEMGZ9JYP8NXPSQJZM14ACT9 (Examines various vision Transformer optimizations including a NAS approached based on building blocks and also combined token pruning/merging for input compression.)
Nikolay Bogoychev, Pinzhen Chen, Barry Haddow, Alexandra Birch, Nov 2023, Large Language Model Inference with Lexical Shortlisting, https://arxiv.org/abs/2311.09709 (Shortlisting the vocabulary to common words for reduced tokens and embedding matrix size.)
Jungmin Yun, Mihyeon Kim, Youngbin Kim, 2023, Focus on the Core: Efficient Attention via Pruned Token Compression for Document Classification https://aclanthology.org/2023.findings-emnlp.909.pdf
Chang Liu, Chongyang Tao, Jianxin Liang, Jiazhan Feng, Tao Shen, 2023, Quzhe Huang, Dongyan Zhao,Length-Adaptive Distillation: Customizing Small Language Model for Dynamic Token Pruning, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 4452–4463, December 6-10, 2023, https://aclanthology.org/2023.findings-emnlp.294.pdf (Explores combining static model compression via knowledge distillation with dynamic adaptive inference via token pruning. This creates a modified distillation algorithm that prepares the model for token pruning during training.)
Yifei Liu, Mathias Gehrig, Nico Messikommer, Marco Cannici, Davide Scaramuzza, 2024, Revisiting Token Pruning for Object Detection and Instance Segmentation, IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, 2024, https://rpg.ifi.uzh.ch/docs/WACV24_Liu.pdf Code: https://github.com/uzh-rpg/svit/ (Examination of token pruning in image processing including re-activating pruned tokens and advanced techniques such as a dynamic pruning rate.)
Jinyu Chen, Wenchao Xu, Zicong Hong, Song Guo, Haozhao Wang, Jie Zhang, Deze Zeng, 10 Jan 2024, OTAS: An Elastic Transformer Serving System via Token Adaptation, https://arxiv.org/abs/2401.05031
Lu Yin, You Wu, Zhenyu Zhang, Cheng-Yu Hsieh, Yaqing Wang, Yiling Jia, Mykola Pechenizkiy, Yi Liang, Zhangyang Wang, Shiwei Liu, Oct 2023, Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity, https://arxiv.org/abs/2310.05175
Ali Modarressi, Hosein Mohebbi, and Mohammad Taher Pilehvar. 2022. AdapLeR: Speeding up inference by adaptive length reduction. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1–15, Dublin, Ireland. Association for Computational Linguistics, https://doi.org/10.18653/v1/2022.acl-long.1
B Sun, X Ye, Z Wang, H Li, Z Wang 2023 Exploring Coarse-to-Fine Action Token Localization and Interaction for Fine-grained Video Action Recognition, https://dl.acm.org/doi/abs/10.1145/3581783.3612206
F Shi, L Wang, Oct 2023 Bridging The Gaps Between Token Pruning and Full Pre-training via Masked Fine-tuning, arXiv preprint arXiv:2310.17177, https://arxiv.org/pdf/2310.17177.pdf
M Berchansky, P Izsak, A Caciularu, I Dagan, Oct 2023, Optimizing Retrieval-augmented Reader Models via Token Elimination, https://arxiv.org/pdf/2310.13682.pdf Code: https://github.com/mosheber/token_elimination
Zhengyang Zhuge, Peisong Wang, Xingting Yao1, Jian Cheng, 2024, Towards Efficient Spiking Transformer: a Token Sparsification Framework for Training and Inference Acceleration, https://openreview.net/pdf?id=yL6hljtjW4
Tao Yang, Dongyue Li, Zhuoran Song, Yilong Zhao, Fangxin Liu, Zongwu Wang, Zhezhi He, and Li Jiang, 2022, DTQAtten: Leveraging Dynamic Token-based Quantization for Efficient Attention Architecture, https://mxhx7199.github.io/files/%5BDATE-22%5DDTQAtten_preprint.pdf
David Spuler, March 2024, Chapter 49. Length Pruning, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
P Dong, M Sun, A Lu, Y Xie, K Liu, 2023, Heatvit: Hardware-efficient adaptive token pruning for vision transformers, https://ieeexplore.ieee.org/abstract/document/10071047/ https://arxiv.org/pdf/2211.08110
T. Ge, J. Hu, X. Wang, S.-Q. Chen, and F. Wei, “In-context autoencoder for context compression in a large language model,” arXiv preprint arXiv:2307.06945, 2023. https://arxiv.org/abs/2307.06945
Quandong Wang, Yuxuan Yuan, Xiaoyu Yang, Ruike Zhang, Kang Zhao, Wei Liu, Jian Luan, Daniel Povey, Bin Wang, 17 Jun 2024 (v2), SUBLLM: A Novel Efficient Architecture with Token Sequence Subsampling for LLM, https://arxiv.org/abs/2406.06571
Zhiyu Guo, Hidetaka Kamigaito, Taro Watanabe, 18 Jun 2024, Attention Score is not All You Need for Token Importance Indicator in KV Cache Reduction: Value Also Matters, https://arxiv.org/abs/2406.12335 (Extensions of KV cache token pruning methods tjat ise attention scores to find pivotal tokens, generalized to also consider L1 vector norms of value vectors.)
Nikolay Bogoychev, Pinzhen Chen, Barry Haddow, Alexandra Birch, June 20, 2024, The Ups and Downs of Large Language Model Inference, with Vocabulary Trimming by Language Heuristics, School of Informatics, University of Edinburgh, Proceedings of the Fifth Workshop on Insights from Negative Results in NLP, pages 148–153 https://aclanthology.org/2024.insights-1.17.pdf
Wonbeom Lee, Jungi Lee, Junghwan Seo, Jaewoong Sim, 28 Jun 2024, InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management, https://arxiv.org/abs/2406.19707 (KV caching optimization using salient token pruning for the attention layer.)
Yun-Chia Yu; Mao-Chi Weng; Ming-Guang Lin; An-Yeu Andy Wu, May 2024, Retraining-free Constraint-aware Token Pruning for Vision Transformer on Edge Devices, 2024 IEEE International Symposium on Circuits and Systems (ISCAS), 19-22 May 2024, https://ieeexplore.ieee.org/abstract/document/10558603
Yuqing Yang, Yuedong Xu, Lei Jiao, 7 Jul 2024, A Queueing Theoretic Perspective on Low-Latency LLM Inference with Variable Token Length, https://arxiv.org/abs/2407.05347
Mootez Saad, José Antonio Hernández López, Boqi Chen, Dániel Varró, Tushar Sharma, 4 Jul 2024, ALPINE: An adaptive language-agnostic pruning method for language models for code, https://arxiv.org/abs/2407.04147
Siddharth Jha, Lutfi Eren Erdogan, Sehoon Kim, Kurt Keutzer, Amir Gholami, 11 Jul 2024, Characterizing Prompt Compression Methods for Long Context Inference, https://arxiv.org/abs/2407.08892
Haonan Wang, Jie Liu, Jie Tang, Gangshan Wu, Bo Xu, Yanbing Chou, Yong Wang, 16 Jul 2024 (v2), GTPT: Group-based Token Pruning Transformer for Efficient Human Pose Estimation, https://arxiv.org/abs/2407.10756
Qichen Fu, Minsik Cho, Thomas Merth, Sachin Mehta, Mohammad Rastegari, Mahyar Najibi, 19 Jul 2024, LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference, https://arxiv.org/abs/2407.14057
Junyoung Park, Myeonggu Kang, Yunki Han, Yanggon Kim, Jaekang Shin, Lee-Sup Kim, 21 Jul 2024, Token-Picker: Accelerating Attention in Text Generation with Minimized Memory Transfer via Probability Estimation, https://arxiv.org/abs/2407.15131
David Gu, July 18, 2024, Text Compression for Efficient Language Generation, Master’s Thesis, Distributed Computing Group, Computer Engineering and Networks Laboratory, ETH Zürich, https://pub.tik.ee.ethz.ch/students/2023-HS/MA-2023-19.pdf (Training and inference at the sentence level, including caching of embeddings per sentence, which also has the side-effect of compressing the input prompts and reducing computation analogously to token pruning.)
Oshin Dutta, Ritvik Gupta, Sumeet Agarwal, 2024, Efficient LLM Pruning with Global Token-Dependency Awareness and Hardware-Adapted Inference, https://openreview.net/pdf?id=cqhAzteLzc
Haode Qi, Cheng Qian, Jian Ni, Pratyush Singh, Reza Fazeli, Gengyu Wang, Zhongzheng Shu, Eric Wayne, Juergen Bross, 21 Aug 2024, Practical token pruning for foundation models in few-shot conversational virtual assistant systems, https://arxiv.org/abs/2408.11799 (Using token pruning for faster intent classification.)
Kazi Hasan Ibn Arif, JinYi Yoon, Dimitrios S. Nikolopoulos, Hans Vandierendonck, Deepu John, Bo Ji, 20 Aug 2024, HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models in Resource-Constrained Environments, https://arxiv.org/abs/2408.10945
Yufei Huang, Xu Han, Maosong Sun, 12 Aug 2024, FastFiD: Improve Inference Efficiency of Open Domain Question Answering via Sentence Selection, https://arxiv.org/abs/2408.06333 https://github.com/thunlp/FastFiD (Sentence-level pruning after encoding on the input text dimension.)
Bartosz Wójcik, Alessio Devoto, Karol Pustelnik, Pasquale Minervini, Simone Scardapane, 15 Dec 2023, Adaptive Computation Modules: Granular Conditional Computation For Efficient Inference, https://arxiv.org/abs/2312.10193
Li, Yanyu, Aug 2024, Accelerating large scale generative AI : a comprehensive study, Ph.D. Thesis, Northeastern University, Boston, Massachusetts, USA, https://hdl.handle.net/2047/D20669654 https://repository.library.northeastern.edu/files/neu:ms35wj107 https://repository.library.northeastern.edu/files/neu:ms35wj107/fulltext.pdf
CE Song, A Moradifirouzabadi, T Rosing, M Kang, 2024, Efficient Transformer Acceleration via Reconfiguration for Encoder and Decoder Models and Sparsity-Aware Algorithm Mapping, PDF: https://dl.acm.org/doi/pdf/10.1145/3665314.3670798
Purvish Jajal, Nick John Eliopoulos, Benjamin Shiue-Hal Chou, George K. Thiravathukal, James C. Davis, Yung-Hsiang Lu, 11 Sep 2024, Token Turing Machines are Efficient Vision Models, https://arxiv.org/abs/2409.07613
Weihao Ye, Qiong Wu, Wenhao Lin, Yiyi Zhou, 16 Sep 2024, Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models, https://arxiv.org/abs/2409.10197 https://github.com/ywh187/FitPrune
Wang, X, Sep 2024, KELTP: Keyword-Enhanced Learned Token Pruning for Knowledge-Grounded Dialogue. In: Wand, M., Malinovská, K., Schmidhuber, J., Tetko, I.V. (eds) Artificial Neural Networks and Machine Learning – ICANN 2024. ICANN 2024. Lecture Notes in Computer Science, vol 15022. Springer, Cham. https://doi.org/10.1007/978-3-031-72350-6_16 https://link.springer.com/chapter/10.1007/978-3-031-72350-6_16 (Adaptive removal of low-attention tokens during inference.)
Jinming Lou, Wenyang Luo, Yufan Liu, Bing Li, Xinmiao Ding, Weiming Hu, Jiajiong Cao, Yuming Li, Chenguang Ma, 27 Sep 2024, Token Caching for Diffusion Transformer Acceleration, https://arxiv.org/abs/2409.18523
Yaniv Leviathan, Matan Kalman, Yossi Matias, 3 Oct 2024, Selective Attention Improves Transformer, https://arxiv.org/abs/2410.02703 (Allowing adjacent tokens to predict whether required for attention.)
Yuxin Li, Yiheng Li, Xulei Yang, Mengying Yu, Zihang Huang, Xiaojun Wu, Chai Kiat Yeo, 9 Oct 2024, Learning Content-Aware Multi-Modal Joint Input Pruning via Bird's-Eye-View Representation, https://arxiv.org/abs/2410.07268
K. Shi et al., "Fitop-Trans: Maximizing Transformer Pipeline Efficiency through Fixed-Length Token Pruning on FPGA," 2024 34th International Conference on Field-Programmable Logic and Applications (FPL), Torino, Italy, 2024, pp. 243-249, doi: 10.1109/FPL64840.2024.00041. https://ieeexplore.ieee.org/abstract/document/10705515/
Yefei He, Feng Chen, Jing Liu, Wenqi Shao, Hong Zhou, Kaipeng Zhang, Bohan Zhua, 11 Oct 2024, ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification and KV Cache Compression, https://arxiv.org/abs/2410.08584
Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, Dahua Lin, 22 Oct 2024, PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction, https://arxiv.org/abs/2410.17247
Jort Vincenti, Karim Abdel Sadek, Joan Velja, Matteo Nulli, Metod Jazbec, 24 Oct 2024, Dynamic Vocabulary Pruning in Early-Exit LLMs, https://arxiv.org/abs/2410.18952
X. Shen et al., "HotaQ: Hardware Oriented Token Adaptive Quantization for Large Language Models," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, doi: 10.1109/TCAD.2024.3487781. https://ieeexplore.ieee.org/abstract/document/10737419/
Zhenmei Shi, Yifei Ming, Xuan-Phi Nguyen, Yingyu Liang, Shafiq Joty, 25 Sep 2024, Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction, https://arxiv.org/abs/2409.17422 https://github.com/SalesforceAIResearch/GemFilter (Use the early layers of a model to choose the most relevant tokens, similar to early exiting, and then compress the input token sequences based on the importance of these tokens. Notably, this reduces latency and also increases accuracy on long contexts.)
Xiangyu Zhang, Yu Zhou, Guang Yang, Harald C. Gall, Taolue Chen, 11 Nov 2024, Anchor Attention, Small Cache: Code Generation with Large Language Models, https://arxiv.org/abs/2411.06680
M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024, Resource-efficient Algorithms and Systems of Foundation Models: A Survey, https://dl.acm.org/doi/pdf/10.1145/3706418
Christopher Keith, Michael Robinson, Francis Duncan , Allan Worthington, Joseph Wilson, Soa Harris, October 22nd, 2024, Optimizing large language models: A novel approach through dynamic token pruning, https://doi.org/10.21203/rs.3.rs-5293588/v1 https://assets-eu.researchsquare.com/files/rs-5293588/v1_covered_52b30393-790e-4ab2-8bee-e6e7d4b16895.pdf?c=1729565713 (Reduces the cost of the unembedding step in the early exit tests by dynamically pruning vocabularies, thereby dynamically reducing the size of the unembedding matrix used in early exit testing.)
Yiwu Zhong, Zhuoming Liu, Yin Li, Liwei Wang, 4 Dec 2024, AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning, https://arxiv.org/abs/2412.03248
OpenAI, Dec 2024, OpenAI o1 and new tools for developers, https://openai.com/index/o1-and-new-tools-for-developers/ ("Lower latency: o1 uses on average 60% fewer reasoning tokens than o1-preview for a given request.")
Guoxuan Chen, Han Shi, Jiawei Li, Yihang Gao, Xiaozhe Ren, Yimeng Chen, Xin Jiang, Zhenguo Li, Weiyang Liu, Chao Huang, 17 Dec 2024 (v2), SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator, https://arxiv.org/abs/2412.12094 http://sepllm.github.io/
Yu Kang, Xianghui Sun, Liangyu Chen, Wei Zou, 16 Dec 2024, C3oT: Generating Shorter Chain-of-Thought without Compromising Effectiveness, https://arxiv.org/abs/2412.11664 (Token pruning and prompt compression for Chain-of-Thought.)
Yueqian Lin, Yuzhe Fu, Jingyang Zhang, Yudong Liu, Jianyi Zhang, Jingwei Sun, Hai "Helen" Li, Yiran Chen, 16 Dec 2024, SpeechPrune: Context-aware Token Pruning for Speech Information Retrieval, https://arxiv.org/abs/2412.12009 https://speechprune.github.io/
Wenchao Xu, Jinyu Chen, Peirong Zheng, Xiaoquan Yi, Tianyi Tian, Wenhui Zhu, Quan Wan, Haozhao Wang, Yunfeng Fan, Qinliang Su, Xuemin Shen, https://arxiv.org/abs/2412.13437 18 Dec 2024, Deploying Foundation Model Powered Agent Services: A Survey, (A survey of not just deployment, but many inference optimization techniques.)
Xiaohu Huang, Hao Zhou, Kai Han, 20 Dec 2024, PruneVid: Visual Token Pruning for Efficient Video Large Language Models, https://arxiv.org/abs/2412.16117
Xuanli He, Iman Keivanloo, Yi Xu, Xiang He, Belinda Zeng, Santosh Rajagopalan, Trishul Chilimbi, 30 Oct 2021, Magic Pyramid: Accelerating Inference with Early Exiting and Token Pruning, https://arxiv.org/abs/2111.00230
Fabio Montello, Ronja Güldenring, Simone Scardapane, Lazaros Nalpantidis, 13 Jan 2025, A Survey on Dynamic Neural Networks: from Computer Vision to Multi-modal Sensor Fusion, https://arxiv.org/abs/2501.07451 (Survey of adaptive inference optimizations: early exit, dynamic routing, token skimming.)
Tuowei Wang, Xingyu Chen, Kun Li, Ting Cao, Ju Ren, Yaoxue Zhang, 15 Jan 2025, LeMo: Enabling LEss Token Involvement for MOre Context Fine-tuning, https://arxiv.org/abs/2501.09767
Weizhi Fei, Xueyan Niu, Guoqing Xie, Yingqing Liu, Bo Bai, Wei Han, 22 Jan 2025, Efficient Prompt Compression with Evaluator Heads for Long-Context Transformer Inference, https://arxiv.org/abs/2501.12959 (Input token scanning efficiencly using early exit during prefill to prune tokens for the decoding phase.)
Xiaoyu Liang, Chaofeng Guan, Jiaying Lu, Huiyao Chen, Huan Wang, Haoji Hu, 24 Jan 2025, Dynamic Token Reduction during Generation for Vision Language Models, https://arxiv.org/abs/2501.14204 (Dynamic pruning of visual tokens.)
Heejun Lee, Geon Park, Jaduk Suh, Sung Ju Hwang, 13 Feb 2025, InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU, https://arxiv.org/abs/2502.08910 (Using dynamic token pruning and KV cache data offloading to CPU memory.)
Zichen Wen, Yifeng Gao, Shaobo Wang, Junyuan Zhang, Qintong Zhang, Weijia Li, Conghui He, Linfeng Zhang, 17 Feb 2025, Stop Looking for Important Tokens in Multimodal Language Models: Duplication Matters More, https://arxiv.org/abs/2502.11494 https://github.com/ZichenWen1/DART
Shang Yang, Junxian Guo, Haotian Tang, Qinghao Hu, Guangxuan Xiao, Jiaming Tang, Yujun Lin, Zhijian Liu, Yao Lu, Song Han, 20 Feb 2025, LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention, https://arxiv.org/abs/2502.14866
Haoyu Wang, Tong Teng, Tianyu Guo, An Xiao, Duyu Tang, Hanting Chen, Yunhe Wang, 20 Feb 2025, Unshackling Context Length: An Efficient Selective Attention Approach through Query-Key Compression, https://arxiv.org/abs/2502.14477
Shaibal Saha, Lanyu Xu, 26 Feb 2025, Vision Transformers on the Edge: A Comprehensive Survey of Model Compression and Acceleration Strategies, https://arxiv.org/abs/2503.02891
Z. Zou et al., "PF2A-ViT: Parameter-Free and Feature-Aware Dynamic Token Pruning Accelerator With Complementary Quantization-Encoding for Vision Transformer," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, doi: 10.1109/TCAD.2025.3565468, https://ieeexplore.ieee.org/abstract/document/10979990

Token Skipping

Token skipping is a type of token pruning. The idea is to skip selected tokens in the input in an adaptive manner. This is similar to dynamic token pruning or token merging.

Shujian Zhang, Korawat Tanwisuth, Chengyue Gong, Pengcheng He, Mingyuan Zhou, 7 May 2024, Switchable Decision: Dynamic Neural Generation Networks, https://arxiv.org/abs/2405.04513 (Switching and skipping sub-layer components such as attention heads, FFNs, or input token skipping, using decisions made based on allocating computation resources.)
Wangbo Zhao, Jiasheng Tang, Yizeng Han, Yibing Song, Kai Wang, Gao Huang, Fan Wang, Yang You, 18 Mar 2024, Dynamic Tuning Towards Parameter and Inference Efficiency for ViT Adaptation, https://arxiv.org/abs/2403.11808 (PEFT and adaptive inference and token pruning in Vision Transformers.)
Foozhan Ataiefard, Walid Ahmed, Habib Hajimolahoseini, Saina Asani, Farnoosh Javadi, Mohammad Hassanpour, Omar Mohamed Awad, Austin Wen, Kangling Liu, Yang Liu, 27 Jan 2024, SkipViT: Speeding Up Vision Transformers with a Token-Level Skip Connection, https://arxiv.org/abs/2401.15293
Fabio Montello, Ronja Güldenring, Simone Scardapane, Lazaros Nalpantidis, 13 Jan 2025, A Survey on Dynamic Neural Networks: from Computer Vision to Multi-modal Sensor Fusion, https://arxiv.org/abs/2501.07451 (Survey of adaptive inference optimizations: early exit, dynamic routing, token skimming.)
Heming Xia, Yongqi Li, Chak Tou Leong, Wenjie Wang, Wenjie Li, 17 Feb 2025, TokenSkip: Controllable Chain-of-Thought Compression in LLMs, https://arxiv.org/abs/2502.12067

Token Dropping

Bartosz Wójcik, Alessio Devoto, Karol Pustelnik, Pasquale Minervini, Simone Scardapane, 15 Dec 2023, Adaptive Computation Modules: Granular Conditional Computation For Efficient Inference, https://arxiv.org/abs/2312.10193 (Modifies its computation depending on the difficulty of each input token.)
M Salehi, S Mehta, A Kusupati, A Farhadi, H Hajishirzi, 2023 Sharcs: Efficient transformers through routing with dynamic width sub-networks https://arxiv.org/pdf/2310.12126.pdf (Direct queries to subnetworks with different widths.)
Simone Scardapane, Alessandro Baiocchi, Alessio Devoto, Valerio Marsocci, Pasquale Minervini, Jary Pomponi, 8 Jul 2024 (v2), Conditional computation in neural networks: principles and research trends, https://arxiv.org/abs/2403.07965
Le Hou, Richard Yuanzhe Pang, Tianyi Zhou, Yuexin Wu, Xinying Song, Xiaodan Song, Denny Zhou, 24 Mar 2022. Token Dropping for Efficient BERT Pretraining, https://arxiv.org/abs/2203.13240
Huaao Zhang, Shigui Qiu, Xiangyu Duan, Min Zhang, 21 Oct 2020, Token Drop mechanism for Neural Machine Translation, https://arxiv.org/abs/2010.11018
Qihuang Zhong, Liang Ding, Juhua Liu, Xuebo Liu, Min Zhang, Bo Du, Dacheng Tao, 24 May 2023, Revisiting Token Dropping Strategy in Efficient BERT Pretraining, https://arxiv.org/abs/2305.15273
Zhewei Yao, Xiaoxia Wu, Conglong Li, Connor Holmes, Minjia Zhang, Cheng Li, Yuxiong He, 17 Nov 2022, Random-LTD: Random and Layerwise Token Dropping Brings Efficient Training for Large-scale Transformers, https://arxiv.org/abs/2211.11586
Kazi Hasan Ibn Arif, JinYi Yoon, Dimitrios S. Nikolopoulos, Hans Vandierendonck, Deepu John, Bo Ji, 20 Aug 2024, HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models in Resource-Constrained Environments, https://arxiv.org/abs/2408.10945
Ting Liu, Liangtao Shi, Richang Hong, Yue Hu, Quanjun Yin, Linfeng Zhang, 16 Nov 2024, Multi-Stage Vision Token Dropping: Towards Efficient Multimodal Large Language Model, https://arxiv.org/abs/2411.10803
Fabio Montello, Ronja Güldenring, Simone Scardapane, Lazaros Nalpantidis, 13 Jan 2025, A Survey on Dynamic Neural Networks: from Computer Vision to Multi-modal Sensor Fusion, https://arxiv.org/abs/2501.07451 (Survey of adaptive inference optimizations: early exit, dynamic routing, token skimming.)
Tuowei Wang, Xingyu Chen, Kun Li, Ting Cao, Ju Ren, Yaoxue Zhang, 15 Jan 2025, LeMo: Enabling LEss Token Involvement for MOre Context Fine-tuning, https://arxiv.org/abs/2501.09767
Difan Deng, Marius Lindauer, 20 Feb 2025 (v2), Neural Attention Search, https://arxiv.org/abs/2502.13251 (Deciding whether a token deserves global attention, local attention, or sliding window attention, reducing KV caches.)

Prompt Compression

Prompt compression is a type of input token pruning. The idea is to reduce the size of the prompt, thereby reducing the number of tokens that must be processed.

Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, 22 Apr 2024, A Survey on Efficient Inference for Large Language Models, https://arxiv.org/abs/2404.14294
Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, Lili Qiu, Oct 2023, LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression https://arxiv.org/abs/2310.06839 Code: https://aka.ms/LLMLingua
Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, Lili Qiu, Dec 2023, LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models, https://arxiv.org/abs/2310.05736 Code: https://aka.ms/LLMLingua
Yucheng Li, 24 Apr 2023, Unlocking Context Constraints of LLMs: Enhancing Context Efficiency of LLMs with Self-Information-Based Content Filtering. https://arxiv.org/abs/2304.12102
Alexis Chevalier, Alexander Wettig, Anirudh Ajith, Danqi Chen, 4 Nov 2023, Adapting Language Models to Compress Contexts (AutoCompressors method) https://arxiv.org/abs/2305.14788 Cope: https://github.com/princeton-nlp/AutoCompressors
S Ren, Q Jia, KQ Zhu, arXiv preprint arXiv:2310.08152, Context Compression for Auto-regressive Transformers with Sentinel Tokens, Oct 2023, https://arxiv.org/pdf/2310.08152.pdf, Code: https://github.com/DRSY/KV_Compression
Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, Yan Yan, 25 Mar 2024, LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models, https://arxiv.org/abs/2403.15388 Code: https://llava-prumerge.github.io/ (Compresses input images based on redundant sections.)
Siyuan Wei, Tianzhu Ye, Shen Zhang, Yao Tang, Jiajun Liang, Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers, Apr 2023, https://arxiv.org/abs/2304.10716, Code: https://github.com/megvii-research/TPS-CVPR2023
Alessandro Baiocchi, Indro Spinelli, Alessandro Nicolosi, Simone Scardapane, 26 Jan 2024, Adaptive Point Transformer, https://arxiv.org/abs/2401.14845
Can Wang, Bolin Zhang, Dianbo Sui, Zhiying Tu, Xiaoyu Liu, Jiabao Kang, 1 Mar 2024 (v2), A Survey on Effective Invocation Methods of Massive LLM Services, https://arxiv.org/abs/2402.03408 (Deployment of LLMs as LLM-as-a-Service or LLMaaS architectures including prompt compression, semantic caching and model selection based on scoring inputs.)
Yichen Jiang, Marco Del Vecchio, Mohit Bansal, Anders Johannsen, March 17-22, 2024 , Hierarchical and Dynamic Prompt Compression for Efficient Zero-shot API Usage, Findings of the Association for Computational Linguistics: EACL 2024, pages 2162–2174, https://aclanthology.org/2024.findings-eacl.143.pdf
Amrit Nagarajan, Anand Raghunathan, Input Compression with Positional Consistency for Efficient Training and Inference of Transformer Neural Networks, PDF: https://www.researchgate.net/profile/Amrit_Nagarajan/publication/375911248_Input_Compression_with_Positional_Consistency_for_Efficient_Training_and_Inference_of_Transformer_Neural_Networks/links/656212a63fa26f66f4281d32/Input-Compression-with-Positional-Consistency-for-Efficient-Training-and-Inference-of-Transformer-Neural-Networks.pdf Code: https://github.com/amrnag/ICPC
Jang-Hyun Kim, Junyoung Yeom, Sangdoo Yun, Hyun Oh Song, Dec 2023, Compressed Context Memory For Online Language Model Interaction, https://arxiv.org/abs/2312.03414 Code: https://github.com/snu-mllab/context-memory
Jungmin Yun, Mihyeon Kim, Youngbin Kim, 2023, Focus on the Core: Efficient Attention via Pruned Token Compression for Document Classification https://aclanthology.org/2023.findings-emnlp.909.pdf
Maged S. Al-Shaibani and Irfan Ahmad, 2023, Consonant is all you need: a compact representation of English text for efficient NLP, https://aclanthology.org/2023.findings-emnlp.775.pdf (Tokens in a text are reduced by using a consonant-only representation, removing vowels, which is effectively a type of prompt compression.)
Jinyu Chen, Wenchao Xu, Zicong Hong, Song Guo, Haozhao Wang, Jie Zhang, Deze Zeng, 10 Jan 2024, OTAS: An Elastic Transformer Serving System via Token Adaptation, https://arxiv.org/abs/2401.05031
Iulia Brezeanu, Jan 5, 2024, How to Cut RAG Costs by 80% Using Prompt Compression, Towards Data Science, https://towardsdatascience.com/how-to-cut-rag-costs-by-80-using-prompt-compression-877a07c6bedb
T. Ge, J. Hu, X. Wang, S.-Q. Chen, and F. Wei, “In-context autoencoder for context compression in a large language model,” arXiv preprint arXiv:2307.06945, 2023. https://arxiv.org/abs/2307.06945
J Mu, XL Li, N Goodman, 2023, Learning to compress prompts with gist tokens, https://arxiv.org/abs/2304.08467
Quandong Wang, Yuxuan Yuan, Xiaoyu Yang, Ruike Zhang, Kang Zhao, Wei Liu, Jian Luan, Daniel Povey, Bin Wang, 17 Jun 2024 (v2), SUBLLM: A Novel Efficient Architecture with Token Sequence Subsampling for LLM, https://arxiv.org/abs/2406.06571
Cangqing Wang, Yutian Yang, Ruisi Li, Dan Sun, Ruicong Cai, Yuzhu Zhang, Chengqian Fu, Lillian Floyd, 18 Apr 2024 (v2), Adapting LLMs for Efficient Context Processing through Soft Prompt Compression, https://arxiv.org/abs/2404.04997
Xiaohua Wang, Zhenghua Wang, Xuan Gao, Feiran Zhang, Yixin Wu, Zhibo Xu, Tianyuan Shi, Zhengyuan Wang, Shizheng Li, Qi Qian, Ruicheng Yin, Changze Lv, Xiaoqing Zheng, Xuanjing Huang, 1 Jul 2024, Searching for Best Practices in Retrieval-Augmented Generation, https://arxiv.org/abs/2407.01219 Project: https://github.com/FudanDNN-NLP/RAG (Attempts to optimize the entire RAG system, including the various options for different RAG modules in the RAG pipeline, such as optimal methods for chunking, retrieval, embedding models, vector databases, prompt compression, reranking, repacking, summarizers, and other components.)
Siddharth Jha, Lutfi Eren Erdogan, Sehoon Kim, Kurt Keutzer, Amir Gholami, 11 Jul 2024, Characterizing Prompt Compression Methods for Long Context Inference, https://arxiv.org/abs/2407.08892
Qichen Fu, Minsik Cho, Thomas Merth, Sachin Mehta, Mohammad Rastegari, Mahyar Najibi, 19 Jul 2024, LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference, https://arxiv.org/abs/2407.14057
David Gu, July 18, 2024, Text Compression for Efficient Language Generation, Master’s Thesis, Distributed Computing Group, Computer Engineering and Networks Laboratory, ETH Zürich, https://pub.tik.ee.ethz.ch/students/2023-HS/MA-2023-19.pdf (Training and inference at the sentence level, including caching of embeddings per sentence, which also has the side-effect of compressing the input prompts and reducing computation analogously to token pruning.)
James Groeneveld, Aug 1, 2024, Prompt Design at Character.AI, Character.AI blog, https://research.character.ai/prompt-design-at-character-ai/
Simone Scardapane, Alessandro Baiocchi, Alessio Devoto, Valerio Marsocci, Pasquale Minervini, Jary Pomponi, 8 Jul 2024 (v2), Conditional computation in neural networks: principles and research trends, https://arxiv.org/abs/2403.07965
Pai Zeng, Zhenyu Ning, Jieru Zhao, Weihao Cui, Mengwei Xu, Liwei Guo, Xusheng Chen, Yizhou Shan, 27 May 2024 (v2), The CAP Principle for LLM Serving: A Survey of Long-Context Large Language Model Serving, https://arxiv.org/abs/2405.11299
Wei Chen, Zhiyuan Li, Shuo Xin, Yihao Wang, 28 Aug 2024, Dolphin: Long Context as a New Modality for Energy-Efficient On-Device Language Models, https://arxiv.org/abs/2408.15518 https://huggingface.co/NexaAIDev/Dolphin (Using vision transformer architecture to process longer text.)
Bosheng Qin, Juncheng Li, Siliang Tang, Yueting Zhuang, 24 Nov 2022, DBA: Efficient Transformer with Dynamic Bilinear Low-Rank Attention, https://arxiv.org/abs/2211.16368
Barys Liskavets, Maxim Ushakov, Shuvendu Roy, Mark Klibanov, Ali Etemad, Shane Luke, 2 Sep 2024, Prompt Compression with Context-Aware Sentence Encoding for Fast and Improved LLM Inference, https://arxiv.org/abs/2409.01227
Zhenmei Shi, Yifei Ming, Xuan-Phi Nguyen, Yingyu Liang, Shafiq Joty, 25 Sep 2024, Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction, https://arxiv.org/abs/2409.17422 https://github.com/SalesforceAIResearch/GemFilter (Use the early layers of a model to choose the most relevant tokens, similar to early exiting, and then compress the input token sequences based on the importance of these tokens. Notably, this reduces latency and also increases accuracy on long contexts.)
Tsz Ting Chung, Leyang Cui, Lemao Liu, Xinting Huang, Shuming Shi, Dit-Yan Yeung, 15 Oct 2024, Selection-p: Self-Supervised Task-Agnostic Prompt Compression for Faithfulness and Transferability, https://arxiv.org/abs/2410.11786
Zongqian Li, Yinhong Liu, Yixuan Su, Nigel Collier, 17 Oct 2024 (v2), Prompt Compression for Large Language Models: A Survey, https://arxiv.org/abs/2410.12388
M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024, Resource-efficient Algorithms and Systems of Foundation Models: A Survey, https://dl.acm.org/doi/pdf/10.1145/3706418
Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yuxian Gu, Dacheng Li, Xiuyu Li, Yunhao Fang, Yukang Chen, Cheng-Yu Hsieh, De-An Huang, An-Chieh Cheng, Vishwesh Nath, Jinyi Hu, Sifei Liu, Ranjay Krishna, Daguang Xu, Xiaolong Wang, Pavlo Molchanov, Jan Kautz, Hongxu Yin, Song Han, Yao Lu, 5 Dec 2024, NVILA: Efficient Frontier Visual Language Models, https://arxiv.org/abs/2412.04468
Guoxuan Chen, Han Shi, Jiawei Li, Yihang Gao, Xiaozhe Ren, Yimeng Chen, Xin Jiang, Zhenguo Li, Weiyang Liu, Chao Huang, 17 Dec 2024 (v2), SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator, https://arxiv.org/abs/2412.12094 http://sepllm.github.io/
Yu Kang, Xianghui Sun, Liangyu Chen, Wei Zou, 16 Dec 2024, C3oT: Generating Shorter Chain-of-Thought without Compromising Effectiveness, https://arxiv.org/abs/2412.11664 (Token pruning and prompt compression for Chain-of-Thought.)
13 Sep 2024, Expediting and Elevating Large Language Model Reasoning via Hidden Chain-of-Thought Decoding, https://arxiv.org/abs/2409.08561 (Compressing the interim token sequences in Chain-of-Thought.)
Tyler McDonald, Anthony Colosimo, Yifeng Li, Ali Emami, 2 Dec 2024, Can We Afford The Perfect Prompt? Balancing Cost and Accuracy with the Economical Prompting Index, https://arxiv.org/abs/2412.01690
Xinze Li, Zhenghao Liu, Chenyan Xiong, Shi Yu, Yukun Yan, Shuo Wang, Ge Yu, 25 Feb 2024, Say More with Less: Understanding Prompt Learning Behaviors through Gist Compression, https://arxiv.org/abs/2402.16058 https://github.com/OpenMatch/Gist-COCO
Peitian Zhang, Zheng Liu, Shitao Xiao, Ninglu Shao, Qiwei Ye, Zhicheng Dou, 11 Oct 2024 (v2), Compressing Lengthy Context With UltraGist, https://arxiv.org/abs/2405.16635 https://github.com/namespace-Pt/UltraGist
Tong Xiao, Jingbo Zhu, 16 Jan 2025, Foundations of Large Language Models, https://arxiv.org/abs/2501.09223 (Huge 230 page paper on many topics such as training, prompting, alignment, and long context.)
J Köpke, A Safan, 2024, Efficient llm-based conversational process modeling, Business Process Management Workshops, https://isys.uni-klu.ac.at/PDF/BPM_2024_paper_1442.pdf (Examines and improves the token costs of prompt strategies in conversational sessions.)
Huanxuan Liao, Shizhu He, Yupu Hao, Xiang Li, Yuanzhe Zhang, Jun Zhao, Kang Liu, Jan 2025, SKIntern: Internalizing Symbolic Knowledge for Distilling Better CoT Capabilities into Small Language Models, Proceedings of the 31st International Conference on Computational Linguistics, pages 3203–3221 January 19–24, 2025, https://aclanthology.org/2025.coling-main.215.pdf
Weizhi Fei, Xueyan Niu, Guoqing Xie, Yingqing Liu, Bo Bai, Wei Han, 22 Jan 2025, Efficient Prompt Compression with Evaluator Heads for Long-Context Transformer Inference, https://arxiv.org/abs/2501.12959 (Input token scanning efficiencly using early exit during prefill to prune tokens for the decoding phase.)
Heming Xia, Yongqi Li, Chak Tou Leong, Wenjie Wang, Wenjie Li, 17 Feb 2025, TokenSkip: Controllable Chain-of-Thought Compression in LLMs, https://arxiv.org/abs/2502.12067
Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Y. X. Wei, Lean Wang, Zhiping Xiao, Yuqing Wang, Chong Ruan, Ming Zhang, Wenfeng Liang, Wangding Zeng, 16 Feb 2025, Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention, https://arxiv.org/abs/2502.11089
Pengfei He, Shaowei Wang, Tse-Hsun Chen, 19 Feb 2025, CODEPROMPTZIP: Code-specific Prompt Compression for Retrieval-Augmented Generation in Coding Tasks with LMs, https://arxiv.org/abs/2502.14925
Weronika Łajewska, Momchil Hardalov, Laura Aina, Neha Anna John, Hang Su, Lluís Màrquez, 24 Mar 2025, Understanding and Improving Information Preservation in Prompt Compression for LLMs, https://arxiv.org/abs/2503.19114

Token Compression

It's not really clear that "token compression" is a technique much different from token pruning, token merging, or other types of prompt compression. Nevertheless, here are a few papers on this specific topic:

Wangbo Zhao, Jiasheng Tang, Yizeng Han, Yibing Song, Kai Wang, Gao Huang, Fan Wang, Yang You, 18 Mar 2024, Dynamic Tuning Towards Parameter and Inference Efficiency for ViT Adaptation, https://arxiv.org/abs/2403.11808 (PEFT and adaptive inference and token pruning in Vision Transformers.)
Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, Yan Yan, 25 Mar 2024, LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models, https://arxiv.org/abs/2403.15388 Code: https://llava-prumerge.github.io/ (Compresses input images based on redundant sections.)
Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, Qianyu Chen, Huarong Zhou, Zhensheng Zou, Haoye Zhang, Shengding Hu, Zhi Zheng, Jie Zhou, Jie Cai, Xu Han, Guoyang Zeng, Dahai Li, Zhiyuan Liu, Maosong Sun, 3 Aug 2024, MiniCPM-V: A GPT-4V Level MLLM on Your Phone, https://arxiv.org/abs/2408.01800 Code: https://github.com/OpenBMB/MiniCPM-V
Keda Tao, Can Qin, Haoxuan You, Yang Sui, Huan Wang, 22 Nov 2024, DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models, https://arxiv.org/abs/2411.15024
Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yuxian Gu, Dacheng Li, Xiuyu Li, Yunhao Fang, Yukang Chen, Cheng-Yu Hsieh, De-An Huang, An-Chieh Cheng, Vishwesh Nath, Jinyi Hu, Sifei Liu, Ranjay Krishna, Daguang Xu, Xiaolong Wang, Pavlo Molchanov, Jan Kautz, Hongxu Yin, Song Han, Yao Lu, 5 Dec 2024, NVILA: Efficient Frontier Visual Language Models, https://arxiv.org/abs/2412.04468
Hao Li, Changyao Tian, Jie Shao, Xizhou Zhu, Zhaokai Wang, Jinguo Zhu, Wenhan Dou, Xiaogang Wang, Hongsheng Li, Lewei Lu, Jifeng Dai, 12 Dec 2024, SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding, https://arxiv.org/abs/2412.09604 (Reducing the precision of input images via token folding, with a special decoder at the end to ensure output is high precision.)
OpenAI, Dec 2024, OpenAI o1 and new tools for developers, https://openai.com/index/o1-and-new-tools-for-developers/ ("Lower latency: o1 uses on average 60% fewer reasoning tokens than o1-preview for a given request.")
Guoxuan Chen, Han Shi, Jiawei Li, Yihang Gao, Xiaozhe Ren, Yimeng Chen, Xin Jiang, Zhenguo Li, Weiyang Liu, Chao Huang, 17 Dec 2024 (v2), SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator, https://arxiv.org/abs/2412.12094 http://sepllm.github.io/
Yu Kang, Xianghui Sun, Liangyu Chen, Wei Zou, 16 Dec 2024, C3oT: Generating Shorter Chain-of-Thought without Compromising Effectiveness, https://arxiv.org/abs/2412.11664 (Token pruning and prompt compression for Chain-of-Thought.)
13 Sep 2024, Expediting and Elevating Large Language Model Reasoning via Hidden Chain-of-Thought Decoding, https://arxiv.org/abs/2409.08561 (Compressing the interim token sequences in Chain-of-Thought.)
Haoran You, Connelly Barnes, Yuqian Zhou, Yan Kang, Zhenbang Du, Wei Zhou, Lingzhi Zhang, Yotam Nitzan, Xiaoyang Liu, Zhe Lin, Eli Shechtman, Sohrab Amirghodsi, Yingyan Celine Lin, 22 Dec 2024, Layer- and Timestep-Adaptive Differentiable Token Compression Ratios for Efficient Diffusion Transformers, https://arxiv.org/abs/2412.16822
Fabio Montello, Ronja Güldenring, Simone Scardapane, Lazaros Nalpantidis, 13 Jan 2025, A Survey on Dynamic Neural Networks: from Computer Vision to Multi-modal Sensor Fusion, https://arxiv.org/abs/2501.07451 (Survey of adaptive inference optimizations: early exit, dynamic routing, token skimming.)
J Köpke, A Safan, 2024, Efficient llm-based conversational process modeling, Business Process Management Workshops, https://isys.uni-klu.ac.at/PDF/BPM_2024_paper_1442.pdf (Examines and improves the token costs of prompt strategies in conversational sessions.)
Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Y. X. Wei, Lean Wang, Zhiping Xiao, Yuqing Wang, Chong Ruan, Ming Zhang, Wenfeng Liang, Wangding Zeng, 16 Feb 2025, Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention, https://arxiv.org/abs/2502.11089
Xiaoran Liu, Ruixiao Li, Mianqiu Huang, Zhigeng Liu, Yuerong Song, Qipeng Guo, Siyang He, Qiqi Wang, Linlin Li, Qun Liu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu, 24 Feb 2025, Thus Spake Long-Context Large Language Model, https://arxiv.org/abs/2502.17129 (Impressive survey of many techniques to improve efficiency and accuracy of long context processing in both inference and training, covering text, video and multimodal models.)

Context Compression

Context compression is a type of prompt compression and input token pruning. The idea is to reduce the size of the context that is part of the input prompt. Context tokens that are candidates for compression include, for example, (a) the conversational history in a chatbot session, (b) global instructions used at the beginning of every query, or (c) the supplementary document chunks that are prepended in a RAG architecture.

S Ren, Q Jia, KQ Zhu, arXiv preprint arXiv:2310.08152, Context Compression for Auto-regressive Transformers with Sentinel Tokens, Oct 2023, https://arxiv.org/pdf/2310.08152.pdf, Code: https://github.com/DRSY/KV_Compression
Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, Yan Yan, 25 Mar 2024, LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models, https://arxiv.org/abs/2403.15388 Code: https://llava-prumerge.github.io/ (Compresses input images based on redundant sections.)
Siyuan Wei, Tianzhu Ye, Shen Zhang, Yao Tang, Jiajun Liang, Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers, Apr 2023, https://arxiv.org/abs/2304.10716, Code: https://github.com/megvii-research/TPS-CVPR2023
Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guohong Liu, Jiacheng Liu, Wenxing Xu, Xiang Wang, Yi Sun, Rui Kong, Yile Wang, Hanfei Geng, Jian Luan, Xuefeng Jin, Zilong Ye, Guanjing Xiong, Fan Zhang, Xiang Li, Mengwei Xu, Zhijun Li, Peng Li, Yang Liu, Ya-Qin Zhang, Yunxin Liu, 8 May 2024 (v2), Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security, https://arxiv.org/abs/2401.05459 https://github.com/MobileLLM/Personal_LLM_Agents_Survey
Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yuxian Gu, Dacheng Li, Xiuyu Li, Yunhao Fang, Yukang Chen, Cheng-Yu Hsieh, De-An Huang, An-Chieh Cheng, Vishwesh Nath, Jinyi Hu, Sifei Liu, Ranjay Krishna, Daguang Xu, Xiaolong Wang, Pavlo Molchanov, Jan Kautz, Hongxu Yin, Song Han, Yao Lu, 5 Dec 2024, NVILA: Efficient Frontier Visual Language Models, https://arxiv.org/abs/2412.04468
13 Sep 2024, Expediting and Elevating Large Language Model Reasoning via Hidden Chain-of-Thought Decoding, https://arxiv.org/abs/2409.08561 (Compressing the interim token sequences in Chain-of-Thought.)
Jeffrey Cheng, Benjamin Van Durme, 17 Dec 2024, Compressed Chain of Thought: Efficient Reasoning Through Dense Representations, https://arxiv.org/abs/2412.13171 (Context compression applied to interim CoT reasoning steps.)
Chenlong Deng, Zhisong Zhang, Kelong Mao, Shuaiyi Li, Xinting Huang, Dong Yu, Zhicheng Dou, 23 Dec 2024, A Silver Bullet or a Compromise for Full Attention? A Comprehensive Study of Gist Token-based Context Compression, https://arxiv.org/abs/2412.17483
Peitian Zhang, Zheng Liu, Shitao Xiao, Ninglu Shao, Qiwei Ye, Zhicheng Dou, 11 Oct 2024 (v2), Compressing Lengthy Context With UltraGist, https://arxiv.org/abs/2405.16635 https://github.com/namespace-Pt/UltraGist
J Köpke, A Safan, 2024, Efficient llm-based conversational process modeling, Business Process Management Workshops, https://isys.uni-klu.ac.at/PDF/BPM_2024_paper_1442.pdf (Examines and improves the token costs of prompt strategies in conversational sessions.)
H Liao, S He, Y Xu, Y Zhang, S Liu, K Liu, J Zhao, Jan 2025, Awakening Augmented Generation: Learning to Awaken Internal Knowledge of Large Language Models for Question Answering, Proceedings of the 31st International Conference on Computational Linguistics, pages 1333–1352, January 19–24, 2025, https://aclanthology.org/2025.coling-main.89.pdf https://github.com/Xnhyacinth/IAG (Attempts to perform RALM based only on parametric knowledge, without any external sources, thereby optimizing away RAG steps.)
Jeonghun Cho, Gary Geunbae Lee, 23 Jan 2025, K-COMP: Retrieval-Augmented Medical Domain Question Answering With Knowledge-Injected Compressor, https://arxiv.org/abs/2501.13567
Nadezhda Chirkova, Thibault Formal, Vassilina Nikoulina, Stéphane Clinchant, 27 Jan 2025, Provence: efficient and robust context pruning for retrieval-augmented generation, https://arxiv.org/abs/2501.16214
Maxime Louis, Hervé Déjean, Stéphane Clinchant, 7 Jan 2025, PISCO: Pretty Simple Compression for Retrieval-Augmented Generation, https://arxiv.org/abs/2501.16075
Jing Xiong, Jianghan Shen, Chuanyang Zheng, Zhongwei Wan, Chenyang Zhao, Chiwun Yang, Fanghua Ye, Hongxia Yang, Lingpeng Kong, Ngai Wong, 20 Feb 2025, ParallelComp: Parallel Long-Context Compressor for Length Extrapolation, https://arxiv.org/abs/2502.14317
Chau Minh Pham, Yapei Chang, Mohit Iyyer, 20 Feb 2025, CLIPPER: Compression enables long-context synthetic data generation, https://arxiv.org/abs/2502.14854
Jintian Zhang, Yuqi Zhu, Mengshu Sun, Yujie Luo, Shuofei Qiao, Lun Du, Da Zheng, Huajun Chen, Ningyu Zhang, 21 Feb 2025, LightThinker: Thinking Step-by-Step Compression, https://arxiv.org/abs/2502.15589 https://github.com/zjunlp/LightThinker (Faster CoT by compressing the text of intermediate reasoning steps with gist tokens.)
Zhanghao Hu, Hanqi Yan, Qingling Zhu, Zhenyi Shen, Yulan He, Lin Gui, 3 Mar 2025, Beyond Prompting: An Efficient Embedding Framework for Open-Domain Question Answering, https://arxiv.org/abs/2503.01606

Image Token Pruning

Tokenization of images is a two-dimensional process. There are various papers on applying the ideas of "token pruning" to image inputs, which effectively means ignoring some parts of the image.

Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, Yan Yan, 25 Mar 2024, LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models, https://arxiv.org/abs/2403.15388 Code: https://llava-prumerge.github.io/ (Compresses input images based on redundant sections.)
Joonmyung Choi, Sanghyeok Lee, Jaewon Chu, Minhyuk Choi, Hyunwoo J. Kim, 20 Mar 2024, vid-TLDR: Training Free Token merging for Light-weight Video Transformer, https://arxiv.org/abs/2403.13347, Code: https://github.com/mlvlab/vid-TLDR (Token merging in video with a focus on the background of the image.)
Zhongwei Wan, Ziang Wu, Che Liu, Jinfa Huang, Zhihong Zhu, Peng Jin, Longyue Wang, Li Yuan, 26 Jun 2024, LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference, https://arxiv.org/abs/2406.18139 (KV cache compression in text and multimodal inference, prioritizing eviction of text over image tokens, and using new ways to merge evicted KV cache data into the retained KV cache, including averaging, pivotal tokens, and weighted averages, which is relevant to token merging and KV cache fusion.)
Wang, Y., Huang, R., Song, S., Huang, Z., Huang, G., May 2021, Not all images are worth 16x16 words: Dynamic vision transformers with adaptive sequence length. NeurIPS 2021, https://arxiv.org/abs/2105.15075, Code: https://github.com/blackfeather-wang/Dynamic-Vision-Transformer, Code: https://github.com/blackfeather-wang/Dynamic-Vision-Transformer-MindSpore
Wang Y., Lv K., Huang R., Song S., Yang L., Huang G., 2020, Glance and focus: a dynamic approach to reducing spatial redundancy in image classification, Advances in neural information processing systems, Vol. 33 (2020), pp. 2432-2444, https://arxiv.org/abs/2010.05300, Code: https://github.com/blackfeather-wang/GFNet-Pytorch (Focuses on a subset of image inputs, which is analogous to token pruning.)
Hou-I Liu, Marco Galindo, Hongxia Xie, Lai-Kuan Wong, Hong-Han Shuai, Yung-Yui Li, Wen-Huang Cheng, 8 Apr 2024, Lightweight Deep Learning for Resource-Constrained Environments: A Survey, https://arxiv.org/abs/2404.07236 (A survey of various optimizations, with a lot of focus on image and vision models, including CNNs, RNNs, and Transformers.)
M Sponner, B Waschneck, A Kumar , 2024, Adapting Neural Networks at Runtime: Current Trends in At-Runtime Optimizations for Deep Learning, ACM Computing Surveys,, PDF: https://dl.acm.org/doi/pdf/10.1145/3657283 (Survey of various adaptive inference optimization techniques with much focus on image and video processing optimization for LLMs.)
Tanvir Mahmud, Burhaneddin Yaman, Chun-Hao Liu, Diana Marculescu, 24 Mar 2024, PaPr: Training-Free One-Step Patch Pruning with Lightweight ConvNets for Faster Inference, https://arxiv.org/abs/2403.16020 (Pruning of patches in input images, which is a form of token or channel pruning.)
Yifei Liu, Mathias Gehrig, Nico Messikommer, Marco Cannici, Davide Scaramuzza, 2024, Revisiting Token Pruning for Object Detection and Instance Segmentation, IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, 2024, https://rpg.ifi.uzh.ch/docs/WACV24_Liu.pdf Code: https://github.com/uzh-rpg/svit/ (Examination of token pruning in image processing including re-activating pruned tokens and advanced techniques such as a dynamic pruning rate.)
Keda Tao, Can Qin, Haoxuan You, Yang Sui, Huan Wang, 22 Nov 2024, DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models, https://arxiv.org/abs/2411.15024
Yiwu Zhong, Zhuoming Liu, Yin Li, Liwei Wang, 4 Dec 2024, AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning, https://arxiv.org/abs/2412.03248
Jiedong Zhuang, Lu Lu, Ming Dai, Rui Hu, Jian Chen, Qiang Liu, Haoji Hu, 28 Dec 2024, ST3: Accelerating Multimodal Large Language Model by Spatial-Temporal Visual Token Trimming, https://arxiv.org/abs/2412.20105
Haoran You, Connelly Barnes, Yuqian Zhou, Yan Kang, Zhenbang Du, Wei Zhou, Lingzhi Zhang, Yotam Nitzan, Xiaoyang Liu, Zhe Lin, Eli Shechtman, Sohrab Amirghodsi, Yingyan Celine Lin, 22 Dec 2024, Layer- and Timestep-Adaptive Differentiable Token Compression Ratios for Efficient Diffusion Transformers, https://arxiv.org/abs/2412.16822
Fabio Montello, Ronja Güldenring, Simone Scardapane, Lazaros Nalpantidis, 13 Jan 2025, A Survey on Dynamic Neural Networks: from Computer Vision to Multi-modal Sensor Fusion, https://arxiv.org/abs/2501.07451 (Survey of adaptive inference optimizations: early exit, dynamic routing, token skimming.)

Vision Token Pruning

The ideas of pruning parts of an image, or "patches" of an image input, can also be applied to vision models. There are various papers on vision LLM token pruning:

Li, Yanyu, Aug 2024, Accelerating large scale generative AI : a comprehensive study, Ph.D. Thesis, Northeastern University, Boston, Massachusetts, USA, https://hdl.handle.net/2047/D20669654 https://repository.library.northeastern.edu/files/neu:ms35wj107 https://repository.library.northeastern.edu/files/neu:ms35wj107/fulltext.pdf
Maxim Bonnaerens, Joni Dambre, 17 Aug 2023 (v2), Learned Thresholds Token Merging and Pruning for Vision Transformers, https://arxiv.org/abs/2307.10780
Qingqing Cao, Bhargavi Paranjape, Hannaneh Hajishirzi, 27 May 2023, PuMer: Pruning and Merging Tokens for Efficient Vision Language Models, https://arxiv.org/abs/2305.17530
Cedric Renggli, André Susano Pinto, Neil Houlsby, Basil Mustafa, Joan Puigcerver, Carlos Riquelme, 24 Feb 2022, Learning to Merge Tokens in Vision Transformers, https://arxiv.org/abs/2202.12015
Maxim Bonnaerens, Nov 2023, Resource-Efficient Deep Learning for Computer Vision, Ph.D. thesis, Ghent University, https://biblio.ugent.be/publication/01HEMGWENRT8C255N2RD9KAEJC/file/01HEMGZ9JYP8NXPSQJZM14ACT9 (Examines various vision Transformer optimizations including a NAS approached based on building blocks and also combined token pruning/merging for input compression.)
Zhenglun Kong, Peiyan Dong, Xiaolong Ma, Xin Meng, Wei Niu, Mengshu Sun, Xuan Shen, Geng Yuan, Bin Ren, Hao Tang, Minghai Qin & Yanzhi Wang, SPViT: Enabling Faster Vision Transformers via Latency-Aware Soft Token Pruning, Nov 2022, LNCS,volume 13671, https://link.springer.com/chapter/10.1007/978-3-031-20083-0_37, Code: https://github.com/PeiyanFlying/SPViT
SaiT: Sparse Vision Transformers through Adaptive Token Pruning Ling Li, David Thorsley, Joseph Hassoun Oct 2022, https://arxiv.org/abs/2210.05832
Haoyu Ma, Zhe Wang, Yifei Chen, Deying Kong, Liangjian Chen, Xingwei Liu, Xiangyi Yan, Hao Tang & Xiaohui Xie , PPT: Token-Pruned Pose Transformer for Monocular and Multi-view Human Pose Estimation, ECCV 2022: Computer Vision, pp 424–442, LNCS volume 13665, https://link.springer.com/chapter/10.1007/978-3-031-20065-6_25
Mohsen Fayyaz, Soroush Abbasi Kouhpayegani, Farnoush Rezaei Jafari, Eric Sommerlade, Hamid Reza Vaezi Joze, Hamed Pirsiavash, and Juergen Gall. ATS: Adaptive token sampling for efficient vision transformers. In ECCV, July 2022, https://arxiv.org/abs/2111.15667v1
Siyuan Wei, Tianzhu Ye, Shen Zhang, Yao Tang, Jiajun Liang, Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers, Apr 2023, https://arxiv.org/abs/2304.10716, Code: https://github.com/megvii-research/TPS-CVPR2023
Youwei Liang, Chongjian Ge, Zhan Tong, Yibing Song, Jue Wang, and Pengtao Xie. Not all patches are what you need: Expediting vision transformers via token reorganizations. arXiv preprint arXiv:2202.07800, Apr 2022, https://arxiv.org/abs/2202.07800
Yifan Xu, Zhijie Zhang, Mengdan Zhang, Kekai Sheng, Ke Li, Weiming Dong, Liqing Zhang, Changsheng Xu, and Xing Sun. Evo-ViT: Slow-fast token evolution for dynamic vision transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 2964–2972, 2022, https://arxiv.org/abs/2108.01390, Code: https://github.com/YifanXu74/Evo-ViT
Bowen Pan, Rameswar Panda, Yifan Jiang, Zhangyang Wang, Rogerio Feris, and Aude Oliva. IA-RED2: Interpretability-aware redundancy reduction for vision transformers. In Advances in Neural Information Processing Systems (NeurIPS), Oct 2021, https://arxiv.org/abs/2106.12620
Hongxu Yin, Arash Vahdat, Jose Alvarez, Arun Mallya, Jan Kautz, and Pavlo Molchanov. AdaViT: Adaptive tokens for efficient vision transformer. arXiv preprint arXiv:2112.07658, 2021 (revised Oct 2022), https://arxiv.org/abs/2112.07658, Code: https://a-vit.github.io/
Hao Yu and Jianxin Wu. A unified pruning framework for vision transformers. arXiv preprint arXiv:2111.15127, Nov 2021, https://arxiv.org/abs/2111.15127
Youwei Liang, Chongjian GE, Zhan Tong, Yibing Song, Jue Wang, and Pengtao Xie. EViT: Expediting vision transformers via token reorganizations. In International Conference on Learning Representations (ICLR), Jan 2022, https://openreview.net/forum?id=BjyvwnXXVn_, PDF: https://openreview.net/pdf?id=BjyvwnXXVn_, Code: https://github.com/youweiliang/evit
Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynamicvit: Efficient vision transformers with dynamic token sparsification. In Advances in Neural Information Processing Systems (NeurIPS), Oct 2021, https://arxiv.org/abs/2106.02034, Code: https://github.com/raoyongming/DynamicViT
Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zi-Hang Jiang, Francis E.H. Tay, Jiashi Feng, and Shuicheng Yan. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 558–567, October 2021, https://arxiv.org/abs/2101.11986
Huanrui Yang, Hongxu Yin, Pavlo Molchanov, Hai Li, and Jan Kautz. Nvit: Vision transformer compression and parameter redistribution. arXiv preprint arXiv:2110.04869, 2021, PDF: https://arxiv.org/pdf/2110.04869v1.pdf
Yehui Tang, Kai Han, Yunhe Wang, Chang Xu, Jianyuan Guo, Chao Xu, and Dacheng Tao. Patch slimming for efficient vision transformers. arXiv preprint arXiv:2106.02852, 2021 (revised Apr 2022). https://arxiv.org/abs/2106.02852
Shixing Yu, Tianlong Chen, Jiayi Shen, Huan Yuan, Jianchao Tan, Sen Yang, Ji Liu, and Zhangyang Wang. Unified visual transformer compression. arXiv preprint arXiv:2203.08243, Mar 2022, https://arxiv.org/abs/2203.08243
Tianlong Chen, Yu Cheng, Zhe Gan, Lu Yuan, Lei Zhang, and Zhangyang Wang. Chasing sparsity in vision transformers: An end-to-end exploration. Advances in Neural Information Processing Systems, 34, 2021, https://arxiv.org/abs/2106.04533
Mingjian Zhu, Yehui Tang, and Kai Han. Vision transformer pruning. arXiv preprint arXiv:2104.08500, Aug 2021, https://arxiv.org/abs/2104.08500
Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. ThiNet: A filter level pruning method for deep neural network compression. In Proceedings of the IEEE international conference on computer vision, pages 5058–5066, 2017, https://arxiv.org/abs/1707.06342
Zihang Jiang, Qibin Hou, Li Yuan, Daquan Zhou, Yujun Shi, Xiaojie Jin, Anran Wang, and Jiashi Feng. All tokens matter: Token labeling for training better vision transformers. arXiv preprint arXiv:2104.10858, June 2021, https://arxiv.org/abs/2104.10858, Code: https://github.com/zihangJiang/TokenLabeling
Zejiang Hou, Sun-Yuan Kung, Multi-Dimensional Vision Transformer Compression via Dependency Guided Gaussian Process Search, 2022, https://ieeexplore.ieee.org/document/9857488, PDF: https://openaccess.thecvf.com/content/CVPR2022W/EVW/html/Hou_Multi-Dimensional_Vision_Transformer_Compression_via_Dependency_Guided_Gaussian_Process_Search_CVPRW_2022_paper.html (Multi-dimensional pruning.)
K Luo, H Li, X Zhou, B Huang, An Attention-Based Token Pruning Method for Vision Transformers, International Joint Conference, IJCRS 2022, Suzhou, China, November 11–14, 2022, Proceedings, Nov 2022, Pages 274–288, https://doi.org/10.1007/978-3-031-21244-4_21
Q Tang, B Zhang, J Liu, F Liu, Y Liu, Dynamic Token Pruning in Plain Vision Transformers for Semantic Segmentation arXiv preprint arXiv:2308.01045, 2023, https://arxiv.org/abs/2308.01045
X Xu, C Li, Y Chen, X Chang, J Liu, S Wang, Oct 2023, No Token Left Behind: Efficient Vision Transformer via Dynamic Token Idling, arXiv preprint arXiv:2310.05654, https://arxiv.org/pdf/2310.05654.pdf (Suggests "token idling" that allows reuse of pruned tokens in later layers)
Wang, Y., Huang, R., Song, S., Huang, Z., Huang, G., May 2021, Not all images are worth 16x16 words: Dynamic vision transformers with adaptive sequence length. NeurIPS 2021, https://arxiv.org/abs/2105.15075, Code: https://github.com/blackfeather-wang/Dynamic-Vision-Transformer, Code: https://github.com/blackfeather-wang/Dynamic-Vision-Transformer-MindSpore
Zhihang Lin, Mingbao Lin, Luxi Lin, Rongrong Ji, 9 May 2024, Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference, https://arxiv.org/abs/2405.05803 Code: https://github.com/lzhxmu/VTW (Removing all visual tokens in later layers of a multimodal model, which is effectively early exiting or token pruning, but affecting only the vision part of the multimodal Transformer.)
Hou-I Liu, Marco Galindo, Hongxia Xie, Lai-Kuan Wong, Hong-Han Shuai, Yung-Yui Li, Wen-Huang Cheng, 8 Apr 2024, Lightweight Deep Learning for Resource-Constrained Environments: A Survey, https://arxiv.org/abs/2404.07236 (A survey of various optimizations, with a lot of focus on image and vision models, including CNNs, RNNs, and Transformers.)
Wangbo Zhao, Jiasheng Tang, Yizeng Han, Yibing Song, Kai Wang, Gao Huang, Fan Wang, Yang You, 18 Mar 2024, Dynamic Tuning Towards Parameter and Inference Efficiency for ViT Adaptation, https://arxiv.org/abs/2403.11808 (PEFT and adaptive inference and token pruning in Vision Transformers.)
Yansong Xu, Dongxu Lyu, Zhenyu Li, Zilong Wang, Yuzhou Chen, Gang Wang, Zhican Wang, Haomin Li, Guanghui He, 16 Mar 2024, DEFA: Efficient Deformable Attention Acceleration via Pruning-Assisted Grid-Sampling and Multi-Scale Parallel Processing, https://arxiv.org/abs/2403.10913 (Attention optimizations in Vision Transformer with pruning of feature maps, and extensive parallelization with consideration of the hardware layer.)
Moein Heidari, Reza Azad, Sina Ghorbani Kolahi, René Arimond, Leon Niggemeier, Alaa Sulaiman, Afshin Bozorgpour, Ehsan Khodapanah Aghdam, Amirhossein Kazerouni, Ilker Hacihaliloglu, Dorit Merhof, 28 Mar 2024, Enhancing Efficiency in Vision Transformer Networks: Design Techniques and Insights, https://arxiv.org/abs/2403.19882 Project: https://github.com/mindflow-institue/Awesome-Attention-Mechanism-in-Medical-Imaging (Survey of optimization techniques for Vision Transformers, with particular focus on attention optimizations.)
Jianjian Cao, Peng Ye, Shengze Li, Chong Yu, Yansong Tang, Jiwen Lu, Tao Chen, 5 Mar 2024, MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer, https://arxiv.org/abs/2403.02991
Yifei Liu, Mathias Gehrig, Nico Messikommer, Marco Cannici, Davide Scaramuzza, 2024, Revisiting Token Pruning for Object Detection and Instance Segmentation, IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, 2024, https://rpg.ifi.uzh.ch/docs/WACV24_Liu.pdf Code: https://github.com/uzh-rpg/svit/ (Examination of token pruning in image processing including re-activating pruned tokens and advanced techniques such as a dynamic pruning rate.)
P Dong, M Sun, A Lu, Y Xie, K Liu, 2023, Heatvit: Hardware-efficient adaptive token pruning for vision transformers, https://ieeexplore.ieee.org/abstract/document/10071047/ https://arxiv.org/pdf/2211.08110
Yun-Chia Yu; Mao-Chi Weng; Ming-Guang Lin; An-Yeu Andy Wu, May 2024, Retraining-free Constraint-aware Token Pruning for Vision Transformer on Edge Devices, 2024 IEEE International Symposium on Circuits and Systems (ISCAS), 19-22 May 2024, https://ieeexplore.ieee.org/abstract/document/10558603
Kazi Hasan Ibn Arif, JinYi Yoon, Dimitrios S. Nikolopoulos, Hans Vandierendonck, Deepu John, Bo Ji, 20 Aug 2024, HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models in Resource-Constrained Environments, https://arxiv.org/abs/2408.10945
Purvish Jajal, Nick John Eliopoulos, Benjamin Shiue-Hal Chou, George K. Thiravathukal, James C. Davis, Yung-Hsiang Lu, 11 Sep 2024, Token Turing Machines are Efficient Vision Models, https://arxiv.org/abs/2409.07613
Foozhan Ataiefard, Walid Ahmed, Habib Hajimolahoseini, Saina Asani, Farnoosh Javadi, Mohammad Hassanpour, Omar Mohamed Awad, Austin Wen, Kangling Liu, Yang Liu, 27 Jan 2024, SkipViT: Speeding Up Vision Transformers with a Token-Level Skip Connection, https://arxiv.org/abs/2401.15293
Wei Chen, Zhiyuan Li, Shuo Xin, Yihao Wang, 28 Aug 2024, Dolphin: Long Context as a New Modality for Energy-Efficient On-Device Language Models, https://arxiv.org/abs/2408.15518 https://huggingface.co/NexaAIDev/Dolphin (Using vision transformer architecture to process longer text.)
Ting Liu, Liangtao Shi, Richang Hong, Yue Hu, Quanjun Yin, Linfeng Zhang, 16 Nov 2024, Multi-Stage Vision Token Dropping: Towards Efficient Multimodal Large Language Model, https://arxiv.org/abs/2411.10803
Keda Tao, Can Qin, Haoxuan You, Yang Sui, Huan Wang, 22 Nov 2024, DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models, https://arxiv.org/abs/2411.15024
Qizhe Zhang, Aosong Cheng, Ming Lu, Zhiyong Zhuo, Minqi Wang, Jiajun Cao, Shaobo Guo, Qi She, Shanghang Zhang, 2 Dec 2024, [CLS] Attention is All You Need for Training-Free Visual Token Pruning: Make VLM Inference Faster, https://arxiv.org/abs/2412.01818 https://github.com/Theia-4869/FasterVLM
Benjamin Bergner, Christoph Lippert, Aravindh Mahendran, 1 Dec 2024, Token Cropr: Faster ViTs for Quite a Few Tasks, https://arxiv.org/abs/2412.00965
Xubing Ye, Yukang Gan, Yixiao Ge, Xiao-Ping Zhang, Yansong Tang, 30 Nov 2024, ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models, https://arxiv.org/abs/2412.00447 https://yxxxb.github.io/ATP-LLaVA-page/
Yiwu Zhong, Zhuoming Liu, Yin Li, Liwei Wang, 4 Dec 2024, AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning, https://arxiv.org/abs/2412.03248
Xiaohu Huang, Hao Zhou, Kai Han, 20 Dec 2024, PruneVid: Visual Token Pruning for Efficient Video Large Language Models, https://arxiv.org/abs/2412.16117
Fabio Montello, Ronja Güldenring, Simone Scardapane, Lazaros Nalpantidis, 13 Jan 2025, A Survey on Dynamic Neural Networks: from Computer Vision to Multi-modal Sensor Fusion, https://arxiv.org/abs/2501.07451 (Survey of adaptive inference optimizations: early exit, dynamic routing, token skimming.)
Yefei He, Feng Chen, Jing Liu, Wenqi Shao, Hong Zhou, Kaipeng Zhang, Bohan Zhua, 11 Oct 2024, ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification and KV Cache Compression, https://arxiv.org/abs/2410.08584
Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
Q. Dong, S. Zhang and Z. Wang, "An Efficient Window-Based Vision Transformer Accelerator via Mixed-Granularity Sparsity," in IEEE Transactions on Circuits and Systems I: Regular Papers, doi: 10.1109/TCSI.2025.3527541. https://ieeexplore.ieee.org/abstract/document/10844888
Xiaoyu Liang, Chaofeng Guan, Jiaying Lu, Huiyao Chen, Huan Wang, Haoji Hu, 24 Jan 2025, Dynamic Token Reduction during Generation for Vision Language Models, https://arxiv.org/abs/2501.14204 (Dynamic pruning of visual tokens.)
Yu Meng, Kaiyuan Li, Chenran Huang, Chen Gao, Xinlei Chen, Yong Li, Xiaoping Zhang, 20 Feb 2025, PLPHP: Per-Layer Per-Head Vision Token Pruning for Efficient Large Vision-Language Models, https://arxiv.org/abs/2502.14504
Bozhi Luan, Wengang Zhou, Hao Feng, Zhe Wang, Xiaosong Li, Houqiang Li, 11 Mar 2025, Multi-Cue Adaptive Visual Token Pruning for Large Vision-Language Models, https://arxiv.org/abs/2503.08019
Zhenwei Shao, Mingyang Wang, Zhou Yu, Wenwen Pan, Yan Yang, Tao Wei, Hongyuan Zhang, Ning Mao, Wei Chen, Jun Yu, 18 Mar 2025, Growing a Twig to Accelerate Large Vision-Language Models, https://arxiv.org/abs/2503.14075
Cheng Yang, Yang Sui, Jinqi Xiao, Lingyi Huang, Yu Gong, Chendi Li, Jinghua Yan, Yu Bai, Ponnuswamy Sadayappan, Xia Hu, Bo Yuan, 29 Mar 2025 (v2), TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model, https://arxiv.org/abs/2503.18278
Z. Zou et al., "PF2A-ViT: Parameter-Free and Feature-Aware Dynamic Token Pruning Accelerator With Complementary Quantization-Encoding for Vision Transformer," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, doi: 10.1109/TCAD.2025.3565468, https://ieeexplore.ieee.org/abstract/document/10979990

More Research on Pruning Types

Depth pruning (overview)
— Static layer pruning
— Layer pruning
— Early exit
— Dynamic layer pruning
— Layer skipping
— Layer approximation
— Shallow decoder architecture
— Layer reordering
— Layer Importance
Width pruning (overview)
— Attention head pruning
— Slimmable networks (width pruning)
— FFN pruning
— Channel pruning
— Filter pruning
Length pruning (longitudinal/input/end-to-end):
— Token pruning (input pruning)
— Dynamic token pruning
— Prompt compression
— Context compression
— Token merging
— Token skipping
— Token dropping
— Zero padding removal
Embedding-dimension pruning
— Embedding pruning
— Embedding matrix compression (embedding pruning)
— Embedding low-rank matrix factorization
— Unembedding matrix (output embeddings)
Multi-dimensional pruning
— Dual pruning
— Triple pruning
— Quadruple pruning
— 3D CNN model pruning