Aussie AI

Dynamic Token Pruning

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

Dynamic Token Pruning

Dynamic token pruning is where the choice of which tokens to discard is made during the inference algorithm. This is a form of dynamic length pruning of the model.

Another related lengthwise technique is that avoiding attention logic on some tokens has been a method researched to speed up Transformer attention (i.e. to reduce the quadratic dependence on input length). This is effectively attention-specific token pruning, where other weights for the token may still be used. See Chapter 20 for more on long context research.

Research papers on dynamic token pruning:

Sehoon Kim, Sheng Shen, David Thorsley, Amir Gholami, Woosuk Kwon, Joseph Hassoun, 14 August 2022, Learned Token Pruning for Transformers, https://dl.acm.org/doi/abs/10.1145/3534678.3539260, PDF: https://dl.acm.org/doi/pdf/10.1145/3534678.3539260
Zhenglun Kong, Peiyan Dong, Xiaolong Ma, Xin Meng, Wei Niu, Mengshu Sun, Xuan Shen, Geng Yuan, Bin Ren, Hao Tang, Minghai Qin & Yanzhi Wang, 2022, SPViT: Enabling Faster Vision Transformers via Latency-Aware Soft Token Pruning, Nov 2022, LNCS,volume 13671, https://link.springer.com/chapter/10.1007/978-3-031-20083-0_37, Code: https://github.com/PeiyanFlying/SPViT
Peiyan Dong; Mengshu Sun; Alec Lu; Yanyue Xie; Kenneth Liu; Zhenglun Kong; Xin Meng; Zhengang Li; 2023, Heatvit: Hardware-efficient adaptive token pruning for vision transformers, 2023, IEEE International Symposium on High-Performance Computer Architecture (HPCA) 2023, DOI: 10.1109/HPCA56546.2023.10071047, https://ieeexplore.ieee.org/abstract/document/10071047
Ling Li, David Thorsley, Joseph Hassoun Oct 2022, SaiT: Sparse Vision Transformers through Adaptive Token Pruning, https://arxiv.org/abs/2210.05832
Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, Cho-Jui Hsieh, 2021, Dynamicvit: Efficient vision transformers with dynamic token sparsification, Advances in Neural Information Processing Systems 34 (NeurIPS 2021), https://proceedings.neurips.cc/paper_files/paper/2021/hash/747d3443e319a22747fbb873e8b2f9f2-Abstract.html, PDF: https://proceedings.neurips.cc/paper_files/paper/2021/file/747d3443e319a22747fbb873e8b2f9f2-Paper.pdf
J Li, LL Zhang, J Xu, Y Wang, S Yan, Y Xia, 2023, Constraint-aware and Ranking-distilled Token Pruning for Efficient Transformer Inference, https://arxiv.org/abs/2306.14393
Haoyu Ma, Zhe Wang, Yifei Chen, Deying Kong, Liangjian Chen, Xingwei Liu, Xiangyi Yan, Hao Tang & Xiaohui Xie, 2022, PPT: Token-Pruned Pose Transformer for Monocular and Multi-view Human Pose Estimation, ECCV 2022: Computer Vision, pp 424–442, LNCS volume 13665, https://link.springer.com/chapter/10.1007/978-3-031-20065-6_25
Xiangcheng Liu, Tianyi Wu, Guodong Guo, 2022, Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention, Sep 2022, https://arxiv.org/abs/2209.13802
Luca Soldaini and Alessandro Moschitti. 2020. The Cascade Transformer: An application for efficient answer sentence selection, In Proceedings of ACL, pages 5697–5708, https://arxiv.org/abs/2005.02534
Gyuwan Kim and Kyunghyun Cho. 2021. Length-Adaptive Transformer: Train Once with Length Drop, Use Anytime with Search, In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, 6501–6511, https://arxiv.org/abs/2010.07003, Code: https://github.com/clovaai/length-adaptive-transformer (Technique is the “Length adaptive transformer” or LAT)
Canwen Xu, Julian McAuley, 2022, A Survey on Model Compression and Acceleration for Pretrained Language Models, https://arxiv.org/abs/2202.07105
Yuang Liu, Qiang Zhou, Jing Wang, Zhibin Wang, Fan Wang, Jun Wang, Wei Zhang, 2023, Dynamic Token-Pass Transformers for Semantic Segmentation, August 2023, DOI: 10.48550/arXiv.2308.01944, https://ui.adsabs.harvard.edu/abs/2023arXiv230801944L/abstract, PDF: https://arxiv.org/pdf/2308.01944.pdf
Mohsen Fayyaz, Soroush Abbasi Kouhpayegani, Farnoush Rezaei Jafari, Eric Sommerlade, Hamid Reza Vaezi Joze, Hamed Pirsiavash, and Juergen Gall. 2023, ATS: Adaptive token sampling for efficient vision transformers, In ECCV, July 2022, https://arxiv.org/abs/2111.15667v1
Hongjie Wang, Bhishma Dedhia, Niraj K. Jha, 2023, Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention Graph in Pre-Trained Transformers, May 2023, https://arxiv.org/abs/2305.17328
Yifei Liu, Mathias Gehrig, Nico Messikommer, Marco Cannici, Davide Scaramuzza, 2023, Revisiting Token Pruning for Object Detection and Instance Segmentation, June 2023, https://arxiv.org/abs/2306.07050
Xiangcheng Liu, Tianyi Wu, Guodong Guo, July 2023, Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention, https://arxiv.org/abs/2209.13802
Zhewei Yao, Linjian Ma, Sheng Shen, Kurt Keutzer, and Michael W Mahoney. 2021. MLPruning: A Multilevel Structured Pruning Framework for Transformer-based Models, arXiv preprint arXiv:2105.14636 (2021), https://arxiv.org/abs/2105.14636v1
Victor Sanh, Thomas Wolf, and Alexander M Rush. 2020. Movement pruning: Adaptive sparsity by fine-tuning, arXiv preprint arXiv:2005.07683 (2020), https://arxiv.org/abs/2005.07683
Mohsen Fayyaz, Soroush Abbasi Koohpayegani, Farnoush Rezaei Jafari, Sunando Sengupta, Hamid Reza Vaezi Joze, Eric Sommerlade, Hamed Pirsiavash, Juergen Gall, 2022, Adaptive Token Sampling For Efficient Vision Transformers, July 2022, https://arxiv.org/abs/2111.15667
Zi Lin, Jeremiah Zhe Liu, Zi Yang, Nan Hua, and Dan Roth. Oct 2020. Pruning Redundant Mappings in Transformer Models via Spectral-Normalized Identity Prior, arXiv preprint arXiv:2010.01791 (2020), https://arxiv.org/abs/2010.01791
François Lagunas, Ella Charlaix, Victor Sanh, and Alexander M Rush. Sep 2021. Block pruning for faster transformers, arXiv preprint arXiv:2109.04838 (2021), https://arxiv.org/abs/2109.04838
Siyuan Wei, Tianzhu Ye, Shen Zhang, Yao Tang, Jiajun Liang, 2023, Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers, Apr 2023, https://arxiv.org/abs/2304.10716, Code: https://github.com/megvii-research/TPS-CVPR2023
Youwei Liang, Chongjian Ge, Zhan Tong, Yibing Song, Jue Wang, and Pengtao Xie. 2022, Not all patches are what you need: Expediting vision transformers via token reorganizations, arXiv preprint arXiv:2202.07800, Apr 2022, https://arxiv.org/abs/2202.07800
Yifan Xu, Zhijie Zhang, Mengdan Zhang, Kekai Sheng, Ke Li, Weiming Dong, Liqing Zhang, Changsheng Xu, and Xing Sun. 2022, Evo-ViT: Slow-fast token evolution for dynamic vision transformer, In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 2964–2972, 2022, https://arxiv.org/abs/2108.01390, Code: https://github.com/YifanXu74/Evo-ViT
Bowen Pan, Rameswar Panda, Yifan Jiang, Zhangyang Wang, Rogerio Feris, and Aude Oliva. 2021, IA-RED2: Interpretability-aware redundancy reduction for vision transformers, In Advances in Neural Information Processing Systems (NeurIPS), Oct 2021, https://arxiv.org/abs/2106.12620
Hongxu Yin, Arash Vahdat, Jose Alvarez, Arun Mallya, Jan Kautz, and Pavlo Molchanov. 2022, AdaViT: Adaptive tokens for efficient vision transformer, arXiv preprint arXiv:2112.07658, 2021 (revised Oct 2022), https://arxiv.org/abs/2112.07658, Code: https://a-vit.github.io/
Hao Yu and Jianxin Wu. 2021, A unified pruning framework for vision transformers, arXiv preprint arXiv:2111.15127, Nov 2021, https://arxiv.org/abs/2111.15127
Youwei Liang, Chongjian GE, Zhan Tong, Yibing Song, Jue Wang, and Pengtao Xie. 2022, EViT: Expediting vision transformers via token reorganizations, In International Conference on Learning Representations (ICLR), Jan 2022, https://openreview.net/forum?id=BjyvwnXXVn_, PDF: https://openreview.net/pdf?id=BjyvwnXXVn_, Code: https://github.com/youweiliang/evit
Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. 2021, Dynamicvit: Efficient vision transformers with dynamic token sparsification, In Advances in Neural Information Processing Systems (NeurIPS), Oct 2021, https://arxiv.org/abs/2106.02034, Code: https://github.com/raoyongming/DynamicViT
Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zi-Hang Jiang, Francis E.H. Tay, Jiashi Feng, and Shuicheng Yan. 2021, Tokens-to-token vit: Training vision transformers from scratch on imagenet, In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 558–567, October 2021, https://arxiv.org/abs/2101.11986
Huanrui Yang, Hongxu Yin, Pavlo Molchanov, Hai Li, and Jan Kautz. 2021, Nvit: Vision transformer compression and parameter redistribution, arXiv preprint arXiv:2110.04869, 2021, PDF: https://arxiv.org/pdf/2110.04869v1.pdf
Yehui Tang, Kai Han, Yunhe Wang, Chang Xu, Jianyuan Guo, Chao Xu, and Dacheng Tao. 2022, Patch slimming for efficient vision transformers, arXiv preprint arXiv:2106.02852, 2021 (revised Apr 2022). https://arxiv.org/abs/2106.02852
Shixing Yu, Tianlong Chen, Jiayi Shen, Huan Yuan, Jianchao Tan, Sen Yang, Ji Liu, and Zhangyang Wang. 2022, Unified visual transformer compression, arXiv preprint arXiv:2203.08243, Mar 2022, https://arxiv.org/abs/2203.08243
Tianlong Chen, Yu Cheng, Zhe Gan, Lu Yuan, Lei Zhang, and Zhangyang Wang. 2021, Chasing sparsity in vision transformers: An end-to-end exploration, Advances in Neural Information Processing Systems, 34, 2021, https://arxiv.org/abs/2106.04533
Mingjian Zhu, Yehui Tang, and Kai Han. 2021, Vision transformer pruning, arXiv preprint arXiv:2104.08500, Aug 2021, https://arxiv.org/abs/2104.08500
Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. 2017, Pruning convolutional neural networks for resource efficient inference, arXiv preprint arXiv:1611.06440, 2016 (revised June 2017), https://arxiv.org/abs/1611.06440
Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. 2016, Pruning filters for efficient convnets, arXiv preprint arXiv:1608.08710, 2016, https://arxiv.org/abs/1608.08710
Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. 2017, Learning efficient convolutional networks through network slimming, In ICCV 2017, Aug 2017, https://arxiv.org/abs/1708.06519
Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. 2017, ThiNet: A filter level pruning method for deep neural network compression, In Proceedings of the IEEE international conference on computer vision, pages 5058–5066, 2017, https://arxiv.org/abs/1707.06342
Zihang Jiang, Qibin Hou, Li Yuan, Daquan Zhou, Yujun Shi, Xiaojie Jin, Anran Wang, and Jiashi Feng. 2021, All tokens matter: Token labeling for training better vision transformers, arXiv preprint arXiv:2104.10858, June 2021, https://arxiv.org/abs/2104.10858, Code: https://github.com/zihangJiang/TokenLabeling
Zhewei Yao, Linjian Ma, Sheng Shen, Kurt Keutzer, and Michael W Mahoney. 2021. MLPruning: A Multilevel Structured Pruning Framework for Transformer-based Models, arXiv preprint arXiv:2105.14636 (2021), PDF: https://arxiv.org/pdf/2105.14636v1.pdf, Code: https://github.com/yaozhewei/mlpruning.git
Yue Guan, Zhengyi Li, Jingwen Leng, Zhouhan Lin, Minyi Guo, 2022, Transkimmer: Transformer Learns to Layer-wise Skim, May 2022, In AC, https://arxiv.org/abs/2205.07324 (This paper does per-layer dynamic token pruning.)
Sehoon Kim, Sheng Shen, David Thorsley, Amir Gholami, Woosuk Kwon, Joseph Hassoun, Kurt Keutzer, May 11, 2023, Learned Token Pruning for Efficient Transformer Inference, Masters Thesis, Technical Report No. UCB/EECS-2023-119, Electrical Engineering and Computer Sciences, University of California, Berkeley, http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-119.html PDF: https://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-119.pdf (Learns threshold-based token pruning parameters; novel approach to token pruning during attention. Also contains good literature survey on token pruning.)
Ofir Press, Noah A Smith, and Mike Lewis. 2022, Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation, arXiv preprint arXiv:2108.12409, 2021 (revised Apr 2022), https://arxiv.org/abs/2108.12409 (Attention with Linear Biases (ALiBi) paper)
Chonghan Lee, Md Fahim Faysal Khan, Rita Brugarolas Brufau, Ke Ding, Vijaykrishnan Narayanan, Oct 2022, Token and Head Adaptive Transformers for Efficient Natural Language Processing, https://aclanthology.org/2022.coling-1.404/ (Combination of token pruning and attention head pruning, i.e. length/width pruning combined)
Zejiang Hou, Sun-Yuan Kung, 2022, Multi-Dimensional Vision Transformer Compression via Dependency Guided Gaussian Process Search, https://ieeexplore.ieee.org/document/9857488, PDF: https://openaccess.thecvf.com/content/CVPR2022W/EVW/html/Hou_Multi-Dimensional_Vision_Transformer_Compression_via_Dependency_Guided_Gaussian_Process_Search_CVPRW_2022_paper.html (Multi-dimensional pruning.)
K Luo, H Li, X Zhou, B Huang, 2022, An Attention-Based Token Pruning Method for Vision Transformers, International Joint Conference, IJCRS 2022, Suzhou, China, November 11–14, 2022, Proceedings, Nov 2022, Pages 274–288, https://doi.org/10.1007/978-3-031-21244-4_21
Siyuan Wei, Tianzhu Ye, Shen Zhang, Yao Tang, Jiajun Liang, 2023, Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 2092-2101, http://openaccess.thecvf.com/content/CVPR2023/html/Wei_Joint_Token_Pruning_and_Squeezing_Towards_More_Aggressive_Compression_of_CVPR_2023_paper.html, https://arxiv.org/abs/2304.10716, Code: https://github.com/megvii-research/TPS-CVPR2023
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. 2019, Transformer-xl: Attentive language models beyond a fixed-length context, arXiv preprint arXiv:1901.02860, 2019. https://arxiv.org/abs/1901.02860 (Related to length pruning and context length, although not fully token pruning.)
Xin Huang, Ashish Khetan, Rene Bidart, and Zohar Karnin. 2022, Pyramid-BERT: Reducing complexity via successive core-set based token selection, arXiv preprint arXiv:2203.14380, 2022. https://arxiv.org/abs/2203.14380
Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. 2022, Token merging: Your vit but faster, arXiv preprint arXiv:2210.09461, 2022, https://arxiv.org/abs/2210.09461 (Token merging idea is similar to token pruning.)
Y Guan, Z Li, Z Lin, Y Zhu, J Leng, M Guo, 2022, Block-skim: Efficient question answering for transformer, Proceedings of the AAAI, 2022, https://doi.org/10.1609/aaai.v36i10.21316, https://ojs.aaai.org/index.php/AAAI/article/view/21316, PDF: https://ojs.aaai.org/index.php/AAAI/article/view/21316/21065
Q Tang, B Zhang, J Liu, F Liu, Y Liu, 2023, Dynamic Token Pruning in Plain Vision Transformers for Semantic Segmentation, arXiv preprint arXiv:2308.01045, 2023, https://arxiv.org/abs/2308.01045
Sotiris Anagnostidis, Dario Pavllo, Luca Biggio, Lorenzo Noci, Aurelien Lucchi, Thomas Hofmann, 2023, Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers, arXiv preprint, 2023, https://arxiv.org/abs/2305.15805
Hansen, C.; Hansen, C.; Alstrup, S.; Simonsen, J. G.; and Lioma, C. 2018. Neural Speed Reading with Structural-Jump-LSTM, In International Conference on Learning Representations, https://arxiv.org/abs/1904.00761
Seo, M.; Min, S.; Farhadi, A.; and Hajishirzi, H. 2018. Neural Speed Reading via Skim-RNN, In International Conference on Learning Representations, https://arxiv.org/abs/1711.02085
Adams Wei Yu, Hongrae Lee, Quoc V. Le. 2017. Learning to Skim Text, In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), https://arxiv.org/abs/1704.06877
Y Wang, K Chen, H Tan, K Guo, 2023, Tabi: An Efficient Multi-Level Inference System for Large Language Models, EuroSys '23: Proceedings of the Eighteenth European Conference on Computer Systems, Rome, Italy May 2023, Pages 233–248, https://doi.org/10.1145/3552326.3587438, PDF: https://cse.hkust.edu.hk/~kaichen/papers/tabi-eurosys23.pdf
L. Denoyer and P. Gallinari. 2014, Deep sequential neural network, arXiv preprint arXiv:1410.0510, 2014. https://arxiv.org/abs/1410.0510 (Input adaptive method, somewhat related to token pruning.)
Hochreiter, S.; and Schmidhuber, J., 1997. Long short-term memory, Neural computation, 9(8): 1735–1780. https://ieeexplore.ieee.org/abstract/document/6795963 (Early paper, somewhat related to token skimming.)
Campos, V.; Jou, B.; Giro-i-Nieto, X.; Torres, J.; and Chang, S., 2017. Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks, CoRR, abs/1708.06834. https://arxiv.org/abs/1708.06834, Code: https://imatge-upc.github.io/skiprnn-2017-telecombcn/
Zheng Qu, Liu Liu, Fengbin Tu, Zhaodong Chen, Yufei Ding, and Yuan Xie. 2022, DOTA: Detect and Omit Weak Attentions for Scalable Transformer Acceleration, In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pages 14–26, 2022. https://dl.acm.org/doi/pdf/10.1145/3503222.3507738 (Involves some reordering of tokens.)
Jean Senellart, Dakun Zhang, Bo Wang, Guillaume Klein, Jean-Pierre Ramatchandirin, Josep Crego, and Alexander Rush. 2018, OpenNMT system description for WNMT 2018: 800 words/sec on a single-core CPU, In Proc. of WNG, 2018. https://www.aclweb.org/anthology/W18-2715
Xing Shi and Kevin Knight. 2017, Speeding up neural machine translation decoding by shrinking run-time vocabulary, In Proc. of ACL, 2017. https://aclanthology.org/P17-2091/, PDF: http://xingshi.me/data/pdf/ACL2017short.pdf
Gurvan L’Hostis, David Grangier, and Michael Auli. 2016. Vocabulary Selection Strategies for Neural Machine Translation, Arxiv preprint arXiv:1610.00072, https://arxiv.org/abs/1610.00072
Ali Modarressi, Hosein Mohebbi, Mohammad Taher Pilehvar, 2022, AdapLeR: Speeding up Inference by Adaptive Length Reduction, arXiv preprint arXiv:2203.08991, https://arxiv.org/abs/2203.08991 Code: https://github.com/amodaresi/AdapLeR
Hansen, C., Hansen, C., Alstrup, S., Simonsen, J. G., and Lioma, C. (2019). Neural speed reading with structural-jump-LSTM, In International Conference on Learning Representations. https://arxiv.org/abs/1904.00761, https://openreview.net/forum?id=B1xf9j
Maedeh Hemmat, Joshua San Miguel, Azadeh Davoodi, 2021, AirNN: A Featherweight Framework for Dynamic Input-Dependent Approximation of CNNs, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol.40, no.10, pp.2090-2103, 2021. https://ieeexplore.ieee.org/document/9239327 (Input dependent matching of weight clusters from tokens is vaguely similar to token pruning or length pruning.)
Minxuan Zhou; Weihong Xu; Jaeyoung Kang; Tajana Rosing, 2022, TransPIM: A Memory-based Acceleration via Software-Hardware Co-Design for Transformer, 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA), https://ieeexplore.ieee.org/document/9773212 PDF: https://par.nsf.gov/servlets/purl/10345536 (Does some token pruning but is primarily focused on memory optimization, including with token-based data sharding for allocation to different memory banks.)
Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, Lee Sharkey, Sep 2023, Sparse Autoencoders Find Highly Interpretable Features in Language Models, https://arxiv.org/abs/2309.08600 (Analysis has some relevant to tokenization and token pruning.)
H Jiang, Q Wu, CY Lin, Y Yang, L Qiu, Oct 2023, LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models, arXiv preprint arXiv:2310.05736, https://arxiv.org/pdf/2310.05736.pdf, Code: https://aka.ms/LLMLingua (Dynamic token pruning for prompt compression.)
X Xu, C Li, Y Chen, X Chang, J Liu, S Wang, Oct 2023, No Token Left Behind: Efficient Vision Transformer via Dynamic Token Idling, arXiv preprint arXiv:2310.05654, https://arxiv.org/pdf/2310.05654.pdf (Suggests “token idling” that allows reuse of pruned tokens in later layers)
Yucheng Li. April 2023. Unlocking context constraints of LLMs: Enhancing context efficiency of LLMs with self-information-based content filtering, ArXiv preprint abs/2304.12102. https://arxiv.org/abs/2304.12102 (Token pruning for prompt compression.)
Wang, Y., Huang, R., Song, S., Huang, Z., Huang, G., May 2021, Not all images are worth 16x16 words: Dynamic vision transformers with adaptive sequence length, NeurIPS 2021, https://arxiv.org/abs/2105.15075, Code: https://github.com/blackfeather-wang/Dynamic-Vision-Transformer, Code: https://github.com/blackfeather-wang/Dynamic-Vision-Transformer-MindSpore
Jesse Mu, Xiang Lisa Li, and Noah Goodman. July 2023. Learning to compress prompts with gist tokens, arXiv preprint arXiv:2304.08467. https://arxiv.org/abs/2304.08467 (Prompt compression.)
Wang Y., Lv K., Huang R., Song S., Yang L., Huang G., 2020, Glance and focus: a dynamic approach to reducing spatial redundancy in image classification, Advances in neural information processing systems, Vol. 33 (2020), pp. 2432-2444, https://arxiv.org/abs/2010.05300, Code: https://github.com/blackfeather-wang/GFNet-Pytorch (Focuses on a subset of image inputs, which is analogous to token pruning.)

For more research papers on dynamic token pruning, see https://www.aussieai.com/research/token-pruning#dynamic.

• Next:

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs

The new AI programming book by Aussie AI co-founders:

AI coding in C++
Transformer engine speedups
LLM models
Phone and desktop AI
Code examples
Research citations

Get your copy from Amazon: Generative AI in C++

Aussie AI

Dynamic Token Pruning

Dynamic Token Pruning

Quick Links

Product

New to Writing?

Writing Styles