Aussie AI

Dynamic Token Pruning

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Dynamic Token Pruning

Dynamic token pruning is where the choice of which tokens to discard is made during the inference algorithm. This is a form of dynamic length pruning of the model.

Another related lengthwise technique is that avoiding attention logic on some tokens has been a method researched to speed up Transformer attention (i.e. to reduce the quadratic dependence on input length). This is effectively attention-specific token pruning, where other weights for the token may still be used. See Chapter 20 for more on long context research.

Research papers on dynamic token pruning:

  1. Sehoon Kim, Sheng Shen, David Thorsley, Amir Gholami, Woosuk Kwon, Joseph Hassoun, 14 August 2022, Learned Token Pruning for Transformers, https://dl.acm.org/doi/abs/10.1145/3534678.3539260, PDF: https://dl.acm.org/doi/pdf/10.1145/3534678.3539260
  2. Zhenglun Kong, Peiyan Dong, Xiaolong Ma, Xin Meng, Wei Niu, Mengshu Sun, Xuan Shen, Geng Yuan, Bin Ren, Hao Tang, Minghai Qin & Yanzhi Wang, 2022, SPViT: Enabling Faster Vision Transformers via Latency-Aware Soft Token Pruning, Nov 2022, LNCS,volume 13671, https://link.springer.com/chapter/10.1007/978-3-031-20083-0_37, Code: https://github.com/PeiyanFlying/SPViT
  3. Peiyan Dong; Mengshu Sun; Alec Lu; Yanyue Xie; Kenneth Liu; Zhenglun Kong; Xin Meng; Zhengang Li; 2023, Heatvit: Hardware-efficient adaptive token pruning for vision transformers, 2023, IEEE International Symposium on High-Performance Computer Architecture (HPCA) 2023, DOI: 10.1109/HPCA56546.2023.10071047, https://ieeexplore.ieee.org/abstract/document/10071047
  4. Ling Li, David Thorsley, Joseph Hassoun Oct 2022, SaiT: Sparse Vision Transformers through Adaptive Token Pruning, https://arxiv.org/abs/2210.05832
  5. Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, Cho-Jui Hsieh, 2021, Dynamicvit: Efficient vision transformers with dynamic token sparsification, Advances in Neural Information Processing Systems 34 (NeurIPS 2021), https://proceedings.neurips.cc/paper_files/paper/2021/hash/747d3443e319a22747fbb873e8b2f9f2-Abstract.html, PDF: https://proceedings.neurips.cc/paper_files/paper/2021/file/747d3443e319a22747fbb873e8b2f9f2-Paper.pdf
  6. J Li, LL Zhang, J Xu, Y Wang, S Yan, Y Xia, 2023, Constraint-aware and Ranking-distilled Token Pruning for Efficient Transformer Inference, https://arxiv.org/abs/2306.14393
  7. Haoyu Ma, Zhe Wang, Yifei Chen, Deying Kong, Liangjian Chen, Xingwei Liu, Xiangyi Yan, Hao Tang & Xiaohui Xie, 2022, PPT: Token-Pruned Pose Transformer for Monocular and Multi-view Human Pose Estimation, ECCV 2022: Computer Vision, pp 424–442, LNCS volume 13665, https://link.springer.com/chapter/10.1007/978-3-031-20065-6_25
  8. Xiangcheng Liu, Tianyi Wu, Guodong Guo, 2022, Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention, Sep 2022, https://arxiv.org/abs/2209.13802
  9. Luca Soldaini and Alessandro Moschitti. 2020. The Cascade Transformer: An application for efficient answer sentence selection, In Proceedings of ACL, pages 5697–5708, https://arxiv.org/abs/2005.02534
  10. Gyuwan Kim and Kyunghyun Cho. 2021. Length-Adaptive Transformer: Train Once with Length Drop, Use Anytime with Search, In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, 6501–6511, https://arxiv.org/abs/2010.07003, Code: https://github.com/clovaai/length-adaptive-transformer (Technique is the “Length adaptive transformer” or LAT)
  11. Canwen Xu, Julian McAuley, 2022, A Survey on Model Compression and Acceleration for Pretrained Language Models, https://arxiv.org/abs/2202.07105
  12. Yuang Liu, Qiang Zhou, Jing Wang, Zhibin Wang, Fan Wang, Jun Wang, Wei Zhang, 2023, Dynamic Token-Pass Transformers for Semantic Segmentation, August 2023, DOI: 10.48550/arXiv.2308.01944, https://ui.adsabs.harvard.edu/abs/2023arXiv230801944L/abstract, PDF: https://arxiv.org/pdf/2308.01944.pdf
  13. Mohsen Fayyaz, Soroush Abbasi Kouhpayegani, Farnoush Rezaei Jafari, Eric Sommerlade, Hamid Reza Vaezi Joze, Hamed Pirsiavash, and Juergen Gall. 2023, ATS: Adaptive token sampling for efficient vision transformers, In ECCV, July 2022, https://arxiv.org/abs/2111.15667v1
  14. Hongjie Wang, Bhishma Dedhia, Niraj K. Jha, 2023, Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention Graph in Pre-Trained Transformers, May 2023, https://arxiv.org/abs/2305.17328
  15. Yifei Liu, Mathias Gehrig, Nico Messikommer, Marco Cannici, Davide Scaramuzza, 2023, Revisiting Token Pruning for Object Detection and Instance Segmentation, June 2023, https://arxiv.org/abs/2306.07050
  16. Xiangcheng Liu, Tianyi Wu, Guodong Guo, July 2023, Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention, https://arxiv.org/abs/2209.13802
  17. Zhewei Yao, Linjian Ma, Sheng Shen, Kurt Keutzer, and Michael W Mahoney. 2021. MLPruning: A Multilevel Structured Pruning Framework for Transformer-based Models, arXiv preprint arXiv:2105.14636 (2021), https://arxiv.org/abs/2105.14636v1
  18. Victor Sanh, Thomas Wolf, and Alexander M Rush. 2020. Movement pruning: Adaptive sparsity by fine-tuning, arXiv preprint arXiv:2005.07683 (2020), https://arxiv.org/abs/2005.07683
  19. Mohsen Fayyaz, Soroush Abbasi Koohpayegani, Farnoush Rezaei Jafari, Sunando Sengupta, Hamid Reza Vaezi Joze, Eric Sommerlade, Hamed Pirsiavash, Juergen Gall, 2022, Adaptive Token Sampling For Efficient Vision Transformers, July 2022, https://arxiv.org/abs/2111.15667
  20. Zi Lin, Jeremiah Zhe Liu, Zi Yang, Nan Hua, and Dan Roth. Oct 2020. Pruning Redundant Mappings in Transformer Models via Spectral-Normalized Identity Prior, arXiv preprint arXiv:2010.01791 (2020), https://arxiv.org/abs/2010.01791
  21. François Lagunas, Ella Charlaix, Victor Sanh, and Alexander M Rush. Sep 2021. Block pruning for faster transformers, arXiv preprint arXiv:2109.04838 (2021), https://arxiv.org/abs/2109.04838
  22. Siyuan Wei, Tianzhu Ye, Shen Zhang, Yao Tang, Jiajun Liang, 2023, Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers, Apr 2023, https://arxiv.org/abs/2304.10716, Code: https://github.com/megvii-research/TPS-CVPR2023
  23. Youwei Liang, Chongjian Ge, Zhan Tong, Yibing Song, Jue Wang, and Pengtao Xie. 2022, Not all patches are what you need: Expediting vision transformers via token reorganizations, arXiv preprint arXiv:2202.07800, Apr 2022, https://arxiv.org/abs/2202.07800
  24. Yifan Xu, Zhijie Zhang, Mengdan Zhang, Kekai Sheng, Ke Li, Weiming Dong, Liqing Zhang, Changsheng Xu, and Xing Sun. 2022, Evo-ViT: Slow-fast token evolution for dynamic vision transformer, In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 2964–2972, 2022, https://arxiv.org/abs/2108.01390, Code: https://github.com/YifanXu74/Evo-ViT
  25. Bowen Pan, Rameswar Panda, Yifan Jiang, Zhangyang Wang, Rogerio Feris, and Aude Oliva. 2021, IA-RED2: Interpretability-aware redundancy reduction for vision transformers, In Advances in Neural Information Processing Systems (NeurIPS), Oct 2021, https://arxiv.org/abs/2106.12620
  26. Hongxu Yin, Arash Vahdat, Jose Alvarez, Arun Mallya, Jan Kautz, and Pavlo Molchanov. 2022, AdaViT: Adaptive tokens for efficient vision transformer, arXiv preprint arXiv:2112.07658, 2021 (revised Oct 2022), https://arxiv.org/abs/2112.07658, Code: https://a-vit.github.io/
  27. Hao Yu and Jianxin Wu. 2021, A unified pruning framework for vision transformers, arXiv preprint arXiv:2111.15127, Nov 2021, https://arxiv.org/abs/2111.15127
  28. Youwei Liang, Chongjian GE, Zhan Tong, Yibing Song, Jue Wang, and Pengtao Xie. 2022, EViT: Expediting vision transformers via token reorganizations, In International Conference on Learning Representations (ICLR), Jan 2022, https://openreview.net/forum?id=BjyvwnXXVn_, PDF: https://openreview.net/pdf?id=BjyvwnXXVn_, Code: https://github.com/youweiliang/evit
  29. Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. 2021, Dynamicvit: Efficient vision transformers with dynamic token sparsification, In Advances in Neural Information Processing Systems (NeurIPS), Oct 2021, https://arxiv.org/abs/2106.02034, Code: https://github.com/raoyongming/DynamicViT
  30. Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zi-Hang Jiang, Francis E.H. Tay, Jiashi Feng, and Shuicheng Yan. 2021, Tokens-to-token vit: Training vision transformers from scratch on imagenet, In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 558–567, October 2021, https://arxiv.org/abs/2101.11986
  31. Huanrui Yang, Hongxu Yin, Pavlo Molchanov, Hai Li, and Jan Kautz. 2021, Nvit: Vision transformer compression and parameter redistribution, arXiv preprint arXiv:2110.04869, 2021, PDF: https://arxiv.org/pdf/2110.04869v1.pdf
  32. Yehui Tang, Kai Han, Yunhe Wang, Chang Xu, Jianyuan Guo, Chao Xu, and Dacheng Tao. 2022, Patch slimming for efficient vision transformers, arXiv preprint arXiv:2106.02852, 2021 (revised Apr 2022). https://arxiv.org/abs/2106.02852
  33. Shixing Yu, Tianlong Chen, Jiayi Shen, Huan Yuan, Jianchao Tan, Sen Yang, Ji Liu, and Zhangyang Wang. 2022, Unified visual transformer compression, arXiv preprint arXiv:2203.08243, Mar 2022, https://arxiv.org/abs/2203.08243
  34. Tianlong Chen, Yu Cheng, Zhe Gan, Lu Yuan, Lei Zhang, and Zhangyang Wang. 2021, Chasing sparsity in vision transformers: An end-to-end exploration, Advances in Neural Information Processing Systems, 34, 2021, https://arxiv.org/abs/2106.04533
  35. Mingjian Zhu, Yehui Tang, and Kai Han. 2021, Vision transformer pruning, arXiv preprint arXiv:2104.08500, Aug 2021, https://arxiv.org/abs/2104.08500
  36. Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. 2017, Pruning convolutional neural networks for resource efficient inference, arXiv preprint arXiv:1611.06440, 2016 (revised June 2017), https://arxiv.org/abs/1611.06440
  37. Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. 2016, Pruning filters for efficient convnets, arXiv preprint arXiv:1608.08710, 2016, https://arxiv.org/abs/1608.08710
  38. Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. 2017, Learning efficient convolutional networks through network slimming, In ICCV 2017, Aug 2017, https://arxiv.org/abs/1708.06519
  39. Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. 2017, ThiNet: A filter level pruning method for deep neural network compression, In Proceedings of the IEEE international conference on computer vision, pages 5058–5066, 2017, https://arxiv.org/abs/1707.06342
  40. Zihang Jiang, Qibin Hou, Li Yuan, Daquan Zhou, Yujun Shi, Xiaojie Jin, Anran Wang, and Jiashi Feng. 2021, All tokens matter: Token labeling for training better vision transformers, arXiv preprint arXiv:2104.10858, June 2021, https://arxiv.org/abs/2104.10858, Code: https://github.com/zihangJiang/TokenLabeling
  41. Zhewei Yao, Linjian Ma, Sheng Shen, Kurt Keutzer, and Michael W Mahoney. 2021. MLPruning: A Multilevel Structured Pruning Framework for Transformer-based Models, arXiv preprint arXiv:2105.14636 (2021), PDF: https://arxiv.org/pdf/2105.14636v1.pdf, Code: https://github.com/yaozhewei/mlpruning.git
  42. Yue Guan, Zhengyi Li, Jingwen Leng, Zhouhan Lin, Minyi Guo, 2022, Transkimmer: Transformer Learns to Layer-wise Skim, May 2022, In AC, https://arxiv.org/abs/2205.07324 (This paper does per-layer dynamic token pruning.)
  43. Sehoon Kim, Sheng Shen, David Thorsley, Amir Gholami, Woosuk Kwon, Joseph Hassoun, Kurt Keutzer, May 11, 2023, Learned Token Pruning for Efficient Transformer Inference, Masters Thesis, Technical Report No. UCB/EECS-2023-119, Electrical Engineering and Computer Sciences, University of California, Berkeley, http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-119.html PDF: https://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-119.pdf (Learns threshold-based token pruning parameters; novel approach to token pruning during attention. Also contains good literature survey on token pruning.)
  44. Ofir Press, Noah A Smith, and Mike Lewis. 2022, Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation, arXiv preprint arXiv:2108.12409, 2021 (revised Apr 2022), https://arxiv.org/abs/2108.12409 (Attention with Linear Biases (ALiBi) paper)
  45. Chonghan Lee, Md Fahim Faysal Khan, Rita Brugarolas Brufau, Ke Ding, Vijaykrishnan Narayanan, Oct 2022, Token and Head Adaptive Transformers for Efficient Natural Language Processing, https://aclanthology.org/2022.coling-1.404/ (Combination of token pruning and attention head pruning, i.e. length/width pruning combined)
  46. Zejiang Hou, Sun-Yuan Kung, 2022, Multi-Dimensional Vision Transformer Compression via Dependency Guided Gaussian Process Search, https://ieeexplore.ieee.org/document/9857488, PDF: https://openaccess.thecvf.com/content/CVPR2022W/EVW/html/Hou_Multi-Dimensional_Vision_Transformer_Compression_via_Dependency_Guided_Gaussian_Process_Search_CVPRW_2022_paper.html (Multi-dimensional pruning.)
  47. K Luo, H Li, X Zhou, B Huang, 2022, An Attention-Based Token Pruning Method for Vision Transformers, International Joint Conference, IJCRS 2022, Suzhou, China, November 11–14, 2022, Proceedings, Nov 2022, Pages 274–288, https://doi.org/10.1007/978-3-031-21244-4_21
  48. Siyuan Wei, Tianzhu Ye, Shen Zhang, Yao Tang, Jiajun Liang, 2023, Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 2092-2101, http://openaccess.thecvf.com/content/CVPR2023/html/Wei_Joint_Token_Pruning_and_Squeezing_Towards_More_Aggressive_Compression_of_CVPR_2023_paper.html, https://arxiv.org/abs/2304.10716, Code: https://github.com/megvii-research/TPS-CVPR2023
  49. Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. 2019, Transformer-xl: Attentive language models beyond a fixed-length context, arXiv preprint arXiv:1901.02860, 2019. https://arxiv.org/abs/1901.02860 (Related to length pruning and context length, although not fully token pruning.)
  50. Xin Huang, Ashish Khetan, Rene Bidart, and Zohar Karnin. 2022, Pyramid-BERT: Reducing complexity via successive core-set based token selection, arXiv preprint arXiv:2203.14380, 2022. https://arxiv.org/abs/2203.14380
  51. Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. 2022, Token merging: Your vit but faster, arXiv preprint arXiv:2210.09461, 2022, https://arxiv.org/abs/2210.09461 (Token merging idea is similar to token pruning.)
  52. Y Guan, Z Li, Z Lin, Y Zhu, J Leng, M Guo, 2022, Block-skim: Efficient question answering for transformer, Proceedings of the AAAI, 2022, https://doi.org/10.1609/aaai.v36i10.21316, https://ojs.aaai.org/index.php/AAAI/article/view/21316, PDF: https://ojs.aaai.org/index.php/AAAI/article/view/21316/21065
  53. Q Tang, B Zhang, J Liu, F Liu, Y Liu, 2023, Dynamic Token Pruning in Plain Vision Transformers for Semantic Segmentation, arXiv preprint arXiv:2308.01045, 2023, https://arxiv.org/abs/2308.01045
  54. Sotiris Anagnostidis, Dario Pavllo, Luca Biggio, Lorenzo Noci, Aurelien Lucchi, Thomas Hofmann, 2023, Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers, arXiv preprint, 2023, https://arxiv.org/abs/2305.15805
  55. Hansen, C.; Hansen, C.; Alstrup, S.; Simonsen, J. G.; and Lioma, C. 2018. Neural Speed Reading with Structural-Jump-LSTM, In International Conference on Learning Representations, https://arxiv.org/abs/1904.00761
  56. Seo, M.; Min, S.; Farhadi, A.; and Hajishirzi, H. 2018. Neural Speed Reading via Skim-RNN, In International Conference on Learning Representations, https://arxiv.org/abs/1711.02085
  57. Adams Wei Yu, Hongrae Lee, Quoc V. Le. 2017. Learning to Skim Text, In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), https://arxiv.org/abs/1704.06877
  58. Y Wang, K Chen, H Tan, K Guo, 2023, Tabi: An Efficient Multi-Level Inference System for Large Language Models, EuroSys '23: Proceedings of the Eighteenth European Conference on Computer Systems, Rome, Italy May 2023, Pages 233–248, https://doi.org/10.1145/3552326.3587438, PDF: https://cse.hkust.edu.hk/~kaichen/papers/tabi-eurosys23.pdf
  59. L. Denoyer and P. Gallinari. 2014, Deep sequential neural network, arXiv preprint arXiv:1410.0510, 2014. https://arxiv.org/abs/1410.0510 (Input adaptive method, somewhat related to token pruning.)
  60. Hochreiter, S.; and Schmidhuber, J., 1997. Long short-term memory, Neural computation, 9(8): 1735–1780. https://ieeexplore.ieee.org/abstract/document/6795963 (Early paper, somewhat related to token skimming.)
  61. Campos, V.; Jou, B.; Giro-i-Nieto, X.; Torres, J.; and Chang, S., 2017. Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks, CoRR, abs/1708.06834. https://arxiv.org/abs/1708.06834, Code: https://imatge-upc.github.io/skiprnn-2017-telecombcn/
  62. Zheng Qu, Liu Liu, Fengbin Tu, Zhaodong Chen, Yufei Ding, and Yuan Xie. 2022, DOTA: Detect and Omit Weak Attentions for Scalable Transformer Acceleration, In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pages 14–26, 2022. https://dl.acm.org/doi/pdf/10.1145/3503222.3507738 (Involves some reordering of tokens.)
  63. Jean Senellart, Dakun Zhang, Bo Wang, Guillaume Klein, Jean-Pierre Ramatchandirin, Josep Crego, and Alexander Rush. 2018, OpenNMT system description for WNMT 2018: 800 words/sec on a single-core CPU, In Proc. of WNG, 2018. https://www.aclweb.org/anthology/W18-2715
  64. Xing Shi and Kevin Knight. 2017, Speeding up neural machine translation decoding by shrinking run-time vocabulary, In Proc. of ACL, 2017. https://aclanthology.org/P17-2091/, PDF: http://xingshi.me/data/pdf/ACL2017short.pdf
  65. Gurvan L’Hostis, David Grangier, and Michael Auli. 2016. Vocabulary Selection Strategies for Neural Machine Translation, Arxiv preprint arXiv:1610.00072, https://arxiv.org/abs/1610.00072
  66. Ali Modarressi, Hosein Mohebbi, Mohammad Taher Pilehvar, 2022, AdapLeR: Speeding up Inference by Adaptive Length Reduction, arXiv preprint arXiv:2203.08991, https://arxiv.org/abs/2203.08991 Code: https://github.com/amodaresi/AdapLeR
  67. Hansen, C., Hansen, C., Alstrup, S., Simonsen, J. G., and Lioma, C. (2019). Neural speed reading with structural-jump-LSTM, In International Conference on Learning Representations. https://arxiv.org/abs/1904.00761, https://openreview.net/forum?id=B1xf9j
  68. Maedeh Hemmat, Joshua San Miguel, Azadeh Davoodi, 2021, AirNN: A Featherweight Framework for Dynamic Input-Dependent Approximation of CNNs, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol.40, no.10, pp.2090-2103, 2021. https://ieeexplore.ieee.org/document/9239327 (Input dependent matching of weight clusters from tokens is vaguely similar to token pruning or length pruning.)
  69. Minxuan Zhou; Weihong Xu; Jaeyoung Kang; Tajana Rosing, 2022, TransPIM: A Memory-based Acceleration via Software-Hardware Co-Design for Transformer, 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA), https://ieeexplore.ieee.org/document/9773212 PDF: https://par.nsf.gov/servlets/purl/10345536 (Does some token pruning but is primarily focused on memory optimization, including with token-based data sharding for allocation to different memory banks.)
  70. Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, Lee Sharkey, Sep 2023, Sparse Autoencoders Find Highly Interpretable Features in Language Models, https://arxiv.org/abs/2309.08600 (Analysis has some relevant to tokenization and token pruning.)
  71. H Jiang, Q Wu, CY Lin, Y Yang, L Qiu, Oct 2023, LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models, arXiv preprint arXiv:2310.05736, https://arxiv.org/pdf/2310.05736.pdf, Code: https://aka.ms/LLMLingua (Dynamic token pruning for prompt compression.)
  72. X Xu, C Li, Y Chen, X Chang, J Liu, S Wang, Oct 2023, No Token Left Behind: Efficient Vision Transformer via Dynamic Token Idling, arXiv preprint arXiv:2310.05654, https://arxiv.org/pdf/2310.05654.pdf (Suggests “token idling” that allows reuse of pruned tokens in later layers)
  73. Yucheng Li. April 2023. Unlocking context constraints of LLMs: Enhancing context efficiency of LLMs with self-information-based content filtering, ArXiv preprint abs/2304.12102. https://arxiv.org/abs/2304.12102 (Token pruning for prompt compression.)
  74. Wang, Y., Huang, R., Song, S., Huang, Z., Huang, G., May 2021, Not all images are worth 16x16 words: Dynamic vision transformers with adaptive sequence length, NeurIPS 2021, https://arxiv.org/abs/2105.15075, Code: https://github.com/blackfeather-wang/Dynamic-Vision-Transformer, Code: https://github.com/blackfeather-wang/Dynamic-Vision-Transformer-MindSpore
  75. Jesse Mu, Xiang Lisa Li, and Noah Goodman. July 2023. Learning to compress prompts with gist tokens, arXiv preprint arXiv:2304.08467. https://arxiv.org/abs/2304.08467 (Prompt compression.)
  76. Wang Y., Lv K., Huang R., Song S., Yang L., Huang G., 2020, Glance and focus: a dynamic approach to reducing spatial redundancy in image classification, Advances in neural information processing systems, Vol. 33 (2020), pp. 2432-2444, https://arxiv.org/abs/2010.05300, Code: https://github.com/blackfeather-wang/GFNet-Pytorch (Focuses on a subset of image inputs, which is analogous to token pruning.)

For more research papers on dynamic token pruning, see https://www.aussieai.com/research/token-pruning#dynamic.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++