Aussie AI

Token Pruning

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Token Pruning

Token pruning is a type of model “length pruning” that aims to address the cost of processing input sequences and the related embeddings. It is closely related to “embeddings pruning”, but is orthogonal to depth pruning (e.g. layer pruning) or width pruning (e.g. attention head pruning).

Input token pruning refers to removing some of the tokens with a low probability, based on some evaluation. It is also called “prompt compression” because the user's input prompt is shortened by removing less important tokens. For a model perspective, this reduces the input vocabulary size proportionally to the tokens pruned, thereby reducing the overall model size. This technique suffers the trade-off of accuracy loss in the model, as it is difficult to predict which tokens to prune.

Probabilities and weights for a pruned token are lost, no longer affecting output of other tokens. Token pruning and prompt compression can nevertheless be effective in use cases such as summarization or concept classification, since some of the common, small words in the input sequence are obviously less important.

Research papers on token pruning:

  1. Xuanli He, Iman Keivanloo, Yi Xu, Xiang He, Belinda Zeng, Santosh Rajagopalan, Trishul Chilimbi, 2021, Magic Pyramid: Accelerating Inference with Early Exiting and Token Pruning, Oct 2021, https://arxiv.org/abs/2111.00230, https://doi.org/10.48550/arXiv.2111.00230
  2. Sehoon Kim, Sheng Shen, David Thorsley, Amir Gholami, Joseph Hassoun, and Kurt Keutzer, 2021, Learned token pruning for transformers, arXiv preprint arXiv:2107.00910, 2021, https://arxiv.org/abs/2107.00910
  3. Hanrui Wang, Zhekai Zhang, and Song Han, 2021, SpAtten: Efficient sparse attention architecture with cascade token and head pruning, In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 97–110. IEEE, 2021, https://arxiv.org/abs/2012.09852
  4. Saurabh Goyal, Anamitra Roy Choudhury, Saurabh Raje, Venkatesan Chakaravarthy, Yogish Sabharwal, and Ashish Verma, 2020, Power-bert: Accelerating bert inference via progressive word-vector elimination, In International Conference on Machine Learning, pages 3690–3699, PMLR, 2020, https://arxiv.org/abs/2001.08950, https://doi.org/10.48550/arXiv.2001.08950
  5. Deming Ye, Yankai Lin, Yufei Huang, and Maosong Sun, 2021, TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference, arXiv preprint arXiv:2105.11618 (2021), https://arxiv.org/abs/2105.11618
  6. Ofir Press, Noah A Smith, and Mike Lewis, 2021, Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation, arXiv preprint arXiv:2108.12409 (2021), https://arxiv.org/abs/2108.12409
  7. Carlos Lassance, Maroua Maachou, Joohee Park, Stéphane Clinchant, 2021, A Study on Token Pruning for ColBERT, Dec 2021, https://arxiv.org/abs/2112.06540
  8. Zihang Dai, Guokun Lai, Yiming Yang, and Quoc V Le. 2020, Funnel-transformer: Filtering out sequential redundancy for efficient language processing, Proceedings of NeurIPS, June 2020, https://arxiv.org/abs/2006.03236, Code: https://github.com/laiguokun/Funnel-Transformer
  9. S Ren, Q Jia, KQ Zhu, 2023, arXiv preprint arXiv:2310.08152, Context Compression for Auto-regressive Transformers with Sentinel Tokens, Oct 2023, https://arxiv.org/pdf/2310.08152.pdf, Code: https://github.com/DRSY/KV_Compression

For more research papers on token pruning, see https://www.aussieai.com/research/token-pruning.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++