Aussie AI
Salient Tokens
-
Last Updated 24 February, 2025
-
by David Spuler, Ph.D.
Salient tokens are an LLM optimization strategy that focuses inference cost on important or "salient" tokens. Unimportant or non-salient tokens can be pruned to reduce overall computations to focus on the salient ones. The strategy of pruning non-salient tokens can be applied to the regular inference tokens, or the same strategy can be applied in the KV cache, or both.
Related research topics include:
- Token pruning (input token pruning)
- KV cache token pruning
- Gist tokens
- Dynamic token pruning
- Token merging
- Prompt compression
- Context compression
- Token skipping
- Token dropping
- Length pruning
Research on Salient Tokens
- Chao Lou, Zixia Jia, Zilong Zheng, Kewei Tu, 24 Jun 2024, Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers, https://arxiv.org/abs/2406.16747 (Sparse KV cache for memory-efficient decoding on long contexts by selecting KV pairs of salient tokens.)
- Wonbeom Lee, Jungi Lee, Junghwan Seo, Jaewoong Sim, 28 Jun 2024, InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management, https://arxiv.org/abs/2406.19707 (KV caching optimization using salient token pruning for the attention layer.)
- Zhenyu Zhang, Shiwei Liu, Runjin Chen, Bhavya Kailkhura, Beidi Chen, Atlas Wang, 2024, Q-Hitter: A Better Token Oracle for Efficient LLM Inference via Sparse-Quantized KV Cache, Part of Proceedings of Machine Learning and Systems 6 (MLSys 2024) Conference, https://proceedings.mlsys.org/paper_files/paper/2024/hash/bbb7506579431a85861a05fff048d3e1-Abstract-Conference.html https://proceedings.mlsys.org/paper_files/paper/2024/file/bbb7506579431a85861a05fff048d3e1-Paper-Conference.pdf https://github.com/VITA-Group/Q-Hitter
- Yefei He, Luoming Zhang, Weijia Wu, Jing Liu, Hong Zhou, Bohan Zhuang, 23 May 2024, ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification, https://arxiv.org/abs/2405.14256
- Hyesong Choi, Hyejin Park, Kwang Moo Yi, Sungmin Cha, Dongbo Min, 12 Apr 2024, Salience-Based Adaptive Masking: Revisiting Token Dynamics for Enhanced Pre-training, https://arxiv.org/abs/2404.08327
- Xuanli He, Iman Keivanloo, Yi Xu, Xiang He, Belinda Zeng, Santosh Rajagopalan, Trishul Chilimbi, 30 Oct 2021, Magic Pyramid: Accelerating Inference with Early Exiting and Token Pruning, https://arxiv.org/abs/2111.00230
- Zichen Wen, Yifeng Gao, Shaobo Wang, Junyuan Zhang, Qintong Zhang, Weijia Li, Conghui He, Linfeng Zhang, 17 Feb 2025, Stop Looking for Important Tokens in Multimodal Language Models: Duplication Matters More, https://arxiv.org/abs/2502.11494 https://github.com/ZichenWen1/DART
More AI Research
Read more about: