Aussie AI

Token Reduction

Last Updated 18 September, 2025

by David Spuler, Ph.D.

What is Token Reduction?

Token reduction is an LLM inference optimization method that aims to speed up LLMs by reducing the number of tokens processed. This is relevant to any inference process, but is particularly applicable to optimizing multi-step reasoning algorithms, such as Chain-of-Thought, because their interim reasoning steps generate long sequences of tokens.

Token reduction methods have a long history of research in single-step inference optimizations. Some examples where input texts have many tokens include RAG chunks and the conversational history context in chatbot sessions. Hence, reducing the total number of tokens will improve speed in several situations:

RAG optimizations (e.g., retriever returns fewer chunks, or use smaller chunks)
Multi-step reasoning optimizations

One of the simpler ways to reduce tokens is to use prompt engineering techniques. There are two main ways:

Write a concise prompt for the LLM (fewer input tokens).
Politely ask the LLM to "be concise" as part of the prompt (fewer output tokens).

There are various technical ways to do token reduction as part of inference optimization, and some of the many sub-techniques include:

Research on Token Reduction

Some of the general papers on token reduction strategies:

Kyle Wiggers, September 11, 2024, Mistral releases Pixtral 12B, its first multimodal model, https://techcrunch.com/2024/09/11/mistral-releases-pixtral-its-first-multimodal-model/
OpenAI, Dec 2024, OpenAI o1 and new tools for developers, https://openai.com/index/o1-and-new-tools-for-developers/ ("Lower latency: o1 uses on average 60% fewer reasoning tokens than o1-preview for a given request.")
Tianqiao Liu, Zui Chen, Zitao Liu, Mi Tian, Weiqi Luo, 13 Sep 2024, Expediting and Elevating Large Language Model Reasoning via Hidden Chain-of-Thought Decoding, https://arxiv.org/abs/2409.08561 (Compressing the interim token sequences in Chain-of-Thought.)
Yuntian Deng, Kiran Prasad, Roland Fernandez, Paul Smolensky, Vishrav Chaudhary, Stuart Shieber, 2 Nov 2023, Implicit Chain of Thought Reasoning via Knowledge Distillation, https://arxiv.org/abs/2311.01460 (Knowledge distillation applied to optimizing the interim computations in Chain-of-Thought.)
Yu Kang, Xianghui Sun, Liangyu Chen, Wei Zou, 16 Dec 2024, C3oT: Generating Shorter Chain-of-Thought without Compromising Effectiveness, https://arxiv.org/abs/2412.11664 (Token pruning and prompt compression for Chain-of-Thought.)
Wenchao Xu, Jinyu Chen, Peirong Zheng, Xiaoquan Yi, Tianyi Tian, Wenhui Zhu, Quan Wan, Haozhao Wang, Yunfeng Fan, Qinliang Su, Xuemin Shen, https://arxiv.org/abs/2412.13437 18 Dec 2024, Deploying Foundation Model Powered Agent Services: A Survey, (A survey of not just deployment, but many inference optimization techniques.)
Jeffrey Cheng, Benjamin Van Durme, 17 Dec 2024, Compressed Chain of Thought: Efficient Reasoning Through Dense Representations, https://arxiv.org/abs/2412.13171 (Context compression applied to interim CoT reasoning steps.)
Devmallya Karar, Oct 4, 2024, Chain-Of-Thought ( CoT ) in Large Language Models prompting and Concise CoT with Code implementation using Python and PyTorch, https://medium.com/@devmallyakarar/chain-of-thought-cot-in-large-language-models-prompting-and-concise-cot-with-code-82821f9a832d
Cobus Greyling, Jan 24, 2024, Concise Chain-of-Thought (CCoT) Prompting, Traditional CoT comes at a cost of increased output token usage, CCoT prompting is a prompt-engineering technique which is aimed at reducing LLM response verbosity & inference time. https://cobusgreyling.substack.com/p/concise-chain-of-thought-ccot-prompting
Matthew Renze, Erhan Guven 19 Oct 2024 (v3), The Benefits of a Concise Chain of Thought on Problem-Solving in Large Language Models, https://arxiv.org/abs/2401.05618 https://github.com/matthewrenze/jhu-concise-cot (The original paper on Concise CoT.)
Tyler McDonald, Anthony Colosimo, Yifeng Li, Ali Emami, 2 Dec 2024, Can We Afford The Perfect Prompt? Balancing Cost and Accuracy with the Economical Prompting Index, https://arxiv.org/abs/2412.01690
Waleed Kadous, May 17, 2023, Numbers every LLM Developer should know, https://www.anyscale.com/blog/num-every-llm-developer-should-know (Includes discussion of "be concise" prompting.)
Sachin Kumar, Sep 17, 2024, Hidden Chain-of-Thought decoding: faster and efficient CoT decoding to improve reasoning of LLMs, https://medium.com/@techsachin/hidden-chain-of-thought-decoding-faster-and-efficient-cot-decoding-to-improve-reasoning-of-llms-d95584bc9346 (Token reduction in CoT by compressing language tokens into an internal "hidden" concise token representation.)
Tingxu Han, Chunrong Fang, Shiyu Zhao, Shiqing Ma, Zhenyu Chen, Zhenting Wang, 30 Dec 2024 (v2), Token-Budget-Aware LLM Reasoning, https://arxiv.org/abs/2412.18547 https://github.com/GeniusHTX/TALE
Amirkeivan Mohtashami, Matteo Pagliardini, Martin Jaggi, 14 Aug 2024 (v2), CoTFormer: A Chain-of-Thought Driven Architecture with Budget-Adaptive Computation Cost at Inference, https://arxiv.org/abs/2310.10845
Joonwon Jang, Jaehee Kim, Wonbin Kweon, Hwanjo Yu, 31 Dec 2024 (v2), Verbosity-Aware Rationale Reduction: Effective Reduction of Redundant Rationale via Principled Criteria, https://arxiv.org/abs/2412.21006? (Remove redundant sentences in the reasoning steps for token efficiency.)
Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, Dong Yu, 30 Dec 2024, Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs, https://arxiv.org/abs/2412.21187
Zirui Zhao, Hanze Dong, Amrita Saha, Caiming Xiong, Doyen Sahoo, 10 Oct 2024, Automatic Curriculum Expert Iteration for Reliable LLM Reasoning, https://arxiv.org/abs/2410.07627 (Efficiency of bailing out with "I don't know" or refusals versus continuing reasoning steps.)
Mehul Damani, Idan Shenfeld, Andi Peng, Andreea Bobu, Jacob Andreas, 7 Oct 2024, Learning How Hard to Think: Input-Adaptive Allocation of LM Computation, https://arxiv.org/abs/2410.04707
Xinglin Wang, Shaoxiong Feng, Yiwei Li, Peiwen Yuan, Yueqi Zhang, Boyuan Pan, Heda Wang, Yao Hu, Kan Li, 24 Aug 2024, Make Every Penny Count: Difficulty-Adaptive Self-Consistency for Cost-Efficient Reasoning, https://arxiv.org/abs/2408.13457
Rohin Manvi, Anikait Singh, Stefano Ermon, 3 Oct 2024, Adaptive Inference-Time Compute: LLMs Can Predict if They Can Do Better, Even Mid-Generation, https://arxiv.org/abs/2410.02725
Yiwei Li, Peiwen Yuan, Shaoxiong Feng, Boyuan Pan, Xinglin Wang, Bin Sun, Heda Wang, and Kan Li, 19 Jan 2024, Escape sky-high cost: Early-stopping self-consistency for multi-step reasoning. The Twelfth International Conference on Learning Representations, 2024, https://arxiv.org/abs/2401.10480 https://github.com/Yiwei98/ESC (Uses "early stopping" idea to improve CoT efficiency during inference.)
Jiace Zhu, Yingtao Shen, Jie Zhao, An Zou, 25 Aug 2024, Path-Consistency: Prefix Enhancement for Efficient Inference in LLM, https://arxiv.org/abs/2409.01281 (Uses the confidence calculations from earlier branches of the reasoning to improve efficiency.)
Pranjal Aggarwal, Aman Madaan, Yiming Yang, Mausam, 16 Nov 2023 (v2), Let's Sample Step by Step: Adaptive-Consistency for Efficient Reasoning and Coding with LLMs, EMNLP 2023, https://arxiv.org/abs/2305.11860 https://www.sample-step-by-step.info/
Luyang Liu, Jonas Pfeiffer, Jiaxing Wu, Jun Xie, Arthur Szlam, 23 Dec 2024, Deliberation in Latent Space via Differentiable Cache Augmentation, https://arxiv.org/abs/2412.17747 (Augmenting the KV cache with reasoning information so that decoding will mimic multi-step reasoning with fewer tokens required for intermediate steps.)
Sania Nayab, Giulio Rossolini, Giorgio Buttazzo, Nicolamaria Manes, Fabrizio Giacomelli, 29 Jul 2024, Concise Thoughts: Impact of Output Length on LLM Reasoning and Cost, https://arxiv.org/abs/2407.19825
Wenqing Chen, Weicheng Wang, Zhixuan Chu, Kui Ren, Zibin Zheng, and Zhichao Lu. 2024. Self-Para-Consistency: Improving Reasoning Tasks at Low Cost for Large Language Models. In Findings of the Association for Computational Linguistics: ACL 2024, pages 14162–14167, Bangkok, Thailand. Association for Computational Linguistics. https://aclanthology.org/2024.findings-acl.842/ (Generate multiple paraphrased answers, which can reduce tokens as fewer are needed.)
Fabio Montello, Ronja Güldenring, Simone Scardapane, Lazaros Nalpantidis, 13 Jan 2025, A Survey on Dynamic Neural Networks: from Computer Vision to Multi-modal Sensor Fusion, https://arxiv.org/abs/2501.07451 (Survey of adaptive inference optimizations: early exit, dynamic routing, token skimming.)
Tong Xiao, Jingbo Zhu, 16 Jan 2025, Foundations of Large Language Models, https://arxiv.org/abs/2501.09223 (Huge 230 page paper on many topics such as training, prompting, alignment, and long context.)
J Köpke, A Safan, 2024, Efficient llm-based conversational process modeling, Business Process Management Workshops, https://isys.uni-klu.ac.at/PDF/BPM_2024_paper_1442.pdf (Examines and improves the token costs of prompt strategies in conversational sessions.)
Haotian Luo, Li Shen, Haiying He, Yibo Wang, Shiwei Liu, Wei Li, Naiqiang Tan, Xiaochun Cao, Dacheng Tao, 22 Jan 2025, O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning, https://arxiv.org/abs/2501.12570 https://github.com/StarDewXXX/O1-Pruner
By Asif Razzaq, January 24, 2025, Berkeley Sky Computing Lab Introduces Sky-T1-32B-Flash: A New Reasoning Language Model that Significantly Reduces Overthinking, Slashing Inference Costs on Challenging Questions by up to 57%, https://www.marktechpost.com/2025/01/24/berkeley-sky-computing-lab-introduces-sky-t1-32b-flash-a-new-reasoning-language-model-that-significantly-reduces-overthinking-slashing-inference-costs-on-challenging-questions-by-up-to-57/
Liang Wang, Haonan Chen, Nan Yang, Xiaolong Huang, Zhicheng Dou, Furu Wei, 24 Jan 2025, Chain-of-Retrieval Augmented Generation, https://arxiv.org/abs/2501.14342 (Combines RAG with multi-step reasoning such as Chain-of-Thought, with a method to control token cost.)
Zhenrui Yue, Honglei Zhuang, Aijun Bai, Kai Hui, Rolf Jagerman, Hansi Zeng, Zhen Qin, Dong Wang, Xuanhui Wang, Michael Bendersky, 6 Oct 2024, Inference Scaling for Long-Context Retrieval Augmented Generation, https://arxiv.org/abs/2410.04343 (Combine RAG and multi-step inference, controlling token cost via budgeting allocations.)
Xiaoyu Liang, Chaofeng Guan, Jiaying Lu, Huiyao Chen, Huan Wang, Haoji Hu, 24 Jan 2025, Dynamic Token Reduction during Generation for Vision Language Models, https://arxiv.org/abs/2501.14204 (Dynamic pruning of visual tokens.)
Zihui Zhao, Yingxin Li, Yang Li, 29 Jan 2025, Learning Free Token Reduction for Multi-Modal LLM, https://arxiv.org/abs/2501.17391
Zishun Yu, Tengyu Xu, Di Jin, Karthik Abinav Sankararaman, Yun He, Wenxuan Zhou, Zhouhao Zeng, Eryk Helenowski, Chen Zhu, Sinong Wang, Hao Ma, Han Fang, 29 Jan 2025, Think Smarter not Harder: Adaptive Reasoning with Inference Aware Optimization, https://arxiv.org/abs/2501.17974 (CoT optimization using an inference budget.)
Mayi Xu, Yongqi Li, Ke Sun, and Tieyun Qian, 2024, Adaption-of-thought: Learning question difficulty improves large language models for reasoning, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 5468–5495, 2024, https://aclanthology.org/2024.emnlp-main.313/ https://aclanthology.org/2024.emnlp-main.313.pdf
Yue Wang, Qiuzhi Liu, Jiahao Xu, Tian Liang, Xingyu Chen, Zhiwei He, Linfeng Song, Dian Yu, Juntao Li, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, Dong Yu, 30 Jan 2025, Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs, https://arxiv.org/abs/2501.18585
Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, Tatsunori Hashimoto, 3 Feb 2025 (v2), s1: Simple test-time scaling, https://arxiv.org/abs/2501.19393 https://github.com/simplescaling/s1 (Method of "budget forcing" that allows either shortening or lengthening multi-step reasoning sequences.)
Mohammed Karimkhan Pathan, February 3, 2025, Open-source revolution: How DeepSeek-R1 challenges OpenAI’s o1 with superior processing, cost efficiency, https://venturebeat.com/ai/open-source-revolution-how-deepseek-r1-challenges-openais-o1-with-superior-processing-cost-efficiency/
Ben Dickson, February 5, 2025, Not every AI prompt deserves multiple seconds of thinking: how Meta is teaching models to prioritize, https://venturebeat.com/ai/not-every-ai-prompt-deserves-multiple-seconds-of-thinking-how-meta-is-teaching-models-to-prioritize/
Di Chai, Pengbo Li, Feiyuan Zhang, Yilun Jin, Han Tian, Junxue Zhang, Kai Chen, 1 Feb 2025, Enhancing Token Filtering Efficiency in Large Language Model Training with Collider, https://arxiv.org/abs/2502.00340 (Token reduction in training.)
Xinyin Ma, Guangnian Wan, Runpeng Yu, Gongfan Fang, Xinchao Wang, 13 Feb 2025, CoT-Valve: Length-Compressible Chain-of-Thought Tuning, https://arxiv.org/abs/2502.09601 https://github.com/horseee/CoT-Valve
Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Congcong Wang, (authors omitted), 22 Jan 2025, Kimi k1.5: Scaling Reinforcement Learning with LLMs, https://arxiv.org/abs/2501.12599 (Includes a "length penalty" to address token reduction.)
Jin, M., Yu, Q., Shu, D., Zhao, H., Hua, W., Meng, Y., Zhang, Y., and Du, M. The impact of reasoning step length on large language models. In Ku, L.-W., Martins, A., and Srikumar, V. (eds.), Findings of the Association for Computational Linguistics: ACL 2024, pp. 1830 1842, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024. https://aclanthology.org/2024.findings-acl.108/ (Shows that token reduction does reduce accuracy in reasoning.)
Heming Xia, Yongqi Li, Chak Tou Leong, Wenjie Wang, Wenjie Li, 17 Feb 2025, TokenSkip: Controllable Chain-of-Thought Compression in LLMs, https://arxiv.org/abs/2502.12067
Marthe Ballon, Andres Algaba, Vincent Ginis, 21 Feb 2025, The Relationship Between Reasoning and Performance in Large Language Models -- o3 (mini) Thinks Harder, Not Longer, https://arxiv.org/abs/2502.15631
Jintian Zhang, Yuqi Zhu, Mengshu Sun, Yujie Luo, Shuofei Qiao, Lun Du, Da Zheng, Huajun Chen, Ningyu Zhang, 21 Feb 2025, LightThinker: Thinking Step-by-Step Compression, https://arxiv.org/abs/2502.15589 https://github.com/zjunlp/LightThinker (Faster CoT by compressing the text of intermediate reasoning steps with gist tokens.)
Wenkai Yang, Shuming Ma, Yankai Lin, Furu Wei, 25 Feb 2025, Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning, https://arxiv.org/abs/2502.18080 (Trying to generate the "shortest correct response" by examining the lengths needed for CoT.)
Silei Xu, Wenhao Xie, Lingxiao Zhao, Pengcheng He, 25 Feb 2025, Chain of Draft: Thinking Faster by Writing Less, https://arxiv.org/abs/2502.18600 (Concise CoT method using a per-step inference budget.)
Dr. Ashish Bamania, March 3rd, 2025, Chain-of-Draft (CoD) Is The New King Of Prompting Techniques: A deep dive into the novel Chain-of-Draft (CoD) Prompting that outperforms Chain-of-Thought (CoT) Prompting while reducing LLM inference cost and latency like never before, https://levelup.gitconnected.com/chain-of-draft-cod-is-the-new-king-of-prompting-techniques-d9dc17f12051
Michael Nuñez, March 3, 2025, Less is more: How ‘chain of draft’ could cut AI costs by 90% while improving performance, https://venturebeat.com/ai/less-is-more-how-chain-of-draft-could-cut-ai-costs-by-90-while-improving-performance/
Reddit, March 01, 2025, Chain of Draft: Streamlining LLM Reasoning with Minimal Token Generation, https://www.reddit.com/r/artificial/comments/1j04ezf/chain_of_draft_streamlining_llm_reasoning_with/
Sulbha Jain, March 02, 2025, Chain of Draft: Thinking Faster by Writing Less — Paper Review, https://medium.com/@sulbha.jindal/chain-of-draft-thinking-faster-by-writing-less-paper-review-20e57bfc867a
Ajith Vallath Prabhakar, March 2, 2025, Chain of Draft: The Breakthrough Prompting Technique That Makes LLMs Think Faster With Less, https://ajithp.com/2025/03/02/chain-of-draft-llm-prompting/
The Decoder, Mar 2, 2025, Chain of Draft Prompts lets LLMs think cheaper with fewer words, https://the-decoder.com/chain-of-draft-prompts-lets-llms-think-cheaper-with-fewer-words/
Zhenyi Shen, Hanqi Yan, Linhai Zhang, Zhanghao Hu, Yali Du, Yulan He, 28 Feb 2025, CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation, https://arxiv.org/abs/2502.21074
Ayeong Lee, Ethan Che, Tianyi Peng, 3 Mar 2025, How Well do LLMs Compress Their Own Chain-of-Thought? A Token Complexity Approach, https://arxiv.org/abs/2503.01141
Anonymous authors, Mar 2025, Pencil: Long Throughts with Short Memory, ICLR 2025 review, https://openreview.net/pdf?id=KRI2Fmffqr (Using a "memory" during reasoning steps to rewrite intermediate CoT steps in shorter form.)
Pranjal Aggarwal, Sean Welleck, 6 Mar 2025, L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning, https://arxiv.org/abs/2503.04697 https://www.cmu-l3.github.io/l1 (Efficient CoT by controlling the length of intermediate step outputs.)
Simon A. Aytes, Jinheon Baek, Sung Ju Hwang, 7 Mar 2025, Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching, https://arxiv.org/abs/2503.05179 https://www.github.com/SimonAytes/SoT (Compressing intermediate steps in CoT.)
Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, Wanxiang Che, 13 Mar 2025 (v2), Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models, https://arxiv.org/abs/2503.09567 (Massive and broad survey of all types of reasoning.)
Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Hanjie Chen, Xia Hu, 23 Mar 2025 (v2), Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models, https://arxiv.org/abs/2503.16419
Zheng Li, Qingxiu Dong, Jingyuan Ma, Di Zhang, Zhifang Sui, 16 May 2025, SelfBudgeter: Adaptive Token Allocation for Efficient LLM Reasoning, https://arxiv.org/abs/2505.11274
Chengyu Huang, Zhengxin Zhang, Claire Cardie, 16 May 2025, HAPO: Training Language Models to Reason Concisely via History-Aware Policy Optimization, https://arxiv.org/abs/2505.11225
Ren Zhuang, Ben Wang, Shuifa Sun, 17 May 2025 (v2), Accelerating Chain-of-Thought Reasoning: When Goal-Gradient Importance Meets Dynamic Skipping, https://arxiv.org/abs/2505.08392
Xuechen Zhang, Zijian Huang, Chenshun Ni, Ziyang Xiong, Jiasi Chen, Samet Oymak, 14 May 2025 (v2), Making Small Language Models Efficient Reasoners: Intervention, Supervision, Reinforcement, https://arxiv.org/abs/2505.07961
Muzhi Dai, Chenxu Yang, Qingyi Si, 17 May 2025 (v2), S-GRPO: Early Exit via Reinforcement Learning in Reasoning Models, https://arxiv.org/abs/2505.07686
Yuhui Xu, Hanze Dong, Lei Wang, Doyen Sahoo, Junnan Li, Caiming Xiong, 21 May 2025 (v2), Scalable Chain of Thoughts via Elastic Reasoning, https://arxiv.org/abs/2505.05315 https://github.com/SalesforceAIResearch/Elastic-Reasoning
Ziqing Qiao, Yongheng Deng, Jiali Zeng, Dong Wang, Lai Wei, Fandong Meng, Jie Zhou, Ju Ren, Yaoxue Zhang, 8 May 2025, ConCISE: Confidence-guided Compression in Step-by-step Efficient Reasoning, https://arxiv.org/abs/2505.04881
Bin Yu, Hang Yuan, Haotian Li, Xueyin Xu, Yuliang Wei, Bailing Wang, Weizhen Qi, Kai Chen, 21 May 2025 (v2), Long-Short Chain-of-Thought Mixture Supervised Fine-Tuning Eliciting Efficient Reasoning in Large Language Models, https://arxiv.org/abs/2505.03469
Jingyang Yi, Jiazheng Wang, Sida Li, 16 May 2025 (v2), ShorterBetter: Guiding Reasoning Models to Find Optimal Inference Length for Efficient Reasoning, https://arxiv.org/abs/2504.21370
Guanghao Li, Wenhao Jiang, Mingfeng Chen, Yan Li, Hao Yu, Shuting Dong, Tao Ren, Ming Tang, Chun Yuan, 30 May 2025, SCOUT: Teaching Pre-trained Language Models to Enhance Reasoning via Flow Chain-of-Thought, https://arxiv.org/abs/2505.24181
Feng Luo, Yu-Neng Chuang, Guanchu Wang, Hoang Anh Duy Le, Shaochen Zhong, Hongyi Liu, Jiayi Yuan, Yang Sui, Vladimir Braverman, Vipin Chaudhary, Xia Hu, 28 May 2025, AutoL2S: Auto Long-Short Reasoning for Efficient Large Language Models, https://arxiv.org/abs/2505.22662
Siqi Fan, Peng Han, Shuo Shang, Yequan Wang, Aixin Sun, 28 May 2025, CoThink: Token-Efficient Reasoning via Instruct Models Guiding Reasoning Models, https://arxiv.org/abs/2505.22017
Jinyan Su, Claire Cardie, 23 May 2025, Thinking Fast and Right: Balancing Accuracy and Reasoning Length with Adaptive Rewards, https://arxiv.org/abs/2505.18298
Ruihan Gong, Yue Liu, Wenjie Qu, Mingzhe Du, Yufei He, Yingwei Ma, Yulin Chen, Xiang Liu, Yi Wen, Xinfeng Li, Ruidong Wang, Xinzhong Zhu, Bryan Hooi, Jiaheng Zhang, 26 May 2025, Efficient Reasoning via Chain of Unconscious Thought, https://arxiv.org/abs/2505.19756
Weize Chen, Jiarui Yuan, Tailin Jin, Ning Ding, Huimin Chen, Zhiyuan Liu, Maosong Sun, 25 May 2025, The Overthinker's DIET: Cutting Token Calories with DIfficulty-AwarE Training, https://arxiv.org/abs/2505.19217
Zhenglun Kong, Yize Li, Fanhu Zeng, Lei Xin, Shvat Messica, Xue Lin, Pu Zhao, Manolis Kellis, Hao Tang, Marinka Zitnik, 23 May 2025, Token Reduction Should Go Beyond Efficiency in Generative Models -- From Vision, Language to Multimodality, https://arxiv.org/abs/2505.18227
Aradhye Agarwal, Ayan Sengupta, Tanmoy Chakraborty, 23 May 2025, First Finish Search: Efficient Test-Time Scaling in Large Language Models, https://arxiv.org/abs/2505.18149 (Running multiple parallel decoding steps but stopping when the fastest and usually shortest one completes.)
Tomasz Tunguz, Jul 8, 2025, The Hungry, Hungry AI Model, https://tomtunguz.com/input-output-ratio/
Joonwon Jang, Jaehee Kim, Wonbin Kweon, Seonghyeon Lee, Hwanjo Yu, July 2025, Verbosity-Aware Rationale Reduction: Sentence-Level Rationale Reduction for Efficient and Effective Reasoning, Findings of the Association for Computational Linguistics: ACL 2025, pages 20769–20784 July 27- August 1, 2025, https://aclanthology.org/2025.findings-acl.1068.pdf
Shu Yang, Junchao Wu, Xuansheng Wu, Derek Wong, Ninhao Liu, Di Wang, 24 Jun 2025, Is Long-to-Short a Free Lunch? Investigating Inconsistency and Reasoning Efficiency in LRMs, https://arxiv.org/abs/2506.19492
Zehui Ling, Deshu Chen, Hongwei Zhang, Yifeng Jiao, Xin Guo, Yuan Cheng, 12 Jun 2025, Fast on the Easy, Deep on the Hard: Efficient Reasoning via Powered Length Penalty, https://arxiv.org/abs/2506.10446
Wei Huang, Yizhe Xiong, Xin Ye, Zhijie Deng, Hui Chen, Zijia Lin, Guiguang Ding, 23 May 2025, Fast Quiet-STaR: Thinking Without Thought Tokens, https://arxiv.org/abs/2505.17746 https://github.com/huangwei200012/Fast-Quiet-STaR
Zigeng Chen, Xinyin Ma, Gongfan Fang, Ruonan Yu, Xinchao Wang, 23 May 2025, VeriThinker: Learning to Verify Makes Reasoning Model Efficient, https://arxiv.org/abs/2505.17941 https://github.com/czg1225/VeriThinker
Sicheng Feng, Gongfan Fang, Xinyin Ma, Xinchao Wang, 15 Apr 2025, Efficient Reasoning Models: A Survey, https://arxiv.org/abs/2504.10903
Jiwon Song, Dongwon Jo, Yulhwa Kim, Jae-Joon Kim, 20 May 2025, Reasoning Path Compression: Compressing Generation Trajectories for Efficient LLM Reasoning, https://arxiv.org/abs/2505.13866 https://github.com/jiwonsong-dev/ReasoningPathCompression
Zeju Li, Jianyuan Zhong, Ziyang Zheng, Xiangyu Wen, Zhijian Xu, Yingying Cheng, Fan Zhang, Qiang Xu, 5 Aug 2025, Compressing Chain-of-Thought in LLMs via Step Entropy, https://arxiv.org/abs/2508.03346
Wenhao Zeng, Yaoning Wang, Chao Hu, Yuling Shi, Chengcheng Wan, Hongyu Zhang, Xiaodong Gu, 8 Aug 2025, Pruning the Unsurprising: Efficient Code Reasoning via First-Token Surprisal, https://arxiv.org/abs/2508.05988 https://github.com/Zengwh02/ASAP
Andrew Brown, Muhammad Roman, Barry Devereux, 8 Aug 2025, A Systematic Literature Review of Retrieval-Augmented Generation: Techniques, Metrics, and Challenges, https://arxiv.org/abs/2508.06401
Michael Nuñez, August 14, 2025, That ‘cheap’ open-source AI model is actually burning through your compute budget, https://venturebeat.com/ai/that-cheap-open-source-ai-model-is-actually-burning-through-your-compute-budget/ (Open-source models use more tokens.)
Tim, Nous Research, Aug 14, 2025, Measuring Thinking Efficiency in Reasoning Models: The Missing Benchmark, https://nousresearch.com/measuring-thinking-efficiency-in-reasoning-models-the-missing-benchmark/
Kwonyoung Kim, Jungin Park, Jin Kim, Hyeongjun Kwon, Kwanghoon Sohn, 26 Aug 2025, Faster Parameter-Efficient Tuning with Token Redundancy Reduction, https://arxiv.org/abs/2503.20282

Token Efficiency in Reasoning and CoT

Token reduction is one of the main methods to improve the efficiency of reasoning models, especially using Chain-of-Thought (CoT) algorithms. There are various ways to improve token counts in CoT at the high level by skipping steps or pruning paths, and there are also various low-level methods.

Blog articles on reasoning efficiency:

More research information on general efficiency optimization techniques for reasoning models:

Efficiency optimizations to Chain-of-Thought that aim to reduce token processing include: