Aussie AI

Chain-of-Thought Efficiency Optimization

Last Updated 24 June, 2025

by David Spuler, Ph.D.

Chain-of-Thought efficiency optimization is the improvement of the latency or throughput for the CoT algorithm. There are a variety of methods to reduce the total number of tokens processed during CoT sequences, and various other types of LLM inference optimization techniques can also be used with Chain-of-Thought. There are literally 500 inference optimization techniques, most of which could be used orthogonally to optimize the individual LLM inference steps, but not all offer any particular extra speedup across Chain-of-Thought's multiple steps.

Chain-of-Thought Token Reduction

Chain-of-Thought token reduction is the optimization of reducing the number of tokens in the prompt steps for reasoning. The main cause of slowness in Chain-of-Thought is processing too many tokens in the interim reasoning steps. Some of the high-level improvements to token counts include:

Fewer steps of inference (e.g., "step skipping").
Quitting ineffective pathways early (e.g., "early stopping")
Avoiding redundant or duplicative pathways (e.g., "path reduction").
Shortening each step (e.g., "redundant sentence pruning").

Some of the additional modifications to the CoT algorithm for token reduction include:

Concise Chain-of-Thought (CCoT) — request the LLM to think concisely!
Hidden Token Chain-of-Thought (HCot) — use non-language tokens in the internal reasoning steps.
Continuous Chain-of-Thought (Coconut) — reasoning steps in the embedding space.
CoT Reasoning Decoding — use a tree of multiple paths during decoding (like beam search) to mimic the reasoning steps of Chain-of-Thought in a single inference step.

At a lower-level of optimizations, some of the existing LLM token reduction methods, which are general to any type of LLM inference, may be considered on any or all of the CoT inference steps:

CoT Step Skipping

CoT step skipping is the LLM speed optimization that "skips" some of the internal reasoning steps, thereby reducing costs and improving latency. More steps are associated with better reasoning, but also increase token costs, so the aim is to skip some steps for efficiency, while minimizing the loss of reasoning accuracy.

Research papers on CoT step skipping optimizations:

DiJia Su, Sainbayar Sukhbaatar, Michael Rabbat, Yuandong Tian, Qinqing Zheng, 13 Oct 2024, Dualformer: Controllable Fast and Slow Thinking by Learning with Randomized Reasoning Traces, https://arxiv.org/abs/2410.09918
Yuntian Deng, Yejin Choi, Stuart Shieber, 23 May 2024, From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step, https://arxiv.org/abs/2405.14838
Joonwon Jang, Jaehee Kim, Wonbin Kweon, Hwanjo Yu, 31 Dec 2024 (v2), Verbosity-Aware Rationale Reduction: Effective Reduction of Redundant Rationale via Principled Criteria, https://arxiv.org/abs/2412.21006? (Remove redundant sentences in the reasoning steps for token efficiency.)
Tengxiao Liu, Qipeng Guo, Xiangkun Hu, Cheng Ji ayang, Yue Zhang, Xipeng Qiu, and Zheng Zhang. 2024. Can language models learn to skip steps? In The Thirty-eighth Annual Conference on Neural In formation Processing Systems. https://arxiv.org/abs/2411.01855
Wenqing Chen, Weicheng Wang, Zhixuan Chu, Kui Ren, Zibin Zheng, and Zhichao Lu. 2024. Self-Para-Consistency: Improving Reasoning Tasks at Low Cost for Large Language Models. In Findings of the Association for Computational Linguistics: ACL 2024, pages 14162–14167, Bangkok, Thailand. Association for Computational Linguistics. https://aclanthology.org/2024.findings-acl.842/ (Generate multiple paraphrased answers, which can reduce tokens as fewer are needed.)
Juno Kim, Denny Wu, Jason Lee, Taiji Suzuki, 2 Feb 2025, Metastable Dynamics of Chain-of-Thought Reasoning: Provable Benefits of Search, RL and Distillation, https://arxiv.org/abs/2502.01694
Emanuele Marconato, Stefano Teso, Antonio Vergari, and Andrea Passerini. 2024. Not all neuro-symbolic concepts are created equal: Analysis and mitiga tion of reasoning shortcuts. Advances in Neural Information Processing Systems, 36. https://arxiv.org/abs/2305.19951
Mengru Ding, Hanmeng Liu, Zhizhang Fu, Jian Song, Wenbo Xie, Yue Zhang, 4 Jun 2024, Break the Chain: Large Language Models Can be Shortcut Reasoners, https://arxiv.org/abs/2406.06580
Yingqian Cui, Pengfei He, Jingying Zeng, Hui Liu, Xianfeng Tang, Zhenwei Dai, Yan Han, Chen Luo, Jing Huang, Zhen Li, Suhang Wang, Yue Xing, Jiliang Tang, Qi He, 18 Feb 2025, Stepwise Perplexity-Guided Refinement for Efficient Chain-of-Thought Reasoning in Large Language Models, https://arxiv.org/abs/2502.13260
Siyuan Wang, Enda Zhao, Zhongyu Wei, Xiang Ren, 21 Feb 2025, Stepwise Informativeness Search for Improving LLM Reasoning, https://arxiv.org/abs/2502.15335
Y Fu, J Chen, Y Zhuang, Z Fu, I Stoica, H Zhang, Mar 2025, Reasoning Without Self-Doubt: More Efficient Chain-of-Thought Through Certainty Probing, ICLR 2025 review, https://openreview.net/pdf?id=wpK4IMJfdX (Shortening CoT reasoning paths by "probing" the LLM in the middle of the sequence to examine if it already has the final answer, thereby avoiding unnecessary extra reasoning steps.)
Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, Wanxiang Che, 13 Mar 2025 (v2), Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models, https://arxiv.org/abs/2503.09567 (Massive and broad survey of all types of reasoning.)
Ren Zhuang, Ben Wang, Shuifa Sun, 17 May 2025 (v2), Accelerating Chain-of-Thought Reasoning: When Goal-Gradient Importance Meets Dynamic Skipping, https://arxiv.org/abs/2505.08392
Shengjia Zhang, Junjie Wu, Jiawei Chen, Changwang Zhang, Xingyu Lou, Wangchunshu Zhou, Sheng Zhou, Can Wang, Jun Wang, 3 Jun 2025, OThink-R1: Intrinsic Fast/Slow Thinking Mode Switching for Over-Reasoning Mitigation, https://arxiv.org/abs/2506.02397 https://github.com/AgenticIR-Lab/OThink-R1
Xixian Yong, Xiao Zhou, Yingying Zhang, Jinlin Li, Yefeng Zheng, Xian Wu, 23 May 2025, Think or Not? Exploring Thinking Efficiency in Large Reasoning Models via an Information-Theoretic Lens, https://arxiv.org/abs/2505.18237
Zezhong Wang, Xingshan Zeng, Weiwen Liu, Yufei Wang, Liangyou Li, Yasheng Wang, Lifeng Shang, Xin Jiang, Qun Liu, Kam-Fai Wong, 23 May 2025, Stepwise Reasoning Checkpoint Analysis: A Test Time Scaling Method to Enhance LLMs' Reasoning, https://arxiv.org/abs/2505.17829

Chain-of-Thought Early Stopping

Chain-of-Thought early stopping is an inference-time optimization that aims to drop out of chains of computation early. The technique of early stopping or dropout is a well-known optimization in training, which is here being used in a different way in multi-step inference. Note that this optimization is different to early exiting of layers in inference, which affects a single inference step, whereas this method involves optimizations over multiple inference steps.

Research papers on early stopping in CoT:

Mehul Damani, Idan Shenfeld, Andi Peng, Andreea Bobu, Jacob Andreas, 7 Oct 2024, Learning How Hard to Think: Input-Adaptive Allocation of LM Computation, https://arxiv.org/abs/2410.04707
Xinglin Wang, Shaoxiong Feng, Yiwei Li, Peiwen Yuan, Yueqi Zhang, Boyuan Pan, Heda Wang, Yao Hu, Kan Li, 24 Aug 2024, Make Every Penny Count: Difficulty-Adaptive Self-Consistency for Cost-Efficient Reasoning, https://arxiv.org/abs/2408.13457
Rohin Manvi, Anikait Singh, Stefano Ermon, 3 Oct 2024, Adaptive Inference-Time Compute: LLMs Can Predict if They Can Do Better, Even Mid-Generation, https://arxiv.org/abs/2410.02725
Yiwei Li, Peiwen Yuan, Shaoxiong Feng, Boyuan Pan, Xinglin Wang, Bin Sun, Heda Wang, and Kan Li, 19 Jan 2024, Escape sky-high cost: Early-stopping self-consistency for multi-step reasoning. The Twelfth International Conference on Learning Representations, 2024, https://arxiv.org/abs/2401.10480 https://github.com/Yiwei98/ESC (Uses "early stopping" idea to improve CoT efficiency during inference.)
Pranjal Aggarwal, Aman Madaan, Yiming Yang, Mausam, 16 Nov 2023 (v2), Let's Sample Step by Step: Adaptive-Consistency for Efficient Reasoning and Coding with LLMs, EMNLP 2023, https://arxiv.org/abs/2305.11860 https://www.sample-step-by-step.info/

CoT Path Reduction

CoT path reduction is a reasoning LLM efficiency optimization by avoiding redundant or incorrect paths in the chain of reasoning steps. It has also been called "path pruning." Reducing the paths is a high-level way of reducing the token count in processing CoT sequences.

Research papers on CoT path reduction or path pruning:

Joonwon Jang, Jaehee Kim, Wonbin Kweon, Hwanjo Yu, 31 Dec 2024 (v2), Verbosity-Aware Rationale Reduction: Effective Reduction of Redundant Rationale via Principled Criteria, https://arxiv.org/abs/2412.21006? (Remove redundant sentences in the reasoning steps for token efficiency.)
Sungjae Lee, Hyejin Park, Jaechang Kim, Jungseul Ok, 10 Jan 2025, Semantic Exploration with Adaptive Gating for Efficient Problem Solving with Language Models, https://arxiv.org/abs/2501.05752 (CoT optimization by avoiding redundant paths that have identical semantics.)
Haotian Luo, Li Shen, Haiying He, Yibo Wang, Shiwei Liu, Wei Li, Naiqiang Tan, Xiaochun Cao, Dacheng Tao, 22 Jan 2025, O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning, https://arxiv.org/abs/2501.12570 https://github.com/StarDewXXX/O1-Pruner
Zishun Yu, Tengyu Xu, Di Jin, Karthik Abinav Sankararaman, Yun He, Wenxuan Zhou, Zhouhao Zeng, Eryk Helenowski, Chen Zhu, Sinong Wang, Hao Ma, Han Fang, 29 Jan 2025, Think Smarter not Harder: Adaptive Reasoning with Inference Aware Optimization, https://arxiv.org/abs/2501.17974 (CoT optimization using an inference budget.)
Mayi Xu, Yongqi Li, Ke Sun, and Tieyun Qian, 2024, Adaption-of-thought: Learning question difficulty improves large language models for reasoning, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 5468–5495, 2024, https://aclanthology.org/2024.emnlp-main.313/ https://aclanthology.org/2024.emnlp-main.313.pdf
Zhi Zhou, Tan Yuhao, Zenan Li, Yuan Yao, Lan-Zhe Guo, Xiaoxing Ma, Yu-Feng Li, 1 Feb 2025, Bridging Internal Probability and Self-Consistency for Effective and Efficient LLM Reasoning, https://arxiv.org/abs/2502.00511
Juno Kim, Denny Wu, Jason Lee, Taiji Suzuki, 2 Feb 2025, Metastable Dynamics of Chain-of-Thought Reasoning: Provable Benefits of Search, RL and Distillation, https://arxiv.org/abs/2502.01694
Yang Li, 4 Feb 2025, Policy Guided Tree Search for Enhanced LLM Reasoning, https://arxiv.org/abs/2502.06813
Emanuele Marconato, Stefano Teso, Antonio Vergari, and Andrea Passerini. 2024. Not all neuro-symbolic concepts are created equal: Analysis and mitiga tion of reasoning shortcuts. Advances in Neural Information Processing Systems, 36. https://arxiv.org/abs/2305.19951
Mengru Ding, Hanmeng Liu, Zhizhang Fu, Jian Song, Wenbo Xie, Yue Zhang, 4 Jun 2024, Break the Chain: Large Language Models Can be Shortcut Reasoners, https://arxiv.org/abs/2406.06580
Yifu Ding, Wentao Jiang, Shunyu Liu, Yongcheng Jing, Jinyang Guo, Yingjie Wang, Jing Zhang, Zengmao Wang, Ziwei Liu, Bo Du, Xianglong Liu, Dacheng Tao, 27 Feb 2025 (v2), Dynamic Parallel Tree Search for Efficient LLM Reasoning, https://arxiv.org/abs/2502.16235
Yiming Wang, Pei Zhang, Siyuan Huang, Baosong Yang, Zhuosheng Zhang, Fei Huang, Rui Wang, 3 Mar 2025, Sampling-Efficient Test-Time Scaling: Self-Estimating the Best-of-N Sampling in Early Decoding, https://arxiv.org/abs/2503.01422
Y Fu, J Chen, Y Zhuang, Z Fu, I Stoica, H Zhang, Mar 2025, Reasoning Without Self-Doubt: More Efficient Chain-of-Thought Through Certainty Probing, ICLR 2025 review, https://openreview.net/pdf?id=wpK4IMJfdX (Shortening CoT reasoning paths by "probing" the LLM in the middle of the sequence to examine if it already has the final answer, thereby avoiding unnecessary extra reasoning steps.)
Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, Wanxiang Che, 13 Mar 2025 (v2), Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models, https://arxiv.org/abs/2503.09567 (Massive and broad survey of all types of reasoning.)
Xiaoye Qu, Yafu Li, Zhaochen Su, Weigao Sun, Jianhao Yan, Dongrui Liu, Ganqu Cui, Daizong Liu, Shuxian Liang, Junxian He, Peng Li, Wei Wei, Jing Shao, Chaochao Lu, Yue Zhang, Xian-Sheng Hua, Bowen Zhou, Yu Cheng, 27 Mar 2025, A Survey of Efficient Reasoning for Large Reasoning Models: Language, Multimodality, and Beyond, https://arxiv.org/abs/2503.21614
Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Hanjie Chen, Xia Hu, 23 Mar 2025 (v2), Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models, https://arxiv.org/abs/2503.16419
Shengjia Zhang, Junjie Wu, Jiawei Chen, Changwang Zhang, Xingyu Lou, Wangchunshu Zhou, Sheng Zhou, Can Wang, Jun Wang, 3 Jun 2025, OThink-R1: Intrinsic Fast/Slow Thinking Mode Switching for Over-Reasoning Mitigation, https://arxiv.org/abs/2506.02397 https://github.com/AgenticIR-Lab/OThink-R1

Constrained Chain-of-Thought

Constrained Chain-of-Thought is an optimization to the multi-step inference sequence by imposing a "constraint" on the LLM. There are different types of constraints, such as simply asking the LLM to be concise, or more complex constraints that limit either the output or progression to the next step of reasoning.

Research papers on constrained CoT:

Shiv Sakhuja, 25 Sep 2024, Chain-of-Thought (CoT) Prompting Explained: 7 Techniques for Optimizing AI Performance, https://hub.athina.ai/athina-originals/guides-chain-of-thought-cot-prompting-explained-7-techniques-for-optimizing-ai-performance/
Zizheng Lin, Chunkit Chan, Yangqiu Song, Xin Liu, 20 Sep 2024, Constrained Reasoning Chains for Enhancing Theory-of-Mind in Large Language Models, https://arxiv.org/abs/2409.13490?
Renato Vukovic, David Arps, Carel van Niekerk, Benjamin Matthias Ruppik, Hsien-Chin Lin, Michael Heck, Milica Gašić, 5 Aug 2024, Dialogue Ontology Relation Extraction via Constrained Chain-of-Thought Decoding, https://arxiv.org/abs/2408.02361
Wachara Fungwacharakorn, Nguyen Ha Thanh, May Myo Zin, Ken Satoh, 16 Oct 2024, Layer-of-Thoughts Prompting (LoT): Leveraging LLM-Based Retrieval with Constraint Hierarchies, https://arxiv.org/abs/2410.12153

Coconut

Coconut is an optimization for LLM reasoning algorithms, such as Chain-of-Thought, whereby the model reasons in "latent space" rather than with language tokens. The name derives from "Chain of Continuous Thought" (Coconut) in the original paper. Coconut type algorithms can be both more accurate and more efficient with fewer tokens to process.

Research papers on Coconut

Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, Yuandong Tian, 9 Dec 2024, Training Large Language Models to Reason in a Continuous Latent Space, https://arxiv.org/abs/2412.06769 (Performing reasoning in a model trained to operate in the embedding vector space, rather than more directly in the token space.)
Ben Congdon, December 14, 2024, Chain of Continuous Thoughts, https://benjamincongdon.me/blog/2024/12/14/Chain-of-Continuous-Thoughts/
Lance Eliot, Dec 18, 2024, Chain Of Continuous Thought Promises Mighty Boost For LLMs And Generative AI By Blowing Up The Fixation On Tokens, https://www.forbes.com/sites/lanceeliot/2024/12/18/chain-of-continuous-thought-promises-mighty-boost-for-llms-and-generative-ai-by-blowing-up-the-fixation-on-tokens/
Alex McFarland, December 16, 2024, Meta’s COCONUT: The AI Method That Thinks Without Language, https://www.unite.ai/metas-coconut-the-ai-method-that-thinks-without-language/
Halil Alperen Gozeten, M. Emrullah Ildiz, Xuechen Zhang, Hrayr Harutyunyan, Ankit Singh Rawat, Samet Oymak, 29 May 2025, Continuous Chain of Thought Enables Parallel Exploration and Reasoning, https://arxiv.org/abs/2505.23648

Adaptive Inference Time Compute

Adaptive inference time compute refers to LLM inference optimization techniques that "adapt" the amount of computation for each inference step. This allows the LLM to decide adaptively, based on the input query, whether or not to continue thinking, or to allocate more resources. This use of adaptive computations is more general than the many "adaptive inference" optimizations that are possible with LLM inference optimization (e.g., layer skipping, early exit, slimming, etc.). By applying these ideas to the multi-step inference reasoning process, this technique allows fewer tokens to be processed overall for multi-step inference.

Research papers on adaptive inference time compute in multi-step inference:

Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, Dong Yu, 30 Dec 2024, Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs, https://arxiv.org/abs/2412.21187
Zirui Zhao, Hanze Dong, Amrita Saha, Caiming Xiong, Doyen Sahoo, 10 Oct 2024, Automatic Curriculum Expert Iteration for Reliable LLM Reasoning, https://arxiv.org/abs/2410.07627 (Efficiency of bailing out with "I don't know" or refusals versus continuing reasoning steps.)
Mehul Damani, Idan Shenfeld, Andi Peng, Andreea Bobu, Jacob Andreas, 7 Oct 2024, Learning How Hard to Think: Input-Adaptive Allocation of LM Computation, https://arxiv.org/abs/2410.04707
Xinglin Wang, Shaoxiong Feng, Yiwei Li, Peiwen Yuan, Yueqi Zhang, Boyuan Pan, Heda Wang, Yao Hu, Kan Li, 24 Aug 2024, Make Every Penny Count: Difficulty-Adaptive Self-Consistency for Cost-Efficient Reasoning, https://arxiv.org/abs/2408.13457
Rohin Manvi, Anikait Singh, Stefano Ermon, 3 Oct 2024, Adaptive Inference-Time Compute: LLMs Can Predict if They Can Do Better, Even Mid-Generation, https://arxiv.org/abs/2410.02725
Yiwei Li, Peiwen Yuan, Shaoxiong Feng, Boyuan Pan, Xinglin Wang, Bin Sun, Heda Wang, and Kan Li, 19 Jan 2024, Escape sky-high cost: Early-stopping self-consistency for multi-step reasoning. The Twelfth International Conference on Learning Representations, 2024, https://arxiv.org/abs/2401.10480 https://github.com/Yiwei98/ESC (Uses "early stopping" idea to improve CoT efficiency during inference.)
Pranjal Aggarwal, Aman Madaan, Yiming Yang, Mausam, 16 Nov 2023 (v2), Let's Sample Step by Step: Adaptive-Consistency for Efficient Reasoning and Coding with LLMs, EMNLP 2023, https://arxiv.org/abs/2305.11860 https://www.sample-step-by-step.info/
Ling Yang, Zhaochen Yu, Bin Cui, Mengdi Wang, 10 Feb 2025, ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates, https://arxiv.org/abs/2502.06772 https://github.com/Gen-Verse/ReasonFlux (RALM-like retrieval of reasoning prompt templates at inference time.)
Songjun Tu, Jiahao Lin, Qichao Zhang, Xiangyu Tian, Linjing Li, Xiangyuan Lan, Dongbin Zhao, 16 May 2025, Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Multi-Stage RL, https://arxiv.org/abs/2505.10832
Yuhui Xu, Hanze Dong, Lei Wang, Doyen Sahoo, Junnan Li, Caiming Xiong, 21 May 2025 (v2), Scalable Chain of Thoughts via Elastic Reasoning, https://arxiv.org/abs/2505.05315 https://github.com/SalesforceAIResearch/Elastic-Reasoning

CoT Token Reduction

Research papers on token reductions in Chain-of-Thought:

OpenAI, Dec 2024, OpenAI o1 and new tools for developers, https://openai.com/index/o1-and-new-tools-for-developers/ ("Lower latency: o1 uses on average 60% fewer reasoning tokens than o1-preview for a given request.")
Tianqiao Liu, Zui Chen, Zitao Liu, Mi Tian, Weiqi Luo, 13 Sep 2024, Expediting and Elevating Large Language Model Reasoning via Hidden Chain-of-Thought Decoding, https://arxiv.org/abs/2409.08561 (Compressing the interim token sequences in Chain-of-Thought.)
Yuntian Deng, Kiran Prasad, Roland Fernandez, Paul Smolensky, Vishrav Chaudhary, Stuart Shieber, 2 Nov 2023, Implicit Chain of Thought Reasoning via Knowledge Distillation, https://arxiv.org/abs/2311.01460 (Knowledge distillation applied to optimizing the interim computations in Chain-of-Thought.)
Yu Kang, Xianghui Sun, Liangyu Chen, Wei Zou, 16 Dec 2024, C3oT: Generating Shorter Chain-of-Thought without Compromising Effectiveness, https://arxiv.org/abs/2412.11664 (Token pruning and prompt compression for Chain-of-Thought.)
Jeffrey Cheng, Benjamin Van Durme, 17 Dec 2024, Compressed Chain of Thought: Efficient Reasoning Through Dense Representations, https://arxiv.org/abs/2412.13171 (Context compression applied to interim CoT reasoning steps.)
Devmallya Karar, Oct 4, 2024, Chain-Of-Thought ( CoT ) in Large Language Models prompting and Concise CoT with Code implementation using Python and PyTorch, https://medium.com/@devmallyakarar/chain-of-thought-cot-in-large-language-models-prompting-and-concise-cot-with-code-82821f9a832d
Cobus Greyling, Jan 24, 2024, Concise Chain-of-Thought (CCoT) Prompting, Traditional CoT comes at a cost of increased output token usage, CCoT prompting is a prompt-engineering technique which is aimed at reducing LLM response verbosity & inference time. https://cobusgreyling.substack.com/p/concise-chain-of-thought-ccot-prompting
Matthew Renze, Erhan Guven 19 Oct 2024 (v3), The Benefits of a Concise Chain of Thought on Problem-Solving in Large Language Models, https://arxiv.org/abs/2401.05618 https://github.com/matthewrenze/jhu-concise-cot (The original paper on Concise CoT.)
Tyler McDonald, Anthony Colosimo, Yifeng Li, Ali Emami, 2 Dec 2024, Can We Afford The Perfect Prompt? Balancing Cost and Accuracy with the Economical Prompting Index, https://arxiv.org/abs/2412.01690
Waleed Kadous, May 17, 2023, Numbers every LLM Developer should know, https://www.anyscale.com/blog/num-every-llm-developer-should-know (Includes discussion of "be concise" prompting.)
Sachin Kumar, Sep 17, 2024, Hidden Chain-of-Thought decoding: faster and efficient CoT decoding to improve reasoning of LLMs, https://medium.com/@techsachin/hidden-chain-of-thought-decoding-faster-and-efficient-cot-decoding-to-improve-reasoning-of-llms-d95584bc9346 (Token reduction in CoT by compressing language tokens into an internal "hidden" concise token representation.)
Tingxu Han, Chunrong Fang, Shiyu Zhao, Shiqing Ma, Zhenyu Chen, Zhenting Wang, 30 Dec 2024 (v2), Token-Budget-Aware LLM Reasoning, https://arxiv.org/abs/2412.18547 https://github.com/GeniusHTX/TALE
Amirkeivan Mohtashami, Matteo Pagliardini, Martin Jaggi, 14 Aug 2024 (v2), CoTFormer: A Chain-of-Thought Driven Architecture with Budget-Adaptive Computation Cost at Inference, https://arxiv.org/abs/2310.10845
Joonwon Jang, Jaehee Kim, Wonbin Kweon, Hwanjo Yu, 31 Dec 2024 (v2), Verbosity-Aware Rationale Reduction: Effective Reduction of Redundant Rationale via Principled Criteria, https://arxiv.org/abs/2412.21006? (Remove redundant sentences in the reasoning steps for token efficiency.)
Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, Dong Yu, 30 Dec 2024, Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs, https://arxiv.org/abs/2412.21187
Zirui Zhao, Hanze Dong, Amrita Saha, Caiming Xiong, Doyen Sahoo, 10 Oct 2024, Automatic Curriculum Expert Iteration for Reliable LLM Reasoning, https://arxiv.org/abs/2410.07627 (Efficiency of bailing out with "I don't know" or refusals versus continuing reasoning steps.)
Mehul Damani, Idan Shenfeld, Andi Peng, Andreea Bobu, Jacob Andreas, 7 Oct 2024, Learning How Hard to Think: Input-Adaptive Allocation of LM Computation, https://arxiv.org/abs/2410.04707
Xinglin Wang, Shaoxiong Feng, Yiwei Li, Peiwen Yuan, Yueqi Zhang, Boyuan Pan, Heda Wang, Yao Hu, Kan Li, 24 Aug 2024, Make Every Penny Count: Difficulty-Adaptive Self-Consistency for Cost-Efficient Reasoning, https://arxiv.org/abs/2408.13457
Rohin Manvi, Anikait Singh, Stefano Ermon, 3 Oct 2024, Adaptive Inference-Time Compute: LLMs Can Predict if They Can Do Better, Even Mid-Generation, https://arxiv.org/abs/2410.02725
Yiwei Li, Peiwen Yuan, Shaoxiong Feng, Boyuan Pan, Xinglin Wang, Bin Sun, Heda Wang, and Kan Li, 19 Jan 2024, Escape sky-high cost: Early-stopping self-consistency for multi-step reasoning. The Twelfth International Conference on Learning Representations, 2024, https://arxiv.org/abs/2401.10480 https://github.com/Yiwei98/ESC (Uses "early stopping" idea to improve CoT efficiency during inference.)
Jiace Zhu, Yingtao Shen, Jie Zhao, An Zou, 25 Aug 2024, Path-Consistency: Prefix Enhancement for Efficient Inference in LLM, https://arxiv.org/abs/2409.01281 (Uses the confidence calculations from earlier branches of the reasoning to improve efficiency.)
Pranjal Aggarwal, Aman Madaan, Yiming Yang, Mausam, 16 Nov 2023 (v2), Let's Sample Step by Step: Adaptive-Consistency for Efficient Reasoning and Coding with LLMs, EMNLP 2023, https://arxiv.org/abs/2305.11860 https://www.sample-step-by-step.info/
Luyang Liu, Jonas Pfeiffer, Jiaxing Wu, Jun Xie, Arthur Szlam, 23 Dec 2024, Deliberation in Latent Space via Differentiable Cache Augmentation, https://arxiv.org/abs/2412.17747 (Augmenting the KV cache with reasoning information so that decoding will mimic multi-step reasoning with fewer tokens required for intermediate steps.)
Sania Nayab, Giulio Rossolini, Giorgio Buttazzo, Nicolamaria Manes, Fabrizio Giacomelli, 29 Jul 2024, Concise Thoughts: Impact of Output Length on LLM Reasoning and Cost, https://arxiv.org/abs/2407.19825
Wenqing Chen, Weicheng Wang, Zhixuan Chu, Kui Ren, Zibin Zheng, and Zhichao Lu. 2024. Self-Para-Consistency: Improving Reasoning Tasks at Low Cost for Large Language Models. In Findings of the Association for Computational Linguistics: ACL 2024, pages 14162–14167, Bangkok, Thailand. Association for Computational Linguistics. https://aclanthology.org/2024.findings-acl.842/ (Generate multiple paraphrased answers, which can reduce tokens as fewer are needed.)
Haotian Luo, Li Shen, Haiying He, Yibo Wang, Shiwei Liu, Wei Li, Naiqiang Tan, Xiaochun Cao, Dacheng Tao, 22 Jan 2025, O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning, https://arxiv.org/abs/2501.12570 https://github.com/StarDewXXX/O1-Pruner
By Asif Razzaq, January 24, 2025, Berkeley Sky Computing Lab Introduces Sky-T1-32B-Flash: A New Reasoning Language Model that Significantly Reduces Overthinking, Slashing Inference Costs on Challenging Questions by up to 57%, https://www.marktechpost.com/2025/01/24/berkeley-sky-computing-lab-introduces-sky-t1-32b-flash-a-new-reasoning-language-model-that-significantly-reduces-overthinking-slashing-inference-costs-on-challenging-questions-by-up-to-57/
Liang Wang, Haonan Chen, Nan Yang, Xiaolong Huang, Zhicheng Dou, Furu Wei, 24 Jan 2025, Chain-of-Retrieval Augmented Generation, https://arxiv.org/abs/2501.14342 (Combines RAG with multi-step reasoning such as Chain-of-Thought, with a method to control token cost.)
Zhenrui Yue, Honglei Zhuang, Aijun Bai, Kai Hui, Rolf Jagerman, Hansi Zeng, Zhen Qin, Dong Wang, Xuanhui Wang, Michael Bendersky, 6 Oct 2024, Inference Scaling for Long-Context Retrieval Augmented Generation, https://arxiv.org/abs/2410.04343 (Combine RAG and multi-step inference, controlling token cost via budgeting allocations.)
Zishun Yu, Tengyu Xu, Di Jin, Karthik Abinav Sankararaman, Yun He, Wenxuan Zhou, Zhouhao Zeng, Eryk Helenowski, Chen Zhu, Sinong Wang, Hao Ma, Han Fang, 29 Jan 2025, Think Smarter not Harder: Adaptive Reasoning with Inference Aware Optimization, https://arxiv.org/abs/2501.17974 (CoT optimization using an inference budget.)
Mayi Xu, Yongqi Li, Ke Sun, and Tieyun Qian, 2024, Adaption-of-thought: Learning question difficulty improves large language models for reasoning, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 5468–5495, 2024, https://aclanthology.org/2024.emnlp-main.313/ https://aclanthology.org/2024.emnlp-main.313.pdf
Yue Wang, Qiuzhi Liu, Jiahao Xu, Tian Liang, Xingyu Chen, Zhiwei He, Linfeng Song, Dian Yu, Juntao Li, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, Dong Yu, 30 Jan 2025, Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs, https://arxiv.org/abs/2501.18585
Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, Tatsunori Hashimoto, 3 Feb 2025 (v2), s1: Simple test-time scaling, https://arxiv.org/abs/2501.19393 https://github.com/simplescaling/s1 (Method of "budget forcing" that allows either shortening or lengthening multi-step reasoning sequences.)
Mohammed Karimkhan Pathan, February 3, 2025, Open-source revolution: How DeepSeek-R1 challenges OpenAI’s o1 with superior processing, cost efficiency, https://venturebeat.com/ai/open-source-revolution-how-deepseek-r1-challenges-openais-o1-with-superior-processing-cost-efficiency/
Ben Dickson, February 5, 2025, Not every AI prompt deserves multiple seconds of thinking: how Meta is teaching models to prioritize, https://venturebeat.com/ai/not-every-ai-prompt-deserves-multiple-seconds-of-thinking-how-meta-is-teaching-models-to-prioritize/
Xinyin Ma, Guangnian Wan, Runpeng Yu, Gongfan Fang, Xinchao Wang, 13 Feb 2025, CoT-Valve: Length-Compressible Chain-of-Thought Tuning, https://arxiv.org/abs/2502.09601 https://github.com/horseee/CoT-Valve
Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Congcong Wang, (authors omitted), 22 Jan 2025, Kimi k1.5: Scaling Reinforcement Learning with LLMs, https://arxiv.org/abs/2501.12599 (Includes a "length penalty" to address token reduction.)
Jin, M., Yu, Q., Shu, D., Zhao, H., Hua, W., Meng, Y., Zhang, Y., and Du, M. The impact of reasoning step length on large language models. In Ku, L.-W., Martins, A., and Srikumar, V. (eds.), Findings of the Association for Computational Linguistics: ACL 2024, pp. 1830 1842, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024. https://aclanthology.org/2024.findings-acl.108/ (Shows that token reduction does reduce accuracy in reasoning.)
Heming Xia, Yongqi Li, Chak Tou Leong, Wenjie Wang, Wenjie Li, 17 Feb 2025, TokenSkip: Controllable Chain-of-Thought Compression in LLMs, https://arxiv.org/abs/2502.12067
Marthe Ballon, Andres Algaba, Vincent Ginis, 21 Feb 2025, The Relationship Between Reasoning and Performance in Large Language Models -- o3 (mini) Thinks Harder, Not Longer, https://arxiv.org/abs/2502.15631
Jintian Zhang, Yuqi Zhu, Mengshu Sun, Yujie Luo, Shuofei Qiao, Lun Du, Da Zheng, Huajun Chen, Ningyu Zhang, 21 Feb 2025, LightThinker: Thinking Step-by-Step Compression, https://arxiv.org/abs/2502.15589 https://github.com/zjunlp/LightThinker (Faster CoT by compressing the text of intermediate reasoning steps with gist tokens.)
Wenkai Yang, Shuming Ma, Yankai Lin, Furu Wei, 25 Feb 2025, Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning, https://arxiv.org/abs/2502.18080 (Trying to generate the "shortest correct response" by examining the lengths needed for CoT.)
Silei Xu, Wenhao Xie, Lingxiao Zhao, Pengcheng He, 25 Feb 2025, Chain of Draft: Thinking Faster by Writing Less, https://arxiv.org/abs/2502.18600 (Concise CoT method using a per-step inference budget.)
Dr. Ashish Bamania, March 3rd, 2025, Chain-of-Draft (CoD) Is The New King Of Prompting Techniques: A deep dive into the novel Chain-of-Draft (CoD) Prompting that outperforms Chain-of-Thought (CoT) Prompting while reducing LLM inference cost and latency like never before, https://levelup.gitconnected.com/chain-of-draft-cod-is-the-new-king-of-prompting-techniques-d9dc17f12051
Michael Nuñez, March 3, 2025, Less is more: How ‘chain of draft’ could cut AI costs by 90% while improving performance, https://venturebeat.com/ai/less-is-more-how-chain-of-draft-could-cut-ai-costs-by-90-while-improving-performance/
Reddit, March 01, 2025, Chain of Draft: Streamlining LLM Reasoning with Minimal Token Generation, https://www.reddit.com/r/artificial/comments/1j04ezf/chain_of_draft_streamlining_llm_reasoning_with/
Sulbha Jain, March 02, 2025, Chain of Draft: Thinking Faster by Writing Less — Paper Review, https://medium.com/@sulbha.jindal/chain-of-draft-thinking-faster-by-writing-less-paper-review-20e57bfc867a
Ajith Vallath Prabhakar, March 2, 2025, Chain of Draft: The Breakthrough Prompting Technique That Makes LLMs Think Faster With Less, https://ajithp.com/2025/03/02/chain-of-draft-llm-prompting/
The Decoder, Mar 2, 2025, Chain of Draft Prompts lets LLMs think cheaper with fewer words, https://the-decoder.com/chain-of-draft-prompts-lets-llms-think-cheaper-with-fewer-words/
Zhenyi Shen, Hanqi Yan, Linhai Zhang, Zhanghao Hu, Yali Du, Yulan He, 28 Feb 2025, CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation, https://arxiv.org/abs/2502.21074
Ayeong Lee, Ethan Che, Tianyi Peng, 3 Mar 2025, How Well do LLMs Compress Their Own Chain-of-Thought? A Token Complexity Approach, https://arxiv.org/abs/2503.01141
Anonymous authors, Mar 2025, Pencil: Long Throughts with Short Memory, ICLR 2025 review, https://openreview.net/pdf?id=KRI2Fmffqr (Using a "memory" during reasoning steps to rewrite intermediate CoT steps in shorter form.)
Pranjal Aggarwal, Sean Welleck, 6 Mar 2025, L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning, https://arxiv.org/abs/2503.04697 https://www.cmu-l3.github.io/l1 (Efficient CoT by controlling the length of intermediate step outputs.)
Simon A. Aytes, Jinheon Baek, Sung Ju Hwang, 7 Mar 2025, Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching, https://arxiv.org/abs/2503.05179 https://www.github.com/SimonAytes/SoT (Compressing intermediate steps in CoT.)
Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, Wanxiang Che, 13 Mar 2025 (v2), Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models, https://arxiv.org/abs/2503.09567 (Massive and broad survey of all types of reasoning.)
Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Hanjie Chen, Xia Hu, 23 Mar 2025 (v2), Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models, https://arxiv.org/abs/2503.16419
Zheng Li, Qingxiu Dong, Jingyuan Ma, Di Zhang, Zhifang Sui, 16 May 2025, SelfBudgeter: Adaptive Token Allocation for Efficient LLM Reasoning, https://arxiv.org/abs/2505.11274
Ren Zhuang, Ben Wang, Shuifa Sun, 17 May 2025 (v2), Accelerating Chain-of-Thought Reasoning: When Goal-Gradient Importance Meets Dynamic Skipping, https://arxiv.org/abs/2505.08392
Xuechen Zhang, Zijian Huang, Chenshun Ni, Ziyang Xiong, Jiasi Chen, Samet Oymak, 14 May 2025 (v2), Making Small Language Models Efficient Reasoners: Intervention, Supervision, Reinforcement, https://arxiv.org/abs/2505.07961
Muzhi Dai, Chenxu Yang, Qingyi Si, 17 May 2025 (v2), S-GRPO: Early Exit via Reinforcement Learning in Reasoning Models, https://arxiv.org/abs/2505.07686
Yuhui Xu, Hanze Dong, Lei Wang, Doyen Sahoo, Junnan Li, Caiming Xiong, 21 May 2025 (v2), Scalable Chain of Thoughts via Elastic Reasoning, https://arxiv.org/abs/2505.05315 https://github.com/SalesforceAIResearch/Elastic-Reasoning
Ziqing Qiao, Yongheng Deng, Jiali Zeng, Dong Wang, Lai Wei, Fandong Meng, Jie Zhou, Ju Ren, Yaoxue Zhang, 8 May 2025, ConCISE: Confidence-guided Compression in Step-by-step Efficient Reasoning, https://arxiv.org/abs/2505.04881
Bin Yu, Hang Yuan, Haotian Li, Xueyin Xu, Yuliang Wei, Bailing Wang, Weizhen Qi, Kai Chen, 21 May 2025 (v2), Long-Short Chain-of-Thought Mixture Supervised Fine-Tuning Eliciting Efficient Reasoning in Large Language Models, https://arxiv.org/abs/2505.03469
Jingyang Yi, Jiazheng Wang, Sida Li, 16 May 2025 (v2), ShorterBetter: Guiding Reasoning Models to Find Optimal Inference Length for Efficient Reasoning, https://arxiv.org/abs/2504.21370
Guanghao Li, Wenhao Jiang, Mingfeng Chen, Yan Li, Hao Yu, Shuting Dong, Tao Ren, Ming Tang, Chun Yuan, 30 May 2025, SCOUT: Teaching Pre-trained Language Models to Enhance Reasoning via Flow Chain-of-Thought, https://arxiv.org/abs/2505.24181
Feng Luo, Yu-Neng Chuang, Guanchu Wang, Hoang Anh Duy Le, Shaochen Zhong, Hongyi Liu, Jiayi Yuan, Yang Sui, Vladimir Braverman, Vipin Chaudhary, Xia Hu, 28 May 2025, AutoL2S: Auto Long-Short Reasoning for Efficient Large Language Models, https://arxiv.org/abs/2505.22662
Siqi Fan, Peng Han, Shuo Shang, Yequan Wang, Aixin Sun, 28 May 2025, CoThink: Token-Efficient Reasoning via Instruct Models Guiding Reasoning Models, https://arxiv.org/abs/2505.22017
Jinyan Su, Claire Cardie, 23 May 2025, Thinking Fast and Right: Balancing Accuracy and Reasoning Length with Adaptive Rewards, https://arxiv.org/abs/2505.18298
Ruihan Gong, Yue Liu, Wenjie Qu, Mingzhe Du, Yufei He, Yingwei Ma, Yulin Chen, Xiang Liu, Yi Wen, Xinfeng Li, Ruidong Wang, Xinzhong Zhu, Bryan Hooi, Jiaheng Zhang, 26 May 2025, Efficient Reasoning via Chain of Unconscious Thought, https://arxiv.org/abs/2505.19756
Weize Chen, Jiarui Yuan, Tailin Jin, Ning Ding, Huimin Chen, Zhiyuan Liu, Maosong Sun, 25 May 2025, The Overthinker's DIET: Cutting Token Calories with DIfficulty-AwarE Training, https://arxiv.org/abs/2505.19217
Aradhye Agarwal, Ayan Sengupta, Tanmoy Chakraborty, 23 May 2025, First Finish Search: Efficient Test-Time Scaling in Large Language Models, https://arxiv.org/abs/2505.18149 (Running multiple parallel decoding steps but stopping when the fastest and usually shortest one completes.)

Concise Chain-of-Thought

Concise Chain-of-Thought is the token reduction optimization of prompting the LLM to be "concise" in all its interim reasoning steps. The end result is shorter, more compressed hidden chains of reasoning steps.

Research papers on Concise CoT:

Devmallya Karar, Oct 4, 2024, Chain-Of-Thought ( CoT ) in Large Language Models prompting and Concise CoT with Code implementation using Python and PyTorch, https://medium.com/@devmallyakarar/chain-of-thought-cot-in-large-language-models-prompting-and-concise-cot-with-code-82821f9a832d
Cobus Greyling, Jan 24, 2024, Concise Chain-of-Thought (CCoT) Prompting, Traditional CoT comes at a cost of increased output token usage, CCoT prompting is a prompt-engineering technique which is aimed at reducing LLM response verbosity & inference time. https://cobusgreyling.substack.com/p/concise-chain-of-thought-ccot-prompting
Matthew Renze, Erhan Guven 19 Oct 2024 (v3), The Benefits of a Concise Chain of Thought on Problem-Solving in Large Language Models, https://arxiv.org/abs/2401.05618 https://github.com/matthewrenze/jhu-concise-cot (The original paper on Concise CoT.)
Waleed Kadous, May 17, 2023, Numbers every LLM Developer should know, https://www.anyscale.com/blog/num-every-llm-developer-should-know (Includes discussion of "be concise" prompting.)
Sania Nayab, Giulio Rossolini, Giorgio Buttazzo, Nicolamaria Manes, Fabrizio Giacomelli, 29 Jul 2024, Concise Thoughts: Impact of Output Length on LLM Reasoning and Cost, https://arxiv.org/abs/2407.19825
Junjie Liu, Shaotian Yan, Chen Shen, Liang Xie, Wenxiao Wang, Jieping Ye, 13 Jun 2024 (v4), Concise and Organized Perception Facilitates Reasoning in Large Language Models, https://arxiv.org/abs/2310.03309
Silei Xu, Wenhao Xie, Lingxiao Zhao, Pengcheng He, 25 Feb 2025, Chain of Draft: Thinking Faster by Writing Less, https://arxiv.org/abs/2502.18600 (Concise CoT method using a per-step inference budget.)
Tergel Munkhbat, Namgyu Ho, Seohyun Kim, Yongjin Yang, Yujin Kim, Se-Young Yun, 27 Feb 2025, Self-Training Elicits Concise Reasoning in Large Language Models, https://arxiv.org/abs/2502.20122 https://github.com/TergelMunkhbat/concise-reasoning
Ayeong Lee, Ethan Che, Tianyi Peng, 3 Mar 2025, How Well do LLMs Compress Their Own Chain-of-Thought? A Token Complexity Approach, https://arxiv.org/abs/2503.01141
Pranjal Aggarwal, Sean Welleck, 6 Mar 2025, L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning, https://arxiv.org/abs/2503.04697 https://www.cmu-l3.github.io/l1 (Efficient CoT by controlling the length of intermediate step outputs.)
Yuchen Yan, Yongliang Shen, Yang Liu, Jin Jiang, Mengdi Zhang, Jian Shao, Yueting Zhuang, 13 Mar 2025 (v2), InftyThink: Breaking the Length Limits of Long-Context Reasoning in Large Language Models, https://arxiv.org/abs/2503.06692
Xiaoye Qu, Yafu Li, Zhaochen Su, Weigao Sun, Jianhao Yan, Dongrui Liu, Ganqu Cui, Daizong Liu, Shuxian Liang, Junxian He, Peng Li, Wei Wei, Jing Shao, Chaochao Lu, Yue Zhang, Xian-Sheng Hua, Bowen Zhou, Yu Cheng, 27 Mar 2025, A Survey of Efficient Reasoning for Large Reasoning Models: Language, Multimodality, and Beyond, https://arxiv.org/abs/2503.21614
Chengyu Huang, Zhengxin Zhang, Claire Cardie, 16 May 2025, HAPO: Training Language Models to Reason Concisely via History-Aware Policy Optimization, https://arxiv.org/abs/2505.11225
Yuhui Xu, Hanze Dong, Lei Wang, Doyen Sahoo, Junnan Li, Caiming Xiong, 21 May 2025 (v2), Scalable Chain of Thoughts via Elastic Reasoning, https://arxiv.org/abs/2505.05315 https://github.com/SalesforceAIResearch/Elastic-Reasoning
Ziqing Qiao, Yongheng Deng, Jiali Zeng, Dong Wang, Lai Wei, Fandong Meng, Jie Zhou, Ju Ren, Yaoxue Zhang, 8 May 2025, ConCISE: Confidence-guided Compression in Step-by-step Efficient Reasoning, https://arxiv.org/abs/2505.04881
Sohyun An, Ruochen Wang, Tianyi Zhou, Cho-Jui Hsieh, 27 May 2025, Don't Think Longer, Think Wisely: Optimizing Thinking Dynamics for Large Reasoning Models, https://arxiv.org/abs/2505.21765

Hidden Chain-of-Thought

Hidden Chain-of-Thought is the optimization of using non-language tokens for the vocabulary of the interim reasoning steps. This means that the model calculates using special "reasoning tokens" for the hidden steps, and only emits human-readable language at the last step. This is an interesting way to make the reasoning more accurate, and can also be more efficient because these internal representations use fewer tokens.

Reearch papers on Hidden CoT include:

Sachin Kumar, Sep 17, 2024, Hidden Chain-of-Thought decoding: faster and efficient CoT decoding to improve reasoning of LLMs, https://medium.com/@techsachin/hidden-chain-of-thought-decoding-faster-and-efficient-cot-decoding-to-improve-reasoning-of-llms-d95584bc9346 (Token reduction in CoT by compressing language tokens into an internal "hidden" concise token representation.)
Tianqiao Liu, Zui Chen, Zitao Liu, Mi Tian, Weiqi Luo, 13 Sep 2024, Expediting and Elevating Large Language Model Reasoning via Hidden Chain-of-Thought Decoding, https://arxiv.org/abs/2409.08561
Aryasomayajula Ram Bharadwaj, 5 Dec 2024, Understanding Hidden Computations in Chain-of-Thought Reasoning, https://arxiv.org/abs/2412.04537
Xuan Shen, Yizhou Wang, Xiangxi Shi, Yanzhi Wang, Pu Zhao, Jiuxiang Gu, 31 Jan 2025, Efficient Reasoning with Hidden Thinking, https://arxiv.org/abs/2501.19201
Zhenyi Shen, Hanqi Yan, Linhai Zhang, Zhanghao Hu, Yali Du, Yulan He, 28 Feb 2025, CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation, https://arxiv.org/abs/2502.21074
Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, Wanxiang Che, 13 Mar 2025 (v2), Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models, https://arxiv.org/abs/2503.09567 (Massive and broad survey of all types of reasoning.)

Chain-of-Thought Decoding

Chain-of-Thought decoding is the use of the LLM decoding algorithm's multiple possible pathways to mimic the multiple reasoning steps in CoT. This goes beyond the typical "tree decoding" such as beam search decoding, to a more advanced usage of the multiple decoded token sequences or paths. The disadvantage is that this method is not as accurate as full CoT, but the advantage is greater efficiency, because only one step of inference is needed.

Research papers on CoT decoding methods with a tree of pathways:

Xuezhi Wang, Denny Zhou, 23 May 2024 (v2), Chain-of-Thought Reasoning Without Prompting, https://arxiv.org/abs/2402.10200 ("CoT decoding" is examining the alternative paths in the decoding algorithm, which is somewhat similar to Chain-of-Thought reasoning.)
xjdr-alt, Dec 2024, entropix: Entropy Based Sampling and Parallel CoT Decoding, https://github.com/xjdr-alt/entropix (Parallel decoding attempts to get something similar to CoT.)
Hongxuan Zhang, Zhining Liu, Yao Zhao, Jiaqi Zheng, Chenyi Zhuang, Jinjie Gu, Guihai Chen, 4 Jun 2024 (v2), Fast Chain-of-Thought: A Glance of Future from Parallel Decoding Leads to Answers Faster, https://arxiv.org/abs/2311.08263 (Use of Jacobi parallel decoding with Chain-of-Thought.)
Renato Vukovic, David Arps, Carel van Niekerk, Benjamin Matthias Ruppik, Hsien-Chin Lin, Michael Heck, Milica Gašić, 5 Aug 2024, Dialogue Ontology Relation Extraction via Constrained Chain-of-Thought Decoding, https://arxiv.org/abs/2408.02361
Yuntian Deng, Kiran Prasad, Roland Fernandez, Paul Smolensky, Vishrav Chaudhary, Stuart Shieber, 2 Nov 2023, Implicit Chain of Thought Reasoning via Knowledge Distillation, https://arxiv.org/abs/2311.01460 (Knowledge distillation applied to optimizing the interim computations in Chain-of-Thought.)
Yuntian Deng, Yejin Choi, Stuart Shieber, 23 May 2024, From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step, https://arxiv.org/abs/2405.14838
Ping Yu, Jing Xu, Jason Weston, Ilia Kulikov, 24 Jul 2024 (v3), Distilling System 2 into System 1, https://arxiv.org/abs/2407.06023
Mehul Damani, Idan Shenfeld, Andi Peng, Andreea Bobu, Jacob Andreas, 7 Oct 2024, Learning How Hard to Think: Input-Adaptive Allocation of LM Computation, https://arxiv.org/abs/2410.04707
Pranjal Aggarwal, Aman Madaan, Yiming Yang, Mausam, 16 Nov 2023 (v2), Let's Sample Step by Step: Adaptive-Consistency for Efficient Reasoning and Coding with LLMs, EMNLP 2023, https://arxiv.org/abs/2305.11860 https://www.sample-step-by-step.info/
Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, Yuandong Tian, 9 Dec 2024, Training Large Language Models to Reason in a Continuous Latent Space, https://arxiv.org/abs/2412.06769 (Performing reasoning in a model trained to operate in the embedding vector space, rather than more directly in the token space.)
Luyang Liu, Jonas Pfeiffer, Jiaxing Wu, Jun Xie, Arthur Szlam, 23 Dec 2024, Deliberation in Latent Space via Differentiable Cache Augmentation, https://arxiv.org/abs/2412.17747 (Augmenting the KV cache with reasoning information so that decoding will mimic multi-step reasoning with fewer tokens required for intermediate steps.)
Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, Vaishnavh Nagarajan, 21 Apr 2024 (v3), Think before you speak: Training Language Models With Pause Tokens, https://arxiv.org/abs/2310.02226 (Inserting extra "pause tokens" that trigger the LLM to perform extra reasoning during the decoding phase.)
Yuval Shalev, Amir Feder, Ariel Goldstein, 19 Jun 2024, Distributional reasoning in LLMs: Parallel reasoning processes in multi-hop reasoning, https://arxiv.org/abs/2406.13858 (Using embeddings from intermediate model layers in decoding to mimic reasoning pathways.)
Eden Biran, Daniela Gottesman, Sohee Yang, Mor Geva, Amir Globerson, 14 Oct 2024 (v2), Hopping Too Late: Exploring the Limitations of Large Language Models on Multi-Hop Queries, https://arxiv.org/abs/2406.12775 (Backpatching prior layers using embeddings from the current activations to mimic multi-step reasoning.)
Jacob Pfau, William Merrill, Samuel R. Bowman, 24 Apr 2024, Let's Think Dot by Dot: Hidden Computation in Transformer Language Models, https://arxiv.org/abs/2404.15758 (Use of dummy "filler tokens" similar to "pause tokens" or "reasoning tokens" to aid multi-step reasoning in decoding.)
Eric Zelikman, Georges Harik, Yijia Shao, Varuna Jayasiri, Nick Haber, Noah D. Goodman, 18 Mar 2024 (v2), Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking, https://arxiv.org/abs/2403.09629 (Introduces answers between a start-of-thought and end-of-thought meta-token for reasoning.)
Haoran Wang, Kai Shu, Jan 2025, MakeEveryTokenCount: ASystematic Survey on Decoding Methods for Foundation Model, https://www.researchgate.net/profile/Haoran-Wang-96/publication/387703971_Make_Every_Token_Count_A_Systematic_Survey_on_Decoding_Methods_for_Foundation_Models/links/67784c8ce74ca64e1f49eb15/Make-Every-Token-Count-A-Systematic-Survey-on-Decoding-Methods-for-Foundation-Models.pdf https://github.com/wang2226/Awesome-LLM-Decoding
Phuc Phan, Hieu Tran, Long Phan, 23 Aug 2024 (v2), Distillation Contrastive Decoding: Improving LLMs Reasoning with Contrastive Decoding and Distillation, https://arxiv.org/abs/2402.14874
Maxime Peyrard, Martin Josifoski, Robert West, 21 Mar 2024, The Era of Semantic Decoding, https://arxiv.org/abs/2403.14562
Fengli Xu, Qianyue Hao, Zefang Zong, Jingwei Wang, Yunke Zhang, Jingyi Wang, Xiaochong Lan, Jiahui Gong, Tianjian Ouyang, Fanjin Meng, Chenyang Shao, Yuwei Yan, Qinglong Yang, Yiwen Song, Sijian Ren, Xinyuan Hu, Yu Li, Jie Feng, Chen Gao, Yong Li, 17 Jan 2025 (v2), Towards Large Reasoning Models: A Survey on Scaling LLM Reasoning Capabilities, https://arxiv.org/abs/2501.09686
Xiangjue Dong, Maria Teleki, James Caverlee, 18 Dec 2024, A Survey on LLM Inference-Time Self-Improvement, https://arxiv.org/abs/2412.14352 https://github.com/dongxiangjue/Awesome-LLM-Self-Improvement
Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, Tom Goldstein, 7 Feb 2025, Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach, https://arxiv.org/abs/2502.05171
G Lu, L Peng, L Li, 2025, CoT-Decoding: Complex Reasoning via Chain-of-Thought Decoding, https://epubs.siam.org/doi/pdf/10.1137/1.9781611978520.44

Overthinking

Overthinking is using too many interim reasoning steps to solve simple LLM queries. An extreme example is simple queries like "2 plus 3" that can be solved in a single shot, but often elicit complex chains of reasoning in naive Chain-of-Thought implementations. The model usually gets to the correct answer, but the cost of overthinking is processing too many tokens. Hence, avoiding overthinking is important for token reduction and CoT efficiency.

Research papers on overthinking in LLM reasoning:

Tingxu Han, Chunrong Fang, Shiyu Zhao, Shiqing Ma, Zhenyu Chen, Zhenting Wang, 30 Dec 2024 (v2), Token-Budget-Aware LLM Reasoning, https://arxiv.org/abs/2412.18547 https://github.com/GeniusHTX/TALE
Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, Dong Yu, 30 Dec 2024, Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs, https://arxiv.org/abs/2412.21187
Mehul Damani, Idan Shenfeld, Andi Peng, Andreea Bobu, Jacob Andreas, 7 Oct 2024, Learning How Hard to Think: Input-Adaptive Allocation of LM Computation, https://arxiv.org/abs/2410.04707
Xinglin Wang, Shaoxiong Feng, Yiwei Li, Peiwen Yuan, Yueqi Zhang, Boyuan Pan, Heda Wang, Yao Hu, Kan Li, 24 Aug 2024, Make Every Penny Count: Difficulty-Adaptive Self-Consistency for Cost-Efficient Reasoning, https://arxiv.org/abs/2408.13457
Rohin Manvi, Anikait Singh, Stefano Ermon, 3 Oct 2024, Adaptive Inference-Time Compute: LLMs Can Predict if They Can Do Better, Even Mid-Generation, https://arxiv.org/abs/2410.02725
Yiwei Li, Peiwen Yuan, Shaoxiong Feng, Boyuan Pan, Xinglin Wang, Bin Sun, Heda Wang, and Kan Li, 19 Jan 2024, Escape sky-high cost: Early-stopping self-consistency for multi-step reasoning. The Twelfth International Conference on Learning Representations, 2024, https://arxiv.org/abs/2401.10480 https://github.com/Yiwei98/ESC (Uses "early stopping" idea to improve CoT efficiency during inference.)
By Asif Razzaq, January 24, 2025, Berkeley Sky Computing Lab Introduces Sky-T1-32B-Flash: A New Reasoning Language Model that Significantly Reduces Overthinking, Slashing Inference Costs on Challenging Questions by up to 57%, https://www.marktechpost.com/2025/01/24/berkeley-sky-computing-lab-introduces-sky-t1-32b-flash-a-new-reasoning-language-model-that-significantly-reduces-overthinking-slashing-inference-costs-on-challenging-questions-by-up-to-57/
G Wang, S Zhang, T Zhan, Z Shen, J Li, X Hu, X Sun, Jan 2025, Unlocking the Mysteries of OpenAI o1: A Survey of the Reasoning Abilities of Large Language Models, https://openreview.net/pdf?id=J0ADLa2rNp
Sebastian Raschka, PhD, Feb 05, 2025, Understanding Reasoning LLMs: Methods and Strategies for Building and Refining Reasoning Models https://magazine.sebastianraschka.com/p/understanding-reasoning-llms
Salvatore Raieli, Feb 2025, The LLMs’ Dilemma: Thinking Too Much OR Too Little? Exploring the fine line between deep reasoning and computational overkill in large language models., https://levelup.gitconnected.com/the-llms-dilemma-thinking-too-much-or-too-little-619a7532a47e
Alejandro Cuadron, Dacheng Li, Wenjie Ma, Xingyao Wang, Yichuan Wang, Siyuan Zhuang, Shu Liu, Luis Gaspar Schroeder, Tian Xia, Huanzhi Mao, Nicholas Thumiger, Aditya Desai, Ion Stoica, Ana Klimovic, Graham Neubig, Joseph E. Gonzalez, https://www.arxiv.org/abs/2502.08235 12 Feb 2025, The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks,
Wenkai Yang, Shuming Ma, Yankai Lin, Furu Wei, 25 Feb 2025, Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning, https://arxiv.org/abs/2502.18080 (Trying to generate the "shortest correct response" by examining the lengths needed for CoT.)
Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, Wanxiang Che, 13 Mar 2025 (v2), Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models, https://arxiv.org/abs/2503.09567 (Massive and broad survey of all types of reasoning.)
Xiaoye Qu, Yafu Li, Zhaochen Su, Weigao Sun, Jianhao Yan, Dongrui Liu, Ganqu Cui, Daizong Liu, Shuxian Liang, Junxian He, Peng Li, Wei Wei, Jing Shao, Chaochao Lu, Yue Zhang, Xian-Sheng Hua, Bowen Zhou, Yu Cheng, 27 Mar 2025, A Survey of Efficient Reasoning for Large Reasoning Models: Language, Multimodality, and Beyond, https://arxiv.org/abs/2503.21614
Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Hanjie Chen, Xia Hu, 23 Mar 2025 (v2), Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models, https://arxiv.org/abs/2503.16419
Bin Yu, Hang Yuan, Haotian Li, Xueyin Xu, Yuliang Wei, Bailing Wang, Weizhen Qi, Kai Chen, 21 May 2025 (v2), Long-Short Chain-of-Thought Mixture Supervised Fine-Tuning Eliciting Efficient Reasoning in Large Language Models, https://arxiv.org/abs/2505.03469
Parshin Shojaee, Maxwell Horton, Iman Mirzadeh, Samy Bengio, Keivan Alizadeh, June 2025, The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity, Apple, https://machinelearning.apple.com/research/illusion-of-thinking https://ml-site.cdn-apple.com/papers/the-illusion-of-thinking.pdf
Dr. Ashish Bamania, June 2025, Apple’s New Research Shows That LLM Reasoning Is Completely Broken: A deep dive into Apple research that exposes the flawed thinking process in state-of-the-art Reasoning LLMs, https://ai.gopubby.com/apples-new-research-shows-that-llm-reasoning-is-completely-broken-47b5be71a06a
Shengjia Zhang, Junjie Wu, Jiawei Chen, Changwang Zhang, Xingyu Lou, Wangchunshu Zhou, Sheng Zhou, Can Wang, Jun Wang, 3 Jun 2025, OThink-R1: Intrinsic Fast/Slow Thinking Mode Switching for Over-Reasoning Mitigation, https://arxiv.org/abs/2506.02397 https://github.com/AgenticIR-Lab/OThink-R1
Sohyun An, Ruochen Wang, Tianyi Zhou, Cho-Jui Hsieh, 27 May 2025, Don't Think Longer, Think Wisely: Optimizing Thinking Dynamics for Large Reasoning Models, https://arxiv.org/abs/2505.21765
Yao Huang, Huanran Chen, Shouwei Ruan, Yichi Zhang, Xingxing Wei, Yinpeng Dong, 28 May 2025, Mitigating Overthinking in Large Reasoning Models via Manifold Steering, https://arxiv.org/abs/2505.22411 https://github.com/Aries-iai/Manifold_Steering
Xixian Yong, Xiao Zhou, Yingying Zhang, Jinlin Li, Yefeng Zheng, Xian Wu, 23 May 2025, Think or Not? Exploring Thinking Efficiency in Large Reasoning Models via an Information-Theoretic Lens, https://arxiv.org/abs/2505.18237
Weize Chen, Jiarui Yuan, Tailin Jin, Ning Ding, Huimin Chen, Zhiyuan Liu, Maosong Sun, 25 May 2025, The Overthinker's DIET: Cutting Token Calories with DIfficulty-AwarE Training, https://arxiv.org/abs/2505.19217

CoT and Prompt Sequence Optimizations

The generated text for each of the CoT steps in the chain often have some overlap, and there may be token similarity across multiple steps, and between multiple answer "pathways" at one step. Hence, this means:

Common token subsequences that are re-processed.
Input token sequences may appear verbatim in output.
Similar token sequences in multiple reasoning pathways.

Any token sequences that were processed in a prior step, or another pathway at the current step, also have had computations performed, so there are activations (embedding signals) and potentially logits (with extra processing via an unembedding matrix). The question is how to use all this extra token sequence information as a speed optimization?

There are some LLM optimizations that make use of similar types of existing information, and these could potentially be adapted to optimize Chain-of-Thought. Some of the LLM inference optimizations that can make use of this appearance of identical sequences in the input and output, or in alternative pathways at the current step, or in two subsequent inputs, include:

Prompt lookup decoding (a type of speculative decoding)
Fused substring KV caching (re-using the KV cache of sub-sequences)
Activation sparsification (using knowledge of prior texts to help sparsify)

Few research papers yet. I haven't seen many CoT-specific research papers on these LLM inference optimizations.

CoT Sparsity

CoT sparsity is the use of sparsification optimizations in multi-step inference reasoning. With Chain-of-Thought and other multi-step reasoning algorithms, there are multiple pathways or "thoughts," and each has their computed activations. This gives a wealth of extra information whereby sparsity optimizations could be used to focus the next inference steps on particular tokens and embedding signals. However, relatively little research has examined sparsity with a focus on reasoning algorithms.

Research papers on sparsity applied to Chain-of-Thought:

Libo Wang, 11 Dec 2024 (v4), Reducing Reasoning Costs -- The Path of Optimization for Chain of Thought via Sparse Attention Mechanism, https://arxiv.org/abs/2411.09111 https://github.com/brucewang123456789/GeniusTrail.git (Sparse attention optimization applied to CoT.)

CoT and Grammatical Error Correction (GEC)

An interesting point about Chain-of-Thought is that some of the interim steps in the "chain" of text bear the hallmarks of "editing" or "Grammatical Error Correction (GEC)". Features of GEC include:

Similar length of input and output.
Subsequences that are identical to input appear in the output.
Output text is similar to input, with minor changes.

Hence, speed optimizations for GEC may be relevant to optimizing the interim CoT steps, such as:

GEC research
Aggressive decoding
Edit decoding
Early exiting methods (using expected inputs in the classifiers)

Research papers? However, I'm yet to see a paper on CoT and GEC optimizations combined.

CoT and Knowledge Distillation

Research papers on the combination of Knowledge Distillation and CoT:

Yuntian Deng, Kiran Prasad, Roland Fernandez, Paul Smolensky, Vishrav Chaudhary, Stuart Shieber, 2 Nov 2023, Implicit Chain of Thought Reasoning via Knowledge Distillation, https://arxiv.org/abs/2311.01460 (Knowledge distillation applied to optimizing the interim computations in Chain-of-Thought.)
Yuntian Deng, Yejin Choi, Stuart Shieber, 23 May 2024, From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step, https://arxiv.org/abs/2405.14838
Phuc Phan, Hieu Tran, Long Phan, 23 Aug 2024 (v2), Distillation Contrastive Decoding: Improving LLMs Reasoning with Contrastive Decoding and Distillation, https://arxiv.org/abs/2402.14874
Huanxuan Liao, Shizhu He, Yupu Hao, Xiang Li, Yuanzhe Zhang, Jun Zhao, Kang Liu, Jan 2025, SKIntern: Internalizing Symbolic Knowledge for Distilling Better CoT Capabilities into Small Language Models, Proceedings of the 31st International Conference on Computational Linguistics, pages 3203–3221 January 19–24, 2025, https://aclanthology.org/2025.coling-main.215.pdf
Zhenyi Shen, Hanqi Yan, Linhai Zhang, Zhanghao Hu, Yali Du, Yulan He, 28 Feb 2025, CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation, https://arxiv.org/abs/2502.21074
Chengsong Huang, Langlin Huang, Jixuan Leng, Jiacheng Liu, Jiaxin Huang, 25 Feb 2025, Efficient Test-Time Scaling via Self-Calibration, https://arxiv.org/abs/2503.00031

Multimodal CoT

Research papers on multimodal CoT include:

Jiacheng Ye, Shansan Gong, Liheng Chen, Lin Zheng, Jiahui Gao, Han Shi, Chuan Wu, Xin Jiang, Zhenguo Li, Wei Bi, Lingpeng Kong, 5 Dec 2024 (v3), Diffusion of Thoughts: Chain-of-Thought Reasoning in Diffusion Language Models, https://arxiv.org/abs/2402.07754
Jun Gao, Yongqi Li, Ziqiang Cao, Wenjie Li, 29 Nov 2024, Interleaved-Modal Chain-of-Thought, https://arxiv.org/abs/2411.19488
Daoan Zhang, Junming Yang, Hanjia Lyu, Zijian Jin, Yuan Yao, Mingkai Chen, Jiebo Luo, 5 Jan 2024, CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs, https://arxiv.org/abs/2401.02582
Ruilin Luo, Zhuofan Zheng, Yifan Wang, Yiyao Yu, Xinzhe Ni, Zicheng Lin, Jin Zeng, Yujiu Yang, 8 Jan 2025, URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics, https://arxiv.org/abs/2501.04686
Zheqi Lv, Wenkai Wang, Jiawei Wang, Shengyu Zhang, Fei Wu, 10 Jan 2025, Cascaded Self-Evaluation Augmented Training for Efficient Multimodal Large Language Models, https://arxiv.org/abs/2501.05662 (Optimize multimodal CoT by breaking down prompts into smaller sub-goals.)
Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, Wanxiang Che, 13 Mar 2025 (v2), Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models, https://arxiv.org/abs/2503.09567 (Massive and broad survey of all types of reasoning.)
Yaoting Wang, Shengqiong Wu, Yuecheng Zhang, William Wang, Ziwei Liu, Jiebo Luo, Hao Fei, 16 Mar 2025, Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey, https://arxiv.org/abs/2503.12605

General Research on Chain-of-Thought Efficiency Optimization

Some of the papers on the general area of Chain-of-Thought efficiency:

OpenAI, Dec 2024, OpenAI o1 and new tools for developers, https://openai.com/index/o1-and-new-tools-for-developers/ ("Lower latency: o1 uses on average 60% fewer reasoning tokens than o1-preview for a given request.")
Tianqiao Liu, Zui Chen, Zitao Liu, Mi Tian, Weiqi Luo, 13 Sep 2024, Expediting and Elevating Large Language Model Reasoning via Hidden Chain-of-Thought Decoding, https://arxiv.org/abs/2409.08561 (Compressing the interim token sequences in Chain-of-Thought.)
Yuntian Deng, Kiran Prasad, Roland Fernandez, Paul Smolensky, Vishrav Chaudhary, Stuart Shieber, 2 Nov 2023, Implicit Chain of Thought Reasoning via Knowledge Distillation, https://arxiv.org/abs/2311.01460 (Knowledge distillation applied to optimizing the interim computations in Chain-of-Thought.)
Yu Kang, Xianghui Sun, Liangyu Chen, Wei Zou, 16 Dec 2024, C3oT: Generating Shorter Chain-of-Thought without Compromising Effectiveness, https://arxiv.org/abs/2412.11664 (Token pruning and prompt compression for Chain-of-Thought.)
Hongxuan Zhang, Zhining Liu, Yao Zhao, Jiaqi Zheng, Chenyi Zhuang, Jinjie Gu, Guihai Chen, 4 Jun 2024 (v2), Fast Chain-of-Thought: A Glance of Future from Parallel Decoding Leads to Answers Faster, https://arxiv.org/abs/2311.08263
Jeffrey Cheng, Benjamin Van Durme, 17 Dec 2024, Compressed Chain of Thought: Efficient Reasoning Through Dense Representations, https://arxiv.org/abs/2412.13171 (Context compression applied to interim CoT reasoning steps.)
Libo Wang, 11 Dec 2024 (v4), Reducing Reasoning Costs -- The Path of Optimization for Chain of Thought via Sparse Attention Mechanism, https://arxiv.org/abs/2411.09111 https://github.com/brucewang123456789/GeniusTrail.git
Sachin Kumar, Sep 17, 2024, Hidden Chain-of-Thought decoding: faster and efficient CoT decoding to improve reasoning of LLMs, https://medium.com/@techsachin/hidden-chain-of-thought-decoding-faster-and-efficient-cot-decoding-to-improve-reasoning-of-llms-d95584bc9346
Devmallya Karar, Oct 4, 2024, Chain-Of-Thought ( CoT ) in Large Language Models prompting and Concise CoT with Code implementation using Python and PyTorch, https://medium.com/@devmallyakarar/chain-of-thought-cot-in-large-language-models-prompting-and-concise-cot-with-code-82821f9a832d
Cobus Greyling, Jan 24, 2024, Concise Chain-of-Thought (CCoT) Prompting, Traditional CoT comes at a cost of increased output token usage, CCoT prompting is a prompt-engineering technique which is aimed at reducing LLM response verbosity & inference time. https://cobusgreyling.substack.com/p/concise-chain-of-thought-ccot-prompting
Matthew Renze, Erhan Guven 19 Oct 2024 (v3), The Benefits of a Concise Chain of Thought on Problem-Solving in Large Language Models, https://arxiv.org/abs/2401.05618 https://github.com/matthewrenze/jhu-concise-cot (The original paper on Concise CoT.)
Tyler McDonald, Anthony Colosimo, Yifeng Li, Ali Emami, 2 Dec 2024, Can We Afford The Perfect Prompt? Balancing Cost and Accuracy with the Economical Prompting Index, https://arxiv.org/abs/2412.01690
David Spuler, Dec 21st, 2024, Multi-Step Reasoning Inference Optimization, Aussie AI Blog, https://www.aussieai.com/blog/reasoning-inference-optimization
Dian Yu, Yuheng Zhang, Jiahao Xu, Tian Liang, Linfeng Song, Zhaopeng Tu, Haitao Mi, Dong Yu, 22 Dec 2024, Teaching LLMs to Refine with Tools, https://arxiv.org/abs/2412.16871
Xuan Zhang, Chao Du, Tianyu Pang, Qian Liu, Wei Gao, Min Lin, 31 Oct 2024 (v2), Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs, https://arxiv.org/abs/2406.09136 https://github.com/sail-sg/CPO
Tingxu Han, Chunrong Fang, Shiyu Zhao, Shiqing Ma, Zhenyu Chen, Zhenting Wang, 30 Dec 2024 (v2), Token-Budget-Aware LLM Reasoning, https://arxiv.org/abs/2412.18547 https://github.com/GeniusHTX/TALE
Amirkeivan Mohtashami, Matteo Pagliardini, Martin Jaggi, 14 Aug 2024 (v2), CoTFormer: A Chain-of-Thought Driven Architecture with Budget-Adaptive Computation Cost at Inference, https://arxiv.org/abs/2310.10845
Shiv Sakhuja, 25 Sep 2024, Chain-of-Thought (CoT) Prompting Explained: 7 Techniques for Optimizing AI Performance, https://hub.athina.ai/athina-originals/guides-chain-of-thought-cot-prompting-explained-7-techniques-for-optimizing-ai-performance/
Tengxiao Liu, Qipeng Guo, Xiangkun Hu, Cheng Ji ayang, Yue Zhang, Xipeng Qiu, and Zheng Zhang. 2024. Can language models learn to skip steps? In The Thirty-eighth Annual Conference on Neural In formation Processing Systems. https://arxiv.org/abs/2411.01855
Mayi Xu, Yunfeng Ning, Yongqi Li, Jianhao Chen, Jintao Wen, Yao Xiao, Shen Zhou, Birong Pan, Zepeng Bao, Xin Miao, Hankun Kang, Ke Sun, Tieyun Qian, 2 Jan 2025, Reasoning based on symbolic and parametric knowledge bases: a survey, https://arxiv.org/abs/2501.01030 (Extensive survey of reasoning from CoT to knowledge graphs to table-based reasoning.)
Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, Dong Yu, 30 Dec 2024, Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs, https://arxiv.org/abs/2412.21187
Zirui Zhao, Hanze Dong, Amrita Saha, Caiming Xiong, Doyen Sahoo, 10 Oct 2024, Automatic Curriculum Expert Iteration for Reliable LLM Reasoning, https://arxiv.org/abs/2410.07627 (Efficiency of bailing out with "I don't know" or refusals versus continuing reasoning steps.)
Mehul Damani, Idan Shenfeld, Andi Peng, Andreea Bobu, Jacob Andreas, 7 Oct 2024, Learning How Hard to Think: Input-Adaptive Allocation of LM Computation, https://arxiv.org/abs/2410.04707
Xinglin Wang, Shaoxiong Feng, Yiwei Li, Peiwen Yuan, Yueqi Zhang, Boyuan Pan, Heda Wang, Yao Hu, Kan Li, 24 Aug 2024, Make Every Penny Count: Difficulty-Adaptive Self-Consistency for Cost-Efficient Reasoning, https://arxiv.org/abs/2408.13457
Rohin Manvi, Anikait Singh, Stefano Ermon, 3 Oct 2024, Adaptive Inference-Time Compute: LLMs Can Predict if They Can Do Better, Even Mid-Generation, https://arxiv.org/abs/2410.02725
Yiwei Li, Peiwen Yuan, Shaoxiong Feng, Boyuan Pan, Xinglin Wang, Bin Sun, Heda Wang, and Kan Li, 19 Jan 2024, Escape sky-high cost: Early-stopping self-consistency for multi-step reasoning. The Twelfth International Conference on Learning Representations, 2024, https://arxiv.org/abs/2401.10480 https://github.com/Yiwei98/ESC (Uses "early stopping" idea to improve CoT efficiency during inference.)
Jiace Zhu, Yingtao Shen, Jie Zhao, An Zou, 25 Aug 2024, Path-Consistency: Prefix Enhancement for Efficient Inference in LLM, https://arxiv.org/abs/2409.01281 (Uses the confidence calculations from earlier branches of the reasoning to improve efficiency.)
Pranjal Aggarwal, Aman Madaan, Yiming Yang, Mausam, 16 Nov 2023 (v2), Let's Sample Step by Step: Adaptive-Consistency for Efficient Reasoning and Coding with LLMs, EMNLP 2023, https://arxiv.org/abs/2305.11860 https://www.sample-step-by-step.info/
Luyang Liu, Jonas Pfeiffer, Jiaxing Wu, Jun Xie, Arthur Szlam, 23 Dec 2024, Deliberation in Latent Space via Differentiable Cache Augmentation, https://arxiv.org/abs/2412.17747 (Augmenting the KV cache with reasoning information so that decoding will mimic multi-step reasoning with fewer tokens required for intermediate steps.)
Murong Yue, Jie Zhao, Min Zhang, Liang Du, Ziyu Yao, 8 Feb 2024 (v3), Large Language Model Cascades with Mixture of Thoughts Representations for Cost-efficient Reasoning, https://arxiv.org/abs/2310.03094 (Efficient CoT using smaller models.)
Wenqing Chen, Weicheng Wang, Zhixuan Chu, Kui Ren, Zibin Zheng, and Zhichao Lu. 2024. Self-Para-Consistency: Improving Reasoning Tasks at Low Cost for Large Language Models. In Findings of the Association for Computational Linguistics: ACL 2024, pages 14162–14167, Bangkok, Thailand. Association for Computational Linguistics. https://aclanthology.org/2024.findings-acl.842/ (Generate multiple paraphrased answers, which can reduce tokens as fewer are needed.)
Zhen Li, Yupeng Su, Runming Yang, Zhongwei Xie, Ngai Wong, Hongxia Yang, 6 Jan 2025, Quantization Meets Reasoning: Exploring LLM Low-Bit Quantization Degradation for Mathematical Reasoning, https://arxiv.org/abs/2501.03035 (Analysis of quantization's effect on CoT and math reasoning.)
Yixin Ji, Juntao Li, Hai Ye, Kaixin Wu, Jia Xu, Linjian Mo, Min Zhang, 5 Jan 2025, Test-time Computing: from System-1 Thinking to System-2 Thinking, https://arxiv.org/abs/2501.02497
Janelle Teng, Dec 24, 2024, Unwrapping OpenAI’s o3, https://nextbigteng.substack.com/p/unwrapping-openai-o3-reasoning-model ("...it costs a whopping $17-$20 per task to run o3 in low-compute mode...o3 and other CoT models are currently expensive at inference")
Sungjae Lee, Hyejin Park, Jaechang Kim, Jungseul Ok, 10 Jan 2025, Semantic Exploration with Adaptive Gating for Efficient Problem Solving with Language Models, https://arxiv.org/abs/2501.05752 (CoT optimization by avoiding redundant paths that have identical semantics.)
Zheqi Lv, Wenkai Wang, Jiawei Wang, Shengyu Zhang, Fei Wu, 10 Jan 2025, Cascaded Self-Evaluation Augmented Training for Efficient Multimodal Large Language Models, https://arxiv.org/abs/2501.05662 (Optimize multimodal CoT by breaking down prompts into smaller sub-goals.)
Siddharth Narayanan, James D. Braza, Ryan-Rhys Griffiths, Manu Ponnapati, Albert Bou, Jon Laurent, Ori Kabeli, Geemi Wellawatte, Sam Cox, Samuel G. Rodriques, Andrew D. White, 30 Dec 2024, Aviary: training language agents on challenging scientific tasks, https://arxiv.org/abs/2412.21154 (Using smaller models combined with multi-step reasoning to compete with big models with 100x less inference cost.)
Maciej Besta, Julia Barth, Eric Schreiber, Ales Kubicek, Afonso Catarino, Robert Gerstenberger, Piotr Nyczyk, Patrick Iff, Yueling Li, Sam Houliston, Tomasz Sternal, Marcin Copik, Grzegorz Kwaśniewski, Jürgen Müller, Łukasz Flis, Hannes Eberhard, Hubert Niewiadomski, Torsten Hoefler, 23 Jan 2025 (v3), Reasoning Language Models: A Blueprint, https://arxiv.org/abs/2501.11223 (Survey and blueprint for how to build a Large Reasoning Model.)
Haotian Luo, Li Shen, Haiying He, Yibo Wang, Shiwei Liu, Wei Li, Naiqiang Tan, Xiaochun Cao, Dacheng Tao, 22 Jan 2025, O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning, https://arxiv.org/abs/2501.12570 https://github.com/StarDewXXX/O1-Pruner
Zunhai Su, Zhe Chen, Wang Shen, Hanyu Wei, Linge Li, Huangqi Yu, Kehong Yuan, 25 Jan 2025, RotateKV: Accurate and Robust 2-Bit KV Cache Quantization for LLMs via Outlier-Aware Adaptive Rotations, https://arxiv.org/abs/2501.16383 (INT2 KV caching with special handling of outliers, RoPE, and attention sinks, and the resulting architecture works in Chain-of-Thought.)
Jianing Sun, Zhichao Zhang, Xiaopu Wang, Xinyuan Ji, Yizhi Zhang, October 17, 2024; revised December 23, 2024, Fallback Prompting Guides Large Language Models for Accurate Responses in Complex Reasoning, https://iecscience.org/uploads/jpapers/202501/cGU5HbpaB6LvuhKhZHeAPUzYGglrSN2xxASwBlPH.pdf
Zishun Yu, Tengyu Xu, Di Jin, Karthik Abinav Sankararaman, Yun He, Wenxuan Zhou, Zhouhao Zeng, Eryk Helenowski, Chen Zhu, Sinong Wang, Hao Ma, Han Fang, 29 Jan 2025, Think Smarter not Harder: Adaptive Reasoning with Inference Aware Optimization, https://arxiv.org/abs/2501.17974 (CoT optimization using an inference budget.)
Mayi Xu, Yongqi Li, Ke Sun, and Tieyun Qian, 2024, Adaption-of-thought: Learning question difficulty improves large language models for reasoning, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 5468–5495, 2024, https://aclanthology.org/2024.emnlp-main.313/ https://aclanthology.org/2024.emnlp-main.313.pdf
G Wang, S Zhang, T Zhan, Z Shen, J Li, X Hu, X Sun, Jan 2025, Unlocking the Mysteries of OpenAI o1: A Survey of the Reasoning Abilities of Large Language Models, https://openreview.net/pdf?id=J0ADLa2rNp
Joonwon Jang, Jaehee Kim, Wonbin Kweon, Hwanjo Yu, 31 Dec 2024 (v2), Verbosity-Aware Rationale Reduction: Effective Reduction of Redundant Rationale via Principled Criteria, https://arxiv.org/abs/2412.21006
Yue Wang, Qiuzhi Liu, Jiahao Xu, Tian Liang, Xingyu Chen, Zhiwei He, Linfeng Song, Dian Yu, Juntao Li, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, Dong Yu, 30 Jan 2025, Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs, https://arxiv.org/abs/2501.18585
Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, Tatsunori Hashimoto, 3 Feb 2025 (v2), s1: Simple test-time scaling, https://arxiv.org/abs/2501.19393 https://github.com/simplescaling/s1 (Method of "budget forcing" that allows either shortening or lengthening multi-step reasoning sequences.)
Ben Dickson, February 5, 2025, Not every AI prompt deserves multiple seconds of thinking: how Meta is teaching models to prioritize, https://venturebeat.com/ai/not-every-ai-prompt-deserves-multiple-seconds-of-thinking-how-meta-is-teaching-models-to-prioritize/
Dacheng Li, Shiyi Cao, Tyler Griggs, Shu Liu, Xiangxi Mo, Shishir G. Patil, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica, 11 Feb 2025, LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters! https://arxiv.org/abs/2502.07374 https://github.com/NovaSky-AI/SkyThought (Learning to reason with SFT and LoRA.)
Daman Arora, Andrea Zanette, 11 Feb 2025 (v2), Training Language Models to Reason Efficiently, https://arxiv.org/abs/2502.04463 https://github.com/Zanette-Labs/efficient-reasoning
Jin, M., Yu, Q., Shu, D., Zhao, H., Hua, W., Meng, Y., Zhang, Y., and Du, M. The impact of reasoning step length on large language models. In Ku, L.-W., Martins, A., and Srikumar, V. (eds.), Findings of the Association for Computational Linguistics: ACL 2024, pp. 1830 1842, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024. https://aclanthology.org/2024.findings-acl.108/ (Shows that token reduction does reduce accuracy in reasoning.)
Heming Xia, Yongqi Li, Chak Tou Leong, Wenjie Wang, Wenjie Li, 17 Feb 2025, TokenSkip: Controllable Chain-of-Thought Compression in LLMs, https://arxiv.org/abs/2502.12067
Yingqian Cui, Pengfei He, Jingying Zeng, Hui Liu, Xianfeng Tang, Zhenwei Dai, Yan Han, Chen Luo, Jing Huang, Zhen Li, Suhang Wang, Yue Xing, Jiliang Tang, Qi He, 18 Feb 2025, Stepwise Perplexity-Guided Refinement for Efficient Chain-of-Thought Reasoning in Large Language Models, https://arxiv.org/abs/2502.13260
Marthe Ballon, Andres Algaba, Vincent Ginis, 21 Feb 2025, The Relationship Between Reasoning and Performance in Large Language Models -- o3 (mini) Thinks Harder, Not Longer, https://arxiv.org/abs/2502.15631
Jintian Zhang, Yuqi Zhu, Mengshu Sun, Yujie Luo, Shuofei Qiao, Lun Du, Da Zheng, Huajun Chen, Ningyu Zhang, 21 Feb 2025, LightThinker: Thinking Step-by-Step Compression, https://arxiv.org/abs/2502.15589 https://github.com/zjunlp/LightThinker (Faster CoT by compressing the text of intermediate reasoning steps with gist tokens.)
Silei Xu, Wenhao Xie, Lingxiao Zhao, Pengcheng He, 25 Feb 2025, Chain of Draft: Thinking Faster by Writing Less, https://arxiv.org/abs/2502.18600 (Concise CoT method using a per-step inference budget.)
Ayeong Lee, Ethan Che, Tianyi Peng, 3 Mar 2025, How Well do LLMs Compress Their Own Chain-of-Thought? A Token Complexity Approach, https://arxiv.org/abs/2503.01141
Chengsong Huang, Langlin Huang, Jixuan Leng, Jiacheng Liu, Jiaxin Huang, 25 Feb 2025, Efficient Test-Time Scaling via Self-Calibration, https://arxiv.org/abs/2503.00031
Anonymous authors, Mar 2025, Pencil: Long Throughts with Short Memory, ICLR 2025 review, https://openreview.net/pdf?id=KRI2Fmffqr (Using a "memory" during reasoning steps to rewrite intermediate CoT steps in shorter form.)
Y Fu, J Chen, Y Zhuang, Z Fu, I Stoica, H Zhang, Mar 2025, Reasoning Without Self-Doubt: More Efficient Chain-of-Thought Through Certainty Probing, ICLR 2025 review, https://openreview.net/pdf?id=wpK4IMJfdX (Shortening CoT reasoning paths by "probing" the LLM in the middle of the sequence to examine if it already has the final answer, thereby avoiding unnecessary extra reasoning steps.)
Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, Wanxiang Che, 13 Mar 2025 (v2), Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models, https://arxiv.org/abs/2503.09567 (Massive and broad survey of all types of reasoning.)
Ashraf Eassa, Anjali Shah, Huizi Mao, Hao Lu, Erin Ho, Justin Xin and Omri Almog, Mar 18, 2025, NVIDIA Blackwell Delivers World-Record DeepSeek-R1 Inference Performance, https://developer.nvidia.com/blog/nvidia-blackwell-delivers-world-record-deepseek-r1-inference-performance/
Xiaoye Qu, Yafu Li, Zhaochen Su, Weigao Sun, Jianhao Yan, Dongrui Liu, Ganqu Cui, Daizong Liu, Shuxian Liang, Junxian He, Peng Li, Wei Wei, Jing Shao, Chaochao Lu, Yue Zhang, Xian-Sheng Hua, Bowen Zhou, Yu Cheng, 27 Mar 2025, A Survey of Efficient Reasoning for Large Reasoning Models: Language, Multimodality, and Beyond, https://arxiv.org/abs/2503.21614
Yijiong Yu, 26 Mar 2025, Accelerate Parallelizable Reasoning via Parallel Decoding within One Sequence, https://arxiv.org/abs/2503.20533 https://github.com/yuyijiong/parallel-decoding-in-one-sequence
Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Hanjie Chen, Xia Hu, 23 Mar 2025 (v2), Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models, https://arxiv.org/abs/2503.16419
Zheng Li, Qingxiu Dong, Jingyuan Ma, Di Zhang, Zhifang Sui, 16 May 2025, SelfBudgeter: Adaptive Token Allocation for Efficient LLM Reasoning, https://arxiv.org/abs/2505.11274
Songjun Tu, Jiahao Lin, Qichao Zhang, Xiangyu Tian, Linjing Li, Xiangyuan Lan, Dongbin Zhao, 16 May 2025, Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Multi-Stage RL, https://arxiv.org/abs/2505.10832
Zemin Huang, Zhiyang Chen, Zijun Wang, Tiancheng Li, Guo-Jun Qi, 21 May 2025 (v2), Reinforcing the Diffusion Chain of Lateral Thought with Diffusion Language Models, https://arxiv.org/abs/2505.10446
Muzhi Dai, Chenxu Yang, Qingyi Si, 17 May 2025 (v2), S-GRPO: Early Exit via Reinforcement Learning in Reasoning Models, https://arxiv.org/abs/2505.07686
Yuhui Xu, Hanze Dong, Lei Wang, Doyen Sahoo, Junnan Li, Caiming Xiong, 21 May 2025 (v2), Scalable Chain of Thoughts via Elastic Reasoning, https://arxiv.org/abs/2505.05315 https://github.com/SalesforceAIResearch/Elastic-Reasoning
Bin Yu, Hang Yuan, Haotian Li, Xueyin Xu, Yuliang Wei, Bailing Wang, Weizhen Qi, Kai Chen, 21 May 2025 (v2), Long-Short Chain-of-Thought Mixture Supervised Fine-Tuning Eliciting Efficient Reasoning in Large Language Models, https://arxiv.org/abs/2505.03469
Jingyang Yi, Jiazheng Wang, Sida Li, 16 May 2025 (v2), ShorterBetter: Guiding Reasoning Models to Find Optimal Inference Length for Efficient Reasoning, https://arxiv.org/abs/2504.21370
Guanghao Li, Wenhao Jiang, Mingfeng Chen, Yan Li, Hao Yu, Shuting Dong, Tao Ren, Ming Tang, Chun Yuan, 30 May 2025, SCOUT: Teaching Pre-trained Language Models to Enhance Reasoning via Flow Chain-of-Thought, https://arxiv.org/abs/2505.24181
Feng Luo, Yu-Neng Chuang, Guanchu Wang, Hoang Anh Duy Le, Shaochen Zhong, Hongyi Liu, Jiayi Yuan, Yang Sui, Vladimir Braverman, Vipin Chaudhary, Xia Hu, 28 May 2025, AutoL2S: Auto Long-Short Reasoning for Efficient Large Language Models, https://arxiv.org/abs/2505.22662
Siqi Fan, Peng Han, Shuo Shang, Yequan Wang, Aixin Sun, 28 May 2025, CoThink: Token-Efficient Reasoning via Instruct Models Guiding Reasoning Models, https://arxiv.org/abs/2505.22017
Sohyun An, Ruochen Wang, Tianyi Zhou, Cho-Jui Hsieh, 27 May 2025, Don't Think Longer, Think Wisely: Optimizing Thinking Dynamics for Large Reasoning Models, https://arxiv.org/abs/2505.21765
Xixian Yong, Xiao Zhou, Yingying Zhang, Jinlin Li, Yefeng Zheng, Xian Wu, 23 May 2025, Think or Not? Exploring Thinking Efficiency in Large Reasoning Models via an Information-Theoretic Lens, https://arxiv.org/abs/2505.18237
Jinyan Su, Claire Cardie, 23 May 2025, Thinking Fast and Right: Balancing Accuracy and Reasoning Length with Adaptive Rewards, https://arxiv.org/abs/2505.18298
Ruihan Gong, Yue Liu, Wenjie Qu, Mingzhe Du, Yufei He, Yingwei Ma, Yulin Chen, Xiang Liu, Yi Wen, Xinfeng Li, Ruidong Wang, Xinzhong Zhu, Bryan Hooi, Jiaheng Zhang, 26 May 2025, Efficient Reasoning via Chain of Unconscious Thought, https://arxiv.org/abs/2505.19756
Jiaxuan Gao, Shu Yan, Qixin Tan, Lu Yang, Shusheng Xu, Wei Fu, Zhiyu Mei, Kaifeng Lyu, Yi Wu, 8 Jun 2025, How Far Are We from Optimal Reasoning Efficiency? https://arxiv.org/abs/2506.07104

Reasoning Inference Optimization (RIO)

Other research papers on the general area of reasoning optimization, not necessarily specific to CoT methods. See: Reasoning Inference Optimization (RIO).

Reasoning and CoT Efficiency Topics

Blog articles on reasoning efficiency:

More research information on general efficiency optimization techniques for reasoning models:

Efficiency optimizations to Chain-of-Thought include: