Aussie AI

Reasoning Inference Optimization

  • Last Updated 7 March, 2025
  • by David Spuler, Ph.D.

What is Reasoning Inference Optimization?

Reasoning Inference Optimization is the application of LLM inference optimization techniques to speed up reasoning models, such as Chain-of-Thought (CoT) algorithms. Although all of the standard LLM inference optimization techniques (more than 500 exist) can be applied to the inference steps in these multi-step reasoning methods, there are also additional optimizations that are specific to multi-step inference.

The main special feature of CoT and other reasoning algorithms is that they work on a sequence of texts, iteratively refining the answer until a best answer is chosen. Generally speaking, more steps is better than less steps, which gives rise to the "inference scaling law" whereby more inference computation increases the intelligence of the model. Hence, there is a trade-off between speed and capability.

Hence, the question is whether there are any cross-step optimizations that make use of prior texts in the sequence, whether they are continued paths or abandoned sequences. Some of the inference optimization techniques that may have particular applicability to multi-step reasoning algorithms include:

  • High-level reasoning algorithm changes (e.g., fewer steps, cutting short reasoning paths earlier, etc.)
  • Concise Chain-of-Thought (CCoT) — a particular enhancement to CoT whereby the model is prompted to be "concise" in its thoughts.
  • Token reduction methods — any of the techniques seem particularly applicable to Chain-of-Thought, which has long token sequences in its reasoning steps.
  • Prompt lookup decoding — use token sequences in prior reasoning steps for faster and more accurate drafting in speculative decoding.
  • Early exit enhancements — use text sequences from prior steps as part of the exiting decision, to allow more precise exiting with reduced accuracy loss.
  • Activation sparsification — CoT algorithms have much context, with effectively early drafts of the final output, which gives a strong indication of which signals are relevant to the final answer, so that dynamic sparsification of activations may be effective.
  • Fused substring KV caching — where a subsequence has been calculated by inference in a previous step, the KV cache for this substring can be reused (a generalization of prefix KV caching).

Another factor is that each step of a multi-step inference algorithm is working on an input text, and producing an output text. What does this sound like? Well, it's editing and revising! Hence, applicable research may also include:

Research on Reasoning Inference Optimization

Research papers on speeding up reasoning algorithms:

  • OpenAI, Dec 2024, OpenAI o1 and new tools for developers, https://openai.com/index/o1-and-new-tools-for-developers/ ("Lower latency: o1 uses on average 60% fewer reasoning tokens than o1-preview for a given request.")
  • 13 Sep 2024, Expediting and Elevating Large Language Model Reasoning via Hidden Chain-of-Thought Decoding, https://arxiv.org/abs/2409.08561 (Compressing the interim token sequences in Chain-of-Thought.)
  • Yuntian Deng, Kiran Prasad, Roland Fernandez, Paul Smolensky, Vishrav Chaudhary, Stuart Shieber, 2 Nov 2023, Implicit Chain of Thought Reasoning via Knowledge Distillation, https://arxiv.org/abs/2311.01460 (Knowledge distillation applied to optimizing the interim computations in Chain-of-Thought.)
  • Yu Kang, Xianghui Sun, Liangyu Chen, Wei Zou, 16 Dec 2024, C3oT: Generating Shorter Chain-of-Thought without Compromising Effectiveness, https://arxiv.org/abs/2412.11664 (Token pruning and prompt compression for Chain-of-Thought.)
  • Hongxuan Zhang, Zhining Liu, Yao Zhao, Jiaqi Zheng, Chenyi Zhuang, Jinjie Gu, Guihai Chen, 4 Jun 2024 (v2), Fast Chain-of-Thought: A Glance of Future from Parallel Decoding Leads to Answers Faster, https://arxiv.org/abs/2311.08263
  • Jeffrey Cheng, Benjamin Van Durme, 17 Dec 2024, Compressed Chain of Thought: Efficient Reasoning Through Dense Representations, https://arxiv.org/abs/2412.13171 (Context compression applied to interim CoT reasoning steps.)
  • Libo Wang, 11 Dec 2024 (v4), Reducing Reasoning Costs -- The Path of Optimization for Chain of Thought via Sparse Attention Mechanism, https://arxiv.org/abs/2411.09111 https://github.com/brucewang123456789/GeniusTrail.git
  • Sachin Kumar, Sep 17, 2024, Hidden Chain-of-Thought decoding: faster and efficient CoT decoding to improve reasoning of LLMs, https://medium.com/@techsachin/hidden-chain-of-thought-decoding-faster-and-efficient-cot-decoding-to-improve-reasoning-of-llms-d95584bc9346
  • Devmallya Karar, Oct 4, 2024, Chain-Of-Thought ( CoT ) in Large Language Models prompting and Concise CoT with Code implementation using Python and PyTorch, https://medium.com/@devmallyakarar/chain-of-thought-cot-in-large-language-models-prompting-and-concise-cot-with-code-82821f9a832d
  • Cobus Greyling, Jan 24, 2024, Concise Chain-of-Thought (CCoT) Prompting, Traditional CoT comes at a cost of increased output token usage, CCoT prompting is a prompt-engineering technique which is aimed at reducing LLM response verbosity & inference time. https://cobusgreyling.substack.com/p/concise-chain-of-thought-ccot-prompting
  • Matthew Renze, Erhan Guven 19 Oct 2024 (v3), The Benefits of a Concise Chain of Thought on Problem-Solving in Large Language Models, https://arxiv.org/abs/2401.05618 https://github.com/matthewrenze/jhu-concise-cot (The original paper on Concise CoT.)
  • Tyler McDonald, Anthony Colosimo, Yifeng Li, Ali Emami, 2 Dec 2024, Can We Afford The Perfect Prompt? Balancing Cost and Accuracy with the Economical Prompting Index, https://arxiv.org/abs/2412.01690
  • David Spuler, Dec 21st, 2024, Multi-Step Reasoning Inference Optimization, Aussie AI Blog, https://www.aussieai.com/blog/reasoning-inference-optimization
  • Dian Yu, Yuheng Zhang, Jiahao Xu, Tian Liang, Linfeng Song, Zhaopeng Tu, Haitao Mi, Dong Yu, 22 Dec 2024, Teaching LLMs to Refine with Tools, https://arxiv.org/abs/2412.16871
  • Xuan Zhang, Chao Du, Tianyu Pang, Qian Liu, Wei Gao, Min Lin, 31 Oct 2024 (v2), Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs, https://arxiv.org/abs/2406.09136 https://github.com/sail-sg/CPO
  • Tingxu Han, Chunrong Fang, Shiyu Zhao, Shiqing Ma, Zhenyu Chen, Zhenting Wang, 30 Dec 2024 (v2), Token-Budget-Aware LLM Reasoning, https://arxiv.org/abs/2412.18547 https://github.com/GeniusHTX/TALE
  • Amirkeivan Mohtashami, Matteo Pagliardini, Martin Jaggi, 14 Aug 2024 (v2), CoTFormer: A Chain-of-Thought Driven Architecture with Budget-Adaptive Computation Cost at Inference, https://arxiv.org/abs/2310.10845
  • Shiv Sakhuja, 25 Sep 2024, Chain-of-Thought (CoT) Prompting Explained: 7 Techniques for Optimizing AI Performance, https://hub.athina.ai/athina-originals/guides-chain-of-thought-cot-prompting-explained-7-techniques-for-optimizing-ai-performance/
  • Tengxiao Liu, Qipeng Guo, Xiangkun Hu, Cheng Ji ayang, Yue Zhang, Xipeng Qiu, and Zheng Zhang. 2024. Can language models learn to skip steps? In The Thirty-eighth Annual Conference on Neural In formation Processing Systems. https://arxiv.org/abs/2411.01855
  • Mayi Xu, Yunfeng Ning, Yongqi Li, Jianhao Chen, Jintao Wen, Yao Xiao, Shen Zhou, Birong Pan, Zepeng Bao, Xin Miao, Hankun Kang, Ke Sun, Tieyun Qian, 2 Jan 2025, Reasoning based on symbolic and parametric knowledge bases: a survey, https://arxiv.org/abs/2501.01030 (Extensive survey of reasoning from CoT to knowledge graphs to table-based reasoning.)
  • Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, Dong Yu, 30 Dec 2024, Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs, https://arxiv.org/abs/2412.21187
  • Zirui Zhao, Hanze Dong, Amrita Saha, Caiming Xiong, Doyen Sahoo, 10 Oct 2024, Automatic Curriculum Expert Iteration for Reliable LLM Reasoning, https://arxiv.org/abs/2410.07627 (Efficiency of bailing out with "I don't know" or refusals versus continuing reasoning steps.)
  • Mehul Damani, Idan Shenfeld, Andi Peng, Andreea Bobu, Jacob Andreas, 7 Oct 2024, Learning How Hard to Think: Input-Adaptive Allocation of LM Computation, https://arxiv.org/abs/2410.04707
  • Xinglin Wang, Shaoxiong Feng, Yiwei Li, Peiwen Yuan, Yueqi Zhang, Boyuan Pan, Heda Wang, Yao Hu, Kan Li, 24 Aug 2024, Make Every Penny Count: Difficulty-Adaptive Self-Consistency for Cost-Efficient Reasoning, https://arxiv.org/abs/2408.13457
  • Rohin Manvi, Anikait Singh, Stefano Ermon, 3 Oct 2024, Adaptive Inference-Time Compute: LLMs Can Predict if They Can Do Better, Even Mid-Generation, https://arxiv.org/abs/2410.02725
  • Yiwei Li, Peiwen Yuan, Shaoxiong Feng, Boyuan Pan, Xinglin Wang, Bin Sun, Heda Wang, and Kan Li, 19 Jan 2024, Escape sky-high cost: Early-stopping self-consistency for multi-step reasoning. The Twelfth International Conference on Learning Representations, 2024, https://arxiv.org/abs/2401.10480 https://github.com/Yiwei98/ESC (Uses "early stopping" idea to improve CoT efficiency during inference.)
  • Jiace Zhu, Yingtao Shen, Jie Zhao, An Zou, 25 Aug 2024, Path-Consistency: Prefix Enhancement for Efficient Inference in LLM, https://arxiv.org/abs/2409.01281 (Uses the confidence calculations from earlier branches of the reasoning to improve efficiency.)
  • Pranjal Aggarwal, Aman Madaan, Yiming Yang, Mausam, 16 Nov 2023 (v2), Let's Sample Step by Step: Adaptive-Consistency for Efficient Reasoning and Coding with LLMs, EMNLP 2023, https://arxiv.org/abs/2305.11860 https://www.sample-step-by-step.info/
  • Luyang Liu, Jonas Pfeiffer, Jiaxing Wu, Jun Xie, Arthur Szlam, 23 Dec 2024, Deliberation in Latent Space via Differentiable Cache Augmentation, https://arxiv.org/abs/2412.17747 (Augmenting the KV cache with reasoning information so that decoding will mimic multi-step reasoning with fewer tokens required for intermediate steps.)
  • Sania Nayab, Giulio Rossolini, Giorgio Buttazzo, Nicolamaria Manes, Fabrizio Giacomelli, 29 Jul 2024, Concise Thoughts: Impact of Output Length on LLM Reasoning and Cost, https://arxiv.org/abs/2407.19825
  • Murong Yue, Jie Zhao, Min Zhang, Liang Du, Ziyu Yao, 8 Feb 2024 (v3), Large Language Model Cascades with Mixture of Thoughts Representations for Cost-efficient Reasoning, https://arxiv.org/abs/2310.03094 (Efficient CoT using smaller models.)
  • Wenqing Chen, Weicheng Wang, Zhixuan Chu, Kui Ren, Zibin Zheng, and Zhichao Lu. 2024. Self-Para-Consistency: Improving Reasoning Tasks at Low Cost for Large Language Models. In Findings of the Association for Computational Linguistics: ACL 2024, pages 14162–14167, Bangkok, Thailand. Association for Computational Linguistics. https://aclanthology.org/2024.findings-acl.842/ (Generate multiple paraphrased answers, which can reduce tokens as fewer are needed.)
  • Zhen Li, Yupeng Su, Runming Yang, Zhongwei Xie, Ngai Wong, Hongxia Yang, 6 Jan 2025, Quantization Meets Reasoning: Exploring LLM Low-Bit Quantization Degradation for Mathematical Reasoning, https://arxiv.org/abs/2501.03035 (Analysis of quantization's effect on CoT and math reasoning.)
  • Yixin Ji, Juntao Li, Hai Ye, Kaixin Wu, Jia Xu, Linjian Mo, Min Zhang, 5 Jan 2025, Test-time Computing: from System-1 Thinking to System-2 Thinking, https://arxiv.org/abs/2501.02497
  • Janelle Teng, Dec 24, 2024, Unwrapping OpenAI’s o3, https://nextbigteng.substack.com/p/unwrapping-openai-o3-reasoning-model ("...it costs a whopping $17-$20 per task to run o3 in low-compute mode...o3 and other CoT models are currently expensive at inference")
  • Sungjae Lee, Hyejin Park, Jaechang Kim, Jungseul Ok, 10 Jan 2025, Semantic Exploration with Adaptive Gating for Efficient Problem Solving with Language Models, https://arxiv.org/abs/2501.05752 (CoT optimization by avoiding redundant paths that have identical semantics.)
  • Zheqi Lv, Wenkai Wang, Jiawei Wang, Shengyu Zhang, Fei Wu, 10 Jan 2025, Cascaded Self-Evaluation Augmented Training for Efficient Multimodal Large Language Models, https://arxiv.org/abs/2501.05662 (Optimize multimodal CoT by breaking down prompts into smaller sub-goals.)
  • Siddharth Narayanan, James D. Braza, Ryan-Rhys Griffiths, Manu Ponnapati, Albert Bou, Jon Laurent, Ori Kabeli, Geemi Wellawatte, Sam Cox, Samuel G. Rodriques, Andrew D. White, 30 Dec 2024, Aviary: training language agents on challenging scientific tasks, https://arxiv.org/abs/2412.21154 (Using smaller models combined with multi-step reasoning to compete with big models with 100x less inference cost.)
  • Maciej Besta, Julia Barth, Eric Schreiber, Ales Kubicek, Afonso Catarino, Robert Gerstenberger, Piotr Nyczyk, Patrick Iff, Yueling Li, Sam Houliston, Tomasz Sternal, Marcin Copik, Grzegorz Kwaśniewski, Jürgen Müller, Łukasz Flis, Hannes Eberhard, Hubert Niewiadomski, Torsten Hoefler, 23 Jan 2025 (v3), Reasoning Language Models: A Blueprint, https://arxiv.org/abs/2501.11223 (Survey and blueprint for how to build a Large Reasoning Model.)
  • Haotian Luo, Li Shen, Haiying He, Yibo Wang, Shiwei Liu, Wei Li, Naiqiang Tan, Xiaochun Cao, Dacheng Tao, 22 Jan 2025, O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning, https://arxiv.org/abs/2501.12570 https://github.com/StarDewXXX/O1-Pruner
  • Zunhai Su, Zhe Chen, Wang Shen, Hanyu Wei, Linge Li, Huangqi Yu, Kehong Yuan, 25 Jan 2025, RotateKV: Accurate and Robust 2-Bit KV Cache Quantization for LLMs via Outlier-Aware Adaptive Rotations, https://arxiv.org/abs/2501.16383 (INT2 KV caching with special handling of outliers, RoPE, and attention sinks, and the resulting architecture works in Chain-of-Thought.)
  • Jianing Sun, Zhichao Zhang, Xiaopu Wang, Xinyuan Ji, Yizhi Zhang, October 17, 2024; revised December 23, 2024, Fallback Prompting Guides Large Language Models for Accurate Responses in Complex Reasoning, https://iecscience.org/uploads/jpapers/202501/cGU5HbpaB6LvuhKhZHeAPUzYGglrSN2xxASwBlPH.pdf
  • Zishun Yu, Tengyu Xu, Di Jin, Karthik Abinav Sankararaman, Yun He, Wenxuan Zhou, Zhouhao Zeng, Eryk Helenowski, Chen Zhu, Sinong Wang, Hao Ma, Han Fang, 29 Jan 2025, Think Smarter not Harder: Adaptive Reasoning with Inference Aware Optimization, https://arxiv.org/abs/2501.17974 (CoT optimization using an inference budget.)
  • Mayi Xu, Yongqi Li, Ke Sun, and Tieyun Qian, 2024, Adaption-of-thought: Learning question difficulty improves large language models for reasoning, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 5468–5495, 2024, https://aclanthology.org/2024.emnlp-main.313/ https://aclanthology.org/2024.emnlp-main.313.pdf
  • G Wang, S Zhang, T Zhan, Z Shen, J Li, X Hu, X Sun, Jan 2025, Unlocking the Mysteries of OpenAI o1: A Survey of the Reasoning Abilities of Large Language Models, https://openreview.net/pdf?id=J0ADLa2rNp
  • Joonwon Jang, Jaehee Kim, Wonbin Kweon, Hwanjo Yu, 31 Dec 2024 (v2), Verbosity-Aware Rationale Reduction: Effective Reduction of Redundant Rationale via Principled Criteria, https://arxiv.org/abs/2412.21006
  • Yue Wang, Qiuzhi Liu, Jiahao Xu, Tian Liang, Xingyu Chen, Zhiwei He, Linfeng Song, Dian Yu, Juntao Li, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, Dong Yu, 30 Jan 2025, Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs, https://arxiv.org/abs/2501.18585
  • Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, Tatsunori Hashimoto, 3 Feb 2025 (v2), s1: Simple test-time scaling, https://arxiv.org/abs/2501.19393 https://github.com/simplescaling/s1 (Method of "budget forcing" that allows either shortening or lengthening multi-step reasoning sequences.)
  • Mohammed Karimkhan Pathan, February 3, 2025, Open-source revolution: How DeepSeek-R1 challenges OpenAI’s o1 with superior processing, cost efficiency, https://venturebeat.com/ai/open-source-revolution-how-deepseek-r1-challenges-openais-o1-with-superior-processing-cost-efficiency/
  • Ben Dickson, February 5, 2025, Not every AI prompt deserves multiple seconds of thinking: how Meta is teaching models to prioritize, https://venturebeat.com/ai/not-every-ai-prompt-deserves-multiple-seconds-of-thinking-how-meta-is-teaching-models-to-prioritize/
  • Zhi Zhou, Tan Yuhao, Zenan Li, Yuan Yao, Lan-Zhe Guo, Xiaoxing Ma, Yu-Feng Li, 1 Feb 2025, Bridging Internal Probability and Self-Consistency for Effective and Efficient LLM Reasoning, https://arxiv.org/abs/2502.00511
  • Dacheng Li, Shiyi Cao, Tyler Griggs, Shu Liu, Xiangxi Mo, Shishir G. Patil, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica, 11 Feb 2025, LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters! https://arxiv.org/abs/2502.07374 https://github.com/NovaSky-AI/SkyThought (Learning to reason with SFT and LoRA.)
  • Daman Arora, Andrea Zanette, 11 Feb 2025 (v2), Training Language Models to Reason Efficiently, https://arxiv.org/abs/2502.04463 https://github.com/Zanette-Labs/efficient-reasoning
  • Jin, M., Yu, Q., Shu, D., Zhao, H., Hua, W., Meng, Y., Zhang, Y., and Du, M. The impact of reasoning step length on large language models. In Ku, L.-W., Martins, A., and Srikumar, V. (eds.), Findings of the Association for Computational Linguistics: ACL 2024, pp. 1830 1842, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024. https://aclanthology.org/2024.findings-acl.108/ (Shows that token reduction does reduce accuracy in reasoning.)
  • Heming Xia, Yongqi Li, Chak Tou Leong, Wenjie Wang, Wenjie Li, 17 Feb 2025, TokenSkip: Controllable Chain-of-Thought Compression in LLMs, https://arxiv.org/abs/2502.12067
  • Yingqian Cui, Pengfei He, Jingying Zeng, Hui Liu, Xianfeng Tang, Zhenwei Dai, Yan Han, Chen Luo, Jing Huang, Zhen Li, Suhang Wang, Yue Xing, Jiliang Tang, Qi He, 18 Feb 2025, Stepwise Perplexity-Guided Refinement for Efficient Chain-of-Thought Reasoning in Large Language Models, https://arxiv.org/abs/2502.13260
  • XYZ Labs, Feb 23, 2025, Open Reasoner Zero: A Breakthrough in AI Training Efficiency Matches DeepSeek with Just 1/30th of Training Steps. Major AI Figures Including Kai-Fu Lee, Harry Shum, and Xiangyu Zhang Unveil Revolutionary Open-Source Training Method. https://xyzlabs.substack.com/p/open-reasoner-zero-a-breakthrough
  • Marthe Ballon, Andres Algaba, Vincent Ginis, 21 Feb 2025, The Relationship Between Reasoning and Performance in Large Language Models -- o3 (mini) Thinks Harder, Not Longer, https://arxiv.org/abs/2502.15631
  • Jintian Zhang, Yuqi Zhu, Mengshu Sun, Yujie Luo, Shuofei Qiao, Lun Du, Da Zheng, Huajun Chen, Ningyu Zhang, 21 Feb 2025, LightThinker: Thinking Step-by-Step Compression, https://arxiv.org/abs/2502.15589 https://github.com/zjunlp/LightThinker (Faster CoT by compressing the text of intermediate reasoning steps with gist tokens.)
  • Silei Xu, Wenhao Xie, Lingxiao Zhao, Pengcheng He, 25 Feb 2025, Chain of Draft: Thinking Faster by Writing Less, https://arxiv.org/abs/2502.18600 (Concise CoT method using a per-step inference budget.)
  • Komal Kumar, Tajamul Ashraf, Omkar Thawakar, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, Phillip H.S. Torr, Salman Khan, Fahad Shahbaz Khan, 28 Feb 2025, LLM Post-Training: A Deep Dive into Reasoning Large Language Models, https://arxiv.org/abs/2502.21321 https://github.com/mbzuai-oryx/Awesome-LLM-Post-training
  • Ayeong Lee, Ethan Che, Tianyi Peng, 3 Mar 2025, How Well do LLMs Compress Their Own Chain-of-Thought? A Token Complexity Approach, https://arxiv.org/abs/2503.01141
  • Chengsong Huang, Langlin Huang, Jixuan Leng, Jiacheng Liu, Jiaxin Huang, 25 Feb 2025, Efficient Test-Time Scaling via Self-Calibration, https://arxiv.org/abs/2503.00031

Reasoning and CoT Efficiency Topics

Blog articles on reasoning efficiency:

More research information on general efficiency optimization techniques for reasoning models:

Efficiency optimizations to Chain-of-Thought include:

More AI Research

Read more about: