Aussie AI

LLM Reasoning Research

  • Last Updated 21 March, 2025
  • by David Spuler, Ph.D.

Reasoning is a key part of intelligence, and much work is ongoing to improve higher-level reasoning of AI models. Examples include solving mathematical problems or performing multi-step planning such as booking a holiday.

There are two main categories of methods to improve reasoning ability:

  • Training methods ("white box reasoning")
  • Multi-step inference methods ("black box reasoning")

You may also be interested in our recent research and blog articles:

Training-Based Reasoning

White Box Reasoning is the training of the weights internal to an LLM so that it performs better on reasoning tasks. Historically, the first idea to create smarter models was always to train an LLM using better data and better techniques. This has improved raw results on "reasoning" and "generalization" tasks.

Lately, this has given rise to the Large Reasoner Model (LRM) architectures, in two main types. There are the trained reasoning models that still give an answer in one step, and there are the multi-step inference models that use multiple steps and "test time compute" to give better answers to complex questions.

The single-shot inference types of reasoning models do rely on prompt engineering to get the LLM to do its reasoning steps. Many of the basic prmpt engineering ideas are applicable here:

  • Basic step prompting ("Let's think step by step")
  • Emotional prompting
  • Roles/personas
  • CoT prompting
  • Zero-shot CoT prompting
  • Echo prompting ("Let's repeat the question")
  • Self-consistency
  • Self-ask (followup questions)
  • Exemplars (In-Content Learning)

The major LRMs are using more advanced meta-prompts for reasoning, for either single-step or multi-step reasoning, but these prompts are commercially sensitive and not usually available. Interestingly, the meta-prompt for the single-step DeepSeek R1 reasoning model was disclosed in their paper (https://arxiv.org/abs/2501.12948):

    A conversation between User and Assistant. The user asks a question, and the Assistant solves it.
    The assistant first thinks about the reasoning process in the mind and then provides the user
    with the answer. The reasoning process and answer are enclosed within <think> </think> and
    <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think>
    <answer> answer here </answer>. User: PROMPT. Assistant:

Fine-tuning on a more specialized subset of relevant data is a particular submethod of this area. There has been much improvement in this area, in both the capabilities of high-end large SOTA models and also at the other end of the spectrum with Small Language Models (SLMs). See more about training methods, but note that there hasn't yet been much research about fine-tuning of reasoning capabilities.

Inference-Based Reasoning

Black Box Reasoning is the use of multiple steps of inference, wrapped around an LLM. The second idea is to treat the LLM as a "black box" and try to use more LLM calls to improve its reasoning abilities. These are called "few-shot" or "many-shot" or "multi-step" reasoning methods.

Chain-of-thought is the best known of these methods, having been adopted by OpenAI for the "o1" models released in September, 2024. However, multi-step reasoning is a longstanding area of research, with much overlap with prompt engineering techniques. There are numerous methods of doing this type of multiple calls to LLMs in the literature:

  • Chain-of-thought (CoT)
  • Self-reflection
  • Skeleton-of-thought
  • Best-of-N (BoN) method
  • Majority voting
  • Self-consistency decoding
  • Programmatic prompting
  • Tree-of-Thoughts (ToT) prompting
  • Chain-of-Symbols (CoS) prompting
  • Graph-of-Thoughts (GoT)
  • Algorithm-of-Thoughts (AoT)
  • Buffer of Thoughts
  • Least-to-Most prompting
  • Chain-of-Table prompting
  • Thread-of-Thought (ThoT) prompting
  • System 2 Attention (S2A) prompting
  • Chain-of-Verification (CoVe) prompting
  • ReAct prompting (reason-and-act)
  • Rephrase-and-Respond (RaR) prompting
  • Chain-of-Knowledge (CoK) prompting
  • Contrastive Chain-of-Thought (CCoT) prompting
  • Program of Thoughts (PoT) prompting
  • Structured Chain-of-Thought (SCoT) prompting
  • Chain-of-Code (CoC) prompting
  • Take a Step Back prompting

Also related to these areas are the various other ways to have the LLM give a "better" answer, even if it's not really using improved reasoning. The simplest ideas include prompt engineering techniques to give the LLM a better query, RAG architectures and Retrieval Augmented Language Models (RALM) to give an LLM more relevant source data, and also dynamic tool usage integrations to generalize the LLM's capabilities to handle answers that require computations. Also relevant is the research on improving answers by fixing specific LLM limitations such as hallucinations, mathematical problem solving difficulties, and language wordplay (in)abilities.

Long Answers versus Multiple Inference Steps

One of the nuances in the distinction between zero-shot reasoner models and multiple steps of inference is the simplest of ideas: output longer answers. Large Reasoner Models with a single-step architecture, such as DeepSeek R1, mimic the steps of reasoning by repeatedly extending the answers with re-phrased reasoning steps about the problem. This is analogous to multi-step inference reasoning, but the model is "talking to itself" about how to reason through the problem, all in one step of inference.

In effect, the sequence of multiple outputs in chained multi-step reasoning is merged into a single output stream of text. The model is deciding whether or not another step is required as part of the normal decoding phase. The output from these types of single-step reasoner models is a readable sequence showing how the model thought through a problem. Hence, the output to achieve a final answer can be a very long token sequence, which can be costly, and it's important to not restrict the "max tokens" settings in these cases.

Inference costs are obviously higher for producing an extended answer with many of the intermediate thoughts written to the answer. However, the number of tokens in multi-step inference is also high. Whether a single-inference model's long answer will be more or less tokens than a multi-step implementation of Chain-of-Thought is not really clear (need some papers!), but the reasoning ability is high for either approach.

Survey Papers on LLM Reasoning

Survey and review papers on reasoning:

  • Xiangjue Dong, Maria Teleki, James Caverlee, 18 Dec 2024, A Survey on LLM Inference-Time Self-Improvement, https://arxiv.org/abs/2412.14352 https://github.com/dongxiangjue/Awesome-LLM-Self-Improvement (Broad survey of reasoning improvement methods from multi-step inference to RALM to decoding algorithms.)
  • Aske Plaat, Annie Wong, Suzan Verberne, Joost Broekens, Niki van Stein, Thomas Back, 16 Jul 2024, Reasoning with Large Language Models, a Survey, https://arxiv.org/abs/2407.11511
  • Alhassan Mumuni, Fuseini Mumuni, 6 Jan 2025, Large language models for artificial general intelligence (AGI): A survey of foundational principles and approaches, https://arxiv.org/abs/2501.03151
  • Yixin Ji, Juntao Li, Hai Ye, Kaixin Wu, Jia Xu, Linjian Mo, Min Zhang, 5 Jan 2025, Test-time Computing: from System-1 Thinking to System-2 Thinking, https://arxiv.org/abs/2501.02497
  • Violet Xiang, Charlie Snell, Kanishk Gandhi, Alon Albalak, Anikait Singh, Chase Blagden, Duy Phung, Rafael Rafailov, Nathan Lile, Dakota Mahan, Louis Castricato, Jan-Philipp Franken, Nick Haber, Chelsea Finn, 8 Jan 2025, Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought, https://arxiv.org/abs/2501.04682
  • Fengli Xu, Qianyue Hao, Zefang Zong, Jingwei Wang, Yunke Zhang, Jingyi Wang, Xiaochong Lan, Jiahui Gong, Tianjian Ouyang, Fanjin Meng, Chenyang Shao, Yuwei Yan, Qinglong Yang, Yiwen Song, Sijian Ren, Xinyuan Hu, Yu Li, Jie Feng, Chen Gao, Yong Li, 17 Jan 2025 (v2), Towards Large Reasoning Models: A Survey on Scaling LLM Reasoning Capabilities, https://arxiv.org/abs/2501.09686
  • Jie Huang and Kevin Chen-Chuan Chang. July 2023. Towards Reasoning in Large Language Models: A Survey. In Findings of the Association for Computational Linguistics: ACL 2023, pages 1049–1065, Toronto, Canada. Association for Computational Linguistics. https://aclanthology.org/2023.findings-acl.67/
  • Seungpil Lee, Woochang Sim, Donghyeon Shin, Wongyu Seo, Jiwon Park, Seokki Lee, Sanha Hwang, Sejin Kim, and Sundong Kim. Jan 2025. Reasoning Abilities of Large Language Models: In-Depth Analysis on the Abstraction and Reasoning Corpus. ACM Trans. Intell. Syst. Technol. https://doi.org/10.1145/3712701 https://dl.acm.org/doi/10.1145/3712701 https://dl.acm.org/doi/pdf/10.1145/3712701
  • Maciej Besta, Julia Barth, Eric Schreiber, Ales Kubicek, Afonso Catarino, Robert Gerstenberger, Piotr Nyczyk, Patrick Iff, Yueling Li, Sam Houliston, Tomasz Sternal, Marcin Copik, Grzegorz Kwaśniewski, Jürgen Müller, Łukasz Flis, Hannes Eberhard, Hubert Niewiadomski, Torsten Hoefler, 23 Jan 2025 (v3), Reasoning Language Models: A Blueprint, https://arxiv.org/abs/2501.11223 (Survey and blueprint for how to build a Large Reasoning Model.)
  • Mohit Sewak, Ph.D., January 29, 2025, Achieving General Intelligence (AGI) and Super Intelligence (ASI): Pathways, Uncertainties, and Ethical Concerns, https://towardsai.net/p/l/achieving-general-intelligence-agi-and-super-intelligence-asi-pathways-uncertainties-and-ethical-concerns
  • Avinash Patil, 5 Feb 2025, Advancing Reasoning in Large Language Models: Promising Methods and Approaches, https://arxiv.org/abs/2502.03671
  • Hieu Minh "Jord" Nguyen, 10 Feb 2025, A Survey of Theory of Mind in Large Language Models: Evaluations, Representations, and Safety Risks, https://arxiv.org/abs/2502.06470
  • Hanmeng Liu, Zhizhang Fu, Mengru Ding, Ruoxi Ning, Chaoli Zhang, Xiaozhang Liu, Yue Zhang, 13 Feb 2025, Logical Reasoning in Large Language Models: A Survey, https://arxiv.org/abs/2502.09100
  • Fengxiang Cheng, Haoxuan Li, Fenrong Liu, Robert van Rooij, Kun Zhang, Zhouchen Lin, 24 Feb 2025 (v2), Empowering LLMs with Logical Reasoning: A Comprehensive Survey, https://arxiv.org/abs/2502.15652
  • Cameron R. Wolfe, Feb 18, 2025, Demystifying Reasoning Models: Understanding reasoning models and their relation to standard LLMs... https://cameronrwolfe.substack.com/p/demystifying-reasoning-models
  • Zhong-Zhi Li, Duzhen Zhang, Ming-Liang Zhang, Jiaxin Zhang, Zengyan Liu, Yuxuan Yao, Haotian Xu, Junhao Zheng, Pei-Jie Wang, Xiuyi Chen, Yingying Zhang, Fei Yin, Jiahua Dong, Zhijiang Guo, Le Song, Cheng-Lin Liu, 25 Feb 2025 (v2), From System 1 to System 2: A Survey of Reasoning Large Language Models, https://arxiv.org/abs/2502.17419
  • Komal Kumar, Tajamul Ashraf, Omkar Thawakar, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, Phillip H.S. Torr, Salman Khan, Fahad Shahbaz Khan, 28 Feb 2025, LLM Post-Training: A Deep Dive into Reasoning Large Language Models, https://arxiv.org/abs/2502.21321 https://github.com/mbzuai-oryx/Awesome-LLM-Post-training
  • Guiyao Tie, Zeli Zhao, Dingjie Song, Fuyang Wei, Rong Zhou, Yurou Dai, Wen Yin, Zhejian Yang, Jiangyue Yan, Yao Su, Zhenhan Dai, Yifeng Xie, Yihan Cao, Lichao Sun, Pan Zhou, Lifang He, Hechang Chen, Yu Zhang, Qingsong Wen, Tianming Liu, Neil Zhenqiang Gong, Jiliang Tang, Caiming Xiong, Heng Ji, Philip S. Yu, Jianfeng Gao, 8 Mar 2025, A Survey on Post-training of Large Language Models, https://arxiv.org/abs/2503.06072
  • Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, Wanxiang Che, 13 Mar 2025 (v2), Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models, https://arxiv.org/abs/2503.09567 (Massive and broad survey of all types of reasoning.)
  • Yaoting Wang, Shengqiong Wu, Yuecheng Zhang, William Wang, Ziwei Liu, Jiebo Luo, Hao Fei, 16 Mar 2025, Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey, https://arxiv.org/abs/2503.12605
  • Dibyanayan Bandyopadhyay, Soham Bhattacharjee, Asif Ekbal, 13 Mar 2025, Thinking Machines: A Survey of LLM based Reasoning Strategies, https://arxiv.org/abs/2503.10814

Reasoning Theory

Papers about the deeper theory of what "reasoning" means:

  • Eghbal Hosseini, Colton Casto, Noga Zaslavsky, Colin Conwell, Mark Richardson, Evelina Fedorenko, Dec 2024, Universality of representation in biological and artificial neural networks, bioRxiv 2024.12.26.629294; doi: https://doi.org/10.1101/2024.12.26.629294 https://www.biorxiv.org/content/10.1101/2024.12.26.629294
  • Kuang-Huei Lee, Ian Fischer, Yueh-Hua Wu, Dave Marwood, Shumeet Baluja, Dale Schuurmans, Xinyun Chen, 17 Jan 2025, Evolving Deeper LLM Thinking, https://arxiv.org/abs/2501.09891 (An alternative search strategy broad/deep, compared to CoT and reflection.)
  • G Bao, H Zhang, C Wang, L Yang, Y Zhang, Jan 2025, How Likely Do LLMs with CoT Mimic Human Reasoning? Proceedings of the 31st International Conference on Computational Linguistics, pages 7831–7850, January 19–24, 2025, https://aclanthology.org/2025.coling-main.524.pdf
  • Santosh Kumar Radha, Oktay Goktas, 23 Jan 2025, On the Reasoning Capacity of AI Models and How to Quantify It, https://arxiv.org/abs/2501.13833
  • Alireza Amiri, Xinting Huang, Mark Rofin, Michael Hahn, 4 Feb 2025, Lower Bounds for Chain-of-Thought Reasoning in Hard-Attention Transformers, https://arxiv.org/abs/2502.02393
  • Ahmed El-Kishky, Alexander Wei, Andre Saraiva, Borys Minaev, Daniel Selsam, David Dohan, Francis Song, Hunter Lightman, Ignasi Clavera, Jakub Pachocki, Jerry Tworek, Lorenz Kuhn, Lukasz Kaiser, Mark Chen, Max Schwarzer, Mostafa Rohaninejad, Nat McAleese, o3 contributors, Oleg Mürk, Rhythm Garg, Rui Shu, Szymon Sidor, Vineet Kosaraju, Wenda Zhou, 3 Feb 2025, Competitive Programming with Large Reasoning Models, https://arxiv.org/abs/2502.06807 (OpenAI's paper on o3 that has similar conclusions to what DeepSeek showed about Reinforcement Learning for reasoning models, namely that "scaling general-purpose reinforcement learning" still works.)
  • Xinhao Yao, Ruifeng Ren, Yun Liao, Yong Liu, 7 Feb 2025, Unveiling the Mechanisms of Explicit CoT Training: How Chain-of-Thought Enhances Reasoning Generalization, https://arxiv.org/abs/2502.04667
  • Hanmeng Liu, Zhizhang Fu, Mengru Ding, Ruoxi Ning, Chaoli Zhang, Xiaozhang Liu, Yue Zhang, 13 Feb 2025, Logical Reasoning in Large Language Models: A Survey, https://arxiv.org/abs/2502.09100
  • Kechen Li, Wenqi Zhu, Coralia Cartis, Tianbo Ji, Shiwei Liu, 27 Feb 2025, SoS1: O1 and R1-Like Reasoning LLMs are Sum-of-Square Solvers, https://arxiv.org/abs/2502.20545
  • Yijiong Yu, 16 Jan 2025 (v4), Do LLMs Really Think Step-by-step In Implicit Reasoning? https://arxiv.org/abs/2411.15862 https://github.com/yuyijiong/if_step_by_step_implicit_CoT
  • Marius Jahrens, Thomas Martinetz, 12 Mar 2025, Why LLMs Cannot Think and How to Fix It, https://arxiv.org/abs/2503.09211
  • Pengcheng Wen, Jiaming Ji, Chi-Min Chan, Juntao Dai, Donghai Hong, Yaodong Yang, Sirui Han, Yike Guo, 17 Mar 2025, ThinkPatterns-21k: A Systematic Study on the Impact of Thinking Patterns in LLMs, https://arxiv.org/abs/2503.12918
  • Dibyanayan Bandyopadhyay, Soham Bhattacharjee, Asif Ekbal, 13 Mar 2025, Thinking Machines: A Survey of LLM based Reasoning Strategies, https://arxiv.org/abs/2503.10814

Reasoning Model Evaluation

Papers about testing LLMs (and overall systems) for their reasoning abilities:

  • Aske Plaat, Annie Wong, Suzan Verberne, Joost Broekens, Niki van Stein, Thomas Back, 16 Jul 2024, Reasoning with Large Language Models, a Survey, https://arxiv.org/abs/2407.11511
  • Fengli Xu, Qianyue Hao, Zefang Zong, Jingwei Wang, Yunke Zhang, Jingyi Wang, Xiaochong Lan, Jiahui Gong, Tianjian Ouyang, Fanjin Meng, Chenyang Shao, Yuwei Yan, Qinglong Yang, Yiwen Song, Sijian Ren, Xinyuan Hu, Yu Li, Jie Feng, Chen Gao, Yong Li, 17 Jan 2025 (v2), Towards Large Reasoning Models: A Survey on Scaling LLM Reasoning Capabilities, https://arxiv.org/abs/2501.09686
  • Maciej Besta, Julia Barth, Eric Schreiber, Ales Kubicek, Afonso Catarino, Robert Gerstenberger, Piotr Nyczyk, Patrick Iff, Yueling Li, Sam Houliston, Tomasz Sternal, Marcin Copik, Grzegorz Kwaśniewski, Jürgen Müller, Łukasz Flis, Hannes Eberhard, Hubert Niewiadomski, Torsten Hoefler, 23 Jan 2025 (v3), Reasoning Language Models: A Blueprint, https://arxiv.org/abs/2501.11223 (Survey and blueprint for how to build a Large Reasoning Model.)
  • Santosh Kumar Radha, Oktay Goktas, 23 Jan 2025, On the Reasoning Capacity of AI Models and How to Quantify It, https://arxiv.org/abs/2501.13833
  • Ben Dickson, January 31, 2025, Beyond benchmarks: How DeepSeek-R1 and o1 perform on real-world tasks, https://venturebeat.com/ai/beyond-benchmarks-how-deepseek-r1-and-o1-perform-on-real-world-tasks/
  • Guizhen Chen, Weiwen Xu, Hao Zhang, Hou Pong Chan, Chaoqun Liu, Lidong Bing, Deli Zhao, Anh Tuan Luu, Yu Rong, 27 Feb 2025, FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving, https://arxiv.org/abs/2502.20238
  • Avinash Patil, 5 Feb 2025, Advancing Reasoning in Large Language Models: Promising Methods and Approaches, https://arxiv.org/abs/2502.03671
  • Komal Kumar, Tajamul Ashraf, Omkar Thawakar, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, Phillip H.S. Torr, Salman Khan, Fahad Shahbaz Khan, 28 Feb 2025, LLM Post-Training: A Deep Dive into Reasoning Large Language Models, https://arxiv.org/abs/2502.21321 https://github.com/mbzuai-oryx/Awesome-LLM-Post-training
  • Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, Wanxiang Che, 13 Mar 2025 (v2), Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models, https://arxiv.org/abs/2503.09567 (Massive and broad survey of all types of reasoning.)

Large Reasoning Models (LRMs)

Large Reasoning Models (LRMs) are large-scale LLMs that have been trained on advanced reasoning capabilities. Their architecture may be training-only, but increasingly the architectures include multi-step inference or "test time compute" reasoning capabilities such as Chain-of-Thought.

Papers on large reasoning models:

  • Ignacio de Gregorio, Dec 2024, Uncovering OpenAI’s Frontier AI Strategy, https://medium.com/@ignacio.de.gregorio.noblejas/uncovering-openais-frontier-ai-strategy-a02e0aa5320e
  • Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, Zhicheng Dou, 9 Jan 2025, Search-o1: Agentic Search-Enhanced Large Reasoning Models, https://arxiv.org/abs/2501.05366 https://github.com/sunnynexus/Search-o1 (RAG retrieval and agentic methods applied to Large Reasoning Models.)
  • Fengli Xu, Qianyue Hao, Zefang Zong, Jingwei Wang, Yunke Zhang, Jingyi Wang, Xiaochong Lan, Jiahui Gong, Tianjian Ouyang, Fanjin Meng, Chenyang Shao, Yuwei Yan, Qinglong Yang, Yiwen Song, Sijian Ren, Xinyuan Hu, Yu Li, Jie Feng, Chen Gao, Yong Li, 17 Jan 2025 (v2), Towards Large Reasoning Models: A Survey on Scaling LLM Reasoning Capabilities, https://arxiv.org/abs/2501.09686
  • OpenAI, September 12, 2024 Learning to reason with LLMs. We are introducing OpenAI o1, a new large language model trained with reinforcement learning to perform complex reasoning. o1 thinks before it answers—it can produce a long internal chain of thought before responding to the user. https://openai.com/index/learning-to-reason-with-llms/
  • Aske Plaat, Annie Wong, Suzan Verberne, Joost Broekens, Niki van Stein, Thomas Back, 16 Jul 2024, Reasoning with Large Language Models, a Survey, https://arxiv.org/abs/2407.11511
  • Jie Huang and Kevin Chen-Chuan Chang. July 2023. Towards Reasoning in Large Language Models: A Survey. In Findings of the Association for Computational Linguistics: ACL 2023, pages 1049–1065, Toronto, Canada. Association for Computational Linguistics. https://aclanthology.org/2023.findings-acl.67/
  • Seungpil Lee, Woochang Sim, Donghyeon Shin, Wongyu Seo, Jiwon Park, Seokki Lee, Sanha Hwang, Sejin Kim, and Sundong Kim. Jan 2025. Reasoning Abilities of Large Language Models: In-Depth Analysis on the Abstraction and Reasoning Corpus. ACM Trans. Intell. Syst. Technol. https://doi.org/10.1145/3712701 https://dl.acm.org/doi/10.1145/3712701 https://dl.acm.org/doi/pdf/10.1145/3712701
  • Demis Hassabis, Jan 2025, X post: Announcing Gemini 2.0 Flash https://x.com/demishassabis/status/1881844417746632910 (Gemini 2.0 Flash from Google is a Large Reasoning Model with a 1M ultra-long context.)
  • Maciej Besta, Julia Barth, Eric Schreiber, Ales Kubicek, Afonso Catarino, Robert Gerstenberger, Piotr Nyczyk, Patrick Iff, Yueling Li, Sam Houliston, Tomasz Sternal, Marcin Copik, Grzegorz Kwaśniewski, Jürgen Müller, Łukasz Flis, Hannes Eberhard, Hubert Niewiadomski, Torsten Hoefler, 23 Jan 2025 (v3), Reasoning Language Models: A Blueprint, https://arxiv.org/abs/2501.11223 (Survey and blueprint for how to build a Large Reasoning Model.)
  • Alberto Romero, Jan 2025, DeepSeek, a little-known Chinese startup, released R1 yesterday, https://substack.com/@thealgorithmicbridge/note/c-87664591-
  • DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z.F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, et al. (100+ additional authors not shown), 22 Jan 2025, DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, https://arxiv.org/abs/2501.12948 (The DeepSeek R1 large reasoning model.)
  • G Wang, S Zhang, T Zhan, Z Shen, J Li, X Hu, X Sun, Jan 2025, Unlocking the Mysteries of OpenAI o1: A Survey of the Reasoning Abilities of Large Language Models, https://openreview.net/pdf?id=J0ADLa2rNp
  • Ben Dickson, January 31, 2025, Beyond benchmarks: How DeepSeek-R1 and o1 perform on real-world tasks, https://venturebeat.com/ai/beyond-benchmarks-how-deepseek-r1-and-o1-perform-on-real-world-tasks/
  • Deqian Kong, Minglu Zhao, Dehong Xu, Bo Pang, Shu Wang, Edouardo Honig, Zhangzhang Si, Chuan Li, Jianwen Xie, Sirui Xie, Ying Nian Wu, 3 Feb 2025, Scalable Language Models with Posterior Inference of Latent Thought Vectors, https://arxiv.org/abs/2502.01567
  • Ahmed El-Kishky, Alexander Wei, Andre Saraiva, Borys Minaev, Daniel Selsam, David Dohan, Francis Song, Hunter Lightman, Ignasi Clavera, Jakub Pachocki, Jerry Tworek, Lorenz Kuhn, Lukasz Kaiser, Mark Chen, Max Schwarzer, Mostafa Rohaninejad, Nat McAleese, o3 contributors, Oleg Mürk, Rhythm Garg, Rui Shu, Szymon Sidor, Vineet Kosaraju, Wenda Zhou, 3 Feb 2025, Competitive Programming with Large Reasoning Models, https://arxiv.org/abs/2502.06807 (OpenAI's paper on o3 that has similar conclusions to what DeepSeek showed about Reinforcement Learning for reasoning models, namely that "scaling general-purpose reinforcement learning" still works.)
  • DiJia Su, Hanlin Zhu, Yingchen Xu, Jiantao Jiao, Yuandong Tian, Qinqing Zheng, 5 Feb 2025. Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning, https://arxiv.org/abs/2502.03275
  • Cameron R. Wolfe, Feb 18, 2025, Demystifying Reasoning Models: Understanding reasoning models and their relation to standard LLMs... https://cameronrwolfe.substack.com/p/demystifying-reasoning-models
  • Jeremy Kahn, February 28, 2025, OpenAI launches long-awaited GPT-4.5 — but ‘Orion’s’ capabilities already lag competitors, https://fortune.com/2025/02/27/openai-gpt-4-5-orion-launch-sam-altman-benchmarks/
  • Komal Kumar, Tajamul Ashraf, Omkar Thawakar, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, Phillip H.S. Torr, Salman Khan, Fahad Shahbaz Khan, 28 Feb 2025, LLM Post-Training: A Deep Dive into Reasoning Large Language Models, https://arxiv.org/abs/2502.21321 https://github.com/mbzuai-oryx/Awesome-LLM-Post-training
  • Asif Razzaq, March 5, 2025, Qwen Releases QwQ-32B: A 32B Reasoning Model that Achieves Significantly Enhanced Performance in Downstream Task, https://www.marktechpost.com/2025/03/05/qwen-releases-qwq-32b-a-32b-reasoning-model-that-achieves-significantly-enhanced-performance-in-downstream-task/ (Features 32B parameters, 32K context length, 64 layers, RoPE, SwiGLU, RMSNorm, and attention enhancements.)

Open Source Reasoning

Open source reasoning projects are those that either: (a) use open-source code to implement multi-step inference-based reasoning algorithms such as Chain-of-Thought (on any underlying model), or (b) Large Reasoning Models where the model weights and architectural details have been open-sourced, such as Deepseek R3.

General Research on Intelligence

What does it mean to be smart? There are various answers to this, and it's a very nuanced question.

Research on intelligence or "smartness" of AI systems:

Chain-of-Thought (CoT) Reasoning

Research papers on chain-of-thought (CoT) for reasoning:

Advanced Chain-of-Thought

Some more research on advanced improvements to multi-step Chain-of-Thought are below. See also CoT efficiency optimizations.

  • Jiaan Wang, Fandong Meng, Yunlong Liang, Jie Zhou, 23 Dec 2024, DRT-o1: Optimized Deep Reasoning Translation via Long Chain-of-Thought, https://arxiv.org/abs/2412.17498 https://github.com/krystalan/DRT-o1 (Examines similes and metaphors in literature using long CoT.)
  • Violet Xiang, Charlie Snell, Kanishk Gandhi, Alon Albalak, Anikait Singh, Chase Blagden, Duy Phung, Rafael Rafailov, Nathan Lile, Dakota Mahan, Louis Castricato, Jan-Philipp Franken, Nick Haber, Chelsea Finn, 8 Jan 2025, Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought, https://arxiv.org/abs/2501.04682
  • Haotian Xu, Xing Wu, Weinong Wang, Zhongzhi Li, Da Zheng, Boyuan Chen, Yi Hu, Shijia Kang, Jiaming Ji, Yingying Zhang, Zhijiang Guo, Yaodong Yang, Muhan Zhang, Debing Zhang, 20 Jan 2025, RedStar: Does Scaling Long-CoT Data Unlock Better Slow-Reasoning Systems? https://arxiv.org/abs/2501.11284 https://huggingface.co/RedStar-Reasoning
  • Yiyao Yu, Yuxiang Zhang, Dongdong Zhang, Xiao Liang, Hengyuan Zhang, Xingxing Zhang, Ziyi Yang, Mahmoud Khademi, Hany Awadalla, Junjie Wang, Yujiu Yang, Furu Wei, 19 Jan 2025, Chain-of-Reasoning: Towards Unified Mathematical Reasoning in Large Language Models via a Multi-Paradigm Perspective, https://arxiv.org/abs/2501.11110
  • Yuanheng Fang, Guoqing Chao, Wenqiang Lei, Shaobo Li, Dianhui Chu, 21 Jan 2025, CDW-CoT: Clustered Distance-Weighted Chain-of-Thoughts Reasoning, https://arxiv.org/abs/2501.12226 (CoT with integration of clustering and prompt optimization techniques.)
  • Jishnu Ray Chowdhury, Cornelia Caragea, 21 Jan 2025, Zero-Shot Verification-guided Chain of Thoughts, https://arxiv.org/abs/2501.13122
  • Ziyu Guo, Renrui Zhang, Chengzhuo Tong, Zhizheng Zhao, Peng Gao, Hongsheng Li, Pheng-Ann Heng, 23 Jan 2025, Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step, https://arxiv.org/abs/2501.13926 https://github.com/ZiyuGuo99/Image-Generation-CoT
  • Liang Wang, Haonan Chen, Nan Yang, Xiaolong Huang, Zhicheng Dou, Furu Wei, 24 Jan 2025, Chain-of-Retrieval Augmented Generation, https://arxiv.org/abs/2501.14342 (Combines RAG with multi-step reasoning such as Chain-of-Thought, with a method to control token cost.)
  • Zhenrui Yue, Honglei Zhuang, Aijun Bai, Kai Hui, Rolf Jagerman, Hansi Zeng, Zhen Qin, Dong Wang, Xuanhui Wang, Michael Bendersky, 6 Oct 2024, Inference Scaling for Long-Context Retrieval Augmented Generation, https://arxiv.org/abs/2410.04343 (Combine RAG and multi-step inference, controlling token cost via budgeting allocations.)
  • Jianfeng Pan, Senyou Deng, Shaomang Huang, 4 Feb 2025, CoAT: Chain-of-Associated-Thoughts Framework for Enhancing Large Language Models Reasoning, https://arxiv.org/abs/2502.02390 (Integrating results from an "associative memory" in CoT reasoning paths at inference time.)
  • Chen, H., Zhu, J., Wang, W. et al. Triplet-based contrastive method enhances the reasoning ability of large language models. J Supercomput 81, 555 (2025). https://doi.org/10.1007/s11227-025-07056-6 https://link.springer.com/article/10.1007/s11227-025-07056-6 (Providing prompt examples that contrast correct and incorrect results to improve CoT reasoning.)

Tree-of-Thought (ToT)

Tree-of-thought is a tree-structured variant of multi-step Chain-of-Thought. Other tree-based versions of CoT are also examined below. Note that the "tree" structure also arises in "CoT decoding algorithms", which are single-step CoT-like inference optimizations that are based on the inherent tree hierarchy in beam search decoding.

Research papers on Tree-of-thought include:

Other Tree-Structured CoT Variants

Research papers on other tree-based CoT variants include:

  • Changcheng Li, Xiangyu Wang, Qiuju Chen, Xiren Zhou, Huanhuan Chen, 5 Dec 2024, MTMT: Consolidating Multiple Thinking Modes to Form a Thought Tree for Strengthening LLM, https://arxiv.org/abs/2412.03987
  • Yixin Ji, Juntao Li, Hai Ye, Kaixin Wu, Jia Xu, Linjian Mo, Min Zhang, 5 Jan 2025, Test-time Computing: from System-1 Thinking to System-2 Thinking, https://arxiv.org/abs/2501.02497
  • Fengli Xu, Qianyue Hao, Zefang Zong, Jingwei Wang, Yunke Zhang, Jingyi Wang, Xiaochong Lan, Jiahui Gong, Tianjian Ouyang, Fanjin Meng, Chenyang Shao, Yuwei Yan, Qinglong Yang, Yiwen Song, Sijian Ren, Xinyuan Hu, Yu Li, Jie Feng, Chen Gao, Yong Li, 17 Jan 2025 (v2), Towards Large Reasoning Models: A Survey on Scaling LLM Reasoning Capabilities, https://arxiv.org/abs/2501.09686
  • Tiesunlong Shen, Jin Wang1, Xuejie Zhang, Erik Cambria, Jan 2025, Reasoning with Trees: Faithful Question Answering over Knowledge Graph, Proceedings of the 31st International Conference on Computational Linguistics, pages 3138–3157 January 19–24, 2025, Association for Computational Linguistics, https://aclanthology.org/2025.coling-main.211.pdf
  • Maciej Besta, Julia Barth, Eric Schreiber, Ales Kubicek, Afonso Catarino, Robert Gerstenberger, Piotr Nyczyk, Patrick Iff, Yueling Li, Sam Houliston, Tomasz Sternal, Marcin Copik, Grzegorz Kwaśniewski, Jürgen Müller, Łukasz Flis, Hannes Eberhard, Hubert Niewiadomski, Torsten Hoefler, 23 Jan 2025 (v3), Reasoning Language Models: A Blueprint, https://arxiv.org/abs/2501.11223 (Survey and blueprint for how to build a Large Reasoning Model.)
  • Xiaoxue Cheng, Junyi Li, Wayne Xin Zhao, Ji-Rong Wen, 2 Jan 2025, Think More, Hallucinate Less: Mitigating Hallucinations via Dual Process of Fast and Slow Thinking, https://arxiv.org/abs/2501.01306
  • Kun-Peng Ning, Jia-Yu Yao, Yu-Yang Liu, Mu-Nan Ning, Li Yuan, 13 Jan 2025, GPT as a Monte Carlo Language Tree: A Probabilistic Perspective, https://arxiv.org/abs/2501.07641
  • G Wang, S Zhang, T Zhan, Z Shen, J Li, X Hu, X Sun, Jan 2025, Unlocking the Mysteries of OpenAI o1: A Survey of the Reasoning Abilities of Large Language Models, https://openreview.net/pdf?id=J0ADLa2rNp
  • Yang Li, 4 Feb 2025, Policy Guided Tree Search for Enhanced LLM Reasoning, https://arxiv.org/abs/2502.06813
  • Yifu Ding, Wentao Jiang, Shunyu Liu, Yongcheng Jing, Jinyang Guo, Yingjie Wang, Jing Zhang, Zengmao Wang, Ziwei Liu, Bo Du, Xianglong Liu, Dacheng Tao, 27 Feb 2025 (v2), Dynamic Parallel Tree Search for Efficient LLM Reasoning, https://arxiv.org/abs/2502.16235

Graph Reasoning

Graph reasoning is the use of a graph structure, such as a Knowledge Graph, as part of the reasoning algorithm. There is also a variant of Chain-of-Thought called "Graph-of-Thought" or GOT (dragons, anyone?). This is a further generalization of tree-based reasoning hierarchies.

Research papers on graph-based reasoning:

  • Cameron R. Wolfe, Jan 3, 2024, Graph-Based Prompting and Reasoning with Language Models. Understanding graph of thoughts prompting and several variants… https://towardsdatascience.com/graph-based-prompting-and-reasoning-with-language-models-d6acbcd6b3d8
  • Jiarui Ji, Runlin Lei, Jialing Bi, Zhewei Wei, Yankai Lin, Xuchen Pan, Yaliang Li, Bolin Ding, 13 Oct 2024, Dynamic and Textual Graph Generation Via Large-Scale LLM-based Agent Simulation, https://arxiv.org/abs/2410.09824
  • Yuwei Hu, Runlin Lei, Xinyi Huang, Zhewei Wei, Yongchao Liu, 7 Oct 2024, Scalable and Accurate Graph Reasoning with LLM-based Multi-Agents, https://arxiv.org/abs/2410.05130
  • Sambhav Khurana, Xiner Li, Shurui Gui, Shuiwang Ji, 29 Oct 2024, A Hierarchical Language Model For Interpretable Graph Reasoning, https://arxiv.org/abs/2410.22372
  • Haoyu Han, Yaochen Xie, Hui Liu, Xianfeng Tang, Sreyashi Nag, William Headden, Hui Liu, Yang Li, Chen Luo, Shuiwang Ji, Qi He, Jiliang Tang, 14 Jan 2025, Reasoning with Graphs: Structuring Implicit Knowledge to Enhance LLMs Reasoning, https://arxiv.org/abs/2501.07845
  • F. Alotaibi, A. Kulkarni and D. Zhou, "Graph of Logic: Enhancing LLM Reasoning with Graphs and Symbolic Logic," 2024 IEEE International Conference on Big Data (BigData), Washington, DC, USA, 2024, pp. 5926-5935, doi: 10.1109/BigData62323.2024.10825450. https://ieeexplore.ieee.org/abstract/document/10825450
  • Maciej Besta, Julia Barth, Eric Schreiber, Ales Kubicek, Afonso Catarino, Robert Gerstenberger, Piotr Nyczyk, Patrick Iff, Yueling Li, Sam Houliston, Tomasz Sternal, Marcin Copik, Grzegorz Kwaśniewski, Jürgen Müller, Łukasz Flis, Hannes Eberhard, Hubert Niewiadomski, Torsten Hoefler, 23 Jan 2025 (v3), Reasoning Language Models: A Blueprint, https://arxiv.org/abs/2501.11223 (Survey and blueprint for how to build a Large Reasoning Model.)
  • Xingtong Yu, Chang Zhou, Zhongwei Kuai, Xinming Zhang, Yuan Fang, 12 Feb 2025, GCoT: Chain-of-Thought Prompt Learning for Graphs, https://arxiv.org/abs/2502.08092
  • Han Zhang, Langshi Zhou, Hanfang Yang, 20 Feb 2025, Learning to Retrieve and Reason on Knowledge Graph through Active Self-Reflection, https://arxiv.org/abs/2502.14932
  • Anastasios Nentidis, Charilaos Akasiadis, Angelos Charalambidis, Alexander Artikis, 26 Feb 2025, Dealing with Inconsistency for Reasoning over Knowledge Graphs: A Survey, https://arxiv.org/abs/2502.19023
  • Avinash Patil, 5 Feb 2025, Advancing Reasoning in Large Language Models: Promising Methods and Approaches, https://arxiv.org/abs/2502.03671
  • Komal Kumar, Tajamul Ashraf, Omkar Thawakar, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, Phillip H.S. Torr, Salman Khan, Fahad Shahbaz Khan, 28 Feb 2025, LLM Post-Training: A Deep Dive into Reasoning Large Language Models, https://arxiv.org/abs/2502.21321 https://github.com/mbzuai-oryx/Awesome-LLM-Post-training
  • Wenjie Wu, Yongcheng Jing, Yingjie Wang, Wenbin Hu, Dacheng Tao, 3 Mar 2025, Graph-Augmented Reasoning: Evolving Step-by-Step Knowledge Graph Retrieval for LLM Reasoning, https://arxiv.org/abs/2503.01642

Skeleton-of-Thought

Skeleton-of-thought is a technique with dual aims of smarter reasoning and faster inference. The idea is to generate an outline that is a list of points, and then have the LLM process each sub-point in parallel. This allows both a more focused answer to that issue, and a faster parallelization of shorter token length answers.

Research on skeleton-of-thought reasoning includes:

Reflection

Reflection, or self-reflection, is a type of reasoning where the LLM takes an extra step to "reflect" on its own answers. This is a type of multi-step reasoning method, where the LLM is admonished to improve its own answers. There are different variants of self-reflection for training improvement or inference improvement.

Research papers on reflection:

LLM as Judge

LLM as Judge is the method of improving outputs by having an LLM "judge" the correctness of another LLM's output, whether to evaluate it or make improvements. When the LLM judges its own output, this is known as "self-reflection." When an LLM judges a group of other LLM outputs from the same query, and chooses the best, this is called "Best-of-N."

Research papers on LLM-as-Judge areas:

  • Cameron R. Wolfe, Ph.D., Dec 02, 2024, Finetuning LLM Judges for Evaluation: The Prometheus suite, JudgeLM, PandaLM, AutoJ, and more..., https://cameronrwolfe.substack.com/p/finetuned-judge
  • Tom Schaul, 25 Nov 2024, Boundless Socratic Learning with Language Games, https://arxiv.org/abs/2411.16905
  • Mingchen Zhuge, Changsheng Zhao, Dylan Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, Yangyang Shi, Vikas Chandra, Jürgen Schmidhuber, 16 Oct 2024 (v2), Agent-as-a-Judge: Evaluate Agents with Agents, https://arxiv.org/abs/2410.10934
  • Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, Yiqun Liu, 10 Dec 2024 (v2), LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods, https://arxiv.org/abs/2412.05579 https://github.com/CSHaitao/Awesome-LLMs-as-Judges
  • Xiangjue Dong, Maria Teleki, James Caverlee, 18 Dec 2024, A Survey on LLM Inference-Time Self-Improvement, https://arxiv.org/abs/2412.14352 https://github.com/dongxiangjue/Awesome-LLM-Self-Improvement (Broad survey of reasoning improvement methods from multi-step inference to RALM to decoding algorithms.)
  • Aske Plaat, Annie Wong, Suzan Verberne, Joost Broekens, Niki van Stein, Thomas Back, 16 Jul 2024, Reasoning with Large Language Models, a Survey, https://arxiv.org/abs/2407.11511
  • Yixin Ji, Juntao Li, Hai Ye, Kaixin Wu, Jia Xu, Linjian Mo, Min Zhang, 5 Jan 2025, Test-time Computing: from System-1 Thinking to System-2 Thinking, https://arxiv.org/abs/2501.02497
  • Zhenting Wang, Shuming Hu, Shiyu Zhao, Xiaowen Lin, Felix Juefei-Xu, Zhuowei Li, Ligong Han, Harihar Subramanyam, Li Chen, Jianfa Chen, Nan Jiang, Lingjuan Lyu, Shiqing Ma, Dimitris N. Metaxas, Ankit Jain, 31 Dec 2024, MLLM-as-a-Judge for Image Safety without Human Labeling, https://arxiv.org/abs/2501.00192
  • Zheqi Lv, Wenkai Wang, Jiawei Wang, Shengyu Zhang, Fei Wu, 10 Jan 2025, Cascaded Self-Evaluation Augmented Training for Efficient Multimodal Large Language Models, https://arxiv.org/abs/2501.05662 (Optimize multimodal CoT by breaking down prompts into smaller sub-goals.)
  • Maciej Besta, Julia Barth, Eric Schreiber, Ales Kubicek, Afonso Catarino, Robert Gerstenberger, Piotr Nyczyk, Patrick Iff, Yueling Li, Sam Houliston, Tomasz Sternal, Marcin Copik, Grzegorz Kwaśniewski, Jürgen Müller, Łukasz Flis, Hannes Eberhard, Hubert Niewiadomski, Torsten Hoefler, 23 Jan 2025 (v3), Reasoning Language Models: A Blueprint, https://arxiv.org/abs/2501.11223 (Survey and blueprint for how to build a Large Reasoning Model.)
  • Yafu Li, Zhilin Wang, Tingchen Fu, Ganqu Cui, Sen Yang, Yu Cheng, 21 Jan 2025, From Drafts to Answers: Unlocking LLM Potential via Aggregation Fine-Tuning, https://arxiv.org/abs/2501.11877 (Fine-tune an LLM to accept multiple candidate answers and output a final one.)
  • Swarnadeep Saha, Xian Li, Marjan Ghazvininejad, Jason Weston, Tianlu Wang, 30 Jan 2025, Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge, https://arxiv.org/abs/2501.18099
  • Yubo Wang, Xiang Yue, Wenhu Chen, 30 Jan 2025 (v2), Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate, https://arxiv.org/abs/2501.17703
  • Gregor Bachmann, Sotiris Anagnostidis, Albert Pumarola, Markos Georgopoulos, Artsiom Sanakoyeu, Yuming Du, Edgar Schönfeld, Ali Thabet, Jonas Kohler, 31 Jan 2025, Judge Decoding: Faster Speculative Sampling Requires Going Beyond Model Alignment, https://arxiv.org/abs/2501.19309 (Using "LLM as Judge" methods to speed up speculative decoding via higher acceptance rates.)
  • Joshua Ong Jun Leang, Giwon Hong, Wenda Li, Shay B. Cohen, 18 Feb 2025, Theorem Prover as a Judge for Synthetic Data Generation, https://arxiv.org/abs/2502.13137
  • Avinash Patil, 5 Feb 2025, Advancing Reasoning in Large Language Models: Promising Methods and Approaches, https://arxiv.org/abs/2502.03671

System 2

System 2 is the slower reasoning mode of the human brain, which multi-step reasoning algorithms try to emulate. This is the conscious brain and its capability for rational reasoning, usually in a slow and step-by-step fashion, which reasoning algorithms such as Chain-of-Thought aim to copy. By comparison, System 1 is the sensory processing and intuitive type of brain functions, including the "subconscious" brain, which is massively parallel and innate, but also lacking in rationality and explainability, much like a raw neural network.

Research papers on LLMs and System 2 thinking modes:

  • Pranab Sahoo, Ayush Kumar Singh, Sriparna Saha, Vinija Jain, Samrat Mondal, Aman Chadha, 5 Feb 2024, A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications, https://arxiv.org/abs/2402.07927
  • Akash Bajwa, Oct 07, 2024, Inference Time Scaling Laws: AI Megacycle Of System 1 And System 2 Applications, https://akashbajwa.substack.com/p/inference-time-scaling-laws
  • Latent Space, Nov 05, 2024, Inference, Fast and Slow. When System 1/System 2 analogies are not enough: The 6 types of LLM inference https://www.latent.space/p/inference-fast-and-slow
  • Ping Yu, Jing Xu, Jason Weston, Ilia Kulikov, 24 Jul 2024 (v3), Distilling System 2 into System 1, https://arxiv.org/abs/2407.06023
  • DiJia Su, Sainbayar Sukhbaatar, Michael Rabbat, Yuandong Tian, Qinqing Zheng, 13 Oct 2024, Dualformer: Controllable Fast and Slow Thinking by Learning with Randomized Reasoning Traces, https://arxiv.org/abs/2410.09918
  • Cheng Yang, Chufan Shi, Siheng Li, Bo Shui, Yujiu Yang, Wai Lam, 29 Dec 2024, LLM2: Let Large Language Models Harness System 2 Reasoning, https://arxiv.org/abs/2412.20372
  • Xiaoxue Cheng, Junyi Li, Wayne Xin Zhao, Ji-Rong Wen, 2 Jan 2025, Think More, Hallucinate Less: Mitigating Hallucinations via Dual Process of Fast and Slow Thinking, https://arxiv.org/abs/2501.01306
  • Scott C. Lowe, 29 Oct 2024 (v2), System 2 Reasoning Capabilities Are Nigh, https://arxiv.org/abs/2410.03662
  • Yixin Ji, Juntao Li, Hai Ye, Kaixin Wu, Jia Xu, Linjian Mo, Min Zhang, 5 Jan 2025, Test-time Computing: from System-1 Thinking to System-2 Thinking, https://arxiv.org/abs/2501.02497
  • Violet Xiang, Charlie Snell, Kanishk Gandhi, Alon Albalak, Anikait Singh, Chase Blagden, Duy Phung, Rafael Rafailov, Nathan Lile, Dakota Mahan, Louis Castricato, Jan-Philipp Franken, Nick Haber, Chelsea Finn, 8 Jan 2025, Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought, https://arxiv.org/abs/2501.04682
  • Fengli Xu, Qianyue Hao, Zefang Zong, Jingwei Wang, Yunke Zhang, Jingyi Wang, Xiaochong Lan, Jiahui Gong, Tianjian Ouyang, Fanjin Meng, Chenyang Shao, Yuwei Yan, Qinglong Yang, Yiwen Song, Sijian Ren, Xinyuan Hu, Yu Li, Jie Feng, Chen Gao, Yong Li, 17 Jan 2025 (v2), Towards Large Reasoning Models: A Survey on Scaling LLM Reasoning Capabilities, https://arxiv.org/abs/2501.09686
  • Maciej Besta, Julia Barth, Eric Schreiber, Ales Kubicek, Afonso Catarino, Robert Gerstenberger, Piotr Nyczyk, Patrick Iff, Yueling Li, Sam Houliston, Tomasz Sternal, Marcin Copik, Grzegorz Kwaśniewski, Jürgen Müller, Łukasz Flis, Hannes Eberhard, Hubert Niewiadomski, Torsten Hoefler, 23 Jan 2025 (v3), Reasoning Language Models: A Blueprint, https://arxiv.org/abs/2501.11223 (Survey and blueprint for how to build a Large Reasoning Model.)
  • Bilgehan Sel, Ruoxi Jia, Ming Jin, 23 Jan 2025, LLMs Can Plan Only If We Tell Them, https://arxiv.org/abs/2501.13545
  • Kounianhua Du, Hanjing Wang, Jianxing Liu, Jizheng Chen, Xinyi Dai, Yasheng Wang, Ruiming Tang, Yong Yu, Jun Wang, Weinan Zhang, 18 Feb 2025, Boost, Disentangle, and Customize: A Robust System2-to-System1 Pipeline for Code Generation, https://arxiv.org/abs/2502.12492
  • Alireza S. Ziabari, Nona Ghazizadeh, Zhivar Sourati, Farzan Karimi-Malekabadi, Payam Piray, Morteza Dehghani, 18 Feb 2025, Reasoning on a Spectrum: Aligning LLMs to System 1 and System 2 Thinking, https://arxiv.org/abs/2502.12470
  • Zhong-Zhi Li, Duzhen Zhang, Ming-Liang Zhang, Jiaxin Zhang, Zengyan Liu, Yuxuan Yao, Haotian Xu, Junhao Zheng, Pei-Jie Wang, Xiuyi Chen, Yingying Zhang, Fei Yin, Jiahua Dong, Zhijiang Guo, Le Song, Cheng-Lin Liu, 25 Feb 2025 (v2), From System 1 to System 2: A Survey of Reasoning Large Language Models, https://arxiv.org/abs/2502.17419
  • Pengcheng Wen, Jiaming Ji, Chi-Min Chan, Juntao Dai, Donghai Hong, Yaodong Yang, Sirui Han, Yike Guo, 17 Mar 2025, ThinkPatterns-21k: A Systematic Study on the Impact of Thinking Patterns in LLMs, https://arxiv.org/abs/2503.12918

Best of N Reasoning

Best of N is an LLM reasoning method where multiple answers are generated, and the best one is chosen. You can use Best of N (BoN) with multiple answers from a single LLM, or in an ensemble inference architecture with answers from multiple different LLMs. Usually, the last step is another LLM inference that performs "LLM as Judge" computations to choose the best answer. It is also possible to use other types of non-LLM ranking algorithms to choose the best one.

Research papers on Best-of-N reasoning:

  • Siwei Wu, Zhongyuan Peng, Xinrun Du, Tuney Zheng, Minghao Liu, Jialong Wu, Jiachen Ma, Yizhi Li, Jian Yang, Wangchunshu Zhou, Qunshu Lin, Junbo Zhao, Zhaoxiang Zhang, Wenhao Huang, Ge Zhang, Chenghua Lin, J.H. Liu, 22 Oct 2024 (v2), A Comparative Study on Reasoning Patterns of OpenAI's o1 Model, https://arxiv.org/abs/2410.13639
  • Hanshi Sun, Momin Haider, Ruiqi Zhang, Huitao Yang, Jiahao Qiu, Ming Yin, Mengdi Wang, Peter Bartlett, Andrea Zanette, 26 Oct 2024, Fast Best-of-N Decoding via Speculative Rejection, https://arxiv.org/abs/2410.20290
  • Do Xuan Long, Duong Ngoc Yen, Anh Tuan Luu, Kenji Kawaguchi, Min-Yen Kan, Nancy F. Chen, 1 Nov 2024, Multi-expert Prompting Improves Reliability, Safety, and Usefulness of Large Language Models, https://arxiv.org/abs/2411.00492
  • Yinlam Chow, Guy Tennenholtz, Izzeddin Gur, Vincent Zhuang, Bo Dai, Sridhar Thiagarajan, Craig Boutilier, Rishabh Agarwal, Aviral Kumar, Aleksandra Faust, 18 Dec 2024, Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models, https://arxiv.org/abs/2412.15287
  • Violet Xiang, Charlie Snell, Kanishk Gandhi, Alon Albalak, Anikait Singh, Chase Blagden, Duy Phung, Rafael Rafailov, Nathan Lile, Dakota Mahan, Louis Castricato, Jan-Philipp Franken, Nick Haber, Chelsea Finn, 8 Jan 2025, Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought, https://arxiv.org/abs/2501.04682
  • Tong Xiao, Jingbo Zhu, 16 Jan 2025, Foundations of Large Language Models, https://arxiv.org/abs/2501.09223 (Huge 230 page paper on many topics such as training, prompting, alignment, and long context.)
  • Kuang-Huei Lee, Ian Fischer, Yueh-Hua Wu, Dave Marwood, Shumeet Baluja, Dale Schuurmans, Xinyun Chen, 17 Jan 2025, Evolving Deeper LLM Thinking, https://arxiv.org/abs/2501.09891 (An alternative search strategy broad/deep, compared to CoT and reflection.)
  • Edward Beeching, Lewis Tunstall, Sasha Rush Dec 16, 2024, Scaling Test Time Compute with Open Source Models, https://huggingface.co/spaces/HuggingFaceH4/blogpost-scaling-test-time-compute
  • Yafu Li, Zhilin Wang, Tingchen Fu, Ganqu Cui, Sen Yang, Yu Cheng, 21 Jan 2025, From Drafts to Answers: Unlocking LLM Potential via Aggregation Fine-Tuning, https://arxiv.org/abs/2501.11877 (Fine-tune an LLM to accept multiple candidate answers and output a final one.)
  • Weihua Du, Yiming Yang, Sean Welleck, 7 Feb 2025, Optimizing Temperature for Language Models with Multi-Sample Inference, https://arxiv.org/abs/2502.05234 https://github.com/StigLidu/TURN
  • Juntai Cao, Xiang Zhang, Raymond Li, Chuyuan Li, Shafiq Joty, Giuseppe Carenini, 27 Feb 2025, Multi2: Multi-Agent Test-Time Scalable Framework for Multi-Document Processing, https://arxiv.org/abs/2502.20592 (Test time computed applied to the multi-document summarization use case.)
  • Komal Kumar, Tajamul Ashraf, Omkar Thawakar, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, Phillip H.S. Torr, Salman Khan, Fahad Shahbaz Khan, 28 Feb 2025, LLM Post-Training: A Deep Dive into Reasoning Large Language Models, https://arxiv.org/abs/2502.21321 https://github.com/mbzuai-oryx/Awesome-LLM-Post-training
  • Chengsong Huang, Langlin Huang, Jixuan Leng, Jiacheng Liu, Jiaxin Huang, 25 Feb 2025, Efficient Test-Time Scaling via Self-Calibration, https://arxiv.org/abs/2503.00031
  • Yiming Wang, Pei Zhang, Siyuan Huang, Baosong Yang, Zhuosheng Zhang, Fei Huang, Rui Wang, 3 Mar 2025, Sampling-Efficient Test-Time Scaling: Self-Estimating the Best-of-N Sampling in Early Decoding, https://arxiv.org/abs/2503.01422
  • Yiwei Li, Jiayi Shi, Shaoxiong Feng, Peiwen Yuan, Xinglin Wang, Yueqi Zhang, Ji Zhang, Chuyi Tan, Boyuan Pan, Yao Hu, Kan Li, 7 Mar 2025, Speculative Decoding for Multi-Sample Inference, https://arxiv.org/abs/2503.05330 (Optimizing speculative decoding when generating multiple answers for a single query, such as for Best-of-N reasoning.)
  • Eric Zhao, Pranjal Awasthi, Sreenivas Gollapudi, 20 Feb 2025 (v2), Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification https://arxiv.org/abs/2502.01839 (Wrapping a single model with a Best-of-N approach that self-selects the best answer can significantly improve reasoning rates.)

Program Synthesis

Program synthesis is the reasoning method whereby the LLM can synthesize program code that is then executed to solve a problem. Using a Python interpreter with an LLM is common, but any language can potentially be used, including more abstract mathematical symbolic languages. The virtually unlimited flexibility of programming languages, when combined with LLM pattern-matching power to create code, offers a fertile area for reasoning advancement.

Research papers related to program synthesis and similar symbolic reasoning approaches:

Reasoning Decoding Algorithms

Reasoning decoding algorithms, or Chain-of-Thought decoding algorithms, are methods of using the decoding phase of LLM inference rather than multiple steps. The idea is that the possible pathways based on logits can be similar to Chain-of-Thought reasoning, and these pathways can be explored and combined during inference. This yields an algorithm that is better at reasoning than simpler decoding algorithms, but is more efficient than Chain-of-Thought because it can examine multiple pathways in a single inference step.

Research papers on reasoning-decoding or CoT-decoding:

  • Xuezhi Wang, Denny Zhou, 23 May 2024 (v2), Chain-of-Thought Reasoning Without Prompting, https://arxiv.org/abs/2402.10200 ("CoT decoding" is examining the alternative paths in the decoding algorithm, which is somewhat similar to Chain-of-Thought reasoning.)
  • xjdr-alt, Dec 2024, entropix: Entropy Based Sampling and Parallel CoT Decoding, https://github.com/xjdr-alt/entropix (Parallel decoding attempts to get something similar to CoT.)
  • Hongxuan Zhang, Zhining Liu, Yao Zhao, Jiaqi Zheng, Chenyi Zhuang, Jinjie Gu, Guihai Chen, 4 Jun 2024 (v2), Fast Chain-of-Thought: A Glance of Future from Parallel Decoding Leads to Answers Faster, https://arxiv.org/abs/2311.08263 (Use of Jacobi parallel decoding with Chain-of-Thought.)
  • Renato Vukovic, David Arps, Carel van Niekerk, Benjamin Matthias Ruppik, Hsien-Chin Lin, Michael Heck, Milica Gašić, 5 Aug 2024, Dialogue Ontology Relation Extraction via Constrained Chain-of-Thought Decoding, https://arxiv.org/abs/2408.02361
  • Yuntian Deng, Kiran Prasad, Roland Fernandez, Paul Smolensky, Vishrav Chaudhary, Stuart Shieber, 2 Nov 2023, Implicit Chain of Thought Reasoning via Knowledge Distillation, https://arxiv.org/abs/2311.01460 (Knowledge distillation applied to optimizing the interim computations in Chain-of-Thought.)
  • Yuntian Deng, Yejin Choi, Stuart Shieber, 23 May 2024, From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step, https://arxiv.org/abs/2405.14838
  • Ping Yu, Jing Xu, Jason Weston, Ilia Kulikov, 24 Jul 2024 (v3), Distilling System 2 into System 1, https://arxiv.org/abs/2407.06023
  • Mehul Damani, Idan Shenfeld, Andi Peng, Andreea Bobu, Jacob Andreas, 7 Oct 2024, Learning How Hard to Think: Input-Adaptive Allocation of LM Computation, https://arxiv.org/abs/2410.04707
  • Pranjal Aggarwal, Aman Madaan, Yiming Yang, Mausam, 16 Nov 2023 (v2), Let's Sample Step by Step: Adaptive-Consistency for Efficient Reasoning and Coding with LLMs, EMNLP 2023, https://arxiv.org/abs/2305.11860 https://www.sample-step-by-step.info/
  • Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, Yuandong Tian, 9 Dec 2024, Training Large Language Models to Reason in a Continuous Latent Space, https://arxiv.org/abs/2412.06769 (Performing reasoning in a model trained to operate in the embedding vector space, rather than more directly in the token space.)
  • Luyang Liu, Jonas Pfeiffer, Jiaxing Wu, Jun Xie, Arthur Szlam, 23 Dec 2024, Deliberation in Latent Space via Differentiable Cache Augmentation, https://arxiv.org/abs/2412.17747 (Augmenting the KV cache with reasoning information so that decoding will mimic multi-step reasoning with fewer tokens required for intermediate steps.)
  • Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, Vaishnavh Nagarajan, 21 Apr 2024 (v3), Think before you speak: Training Language Models With Pause Tokens, https://arxiv.org/abs/2310.02226 (Inserting extra "pause tokens" that trigger the LLM to perform extra reasoning during the decoding phase.)
  • Yuval Shalev, Amir Feder, Ariel Goldstein, 19 Jun 2024, Distributional reasoning in LLMs: Parallel reasoning processes in multi-hop reasoning, https://arxiv.org/abs/2406.13858 (Using embeddings from intermediate model layers in decoding to mimic reasoning pathways.)
  • Eden Biran, Daniela Gottesman, Sohee Yang, Mor Geva, Amir Globerson, 14 Oct 2024 (v2), Hopping Too Late: Exploring the Limitations of Large Language Models on Multi-Hop Queries, https://arxiv.org/abs/2406.12775 (Backpatching prior layers using embeddings from the current activations to mimic multi-step reasoning.)
  • Jacob Pfau, William Merrill, Samuel R. Bowman, 24 Apr 2024, Let's Think Dot by Dot: Hidden Computation in Transformer Language Models, https://arxiv.org/abs/2404.15758 (Use of dummy "filler tokens" similar to "pause tokens" or "reasoning tokens" to aid multi-step reasoning in decoding.)
  • Eric Zelikman, Georges Harik, Yijia Shao, Varuna Jayasiri, Nick Haber, Noah D. Goodman, 18 Mar 2024 (v2), Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking, https://arxiv.org/abs/2403.09629 (Introduces answers between a start-of-thought and end-of-thought meta-token for reasoning.)
  • Haoran Wang, Kai Shu, Jan 2025, MakeEveryTokenCount: ASystematic Survey on Decoding Methods for Foundation Model, https://www.researchgate.net/profile/Haoran-Wang-96/publication/387703971_Make_Every_Token_Count_A_Systematic_Survey_on_Decoding_Methods_for_Foundation_Models/links/67784c8ce74ca64e1f49eb15/Make-Every-Token-Count-A-Systematic-Survey-on-Decoding-Methods-for-Foundation-Models.pdf https://github.com/wang2226/Awesome-LLM-Decoding
  • Phuc Phan, Hieu Tran, Long Phan, 23 Aug 2024 (v2), Distillation Contrastive Decoding: Improving LLMs Reasoning with Contrastive Decoding and Distillation, https://arxiv.org/abs/2402.14874
  • Maxime Peyrard, Martin Josifoski, Robert West, 21 Mar 2024, The Era of Semantic Decoding, https://arxiv.org/abs/2403.14562
  • Fengli Xu, Qianyue Hao, Zefang Zong, Jingwei Wang, Yunke Zhang, Jingyi Wang, Xiaochong Lan, Jiahui Gong, Tianjian Ouyang, Fanjin Meng, Chenyang Shao, Yuwei Yan, Qinglong Yang, Yiwen Song, Sijian Ren, Xinyuan Hu, Yu Li, Jie Feng, Chen Gao, Yong Li, 17 Jan 2025 (v2), Towards Large Reasoning Models: A Survey on Scaling LLM Reasoning Capabilities, https://arxiv.org/abs/2501.09686
  • Xiangjue Dong, Maria Teleki, James Caverlee, 18 Dec 2024, A Survey on LLM Inference-Time Self-Improvement, https://arxiv.org/abs/2412.14352 https://github.com/dongxiangjue/Awesome-LLM-Self-Improvement
  • Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, Tom Goldstein, 7 Feb 2025, Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach, https://arxiv.org/abs/2502.05171

Planning (as part of Reasoning)

Having an LLM know how to make a plan is part of intelligence. Here are some papers specifically on the aspect of "planning" as part of reasoning:

LLM Long Term Memory

LLM Long Term Memory refers to having the LLM "remember" things that it has learned during inference. By default, an LLM is "stateless" and does not recall facts between queries. Short-term memory can be given via tracking conversational history as "context" for a query, but long term memory is the aim of having an LLM "learn" or "memorize" new facts. Note that this research area is about accuracy of the output, not about the speed optimization of LLM inference memory efficiency.

Research on LLM long term memory:

Agentic Workflow

Agentic workflow has some aspects of reasoning (e.g., planning, multi-step execution) combined with agent technologies. Papers on agentic workflow include:

Temporal Reasoning (Time-Based Logic)

AI models struggle with the concept of time and any sort of "temporal reasoning" that is based on time progression or causation over time.

AGI Research

General research on achieving Artifical General Intelligence (AGI):

General Research on Reasoning Techniques

See the long list of AI reasoning research papers.

Reasoning and CoT Efficiency Topics

Blog articles on reasoning efficiency:

More research information on general efficiency optimization techniques for reasoning models:

Efficiency optimizations to Chain-of-Thought include:

More AI Research

Read more about: