To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning

This paper demonstrates through meta-analysis and experiments that Chain-of-Thought (CoT) prompting significantly enhances large language model performance on math and symbolic reasoning tasks, but offers limited benefits for non-symbolic tasks and underperforms compared to tool-augmented approaches.

Large Language Model, Reasoning, Prompt Engineering, Symbolic Reasoning, Mathematical Reasoning

Zayne Sprague, Fangcong Yin, Juan Diego Rodriguez, Dongwei Jiang, Manya Wadhwa, Prasann Singhal, Xinyu Zhao, Xi Ye, Kyle Mahowald, Greg Durrett

The University of Texas at Austin, Johns Hopkins University, Princeton University

Generated by grok-3

Background Problem

Chain-of-Thought (CoT) prompting has become a popular technique for eliciting reasoning capabilities from large language models (LLMs), often used to improve performance on complex tasks by generating intermediate reasoning steps. However, its effectiveness across diverse task types remains unclear, with much of the existing research focusing on mathematical reasoning. This paper addresses the critical question of where CoT is truly beneficial, challenging the prevailing assumption that it universally enhances reasoning across all problem domains. The key problem solved is identifying the specific task categories—primarily math and symbolic reasoning—where CoT provides substantial performance gains, while highlighting its limited impact on non-symbolic tasks like commonsense reasoning.

Method

The authors employ a two-pronged approach to evaluate CoT’s effectiveness. First, they conduct a meta-analysis of over 100 papers from major ML and NLP conferences (ICLR, EACL, NAACL 2024), analyzing 1,218 experimental comparisons of CoT versus direct answering (DA) across various tasks. Second, they perform their own experiments on 20 datasets spanning symbolic, mathematical, and non-symbolic reasoning categories, testing 14 contemporary LLMs in zero-shot and few-shot settings. They categorize tasks into symbolic (math, logic, algorithmic) and non-symbolic (commonsense, knowledge) domains, using prompts like ‘think step by step’ for CoT and ‘immediately generate the answer’ for DA. Additionally, for symbolic tasks, they separate planning (generating a formal solution plan) and execution (solving the plan) stages, comparing CoT against tool-augmented approaches (e.g., Python interpreters for math, SMT solvers for logic) to pinpoint where CoT’s benefits lie. The methodology focuses on performance deltas (CoT minus DA accuracy) and uses statistical tests like paired bootstrapping with Bonferroni correction to assess significance.

Experiment

The experiments utilized 20 datasets across categories like math (GSM8K, MATH), logical reasoning (ContextHub, FOLIO), and non-symbolic tasks (CommonsenseQA, PIQA), evaluated on 14 LLMs including Llama 3.1, GPT-4o, and Claude-3.5. The setup was comprehensive, testing both zero-shot and few-shot prompting to ensure robustness, with answer extraction tailored per model-dataset pair to minimize unparseable outputs. Results showed significant CoT performance gains on math and symbolic reasoning tasks (e.g., up to 66.9% on GSM8K, 41.6% on MATH), with meta-analysis confirming average improvements of 14.2% for symbolic and 12.3% for math tasks. However, gains were negligible or negative on non-symbolic tasks (e.g., commonsense, knowledge), with only 32% of improvements in these categories being statistically significant, mostly on math slices of MMLU/MMLU Pro (up to 97.6% of gain attributed to questions with ’=’). Comparing CoT to tool-augmented methods on symbolic tasks revealed that while CoT improves execution over DA (e.g., 86.4% vs. 20.1% on GSM8K for Llama 3.1 8b), it underperforms compared to symbolic solvers (e.g., 80.3% vs. 94.4% with tools on GSM8K for Llama 3.1 70b). The experimental design was reasonable, covering a wide range of models and tasks, though the high unparseable rate in tool-augmented settings for smaller models (up to 46.8%) suggests potential implementation issues. Overall, results matched the expectation that CoT excels in symbolic domains but highlighted its limitations elsewhere.

Further Thoughts

The findings of this paper open up intriguing avenues for future research, particularly in exploring why CoT fails to deliver consistent benefits in non-symbolic domains. Could integrating CoT with multi-agent frameworks, where different agents handle distinct reasoning aspects (e.g., one for planning, another for critique), enhance performance on tasks like commonsense reasoning, as suggested by works on multi-agent debate (Du et al., 2023)? Additionally, the significant performance gap between CoT and tool-augmented approaches prompts a deeper investigation into hybrid systems—could fine-tuning LLMs to better interface with external solvers bridge this gap, especially for real-world applications in AI for Science or Finance where precision is critical? Another thought is the potential impact of data contamination, which the authors acknowledge but do not quantify; cross-referencing their results with studies on memorization in LLMs (e.g., Zhang et al., 2024) could provide a clearer picture of CoT’s true generalization ability. Finally, the heuristic of using ’=’ as a marker for symbolic reasoning, while effective, might miss other forms of structured reasoning—future work could explore more nuanced linguistic or structural features to predict CoT efficacy, potentially drawing from interpretability research to understand internal model deliberations.