Long-Short Chain-of-Thought Mixture Supervised Fine-Tuning Eliciting Efficient Reasoning in Large Language Models

This paper introduces Long-Short Chain-of-Thought Mixture Supervised Fine-Tuning (LS-Mixture SFT), which combines long and short CoT datasets to fine-tune non-reasoning LLMs, achieving a 2.3% average accuracy improvement and 47.61% response length reduction on reasoning benchmarks.

Supervised Learning, Fine-tuning, Large Language Model, Reasoning, Efficiency

Bin Yu, Hang Yuan, Yuliang Wei, Bailing Wang, Weizhen Qi, Kai Chen

Harbin Institute of Technology, East China Normal University, Zhongguancun Academy, Zhongguancun Institute of Artificial Intelligence

Generated by grok-3

Background Problem

The paper addresses the challenge of transferring reasoning capabilities from large reasoning models (LRMs) to non-reasoning large language models (LLMs) via supervised fine-tuning (SFT) using distilled Chain-of-Thought (CoT) data. A key issue is the ‘overthinking’ problem inherited from teacher models, where student models generate verbose and redundant reasoning chains, leading to inefficiency during inference. The research aims to solve this by enabling non-reasoning models to perform efficient reasoning without sacrificing accuracy, specifically by avoiding the overthinking issue during the distillation stage.

Method

The proposed method, Long-Short Chain-of-Thought Mixture Supervised Fine-Tuning (LS-Mixture SFT), focuses on eliciting efficient reasoning in non-reasoning LLMs. Its core idea is to mix long CoT reasoning trajectories (distilled from LRMs) with their shorter, structure-preserved rewritten versions to train models that balance comprehensive and concise reasoning. The implementation involves three stages:

Structure-preserved CoT Rewriting: A large language model (Qwen2.5-72B-Instruct) rewrites long CoT trajectories into shorter versions while maintaining logical structure and critical steps using constrained prompts, creating a short CoT dataset.
Mixture Supervised Fine-Tuning: The long and short CoT datasets are randomly combined into a mixed dataset (s1K-mix), which is used to fine-tune a non-reasoning LLM (Qwen2.5-32B-Instruct) with special tokens (, ) to separate reasoning and answer parts, optimizing for both detailed and brief reasoning patterns.
Inference-time Balanced Thinking: During inference, the model uses a balanced thinking prompt to generate reasoning chains that optimize both effectiveness and efficiency, avoiding extremes of overthinking or oversimplification.

Experiment

The experiments were conducted using the s1K-mix dataset, derived from the s1K-1.1 dataset (1,000 long CoT instances from DeepSeek R1), with short CoT versions created via structure-preserved rewriting (984 instances after filtering errors). The setup involved fine-tuning Qwen2.5-32B-Instruct into s1-mix-32B, compared against baselines like s1.1-32B, DeepSeek R1, and OpenAI o1 series on three reasoning benchmarks: MATH500 (math problems), AIME24 (high school math competition), and GPQA Diamond (PhD-level science questions). The design aimed to test both accuracy and response length, reflecting efficiency. Results showed s1-mix-32B outperforming s1.1-32B with accuracy improvements of 2.2% (MATH500), 6.7% (AIME24), and 2% (GPQA), averaging a 2.3% gain, while reducing response length by 47.61%. Ablation studies confirmed the efficacy of structure-preserved rewriting over direct compression and the mixture strategy over long-only or short-only datasets, with balanced thinking mode at inference proving optimal. However, the benchmark selection is narrow, focusing on math and science, which may not generalize to other reasoning tasks. The results match the expectation of improved efficiency and accuracy, but lack qualitative analysis of rewritten CoTs or failure cases.

Further Thoughts

The LS-Mixture SFT approach opens up several avenues for deeper exploration. One intriguing aspect is the potential integration with dynamic mixture ratios—could the proportion of long to short CoT data be adjusted during training based on real-time model performance or task complexity, perhaps using reinforcement learning signals? This could address potential overfitting to either verbose or overly concise reasoning styles. Additionally, the reliance on a specific rewriter model (Qwen2.5-72B-Instruct) raises questions about robustness—how would the method perform with weaker or domain-specific rewriters, and could this introduce biases in the short CoT data? Comparing this approach to other efficiency-focused methods, such as token-budget allocation or reinforcement learning-based compression (as mentioned in related work), might reveal complementary strategies. Furthermore, extending the evaluation to diverse reasoning domains beyond math and science, such as commonsense reasoning or ethical dilemmas, could validate the generalizability of the method. Finally, connecting this to broader AI safety and alignment research, concise reasoning could reduce interpretability challenges in LLMs, but might also risk losing critical intermediate steps needed for transparency—how can this trade-off be managed?