This paper introduces Learning to Think (L2T), an information-theoretic reinforcement fine-tuning framework for LLMs that uses a universal dense process reward to optimize reasoning effectiveness and efficiency, achieving significant accuracy and token efficiency gains on math reasoning benchmarks.
Large Language Model, Reinforcement Learning, Reasoning, Efficiency, Fine-tuning
Jingyao Wang, Wenwen Qiang, Zeen Song, Changwen Zheng, Hui Xiong
University of Chinese Academy of Sciences, Institute of Software Chinese Academy of Sciences, Hong Kong University of Science and Technology
Generated by grok-3
Background Problem
Large Language Models (LLMs) have advanced in reasoning capabilities, enabling them to tackle complex tasks like coding, device control, and personal assistance. However, current methods often fail to balance reasoning effectiveness with computational efficiency, leading to unnecessarily long reasoning chains that waste tokens and sometimes degrade accuracy. The key problem addressed in this paper is the inefficiency of existing reinforcement learning (RL) approaches that rely on sparse outcome rewards, lacking feedback on intermediate steps, and thus encouraging redundant computation. The authors aim to solve this by proposing a framework that optimizes reasoning depth adaptively, ensuring accuracy with minimal token usage.
Method
The proposed method, Learning to Think (L2T), is an information-theoretic reinforcement fine-tuning framework for LLMs. Its core idea is to treat each query-response interaction as a hierarchical session of multiple episodes and introduce a universal dense process reward to quantify episode-wise information gain in model parameters, thus optimizing reasoning efficiency. The implementation involves three stages: (1) Problem reformulation, where each question-answer pair is segmented into reasoning episodes using markers like ’
Experiment
The experiments were conducted on multiple reasoning benchmarks, primarily math-focused (e.g., AIME, AMC, MATH500, MinervaMATH, Omni-MATH), with additional limited testing on code generation. Two base models, DeepScaleR-1.5B-Preview and DeepSeek-R1-Distill-Qwen-1.5B, were fine-tuned and evaluated on A100 GPU clusters with a maximum token budget of 16,384. The setup compares L2T against outcome-reward RL methods (e.g., GRPO), length penalty approaches, and process-reward baselines (e.g., ReST-MCTS, MRT). Results show L2T achieving state-of-the-art performance, with over 10% accuracy improvement and doubled token efficiency compared to outcome-reward methods, and about 5% accuracy gain with 20% token reduction compared to process-reward baselines. The design of experiments seems reasonable for math reasoning tasks, with varied difficulty tiers to test adaptive reasoning depth. However, the focus on math benchmarks limits insights into broader applicability, and while efficiency gains are evident, the computational cost of calculating dense rewards (even with approximations) isn’t fully detailed for larger models. The results match the expectation of improved efficiency and effectiveness, but potential overfitting to specific task types or lack of failure case analysis raises concerns about robustness.
Further Thoughts
The L2T framework’s use of information-theoretic rewards is a compelling direction for addressing efficiency in LLM reasoning, but it prompts deeper questions about its scalability and generalizability. For instance, while the paper focuses on math reasoning, how would this dense reward mechanism perform in domains like legal or medical reasoning, where intermediate steps might not be as linearly progressive or easily quantifiable via parameter information gain? There’s potential for cross-disciplinary inspiration here—could concepts from cognitive science, such as human decision-making under resource constraints, inform more adaptive reward designs? Additionally, the reliance on internal signals might risk reinforcing existing biases in the model’s parameters, a concern not addressed in the paper. Comparing this to recent works on parameter-efficient fine-tuning (e.g., LoRA), one wonders if integrating L2T with such methods could further reduce computational overhead, especially for larger models where even approximated reward calculations might become prohibitive. Finally, exploring failure modes—such as scenarios where the compression penalty overly restricts necessary reasoning depth—could provide valuable insights for refining the balance between effectiveness and efficiency, ensuring the framework’s robustness across diverse real-world applications.