Skip to content
Go back 2505.08364 arXiv logo

Learning Like Humans: Advancing LLM Reasoning Capabilities via Adaptive Difficulty Curriculum Learning and Expert-Guided Self-Reformulation

Published:  at  11:01 AM
85.16 🤔

This paper introduces Adaptive Difficulty Curriculum Learning (ADCL) and Expert-Guided Self-Reformulation (EGSR) to enhance LLM reasoning by dynamically adjusting training curricula and guiding models to reformulate expert solutions, achieving significant performance improvements over standard RL baselines on mathematical reasoning benchmarks.

Reinforcement Learning, Large Language Model, Reasoning, Curriculum Learning, Knowledge Assimilation

Enci Zhang, Xingang Yan, Wei Lin, Tianxiang Zhang, Qianchun Lu

Wired Product Operation Division, ZTE Corporation, Nanjing, China, School of Electronic and Computer Engineering, Peking University, Shenzhen, China

Generated by grok-3

Background Problem

Large Language Models (LLMs) have made significant strides in complex reasoning tasks, such as mathematical problem-solving, through paradigms like Zero-RL, which uses reinforcement learning (RL) to enhance innate reasoning capabilities. However, two key challenges remain: (1) static curriculum learning (CL) fails to adapt to the dynamic perception of problem difficulty during training (termed Difficulty Shift), leading to suboptimal learning trajectories, and (2) on-policy RL methods limit models to their pre-existing knowledge, preventing the assimilation of novel reasoning abilities beyond initial capabilities. Inspired by human learning strategies, this work aims to address these issues by introducing adaptive difficulty adjustments and expert-guided knowledge reformulation to improve LLM performance on complex tasks.

Method

The paper proposes two novel strategies to enhance LLM reasoning within the Zero-RL framework:

  1. Adaptive Difficulty Curriculum Learning (ADCL): This method counters the Difficulty Shift phenomenon by dynamically re-estimating the difficulty of upcoming data batches based on the model’s current state. Initially, difficulty scores are computed using the base model, and the dataset is split into sequential batches. During training, after each batch, the difficulty of the next batch is re-assessed and re-sorted to align with the model’s evolving perception, ensuring a more effective easy-to-hard progression. This approach avoids the computational burden of re-evaluating the entire dataset, focusing only on local batch adjustments.
  2. Expert-Guided Self-Reformulation (EGSR): This strategy addresses the limitations of on-policy RL by incorporating expert guidance without direct imitation. Instead of using off-policy expert trajectories (which cause distributional mismatch), EGSR guides the model to generate trajectories under expert influence (e.g., using expert solutions or answers as prompts), allowing the model to reformulate solutions within its own conceptual framework. This is supported by a modified GRPO objective that integrates guided trajectories when standard rollouts yield zero rewards, maintaining near on-policy stability while expanding capabilities.

Experiment

The experiments were conducted using Qwen2.5-7B as the base model on curated datasets (BaseSet-7K and AugSet-10K) derived from high-quality reasoning corpora, with difficulty assessed via accuracy rates. Benchmarks included MATH500, AIME24, AIME25, AMC23, and Minervamath, with performance measured as pass@1 or pass@8 to account for variance in smaller sets. The setup used the TRL framework with GRPO, a composite reward function balancing format and accuracy, and specific hyperparameters for ADCL (4 batches, 3 re-estimations) and EGSR (guidance via expert answer or full solution). Results showed ADCL outperforming predefined CL (e.g., 76.2% vs. 75.4% on MATH500), and EGSR(s,a) surpassing naive off-policy guidance (e.g., 79.4% vs. 66.0% on MATH500). The combined ADCL and EGSR approach achieved the best results, with a 16.6% improvement on AIME25 over the RL baseline (33.33% vs. 16.67%). While the setup is reasonable, the small size of some benchmarks and lack of comparison with other advanced methods limit the robustness of conclusions. Additionally, the pass@32 analysis suggests capability expansion, but deeper failure case analysis is missing. The results generally match expectations, though the gains may be overstated due to metric choice and benchmark variance.

Further Thoughts

The concepts of ADCL and EGSR open up intriguing avenues for further exploration, particularly in how dynamic adaptation and guided reformulation could apply beyond mathematical reasoning to domains like natural language understanding or even multimodal tasks involving vision and text. For instance, could ADCL’s periodic difficulty re-estimation be adapted to handle evolving user intents in conversational AI, where task complexity shifts based on dialogue context? Similarly, EGSR’s reformulation approach might be linked to techniques in human-AI collaboration, where models learn from human feedback not by mimicking but by internalizing feedback into their response frameworks—potentially aligning with recent work on RLHF (Reinforcement Learning from Human Feedback). A critical concern is the scalability of ADCL’s re-estimation process for larger datasets or models, as frequent re-sorting could introduce significant computational overhead; exploring lightweight proxies for difficulty estimation (e.g., using smaller surrogate models) could be a valuable direction. Additionally, the reliance on perplexity as a measure of policy alignment in EGSR is questionable, as it may not capture semantic or logical coherence—future work could integrate metrics like BLEU or human evaluation for reasoning quality. Finally, connecting these strategies to emergent abilities in LLMs, such as in-context learning, might reveal whether adaptive curricula or guided reformulation can accelerate the onset of such capabilities during scaling.



Previous Post
Fine-tuning Quantized Neural Networks with Zeroth-order Optimization
Next Post
Mining Hidden Thoughts from Texts: Evaluating Continual Pretraining with Synthetic Data for LLM Reasoning