Posts
All the articles I've posted.
-
Can Large Reasoning Models Self-Train?
本文提出Self-Rewarded Training (SRT) 方法,通过模型自一致性驱动强化学习实现无监督数学推理能力提升,初期性能媲美有监督方法,但因奖励黑客问题导致长期训练性能崩溃,并探索了提前停止和课程学习等缓解策略。
-
Hybrid Latent Reasoning via Reinforcement Learning
本文提出HRPO,一种基于强化学习的混合潜在推理框架,通过门控机制结合离散token和连续隐状态,显著提升了大型语言模型在知识和推理任务上的性能,同时减少了对链式思维数据的依赖。
-
Walk Before You Run! Concise LLM Reasoning via Reinforcement Learning
本文提出 ConciseR,一种两阶段强化学习框架,通过 GRPO++ 提升推理能力并通过 L-GRPO 优化响应长度,在保持准确性的同时显著减少 CoT 响应长度,优于多个基准数据集上的现有方法。
-
On the Generalization vs Fidelity Paradox in Knowledge Distillation
本文通过大规模实证分析揭示知识蒸馏(KD)显著提升小型语言模型的零样本推理性能(高达10%),但对大型模型收益有限,且性能提升与推理保真度存在脱节,强调任务专长和适度参数调整的重要性。
-
MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining
This paper introduces MiMo-7B, a 7B-parameter LLM optimized for reasoning through innovative pre-training with reasoning-dense data and multi-token prediction, and post-training with RL using test-difficulty-driven rewards, achieving superior performance over larger models and OpenAI o1-mini on mathematics and coding benchmarks.