Tag: Reinforcement Learning

All the articles with the tag "Reinforcement Learning".

Warm Up Before You Train: Unlocking General Reasoning in Resource-Constrained Settings

Published: 22 May, 2025 at 11:17 AM

93.37 🤔

本文提出了一种两阶段训练框架，通过领域无关的Knights & Knaves逻辑游戏预热激活通用推理能力，并结合少量目标领域数据的RLVR训练，在资源受限环境下显著提升大型语言模型的推理性能和跨领域泛化能力。
Distilling LLM Agent into Small Models with Retrieval and Code Tools

Published: 28 May, 2025 at 11:25 AM

93.11 🤔

本文提出Agent Distillation框架，通过将LLM代理的交互行为蒸馏到sLMs中，并结合first-thought prefix和self-consistent action generation方法，使小型模型在事实和数学推理任务上取得显著性能提升，接近甚至超越更大规模的CoT蒸馏模型。
Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space

Published: 22 May, 2025 at 11:16 AM

92.95 🤔

本文提出 LATENTSEEK 框架，通过在潜在空间中基于策略梯度的测试时实例级适应（TTIA），显著提升大型语言模型的推理能力，同时探索测试时扩展的新方向。
Towards Revealing the Effectiveness of Small-Scale Fine-tuning in R1-style Reinforcement Learning

Published: 28 May, 2025 at 11:25 AM

92.52 🤔

本文通过理论分析和Re-distillation技术，揭示了小规模SFT在R1风格RL中的效率瓶颈，并以极少样本（<1K）在K&K和MATH数据集上接近RL性能，显著提升了数据效率。
Reward Reasoning Model

Published: 24 May, 2025 at 11:08 AM

92.11 🤔

本文提出奖励推理模型（RRMs），通过链式推理过程在生成奖励前自适应利用测试时计算资源，在多个奖励建模基准和实际应用中显著提升性能，尤其在复杂推理任务上表现优异。

Tag: Reinforcement Learning

Warm Up Before You Train: Unlocking General Reasoning in Resource-Constrained Settings

Distilling LLM Agent into Small Models with Retrieval and Code Tools

Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space

Towards Revealing the Effectiveness of Small-Scale Fine-tuning in R1-style Reinforcement Learning

Reward Reasoning Model