Tag: Reinforcement Learning

All the articles with the tag "Reinforcement Learning".

Is PRM Necessary? Problem-Solving RL Implicitly Induces PRM Capability in LLMs

Published: 20 May, 2025 at 11:09 AM

90.43 🤔

本文通过系统性实验证明，纯强化学习（RL）训练不仅提升大型语言模型的复杂推理能力，还能隐式培养过程奖励模型（PRM）能力，提出Self-PRM框架以进一步改进性能，但也揭示了其在高难度问题上的低精度局限。
ShiQ: Bringing back Bellman to LLMs

Published: 20 May, 2025 at 11:23 AM

89.77 🤔

本文提出ShiQ算法，通过从Bellman一致性方程出发设计适应LLM特性的损失函数，支持离线、token级的强化学习微调，并在单轮和多轮任务中表现出优于DPO和CoPG的奖励优化能力。
Agent RL Scaling Law: Agent RL with Spontaneous Code Execution for Mathematical Problem Solving

Published: 22 May, 2025 at 11:12 AM

89.47 🤔

本文通过ZeroTIR框架利用强化学习训练基础大型语言模型自发执行Python代码解决数学问题，揭示了训练步数与代码使用频率、响应长度及任务准确率的正相关规律（Agent RL Scaling Law），并在数学基准上显著优于无工具基线。
UFT: Unifying Supervised and Reinforcement Fine-Tuning

Published: 25 May, 2025 at 11:47 AM

89.30 🤔

本文提出统一微调（UFT）框架，通过整合监督微调和强化微调，利用提示引导探索和混合目标函数，在不同规模模型和推理任务上均表现出色，并理论上证明了样本复杂度的指数级改进。
Reinforcement Learning vs. Distillation: Understanding Accuracy and Capability in LLM Reasoning

Published: 26 May, 2025 at 11:24 AM

89.27 🤔

本文通过实验和理论分析揭示了RLVR提升大型语言模型准确性但不提升能力的原因在于其偏向优化简单问题，而蒸馏只有在引入新知识时才能提升能力，否则表现与RLVR类似。

Tag: Reinforcement Learning

Is PRM Necessary? Problem-Solving RL Implicitly Induces PRM Capability in LLMs

ShiQ: Bringing back Bellman to LLMs

Agent RL Scaling Law: Agent RL with Spontaneous Code Execution for Mathematical Problem Solving

UFT: Unifying Supervised and Reinforcement Fine-Tuning

Reinforcement Learning vs. Distillation: Understanding Accuracy and Capability in LLM Reasoning