Tag: Reinforcement Learning

All the articles with the tag "Reinforcement Learning".

Thinking Fast and Right: Balancing Accuracy and Reasoning Length with Adaptive Rewards

Published: 28 May, 2025 at 11:25 AM

86.15 🤔

本文提出自适应直接长度惩罚（A-DLP）方法，通过动态调整强化学习中的长度惩罚系数，在减少大型语言模型推理长度超过 50% 的同时保持准确性，为构建高效推理模型提供了新方向。
ZeroSearch: Incentivize the Search Capability of LLMs without Searching

Published: 8 May, 2025 at 06:16 PM

86.09 🤔

ZEROSEARCH introduces a reinforcement learning framework that enhances LLMs' search capabilities by simulating search engines with fine-tuned LLMs, achieving performance comparable to or better than real search engines without API costs through a curriculum-based rollout strategy.
Thinker: Learning to Think Fast and Slow

Published: 31 May, 2025 at 11:16 AM

86.01 🤔

本文提出Thinker任务，通过将问答过程分解为快速思考、验证、慢速思考和总结四个阶段，利用强化学习针对性训练大型语言模型的直觉和推理能力，在数学推理基准上实现了显著性能提升。
S-GRPO: Early Exit via Reinforcement Learning in Reasoning Models

Published: 20 May, 2025 at 11:10 AM

85.99 🤔

本文提出 S-GRPO 方法，通过串行组生成和递减奖励策略调控大型语言模型中间推理过程，在多个基准数据集上实现推理长度减少 35.4%~61.1% 和准确率提升 0.72%~6.08%，显著提升推理效率。
Reinforcement Fine-Tuning Powers Reasoning Capability of Multimodal Large Language Models

Published: 2 Jun, 2025 at 11:33 AM

85.98 🤔

本文作为立场论文，主张强化微调（RFT）通过强化学习算法显著提升多模态大语言模型（MLLMs）的推理能力，总结了社区在多模态、任务和领域上的进展，并提出了五个未来研究方向，但缺乏具体方法创新和实验验证。

Tag: Reinforcement Learning

Thinking Fast and Right: Balancing Accuracy and Reasoning Length with Adaptive Rewards

ZeroSearch: Incentivize the Search Capability of LLMs without Searching

Thinker: Learning to Think Fast and Slow

S-GRPO: Early Exit via Reinforcement Learning in Reasoning Models

Reinforcement Fine-Tuning Powers Reasoning Capability of Multimodal Large Language Models