Tag: Reinforcement Learning
All the articles with the tag "Reinforcement Learning".
-
Reward-Augmented Data Enhances Direct Preference Alignment of LLMs
本文提出了一种奖励增强数据集方法,通过对偏好对进行重新标记使大型语言模型条件化于奖励值学习响应质量全谱,显著提升了直接偏好优化(DPO)的性能并缓解了其遗忘高质被拒响应和无差别学习低质选中响应的局限性。
-
VLM Q-Learning: Aligning Vision-Language Models for Interactive Decision-Making
This paper introduces VLM Q-Learning, an offline-to-online reinforcement learning method that fine-tunes Vision-Language Models for interactive decision-making by filtering suboptimal actions with a critic head, achieving significant performance improvements over supervised fine-tuning across multiple multimodal agent tasks.
-
Nemotron-Research-Tool-N1: Exploring Tool-Using Language Models with Reinforced Reasoning
本文提出Nemotron-Research-Tool-N1,通过基于规则的强化学习和二元奖励函数训练工具调用语言模型,在不依赖标注推理轨迹的情况下显著提升工具调用能力,实验表明其在多个基准上超越GPT-4o等强基线。
-
SIMPLEMIX: Frustratingly Simple Mixing of Off- and On-policy Data in Language Model Preference Learning
This paper introduces SIMPLEMIX, a simple method to mix on- and off-policy data in language model preference optimization, demonstrating that their complementary strengths—on-policy for reasoning tasks and off-policy for open-ended tasks—lead to a 6.03% average improvement over single-source methods on Alpaca Eval 2.0.
-
HAIR: Hardness-Aware Inverse Reinforcement Learning with Introspective Reasoning for LLM Alignment
HAIR introduces a novel LLM alignment method using hardness-aware inverse reinforcement learning and introspective reasoning, constructing a balanced safety dataset and training category-specific reward models with GRPO-S, achieving state-of-the-art harmlessness while preserving usefulness across multiple benchmarks.