Tag: Reasoning

All the articles with the tag "Reasoning".

R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning

Published: 7 May, 2025 at 08:43 AM

82.56 🤔

本文提出R1-Reward，通过StableReinforce算法将强化学习应用于多模态奖励模型训练，显著提升了性能并在多个基准测试中超越现有最优模型，同时展示了优异的数据效率和测试时扩展性。
On the generalization of language models from in-context learning and finetuning: a controlled study

Published: 4 May, 2025 at 04:33 PM

82.00 🤔

本文通过控制实验比较了语言模型在上下文学习和微调下的泛化能力，发现上下文学习更灵活，并提出通过数据增强方法显著改善微调的泛化性能。
Discriminative Finetuning of Generative Large Language Models without Reward Models and Human Preference Data

Published: 19 May, 2025 at 11:18 AM

81.58 🤔

本文提出判别式微调（DFT）框架，通过判别式概率模型优化大型语言模型的输出概率，无需人类偏好数据或奖励模型，在数学推理和通用语言任务上显著优于SFT并与SFT→PO方法相当。
AdaptMI: Adaptive Skill-based In-context Math Instruction for Small Language Models

Published: 4 May, 2025 at 04:32 PM

80.86 🤔

本文提出AdaptMI和AdaptMI+自适应方法，通过基于奖励模型检测问题难度并针对困难问题选择技能-based in-context示例，提高小语言模型在数学推理任务中的性能，同时避免认知过载。
To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning

Published: 12 May, 2025 at 11:18 AM

80.55 🤔

This paper demonstrates through meta-analysis and experiments that Chain-of-Thought (CoT) prompting significantly enhances large language model performance on math and symbolic reasoning tasks, but offers limited benefits for non-symbolic tasks and underperforms compared to tool-augmented approaches.

Tag: Reasoning

R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning

On the generalization of language models from in-context learning and finetuning: a controlled study

Discriminative Finetuning of Generative Large Language Models without Reward Models and Human Preference Data

AdaptMI: Adaptive Skill-based In-context Math Instruction for Small Language Models

To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning