Posts

All the articles I've posted.

RM-R1: Reward Modeling as Reasoning

Published: 7 May, 2025 at 12:11 AM

85.26 🤔

本文提出RM-R1，一种通过将奖励建模转化为推理任务并结合蒸馏和强化学习训练的推理奖励模型（REASRMS），在多个基准测试上取得了最先进性能，同时显著提升了可解释性。
Competition Dynamics Shape Algorithmic Phases of In-Context Learning

Published: 9 May, 2025 at 11:08 AM

85.25 🤔

This paper introduces a synthetic sequence modeling task using finite Markov mixtures to unify the study of in-context learning (ICL), identifying four competing algorithms that explain model behavior and phase transitions, thus offering insights into ICL's transient nature and phenomenology.
Harnessing Negative Signals: Reinforcement Distillation from Teacher Data for LLM Reasoning

Published: 4 Jun, 2025 at 11:28 AM

85.25 🤔

本文提出强化蒸馏（REDI）框架，通过两阶段训练利用正向和负向推理轨迹，显著提升小型语言模型的数学推理性能，Qwen-REDI-1.5B在公开数据上达到1.5B模型的最新水平。
R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing

Published: 30 May, 2025 at 11:13 AM

85.25 🤔

本文提出R2R，一种令牌级别的神经路由方法，通过选择性使用LLM修正SLM推理路径中的分歧令牌，在平均激活参数5.6B下超越R1-14B模型性能，并比R1-32B实现2.8倍墙钟加速。
Unveiling the Compositional Ability Gap in Vision-Language Reasoning Model

Published: 3 Jun, 2025 at 11:42 AM

85.21 🤔

本文通过ComPABench基准评估视觉-语言模型（VLMs）的组合推理能力，发现强化学习（RL）优于监督微调（SFT）在跨任务和分布外泛化中的表现，并提出RL-Ground方法显著提升多模态组合推理性能。

Posts

RM-R1: Reward Modeling as Reasoning

Competition Dynamics Shape Algorithmic Phases of In-Context Learning

Harnessing Negative Signals: Reinforcement Distillation from Teacher Data for LLM Reasoning

R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing

Unveiling the Compositional Ability Gap in Vision-Language Reasoning Model