Posts
All the articles I've posted.
-
RM-R1: Reward Modeling as Reasoning
本文提出RM-R1,一种通过将奖励建模转化为推理任务并结合蒸馏和强化学习训练的推理奖励模型(REASRMS),在多个基准测试上取得了最先进性能,同时显著提升了可解释性。
-
Competition Dynamics Shape Algorithmic Phases of In-Context Learning
This paper introduces a synthetic sequence modeling task using finite Markov mixtures to unify the study of in-context learning (ICL), identifying four competing algorithms that explain model behavior and phase transitions, thus offering insights into ICL's transient nature and phenomenology.
-
Harnessing Negative Signals: Reinforcement Distillation from Teacher Data for LLM Reasoning
本文提出强化蒸馏(REDI)框架,通过两阶段训练利用正向和负向推理轨迹,显著提升小型语言模型的数学推理性能,Qwen-REDI-1.5B在公开数据上达到1.5B模型的最新水平。
-
R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing
本文提出R2R,一种令牌级别的神经路由方法,通过选择性使用LLM修正SLM推理路径中的分歧令牌,在平均激活参数5.6B下超越R1-14B模型性能,并比R1-32B实现2.8倍墙钟加速。
-
Unveiling the Compositional Ability Gap in Vision-Language Reasoning Model
本文通过ComPABench基准评估视觉-语言模型(VLMs)的组合推理能力,发现强化学习(RL)优于监督微调(SFT)在跨任务和分布外泛化中的表现,并提出RL-Ground方法显著提升多模态组合推理性能。