Tag: Self-Rewarding
All the articles with the tag "Self-Rewarding".
-
CREAM: Consistency Regularized Self-Rewarding Language Models
本文提出了CREAM(Consistency Regularized Self-Rewarding Language Model)方法,通过衡量自奖励过程中不同迭代模型之间排序的一致性来正则化偏好训练,从而缓解奖励偏差问题,提高小型语言模型的对齐性能和训练稳定性。