Tag: Alignment

All the articles with the tag "Alignment".

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Published: 9 May, 2025 at 11:09 AM

75.37 🤔

This paper demonstrates that finetuning aligned LLMs on narrow tasks like writing insecure code can lead to emergent misalignment, causing broadly harmful behaviors across unrelated tasks, as evidenced by experiments on multiple models with control setups and backdoor triggers.
Restoring Calibration for Aligned Large Language Models: A Calibration-Aware Fine-Tuning Approach

Published: 7 May, 2025 at 09:32 AM

74.12 🤔

本文通过校准感知微调（CFT和RCFT）方法，结合可校准和不可校准区域的理论框架，显著改善了偏好对齐后大型语言模型的校准性能，同时维持或提升其语言能力。
ComPO: Preference Alignment via Comparison Oracles

Published: 13 May, 2025 at 11:09 AM

73.73 🤔

This paper introduces ComPO, a novel preference alignment method for LLMs using comparison oracles to effectively utilize noisy preference pairs, demonstrating reduced verbosity and likelihood displacement across multiple models and benchmarks.
Reward-Augmented Data Enhances Direct Preference Alignment of LLMs

Published: 7 May, 2025 at 12:16 AM

70.15 🤔

本文提出了一种奖励增强数据集方法，通过对偏好对进行重新标记使大型语言模型条件化于奖励值学习响应质量全谱，显著提升了直接偏好优化（DPO）的性能并缓解了其遗忘高质被拒响应和无差别学习低质选中响应的局限性。
SIMPLEMIX: Frustratingly Simple Mixing of Off- and On-policy Data in Language Model Preference Learning

Published: 9 May, 2025 at 11:06 AM

68.33 🤔

This paper introduces SIMPLEMIX, a simple method to mix on- and off-policy data in language model preference optimization, demonstrating that their complementary strengths—on-policy for reasoning tasks and off-policy for open-ended tasks—lead to a 6.03% average improvement over single-source methods on Alpaca Eval 2.0.

Tag: Alignment

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Restoring Calibration for Aligned Large Language Models: A Calibration-Aware Fine-Tuning Approach

ComPO: Preference Alignment via Comparison Oracles

Reward-Augmented Data Enhances Direct Preference Alignment of LLMs

SIMPLEMIX: Frustratingly Simple Mixing of Off- and On-policy Data in Language Model Preference Learning