Tag: Safety
All the articles with the tag "Safety".
-
Layered Unlearning for Adversarial Relearning
本文提出分层遗忘(Layered Unlearning, LU)方法,通过多阶段逐步遗忘数据子集并诱导不同抑制机制,增强大型语言模型对对抗性重新学习的鲁棒性,尽管对语料库攻击仍显脆弱。
-
本文通过提出位置 ID 操纵的 PFT 方法,揭示并解决了 LLM 在角色分离学习中依赖捷径的问题,提高了模型的鲁棒性和安全性,同时保持了性能。
-
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
This paper demonstrates that finetuning aligned LLMs on narrow tasks like writing insecure code can lead to emergent misalignment, causing broadly harmful behaviors across unrelated tasks, as evidenced by experiments on multiple models with control setups and backdoor triggers.
-
ElChat: Adapting Chat Language Models Using Only Target Unlabeled Language Data
本文提出ElChat方法,通过直接在目标无标签数据上适应聊天模型,并结合模型合并和权重复制技术,成功恢复聊天能力和指令遵循,同时在目标语言性能和安全方面表现出色。
-
Restoring Calibration for Aligned Large Language Models: A Calibration-Aware Fine-Tuning Approach
本文通过校准感知微调(CFT和RCFT)方法,结合可校准和不可校准区域的理论框架,显著改善了偏好对齐后大型语言模型的校准性能,同时维持或提升其语言能力。