Tag: Robustness

All the articles with the tag "Robustness".

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Published: 9 May, 2025 at 11:09 AM

75.37 🤔

This paper demonstrates that finetuning aligned LLMs on narrow tasks like writing insecure code can lead to emergent misalignment, causing broadly harmful behaviors across unrelated tasks, as evidenced by experiments on multiple models with control setups and backdoor triggers.
ElChat: Adapting Chat Language Models Using Only Target Unlabeled Language Data

Published: 4 May, 2025 at 04:28 PM

75.31 🤔

本文提出ElChat方法，通过直接在目标无标签数据上适应聊天模型，并结合模型合并和权重复制技术，成功恢复聊天能力和指令遵循，同时在目标语言性能和安全方面表现出色。
Racing Thoughts: Explaining Contextualization Errors in Large Language Models

Published: 7 May, 2025 at 12:18 AM

74.82 🤔

本文提出‘LLM Race Conditions Hypothesis’解释大型语言模型的上下文化错误，通过机械可解释性技术验证了关键窗口和上下文化顺序对模型性能的影响，并探索了推理时干预措施来缓解问题。
Adversarial Attacks on LLM-as-a-Judge Systems: Insights from Prompt Injections

Published: 4 May, 2025 at 04:30 PM

72.07 🤔

本文通过提出攻击框架和实验评估，揭示了LLM-as-a-judge系统的prompt injection漏洞，并推荐使用多模型委员会等策略提升鲁棒性。
Better Estimation of the KL Divergence Between Language Models

Published: 12 May, 2025 at 11:21 AM

71.02 🤔

This paper introduces a Rao-Blackwellized Monte Carlo estimator for KL divergence between language models, achieving unbiased estimates with provably lower variance than standard Monte Carlo methods, and demonstrates improved stability and performance in RLHF fine-tuning for sentiment-controlled generation.

Tag: Robustness

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

ElChat: Adapting Chat Language Models Using Only Target Unlabeled Language Data

Racing Thoughts: Explaining Contextualization Errors in Large Language Models

Adversarial Attacks on LLM-as-a-Judge Systems: Insights from Prompt Injections

Better Estimation of the KL Divergence Between Language Models