Tag: Alignment
All the articles with the tag "Alignment".
-
HAIR: Hardness-Aware Inverse Reinforcement Learning with Introspective Reasoning for LLM Alignment
HAIR introduces a novel LLM alignment method using hardness-aware inverse reinforcement learning and introspective reasoning, constructing a balanced safety dataset and training category-specific reward models with GRPO-S, achieving state-of-the-art harmlessness while preserving usefulness across multiple benchmarks.
-
HyPerAlign: Hypotheses-driven Personalized Alignment
本文提出HyPerAlign方法,通过假设驱动的少样本学习实现LLM的个性化对齐,提高了模型对个体用户的适应性和安全性,同时减少了对微调的依赖。
-
CachePrune: Neural-Based Attribution Defense Against Indirect Prompt Injection Attacks
本文提出CachePrune方法,通过基于DPO损失的特征归因识别并修剪KV缓存中的关键神经元,防御间接提示注入攻击,同时保持模型响应质量。
-
Base Models Beat Aligned Models at Randomness and Creativity
本文通过在随机数生成、混合策略游戏和创意写作等需要不可预测性的任务上进行实验,发现流行的对齐技术会损害基础模型在这方面的能力,而基础模型在这些任务上表现更佳,这表明在常见基准性能和不可预测能力之间可能存在权衡。