Tag: Alignment

All the articles with the tag "Alignment".

Improving Multilingual Language Models by Aligning Representations through Steering

Published: 26 May, 2025 at 11:22 AM

85.45 🤔

本文提出了一种通过表示引导调整大型语言模型层级表示的方法，以提升多语言任务性能，实验显示其在多种任务中优于基本提示并接近翻译基线，但对英语任务有负面影响且对低资源语言改进有限。
Latent Principle Discovery for Language Model Self-Improvement

Published: 26 May, 2025 at 11:25 AM

85.30 🤔

本文提出STaPLe算法，通过Monte Carlo EM方法自动化发现和学习语言模型自我改进的潜在原则，在多个指令跟随基准上显著提升小型模型性能，同时通过聚类生成人类可解释的宪法。
From Distributional to Overton Pluralism: Investigating Large Language Model Alignment

Published: 18 May, 2025 at 11:16 AM

85.12 🤔

本文通过分析对齐前后LLM输出分布的变化，揭示了对齐虽减少分布性多元化但通过更长响应实现奥弗顿多元化，且基础模型通过上下文学习可有效模仿对齐模型行为，支持表面对齐假说。
HSI: Head-Specific Intervention Can Induce Misaligned AI Coordination in Large Language Models

Published: 4 May, 2025 at 04:27 PM

78.97 🤔

本文提出Head-Specific Intervention (HSI)方法，通过针对特定注意力头的激活干预，成功诱导Llama 2模型在AI协调行为上绕过安全对齐，效果优于监督微调和其它干预策略。
Latent Preference Coding: Aligning Large Language Models via Discrete Latent Codes

Published: 12 May, 2025 at 11:14 AM

76.90 🤔

This paper introduces Latent Preference Coding (LPC), a framework that uses discrete latent codes to model multifaceted human preferences, consistently improving the performance of offline alignment algorithms like DPO, SimPO, and IPO across multiple LLMs and benchmarks.

Tag: Alignment

Improving Multilingual Language Models by Aligning Representations through Steering

Latent Principle Discovery for Language Model Self-Improvement

From Distributional to Overton Pluralism: Investigating Large Language Model Alignment

HSI: Head-Specific Intervention Can Induce Misaligned AI Coordination in Large Language Models

Latent Preference Coding: Aligning Large Language Models via Discrete Latent Codes