Tag: Interpretability
All the articles with the tag "Interpretability".
-
I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data?
本文通过潜在变量模型和可识别性分析,证明大型语言模型通过下一词预测学习的表示近似为潜在概念后验概率对数的线性变换,支持线性表示假设,并提出结构化稀疏自编码器改进概念提取效果。
-
Beyond Single Concept Vector: Modeling Concept Subspace in LLMs with Gaussian Distribution
This paper introduces Gaussian Concept Subspace (GCS), a framework to model concept representations in LLMs as Gaussian distributions, demonstrating improved robustness, faithfulness, and plausibility over single vector methods, with effective application in emotion steering tasks.
-
HyPerAlign: Hypotheses-driven Personalized Alignment
本文提出HyPerAlign方法,通过假设驱动的少样本学习实现LLM的个性化对齐,提高了模型对个体用户的适应性和安全性,同时减少了对微调的依赖。
-
Unveiling Language-Specific Features in Large Language Models via Sparse Autoencoders
This paper uses Sparse Autoencoders to identify and manipulate language-specific features in Large Language Models, introducing a monolinguality metric, demonstrating context dependency via code-switching, and enhancing steering vectors for better control over multilingual generation while revealing significant language-specific impacts through ablation studies.
-
Latent Factor Models Meets Instructions: Goal-conditioned Latent Factor Discovery without Task Supervision
本文提出Instruct-LF方法,通过结合LLMs的指令遵循能力和梯度-based统计模型,实现无需任务监督的目标导向潜在因素发现,提高了下游任务性能并在人工评估中被偏好。