This paper introduces Gaussian Concept Subspace (GCS), a framework to model concept representations in LLMs as Gaussian distributions, demonstrating improved robustness, faithfulness, and plausibility over single vector methods, with effective application in emotion steering tasks.
Large Language Model, Representation Learning, Embeddings, Interpretability, Human-AI Interaction
Haiyan Zhao, Heng Zhao, Bo Shen, Ali Payani, Fan Yang, Mengnan Du
New Jersey Institute of Technology, Wake Forest University, Cisco Research
Generated by grok-3
Background Problem
Large Language Models (LLMs) encode semantic knowledge internally, but understanding how concepts are represented remains limited. Current methods using linear probing classifiers to derive single concept vectors suffer from variability due to differences in probing datasets and training processes, leading to less robust representations. This variability hampers the study of concept relations and the effectiveness of inference-time interventions. The paper addresses this by proposing a method to model concept representations as a subspace using Gaussian distributions, aiming to capture multifaceted semantics more robustly.
Method
The proposed Gaussian Concept Subspace (GCS) framework extends linear probing by estimating a multidimensional subspace for each concept in LLMs. It works by: 1) Creating multiple probing datasets for a concept by randomly sampling subsets from a larger dataset. 2) Training linear classifiers on each subset to obtain a set of ‘observed’ concept vectors. 3) Using these vectors to estimate a Gaussian distribution with a mean vector (average of observed vectors) and a diagonal covariance matrix (assuming dimensional independence). 4) Sampling vectors from this distribution within 1σ of the mean to represent the concept with varying relevance. This approach aims to provide a more nuanced and robust representation compared to single vector methods, capturing the multifaceted nature of concepts in the hidden representation space of LLMs.
Experiment
The experiments evaluate GCS across multiple LLMs (Llama-2-7B, Gemma-7B, Llama-2-13B) using synthetic datasets generated by GPT-4o, structured into hierarchical concepts (e.g., movie, sports event, populated place, animal) with 5,000 positive and negative samples per low-level concept. The setup assesses: 1) Faithfulness via cosine similarity and prediction accuracy of sampled vs. observed vectors, showing sampled vectors (especially at 1σ) achieve comparable or better accuracy and high similarity (above 0.93 within sampled vectors). 2) Plausibility through average cosine similarity and PCA visualization, demonstrating alignment with human-expected hierarchies (e.g., intra-category similarities are higher). 3) Application in emotion steering (generating joyful movie reviews), where GCS vectors at 1σ outperform mean difference and show better balance between joyfulness and coherence compared to vectors at higher σ levels, though robustness over single linear vectors is not consistently superior. The setup is comprehensive for controlled testing but relies on synthetic data, potentially limiting real-world applicability. Results match expectations for improved representation but highlight trade-offs in steering fluency at higher σ.
Further Thoughts
The GCS framework presents a novel shift towards distributional representations, which could inspire further research into non-linear or other probabilistic models for concept subspaces in LLMs. However, the assumption of dimensional independence in the covariance matrix might overlook intricate dependencies in high-dimensional spaces, a concern also raised in studies on transformer representations (e.g., Elhage et al., 2021 on superposition). Exploring mixture models or correlated Gaussian distributions could enhance GCS’s accuracy. Additionally, the reliance on synthetic data from GPT-4o raises questions about applicability to noisy, real-world datasets—future work could integrate datasets like those used in probing studies (e.g., Ousidhoum et al., 2021) to test robustness. The emotion steering application connects to broader AI alignment efforts (e.g., RLHF studies), suggesting GCS could be adapted for safety or bias mitigation, though computational overhead of sampling multiple vectors needs optimization. Finally, comparing GCS with emerging methods like activation steering (Zou et al., 2023) could clarify its unique contributions to inference-time interventions.