This paper investigates inter-layer communication in Transformer LMs by identifying low-rank communication channels via SVD, demonstrating their causal role in prompt sensitivity through interventions that significantly improve performance on context retrieval tasks like the Laundry List task.
Transformer, Large Language Model, Interpretability, Representation Learning, Reasoning
Jack Merullo, Carsten Eickhoff, Ellie Pavlick
Brown University, University of Tübingen
Generated by grok-3
Background Problem
Transformer language models (LMs) exhibit impressive capabilities but often suffer from arbitrary sensitivities to prompt variations, such as the order or format of input, leading to unpredictable failures in tasks like context retrieval. This work aims to understand the internal mechanisms behind these sensitivities by investigating how information is passed between layers in LMs, specifically focusing on low-rank communication channels in the residual stream. The key problem addressed is the lack of clarity on how features are represented and routed across layers, which contributes to prompt sensitivity, as exemplified by the synthetic Laundry List task where models struggle to recall items from longer lists or specific positions.
Method
The core idea is to identify and analyze low-rank communication channels between attention heads across layers in Transformer LMs, using Singular Value Decomposition (SVD) to decompose weight matrices (e.g., OV and QK) into interpretable subspaces. The method works by: 1) Calculating a Composition Score (CS) to measure interaction strength between weight matrices of different heads, as introduced by Elhage et al. (2021); 2) Applying SVD to break down matrices into rank-1 components, revealing dominant low-rank signals (1-2 dimensions) that form communication channels; 3) Conducting interventions by editing weights (zeroing out components) or manipulating activations (scaling vectors in subspaces) to test causal effects on behavior. This approach targets specific head interactions like inhibition and mover heads in GPT-2 Small, focusing on query and value compositions to uncover content-independent signals for tasks like token indexing.
Experiment
The experiments are conducted on GPT-2 Small and Pythia 160m, using two main tasks: the Indirect Object Identification (IOI) dataset to study inhibition mechanisms, and the synthetic Laundry List task to test context retrieval under varying list lengths (3 to 20 items). The setup includes decomposing attention head matrices with SVD to identify communication channels, followed by interventions (weight editing and activation scaling) to assess their causal role. Results show that low-rank subspaces (1D for inhibition, 2D for duplicate detection) significantly influence behavior, with interventions improving Laundry List task accuracy by over 20% (e.g., from 64% to 86% for 3 objects, 51% for 8 objects). However, the representational capacity of these channels breaks down around 9-10 items, leading to fragmented indexing and reduced performance, matching the expectation of capacity limits but raising concerns about scalability. The experimental design is comprehensive for the chosen tasks but limited by reliance on well-studied models and synthetic data, potentially missing real-world complexities.
Further Thoughts
The discovery of low-rank communication channels offers a fascinating lens into the emergent structures of Transformer LMs, potentially linking to broader phenomena like emergent abilities in larger models. I’m particularly curious about how these findings might intersect with recent work on scaling laws—do these capacity limits in indexing scale predictably with model size, or are they tied to specific architectural choices in GPT-2? Additionally, the content-independent nature of the signals suggests a parallel to cognitive science concepts like working memory constraints, which could inspire cross-disciplinary research into AI and human cognition. A practical extension could involve applying this decomposition method to newer models with techniques like RoPE (as noted in limitations), potentially adapting the composition score for non-linear embeddings. Finally, the risk of ‘subspace illusion’ highlighted by Makelov et al. (2024) warrants deeper investigation—future work could integrate adversarial testing of these subspaces to ensure robustness of interpretations across diverse tasks and models.