Why do LLMs attend to the first token?

This paper argues that attention sinks in LLMs, particularly at the first token, are a useful mechanism to prevent over-mixing of information in deep Transformers, supported by theoretical insights and empirical evidence from Gemma 7B, LLaMa 3.1 models, and pre-training experiments showing stronger sinks with larger models and longer contexts.

Large Language Model, Transformer, Representation Learning, Pre-training, Long Context

Federico Barbero, Álvaro Arroyo, Xiangming Gu, Christos Perivolaropoulos, Michael Bronstein, Petar Veličković, Razvan Pascanu

University of Oxford, National University of Singapore, Google DeepMind

Generated by grok-3

Background Problem

Large Language Models (LLMs) exhibit a phenomenon known as ‘attention sinks,’ where a significant portion of attention is allocated to seemingly unimportant tokens, often the first token (e.g., ⟨bos⟩). This behavior, observed across frontier models, has been linked to issues like quantization difficulties, security vulnerabilities, and streaming attention challenges. While prior works have focused on mitigating attention sinks, this paper investigates why they are useful, hypothesizing that they serve as a mechanism to prevent ‘over-mixing’ of information in deep Transformer architectures, a problem related to rank collapse, representational collapse, and over-squashing, especially as model depth and context length increase.

Method

The core idea is that attention sinks, particularly at the first token, help control over-mixing in Transformers by reducing the sensitivity of token representations to perturbations, thus preventing representational collapse. The authors approach this theoretically by connecting attention sinks to existing concepts like rank collapse and over-squashing, deriving mathematical bounds (e.g., Jacobian norms for perturbation sensitivity) to show how sinks limit information mixing across tokens. They also propose that sinks act as ‘approximate no-ops’ in attention heads, allowing minimal updates to token embeddings by default. Empirically, they analyze attention patterns in pre-trained models like Gemma 7B and LLaMa 3.1, and conduct pre-training experiments on smaller 120M parameter models to study the impact of context length and data packing strategies on sink formation.

Experiment

The experiments are multifaceted: (1) Perturbation analysis on Gemma 7B shows that the presence of the ⟨bos⟩ token reduces the spread of perturbations across token representations, supporting the over-mixing hypothesis, though limited to specific token changes (e.g., ‘greatest’ to ‘best’). (2) Analysis of the LLaMa 3.1 family (8B to 405B) reveals stronger attention sinks in larger models (e.g., 78.29% sink metric for 405B vs. 45.97% for 8B), aligning with theoretical predictions about model depth, but lacking direct evidence of causality. (3) Pre-training experiments with 120M parameter models demonstrate that longer context lengths (up to 2048 tokens) correlate with stronger sink formation, with sink metrics increasing from near 0% at short contexts to significant percentages at longer ones, though these smaller models may not fully mirror frontier LLM behavior. (4) Data packing experiments show that fixing ⟨bos⟩ at the first position during pre-training enhances sink formation (up to 90.84% sink metric), but its absence at inference drastically reduces performance (validation loss spikes to 7.78), indicating reliance on specific pre-training choices. While results generally match the expectation of sinks mitigating over-mixing, the experimental setup is not fully comprehensive—diverse input types and potential negative impacts of sinks are underexplored, and causality remains correlational rather than proven.

Further Thoughts

The concept of attention sinks as a defense against over-mixing opens up fascinating avenues for future research, particularly in how they might interact with other architectural innovations like mixture-of-depths or sparse attention mechanisms, which also aim to control information flow. Could attention sinks be explicitly engineered or tuned during training to optimize performance for specific tasks, such as long-context reasoning, where over-mixing might be particularly detrimental? Additionally, I wonder if there’s a connection to robustness in adversarial settings—since sinks reduce sensitivity to perturbations, might they inadvertently improve model resilience against adversarial attacks, or conversely, create vulnerabilities by overly focusing on irrelevant tokens? This also ties into recent works on efficient inference (e.g., KV-caching optimizations), where attention sinks have been shown to play a role; understanding their purpose could guide more efficient model designs. Lastly, the reliance on pre-training choices like ⟨bos⟩ positioning suggests a deeper interplay between data preparation and learned attention patterns, which could be explored in the context of curriculum learning or data augmentation strategies to further mitigate collapse phenomena.