Hide & Seek: Transformer Symmetries Obscure Sharpness & Riemannian Geometry Finds It

This paper introduces geodesic sharpness, a novel measure using Riemannian geometry to account for transformer symmetries on a quotient manifold, demonstrating stronger correlations with generalization across diagonal networks, vision transformers, and language models compared to traditional adaptive sharpness.

Transformer, Representation Learning, Efficiency, Robustness

Marvin F. da Silva, Felix Dangel, Sageev Oore

Dalhousie University, Vector Institute for Artificial Intelligence

Generated by grok-3

Background Problem

The paper addresses the challenge of predicting generalization in neural networks, particularly transformers, where traditional sharpness measures (indicating how much loss changes with small parameter perturbations) show weak correlation with generalization performance. Existing measures like adaptive sharpness account for simpler symmetries (e.g., parameter rescaling) but fail to capture the richer, higher-dimensional symmetries in transformer attention mechanisms, such as GL(h) symmetry. This leads to ambiguities in parameter space where different parameter values represent the same function, obscuring the true relationship between sharpness and generalization. The key problem solved is developing a symmetry-invariant sharpness measure to better predict generalization in transformers.

Method

The core idea is to redefine sharpness on a quotient manifold that accounts for transformer symmetries by using Riemannian geometry, thus eliminating ambiguities from parameter equivalences. The method, termed ‘geodesic sharpness,’ involves: (1) constructing a quotient manifold by quotienting out symmetries (e.g., GL(h) in attention layers) to represent equivalent parameter configurations as single points; (2) defining sharpness within a geodesic ball on this manifold, where perturbations follow geodesic paths (curved trajectories respecting the manifold’s geometry) rather than Euclidean straight lines; (3) approximating geodesics using Taylor expansions with Christoffel symbols for curvature when analytical solutions are unavailable. This approach generalizes adaptive sharpness by incorporating higher-order curvature terms and symmetry-compatible metrics, ensuring invariance under transformer-specific symmetries.

Experiment

Experiments were conducted on diagonal networks with synthetic data, vision transformers fine-tuned on ImageNet-1k, and language models (BERT) fine-tuned on MNLI. The setup for diagonal networks involved training 50 models on sparse regression tasks, showing geodesic sharpness outperforming adaptive sharpness with Kendall rank correlations of -0.83 and -0.86 versus -0.68. For vision transformers (72 CLIP ViT-B/32 models), geodesic sharpness achieved correlations of -0.71 and -0.70 compared to -0.41 for adaptive sharpness, though the negative correlation (sharper models generalize better) was unexpected. For language models (35 BERT models), geodesic sharpness showed positive correlations of 0.28 and 0.38, while adaptive sharpness failed at 0.06. The experimental design was comprehensive, testing across diverse architectures and datasets, but the varying correlation signs suggest task-specific behaviors not fully explained. Results generally matched the expectation of stronger correlation with generalization, though computational costs and approximation errors in geodesic calculations were not deeply analyzed, which could impact practical applicability.

Further Thoughts

The varying sign of correlation between geodesic sharpness and generalization across tasks (negative for vision transformers, positive for language models) is a profound observation that warrants deeper investigation. It suggests that the relationship between sharpness and generalization might be modulated by the nature of the data or task, an area the authors themselves flag for future work. This could tie into broader discussions in representation learning, where the geometry of learned representations might differ significantly between vision and language domains due to inherent data structures (e.g., spatial hierarchies in images versus sequential dependencies in text). A potential connection could be drawn to works on emergent abilities in large language models, where scaling and architecture-specific behaviors influence generalization in unexpected ways. Exploring whether geodesic sharpness could serve as a diagnostic tool for understanding these emergent properties, or integrating data symmetries alongside parameter symmetries, might provide a more holistic view of generalization. Additionally, the computational burden of geodesic calculations raises questions about scalability—could approximations or alternative geometric frameworks (e.g., information geometry) offer similar insights with lower overhead? This paper opens a critical dialogue on the interplay between model architecture, data, and generalization, pushing the field to reconsider fundamental assumptions about sharpness.