Skip to content
Go back 2505.01996 arXiv logo

Always Skip Attention

Published:  at  11:06 AM
89.20 🤔

This paper theoretically demonstrates the ill-conditioning of Self-Attention Blocks in Vision Transformers without skip connections, highlights their role as regularizers, and proposes Token Graying (SVD and DCT) to improve input token conditioning, achieving modest performance gains in supervised and self-supervised tasks.

Transformer, Classification, Supervised Learning, Self-Supervised Learning, Efficiency

Yiping Ji, Hemanth Saratchandran, Peyman Moghaddam, Simon Lucey

Australian Institute of Machine Learning, Adelaide University, Data61, CSIRO

Generated by grok-3

Background Problem

The paper investigates a critical issue in Vision Transformers (ViTs): the self-attention mechanism (Self-Attention Block, SAB) is inherently ill-conditioned without skip connections, leading to catastrophic training failures. Unlike other components such as Feedforward Networks (FFNs) or architectures like CNNs, which can perform reasonably without skip connections, SABs uniquely depend on them for stability and convergence during gradient descent. This work aims to theoretically characterize this ill-conditioning, demonstrate the regularizing role of skip connections, and propose a complementary method to improve input token conditioning for better ViT performance.

Method

The core contribution is a theoretical analysis of why SABs in ViTs are ill-conditioned without skip connections, supported by two propositions: Proposition 4.1 shows that the condition number of SAB output embeddings (without skip connections) is bounded by the cube of the input matrix’s condition number, leading to poor conditioning across layers; Proposition 4.2 demonstrates that skip connections significantly lower this condition number, aiding training stability. Additionally, the authors propose Token Graying (TG), a method to pre-condition input tokens using two approaches: Singular Value Decomposition (SVD) TG, which reconstructs tokens by amplifying non-maximal singular values to reduce condition numbers, and Discrete Cosine Transform (DCT) TG, a computationally efficient approximation of SVD TG using frequency-based transformations. TG is applied before patch embedding during training, aiming to improve token conditioning as a complement to skip connections.

Experiment

The experiments validate the theoretical claims and the proposed Token Graying (TG) methods using various ViT models (ViT-Tiny, ViT-Base, Swin-S, CaiT-S, PVT V2 b3) on datasets like Tiny-ImageNet and ImageNet-1K for both supervised and self-supervised learning. The setup includes comparisons of SAB and FFN performance with and without skip connections, showing a catastrophic accuracy drop (up to 22% on CIFAR-10) when SAB skip connections are removed, unlike FFN or CNNs like ConvMixer. TG methods (SVD and DCT) are tested, with DCT TG showing comparable performance to SVD TG (e.g., Top-1 accuracy improvements of 0.2-1.8% across models) but with significantly lower computational overhead (training time of 0.732 days vs. 4.552 days for SVD TG on ViT-Base). Condition numbers of SAB output embeddings are reduced with TG, aligning with theoretical expectations. However, the experimental design lacks diversity in datasets beyond image classification and does not fully address generalization across ViT variants, as performance gains are sometimes marginal (e.g., 0.1-0.2% in some models). The setup is reasonable for initial validation but not comprehensive for broader claims.

Further Thoughts

The insight into the ill-conditioning of Self-Attention Blocks opens up intriguing avenues for future research, particularly in exploring alternative regularization techniques beyond skip connections and Token Graying. For instance, could attention mechanisms be redesigned to inherently mitigate conditioning issues, perhaps by integrating concepts from numerical stability in optimization literature? Additionally, the modest performance gains of TG suggest a need to balance computational cost with impact—could TG be selectively applied to specific layers or tasks where conditioning issues are most pronounced? Connecting this to broader AI research, the conditioning problem might relate to challenges in training large language models (LLMs), where self-attention also plays a central role; investigating whether similar techniques could stabilize LLM training under constrained settings (e.g., low-precision or resource-limited environments) could be valuable. Finally, the limitation in low-precision training hints at a potential intersection with quantization-aware training methods, which could be explored to make TG more practical for edge devices.



Previous Post
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
Next Post
Nonparametric learning of covariate-based Markov jump processes using RKHS techniques