Skip to content
Go back 2412.00359 arXiv logo

Does Self-Attention Need Separate Weights in Transformers?

Published:  at  11:12 AM
79.57 🤔

This paper introduces a shared weight self-attention mechanism for transformers, using a single weight matrix with diagonal scaling to reduce parameters by 66.53% in attention blocks, achieving competitive performance on GLUE and improved noise robustness while slightly underperforming on SQuAD tasks compared to standard BERT.

Transformer, Efficiency, Pre-training, Representation Learning

Md Kowsher, Nusrat Jahan Prottasha, Chun-Nam Yu, Ozlem Ozmen Garibay, Niloofar Yousefi

University of Central Florida, FL, USA, Nokia Bell Labs, NJ, USA

Generated by grok-3

Background Problem

The research addresses the computational inefficiency and high parameter count of traditional self-attention mechanisms in transformer models, which use separate weight matrices for Keys, Queries, and Values, leading to quadratic complexity and challenges in handling long-term dependencies and sequential directionality. The key problem solved is the reduction of parameter size and computational overhead in self-attention blocks while attempting to maintain or improve performance on natural language understanding tasks.

Method

The proposed method, termed ‘shared weight self-attention,’ replaces the three separate weight matrices traditionally used for Keys (K), Queries (Q), and Values (V) with a single shared weight matrix WsW_s. This matrix generates a unified representation S=XWsS = XW_s, from which K, Q, and V are derived using diagonal transformation matrices DkD_k, DqD_q, and DvD_v respectively, as Q=SDqQ = SD_q, K=SDkK = SD_k, and V=SDvV = SD_v. These diagonal matrices act as element-wise scaling factors to adapt the shared representation for different roles in the attention mechanism, calculated via the standard softmax-normalized dot product. This approach factorizes the weight matrices to reduce parameters by sharing WsW_s and using lightweight diagonal matrices for differentiation, aiming to maintain expressiveness while significantly cutting computational costs.

Experiment

The experiments involve pre-training a BERT model with shared weight self-attention on the BooksCorpus and English Wikipedia datasets (3.2 billion tokens) using a setup mirroring the standard BERT-base-uncased configuration (12 layers, 768 hidden dimensions, 12 attention heads). Performance is evaluated on the GLUE benchmark and SQuAD v1.1/v1.2 datasets, comparing against standard, symmetric, and pairwise self-attention BERT models. Results show a 66.53% reduction in attention block parameters and a 12.94% reduction in total BERT parameters, with competitive performance on GLUE (e.g., 0.87% higher accuracy on MRPC over standard BERT, average accuracy of 79.92% vs. 79.97% for standard). However, slight performance drops are observed on SQuAD (e.g., 0.65% lower EM on v1.1). Robustness tests under noise conditions (0-40% Gaussian noise) demonstrate superior stability for the shared model (e.g., MNLI accuracy drops from 80.94% to 75.19% vs. 81.66% to 68.24% for standard). Training time is reduced by 11-30% across tasks. The setup is reasonable for efficiency-focused research but lacks comprehensive testing on larger models or decoder-based tasks, and the slight performance trade-offs suggest the method may not fully capture the expressiveness of separate weights in all scenarios.

Further Thoughts

The shared weight self-attention mechanism presents a compelling case for efficiency in transformer models, particularly in resource-constrained environments, but its limitations in decoder models and larger architectures warrant further exploration. I am intrigued by the potential regularization effect of shared weights contributing to noise robustness—could this be leveraged as a general technique for improving model stability in adversarial settings? Additionally, connecting this work to recent trends in parameter-efficient fine-tuning (e.g., Low-Rank Adaptation), I wonder if combining shared weights with techniques like LoRA could further optimize transformer efficiency without sacrificing performance. Another avenue is testing this approach in multimodal systems, where attention mechanisms often handle diverse data types; the shared weight concept might either simplify cross-modal interactions or fail to capture modality-specific nuances. Finally, the reliance on diagonal matrices for differentiation feels like an under-explored area—future work could investigate more complex transformation functions to balance efficiency and expressiveness, potentially drawing inspiration from sparse attention mechanisms like Longformer.



Previous Post
Radio: Rate-Distortion Optimization for Large Language Model Compression
Next Post
Effective Length Extrapolation via Dimension-Wise Positional Embeddings Manipulation