Tag: Transformer
All the articles with the tag "Transformer".
-
A Sliding Layer Merging Method for Efficient Depth-Wise Pruning in LLMs
本文提出滑动层合并(SLM)方法,通过基于CKA相似性动态合并大型语言模型的连续层,实现深度剪枝,在零样本任务和推理效率上显著优于现有方法,同时探索了深度与宽度剪枝结合的潜力。
-
Why do LLMs attend to the first token?
This paper argues that attention sinks in LLMs, particularly at the first token, are a useful mechanism to prevent over-mixing of information in deep Transformers, supported by theoretical insights and empirical evidence from Gemma 7B, LLaMa 3.1 models, and pre-training experiments showing stronger sinks with larger models and longer contexts.
-
Talking Heads: Understanding Inter-layer Communication in Transformer Language Models
This paper investigates inter-layer communication in Transformer LMs by identifying low-rank communication channels via SVD, demonstrating their causal role in prompt sensitivity through interventions that significantly improve performance on context retrieval tasks like the Laundry List task.
-
CoLA: Compute-Efficient Pre-Training of LLMs via Low-Rank Activation
本文提出CoLA及其内存优化变体CoLA-M,通过用低秩自动编码器替换LLMs的全尺寸MLP和投影层,实现2倍模型大小和计算成本的减少,同时保持全秩性能,并在训练和推理中显著提升吞吐量。
-
Towards Minimizing Feature Drift in Model Merging: Layer-wise Task Vector Fusion for Adaptive Knowledge Integration
本文提出逐层最优任务向量合并(LOT Merging)方法,通过最小化特征漂移优化模型合并过程,在视觉和视觉-语言任务上显著优于无训练基线方法,平均准确率提升高达4.4%。