Tag: Transformer

All the articles with the tag "Transformer".

A Sliding Layer Merging Method for Efficient Depth-Wise Pruning in LLMs

Published: 23 May, 2025 at 11:13 AM

90.44 🤔

本文提出滑动层合并（SLM）方法，通过基于CKA相似性动态合并大型语言模型的连续层，实现深度剪枝，在零样本任务和推理效率上显著优于现有方法，同时探索了深度与宽度剪枝结合的潜力。
Why do LLMs attend to the first token?

Published: 17 May, 2025 at 11:04 AM

90.22 🤔

This paper argues that attention sinks in LLMs, particularly at the first token, are a useful mechanism to prevent over-mixing of information in deep Transformers, supported by theoretical insights and empirical evidence from Gemma 7B, LLaMa 3.1 models, and pre-training experiments showing stronger sinks with larger models and longer contexts.
Talking Heads: Understanding Inter-layer Communication in Transformer Language Models

Published: 13 May, 2025 at 11:21 AM

90.20 🤔

This paper investigates inter-layer communication in Transformer LMs by identifying low-rank communication channels via SVD, demonstrating their causal role in prompt sensitivity through interventions that significantly improve performance on context retrieval tasks like the Laundry List task.
CoLA: Compute-Efficient Pre-Training of LLMs via Low-Rank Activation

Published: 24 May, 2025 at 11:14 AM

89.66 🤔

本文提出CoLA及其内存优化变体CoLA-M，通过用低秩自动编码器替换LLMs的全尺寸MLP和投影层，实现2倍模型大小和计算成本的减少，同时保持全秩性能，并在训练和推理中显著提升吞吐量。
Towards Minimizing Feature Drift in Model Merging: Layer-wise Task Vector Fusion for Adaptive Knowledge Integration

Published: 4 Jun, 2025 at 11:28 AM

89.30 🤔

本文提出逐层最优任务向量合并（LOT Merging）方法，通过最小化特征漂移优化模型合并过程，在视觉和视觉-语言任务上显著优于无训练基线方法，平均准确率提升高达4.4%。

Tag: Transformer

A Sliding Layer Merging Method for Efficient Depth-Wise Pruning in LLMs

Why do LLMs attend to the first token?

Talking Heads: Understanding Inter-layer Communication in Transformer Language Models

CoLA: Compute-Efficient Pre-Training of LLMs via Low-Rank Activation

Towards Minimizing Feature Drift in Model Merging: Layer-wise Task Vector Fusion for Adaptive Knowledge Integration