Tag: Transformer
All the articles with the tag "Transformer".
-
ATLAS: Learning to Optimally Memorize the Context at Test Time
本文提出Atlas,一种高容量长期内存模块,通过滑动窗口Omega规则和Muon优化器优化上下文记忆,在语言建模和长上下文理解任务中显著优于Transformer和现代RNN。
-
Theoretical Insights into Fine-Tuning Attention Mechanism: Generalization and Optimization
This paper introduces a fine-tuning strategy for LLMs that leverages the unequal importance of attention matrices and customized learning rates to enhance efficiency, demonstrating through theoretical analysis and experiments on GLUE benchmarks that fine-tuning only Wq and Wv with higher learning rates for Wv can match or exceed full fine-tuning performance with fewer parameters.
-
How Well Can a Long Sequence Model Model Long Sequences? Comparing Architechtural Inductive Biases on Long-Context Abilities
本文通过对比实验揭示,尽管长序列模型(如Mamba2)理论上支持无限长上下文,但在实际长上下文任务中与Transformer模型一样面临显著局限,尤其在信息位置和数据格式变化时表现不佳,亟需进一步研究其原因。
-
TensorLLM: Tensorising Multi-Head Attention for Enhanced Reasoning and Compression in LLMs
本文提出了一种基于多头张量化和Tucker分解的框架,通过强制共享高维子空间对大型语言模型的多头注意力权重进行结构化去噪和压缩,显著提升推理能力并实现高达247倍的压缩率。
-
LENSLLM: Unveiling Fine-Tuning Dynamics for LLM Selection
LENSLLM introduces a Hessian-based PAC-Bayes framework and NTK-based scaling model for LLM selection, achieving up to 91.1% accuracy and 88.5% computational cost reduction by modeling fine-tuning dynamics across diverse tasks.