Tag: Transformer
All the articles with the tag "Transformer".
-
本文通过隐藏状态的几何特性(可分离性和对齐性)提出统一框架,揭示上下文学习(ICL)在分类任务中的两阶段机制——早期层通过PTH增强可分离性,后期层通过IH优化对齐性,并解释了任务向量的有效性。
-
An Extra RMSNorm is All You Need for Fine Tuning to 1.58 Bits
This paper demonstrates that fine-tuning large language models to 1.58-bit ternary weights using extra RMSNorm layers and a gradual quantization schedule achieves superior cross-entropy loss and preserves reasoning performance, enabling deployment on commodity hardware without relying on complex knowledge distillation.
-
Compression via Pre-trained Transformers: A Study on Byte-Level Multimodal Data
本文通过大规模实验证明,预训练小型Transformer模型在考虑参数大小的情况下,能在文本、图像和音频的分布外数据上实现与传统压缩算法竞争的压缩比,尤其在训练模态内表现优异,但跨模态迁移能力较弱。
-
Round and Round We Go! What makes Rotary Positional Encodings useful?
本文通过理论和实证分析揭示了旋转位置编码(RoPE)在大型语言模型中通过高频构建位置注意力模式和低频传递语义信息的作用机制,并提出p-RoPE方法通过截断低频提高长上下文鲁棒性,在Gemma 2B模型上取得性能提升。
-
Efficient Length-Generalizable Attention via Causal Retrieval for Long-Context Language Modeling
本文提出Grouped Cross Attention (GCA)机制,通过可微分检索和动态上下文选择实现Transformer模型的长度泛化,在16M上下文长度下达到完美passkey检索准确率,同时显著降低计算和内存成本。