Tag: Pre-training
All the articles with the tag "Pre-training".
-
TeLLMe: An Energy-Efficient Ternary LLM Accelerator for Prefilling and Decoding on Edge FPGAs
本文提出TeLLMe,一种能量高效的三元LLM FPGA加速器,通过表查找矩阵引擎和反向注意力优化,支持预填充和解码阶段,在7W功率下实现高达9.51 tokens/s吞吐量和低预填充延迟。
-
When Does Metadata Conditioning (NOT) Work for Language Model Pre-Training? A Study with Context-Free Grammars
本论文通过上下文无关文法合成数据研究了元数据条件化在语言模型预训练中的影响,发现其对长提示任务有益但对短提示任务有害,揭示了潜在语义推断的权衡。
-
Hierarchical Attention Generates Better Proofs
本文提出层次注意力正则化方法,通过引导大型语言模型的注意力机制与数学推理的五级层次结构对齐,在 miniF2F 和 ProofNet 基准上分别提升证明成功率 2.05% 和 1.69%,并显著降低证明复杂度。
-
RWKVQuant: Quantizing the RWKV Family with Proxy Guided Hybrid of Scalar and Vector Quantization
RWKVQuant introduces a tailored Post Training Quantization framework for RWKV models, using a coarse-to-fine proxy to hybridize scalar and vector quantization and optimizing codebooks for element-wise operations, achieving ~3-bit quantization with minimal accuracy loss and significant memory and speed improvements.
-
Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing
本文提出Mixture of Sparse Attention (MoSA)方法,通过专家选择路由实现基于内容的稀疏注意力,显著提高了Transformer模型在相同计算预算下的语言建模性能,并优化了资源使用。