Tag: Pre-training

All the articles with the tag "Pre-training".

MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism

Published: 4 May, 2025 at 04:30 PM

70.65 🤔

本文提出MegaScale-Infer系统，通过分离注意力模块和FFN模块的并行策略以及高效M2N通信库，优化大规模MoE模型的推理效率，实现高达1.90倍的吞吐量提升。
TeLLMe: An Energy-Efficient Ternary LLM Accelerator for Prefilling and Decoding on Edge FPGAs

Published: 4 May, 2025 at 04:29 PM

70.02 🤔

本文提出TeLLMe，一种能量高效的三元LLM FPGA加速器，通过表查找矩阵引擎和反向注意力优化，支持预填充和解码阶段，在7W功率下实现高达9.51 tokens/s吞吐量和低预填充延迟。
When Does Metadata Conditioning (NOT) Work for Language Model Pre-Training? A Study with Context-Free Grammars

Published: 4 May, 2025 at 04:30 PM

69.59 🤔

本论文通过上下文无关文法合成数据研究了元数据条件化在语言模型预训练中的影响，发现其对长提示任务有益但对短提示任务有害，揭示了潜在语义推断的权衡。
Hierarchical Attention Generates Better Proofs

Published: 6 May, 2025 at 11:16 PM

69.57 🤔

本文提出层次注意力正则化方法，通过引导大型语言模型的注意力机制与数学推理的五级层次结构对齐，在 miniF2F 和 ProofNet 基准上分别提升证明成功率 2.05% 和 1.69%，并显著降低证明复杂度。
RWKVQuant: Quantizing the RWKV Family with Proxy Guided Hybrid of Scalar and Vector Quantization

Published: 9 May, 2025 at 11:10 AM

69.29 🤔

RWKVQuant introduces a tailored Post Training Quantization framework for RWKV models, using a coarse-to-fine proxy to hybridize scalar and vector quantization and optimizing codebooks for element-wise operations, achieving ~3-bit quantization with minimal accuracy loss and significant memory and speed improvements.

Tag: Pre-training

MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism

TeLLMe: An Energy-Efficient Ternary LLM Accelerator for Prefilling and Decoding on Edge FPGAs

When Does Metadata Conditioning (NOT) Work for Language Model Pre-Training? A Study with Context-Free Grammars

Hierarchical Attention Generates Better Proofs

RWKVQuant: Quantizing the RWKV Family with Proxy Guided Hybrid of Scalar and Vector Quantization