Tag: Pre-training
All the articles with the tag "Pre-training".
-
TiC-LM: A Web-Scale Benchmark for Time-Continual LLM Pretraining
This paper introduces TiC-LM, a web-scale benchmark for time-continual LLM pretraining using 114 Common Crawl dumps, demonstrating that replay and autoregressive schedules can match Oracle retraining on general web data with less compute, though trade-offs persist across domains.
-
On the Generalization vs Fidelity Paradox in Knowledge Distillation
本文通过大规模实证分析揭示知识蒸馏(KD)显著提升小型语言模型的零样本推理性能(高达10%),但对大型模型收益有限,且性能提升与推理保真度存在脱节,强调任务专长和适度参数调整的重要性。
-
MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining
This paper introduces MiMo-7B, a 7B-parameter LLM optimized for reasoning through innovative pre-training with reasoning-dense data and multi-token prediction, and post-training with RL using test-difficulty-driven rewards, achieving superior performance over larger models and OpenAI o1-mini on mathematics and coding benchmarks.
-
Why Knowledge Distillation Works in Generative Models: A Minimal Working Explanation
本文通过混合高斯模拟和大规模语言模型实验,揭示了知识蒸馏在生成模型中通过教师模型熵控制学生模型精度-召回权衡的机制,从而提升样本质量。
-
SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild
This paper investigates zero RL training on diverse open base models, achieving significant accuracy and response length improvements while identifying key factors like reward design and data difficulty that influence the emergence of reasoning behaviors.