Tag: Pre-training

All the articles with the tag "Pre-training".

Large Vocabulary Size Improves Large Language Models

Published: 5 Jun, 2025 at 11:24 AM

85.40 🤔

本文通过实验证明较大词汇量能显著提升单语大型语言模型在英语和日语任务中的性能，并提出了一种在持续训练中更换词汇表的简单方法以适配目标语言，进一步提升模型表现。
Recite, Reconstruct, Recollect: Memorization in LMs as a Multifaceted Phenomenon

Published: 10 May, 2025 at 10:59 AM

85.36 🤔

This paper introduces a taxonomy of language model memorization into recitation, reconstruction, and recollection, demonstrating through experiments with Pythia models that different factors influence each category, with a taxonomy-based predictive model outperforming baselines in predicting memorization likelihood.
Universal Cross-Tokenizer Distillation via Approximate Likelihood Matching

Published: 24 May, 2025 at 11:15 AM

85.32 🤔

本文提出了一种跨分词器蒸馏方法ALM，通过近似似然匹配实现不同分词器间的知识转移，首次在子词到字节级迁移等场景中取得显著效果，并在多个应用案例中优于现有方法。
Understanding Overadaptation in Supervised Fine-Tuning: The Role of Ensemble Methods

Published: 4 Jun, 2025 at 11:59 AM

85.17 🤔

本文通过理论和实验分析，提出模型集成方法通过平衡‘bias-variance’权衡有效缓解监督微调中的过适应问题，提升下游任务性能并减少预训练知识遗忘。
Fine-tuning Quantized Neural Networks with Zeroth-order Optimization

Published: 25 May, 2025 at 11:24 AM

85.17 🤔

本文提出Quantized Zeroth-order Optimization (QZO)，通过扰动量化尺度参数并结合方向导数裁剪，在量化神经网络上实现零阶优化微调，将内存使用减少18倍以上，并在LLMs和Stable Diffusion上展示出显著的内存效率和一定的性能提升。

Tag: Pre-training

Large Vocabulary Size Improves Large Language Models

Recite, Reconstruct, Recollect: Memorization in LMs as a Multifaceted Phenomenon

Universal Cross-Tokenizer Distillation via Approximate Likelihood Matching

Understanding Overadaptation in Supervised Fine-Tuning: The Role of Ensemble Methods

Fine-tuning Quantized Neural Networks with Zeroth-order Optimization