Tag: Pre-training

All the articles with the tag "Pre-training".

CoLA: Compute-Efficient Pre-Training of LLMs via Low-Rank Activation

Published: 24 May, 2025 at 11:14 AM

89.66 🤔

本文提出CoLA及其内存优化变体CoLA-M，通过用低秩自动编码器替换LLMs的全尺寸MLP和投影层，实现2倍模型大小和计算成本的减少，同时保持全秩性能，并在训练和推理中显著提升吞吐量。
Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More

Published: 23 May, 2025 at 11:13 AM

89.28 🤔

本文提出MEAP训练范式，通过在下一词预测中引入随机掩码策略，显著提升大型语言模型在关键信息检索和长上下文推理任务中的性能，同时保持计算效率和架构兼容性。
QKV Projections Require a Fraction of Their Memory

Published: 5 Jun, 2025 at 11:22 AM

89.22 🤔

本文提出PAMM方法，通过随机选择代表性token近似输入张量，大幅减少注意力机制中Q、K、V投影的内存占用（高达512倍），同时在预训练和微调中基本维持模型性能。
Model Merging in Pre-training of Large Language Models

Published: 21 May, 2025 at 11:14 AM

89.09 🤔

本文提出预训练模型平均（PMA）策略，通过融合预训练阶段的检查点显著提升大型语言模型性能、预测退火效果并增强训练稳定性，为高效模型开发提供了新方法和实用指南。
Self-Data Distillation for Recovering Quality in Pruned Large Language Models

Published: 17 May, 2025 at 11:19 PM

89.06 🤔

本文提出自数据蒸馏微调方法，通过利用未剪枝模型生成蒸馏数据集恢复剪枝后大型语言模型的质量，在HuggingFace OpenLLM Leaderboard v1上显著优于标准监督微调，并通过模型合并和推测解码进一步提升性能和效率。

Tag: Pre-training

CoLA: Compute-Efficient Pre-Training of LLMs via Low-Rank Activation

Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More

QKV Projections Require a Fraction of Their Memory

Model Merging in Pre-training of Large Language Models

Self-Data Distillation for Recovering Quality in Pruned Large Language Models