Tag: Pre-training

All the articles with the tag "Pre-training".

RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale

Published: 8 May, 2025 at 06:17 PM

87.73 🤔

RADLADS introduces a cost-effective three-step distillation protocol to convert softmax attention transformers into linear attention models using only 350-700M tokens, achieving near-teacher performance on benchmarks and setting a new state-of-the-art for pure RNNs with models up to 72B parameters.
Pre-training vs. Fine-tuning: A Reproducibility Study on Dense Retrieval Knowledge Acquisition

Published: 18 May, 2025 at 11:16 AM

87.67 🤔

本文通过线性探查和神经元激活分析，复制并扩展了对密集检索模型中预训练与微调知识获取作用的研究，发现预训练知识在DPR模型中主导检索效果且微调导致知识分散，但此结论在不同架构（如Contriever、RepLlama）和表示策略下并不成立。
How much do language models memorize?

Published: 3 Jun, 2025 at 11:44 AM

87.61 🤔

本文提出了一种基于信息论的记忆量化方法，通过区分无意记忆和泛化，测量GPT风格语言模型的容量约为每个参数3.6比特，并揭示了数据集规模与模型容量比对双重下降和成员推断性能的影响。
An Analysis for Reasoning Bias of Language Models with Small Initialization

Published: 25 May, 2025 at 11:52 AM

87.56 🤔

本文通过理论分析和实验验证，揭示了小参数初始化规模如何通过影响嵌入空间和训练动态，促使大型语言模型更倾向于推理任务而非记忆任务。
Distilling the Implicit Multi-Branch Structure in LLMs' Reasoning via Reinforcement Learning

Published: 26 May, 2025 at 11:25 AM

87.52 🤔

本文提出RLKD，一个基于强化学习的知识蒸馏框架，通过生成结构奖励模型（GSRM）将教师模型推理中的隐式多分支结构传递给学生模型，实验表明其在数学和问答任务上显著优于SFT和传统RL方法。

Tag: Pre-training

RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale

Pre-training vs. Fine-tuning: A Reproducibility Study on Dense Retrieval Knowledge Acquisition

How much do language models memorize?

An Analysis for Reasoning Bias of Language Models with Small Initialization

Distilling the Implicit Multi-Branch Structure in LLMs' Reasoning via Reinforcement Learning