Tag: Pre-training
All the articles with the tag "Pre-training".
-
Why do LLMs attend to the first token?
This paper argues that attention sinks in LLMs, particularly at the first token, are a useful mechanism to prevent over-mixing of information in deep Transformers, supported by theoretical insights and empirical evidence from Gemma 7B, LLaMa 3.1 models, and pre-training experiments showing stronger sinks with larger models and longer contexts.
-
Scalable Parameter and Memory Efficient Pretraining for LLM: Recent Algorithmic Advances and Benchmarking
本文通过综述、基准测试和提出权重重分解与动量重置两种技术,探索了大型语言模型预训练中的参数和内存高效方法,显著提升了低秩方法的性能并减少内存消耗,但仍无法完全匹配全秩训练的效果。
-
LoRASuite: Efficient LoRA Adaptation Across Large Language Model Upgrades
本文提出LoRASuite,一种针对大型语言模型升级的模块化方法,通过转换矩阵、层映射和注意力头映射高效适配LoRA权重,并在数学与常识任务上显著优于小规模LoRA微调,甚至在某些场景下超越全规模重新训练,同时大幅降低内存和时间消耗。
-
One-shot Entropy Minimization
本文提出一-shot熵最小化(EM)方法,通过仅使用单个无标签数据和10步优化即可显著提升大型语言模型在数学推理任务上的性能,媲美或超越传统强化学习方法。
-
Parallel Scaling Law for Language Models
本文提出并行扩展(PARSCALE)方法,通过增加训练和推理时的并行计算流(P)来提升语言模型能力,理论和实验表明P流相当于参数扩展O(log P),并在低资源场景下展现出更高的推理效率。