Tag: Pre-training

All the articles with the tag "Pre-training".

Elastic Weight Consolidation for Full-Parameter Continual Pre-Training of Gemma2

Published: 13 May, 2025 at 11:04 AM

78.66 🤔

This paper demonstrates that Elastic Weight Consolidation (EWC) applied to full-parameter continual pre-training of Gemma2 2B LLM mitigates catastrophic forgetting on English tasks while improving performance on Lithuanian language benchmarks during autoregressive pre-training on CulturaX data.
Latte: Transfering LLMs` Latent-level Knowledge for Few-shot Tabular Learning

Published: 11 May, 2025 at 11:08 AM

77.34 🤔

The paper introduces 'Latte', a framework that transfers latent-level knowledge from Large Language Models during training to enhance few-shot tabular learning, outperforming baselines by leveraging unlabeled data and mitigating overfitting across diverse classification and regression tasks.
Large Language Model Compression with Global Rank and Sparsity Optimization

Published: 11 May, 2025 at 11:14 AM

77.26 🤔

This paper introduces a two-stage LLM compression method using RPCA for low-rank and sparse decomposition and probabilistic pruning via policy gradient, outperforming state-of-the-art techniques at a 50% compression ratio while automatically adapting to layer-wise redundancy without manual thresholds or extensive fine-tuning.
LLM-e Guess: Can LLMs Capabilities Advance Without Hardware Progress?

Published: 12 May, 2025 at 11:20 AM

76.16 🤔

This paper introduces a framework to classify algorithmic innovations in LLMs as compute-dependent or compute-independent, demonstrating through small-scale GPT-2 experiments that compute-independent advancements like FlashAttention can yield up to 3.5× compute-equivalent gains even under hardware constraints, challenging the efficacy of hardware-focused AI regulation.
Temporal Scaling Law for Large Language Models

Published: 18 May, 2025 at 11:16 AM

75.44 🤔

本文提出时间缩放定律（Temporal Scaling Law），通过动态双曲线法则建模LLM预训练中每个token位置的损失变化，精准预测整体测试损失演变，支持直接在目标模型上选择超参数并揭示学习动态。

Tag: Pre-training

Elastic Weight Consolidation for Full-Parameter Continual Pre-Training of Gemma2

Latte: Transfering LLMs` Latent-level Knowledge for Few-shot Tabular Learning

Large Language Model Compression with Global Rank and Sparsity Optimization

LLM-e Guess: Can LLMs Capabilities Advance Without Hardware Progress?

Temporal Scaling Law for Large Language Models