Elastic Weight Consolidation for Full-Parameter Continual Pre-Training of Gemma2

This paper demonstrates that Elastic Weight Consolidation (EWC) applied to full-parameter continual pre-training of Gemma2 2B LLM mitigates catastrophic forgetting on English tasks while improving performance on Lithuanian language benchmarks during autoregressive pre-training on CulturaX data.

Continual Learning, Large Language Model, Pre-training, Regularization, Language Adaptation

Vytenis Šliogeris, Povilas Daniušis, Artūras Nakvosas

Neurotechnology, Vilnius, Lithuania

Generated by grok-3

Background Problem

Large Language Models (LLMs) excel in many natural language processing tasks but often underperform in low-resource languages due to imbalanced training data. Continual pre-training to integrate new languages, such as Lithuanian, frequently results in catastrophic forgetting, where performance on previously learned tasks (e.g., English language understanding) degrades. This paper addresses the challenge of mitigating catastrophic forgetting during continual pre-training of the Gemma2 2B parameter model, aiming to maintain performance on English tasks while enhancing capabilities in Lithuanian.

Method

The method employed is Elastic Weight Consolidation (EWC), a regularization-based continual learning approach applied to the full set of parameters of the Gemma2 2B LLM. EWC works by adding a penalty term to the loss function during training on a new task (Lithuanian language pre-training), which discourages significant changes to parameters deemed important for prior tasks (English language understanding) based on Fisher information. The loss function is formulated as $\mathcal{L}(\theta) = \mathcal{L}_B(\theta) + \frac{\lambda}{2} \sum_i F_i (\theta_i - \theta_{A,i}^*)^2$ , where $\mathcal{L}_B(\theta)$ is the loss for the new task, $F_i$ is the Fisher information for parameter $\theta_i$ , and $\lambda$ controls regularization strength. Since original training data for Gemma2 is unavailable, Fisher information is estimated using the MMLU dataset as a proxy. The new task involves autoregressive pre-training on 10% of the Lithuanian portion of the CulturaX dataset, with experiments testing various $\lambda$ values to balance learning and forgetting.

Experiment

The experiments were conducted on the Gemma2 2B parameter LLM using a cluster of 8 H100 GPUs, with a total computation time of 24 hours. The setup involved continual pre-training on 10% of the Lithuanian CulturaX dataset as the new task, while evaluating performance on both English and Lithuanian versions of seven language understanding benchmarks (Arc, Belebele, Gsm8K, Hellaswag, MMLU, TruthfulQA, Winogrande) and two perplexity benchmarks (TruthfulQA for English preservation and a Lithuanian Q/A dataset for new task impact). Different regularization strengths ( $\lambda$ from 0 to $10^{12}$ ) were tested to observe the trade-off between retaining prior knowledge and learning the new language. Results showed that without EWC ( $\lambda=0$ ), performance dropped significantly on English tasks, confirming catastrophic forgetting. With optimal $\lambda$ values (between $10^2$ and $10^9$ ), EWC mitigated forgetting across all English benchmarks and even improved performance on five Lithuanian benchmarks compared to the baseline. Perplexity on English data was preserved with increasing $\lambda$ , though excessively high $\lambda$ values led to reduced plasticity and higher perplexity on Lithuanian data, indicating over-regularization. The experimental setup is reasonable for evaluating continual learning effects, but the use of MMLU as a proxy for Fisher information and the limited dataset scope may bias results. Comparisons with other continual learning methods are absent, which limits the assessment of EWC’s relative effectiveness.

Further Thoughts

The application of EWC to full-parameter continual pre-training is a promising direction, especially for low-resource languages, but several aspects warrant further exploration. First, the choice of MMLU as a proxy for Fisher information estimation raises questions about whether it adequately captures the original pre-training distribution of Gemma2—could other datasets or synthetic data generation from the model itself provide a better approximation? Second, the improvement in Lithuanian benchmark performance with EWC suggests a potential synergy between prior knowledge and new task learning, which could be investigated further by analyzing which specific parameters or model components are most preserved or adapted. This might reveal insights into the interplay between different types of knowledge (e.g., factual vs. linguistic) in LLMs. Additionally, comparing EWC with other continual learning methods like LoRA or replay-based approaches could clarify its strengths and weaknesses, especially in terms of computational efficiency and scalability to larger models or datasets. Finally, extending this approach to other low-resource languages with distinct linguistic features (e.g., agglutinative or tonal languages) could test the generalizability of EWC’s benefits, potentially linking to broader research on cross-linguistic transfer learning in LLMs.