This paper demonstrates that Elastic Weight Consolidation (EWC) applied to full-parameter continual pre-training of Gemma2 2B LLM mitigates catastrophic forgetting on English tasks while improving performance on Lithuanian language benchmarks during autoregressive pre-training on CulturaX data.
Continual Learning, Large Language Model, Pre-training, Regularization, Language Adaptation
Vytenis Šliogeris, Povilas Daniušis, Artūras Nakvosas
Neurotechnology, Vilnius, Lithuania
Generated by grok-3
Background Problem
Large Language Models (LLMs) excel in many natural language processing tasks but often underperform in low-resource languages due to imbalanced training data. Continual pre-training to integrate new languages, such as Lithuanian, frequently results in catastrophic forgetting, where performance on previously learned tasks (e.g., English language understanding) degrades. This paper addresses the challenge of mitigating catastrophic forgetting during continual pre-training of the Gemma2 2B parameter model, aiming to maintain performance on English tasks while enhancing capabilities in Lithuanian.
Method
The method employed is Elastic Weight Consolidation (EWC), a regularization-based continual learning approach applied to the full set of parameters of the Gemma2 2B LLM. EWC works by adding a penalty term to the loss function during training on a new task (Lithuanian language pre-training), which discourages significant changes to parameters deemed important for prior tasks (English language understanding) based on Fisher information. The loss function is formulated as , where is the loss for the new task, is the Fisher information for parameter , and controls regularization strength. Since original training data for Gemma2 is unavailable, Fisher information is estimated using the MMLU dataset as a proxy. The new task involves autoregressive pre-training on 10% of the Lithuanian portion of the CulturaX dataset, with experiments testing various values to balance learning and forgetting.
Experiment
The experiments were conducted on the Gemma2 2B parameter LLM using a cluster of 8 H100 GPUs, with a total computation time of 24 hours. The setup involved continual pre-training on 10% of the Lithuanian CulturaX dataset as the new task, while evaluating performance on both English and Lithuanian versions of seven language understanding benchmarks (Arc, Belebele, Gsm8K, Hellaswag, MMLU, TruthfulQA, Winogrande) and two perplexity benchmarks (TruthfulQA for English preservation and a Lithuanian Q/A dataset for new task impact). Different regularization strengths ( from 0 to ) were tested to observe the trade-off between retaining prior knowledge and learning the new language. Results showed that without EWC (), performance dropped significantly on English tasks, confirming catastrophic forgetting. With optimal values (between and ), EWC mitigated forgetting across all English benchmarks and even improved performance on five Lithuanian benchmarks compared to the baseline. Perplexity on English data was preserved with increasing , though excessively high values led to reduced plasticity and higher perplexity on Lithuanian data, indicating over-regularization. The experimental setup is reasonable for evaluating continual learning effects, but the use of MMLU as a proxy for Fisher information and the limited dataset scope may bias results. Comparisons with other continual learning methods are absent, which limits the assessment of EWC’s relative effectiveness.
Further Thoughts
The application of EWC to full-parameter continual pre-training is a promising direction, especially for low-resource languages, but several aspects warrant further exploration. First, the choice of MMLU as a proxy for Fisher information estimation raises questions about whether it adequately captures the original pre-training distribution of Gemma2—could other datasets or synthetic data generation from the model itself provide a better approximation? Second, the improvement in Lithuanian benchmark performance with EWC suggests a potential synergy between prior knowledge and new task learning, which could be investigated further by analyzing which specific parameters or model components are most preserved or adapted. This might reveal insights into the interplay between different types of knowledge (e.g., factual vs. linguistic) in LLMs. Additionally, comparing EWC with other continual learning methods like LoRA or replay-based approaches could clarify its strengths and weaknesses, especially in terms of computational efficiency and scalability to larger models or datasets. Finally, extending this approach to other low-resource languages with distinct linguistic features (e.g., agglutinative or tonal languages) could test the generalizability of EWC’s benefits, potentially linking to broader research on cross-linguistic transfer learning in LLMs.