TiC-LM: A Web-Scale Benchmark for Time-Continual LLM Pretraining

This paper introduces TiC-LM, a web-scale benchmark for time-continual LLM pretraining using 114 Common Crawl dumps, demonstrating that replay and autoregressive schedules can match Oracle retraining on general web data with less compute, though trade-offs persist across domains.

Large Language Model, Continual Learning, Pre-training, Time Series Data, Web Data, Data Replay

Jeffrey Li, Mohammadreza Armandpour, Iman Mirzadeh, Sachin Mehta, Vaishaal Shankar, Raviteja Vemulapalli, Samy Bengio, Oncel Tuzel, Mehrdad Farajtabar, Hadi Pouransari, Fartash Faghri

University of Washington, Apple

Generated by grok-3

Background Problem

Large Language Models (LLMs) trained on historical web data, such as Common Crawl (CC), become outdated as new data emerges, leading to performance degradation on newer content due to knowledge cutoffs. The high cost of retraining from scratch motivates the need for efficient continual learning methods to update LLMs over time. This paper introduces the TiC-LM benchmark to address the lack of web-scale, time-stratified datasets and evaluations for studying temporal distribution shifts in LLM pretraining, aiming to balance learning new information with retaining past knowledge across long time horizons.

Method

The core idea of TiC-LM is to simulate time-continual pretraining of LLMs using a massive dataset, TiC-CommonCrawl (TIC-CC), derived from 114 monthly dumps of Common Crawl data (May 2013 to July 2024), totaling 2.9T tokens. The methodology involves:

Dataset Creation: Processing CC data with causality-preserving filters (e.g., no deduplication across future months) to create time-stratified training splits.
Continual Learning Setup: Pretraining an initial model on the first month’s data, then updating it monthly with a fixed token budget, optionally replaying older data.
Baseline Methods: Testing optimization-based methods (e.g., Cyclic Cosine, Autoregressive schedules), data replay strategies (varying ratios of new to old data), and regularization techniques (e.g., LwF, EWC) to mitigate forgetting.
Evaluation: Using perplexity metrics on held-out CC data and domain-specific dynamic evaluations (TiC-Wikipedia, TiC-StackExchange, TiC-CODEDOCS) to assess performance across time.

Experiment

The experiments involved training 3B parameter models with OpenLM on 220B or 440B tokens, front-loading initial pretraining on May 2013 data, and splitting remaining tokens across 113 monthly updates. The setup compared continual methods against Oracle models retrained from scratch every two years (1.16T tokens total). Datasets included TIC-CC for general web data and domain-specific evaluations like TIC-WIKI, TIC-STACKE, and TIC-CODEDOCS, with metrics focusing on in-distribution (ID) performance, backward transfer (retaining old knowledge), and forward transfer (adapting to future data). Results showed that Replay (α=1/2) with Autoregressive schedules at 440B tokens matched or outperformed the Oracle series on TIC-CC with 62% less compute, though trade-offs existed: replay reduced forgetting on general web data but hurt performance on fast-evolving domains like StackOverflow. Static downstream tasks (CORE benchmark) showed a persistent gap to Oracle retraining, suggesting initialization bias or data access limitations. The experimental design was comprehensive in scale and timestep coverage, but the lack of adaptive strategies and focus on perplexity metrics (rather than accuracy) limits practical insights. Results partially met expectations for general data but highlighted domain-specific challenges.

Further Thoughts

The TiC-LM benchmark opens up intriguing avenues for exploring how LLMs can adapt to temporal shifts, but I believe integrating adaptive replay ratios based on domain evolution rates could significantly enhance performance. For instance, leveraging metadata or content analysis to dynamically adjust the replay ratio for fast-evolving domains like technology forums versus stable ones like mathematical knowledge could address the observed trade-offs. This connects to broader research in domain adaptation and transfer learning, where methods like domain-specific fine-tuning (e.g., Gururangan et al., 2020) could be adapted for continual pretraining. Additionally, the ethical implications of training on unfiltered web data over time warrant deeper investigation—temporal biases (e.g., outdated societal norms in older CC dumps) could propagate into models, an area underexplored here but critical given works like Bender et al. (2021) on LLM biases. Finally, extending this benchmark to include multimodal data (e.g., images or videos from web crawls) could align with emerging trends in foundation models, testing whether continual learning principles hold across modalities.