Recite, Reconstruct, Recollect: Memorization in LMs as a Multifaceted Phenomenon

This paper introduces a taxonomy of language model memorization into recitation, reconstruction, and recollection, demonstrating through experiments with Pythia models that different factors influence each category, with a taxonomy-based predictive model outperforming baselines in predicting memorization likelihood.

Large Language Model, Representation Learning, Pre-training, Scaling Laws, Privacy-Preserving Machine Learning, Interpretability

USVSN Sai Prashanth, Alvin Deng, Kyle O’Brien, Jyothir S, Mohammad Aflah Khan, Jaydeep Borkar, Christopher A. Choquette-Choo, Jacob Ray Fuehne, Stella Biderman, Tracy Ke, Katherine Lee, Naomi Saphra

EleutherAI, Microsoft, New York University, DatologyAI, Northeastern University, MPI-SWS, IIIT Delhi, Google DeepMind, University of Illinois at Urbana-Champaign, Harvard University, Kempner Institute

Generated by grok-3

Background Problem

Language models (LMs) often memorize training data, generating exact copies at test time, which poses challenges in privacy, copyright, and understanding generalization. Existing literature treats memorization as a uniform phenomenon, ignoring the diversity of memorized content and underlying causes. This paper addresses this gap by proposing a taxonomy to categorize memorization into recitation (highly duplicated sequences), reconstruction (predictable templates), and recollection (rare sequences), aiming to uncover distinct factors influencing each type and improve predictive understanding of memorization behaviors.

Method

The core idea is to dissect LM memorization into a taxonomy of three categories—recitation, reconstruction, and recollection—based on data properties and heuristics. Recitation targets highly duplicated sequences (threshold > 5 duplicates in the corpus), identified via corpus statistics like duplicate counts. Reconstruction focuses on inherently predictable sequences with templating patterns (repeating or incrementing), detected through handcrafted heuristics. Recollection captures remaining memorized sequences that are neither highly duplicated nor templated, often rare. The method involves analyzing corpus-wide statistics (e.g., duplicates, semantic matches), sequence properties (e.g., compressibility via Huffman coding, templating), and model perplexity. A predictive model using logistic regression is trained on each taxonomic category separately to predict memorization likelihood, leveraging features like perplexity and duplicate counts, and compared against a baseline without taxonomy and an optimally partitioned model.

Experiment

Experiments were conducted using the Pythia model suite (70M to 12B parameters) trained on a deduplicated version of The Pile dataset, with a public list of 32-extractable memorized sequences as the primary data. The setup analyzed memorization across model scale and training time, categorizing memorized data into the proposed taxonomy. Results showed memorization increases with model size and training time, with recollection growing fastest (from 4.49% at 70M to 11.34% at 12B), suggesting larger models memorize rarer sequences. The predictive model based on the taxonomy outperformed a generic baseline and an optimally partitioned model on most metrics (e.g., accuracy, calibration), except for low precision on recollection. The experimental design is comprehensive in exploring scale and time dynamics but limited by the linear assumptions of logistic regression and the specific k-extractable definition (k=32), which may miss partial memorization. The setup is reasonable for initial validation but lacks robustness testing across different memorization definitions or non-linear models. Results partially match expectations, confirming known trends (e.g., low perplexity correlates with memorization), but the rapid growth of recollection raises unanswered questions about underlying mechanisms.

Further Thoughts

The taxonomy proposed in this paper opens up fascinating avenues for exploring memorization in LMs, particularly in relation to privacy and generalization. The rapid growth of recollection with model size suggests that larger models may inherently develop mechanisms to encode rare, episodic-like data, which could be linked to emergent abilities in foundation models as discussed in recent scaling law studies. This raises a critical question: are these recollection behaviors a byproduct of overparameterization, or do they indicate a form of implicit memory compression not yet understood? Connecting this to AI safety, the memorization of rare sequences (recollection) could pose significant risks in privacy-sensitive applications, warranting further research into mitigation strategies like differential privacy or data deduplication, as explored in works like Kandpal et al. (2022). Additionally, the higher memorization rate of code over natural language suggests structural data properties (e.g., syntax rigidity) might play a role, which could be investigated through cross-domain studies involving multimodal data or other structured formats like mathematical expressions. Future work could also explore non-linear predictive models or alternative memorization definitions to test the taxonomy’s robustness, potentially linking it to broader AI ethics discussions on responsible deployment of large models.