This paper proposes a three-dimensional taxonomy and develops TTP and HarmFormer tools to filter harmful content from web-scale LLM pretraining datasets, revealing significant toxicity prevalence and persistent safety gaps through benchmarks like HAVOC.
Large Language Model, Pre-training, Safety, Alignment, Transformer, Responsible AI
Sai Krishna Mendu, Harish Yenala, Aditi Gulati, Shanu Kumar, Parag Agrawal
Microsoft
Generated by grok-3
Background Problem
Large language models (LLMs) are pretrained on massive web-scale datasets like Common Crawl, C4, and FineWeb, which contain harmful content such as hate speech, misinformation, and explicit material. This unfiltered data risks propagating toxic behaviors, societal biases, and misinformation in LLM outputs, undermining trust and raising ethical concerns. Current content moderation tools are limited by sentence-level analysis, binary taxonomies, and poor generalization across harm categories, failing to address nuanced intent in long-form web content. This paper aims to solve these issues by providing a detailed analysis of harmful content in pretraining datasets and developing advanced moderation tools to ensure safer LLM pretraining.
Method
The paper introduces a three-dimensional taxonomy (Safe, Topical, Toxic) across five harm categories (Hate & Violence, Ideological Harm, Sexual Harm, Illegal Activities, Self-Inflicted Harm) to classify content based on intent and severity, distinguishing harmful content from educational or topical discourse. Two main tools are developed: (1) Topical and Toxic Prompt (TTP), a prompt-based classifier using OpenAI’s GPT-4 Omni with few-shot and chain-of-thought prompting to filter harmful content from web-scale datasets, and (2) HarmFormer, a LongFormer-based transformer model with a multi-task architecture featuring five classification heads for each harm category, trained on 253,000 annotated web pages from Common Crawl, C4, and FineWeb to handle long-form text moderation. The methodology also includes creating evaluation benchmarks like TTP-Eval (491 annotated web pages) for long-text moderation and HAVOC (10,376 snippets) to assess LLM toxicity in open-ended generations.
Experiment
The experiments analyze harmful content prevalence in Common Crawl (4.1% toxic), C4 (2.1% toxic), and FineWeb (3.9% toxic), showing significant toxic content despite existing filters. TTP achieves an F1-score of 0.83 on toxic content detection in TTP-Eval, outperforming Perspective API (F1=0.63), though it struggles with ‘reporting’ webpages. HarmFormer, fine-tuned on LongFormer with 1024 token context, reaches an F1-score of 0.85 on toxic content, with strong performance in Sexual (F1=0.91) and Self-Inflicted Harm (F1=0.88) but weaker in Hate & Violence (F1=0.51) due to intent misclassification. On the OpenAI Moderation Dataset, TTP (F1=0.80) and HarmFormer (F1=0.73) surpass Llama Guard and Perspective API. The HAVOC benchmark reveals a 26.7% toxicity leakage rate in LLM outputs, with Sexual Harm showing the highest leakage (7.97% aggregated), indicating persistent safety gaps. The setup is comprehensive, covering diverse datasets and harm categories, but the reliance on TTP annotations for training data introduces potential bias, and results may not fully generalize due to English-only focus and limited cultural context.
Further Thoughts
The three-dimensional taxonomy is a significant step forward in content moderation, as it addresses the critical gap in distinguishing intent, which binary systems often miss. However, the reliance on GPT-4 Omni for TTP raises concerns about cost and accessibility for widespread adoption in Responsible AI practices—could a smaller, open-source model achieve similar results with fine-tuning? Additionally, the high leakage rates in HAVOC, especially for Sexual Harm, suggest that pretraining data filtering alone may not suffice; post-training alignment techniques like RLHF or DPO might need integration to address emergent toxic behaviors. I also wonder about the interplay between dataset ‘cleanliness’ and model performance, as hinted in the limitations—future work could explore whether overly aggressive filtering impacts LLM capabilities in nuanced tasks like medical or legal reasoning, where topical content is vital. Finally, connecting this to broader AI safety research, the HAVOC benchmark could be extended to evaluate multimodal models, where visual and textual harms might compound, posing new challenges for content moderation.