A Comprehensive Analysis of Adversarial Attacks against Spam Filters

This paper conducts a comprehensive analysis of adversarial attacks on deep learning-based spam filters, revealing significant vulnerabilities across character, word, sentence, and AI-generated paragraph levels using novel scoring functions like spam weights, with distilBERT showing relative resilience at paragraph-level attacks.

Adversarial Learning, Natural Language Processing, Deep Learning, Spam Detection, Robustness, AI Ethics

Esra Hotoğlu, Sevil Sen, Burcu Can

Hacettepe University, Ankara, Turkey, University of Stirling, Stirling, UK

Generated by grok-3

Background Problem

The rise of deep learning has significantly enhanced email spam filtering, a crucial defense against cyber threats like phishing and malware. However, adversarial attacks, where malicious inputs are crafted to deceive models, pose a growing challenge to these systems. This study addresses the vulnerability of deep learning-based spam filters to such attacks, focusing on how deliberate perturbations at various textual levels (character, word, sentence, and AI-generated paragraphs) can bypass detection. It aims to expose weaknesses in current spam detection systems and contribute to improving their resilience against evolving adversarial tactics.

Method

The core idea is to evaluate the robustness of six deep learning models (LSTM, CNN, Dense, Attention, Transformer, distilBERT) against adversarial attacks on spam filters by crafting perturbations at multiple textual levels. The methodology involves: (1) training these models on three real-world spam datasets (SpamAssassin, Enron Spam, TREC2007) after preprocessing steps like tokenization and padding; (2) implementing black-box attacks at character-level (e.g., swapping, deletion), word-level (e.g., out-of-vocabulary replacement), sentence-level (e.g., adding ham/spam sentences), and AI-generated paragraph-level using GPT-3.5; (3) introducing novel scoring functions, namely spam weights (SW) and attention weights (AW), alongside the existing Replace One Score (R1S), to identify critical text segments for manipulation, where SW uses LSTM predictions for spam probability, AW leverages attention layer outputs, and R1S calculates loss by token replacement. These functions prioritize impactful words for attacks, aiming to maximize misclassification while maintaining computational efficiency.

Experiment

The experiments were conducted on three datasets (SpamAssassin, Enron Spam, TREC2007) with an 80-20 train-test split, evaluating six deep learning models under attack-free and adversarial conditions. The setup tested attacks at character (10-50% modification), word (1-5% corpus), sentence, and paragraph levels (1000 AI-generated emails), using three scoring functions (SW, AW, R1S). Baseline results showed high accuracy (97-99%) across models in attack-free scenarios, with Attention and CNN performing best. Under attacks, accuracy dropped significantly, especially with SW scoring: LSTM saw the largest decline (e.g., 55.38% with OOV word attack), while Attention was more robust. Character and word-level attacks like insertion and OOV were most effective, reducing accuracy by up to 40%. Sentence-level attacks impacted Dense and Transformer more, and paragraph-level attacks severely degraded performance (except distilBERT at 71% accuracy). The setup is comprehensive in attack variety but limited by outdated datasets and small AI-generated sample size, potentially skewing results. Results partially match expectations for attack impact but overstate vulnerabilities due to dataset age and lack of defense mechanisms.

Further Thoughts

The paper’s focus on multi-level adversarial attacks highlights a critical area in AI security, particularly for spam detection, but it also opens broader questions about the adaptability of deep learning models to evolving threats. The reliance on outdated datasets like Enron Spam suggests a disconnect with modern spam tactics, especially AI-generated content, which could be addressed by integrating more recent or synthetic datasets reflecting current trends. The success of distilBERT at paragraph-level attacks points to the potential of pre-trained models in handling complex adversarial inputs, warranting exploration into hybrid architectures combining pre-training with domain-specific fine-tuning. Additionally, the absence of defense strategies in the study is a missed opportunity—future work could explore adversarial training or robust feature engineering to counter these attacks, drawing from image domain techniques like certified defenses. This also ties into ethical AI concerns; as AI-generated spam becomes harder to detect, the risk of misinformation and phishing escalates, necessitating interdisciplinary research with cybersecurity to balance offensive and defensive advancements. Finally, the computational efficiency of spam weights could inspire similar optimizations in other NLP tasks facing adversarial challenges, such as sentiment analysis or fake news detection.