Skip to content
Go back 2505.04204 arXiv logo

Cyber Security Data Science: Machine Learning Methods and their Performance on Imbalanced Datasets

Published:  at  11:06 AM
90.25 🤔

This paper systematically evaluates machine learning classifiers and imbalance learning techniques on two cybersecurity datasets, revealing that XGB and RF perform robustly, while sampling and ensembling effects vary, emphasizing the need for dataset-specific method selection.

Supervised Learning, Classification, Data Augmentation, Efficiency, Robustness, AI in Security

Mateo Lopez-Ledezma, Gissel Velarde

Universidad Privada Boliviana, Bolivia, IU International University of Applied Sciences, Erfurt, Germany

Generated by grok-3

Background Problem

Cybersecurity is increasingly critical due to the rising cost of cybercrime, estimated at 400 billion USD globally. Many cybersecurity applications, such as fraud detection and anomaly detection, are formulated as binary classification problems where the positive class (e.g., fraudulent transactions) is significantly underrepresented compared to the negative class, creating imbalanced datasets. This imbalance poses a challenge for machine learning algorithms, as learning patterns from underrepresented samples is difficult, often leading to poor detection of critical events. This study aims to address the problem of imbalance learning in cybersecurity by evaluating the effectiveness of various machine learning methods and imbalance handling techniques on representative datasets.

Method

The study employs a systematic approach to evaluate machine learning methods on imbalanced datasets in cybersecurity, focusing on fraud detection. The core idea is to compare the performance of single classifiers and imbalance learning techniques to identify effective strategies for handling imbalanced data. The methodology includes three experiments:

  1. Experiment 1 - Single Classifiers: Six classifiers are tested—Random Forests (RF), Light Gradient Boosting Machine (LGBM), eXtreme Gradient Boosting (XGB), Logistic Regression (LR), Decision Tree (DT), and Gradient Boosting Decision Tree (GBDT). Each classifier undergoes grid search with 5-fold stratified cross-validation on the training set (80-20 split) to optimize the F1 score.
  2. Experiment 2 - Sampling Techniques: Four imbalance learning techniques are applied—Over-Sampling, Under-Sampling, Synthetic Minority Over-sampling Technique (SMOTE), and no sampling. Classifiers are retrained on resampled datasets, and performance is compared with and without optimization on the resampled data.
  3. Experiment 3 - Self-Paced Ensembling (SPE): SPE, an ensemble method focusing on hard examples, is tested with varying numbers of base classifiers (N=10, 20, 50) to assess its impact on performance. Datasets are preprocessed to remove duplicates and handle missing values, and evaluation metrics focus on Precision, Recall, and F1 score due to the inadequacy of accuracy in imbalanced scenarios.

Experiment

The experiments are conducted on two highly imbalanced datasets: Credit Card (283726 samples, imbalance ratio 598.84:1, 0.2% fraud) and PaySim (6362620 samples, imbalance ratio 773.70:1, 0.13% fraud), chosen for their relevance to cybersecurity fraud detection. The setup includes a stratified 80-20 train-test split and 5-fold cross-validation for robustness. Results are as follows:

Further Thoughts

The paper provides a valuable empirical comparison of machine learning methods for imbalanced datasets in cybersecurity, but it raises questions about the broader applicability of the findings. For instance, how do these methods perform under evolving cyber threats, where attack patterns change over time? This could be explored by integrating continual learning paradigms to adapt models dynamically. Additionally, the lack of feature importance analysis misses an opportunity to understand which attributes (e.g., transaction amount, time) are most predictive of fraud, potentially linking to interpretability studies in AI ethics. A connection to federated learning could also be insightful, as cybersecurity often involves sensitive data across institutions, and distributed learning might mitigate privacy concerns while handling imbalanced data. Finally, the varied performance across datasets suggests a need for meta-learning approaches to predict optimal methods based on dataset characteristics, which could be a future research direction to address the No Free Lunch Theorem’s implications in this domain.



Previous Post
MINGLE: Mixtures of Null-Space Gated Low-Rank Experts for Test-Time Continual Model Merging
Next Post
Why do LLMs attend to the first token?