Cyber Security Data Science: Machine Learning Methods and their Performance on Imbalanced Datasets

This paper systematically evaluates machine learning classifiers and imbalance learning techniques on two cybersecurity datasets, revealing that XGB and RF perform robustly, while sampling and ensembling effects vary, emphasizing the need for dataset-specific method selection.

Supervised Learning, Classification, Data Augmentation, Efficiency, Robustness, AI in Security

Mateo Lopez-Ledezma, Gissel Velarde

Universidad Privada Boliviana, Bolivia, IU International University of Applied Sciences, Erfurt, Germany

Generated by grok-3

Background Problem

Cybersecurity is increasingly critical due to the rising cost of cybercrime, estimated at 400 billion USD globally. Many cybersecurity applications, such as fraud detection and anomaly detection, are formulated as binary classification problems where the positive class (e.g., fraudulent transactions) is significantly underrepresented compared to the negative class, creating imbalanced datasets. This imbalance poses a challenge for machine learning algorithms, as learning patterns from underrepresented samples is difficult, often leading to poor detection of critical events. This study aims to address the problem of imbalance learning in cybersecurity by evaluating the effectiveness of various machine learning methods and imbalance handling techniques on representative datasets.

Method

The study employs a systematic approach to evaluate machine learning methods on imbalanced datasets in cybersecurity, focusing on fraud detection. The core idea is to compare the performance of single classifiers and imbalance learning techniques to identify effective strategies for handling imbalanced data. The methodology includes three experiments:

Experiment 1 - Single Classifiers: Six classifiers are tested—Random Forests (RF), Light Gradient Boosting Machine (LGBM), eXtreme Gradient Boosting (XGB), Logistic Regression (LR), Decision Tree (DT), and Gradient Boosting Decision Tree (GBDT). Each classifier undergoes grid search with 5-fold stratified cross-validation on the training set (80-20 split) to optimize the F1 score.
Experiment 2 - Sampling Techniques: Four imbalance learning techniques are applied—Over-Sampling, Under-Sampling, Synthetic Minority Over-sampling Technique (SMOTE), and no sampling. Classifiers are retrained on resampled datasets, and performance is compared with and without optimization on the resampled data.
Experiment 3 - Self-Paced Ensembling (SPE): SPE, an ensemble method focusing on hard examples, is tested with varying numbers of base classifiers (N=10, 20, 50) to assess its impact on performance. Datasets are preprocessed to remove duplicates and handle missing values, and evaluation metrics focus on Precision, Recall, and F1 score due to the inadequacy of accuracy in imbalanced scenarios.

Experiment

The experiments are conducted on two highly imbalanced datasets: Credit Card (283726 samples, imbalance ratio 598.84:1, 0.2% fraud) and PaySim (6362620 samples, imbalance ratio 773.70:1, 0.13% fraud), chosen for their relevance to cybersecurity fraud detection. The setup includes a stratified 80-20 train-test split and 5-fold cross-validation for robustness. Results are as follows:

Experiment 1 (Single Classifiers): XGB and RF consistently outperform others in F1 score on both datasets, with XGB being faster. DT excels on PaySim but not on Credit Card, while LGBM underperforms across both.
Experiment 2 (Sampling Techniques): Sampling effects vary; Over-Sampling often improves recall but reduces precision, SMOTE generally worsens F1 except for GBDT with optimization on resampled data, and Under-Sampling deteriorates performance due to information loss.
Experiment 3 (SPE): SPE with N=20 base classifiers yields optimal results on Credit Card, while on PaySim, performance varies by classifier (e.g., XGB best at N=10, DT at N=50). Increasing N beyond 20 often reduces recall. The experimental design is comprehensive in testing multiple methods, but results do not consistently match expectations (e.g., SMOTE’s negative impact), and the lack of feature analysis or dataset-specific insights limits deeper understanding. Execution times show XGB as the fastest, with Under-Sampling reducing computational cost but at the expense of performance.

Further Thoughts

The paper provides a valuable empirical comparison of machine learning methods for imbalanced datasets in cybersecurity, but it raises questions about the broader applicability of the findings. For instance, how do these methods perform under evolving cyber threats, where attack patterns change over time? This could be explored by integrating continual learning paradigms to adapt models dynamically. Additionally, the lack of feature importance analysis misses an opportunity to understand which attributes (e.g., transaction amount, time) are most predictive of fraud, potentially linking to interpretability studies in AI ethics. A connection to federated learning could also be insightful, as cybersecurity often involves sensitive data across institutions, and distributed learning might mitigate privacy concerns while handling imbalanced data. Finally, the varied performance across datasets suggests a need for meta-learning approaches to predict optimal methods based on dataset characteristics, which could be a future research direction to address the No Free Lunch Theorem’s implications in this domain.