Ensuring the security of a network infrastructure necessitates the precise detection and categorization of malware. While existing methodologies have demonstrated higher accuracy, their effective- ness has predominantly been validated on a limited subset of mal- ware families or samples. These analyses often focus on malware families with a higher number of samples, potentially leading to bi- ased and unrepresentative classification results. To address this gap, our study aims to enhance the accuracy and robustness of malware detection and categorization systems by investigating the impact of dataset size, class balance, and data augmentation techniques on classifier performance. We demonstrate the efficacy of our ap- proach on a comparatively larger dataset titled Blue Hexagon Open Dataset for Malware AnalysiS, comprising of 134k samples. Our analysis, exploiting 85 malware families with at least 50 samples each, results in the highest accuracy of 92.28% using Random Forest as the classifier on the original imbalanced dataset. However, by employing Generative Adversarial Networks to generate synthetic samples and achieve balanced class distributions (resulted in bal- anced datasets), our approach demonstrates the improvement in the classifier’s accuracy to 99.35%.

Balancing the Scales: Using GANs and Class Balance for Superior Malware Detection

Attaullah Buriro;Flaminia Luccio;
In corso di stampa

Abstract

Ensuring the security of a network infrastructure necessitates the precise detection and categorization of malware. While existing methodologies have demonstrated higher accuracy, their effective- ness has predominantly been validated on a limited subset of mal- ware families or samples. These analyses often focus on malware families with a higher number of samples, potentially leading to bi- ased and unrepresentative classification results. To address this gap, our study aims to enhance the accuracy and robustness of malware detection and categorization systems by investigating the impact of dataset size, class balance, and data augmentation techniques on classifier performance. We demonstrate the efficacy of our ap- proach on a comparatively larger dataset titled Blue Hexagon Open Dataset for Malware AnalysiS, comprising of 134k samples. Our analysis, exploiting 85 malware families with at least 50 samples each, results in the highest accuracy of 92.28% using Random Forest as the classifier on the original imbalanced dataset. However, by employing Generative Adversarial Networks to generate synthetic samples and achieve balanced class distributions (resulted in bal- anced datasets), our approach demonstrates the improvement in the classifier’s accuracy to 99.35%.
In corso di stampa
Proceedings of the 40th ACM/SIGAPP Symposium On Applied Computing (SAC 2025)
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in ARCA sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10278/5092489
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact