Malware Classification Leveraging NLP & Machine Learning for Enhanced Accuracy

This repository contains code and resources related to the research paper titled "Malware Classification Leveraging NLP & Machine Learning for Enhanced Accuracy" by Bishwajit Prasad Gond, Rajneekant, Pushkar Kishore, and Durga Prasad Mohapatra from the National Institute of Technology Rourkela, Odisha, India.

Abstract

This paper investigates the application of natural language processing (NLP)-based n-gram analysis and machine learning techniques to enhance malware classification. We explore how NLP can be used to extract and analyze textual features from malware samples through n-gram, contiguous string or API call sequences. This approach effectively captures distinctive linguistic patterns among malware and benign families, enabling finer-grained classification. We delve into n-gram size selection, feature representation, and classification algorithms. While evaluating our proposed method on real-world malware samples, we observe significantly improved accuracy compared to traditional methods. By implementing our n-gram approach, we achieved an accuracy of 99.02% across various machine learning algorithms by using a hybrid feature selection technique to address high dimensionality. The hybrid feature selection technique reduces the feature set to only 1.6% of the original features.

Architecture

We present out proposed framework for malware classification using NLP and machine learning techniques. Our proposed architecture involves:

acquiring malware hash from VirusShare,
querying VirusTotal for JSON file with antivirus scans to determine malware class,
downloading distinct malware categories,
dynamic analysis in Cuckoo sandbox3 to extract API call sequences in JSON format,
partitioning JSON report into API name, argument, return, and category text files,
applying 𝑛-gram methods combining API names and arguments,
calculating TF-IDF using unique 𝑛-grams from all categories,
applying hybrid feature selection techniques to get refine feature set,
applying machine learning techniques and adjusting evaluation criterion on refine feature set.

Installation

To run the code in this repository, you'll need Python 3.x and the following libraries:

pandas
scikit-learn
numpy

You can install these dependencies using pip:

pip install pandas scikit-learn numpy

Data

The dataset used in this research is not included in this repository due to its size and sensitivity. However, you can request on given mail address [email protected].

Dataset Used

S.No	Types	Test Sample	Train Sample	Total Sample
1	Adware	406	1580	1986
2	Backdoor	123	551	674
3	Downloader	495	2002	2497
4	Spyware	190	756	946
5	Trojan	695	2873	3568
6	Worm	277	1080	1357
7	Virus	500	1892	2392
8	Benign	1724	6910	8634
	Total	4410	17644	22054

All the data is obtained from VirusShare (Jan 2023- Jul 2023)

Confusion Matrix

Evaluation Metric Table

S.No	Classifiers	Acc	$F_1$	Rec	Pre
1	Decision Tree	98.50	97.16	97.30	97.23
2	k-Nearest Neighbors	94.40	93.16	92.16	92.65
3	Naive Bayes	56.26	65.49	69.98	57.70
4	SVM Linear	97.01	94.77	95.93	95.32
5	SVM Polynomial $3^{\circ}$	40.21	27.12	13.73	9.45
6	SVM Polynomial $4^{\circ}$	40.19	39.19	13.78	9.57
7	SVM RBF	66.91	73.10	55.03	55.17
8	SVM Sigmoid	54.33	36.96	33.48	31.32
9	Random Forest	98.37	97.75	96.35	97.03
10	XGBoost	99.02	98.35	97.74	98.04
11	LightGBM	99.02	98.31	97.60	97.95

Evaluation Metric Graph

Citation

If you use this code or images and find our work helpful in your research, please consider citing our paper:

@inproceedings{gond2024malware,
  title={Malware Classification Leveraging NLP \& Machine Learning for Enhanced Accuracy},
  author={Gond, Bishwajit Prasad and Rajneekant and Kishore, Pushkar and Mohapatra, Durga Prasad},
  booktitle={Proceedings of the 4th International Conference on Machine Learning and Big Data Analytics (ICMLBDA 2024)},
  year={2024}
}

This work is licensed under the Apache License, Version 2.0

[email protected]

bishwajitprasadgond / malwareclassification Goto Github PK

malwareclassification's Introduction

Malware Classification Leveraging NLP & Machine Learning for Enhanced Accuracy

Abstract

Architecture

Installation

Data

Dataset Used

Confusion Matrix

Evaluation Metric Table

Evaluation Metric Graph

Citation

malwareclassification's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent