Code Monkey home page Code Monkey logo

malwareclassification's Introduction

Malware Classification Leveraging NLP & Machine Learning for Enhanced Accuracy

This repository contains code and resources related to the research paper titled "Malware Classification Leveraging NLP & Machine Learning for Enhanced Accuracy" by Bishwajit Prasad Gond, Rajneekant, Pushkar Kishore, and Durga Prasad Mohapatra from the National Institute of Technology Rourkela, Odisha, India.

Abstract

This paper investigates the application of natural language processing (NLP)-based n-gram analysis and machine learning techniques to enhance malware classification. We explore how NLP can be used to extract and analyze textual features from malware samples through n-gram, contiguous string or API call sequences. This approach effectively captures distinctive linguistic patterns among malware and benign families, enabling finer-grained classification. We delve into n-gram size selection, feature representation, and classification algorithms. While evaluating our proposed method on real-world malware samples, we observe significantly improved accuracy compared to traditional methods. By implementing our n-gram approach, we achieved an accuracy of 99.02% across various machine learning algorithms by using a hybrid feature selection technique to address high dimensionality. The hybrid feature selection technique reduces the feature set to only 1.6% of the original features.

Architecture

Decision Tree

We present out proposed framework for malware classification using NLP and machine learning techniques. Our proposed architecture involves:

  1. acquiring malware hash from VirusShare,
  2. querying VirusTotal for JSON file with antivirus scans to determine malware class,
  3. downloading distinct malware categories,
  4. dynamic analysis in Cuckoo sandbox3 to extract API call sequences in JSON format,
  5. partitioning JSON report into API name, argument, return, and category text files,
  6. applying ๐‘›-gram methods combining API names and arguments,
  7. calculating TF-IDF using unique ๐‘›-grams from all categories,
  8. applying hybrid feature selection techniques to get refine feature set,
  9. applying machine learning techniques and adjusting evaluation criterion on refine feature set.

Decision Tree

Installation

To run the code in this repository, you'll need Python 3.x and the following libraries:

  • pandas
  • scikit-learn
  • numpy

You can install these dependencies using pip:

pip install pandas scikit-learn numpy

Data

The dataset used in this research is not included in this repository due to its size and sensitivity. However, you can request on given mail address [email protected].

Dataset Used

S.No Types Test Sample Train Sample Total Sample
1 Adware 406 1580 1986
2 Backdoor 123 551 674
3 Downloader 495 2002 2497
4 Spyware 190 756 946
5 Trojan 695 2873 3568
6 Worm 277 1080 1357
7 Virus 500 1892 2392
8 Benign 1724 6910 8634
Total 4410 17644 22054

All the data is obtained from VirusShare (Jan 2023- Jul 2023)

Confusion Matrix

Decision Tree Image 2 Image 3 Image 4 Image 5 Image 6 Image 7 Image 8 Image 9 Image 10 Image 11

Evaluation Metric Table

S.No Classifiers Acc $F_1$ Rec Pre
1 Decision Tree 98.50 97.16 97.30 97.23
2 k-Nearest Neighbors 94.40 93.16 92.16 92.65
3 Naive Bayes 56.26 65.49 69.98 57.70
4 SVM Linear 97.01 94.77 95.93 95.32
5 SVM Polynomial $3^{\circ}$ 40.21 27.12 13.73 9.45
6 SVM Polynomial $4^{\circ}$ 40.19 39.19 13.78 9.57
7 SVM RBF 66.91 73.10 55.03 55.17
8 SVM Sigmoid 54.33 36.96 33.48 31.32
9 Random Forest 98.37 97.75 96.35 97.03
10 XGBoost 99.02 98.35 97.74 98.04
11 LightGBM 99.02 98.31 97.60 97.95

Evaluation Metric Graph

Decision Tree

Citation

If you use this code or images and find our work helpful in your research, please consider citing our paper:

@inproceedings{gond2024malware,
  title={Malware Classification Leveraging NLP \& Machine Learning for Enhanced Accuracy},
  author={Gond, Bishwajit Prasad and Rajneekant and Kishore, Pushkar and Mohapatra, Durga Prasad},
  booktitle={Proceedings of the 4th International Conference on Machine Learning and Big Data Analytics (ICMLBDA 2024)},
  year={2024}
}

Copyright ยฉ Bishwajit Prasad Gond, 2024

Apache License
This work is licensed under the Apache License, Version 2.0

[email protected]

malwareclassification's People

Contributors

bishwajitprasadgond avatar

Watchers

Kostas Georgiou avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.