Code Monkey home page Code Monkey logo

dikedataset's Introduction

DikeDataset πŸ—ƒοΈ

Table of Content πŸ”–

Description πŸ–ŒοΈ

DikeDataset is a labeled dataset containing benign and malicious PE and OLE files.

Considering the number, the types, and the meanings of the labels, DikeDataset can be used for training artificial intelligence algorithms to predict, for a PE or OLE file, the malice and the membership to a malware family. The artificial intelligence approaches can vary from machine learning (with algorithms such as regressors and soft multi-label classifiers) to deep learning, depending on the requirements.

It is worth mentioning that the numeric labels, with values between 0 and 1, can be transformed into discrete ones to respect the constraints of standard classification. For example, if a superior malice limit for benign files is set to 0.4, a file having the malice of 0.593 is considered malicious.

Labels Exploration πŸ”

Samples Distribution Plot with the distribution of samples
Labels Identification
Name Type
type int64
hash object
malice float64
generic float64
trojan float64
ransomware float64
worm float64
backdoor float64
spyware float64
rootkit float64
encrypter float64
downloader float64
Mean, Standard Deviation, Minimum and Maximum
malice generic trojan ransomware worm backdoor spyware rootkit encrypter downloader
mean 0.876484 0.412354 0.44581 0.00503229 0.0086457 0.0117696 0.00030322 0.00614807 0.0719921 0.037945
std 0.0779914 0.0779332 0.0891624 0.0192288 0.0189522 0.0333144 0.00227205 0.0263416 0.0622346 0.0699552
min 0.235294 0.140351 0.05 0 0 0 0 0 0 0
max 0.981132 0.916667 0.76087 0.307692 0.59 0.290323 0.0212766 0.307692 0.3125 0.307692
Histograms Plot containing a histogram for each numeric label

Methodology πŸ‘·

Observation: A Bash script can be used to replicate the downloading and the renaming steps. On the other hand, the last two steps consist of using functionalities that are available only in the dike, namely in this Python script.

Downloading Step

  1. For PE files, a dataset (see the Sources section) created for a paper was downloaded. As the files were packed inside multiple folders (one for each malware family considered in the study), they were moved into two new folders, malice oriented.
  2. For malicious OLE files, 12 daily (one from each 15th of the 12 previous months) archives were downloaded from MalwareBazaar (see the Sources section). After unarchiving, the files were filtered by certain extensions (.doc, .docx .docm .xls .xlsx .xlsm .ppt .pptx .pptm).
  3. For benign OLE files, 100 files were manually downloaded from the results of random DuckDuckGo searches.

Renaming Step

  1. All resulting files were renamed by their SHA256 hash.
  2. The OLE files, having the Office-specific extensions mentioned in the last paragraph, were replaced with .ole.

Scanning Step

  1. The hashes of all malicious files were dumped into a file.
  2. The file containing hashes was uploaded into a bucket in Google Cloud Storage.
  3. A Google Cloud Function was created, containing a Python script (see the observation above) and triggered by a Google Cloud Scheduler four times in a minute (to respect the API quota). It consumed the hashes by scanning them with the VirusTotal API and dumping specific parts of the results (antivirus engines votes and tags) into a file.

Labeling Step

  1. The file containing the VirusTotal data, which resulted from the scanning step, was moved locally, where dike was already set.
  2. To compute the malice, the weighted formula below was used, where the MALIGN_BENIGN_RATIO constant was set to 2. This means that one antivirus engine considering that the file was malicious has the same weight (on a scale) as two engines considering it is benign.
malign_weight = MALIGN_BENIGN_RATIO * malign_votes
benign_weight = benign_votes
malice = malign_weight / (malign_weight + benign_weight)
  1. To compute the membership on each malware family, a transformer was developed (see the observation above) to "vote" for each available family. For example, if an antivirus engine tag was Trj, then one vote for the trojan family was offered. All tags were consumed in this way and the votes for all families were normalized.
  2. For the benign files, the process was straight-forward as the malice and the memberships were set to 0.

Sources ©️

  1. Malware Detection PE-Based Analysis Using Deep Learning Algorithm Dataset, containing malicious and benign PE files and having CC BY 4.0 license
  2. MalwareBazaar, containing (among others) malicious OLE files and having CC0 license
  3. DuckDuckGo, that was used for searching benign documents with patterns such as filetype:doc

Folders Structure πŸ“‚

DikeDataset                                 root folder
β”œβ”€β”€ files                                   folder with all samples
β”‚   β”œβ”€β”€ benign                              folder for benign samples
β”‚   β”‚   └── ...
β”‚   └── malware                             folder for malicious samples
β”‚       └── ...
β”œβ”€β”€ labels                                  folder with all labels 
β”‚   β”œβ”€β”€ benign.csv                          labels folder for benign samples
β”‚   └── malware.csv                         labels folder for malicious samples
β”œβ”€β”€ others                                  folder with miscellaneous files
β”‚   β”œβ”€β”€ images                              folder with generated images
β”‚   β”‚   β”œβ”€β”€ distribution.png                image with a plot with the distribution of samples
β”‚   β”‚   └── histograms.png                  image containing the histograms for each numeric label
β”‚   β”œβ”€β”€ scripts                             folder with used scripts
β”‚   |   β”œβ”€β”€ explore.py                      Python 3 script for labels exploration
β”‚   |   β”œβ”€β”€ get_files.sh                    Shell script for downloading a large part of the samples
β”‚   |   └── requirements.txt                Python 3 dependencies for the explore.py script
β”‚   β”œβ”€β”€ tables                              folder with generated tables
β”‚   β”‚   β”œβ”€β”€ labels.md                       table in Markdown format containing the identification 
β”‚   β”‚   β”‚                                   of labels
β”‚   β”‚   └── univariate_analysis.md          table in Markdown format containing the results of a
β”‚   β”‚                                       univariate analysis
β”‚   └── vt_data.csv                         raw VirusTotal scan results
└── README.md                               this file

Citations πŸ“„

DikeDataset was proudly used in:

  • Academic studies with BibTeX references in others/citations.bib
    • "A Corpus of Encoded Malware Byte Information as Images for Efficient Classification"
    • "Adversarial Robustness of Learning-based Static Malware Classifiers"
    • "An ensemble deep learning classifier stacked with fuzzy ARTMAP for malware detection"
    • "AutoEncoder 기반 μ—­λ‚œλ…ν™” μ‚¬μ „ν•™μŠ΅ 및 μ „μ΄ν•™μŠ΅μ„ ν†΅ν•œ μ•…μ„±μ½”λ“œ 탐지 방법둠"
    • "Comparison of Feature Extraction and Classification Techniques of PE Malware"
    • "Deep Learning based Residual Attention Network for Malware Detection in CyberSecurity"
    • "Detecting Malware Activities with MalpMiner: A Dynamic Analysis Approach"
    • "Effective Call Graph Fingerprinting for the Analysis and Classification of Windows Malware"
    • "Evaluation and survey of state of the art malware detection and classification techniques: Analysis and recommendation"
    • "Intelligent Endpoint-based Ransomware Detection Framework"
    • "Knowledge Graph creation on Windows malwares and completion using knowledge graph embedding"
    • "Machine Learning for malware characterization and identification"
    • "Malware Detection by Control-Flow Graph Level Representation Learning With Graph Isomorphism Network"
    • "Malware Detection in URL Using Machine Learning Approach"
    • "SoK: Use of Cryptography in Malware Obfuscation"
    • "TΓ©cnicas de aprendizaje mΓ‘quina para anΓ‘lisis de malware"
    • "Toward a methodology for malware analysis and characterization for Machine Learning application"
    • "Toward identifying APT malware through API system calls"
    • "Uso de algoritmos de machine learning para la detecciΓ³n de archivos malware"
  • Projects

Notice: If you're using DikeDataset in an academic study or project, please open an issue or submit a PR if you want to be cited in the above list and the citations file.

dikedataset's People

Contributors

iosifache avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

dikedataset's Issues

Malware classification issues

I have observed the malware.csv file in the labels directory, and I have seen that you have classified the malware into I have seen that you have classified malware into 9 categories and given the associated probabilities, we can assume that the category with the highest probability is the one that it belongs to, but if we go by this idea, it is obvious that the next six categories are not mentioned because their probabilities are too small, is there something wrong with my understanding of the classification probabilities.
Uploading pic1.png…

Usage of Dataset

Hi George-Andre Iosif,

I am a student currently pursuing PGDMA in cybersecurity. I am currently working on my research paper about integration of AI enabled malware detection in antivirus software for which i want to use your dataset to train my model due to which I am raising this request. I am so grateful if granted permission to use the dataset. Thanking you in advance

Best Regards,
Aj-it22(Ajit Varma)

What should I do from labeling step 3?

To compute the membership on each malware family, a transformer was developed (see the observation above) to "vote" for each available family. For example, if an antivirus engine tag was Trj, then one vote for the trojan family was offered. All tags were consumed in this way and the votes for all families were normalized.

(see the observation above)

https://github.com/iosifache/dike/blob/main/codebase/scripts/continuous_vt_scan.py

I entered this link, but I didn't know from labeling step 3.

What should I do from labeling step 3?

From

image

To

image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.