Code Monkey home page Code Monkey logo

microsoft-office-macro-clustering's Introduction

This repository contains the data files and algorithms for clustering Microsoft Office documents by their macro content. For an in-depth explanation of what this project is about, check out our blog at https://inquest.net/blog/2020/12/16/Clustering-for-Classification. For access to the original documents, please see InQuest Labs or read more about us on the web at https://www.inquest.net.

Table of Contents

  • av_labels/: Directory of JSON files, one per sample, containing AntiVirus labels (if any).
  • macros/: Directory of raw VBA macro files, extracted from the document samples.
  • clustering.ipynb: Jupyter notebook demonstrating K-means clustering over the corpus.
  • classification.csv: CSV representation of hash, AV positive count and label (one of UNKNOWN, MALICIOUS, BENIGN).
  • vba_features.csv: CSV representation of VBA feature vectors extracted from the raw macros above.
  • requirements.txt: Libraries required for the notebook to work

classification.csv

This file consists of three columns hash, vt_score, classification. Vt_score is the number of engines within VirusTotal that detected the file as malicious. The total number of engines is variable, for a number of reasons. It would be reasonable to consider the total number as 60. The number of requisite VT positives required to consider a sample as "malicious" is subjective. The third column, classification, is one of "UNKNOWN" (0), "BENIGN" (1), or "MALICIOUS" (2). A number of factors went into application of these labels, the distribution of which is shown here:

      Key|Ct   (Pct)    Histogram
  UNKNOWN|8055 (80.55%) --------------------------------------------------------
MALICIOUS|1790 (17.90%) -------------
   BENIGN| 155  (1.55%) --

Generally, when you're looking to train a supervised model, you'll want 80% of your data to carry labels. Our ratio here is opposite but that's ok for an unsupervised model. In fact, the entire goal of this effort is to automatically expand on our labels within some threshold of confidence. The labels within classification.csv were applied through a variety of checks and balances to ensure fidelity. Within av_labels you can find a JSON dictionary containing the AV scan results for each of the documents. This data can of course be sourced to generate labels with varying threshold of confidence. For example, rewriting classification.csv to label any sample with 4 or more AV positives as malicious, 0 positives as benign, and everything inbetween us unknown, will result in about 85% labeling of the corpus.

When running the notebook, make sure the run each cell individually from top to bottom for best results. The visualizations are also accompanied by sliders to allow one to change them to their whims.

microsoft-office-macro-clustering's People

Contributors

ashton-sidhu avatar pedramamini avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.