Code Monkey home page Code Monkey logo

yali's Introduction

Yali

logo

License: GPL v3 Linting: Pylint Last update

๐Ÿ“Œ Contents Table


๐Ÿ“œ Introduction

Let D be a deep learning model that classifies programs according to the problem they solve. This project aims to evaluate how D behaves with obfuscated code. We want to know how much the accuracy of D is affected.

examples of classifications

The top of the image above shows the histogram produced by a specific strategy for program 292. This program belongs to class 11 of the POJ-104 dataset. The bottom of the image shows how each model classifies the variations of program 292.


๐Ÿ Getting Started

In this section are the steps to reproduce our experiments.

Prerequisites

You need to install the following packages to run this project:

Setup

First, you should copy the .env.example file and rename it to .env. You can now set environment variables in the .env file at the project's root. You can change the following variables:

Variable Description Value
REPRESENTATION Program embedding that will be used to represent a program. This variable is required.
  • histogram
  • ir2vec
  • milepost
  • cfg
  • cfg_compact
  • cdfg
  • cdfg_compact
  • cdfg_plus
  • programl
MODEL Selected machine learning model. This variable is required. If REPRESENTATION is equal to `cfg`, `cfg_compact`, `cdfg`, `cdfg_compact`, `cdfg_plus` or `programl`, the model must be `dgcnn` or `gcn`.
  • "cnn" (Convolutional Neural Network by Lili Mou et al.)
  • "rf" (Random Forest)
  • "svm" (Support Vector Machine)
  • "knn" (K-Nearest Neighbors)
  • "lr" (Logistic Regression)
  • "mlp" (Multilayer Perceptron)
  • "dgcnn" (Deep Graph CNN)
TRAINDATASET / TESTDATASET Dataset that will be used in the training/testing phase. TRAINDATASET is required, but TESTDATASET must be empty if you want to use the same dataset in training and testing phase.
OPTLEVELTRAIN / OPTLEVELTEST Optimization level applied in the traning/testing dataset. OPTLEVELTRAIN is required, but OPTLEVELTEST must be empty if TESTDATASET is empty.
  • O0
  • O3
NUMCLASSES The number of classes of the dataset. This variable is required.
ROUNDS The number of rounds to run the model. This variable is required.
MEMORYPROF Indicate whether a memory profiler will be used. This variable is required.
  • yes
  • no

After that, you need to prepare the environment to run our experiments. Run the following command line:

$ ./setup.sh

This will download the datasets, build the docker image and create the necessary folders for the project.

Running

Now, you can run the following command line:

$ ./run.sh MODE

There are the following values for MODE:

  • all: Run all games, the resources analysis and embedding analysis
  • speedup: Run the speedup analysis with the benchmark game
  • embeddings: Run the embedding analysis
  • resources: Run only the resources analysis
  • malware: Run the experiment to detect classes of malware
  • game0 Run the Game 0 (We will put the link later)
  • game1: Run the Game 1 (We will put the link later)
  • game2: Run the Game 2 (We will put the link later)
  • game3: Run the Game 3 (We will put the link later)
  • discover: Run the Discover Game (We will put the link later)

This will run the docker container with the configurations in the .env file.


๐Ÿ“Š Statistics

The Statistics folder contains Jupyter Notebooks that plot the data generated by the experiments. Each notebook describes each chart and the steps to develop them. There are the following notebooks:

  • EmbeddingResults: Presents information about the accuracy of the dgcnn and cnn models with different representations
  • GameResults: Presents information about the 4 games proposed in our work (We will put the link later).
  • ResourceResults: Presents information about resource consumption (memory and time) of each model
  • StrategiesResults: Presents the distance between the histograms of the original programs and the histograms generated by the obfuscators

๐Ÿ—‚๏ธ Structure

The repository has the following organization:

|-- Classification: "scripts for the classification process"
|-- Compilation: "Scripts for the compilation process"
|-- Docs: "Repository documentation"
|-- Entrypoint: "Container setup"
|-- Extraction: "Script to extract a program representation and convert CSV to Numpy"
|-- HistogramPass: "LLVM pass to get the histograms"
|-- MalwareDataset: "Malware dataset to support experiments in the project"
|-- Representations: "Scripts to extract different program representations"
|-- Statistics: "Jupyter notebooks"
|-- Volume: "Volume of the container"
    |-- Csv: "CSVs with the histograms"
    |-- Embeddings: "Different representations of programs in the Source folder"
    |-- Histograms: "histograms in the Numpy format"
    |-- Irs: "LLVM IRs of the programs"
    |-- Results: "Results of the training/testing phase"
    |-- Source: "Source code of the programs"

To Do

We are doing the following to increment our repository:

  • Put the paper link in this ReadME

yali's People

Contributors

canesche avatar thais-damasio avatar vinicpac avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.