Code Monkey home page Code Monkey logo

libraryidentificationfromvulnerability's Introduction

Vulnerability Data Library Identification

Replication package for Automated Identification of Libraries from Vulnerability Data: Can We Do Better?

Dataset Folder

This folder contains the original dataset that we use in our experiments. The zip file contains four files:

  • dataset.csv: the original csv file of the dataset which has not been cleaned and the label not yet merged
  • dataset_merged_cleaned.csv: the processed dataset.csv file which has been cleaned and merged between the co-occuring labels
  • cve_labels.csv: the csv file containing the pairing of the CVE id with the labels. This is for the dataset.csv file that is not cleaned and merged yet
  • cve_labels_merged_cleaned.csv: csv file containing pairing of CVE id with the labels for the cleaned and merged dataset

How to Use

Each folder within this repository contains the implementation of different XML approaches that can be applied to the CVE data to identify the library. The Utility folder contains utility functions that may be useful to ease the process of using this repository.

QuickLink:

Utility Folder

data_preparation.py: contains functions that can be used to prepare the dataset for different XML algorithm. The inputs for most of these functions are mainly the pre-splitted dataset available in the dataset/splitted folder Please make sure that the folder exist in the Utility/dataset/ directory before using the functions in data_preparation.py. An alternative is to modify the functions to include folder creation functionality.

dataset folder: contains the dataset of the CVE data, both in the splitted form and in the original csv form. All the results from the data_preparation.py functions will be available here

LightXML

Environment setup

  • Please refer to https://github.com/kongds/LightXML/
  • For easier virtual environment, I recommend to use conda env
  • Then, install the requirements listed in requirements.txt
  • Keep in mind when installing the requirements, it use the NVidia Apex (https://github.com/NVIDIA/apex). The one listed in requirements.txt is often linked with the wrong library

Training and Evaluation

  • To run the training and evaluation script, use the run.sh script
  • ./run.sh cve_data
  • Refer to line 76--86 of the run.sh script

FastXML

Environment setup

  • I use Python 3.6 virtual environment for FastXML
  • After creating the virtual environment, install the libraries listed in the requirements.txt

Data preparation

  • For the FastXML, I use the json file structure indicating the train and test data as suggested in the FastXML repo readme
  • Data preparation utility function is available in Utilities/data_preparation.py prepare_fastxml_dataset() function.
  • This function make use of the splitted_train_x.npy, splitted_test_x.npy (the pre-splitted numpy dataset), and the cve_labels_merged_cleaned.csv (the csv file containing all the entries)
  • To make the dataset consistent, it would be good to use the dataset_train.csv and dataset_test.csv available in the utilities/dataset/splitted/splitted_dataset_csv.zip and change the merged column to the text that we want as the feature.
  • Then, to convert these two csv files into the numpy array, you can use the split_dataset() function in the data_preparation file.
  • After you have created the train.json and test.json for FastXML, copy the two files into FastXML/dataset folder

Training Process

  • To start the training process, run the FastXML/baseline.py. We need to define the run parameters. For starter, you can use the following parameter which produce similar result to Veracode's implementation:
model/model_name.model dataset/path_to_train.json --verbose train --iters 200
--gamma 30 --trees 64 --min-label-count 1 --blend-factor 0.5  --re_split 0 --leaf-probs
  • After the training process is completed, the model will be created in the FastXML/model folder
  • Then, we run the FastXML/baseline.py again for the model testing with the following run parameter:

model/path_to_model_folder dataset/path_to_test.json inference --score

  • Running the above test command will produce FastXML/inference_result.json which contains the inference result of the model.
  • To calculate the precision, recall, and F1 metrics, run the FastXML/util.py, which will calculate the metric from the inference_result.json file.

Omikuji

Omikuji is the name of the library that provides the implementation of both Bonsai and Parabel. It is fairly straightforward to setup omikuji as it is readily available in the form of a library.

Data Preparation

  • Omikuji takes as input training and test data in the form of svmlight file of the Tf Idf features of the data
  • Data preparation utility function is available in Utilities/data_preparation.py prepare_omikuji_dataset function
  • This utility function make use of the pre-splitted numpy array dataset

Environment Setup

  • Install the Python binding of Omikuji as specified in its repository README (https://github.com/tomtung/omikuji/).

    pip install omikuji

  • For the omikuji library I use Python 3.8 environment and omikuji version 0.3.2.
  • If there is an error with the omikuji installation, please consider manually installing Omikuji from the repository.
  • The above Python binding of Omikuji installation is used for the model prediction purpose.
  • Meanwhile, for the model training using Omikuji, I use the Rust implementation of Omikuji that is available in Cargo (Refer to Build & Install section of Omikuji repository)

Training Process

  • After Omikuji is successfully installed from Cargo, we can use the following command to train a model: Parabel Model

omikuji train --model_path model_output_path --min_branch_size 2 --n_trees 3 path_to_dataset

Bonsai Model

omikuji train --cluster.unbalanced --model_path model_output_path --n_trees 3 dataset/train.txt

  • Then, we can use the created models to predict the test data by running the Omikuji/omikuji_predict.py with model_path and test_data_path run parameters

LightXML

Deployment - Deploy LightXML models

Deploy the model using Django

# don't forget to use absolute path
docker run --rm --name=xml --gpus '"device=0,1"' --shm-size 32G -it --mount type=bind,src=<absolute path to the folder>,dst=<folder name>/ -p 8000:8000 username/xml
cd <folder name>/Web/
python manage.py migrate
python manage.py runserver 0.0.0.0:8000

You can try to open 0.0.0.0:8000 on your browser to check the prediction. Alternatively, you can check the prediction from command line by curl http://0.0.0.0:8000/predict/?input_text=imagemagick+attackers+service+segmentation

libraryidentificationfromvulnerability's People

Contributors

mhilmiasyrofi avatar ratnadiraw avatar stefanusagus avatar

Watchers

James Cloos avatar  avatar

Forkers

ratnadiraw

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.