Code Monkey home page Code Monkey logo

dfp's Introduction

DFP - DockerFile Patcher

This artifact aims to improve the quality of Dockerfiles by analyzing the file using the linter Hadolint 1.23.0, retrieving possible patches from a database and applying them in order of their ranking.

The patching script will suggest patches for various lines in a given dockerfile, but won't change the original file.

This repository contains scripts to

  1. Generate patches based on Hadolint's violations and a large collection of Dockerfile changes in Open-Source projects MSR18 database (extended dataset can be found on Zenodo
    and
  2. Retrieve and apply these patches to any given Dockerfile

Table of contents

There are several possibilities to get the artifact up and running:

  1. Docker image (recommended)
  2. Docker build
  3. Local

Docker image (recommended)

Pre-requisites:

Download the image and start the container:

docker run --rm --name dfp -d mando9/dfp

You now have a running container with the name dfp. The container will remove itself once it is stopped, due to the option --rm.
To access the container using bash the following command can be used:

docker exec -it dfp /bin/bash

Docker build

Pre-requisites:

Build the docker image using

docker build -t dfp .

This will create a docker image on your local machine with the tag dfp.
Then create a container and run it detached:

docker run --rm --name dfp -d dfp

You now have a running container from your local image with the container name dfp. The container will remove itself once it is stopped, due to the option --rm.
To access the container using bash the following command can be used:

docker exec -it dfp /bin/bash

Local

Pre-requisites:

Windows 10 was used for the local setup, if you use another OS your results may vary. The following uses the default user postgres with password postgres (can vary on different installation methods). If you want to use a different user, change the option -U <user>. You will also need to change the login information in config.ini accordingly.

The patch database can be restored by running

psql -U postgres -e < patch_database.sql

in a terminal (Windows: PowerShell won't work, use the command prompt). This will create a database dfp with all patches.
Alternatively, you can create the database yourself and restore the data using

pg_restore -U postgres --dbname dfp patch_database

When you run the main script using the Dockerfile for this artifact,

# python executable for python 3.9 either just "python" or "python3"
python .\dfp_main.py .\Dockerfile

you should get an output like:

Number of violations: 3
Searching for patches for line (DL3009): RUN apt-get update
Trying patches for violation 0: : 82it [00:09,  8.59it/s]

You can then abort the execution using Ctrl+C.

This repository contains scripts for creating patches, running dfp to apply patches and evaluating it with a test set.

  • dfp_main.py
    When supplied with a Dockerfile, analyzes it and retrieves fitting patches from the patch database and applies them according to a ranking.
  • plotResults.py
    Used to create result plots from the evaluation.
  • evalTestSet.py
    Runs dfp for the test set.
  • patch_database.sql
    A database dump of the patch database.
  • /testSet
    A collection of 100 Dockerfiles and their linting violations for evaluation.
  • /results
    Contains evaluation results of the test set, once with all patches and once with no custom/manual patches. These results are included, since the evaluation can take several hours.
  • /dbHelper
    Code to connect to the Postgres DB.
  • /dfp
    Contains main code for dfp. Functions to extract patches from the source database, get violations of a Dockerfile and retrieve fitting patches.
  • /linter
    Code to use hadolint in python.
  • /msr18model
    Model classes of the source database.
  • /utils Other utility code.

Main script

The main script analyzes a Dockerfile, queries the patch database and applies patches to find fixes.
Execution can last several minutes, depending on the amount of violations in the Dockerfile.
Usage of the main script is as follows:

python dfp_main.py [OPTIONS] DOCKERFILE

with options:

  • -l <violation_file>
    Path to a CSV file containing the result of a linting run on this Dockerfile. These violations will be used for the query.
    Without this option, the script will run the linter before querying patches.
  • -q
    Quiet flag. The script will not output anything.
  • -pl <limit>
    Patch limit. The maximum number of patches to be queried and applied to the Dockerfile.
    Can reduce runtime. Default is 300.

All files with suffix *_dockerfile in /testSet are Dockerfiles to patch.
An example execution would be

python dfp_main.py ./testSet/pID201_dID3718_sID7015_dockerfile

Processing the test set

This process can take a long time (several hours), since many Dockerfiles are analyzed.
Therefore, pre-computed results are provided in folder /results.
All files in the test set can be process using

python evalTestSet.py

The script will print some statistics about the evaluation and saves the data to a file called evalStats_<current_time>.pkl in the project repository.

Evaluate results

To display the results visually, use the script

python plotResults.py RESULT_FILE

This will show several plots of the result data and prints statistical information and LateX tables to the console. Plots:

  1. Violation distribution (Figure 9)
    Which rule violations are found how often.
  2. Execution times (Figure 10)
    How long does the execution take for one Dockerfile and for one violation.
  3. Fix rate (Figure 11)
    Found violations versus fixed violations
  4. Impact of patch limit to fixes
    How limiting the patch query affects found fixes. Can be found in Table 19.

The plots are then stored in the same directory as the results files pre-fixed with the result file name, i.e. for pre-computed patches in folder /results.

To view pre-computed results containing generated and custom patches, use

python plotResults.py ./results/resultsWithAllPatches.pkl

To view pre-computed results containing only generated patches, use

python plotResults.py ./results/resultsWithOnlyGeneratedPatches.pkl

To copy the plots from the docker container use the following on the host machine (example files for resultsWithAllPatches.pkl):

docker cp dfp:/dfp/results/resultsWithAllPatches_ExecutionTimes.png .             
docker cp dfp:/dfp/results/resultsWithAllPatches_FixRate.png .       
docker cp dfp:/dfp/results/resultsWithAllPatches_RuleDistribution.png .
docker cp dfp:/dfp/results/resultsWithAllPatches_PatchLimitImpact.png .

The dataset used to mine the patches is extending the dataset of Structured Information on State and Evolution of Dockerfiles.
A description of their data schema can be found on the linked GitHub repository.
The extended dataset can be downloaded on Zenodo.
Similar to the patch database, the dataset is also a compressed PostgreSQL dump and can be imported with:

pg_restore -U postgres --dbname msr18_extended msr18_extended

The command will restore the database dump as the user postgres into a database with the name msr18_extended.

Important tables of the dataset include (more detailed information of the original schema can be found here):

  • Project: A unique GitHub project/repository with at least one Dockerfile (can have multiple)
  • Dockerfile: A unique Dockerfile contained in a GitHub repository
  • Snapshot: A specific version of a Dockerfile

Extensions include:

  • Snapshot violations (snap_violation): Each snapshot was analysed and the resulting violations are stored in this table
  • Snapshot violation diffs (snap_viol_diff): Changes in violations from one snapshot to another
  • Snapshot vulnerabilities (snap_vuln): Security vulnerabilities based on the security analysis (not all Dockerfiles were analysed due to time constraints)
  • Snapshot vulnerability diffs (snap_vuln_diff): Changes in vulnerabilities

A SQL script to create the DB schema and a complete Entity-Relationship-Diagram can be found in /dataset.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.