Code Monkey home page Code Monkey logo

bugswarm's Introduction

BugSwarm Overview

BugSwarm is a large-scale software defect dataset with its mining infrastructure.

The dataset provides thousands of Java & Python software defect artifacts mined from real-world GitHub Actions and Travis CI builds. To facilitate the use of the dataset, it comes with BugSwarm CLI and BugSwarm REST API.

The infrastructure consists of three major components: Miner, Reproducer, and Cacher.

bugswarm-image-overview Datasets play an important role in the advancement of software tools and facilitate their evaluation. BugSwarm is an infrastructure to automatically create a large dataset of real-world reproducible failures and fixes

For more details:

If you use our infrastructure or dataset, please cite our paper as follows:

@inproceedings{BugSwarm-ICSE19,
  author    = {David A. Tomassi and
               Naji Dmeiri and
               Yichen Wang and
               Antara Bhowmick and
               Yen{-}Chuan Liu and
               Premkumar T. Devanbu and
               Bogdan Vasilescu and
               Cindy Rubio{-}Gonz{\'{a}}lez},
  title     = {BugSwarm: mining and continuously growing a dataset of reproducible
               failures and fixes},
  booktitle = {{ICSE}},
  pages     = {339--349},
  publisher = {{IEEE} / {ACM}},
  year      = {2019}
}

Our second paper concerning the GitHub Actions segment of the pipeline can be cited as follows:

@inproceedings{ICSE:demo/actions-remaker,
  author    = {Zhu, Hao-Nan and
               Guan, Kevin Z. and
               Furth, Robert M. and
               Rubio-González, Cindy},
  title     = {{ActionsRemaker: Reproducing GitHub Actions}},
  booktitle = {{ICSE-Companion}}, 
  pages     = {11--15},
  publisher = {{IEEE}},
  year      = {2023},
  doi       = {{10.1109/ICSE-Companion58688.2023.00015}}
}

Setting up BugSwarm

Note

You only have to follow the steps below if you want to produce your own artifacts. If you only want to use BugSwarm artifact dataset, follow the client instructions or our tutorial instead.

  1. System requirements:

    • A machine with x86-64 architecture. (BugSwarm does not support ARM architecture such as Apple silicon.)
    • A Unix-based operating system. (BugSwarm does not support Windows.)
    • The sudo command is installed on the system.
    • You have sudo privileges on the system.
    • The system uses apt-get to manage packages (or you may need to edit provision.sh to make it work correctly / use spawner (see below)).
  2. Install the prerequisites:

  3. Clone the repository:

    git clone https://github.com/BugSwarm/bugswarm.git
  4. Set up MongoDB:

    BugSwarm provides a Docker image of MongoDB to port with the pipeline. Alternatively, you can use Dockerfile to build your own image from scratch.

    1. Pull the provided Docker image from the BugSwarm Docker Hub repo:

      docker pull bugswarm/containers:bugswarm-db
      docker tag bugswarm/containers:bugswarm-db bugswarm-db

    Build your own Docker image of the BugSwarm MongoDB from the source Dockerfile:

    1. Change to the database directory:

      cd bugswarm/database
    2. Build the Docker image with the tag as bugswarm-db from the Dockerfile:

      docker build . -t bugswarm-db

    Now that the Docker image is ready:

    1. Run & port the Docker container containing MongoDB:

      docker run -itd -p 27017:27017 -p 5000:5000 bugswarm-db

      NOTE: If multiple instances of MongoDB are running on the system, you must change the port accordingly. Please see the FAQ.

      In some operating systems, this command will expose ports so that everyone from the outside world will be able to connect. To stop this, replace -p 27017:27017 -p 5000:5000 with -p 127.0.0.1:27017:27017 -p 127.0.0.1:5000:5000

    2. Get back to parent folder:

      cd ..
  5. (Recommended) Set up and run spawner (to run BugSwarm in host, go to step 6):

    Spawner is a Docker image that contain all required packages in provision.sh and can spawn pipeline jobs. If using spawner, the host only needs to install Docker.

    To understand how spawner works, please see spawner README.

    1. Pull the spawner container and update the tag:

      docker pull bugswarm/containers:bugswarm-spawner
      docker image tag bugswarm/containers:bugswarm-spawner bugswarm-spawner

      Alternatively, build the spawner using Docker:

      cd spawner
      docker build -t bugswarm-spawner .
    2. Run the container with /var/run/docker.sock mounted and network set to host.

      docker run -v /var/run/docker.sock:/var/run/docker.sock \
          -v /var/lib/docker:/var/lib/docker --net=host -it bugswarm-spawner
    3. Add user to docker group and re-login.

      DOCKER_GID=`stat -c %g /var/run/docker.sock`
      sudo groupadd -g $DOCKER_GID docker_host
      sudo usermod -aG $DOCKER_GID bugswarm
      sudo su bugswarm
    4. Pull the git repository

      git pull
  6. If you are using the spawner container, continue the following commands in the containers. If you are using the host, continue with the host.

  7. Mongo should now be up and running. Test the connection by running the following commands and checking that the output matches:

    $ curl localhost:27017  # MongoDB check -- you can also run `mongosh` if it's installed on the host
    It looks like you are trying to access MongoDB over HTTP on the native driver port.
    
    $ curl -H 'Authorization: token testDBPassword' localhost:5000/v1/artifacts  # Local API check
    {"_items": [], "_links": {"next": {"href": "artifacts?page=2", "title": "next page"}, "parent": {"href": "/", "title": "home"}, "self": {"href": "artifacts", "title": "artifacts"}}, "_meta": {"max_results": 250, "page": 1}}
  8. Step into initial BugSwarm directory and configure necessary credentials:

    1. Make a copy of the credentials file:

      cp bugswarm/common/credentials.sample.py bugswarm/common/credentials.py
    2. Fill in credentials in bugswarm/common/credentials.py:

      DOCKER_HUB_REPO=<DOCKER_HUB_REPO>
      DOCKER_HUB_CACHED_REPO=<DOCKER_HUB_CACHED_REPO>
      DOCKER_HUB_USERNAME=<DOCKER_HUB_USERNAME>
      DOCKER_HUB_PASSWORD=<DOCKER_HUB_PASSWORD>
      GITHUB_TOKENS=<GITHUB_TOKENS>
      TRAVIS_TOKENS=<TRAVIS_CI_TOKEN>
      DATABASE_PIPELINE_TOKEN=<DATABASE_PIPELINE_TOKEN>  # ('testDBPassword' if using Docker image of Mongo)
      COMMON_HOSTNAME=<LOCAL-IPADDRESS>:5000

      The credentials are required for authentication, accessing components and APIs used within the BugSwarm pipeline. Please see the FAQ for details regarding the credentials.

  9. Run the provision script:

    ./provision.sh

    This will provision the environment to use the BugSwarm pipeline. If you only want to produce GitHub Actions artifacts, you can use the --github-actions-only switch to skip installing the extra components needed to produce TravisCI artifacts.

    ./provision.sh --github-actions-only

Miner

BugSwarm mines builds from projects on GitHub that use the continuous integration (CI) services Travis CI and GitHub Actions. We mine fail-pass build pairs such that the first build of the pair fails and the second, which is next chronologically in Git history on each branch, passes.

The Miner component consists of the PairFinder, PairFilter, and PairClassifier. PairFinder has a separate version for each CI service we support, while PairFilter and PairClassifier are CI-service-agnostic.

Mine a Project

run_mine_project.sh: Mines job pairs from a project or projects.

This script finds fail-pass job pairs in the given projects, filters out unsuitable pairs, downloads each job's original log, and obtains classification statistics for each job/job pair. It outputs a set of JSON files (one for each repository mined) in pair-classifier/output, and the original logs for each job in pair-filter/original-logs. It also updates the minedBuildPairs database collection.

./run_mine_project.sh --ci <ci> (-r <repo-slug> | -f <repo-slug-file>) [OPTIONS]

Required Options:

  • --ci <CI>: The CI service to mine. Must be either travis or github.
  • -r, --repo <REPO>: The GitHub repository to mine. Cannot be used with -f.
  • -f, --repo-file <FILE>: A file containing a newline-separated list of GitHub repositories to mine. Cannot be used with -r.

Additional Options:

  • -t, --threads <THREADS>: The number of threads to use while mining. Only useful if mining more than one repository. Defaults to 1.
  • -c, --component-directory <DIR>: The directory where the PairFinder, PairFilter, and PairClassifier are located. Defaults to the directory where the script is located.

Example:

./run_mine_project.sh --ci github -r alibaba/spring-cloud-alibaba

The example will mine GitHub Actions job-pairs from the "alibaba/spring-cloud-alibaba" project. This will run through the Miner component of the BugSwarm pipeline. The output will push data to your MongoDB specified and outputs several .json files after each sub-step. This process should take less than 10 minutes.

BugSwarm obtains the original build environment that was used by Travis CI or GitHub Actions via a Docker image, and generates scripts to run every command that was run in the original build. We match the reproduced build log, which is a transcript of everything that happens at the command line during build and testing, with the historical build log from the target CI service. We do this three times to account for reproducibility and flakiness. Reproducible pairs are then pushed as an Artifact to DOCKER_HUB_REPO as specified in credentials.py, as a temporary repo. The reproducer automatically runs the Cacher step afterwards, unless the --skip-cacher flag is passed to the script.

Note: Before running the Reproducer, log in to Docker using docker login. Otherwise, the Reproducer and Cacher will not be able to push artifact images to DockerHub. Alternatively, you can run the reproducer scripts with the --no-push flag to avoid pushing images altogether.

Reproduce a Project

run_reproduce_project.sh: Reproduces all job pairs mined from a project given its repo slug.

Required Options:

  • --ci <CI>: The CI service of the pairs to reproduce. Must be either travis or github.
  • -r, --repo <REPO>: The repository of the pairs to reproduce.

Additional Options:

  • -t, --threads <THREADS>: The number of worker threads to reproduce with.
  • -c, --component-directory <DIR>: The directory where the Reproducer is located. Defaults to the directory where the script is located.
  • --reproducer-runs <RUNS>: The number of times to run the reproducer. Use more to be more certain about whether a run is reprodcible. Defaults to 5.
  • -s, --skip-disk-check: If set, do not verify whether there is adequate disk space (50 GiB by default) for reproducing before running. Possibly useful if you're low on disk space.
  • --skip-cacher: Do not cache the job pairs after reproducing them.
  • --no-push: Do not push any images to DockerHub.

Example:

./run_reproduce_project.sh --ci github -r alibaba/spring-cloud-alibaba -c ~/bugswarm -t 4

The example will attempt to reproduce all job-pairs mined from the "alibaba/spring-cloud-alibaba" project. We add the "-c" argument to specify that "~/bugswarm" directory contains the required BugSwarm components to run the pipeline successfully. We use 4 threads to run the process.

Reproduce a Pair or Pairs

Generate Pair Input

generate_pair_input.py: Generate job pairs from the given repo slug or file containing a list of repos. This allows the user to be selective in which job pairs they'd want to reproduce through the optional argument filters. The output will result as such: repo-slug, failing-job-id, and passing-job-id.

Usage: python3 generate_pair_input.py (-r <repo-slug> | --repo-file <repo-file>) -o <output-path> [options]

Options:
     -r, --repo                         Repo slug for the mined project from which to choose pairs. Cannot be used with --repo-file.
         --repo-file                    Path to file containing a newline-separated list of repo slugs for the mined projects from which to choose pairs. Cannot be used with --repo.
     -o, --output-path                  Path to the file where chosen pairs will be written.
         --include-attempted            Include job pairs in the artifact database collection that we have already attempted to reproduce. Defaults to false.
         --include-archived-only        Include job pairs in the artifact database collection that are marked as archived by GitHub but not resettable. Defaults to false.
         --include-resettable           Include job pairs in the artifact database collection that are marked as resettable. Defaults to false.
         --include-test-failures-only   Include job pairs that have a test failure according to the Analyzer. Defaults to false.
         --include-different-base-image Include job pairs that passed and failed job have different base images. Defaults to false.
         --classified-build             Restrict job pairs that have been classified as build according to classifier Defaults to false.
         --classified-code              Restrict job pairs that have been classified as code according to classifier Defaults to false.
         --classified-test              Restrict job pairs that have been classified as test according to classifier Defaults to false.
         --exclusive-classify           Restrict to job pairs that have been exclusively classified as build/code/test, as specified by their respective options. Defaults to false.
         --classified-exception         Restrict job pairs that have been classified as contain certain exception
         --build-system                 Restricted to certain build system
         --os-version                   Restricted to certain OS version(e.g. 12.04, 14.04, 16.04)
         --diff-size                    Restricted to certain diff size MIN~MAX(e.g. 0~5)

Example:

python3 generate_pair_input.py --repo alibaba/spring-cloud-alibaba --include-resettable --include-test-failures-only --include-archived-only --classified-exception IllegalAccessError -o ./results_output.txt

The example above will include job pairs that were previously attempted to reproduce from the Artifact database collection, among those job pairs we include only those that have test failure according to the Analyzer, marked as resettable, and finally we restrict the job pairs further to those that were classified with having the "IllegalAccessError".

The output file of this script can then be used to define the repo slug, failed job ID, and passed job ID arguments of the below step, Reproduce a Pair.

Reproduce the Pairs

run_reproduce_pair.sh: Reproduces a single job pair given the slug for the project from which the job pair was mined, the failed Job ID, and the passed job ID.

./run_reproduce_pair.sh --ci <CI> (--pair-file <FILE> | -r <REPO> -f <FAILED_JOB_ID> -p <PASSED_JOB_ID>) [OPTIONS]

Required Options:

  • --ci <CI>: The CI service the pair comes from. Must be either travis or github.
  • --pair-file <FILE>: A CSV file containing fail-pass pairs, such as the ones generated by generate_pair_input.py. Cannot be used with -r, -f, or -p.
  • -r, --repo <REPO>: The repository the job pair comes from.
  • -f, --failed-job-id <JOB_ID>: The failed job ID of the pair to reproduce.
  • -p, --passed-job-id <JOB_ID>: The passed job ID of the pair to reproduce.

Additional Options:

  • -t, --threads <THREADS>: The number of worker threads to reproduce with.
  • -c, --component-directory <DIR>: The directory where the Reproducer is located. Defaults to the directory where the script is located.
  • --reproducer-runs <RUNS>: The number of times to run the reproducer. Use more to be more certain about whether a run is reprodcible. Defaults to 5.
  • -s, --skip-disk-check: If set, do not verify whether there is adequate disk space (50 GiB by default) for reproducing before running. Possibly useful if you're low on disk space.
  • --skip-cacher: Do not cache the job pairs after reproducing them.
  • --no-push: Do not push any images to DockerHub.

Example:

./run_reproduce_pair.sh --ci github -r alibaba/spring-cloud-alibaba -f 10571587467 -p 10579006004 --reproducer-runs 3

This example will attempt to reproduce the GitHub Actions job pair with failed job ID 10571587467 and passed job ID 10579006004 from the "alibaba/spring-cloud-alibaba" project. We specify that the Reproducer should attempt to reproduce the job pair only 3 times, instead of the default 5 times. We do not specify the number of threads, so the Reproducer defaults to 1 worker thread.

Artifacts with cached dependencies are more stable over time, and are the form in which Artifacts should be added to a dataset. Successfully cached Artifacts are then pushed to the final repo, specified as DOCKER_HUB_CACHED_REPO in credentials.py, with crucial metadata pushed to the MongoDB. If you reproduced job pairs using run_reproduce_pair.sh, then they have already been cached unless you provided the --skip-cacher parameter.

Cache Reproduced Pairs or Project

run_cacher.sh: Cache job-pair artifacts from a previous reproducer run.

./run_cacher.sh --ci <ci> -i <file> [OPTIONS]

Required Options:

  • --ci <CI>: The CI service of the pairs to cache. Must be either travis or github.
  • -i, --input-json <FILE>: A JSON file generated by the Reproducer. These files are named <ci>-reproducer/output/result_json/<task-name>.json, according the task name given to the reproducer run that generated them. If you used run_reproduce_{pair,project}.sh, then this file will be named one of the following:
    • run_reproduce_project.sh: <owner>-<repo>.json
    • run_reproduce_pair.sh -r ... -f ... -p ...: <owner>-<repo>_<failed-job-id>.json
    • run_reproduce_pair.sh --pair-file ...: <pair-file-without-extension>.json

Additional Options:

  • -t, --threads <THREADS>: The number of concurrent threads to run. Defaults to 1 thread.
  • -c, --component-directory <DIR>: The directory where the Reproducer is located. Defaults to the directory where the script is located.
  • --no-push: Do not update the database or push the image to DOCKER_HUB_CACHED_REPO.
  • -a, --caching-args <ARGS>: A string containing arguments to pass on to CacheMaven.py. See that script's README for a description of these arguments.

Example:

First, log in to a Docker registry with docker login. Then, run:

./run_cacher.sh --ci github -i github-reproducer/output/result_json/spring-cloud-alibaba.json -c ~/bugswarm -a '--separate-passed-failed --no-strict-offline-test'

The example will attempt to cache all reproducible job-pairs from the "alibaba/spring-cloud-alibaba" project. We add the "-c" argument to specify that "~/bugswarm/" directory contains the required BugSwarm components to run the pipeline successfully. We will run the caching script with the --separate-passed-failed and --no-strict-offline-test flags. If successful, metadata will be pushed to our specified MongoDB and the cached Artifact is pushed to the DockerHub repository we specified by DOCKER_HUB_CACHED_REPO. This script tracks successfully cached Artifacts, so that only the remaining uncached are attempted. This script is meant to be re-run as necessary with different caching script flags to iteratively attempt to cache candidate reproducible Artifacts. Successfully cached artifacts then have their metadata inserted into the Database and their failed and passed build logs uploaded to the database.

Questions

Visit our FAQ docs page

bugswarm's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

bugswarm's Issues

Minor problems in README for Setting up BugSwarm

  1. In "Depending on where you cloned the repository, change to the database directory:", $ cd ~/bugswarm/database should be $ cd bugswarm/database; maybe add cd ../.. at the end.
  2. [optional] Maybe in “Install MongoDB - Why MongoDB? (Optional)” add some descriptions about why it is optional (i.e. can be replaced by Docker container)
  3. [optional] Can we start the database image from https://hub.docker.com/_/mongo instead of ubuntu:16.04?
  4. [optional] Can we push the MongoDB database (tag bugswarm-db) to dockerhub?

Missing Information About Access Token Permissions

In the README on the step for filling in credentials.py, the user is directed to the FAQ for detailed instructions. One of the bullet points is:

GITHUB_TOKENS - A GitHub Access Token to perform Git Operations over HTTPS via the Git API. (used for Mining projects)

Missing here is which permissions are necessary on the token for bugswarm to function properly. Refer to step 7 here for more details.

Improve how we store credentials

If I understand it correctly, we currently store credentials locally in a python file, which is included in the version control. I think it makes dev work a bit awkward because we need to make sure not to include that credentials file into commit...

The standard practice is to have a placeholder config file in version control, after clones the repo, the user will need to copy and rename the config file to something that is ignored by git, and fill in credentials. For example:

  • env.example (included in version control)
  • env.dev (development credentials, not included in version control)
  • env.prod (production credentials, not included in version control)

Let me know if it makes more sense than our current approach.

Release ARM64 images

I have gone through the bugswarm dataset images in docker hub and found that it is not having any ARM64 supported tags.

I have checked for the related dockerfiles for available tags in dockerhub in bugswarm repo, but I am unable to get the exact dockerfiles for these images.

I have also checked the bugswarm.org but didn’t get any clue on finding the specific dockerfiles for these images.

Do you have any plans for releasing an ARM64 image?

May I know how the amd64 image is getting published on Docker hub?

Please provide pointers on finding dockerfiles and adding arm64 support.

General Questions about the Dataset

Hello, I have the following questions about the dataset. It would be great if you could answer them. Thanks.

  • For a specific pair, the number of tests executed in the passing job have increased w.r.t failing job. Do these newly added tests correspond to the fixed version of the code? If yes, why they were not executed on the buggy version? For instance, please consider the pair instance with image tag square-okhttp-99350732. The failed job has num_tests_run as 1582, while for the passing job it has increased to 1877.
  • Currently about ~50% of pairs in Java do not have the name(s) of the failing tests. The documentation says that this attribute is not reliable at this time. Is there any improvements in this part? Can we get the names of the failing tests which are currently missing?

Example Project In README Should Be Smaller

While following the README, the user will want to verify if their environment is setup correctly and to view a sample output of the tool. As of now, the example project for this is Flipkart/foxtrot.

This repository is very large and takes a substantial amount of runtime on most systems. This is unnecessary as the user is most likely looking for quick validation.

I recommend using a simpler/smaller example project.

Incompatible ruby version on Ubuntu 18.04.3

Update (2/11):

  • Cannot reproduce the original issue, but facing a new issue possibly relative to Ruby 2.5.0:
Encountered an error while generating the build script with travis-build: /usr/lib/ruby/2.5.0/rubygems/core_ext/kernel_require.rb:59:in `require': cannot load such file -- bundler/setup (LoadError)
	from /usr/lib/ruby/2.5.0/rubygems/core_ext/kernel_require.rb:59:in `require'
	from /home/oacorona/.travis/travis-build/bin/travis:27:in `<main>'.

Currently unreproducible:
Our current implementation of the provision.sh utilizes a command to install ruby-full.
The command, sudo apt-get --assume-yes install ruby-full installs the correct version of Ruby (2.3.1) on Ubuntu 16.04, but will install 2.5 on Ubuntu 18.04.

The bundler installation followed after for Travis utilizes Ruby 2.3.1 and generates a file to look for Ruby version 2.3.1. A user with Ruby version 2.5 will have an error similar to: .travis/travis-build/bin/travis: /usr/bin/env: 'ruby2.3': No Such file or directory

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.