Code Monkey home page Code Monkey logo

bohr's Introduction

BOHR (Big Old Heuristic Repository)

GitHub license Code Climate GitHub make-a-pull-requests

BOHR is a repository of heuristics for preparation (labeling, grouping, linking, filtering) of software engineering artifacts, e.g. commits, bug reports, code snippets, etc. Preparation of these artifacts is often required by researchers in the field of Software Engineering (SE) and Mining Software Repositories (MSR) to convert artifacts mined from software repositories such as GitHub, StackOverflow into datasets that can be used for empirical experiments and for training machine learning models. An example could be classifying commits mined from GitHub into bug-fixing and others in order to create a training dataset to train a machine-learning model on.

Preparing each artifact manually is expensive and does not scale well. Therefore BOHR offers an approach to define heuristics (short functions) to do the job automatically. Even though using heuristics is usually less accurate than letting experts analyse each artifacts, we claim that using a large number of diverse heuristics and combining them "smartly" can significantly reduce the noise compared to for example using one heuistic. The way heuristics are combined depends on the type of the task, but one of the most common way is to use the algorithm used by the snorkel, the state-of-the-art weak supervision tool.

Examples of tasks and heuristics

Commit classification

One example is classifying commits mined from GitHub into "bugfix" and "non-bugfix", in order to create a training dataset to train a machine-learning model on.

# ... other imports

from bohrapi.core import Heuristic
from bohrlabels.core import OneOrManyLabels

from bohrapi.artifacts import Commit
from bohrlabels.labels import CommitLabel


@Heuristic(Commit)
def bugless_if_many_files_changes(commit: Commit) -> OneOrManyLabels:
    if len(commit.commit_files) > 15:
        return CommitLabel.NonBugFix
    else:
        return None

Grouping online identities

Important things to note:

  1. A heuristics is marked with the Heuristic decorator, and the artifact type to which it is applied is passed to it as a parameter;
  2. The artifact instance is exposed to the heuristic as a function parameter; the properties of the artifact object can be used to implement the logic;
  3. The label assigned to the artifact by the heuristic is the result of the execution on the passed artifact object; the heuristic must assign one of the labels defined in the BOHR label hierarchy or None if it abstains on the data point.

TBD: Insert somewhere later?

  • Simplifies the process of adding new heuristics and evaluating their effectiveness;
  • Labels the datasets registered with BOHR and automatically updates the labels once heuristics are added;
  • Keeps track of heursitics used for each version of generated datasets and models, and, in general, makes sure they are reproducible and easily accessable by using DVC.

Main Concepts (maybe this is not needed in README, rather in the docs)

BOHR is a repository of heuristics, hence, a heuristic is a primary concept in BOHR. Sub-program (python function) that accepts an artifact or multiple artifacts of the same or different types.

Artifact is BOHR's abstraction that represents a software engineering artifact - a product of SE activities, e,g. code, commit, software project, software repository, issue report. Dataset is a collection of artifacts of the same type.

Task is an abstraction that describes the problem that a researcher is working on in terms of BOHR. The input and the output of the tasks are datasets. Task types are labeling, grouping, linking, filtering. The task is defined by specifying artifact type(s) heuristics are applied to, possible outputs of heuristics, strategy how heuristics aree combined, test datasets and metrics to use to evaluate the effectiveness of heuristics.

Experiment is an attempt to solve a task using a specific set of heuristics (and training data if needed).

BOHR workflow

1a. For the given task, pull existing heuristics developed by community

bohr clone https://github.com/giganticode/bohr-workdir-bugginess

This will clone the so called BOHR working directory that corresponds to the <task> to <path>

3 Check whether the existing labeled datasets are suitable for your purposes.

Every task comes with a trained classifier and a default dataset labeled by this classifier. Check whether the default datasets suits your purposes.

cd bugginess-work-dir && bohr pull default

The path where dataset is load will be displayed.

4. Label your own dataset with the default classifier.

$ bohr dataset add ~/new_commit_dataset.csv $ bohr task add-dataset bugginess new_commit_dataset --repro

5. Develop a new heuristic

$ vi heuristics/commit_files.py

6. Debug and tune the heuristic by checking its coverage and accuracy on a stand-alone test-set

$ bohr repro

7. Submit a pull request with the new heuristic to remote BOHR repository

$ bohr upload

Label model is trained and metrics are calculated on stand-alone test set as a part of a CI-pipeline. If metrics has been improved, the new heuristic is added to BOHR, and is available for other researchers.

8. Add a new task

$ bohr task add tangled-commits \ ... -l TangledCommit.NonTangled,TangledCommit.Tangled \ ... --repro

Installation

TODO: add links to other repos

Pre-prints and publications

If you use BOHR in your research, please cite us:

@inproceedings{babii2021mining,
  title={Mining Software Repositories with a Collaborative Heuristic Repository},
  author={Babii, Hlib and Prenner, Julian Aron and Stricker, Laurin and Karmakar, Anjan and Janes, Andrea and Robbes, Romain},
  booktitle={2021 IEEE/ACM 43rd International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER)},
  pages={106--110},
  year={2021},
  organization={IEEE}
}

bohr's People

Contributors

actions-user avatar anjandash avatar furunkel avatar hlibbabii avatar renovate-bot avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

bohr's Issues

Why necessarily bug if github refernce in message?

Why necessarily bug if github refernce in message? Can't it be reference to feature or enhancement?

@Heuristic(Commit)
def github_ref_in_message(commit: Commit) -> Optional[Labels]:
    """
    >>> github_ref_in_message(Commit("x", "y", "12afbc4564ba", "gh-123: bug"))
    CommitLabel.BugFix
    >>> github_ref_in_message(Commit("x", "y", "12afbc4564ba", "gh 123"))
    CommitLabel.BugFix
    >>> github_ref_in_message(Commit("x", "y", "12afbc4564ba", "GH 123: bug2"))
    CommitLabel.BugFix
    >>> github_ref_in_message(Commit("x", "y", "12afbc4564ba", "GH123: wrong issue reference")) is None
    True
    """
    return CommitLabel.BugFix if GITHUB_REF_RE.search(commit.message.raw) else None

Provide Mechanism to link datasets

The link between this two datasets should be a separate external dataset which can be result of a task, therefore it can be probabilistic.

E.g. in the case of commit and bug report datasets, it's a link between a commit and an issue. Currently we have this link inside the issue dataset. This link was mined independently from Bohr; this, however, should be done inside Bohr since we can leverage snorkel to combine existing heuristics and tools e.g. ReLink to recover the links.

When using a dataset, we want to be able to use the linked dataset. Currently each Commit artifact loads a related issue if exists. The dependency should be injected by external "injector" instead using an external link dataset.

Dataset licensing considerations

This issue summarizes all the concerns/tasks related to licensing of datasets used by or produced by BOHR.

Current situation

Currently, dataset/desc field in bohr.json is used to mention the license of the dataset along with attribution and the description of the dataset itself. There is also dataset/author field.

Ideas what needs to be done/improved:

  • separate field license for each dataset in bohr.json
  • Field to specify where the dataset has been downloaded from. This might overlap though with the corresponding DVC file if the dataset is "imported".
  • Since dataset descriptions are getting bigger, should we consider having a separate e.g. datasets.json for that. Alternatively, should we have a file per dataset? Can we use dataset stage files (.dvc files) for that - .dvc file should have a description field?

Separate repos for Data (datasets, heuristics, labels) and Infrastructure code

We might consider having two distinct repositories to have different lifecycles for Bohr data and Bohr infrastructure

Data repository

  • roughly, should contain: data, pre-processed data, code to preprocess each dataset, mapper, heuristics, labels, intermediate artifacts, metrics, labeled datasets.
  • easy contribution by everyone
  • changes to this repo can change metrics
  • metrics should be rerun when changes are made
  • DVC-controlled
  • not released to PyPI
  • short release cycle

Infrastructure repository

  • easy contribution by organization members
  • changes to this repo should not change any metrics
  • metrics should not be rerun when changes are made
  • not dvc-controlled
  • released to Pypi
  • longer release cycle

TODO:

  • come up with names for both repos (one will remain Bohr, the other will be bohr- or bohr- and bohr-?)
  • think about the interface between the Bohr data repo and infrastructure library
  • think what goes to which repo

Dependency Dashboard

This issue lists Renovate updates and detected dependencies. Read the Dependency Dashboard docs to learn more.

Edited/Blocked

These updates have been manually edited so Renovate will no longer make changes. To discard all commits and start over, click on a checkbox.

Open

These updates have all been created already. Click a checkbox below to force a retry/rebase of any.

Detected dependencies

pep621
pyproject.toml
  • poetry-core >=1.0.0
poetry
pyproject.toml
  • python ^3.8
  • isort ^5.9.3
  • pre-commit ^2.15.0

  • Check this box to trigger a request for Renovate to run again on this repository

Write down steps how to debug heuristics

It might happen that the proposed heuristics yield worse metrics even though they seem to be reasonable. We need to have a list of steps that can be done to understand what's wrong, e.g.:

  • Check individual datapoints on which metrics degrade;
  • Check how weights of the data model changed;
    etc.

Agree to disagree: label patching

Label hierarchy: definitions of the leaf label should always be fixed and agreed on by all collaborators. However intermediate labels can have different definitions and have different sub-children depending on the task variant.

For the approach to work well, 1) leaf labels should be defined as fine-grained as possible 2) when defining heuristics one should use labels as precise labels as possible, e.g. VariableRenaming instead of Refactoring, so that those who don't agree with a more general label, can still reuse the heuristic.

There should be a default hierarchy, but some task variant can use a different hierarchy, which can be defined by a default hierarchy plus a patch: (e.g. if the following patch is specified docfix: bugFix -> nonBugfix, the default hierarchy will be specified with the only change that docfix subtree would be moved from bugFix to nonBugfix)

If people dont agree on a leaf label, it shouldn't be a leaf label any more.

Another dimension: labels for different languages

Directories with dvc-tracked tracked files: add README how to download files?

Some directories contain only .gitignore file because other files are tracked by DVC and are git-ignored.

For someone browsing the repo on GitHub it might not be clear how the DVC-tracked file can be obtained. Should we add a README file with a list of dvc-tracked files in current directory along with the instruction hoe to download them without cloning the project (using dvc.api)

Set up GitHub API response caching proxy-server

We often need to query GitHub API to get data. However, there is a limit on how many requests per hour one can do with one token; also querying on the fly is expensive timewise. The approach we took with our "bugginess" training set is to preload all the commits. and save them into CSV files. Now I started looking into the tools that detect refactorings in a given commit. Unfortunately, one of the tools, RefactoringMiner, does not provide the possibility to pass a git diff to it out of the box. It asks for a GitHub URL or path to a locally cloned repo. Given this, an alternative to pre-loading commits in a format that might not be suitable for all use-cases is to consider querying API on the fly again. However, we can make it cheaper by setting up a proxy that would cache the GitHub API responses so that when querying them repeatedly quota is not used. Another benefit of this is speed. We can set up the proxy on the same machine we run the pipeline on (ironspeed), so this won't be different than just reading pre-loaded data from the disk.

Access classifiers for linked artifact from Heuristic

Let's consider an example:

@Heuristic(Commit)
def testfix_if_all_test_files(commit: Commit) -> Optional[Labels]:
    for file in commit.files:
        file.label !=CommitFileLabel.TestFile:
            return None
    return CommitLabel.TestFix

TBD

This is a list of ideas for future heuristics

  • WIP keyword in the message, most probably means working on a feature, example: crdschurch/crds-signin-checkin@839a223
  • Removing TODO: can it mean that we deal rather with a feature or refactoring rather than a bug fix? example: crdschurch/crds-signin-checkin@839a223
  • Check if this is the first commit -> InitialCommit label (#153)
  • For change heuristics, untangle formatting changes (e.g. when only wrapping a piece of code with a context manager, only one line of code is effectively added, all the rest are just get more indented)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.