View Code? Open in Web Editor NEW

A machine learning experiment for predicting Pull Requests acceptance rate

License: MIT License

JavaScript 17.34% Python 3.28% CodeQL 0.37% CSS 0.33% HTML 0.15% Svelte 3.72% Dockerfile 0.30% Smarty 1.29% Mustache 6.11% HCL 67.12%

pullreq-ml's Introduction

Github PR prediction (lots of code borrowed from pullreq-ml)

This Node/Python library creates ml features for Pull Requests by learning information about a Github Project. The aim of this library is to aid data scientists build ml models for predicting pull request behaviour

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Prerequisites

What things you need to install the software and how to install them

You will need the following:

Node 8 or newer
MongoDB 3.2 or newer
Git
A Github Access Token for using the Github API. This post explains how to get yours.

Installing & Running

Choose a project to predict. In this document I will use https://github.com/Netflix/pygenie, because it is smaller, but you can use any, like the Node project

Clone this repository into your machine:

git clone https://github.com/benny-hal/pullreq-ml.git

Install dependencies

cd pullreq-ml # or pullreq-ml-master
npm install

Run mongo

docker run -p 27017:27017 --name some-mongo -d mongo

Replace the contents of config.js with the actual repo and database authentication. For example

 module.exports = {
     // Local Mongo DB
     MONGO_DB_URL: 'mongodb://github:github@localhost:27017/github',
     // Token
     GITHUB_ACCESS_TOKEN: '<your token here>',
     // Repo Information for example for https://github.com/Netflix/pygenie you should put
 }

Clone the target repos inside the targetrepo folder

cd targetrepo
git clone https://github.com/Netflix/pygenie.git

Start fetching Repo information
```
node fetch.js
```
Create a features df for the PRs
```
node query.js
```
The jira-fetcher is a bit primitive: it uses jira client to query (jql) jira to find tickets that were reopened and have PRs, we assume that the first PR in those tickets had a bug. We use this to locate buggy PRs ibn order to train our algorithm and build a model. This is not ideal and should be improved. In pull_request.js we use an HTTP cal to scrap Jira online interface in order to get the PR use itself because I couldn't find a way to do it with the client. This should be improved as well!