Code Monkey home page Code Monkey logo

github_repository_growth_forecast's Introduction

Predicting the number of Python and R Repos that will be created over the next 5 years on Github

Introduction

On 31st of January, 2019 (yesterday as I write this), 59,958 new repositories were created in GitHub. Created just 10 years ago, Github is the most popular Git repository hosting service and has an ever increasing growth rate. This level of scaling brings a lot of technical challenges and we need to observe and predict the growth inadvance to properly handle it.

Besides, the popularity of data science has skyrocketed in the past few years and so did the number of projects in the field. Python and R take a large share in the areas of data science project development. Therefore, we try to observe the growth of (Python and R) repository count over the past decade and predict the growth for the next 5 years using time series prediction.

Methodology

  1. As we need historical data on the repository counts, we use Github GraphQL.
  2. Once we have the data, we use ARIMA and SARIMAX, two simple time series forcasting models to forcast for the next 5 years.
  3. Now we build a flask API, which loads the prediction data and visualizes the forcast using Chart.js.

Implementation

  1. The following GraphQL query was used to fetch the Python and R monthly repository counts.
query{
		search(type: REPOSITORY, query: "language:$language created:$dates") {
			repositoryCount
		}
	 }

where language is Python or R and dates refers to monthly ranges; ex. dates = 2010-04-01..2010-05-01 refers to the month of 2010-04

  1. We also need to authenticate the query request for higher rate-limit (5000 requests per hour), the OAuth token is saved in "token_file.txt" and is referred in prepare_historical_data.py. We can visualize the trends to get a basic intuition. PYTHON - HISTORICAL DATA

R - HISTORICAL DATA

  1. We now use ARIMA, a simple yet powerful time series forcast model to predict future trends. We take the latest 12 months as test data an observe the plots: PYTHON - ARIMA PREDICTIONS

R - ARIMA PREDICTIONS

  1. There is clear room for improvement, we now try a more complex model, SARIMAX which brings in seasonality and plot the results.

PYTHON - SARIMAX PREDICTIONS

R - SARIMAX PREDICTIONS

To quantisize and compare the qualities of ARIMA and SARIMAX, we calculate the RMS Error on the test set,

Language Python R
RMSE - ARIMA 20632.553 1339.829
RMSE - SARIMAX 7295.071 975.583

The above table clearly illustrate how the later model predicts much better compared to the earlier, which was also seen in the plots.

  1. We now create Flask APIs with /python and /r endpoints to show the best model predictions using Chart.js. The services can be hosted by running the following command
python services.py

Once hosted the charts can be visualized at

http://127.0.0.1:5001/python
http://127.0.0.1:5001/r
  1. A Docker file is also added to simplify running the project on a docker image.

  2. All the dependencies and corresponding versions are added to requirements.txt

  3. All the results are recorded for observation.

Observations

  1. There is a general growth in the number of repositories over time, which is the expected trend.
  2. Besides the general trend, there is a clear seasonal component in both python and r repository-counts, where there is a peak every March and a trough every December, and our algorithm carried this into predictions well as shown below

Python Predictions

github_repository_growth_forecast's People

Contributors

chaitanyacsss avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.