Code Monkey home page Code Monkey logo

github_miner's Introduction

Github Miner

A tool to search github repositories and process information found from each of the repositories

Pre-requisites

  • Python 3.8+
  • Multinode CloudLab Ubuntu Cluster

1. Access the master node of your cluster and configure scripts

Install Python 3.8

sudo apt-get update && sudo apt-get install -y software-properties-common && sudo add-apt-repository -y ppa:deadsnakes/ppa && sudo apt-get update && sudo apt-get install -y python3.8 && python3.8 --version

Install pip for Python 3.8

sudo apt-get install -y python3.8-distutils && curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py && sudo python3.8 get-pip.py && pip3 --version

Clone the miner scripts

Run the cloning command

git clone https://github.com/ssmtariq/github_miner.git

Change the directory cd github_miner

Configure the Python Git client with your Git username, email and token.

Open the file analyzer/pygitclient.py and set values for the variables username, token and email

2. Fetch github repositories and export in a CSV file

Use the following command to fetch repositories from github and export into the file github_repositories.csv

python repo_fetcher.py --language <LANGUAGE> --stars <MINIMUM_STARS> --forks <MINIMUM_FORKS> --last_commit <LAST_UPDATE_DATE> --result_limit <NUMBER_OF_REPO>

For example see the command below

python repo_fetcher.py --language Python --stars 20 --forks 0 --last_commit 2010-01-01 --result_limit 2000

You can run without any filter as the command below, then the script applies the above filters and no limit by default
python repo_fetcher.py

3. Run the commit analyzer on multi-node cluster using parallel ssh

Create the hosts file

Add all the node ip addresses line by line into the sshhosts file

Ensure inter-node ssh permissions

Execute the authenticator scripts inside each nodes at first
Run command: sh authenticator.sh

Check if all worker nodes are accessible from the master node

Run the following command to using parallel-ssh to print the node names. Do not run any further parallel commands until this one executes sucessfully.

parallel-ssh -i -h sshhosts -O StrictHostKeyChecking=no hostname

Install python3.8 in all worker nodes

Run the following command only if python3.8 not installed earlier

parallel-ssh -A -i -h sshhosts 'sudo apt-get update && sudo apt-get install -y software-properties-common && sudo add-apt-repository -y ppa:deadsnakes/ppa && sudo apt-get update && sudo apt-get install -y python3.8'

Check and confirm python3.8 version in all worker nodes

parallel-ssh -A -i -h sshhosts 'python3.8 --version'

Install pip for python3.8 in all worker nodes

Run the following command only if pip for python3.8 not installed earlier

parallel-ssh -A -i -h sshhosts 'sudo apt-get install -y python3.8-distutils && curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py && sudo python3.8 get-pip.py'

Check and confirm pip version for python3.8 in all worker nodes

parallel-ssh -A -i -h sshhosts 'pip3 --version'

Install required dependencies for our scripts

parallel-ssh -A -i -h sshhosts 'pip3 install pydriller pygit2'

Confirm the installations of required dependencies

parallel-ssh -A -i -h sshhosts 'pip3 show pydriller && pip3 show pygit2'

Clone the github_miner repository in all nodes to upload the commit analysis results

parallel-ssh -i -h sshhosts 'git clone https://github.com/ssmtariq/github_miner.git'

Run script to split and distribute the input files among the nodes

python3.8 task_parallelizer.py your_repo_list your_username

For example

python3.8 task_parallelizer.py repository_lists/github_repositories_Python_10302023.csv ssmtariq

Execute the analyzer on multiple nodes in parallel

Run the following command to execute the analyzer using parallel-ssh. If asked for password then skip by pressing enter key.

parallel-ssh -A -i -h sshhosts 'python3.8 github_miner/analyzer/repo_analyzer.py --username your_github_username --token your_github_token --email your_github_email'

github_miner's People

Contributors

ssmtariq avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.