A tool to search github repositories and process information found from each of the repositories
- Python
3.8+
- Multinode CloudLab Ubuntu Cluster
sudo apt-get update && sudo apt-get install -y software-properties-common && sudo add-apt-repository -y ppa:deadsnakes/ppa && sudo apt-get update && sudo apt-get install -y python3.8 && python3.8 --version
sudo apt-get install -y python3.8-distutils && curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py && sudo python3.8 get-pip.py && pip3 --version
Run the cloning command
git clone https://github.com/ssmtariq/github_miner.git
Change the directory cd github_miner
Open the file analyzer/pygitclient.py
and set values for the variables username, token and email
Use the following command to fetch repositories from github and export into the file github_repositories.csv
python repo_fetcher.py --language <LANGUAGE> --stars <MINIMUM_STARS> --forks <MINIMUM_FORKS> --last_commit <LAST_UPDATE_DATE> --result_limit <NUMBER_OF_REPO>
For example see the command below
python repo_fetcher.py --language Python --stars 20 --forks 0 --last_commit 2010-01-01 --result_limit 2000
You can run without any filter as the command below, then the script applies the above filters and no limit by default
python repo_fetcher.py
Add all the node ip addresses line by line into the sshhosts
file
Execute the authenticator scripts inside each nodes at first
Run command: sh authenticator.sh
Run the following command to using parallel-ssh to print the node names. Do not run any further parallel
commands until this one executes sucessfully.
parallel-ssh -i -h sshhosts -O StrictHostKeyChecking=no hostname
Run the following command only if python3.8 not installed earlier
parallel-ssh -A -i -h sshhosts 'sudo apt-get update && sudo apt-get install -y software-properties-common && sudo add-apt-repository -y ppa:deadsnakes/ppa && sudo apt-get update && sudo apt-get install -y python3.8'
Check and confirm python3.8 version in all worker nodes
parallel-ssh -A -i -h sshhosts 'python3.8 --version'
Run the following command only if pip for python3.8 not installed earlier
parallel-ssh -A -i -h sshhosts 'sudo apt-get install -y python3.8-distutils && curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py && sudo python3.8 get-pip.py'
Check and confirm pip version for python3.8 in all worker nodes
parallel-ssh -A -i -h sshhosts 'pip3 --version'
parallel-ssh -A -i -h sshhosts 'pip3 install pydriller pygit2'
Confirm the installations of required dependencies
parallel-ssh -A -i -h sshhosts 'pip3 show pydriller && pip3 show pygit2'
parallel-ssh -i -h sshhosts 'git clone https://github.com/ssmtariq/github_miner.git'
python3.8 task_parallelizer.py your_repo_list your_username
For example
python3.8 task_parallelizer.py repository_lists/github_repositories_Python_10302023.csv ssmtariq
Run the following command to execute the analyzer using parallel-ssh. If asked for password then skip by pressing enter key.
parallel-ssh -A -i -h sshhosts 'python3.8 github_miner/analyzer/repo_analyzer.py --username your_github_username --token your_github_token --email your_github_email'