mazen160 / githubcloner Goto Github PK
View Code? Open in Web Editor NEWA script that clones Github repositories of users and organizations.
License: MIT License
A script that clones Github repositories of users and organizations.
License: MIT License
Hi,
i get "[!] Error: Github API rate limit exceeded". Is there an option to prevent it?
cmd was: githubcloner.py --org DevExpress-Examples -o E:\DevExpress
Regards
Martin
Hello,
the script downloads only 100 repos then stops.
Regards
I'm wondering if this tool will pull all files from git-lfs servers
I have seen several instances of inconsistent behavior when cloning the same user or org. I observed this while iterating on the API rate limit exceeded and the incomplete repository issues.
Here is the cycle I am using to test iterations:
cd
rm -rf "${HOME}/gitclone-test-"* "${HOME}/.mrconfig"
"${HOME}/code/GithubCloner/githubcloner.py" --user mazen160 -o "${HOME}/gitclone-test-$(date +%s)/"
find "${HOME}/gitclone-test"* -maxdepth 1 -mindepth 1 -type d | xargs -n1 mr register
mr status
I chose mazen160 for this example because it is a public data set within the author's control, but I have observed it with both organizations and other users.
When iterating with the above 7 times, I observed these results:
5 https://github.com/mazen160/Firefox-Security-Toolkit.git
5 https://github.com/mazen160/SecLists.git
5 https://github.com/mazen160/bfac.git
5 https://github.com/mazen160/ct-monitor.git
5 https://github.com/mazen160/struts-pwn_CVE-2017-9805.git
6 https://github.com/mazen160/Ubuntu-Desktop-Malware-Vector-Demo.git
6 https://github.com/mazen160/dirsearch.git
6 https://github.com/mazen160/dnsrecon.git
6 https://github.com/mazen160/public.git
6 https://github.com/mazen160/server-status_PWN.git
7 https://github.com/mazen160/GithubCloner.git
7 https://github.com/mazen160/ptf.git
7 https://github.com/mazen160/struts-pwn.git
Hello @mazen160 !
Thanks a lot for the script, I use it daily to archive my work.
I'm a teacher and I use Github Classroom for my students. It creates a copy of an assignement repo for each of them where
they push their work. It works fine and I can use some CI to validate tests, receive notifications etc.
Since I'm the owner of those said repos, they are archived by your script too... It's not what I want and I couldn't
find a way to exclude those repos.
So I forked your script and added the option. It's here.
The usage is quite simple :
python githubcloner.py ... --exclude_repos repo1,repo2,repo3...
If any string from this list is present in the url, it will be excluded.
I also formated the code to be a little bit more Pythonic (already told you I'm a teacher and it's a second nature...).
If you want I can make a PR.
$ ${HOME}/code/GithubCloner/githubcloner.py --user mazen160 -o "${HOME}/gitclone-test-$(date +%s)/"
...snip...
$ cd ${HOME}/gitclone-test-1526924358/mazen160_SecLists
$ git status --porcelain | awk '{print $1}' | sort | uniq -c
5 ??
438 D
$ git reset --hard origin/master
Checking out files: 100% (438/438), done.
HEAD is now at 7bbc06c Added @mazen160 wordlist for common web API endpoints.
/Users/daniel.hoherd/gitclone-test-1526924358/mazen160_SecLists $ git status --porcelain | awk '{print $1}' | sort | uniq -c
/Users/daniel.hoherd/gitclone-test-1526924358/mazen160_SecLists $
I was previously investigating this bug when I hit the "API rate limit exceeded" issue, so this error is not related to those recent changes.
I'm getting this on Windows. PyCharm works.
githubcloner.py --help
Error: The output path is not specified.
Exiting...
Any hints?
When using or when not using --include-gists
, gists are always pulled.
I added debugging of userGists in a branch of my fork and that function is never being called, so gists are being populated by something else.
I have noticed some repositories with truncated names even though they have fully functioning remotes. Often this split involves the letter 'i'.
For instance:
githubcloner.py --user mazen160
gives dir mazen160_Firefox-Security-Toolk
with origin https://github.com/mazen160/Firefox-Security-Toolkit.git
githubcloner.py --user danielhoherd
gives dir danielhoherd_pre-commit-circlec
with origin https://github.com/danielhoherd/pre-commit-circleci.git
githubcloner.py --org github
gives dir github_puppet-ca_cer
with origin https://github.com/github/puppet-ca_cert.git
So i ran this command :python3 githubcloner.py --org XXXX -o /output,however there folder Output is empty on my computer . I think i misunderstood the tool but i can't find any help online,where am i suppose to find the output results ?
After cloning the repo and attempting to run it, I ran into a few dependencies that needed to be installed. After installing argparse
and PythonGit
I had to change the import queue
line, on line 21, to import Queue as queue
.
Fisrtly thanks for this tool, Mazen! :)
Run mac os :
Omars-MacBook-Air:GithubCloner omarkurt$ sudo python githubcloner.py --user omarkurt -o /omarkurt
Traceback (most recent call last):
File "githubcloner.py", line 250, in <module>
main()
File "githubcloner.py", line 245, in main
cloneBulkRepos(URLs, output_path, threads_limit=threads_limit)
File "githubcloner.py", line 166, in cloneBulkRepos
threading.Thread(target=cloneRepo, args=(URL, cloningPath,), daemon=True).start()
TypeError: __init__() got an unexpected keyword argument 'daemon'
pythongit has been introducing a lot of issues with failing without printing any indications that the cloning failed in multiple occasions. It has been reported multiple times by users.
The best way to solve it is by having our own wrapper that directly uses Git.
It would be nice to be able to ignore forks and only download source repos. Perhaps we could have a --only_type source
option? Not sure if this is intuitive with Github's API or not.
Hi, sorry for being new to this. Trying to run your script, get this error:
Traceback (most recent call last):
File "./githubcloner.py", line 18, in
import git
ModuleNotFoundError: No module named 'git'
Hi,
Thanks for this tool. Pretty handy. I think it'd be useful to have an option to download to user/repo. I think the prefix default doesn't make much sense, at least in my opinion. This would make it easier to mirror github and have the output what you'd expect. Otherwise, would have to do -o user/ instead of just -o . and specify all the users you want to clone.
Orgs no longer appear to be downloaded. This worked a few weeks ago. I have added logging about it into a branch in my fork: https://github.com/danielhoherd/GithubCloner/tree/logging
"renovo" is an org I am a part of that has many private repositories, yet none of them are listed when I use this tool as such:
$ LOGLEVEL=DEBUG /Users/dho/code/GithubCloner/githubcloner.py --org renovo -o "$HOME/github-clone-renovo-$(date +%s)" --include-authenticated-repos --authentication "danielhoherd:$GITHUB_API_TOKEN"
...snip...
githubcloner.py:150 DEBUG: fromOrg is beginning
connectionpool.py:824 DEBUG: Starting new HTTPS connection (1): api.github.com
connectionpool.py:396 DEBUG: https://api.github.com:443 "GET /orgs/renovo/repos?per_page=40000000&page=1 HTTP/1.1" 200 None
githubcloner.py:241 DEBUG: Response type is: <class 'list'>
connectionpool.py:824 DEBUG: Starting new HTTPS connection (1): api.github.com
connectionpool.py:396 DEBUG: https://api.github.com:443 "GET /orgs/renovo/repos?per_page=40000000&page=2 HTTP/1.1" 200 2
githubcloner.py:241 DEBUG: Response type is: <class 'list'>
githubcloner.py:171 DEBUG: fromOrg is returning URLs with length: 1
githubcloner.py:172 DEBUG: fromOrg returned URLs contain: ['git://github.com/renovo/hello-world-ci.git']
...snip...
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.