src-d / datasets Goto Github PK
View Code? Open in Web Editor NEWsource{d} datasets ("big code") for source code analysis and machine learning on source code
License: Other
source{d} datasets ("big code") for source code analysis and machine learning on source code
License: Other
PGA tool used to download dataset repositories is currently incompatible with windows. One problem is #100: this one is related to windows path separator. After fixing this problem all its functionality should be tested in a windows machine.
It's quite annoying to run PGA, cancel it with Control+C and get a lot of corrupted files (potentially one per thread). Ideally, recovering from this should be easier.
As a starting point, files could be downloaded to a temporary path and then renamed to the final one only if the download succeeded. Cleaning up these temporary files on Control+C would be a plus.
Not everyone has Go installed on their machine, and we should not force them to do so just to download the dataset.
I'm working on providing a web UI to pga
, but in the meanwhile it'd be nice to provide binary releases for most common platforms.
I'm sending this to you, @rporres, but feel free to redirect it to whomever seems appropriate.
The sum of LANGS_FILES_COUNT
is less than FILE_COUNT
when a repository contains files wit no language detected.
In such cases, there are no stats about their size, number of lines or the amount of empty lines inside them; reported by LANGS_BYTE_COUNT
, LANGS_LINES_COUNT
, LANGS_FILES_COUNT
, EMPTY_LINES_COUNT
.
If it would be added a language unknown
to the LANGS
list (for example), those cases would be fully covered.
We've been getting consistent reports about confussion around PGA documentation:
We should merge gitbook documentation and make it the default.
As far as I know, we already made our minds about splitting PGA to its own repo, that would also be a good opportunity to consolidate documentation on gitbook, and redirecting pga.sourced.tech to the gitbook documentation (not /csv/ and /siva/ routes though).
Hi,
thanks for providing such a large github corpus but I have really problems to work with it in my scientific researches.
First I used your pga tool to download all java projects (for my research) with:
pga list -l java
I got some folders with 00, 0a, 0b etc. with some files in it like "00a0def4ec907f3f722f452380bd6c2dc614e8b8.siva". I guess every siva-file represents one github-project?
Now I wanted to unpack these projects using your "siva-java" project. When I unpacked the files of one project, I got the files in the structure:
It seems there are again new packed files in the objects-folder and links in the other text files. I don't know what to do with them. On your homepage you said it would be easy to work with this corpora, but I have no clue to work with your file formats or file structure.
Can you make a simple tutorial how to get the (master) github-repos from your corpus as file-folders or as zip/tar?
Feedback obtained from this doc.
PGA wouldn't let me grab the top-100 most-starred repositories in language X. So I went to multitool to generate that list.
This seems like an easy thing to add if we kept the number of stars on pga
.
I imagine this working in conjunction with the already existing filters, so downloading the top 100 java projects would be something like pga get -l java --top 100
.
As per @eiso's comment in src-d/guide#204 (comment) we should change the current license (Apache v2) to ODBL.
This should fix the problems with interrupted downloads and corrupted files too.
└> pga list -v
DEBU[0000] syncing http://pga.sourced.tech/csv/latest.csv.gz to /Users/erizocosmico/.pga/latest.csv.gz
DEBU[0000] local copy is up to date
Error: bad header, expected URL,SIVA_FILENAMES,FILE_COUNT,LANGS,LANGS_BYTE_COUNT,LANGS_LINES_COUNT,LANGS_FILES_COUNT,COMMITS_COUNT,BRANCHES_COUNT,FORK_COUNT,EMPTY_LINES_COUNT,CODE_LINES_COUNT,COMMENT_LINES_COUNT,LICENSE
Apparently the index does not have a LICENSE header.
There are four different doc pages explaining almost the same; imho they could be merged.
Unable to execute the following command :
pga list -u /src-d/ -f json | pga get -i
The error I am getting is :
The filename, directory name, or volume label syntax is incorrect
docs/examples.md
)borges-indexer
, pga
, multitool
into a new folder src
borges-indexer
& multitool
into docs/reproducing-pga,md
@smola @campoy @vmarkovtsev please review this proposal
This is an umbrella issue to download a new version of PGA.
While developing pga
I noticed that many files in the index appear missing from /siva/latest
For instance:
could not get e9477061b9e1b80c2ac5c96abcb8d236c11549a1.siva: could not get http://pga.sourced.tech/siva/latest/e9/e9477061b9e1b80c2ac5c96abcb8d236c11549a1.siva: 404 Not Found
could not get ebd6182d6d724d79fba1bf0cf28c7ae6706382f3.siva: could not get http://pga.sourced.tech/siva/latest/eb/ebd6182d6d724d79fba1bf0cf28c7ae6706382f3.siva: 404 Not Found
could not get e76a3aaa800670389d19bd2b84b9489455d883d8.siva: could not get http://pga.sourced.tech/siva/latest/e7/e76a3aaa800670389d19bd2b84b9489455d883d8.siva: 404 Not Found
could not get e7a4ab4093dcf8423a14fd20ee58f073e90f2ae4.siva: could not get http://pga.sourced.tech/siva/latest/e7/e7a4ab4093dcf8423a14fd20ee58f073e90f2ae4.siva: 404 Not Found
The way I created these URLs was by using the prefix http://pga.sourced.tech/siva/latest/
plus the first to characters in the siva filename as a directory, plus the siva filename at the end.
Is that wrong or do we have another issue?
just a proposal to consider:
Since pga get -i
downloads the repositories that were passed through the stdin
, imho it could be great if pga list
could provide it (without external tools like jq
to print only the siva
file name)
It could be, for example:
$ pga list --siva | pga get -i
or even using a more general syntax:
$ pga list --columns=sivaFilenames | pga get -i
from https://docs.sourced.tech/datasets/publicgitarchive/web/md/index
it is said that:
Column name | Column description |
---|---|
EMPTY_LINES_COUNT | Number of empty lines on files on commit pointed by default HEAD. |
CODE_LINES_COUNT | Number of lines with code on files on commit pointed by default HEAD. |
COMMENT_LINES_COUNT | Number of commented lines on files on commit pointed by default HEAD |
But they contain a list of each of them, regarding to every language on that repository, and ordered as previous languages counters.
Right now both PGA tool is hardcoded to download version latest
of the index an siva files. Add an option to change the version we want to use.
I have downloaded full PGA to HDFS using pga get -v ...
and have full logs of the process.
It took a day and has finished eventually and can see that it's
$ hdfs dfs -du -s -h hdfs://hdfs-namenode/pga
2.4 T hdfs://hdfs-namenode/pga
Questing: how do I make sure that nothing is missing?
What would be the simplest way to verify consistency and completeness of the results of pga get
? Options from the top of my head include
hdfs dfs -ls -R hdfs://hdfs-namenode/pga/ |grep "\.siva$" | wc -l
= 239807pga list | wc -l
= 181481 but it's rooted repository VS actual repository :/Would appreciate any recommendations.
@campoy I guess this might be something that is worth documenting eventually, as other users might have same question eventually. Would be happy to submit a PR.
When I tried to run pga list --lang "Jupyter Notebook"
and pga get --lang "Jupyter Notebook"
commands, I received before the proper response:
WARN[0000] could not check md5 hashes for http://pga.sourced.tech/csv/latest.csv.gz: could
not fetch hash at latest.csv.gz.md5: 404 Not Found
I have no idea what this means. Should I have done something to prevent this? Is this an error caused by something I did/didn't do, or a bug on PGA? I don't know.
Using
cat index.csv | grep -oE '[0-9a-f]{40}\.siva' | pga get -i --output /media/k/data/PGA/
to download PGA dataset I get unexpected EOF
just for several files:
➜ sourced cat index.csv | grep -oE '[0-9a-f]{40}\.siva' | pga get -i --output /media/k/data/PGA/
downloading siva files by name from stdin
filter flags will be ignored
67503 / 257391 [====================>--------------------------------------------------------] 26.23% 40m24s
could not get siva/latest/d9/d9363d1f63b2bee2c69c2a11a5f7b0fafc838f0f.siva: could not check mod time in http://pga.sourced.tech//siva/latest/d9/d9363d1f63b2bee2c69c2a11a5f7b0fafc838f0f.siva: Head http://pga.sourced.tech//siva/latest/d9/d9363d1f63b2bee2c69c2a11a5f7b0fafc838f0f.siva: dial tcp 147.135.10.8:80: i/o timeout
91710 / 257391 [===========================>-------------------------------------------------] 35.63% 44m47s
could not get siva/latest/de/de879ba477d94f28d561b3cd55079a737ec57a85.siva: could not copy http://pga.sourced.tech//siva/latest/de/de879ba477d94f28d561b3cd55079a737ec57a85.siva to /media/k/data/PGA/siva/latest/de/de879ba477d94f28d561b3cd55079a737ec57a85.siva: unexpected EOF
205637 / 257391 [===========================================================>--------------] 79.89% 2h27m11s
could not get siva/latest/c3/c33c209a937af7468bba45e9406a7e5834655541.siva: could not copy http://pga.sourced.tech//siva/latest/c3/c33c209a937af7468bba45e9406a7e5834655541.siva to /media/k/data/PGA/siva/latest/c3/c33c209a937af7468bba45e9406a7e5834655541.siva: unexpected EOF
206423 / 257391 [===========================================================>--------------] 80.20% 2h26m33s
could not get siva/latest/f1/f1f0797a2604519e41be05d81e16cad9969145e7.siva: could not copy http://pga.sourced.tech//siva/latest/f1/f1f0797a2604519e41be05d81e16cad9969145e7.siva to /media/k/data/PGA/siva/latest/f1/f1f0797a2604519e41be05d81e16cad9969145e7.siva: unexpected EOF
257391 / 257391 [========================================================================================================================================================] 100.00%
may be due to network problems or something else.
But at the end these files were present. When I manually download them and put to the corresponding folder I found out that the sizes are really different.
So, it is better to delete such files or try to redownload it several times.
Currently, the csv file we can download from pga.sourced.tech is not the final one we had, maybe we should update it.
Right now, the one from pga.sourced.tech has 181,482 rows.
And 2 months ago, Data Retrieval gave me through Vadim a final csv file with 182,014 rows (I have it locally).
When I run pga get --stdin file.txt, it freezes after the following messages:
downloading siva files by name from stdin
filter flags will be ignored
even if there is only 1 siva file name in the file.txt.
pga get -l python works fine till some point (last time downloaded 10k repos), then freezes.
Tried decreasing number of workers to 1 - seemed to stabilize the process to some extent, but I still get this freeze.
Any ideas?
I created a pretty straight forward one as a placeholder
@ricardobaeta could you provide something better or approve this one?
MariaDB/server
tensorflow/tensorflow
python/cpython
rails/rails
django/django
Create a new release with pga
binaries for Linux, Darwin and macOS
I run borges consumer
and it writes several siva files and records to the DB successfully.
Then I run borges-indexer
and get
FATA[0004] unable to get result set err="pq: column __repository._references does not exist"
Requested on https://github.com/src-d/issues-infrastructure/issues/177
Created here for easier tracking with the milestone
Currently we provide md5 files for all siva files, but as far as I can tell, not for the index file itself.
Could we provide it on http://pga.sourced.tech/csv/latest.csv.gz.md5?
Our official example from the documentation https://github.com/src-d/datasets/tree/master/PublicGitArchive/pga#downloading-siva-files results in a panic
.
Steps to reproduce:
pga list -u github.com/src-d/ -f json | jq -r 'select(.fileCount > 50) | .sivaFilenames[]' | pga get -i -o repositories
Expected: downloading files to ./repositories
starts.
Actual: panic.
May be relevant to -i
option, as noted in #33 (comment).
Update:
echo "cce947b98a050c6d356bc6ba95030254914027b1.siva" | pga get -i
results in the same
downloading siva files by name from stdin
filter flags will be ignored
panic: runtime error: slice bounds out of range
The structure of the docs https://docs.sourced.tech/datasets/publicgitarchive is a bit weird, and there seems to be too many README
files without a clear context
Hey, I thought that one repo should have exactly one siva file until I found one repo which contains 5 different siva files, whose url is "https://github.com/pegasus-isi/pegasus". So, when would one repository contains more than one sivaFile?
It would be very useful (at least for me) to have size of repository in bytes in csv file.
Hi all,
Is that the correct issue tracker for questions & issues about PGA?
--Martin
pga get --siva 00035a4b79ce6b9a4cd7e7006bd78aa153cb9389.siva
Rationale: when debugging I realized I wanted to download several specific siva files locally to understand what was going on with them.
Hello,
it could be a useful feature to provide language statistics per repository -> so people will have more advanced options for filtering repositories.
Example:
Downloading of a repository with 1 line of JS code and 1000s in other languages could be not so useful for researchers who are focusing on JS.
The page currently shows only multitool
, not pga
.
Could we point pga.sourced.tech to show the content of README.md
?
cc: @vmarkovtsev @warenlg
As it is documented in https://docs.sourced.tech/datasets/publicgitarchive/web/md/index
Column name | Column description |
---|---|
COMMITS_HEAD_COUNT | Number of commits on the history obtained by the commit pointed by default HEAD. The output is the same as git rev-list --count HEAD . |
COMMITS_HEAD_COUNT
does not appear in pga list
Testing with a build done using a binary from #32 (I could not connect to HDFS otherwise), pga tool is failing to get
files to HDFS
# cat /siva.txt | ./pga-rafa get --verbose -i -o hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020
downloading siva files by name from stdin
filter flags will be ignored
DEBU[0000] syncing http://pga.sourced.tech//siva/latest/4a/4a14cc02da0a9280538cd3f3242365601d72f241.siva to hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/siva/latest/4a/4a14cc02da0a9280538cd3f3242365601d72f241.siva
panic: runtime error: slice bounds out of range
goroutine 1 [running]:
github.com/src-d/datasets/PublicGitArchive/pga/cmd.downloadFilenames(0x86d4e0, 0xc4201ca080, 0x86d560, 0xc4201b8030, 0xc4201dc000, 0x1f, 0x1f, 0xa, 0x8000105, 0x0)
/root/go/src/github.com/src-d/datasets/PublicGitArchive/pga/cmd/get.go:91 +0x263
github.com/src-d/datasets/PublicGitArchive/pga/cmd.glob..func1(0xa67920, 0xc4200b6340, 0x0, 0x4, 0x0, 0x0)
/root/go/src/github.com/src-d/datasets/PublicGitArchive/pga/cmd/get.go:79 +0x3a8
github.com/src-d/datasets/PublicGitArchive/pga/vendor/github.com/spf13/cobra.(*Command).execute(0xa67920, 0xc4200b6300, 0x4, 0x4, 0xa67920, 0xc4200b6300)
/root/go/src/github.com/src-d/datasets/PublicGitArchive/pga/vendor/github.com/spf13/cobra/command.go:698 +0x46d
github.com/src-d/datasets/PublicGitArchive/pga/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0xa67d60, 0xc4200abf58, 0x73fc95, 0xc4201763c0)
/root/go/src/github.com/src-d/datasets/PublicGitArchive/pga/vendor/github.com/spf13/cobra/command.go:783 +0x2e4
github.com/src-d/datasets/PublicGitArchive/pga/vendor/github.com/spf13/cobra.(*Command).Execute(0xa67d60, 0xc42002a0b8, 0x0)
/root/go/src/github.com/src-d/datasets/PublicGitArchive/pga/vendor/github.com/spf13/cobra/command.go:736 +0x2b
github.com/src-d/datasets/PublicGitArchive/pga/cmd.Execute()
/root/go/src/github.com/src-d/datasets/PublicGitArchive/pga/cmd/root.go:34 +0x2d
main.main()
/root/go/src/github.com/src-d/datasets/PublicGitArchive/pga/main.go:8 +0x20
Find attached the contents of siva.txt
For the moment I'm using multitool to download to HDFS as it is not giving me issues
cc @vmarkovtsev
pga version: eb71a82
Go version: 1.11.5 linux/amd64
Problem: Seeing corrupted files when the pga tool is attempting to rename temporary files. The tool also appears to hang - after ~10 hours, it's only 1.87% complete (467/24928; 75.5GiB) on a 8-core VM in GCP with a 16Gbps NIC.
$ pga get -l go
467 / 24928 [==>-----------------------------------------------------------------------------------------------------------------] 1.87% 2h34m48s
could not get siva/latest/98/9822bb0f781b94b1c7610b2df2ae4817e257c9bb.siva: could not copy to temporary file siva/latest/98/9822bb0f781b94b1c7610b2d
f2ae4817e257c9bb.siva.tmp: could not copy http://pga.sourced.tech//siva/latest/98/9822bb0f781b94b1c7610b2df2ae4817e257c9bb.siva to siva/latest/98/98
22bb0f781b94b1c7610b2df2ae4817e257c9bb.siva.tmp: unexpected EOF
467 / 24928 [==>-----------------------------------------------------------------------------------------------------------------] 1.87% 2h28m29s
could not get siva/latest/57/5708afc613c3a27489ed4be2560d49ef1752eaeb.siva: could not copy to temporary file siva/latest/57/5708afc613c3a27489ed4be2
560d49ef1752eaeb.siva.tmp: could not copy http://pga.sourced.tech//siva/latest/57/5708afc613c3a27489ed4be2560d49ef1752eaeb.siva to siva/latest/57/57
08afc613c3a27489ed4be2560d49ef1752eaeb.siva.tmp: unexpected EOF
467 / 24928 [==>------------------------------------------------------------------------------------------------------------------] 1.87% 2h4m12s
could not get siva/latest/5b/5b8009800d6e8459453463db5a12f76ed146c7d7.siva: rename siva/latest/5b/5b8009800d6e8459453463db5a12f76ed146c7d7.siva.tmp
to siva/latest/5b/5b8009800d6e8459453463db5a12f76ed146c7d7.siva failed: rename siva/latest/5b/5b8009800d6e8459453463db5a12f76ed146c7d7.siva.tmp siva
/latest/5b/5b8009800d6e8459453463db5a12f76ed146c7d7.siva: no such file or directory
467 / 24928 [==>--------------------------------------------------------------------------------------------------------------------------] 1.87%
could not get siva/latest/b4/b4d0dc444d6c1088fe6c15743f7764f39c57f501.siva: could not copy to temporary file siva/latest/b4/b4d0dc444d6c1088fe6c1574
3f7764f39c57f501.siva.tmp: could not copy http://pga.sourced.tech//siva/latest/b4/b4d0dc444d6c1088fe6c15743f7764f39c57f501.siva to siva/latest/b4/b4
d0dc444d6c1088fe6c15743f7764f39c57f501.siva.tmp: unexpected EOF
467 / 24928 [==>--------------------------------------------------------------------------------------------------------------------------] 1.87%
could not get siva/latest/50/50a4667b3b8dda5a16a78f0dcc6f7b1eab8924f8.siva: could not copy to temporary file siva/latest/50/50a4667b3b8dda5a16a78f0d
cc6f7b1eab8924f8.siva.tmp: could not copy http://pga.sourced.tech//siva/latest/50/50a4667b3b8dda5a16a78f0dcc6f7b1eab8924f8.siva to siva/latest/50/50
a4667b3b8dda5a16a78f0dcc6f7b1eab8924f8.siva.tmp: unexpected EOF
467 / 24928 [==>--------------------------------------------------------------------------------------------------------------------------] 1.87%
could not get siva/latest/78/78ffcbdd4f5de3ed41516c7b74ddc1d3c657df39.siva: rename siva/latest/78/78ffcbdd4f5de3ed41516c7b74ddc1d3c657df39.siva.tmp
to siva/latest/78/78ffcbdd4f5de3ed41516c7b74ddc1d3c657df39.siva failed: rename siva/latest/78/78ffcbdd4f5de3ed41516c7b74ddc1d3c657df39.siva.tmp siva
/latest/78/78ffcbdd4f5de3ed41516c7b74ddc1d3c657df39.siva: no such file or directory
467 / 24928 [==>--------------------------------------------------------------------------------------------------------------------------] 1.87%
could not get siva/latest/d6/d6026c727ad9767fe6ccd4b14597453cd9bbac4c.siva: could not copy to temporary file siva/latest/d6/d6026c727ad9767fe6ccd4b1
4597453cd9bbac4c.siva.tmp: close siva/latest/d6/d6026c727ad9767fe6ccd4b14597453cd9bbac4c.siva.tmp: input/output error
467 / 24928 [==>--------------------------------------------------------------------------------------------------------------------------] 1.87%
could not get siva/latest/c1/c14d891d44f0afff64e56ed7c9702df1d807b1ee.siva: could not copy to temporary file siva/latest/c1/c14d891d44f0afff64e56ed7
c9702df1d807b1ee.siva.tmp: could not copy http://pga.sourced.tech//siva/latest/c1/c14d891d44f0afff64e56ed7c9702df1d807b1ee.siva to siva/latest/c1/c1
4d891d44f0afff64e56ed7c9702df1d807b1ee.siva.tmp: unexpected EOF
467 / 24928 [==>--------------------------------------------------------------------------------------------------------------------------] 1.87%
could not get siva/latest/05/0527d29da443886d92e9a418180c5b25a5f8d270.siva: could not copy to temporary file siva/latest/05/0527d29da443886d92e9a418
180c5b25a5f8d270.siva.tmp: could not copy http://pga.sourced.tech//siva/latest/05/0527d29da443886d92e9a418180c5b25a5f8d270.siva to siva/latest/05/05
27d29da443886d92e9a418180c5b25a5f8d270.siva.tmp: unexpected EOF
467 / 24928 [==>--------------------------------------------------------------------------------------------------------------------------] 1.87
I'm wondering if this is a subtle race condition due to the way temporary files are named as name + ".tmp"
with multiple workers?
Update go-git
and core-retrieval
dependencies on pga to make it work with the latest borges version (database models).
It would be super useful for experiments to be able to grep and download only repositories with more than N stars.
When I try to download PGA dataset via
./multitool get-index > index.csv
cat index.csv | grep -oE '[0-9a-f]{40}\.siva' | ./multitool get-dataset -o /media/k/data/PGA/
I get an error
90225 / 257391 [=====================>----------------------------------------]
Error: failed to create /media/k/data/PGA/siva/latest/87/870918f74af1095098b84eaf57765bb4e2a2f2d2.siva: open /media/k/data/PGA/siva/latest/87/870918f74af1095098b84eaf57765bb4e2a2f2d2.siva: too many open files
I know that I can change the limit of open files, I do so, but maybe there is a bug.
borges-indexer
is a cool program and needs to be documented so that the other people can use it.
I want to download all siva files with "Jupyter Notebook" on PGA.
To know how many they are, I ran:
$ pga list --lang "Jupyter Notebook" -f csv
After examining the csv file, I knew that there were 2,606 repos and 3,767 siva files corresponding to them.
To download the siva files, I ran
$ pga get --lang "Jupyter Notebook" -v
And the response that I got was:
DEBU[0004] local copy is outdated or non existent
1 / 6349 [>----------------------------------------------------------] 0.02% 40m59s
Meaning that it was downloading 6,349 files, and I have no idea why. If somebody can help me with this.
This is the list of the currently missing repositories from the latest 181k repos index. This is due to corrupted siva files (or any other issue that may have arised) and we should reprocess (and redownload) them again.
Here's the list: https://gist.github.com/erizocosmico/1dfa3bc5e1d04ba2f15266ab4493420e
Borges is processing the last 4k repos, so I'm posting this here so we don't forget to do it later after it's done.
Although aware that there's an intention of removing pga.sourced.tech on this issue, we currently have a broken link there for multitool
on:
You can download and filter by csv data using multitool.
I intended to fix it myself and push a PR but I could not find a new direction for multitool
nor give a best solution for it, apart from removing this sentence. Guide me if I should fix it on the html
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.