src-d / datasets Goto Github PK

source{d} datasets ("big code") for source code analysis and machine learning on source code

License: Other

Go 0.81% Makefile 0.01% CSS 0.11% HTML 20.60% Jupyter Notebook 78.14% Python 0.31% Dockerfile 0.01% Shell 0.02% C# 0.01%

dataset datasets git github machine-learning mlosc

datasets's People

Contributors

Stargazers

Watchers

Forkers

vmarkovtsev warenlg erizocosmico ajnavarro rporres chubbymaggie ufwt alqz bzz smola kamleshahre myattheingikyaw gomesfernanda namhsuya y1026 dpordomingo mrd bhavana-ragi mcarmonaa jan21 sanschul jfontan smartselab-titan graysonchiang zurk xwixcn inrg egorbu r0maink gkgriffin afcarl ajcsmarty icewolfsayhello lorarjohns jeroenherczeg allenfernzz gagliardetto coffeeandcloud mtelaprolu dolanor-galaxy neomatrix369 wooodhead yuq-1s darius-sas ananthg-code crackercat craftsland ningke-li demilyn yolomachine isabella232 ivanreznikov zzyuanyaozz mkaouer anjandash brentmat bingzhezhou cld4h karthikrajkumar basilsudo hieptran1812 ssgantayat halcyondude syllogy yidong72 era333 wedataintelligence magichain lc-john ra2003 xinyangz tianhao-peng digitty-forks giper45 safonovman yembot31013 techsuni2023

datasets's Issues

Support Windows in PGA tool

PGA tool used to download dataset repositories is currently incompatible with windows. One problem is #100: this one is related to windows path separator. After fixing this problem all its functionality should be tested in a windows machine.

Make PGA downloader safe to cancel

It's quite annoying to run PGA, cancel it with Control+C and get a lot of corrupted files (potentially one per thread). Ideally, recovering from this should be easier.

As a starting point, files could be downloaded to a temporary path and then renamed to the final one only if the download succeeded. Cleaning up these temporary files on Control+C would be a plus.

Provide binary releases

Not everyone has Go installed on their machine, and we should not force them to do so just to download the dataset.

I'm working on providing a web UI to pga, but in the meanwhile it'd be nice to provide binary releases for most common platforms.

I'm sending this to you, @rporres, but feel free to redirect it to whomever seems appropriate.

There are no stats for files without lang

The sum of LANGS_FILES_COUNT is less than FILE_COUNT when a repository contains files wit no language detected.

In such cases, there are no stats about their size, number of lines or the amount of empty lines inside them; reported by LANGS_BYTE_COUNT, LANGS_LINES_COUNT, LANGS_FILES_COUNT, EMPTY_LINES_COUNT.

If it would be added a language unknown to the LANGS list (for example), those cases would be fully covered.

Merge gitbook documentation and make it the main one

We've been getting consistent reports about confussion around PGA documentation:

https://pga.sourced.tech/ (ad hoc web)
https://docs.sourced.tech/datasets/public-git-archive (gitbook currently living in a branch)
README's

We should merge gitbook documentation and make it the default.

Move PGA to its own repository

As far as I know, we already made our minds about splitting PGA to its own repo, that would also be a good opportunity to consolidate documentation on gitbook, and redirecting pga.sourced.tech to the gitbook documentation (not /csv/ and /siva/ routes though).

cc @vmarkovtsev @campoy

How to get the github repos as zip/tar or clone?

Hi,
thanks for providing such a large github corpus but I have really problems to work with it in my scientific researches.

First I used your pga tool to download all java projects (for my research) with:
pga list -l java

I got some folders with 00, 0a, 0b etc. with some files in it like "00a0def4ec907f3f722f452380bd6c2dc614e8b8.siva". I guess every siva-file represents one github-project?

Now I wanted to unpack these projects using your "siva-java" project. When I unpacked the files of one project, I got the files in the structure:

objects/
refs/
config
HEAD

It seems there are again new packed files in the objects-folder and links in the other text files. I don't know what to do with them. On your homepage you said it would be easy to work with this corpora, but I have no clue to work with your file formats or file structure.

Can you make a simple tutorial how to get the (master) github-repos from your corpus as file-folders or as zip/tar?

[feature request] PGA: download n most starred

Feedback obtained from this doc.

PGA wouldn't let me grab the top-100 most-starred repositories in language X. So I went to multitool to generate that list.

This seems like an easy thing to add if we kept the number of stars on pga.

I imagine this working in conjunction with the already existing filters, so downloading the top 100 java projects would be something like pga get -l java --top 100.

Change license to ODBL

As per @eiso's comment in src-d/guide#204 (comment) we should change the current license (Apache v2) to ODBL.

check md5 files to decide whether they should be downloaded again

This should fix the problems with interrupted downloads and corrupted files too.

pga list fails

└> pga list -v
DEBU[0000] syncing http://pga.sourced.tech/csv/latest.csv.gz to /Users/erizocosmico/.pga/latest.csv.gz
DEBU[0000] local copy is up to date
Error: bad header, expected URL,SIVA_FILENAMES,FILE_COUNT,LANGS,LANGS_BYTE_COUNT,LANGS_LINES_COUNT,LANGS_FILES_COUNT,COMMITS_COUNT,BRANCHES_COUNT,FORK_COUNT,EMPTY_LINES_COUNT,CODE_LINES_COUNT,COMMENT_LINES_COUNT,LICENSE

Apparently the index does not have a LICENSE header.

Four similar doc pages with the same/outdated content

There are four different doc pages explaining almost the same; imho they could be merged.

https://docs.sourced.tech/datasets/publicgitarchive
https://docs.sourced.tech/datasets/publicgitarchive/pga
https://docs.sourced.tech/datasets/publicgitarchive/web/md/index → points to an outdated multitool
https://pga.sourced.tech (mirror of the prev one)

Facing issue while dowloading a github dump using pga.

Unable to execute the following command :
pga list -u /src-d/ -f json | pga get -i

The error I am getting is :
The filename, directory name, or volume label syntax is incorrect

error-logs.txt

Clean up documentation for PublicGitArchive

Remove pga.sourced.tech (it is a nightmare to maintain yet another site)
- Remove web folder
Remove examples folder (examples can be part of docs, docs/examples.md)
Move borges-indexer, pga, multitool into a new folder src
Convert PublicGitArchive/README.md into an introduction to PublicGitArchive with a quickstart, including the details of the dataset that are currently only on web (essentially an abridged version of the readme style we use for engine)
Move the current content of the README related to borges-indexer & multitool into docs/reproducing-pga,md

@smola @campoy @vmarkovtsev please review this proposal

Release a new version of PGA, updated, fixed, including stars

This is an umbrella issue to download a new version of PGA.

Missing files

While developing pga I noticed that many files in the index appear missing from /siva/latest

For instance:

could not get e9477061b9e1b80c2ac5c96abcb8d236c11549a1.siva: could not get http://pga.sourced.tech/siva/latest/e9/e9477061b9e1b80c2ac5c96abcb8d236c11549a1.siva: 404 Not Found
could not get ebd6182d6d724d79fba1bf0cf28c7ae6706382f3.siva: could not get http://pga.sourced.tech/siva/latest/eb/ebd6182d6d724d79fba1bf0cf28c7ae6706382f3.siva: 404 Not Found
could not get e76a3aaa800670389d19bd2b84b9489455d883d8.siva: could not get http://pga.sourced.tech/siva/latest/e7/e76a3aaa800670389d19bd2b84b9489455d883d8.siva: 404 Not Found
could not get e7a4ab4093dcf8423a14fd20ee58f073e90f2ae4.siva: could not get http://pga.sourced.tech/siva/latest/e7/e7a4ab4093dcf8423a14fd20ee58f073e90f2ae4.siva: 404 Not Found

The way I created these URLs was by using the prefix http://pga.sourced.tech/siva/latest/ plus the first to characters in the siva filename as a directory, plus the siva filename at the end.

Is that wrong or do we have another issue?

@vmarkovtsev

Datasets should be annotated with JSON-LD

See https://developers.google.com/search/docs/data-types/dataset .

Could we pipe 'list' into 'get -i'?

just a proposal to consider:

Since pga get -i downloads the repositories that were passed through the stdin, imho it could be great if pga list could provide it (without external tools like jq to print only the siva file name)

It could be, for example:

$ pga list --siva | pga get -i

or even using a more general syntax:

$ pga list --columns=sivaFilenames | pga get -i

Properties descriptions at web doc are out dated

from https://docs.sourced.tech/datasets/publicgitarchive/web/md/index

it is said that:

Column name	Column description
EMPTY_LINES_COUNT	Number of empty lines on files on commit pointed by default HEAD.
CODE_LINES_COUNT	Number of lines with code on files on commit pointed by default HEAD.
COMMENT_LINES_COUNT	Number of commented lines on files on commit pointed by default HEAD

But they contain a list of each of them, regarding to every language on that repository, and ordered as previous languages counters.

Support selecting version in pga downloader

Right now both PGA tool is hardcoded to download version latest of the index an siva files. Add an option to change the version we want to use.

Question on full PGA integrity verification

I have downloaded full PGA to HDFS using pga get -v ... and have full logs of the process.

It took a day and has finished eventually and can see that it's

$ hdfs dfs -du -s -h hdfs://hdfs-namenode/pga
2.4 T  hdfs://hdfs-namenode/pga

Questing: how do I make sure that nothing is missing?

What would be the simplest way to verify consistency and completeness of the results of pga get ? Options from the top of my head include

some grep over logs
or re-run tool again multiple times, expecting it is safe, and does not take the same amount of time?
hdfs dfs -ls -R hdfs://hdfs-namenode/pga/ |grep "\.siva$" | wc -l = 239807
and pga list | wc -l = 181481 but it's rooted repository VS actual repository :/

Would appreciate any recommendations.

@campoy I guess this might be something that is worth documenting eventually, as other users might have same question eventually. Would be happy to submit a PR.

Warning using commands `pga list` or `pga get`

When I tried to run pga list --lang "Jupyter Notebook" and pga get --lang "Jupyter Notebook" commands, I received before the proper response:

WARN[0000] could not check md5 hashes for http://pga.sourced.tech/csv/latest.csv.gz: could
not fetch hash at latest.csv.gz.md5: 404 Not Found

I have no idea what this means. Should I have done something to prevent this? Is this an error caused by something I did/didn't do, or a bug on PGA? I don't know.

unexpected EOF leads to corrupted siva files.

Using

cat index.csv | grep -oE '[0-9a-f]{40}\.siva' | pga get -i --output /media/k/data/PGA/

to download PGA dataset I get unexpected EOF just for several files:

➜  sourced cat index.csv | grep -oE '[0-9a-f]{40}\.siva' | pga get -i --output /media/k/data/PGA/
downloading siva files by name from stdin
filter flags will be ignored
 67503 / 257391 [====================>--------------------------------------------------------]  26.23% 40m24s
could not get siva/latest/d9/d9363d1f63b2bee2c69c2a11a5f7b0fafc838f0f.siva: could not check mod time in http://pga.sourced.tech//siva/latest/d9/d9363d1f63b2bee2c69c2a11a5f7b0fafc838f0f.siva: Head http://pga.sourced.tech//siva/latest/d9/d9363d1f63b2bee2c69c2a11a5f7b0fafc838f0f.siva: dial tcp 147.135.10.8:80: i/o timeout
 91710 / 257391 [===========================>-------------------------------------------------]  35.63% 44m47s
could not get siva/latest/de/de879ba477d94f28d561b3cd55079a737ec57a85.siva: could not copy http://pga.sourced.tech//siva/latest/de/de879ba477d94f28d561b3cd55079a737ec57a85.siva to /media/k/data/PGA/siva/latest/de/de879ba477d94f28d561b3cd55079a737ec57a85.siva: unexpected EOF
 205637 / 257391 [===========================================================>--------------]  79.89% 2h27m11s
could not get siva/latest/c3/c33c209a937af7468bba45e9406a7e5834655541.siva: could not copy http://pga.sourced.tech//siva/latest/c3/c33c209a937af7468bba45e9406a7e5834655541.siva to /media/k/data/PGA/siva/latest/c3/c33c209a937af7468bba45e9406a7e5834655541.siva: unexpected EOF
 206423 / 257391 [===========================================================>--------------]  80.20% 2h26m33s
could not get siva/latest/f1/f1f0797a2604519e41be05d81e16cad9969145e7.siva: could not copy http://pga.sourced.tech//siva/latest/f1/f1f0797a2604519e41be05d81e16cad9969145e7.siva to /media/k/data/PGA/siva/latest/f1/f1f0797a2604519e41be05d81e16cad9969145e7.siva: unexpected EOF
 257391 / 257391 [========================================================================================================================================================] 100.00%

may be due to network problems or something else.
But at the end these files were present. When I manually download them and put to the corresponding folder I found out that the sizes are really different.

So, it is better to delete such files or try to redownload it several times.

Update the CSV file in pga.sourced.tech

Currently, the csv file we can download from pga.sourced.tech is not the final one we had, maybe we should update it.

Right now, the one from pga.sourced.tech has 181,482 rows.
And 2 months ago, Data Retrieval gave me through Vadim a final csv file with 182,014 rows (I have it locally).

[PGA] Pointer redeclared during import "unsafe" (borges-indexer on Windows)

I have tried under admin rights as well - no luck.

pga get --stdin file.txt freezes

When I run pga get --stdin file.txt, it freezes after the following messages:

downloading siva files by name from stdin
filter flags will be ignored

even if there is only 1 siva file name in the file.txt.

pga get -l python works fine till some point (last time downloaded 10k repos), then freezes.
Tried decreasing number of workers to 1 - seemed to stabilize the process to some extent, but I still get this freeze.

Any ideas?

Logo for Public Git Archive

I created a pretty straight forward one as a placeholder

@ricardobaeta could you provide something better or approve this one?

Critical VIP repositories missing in PGA

MariaDB/server
tensorflow/tensorflow
python/cpython
rails/rails
django/django

Release v5 with pga

Create a new release with pga binaries for Linux, Darwin and macOS

borges-indexer fails to run with database schema from latest borges version

I run borges consumer and it writes several siva files and records to the DB successfully.

Then I run borges-indexer and get

FATA[0004] unable to get result set                      err="pq: column __repository._references does not exist"

Support https on pga.sourced.tech

Requested on https://github.com/src-d/issues-infrastructure/issues/177

Created here for easier tracking with the milestone

Provide md5 for index

Currently we provide md5 files for all siva files, but as far as I can tell, not for the index file itself.

Could we provide it on http://pga.sourced.tech/csv/latest.csv.gz.md5?

panic: runtime error: slice bounds out of range

Our official example from the documentation https://github.com/src-d/datasets/tree/master/PublicGitArchive/pga#downloading-siva-files results in a panic.

Steps to reproduce:

pga list -u github.com/src-d/ -f json | jq -r 'select(.fileCount > 50) | .sivaFilenames[]' | pga get -i -o repositories

Expected: downloading files to ./repositories starts.
Actual: panic.

May be relevant to -i option, as noted in #33 (comment).

Update:

echo "cce947b98a050c6d356bc6ba95030254914027b1.siva" | pga get -i
results in the same

downloading siva files by name from stdin
filter flags will be ignored
panic: runtime error: slice bounds out of range

Too many readme in docs

The structure of the docs https://docs.sourced.tech/datasets/publicgitarchive is a bit weird, and there seems to be too many README files without a clear context

When would one repository contains more than one sivaFile?

Hey, I thought that one repo should have exactly one siva file until I found one repo which contains 5 different siva files, whose url is "https://github.com/pegasus-isi/pegasus". So, when would one repository contains more than one sivaFile?

[Feature request] Add repository size to index

It would be very useful (at least for me) to have size of repository in bytes in csv file.

main issue tracker about PGA?

Hi all,

Is that the correct issue tracker for questions & issues about PGA?

--Martin

[Feature request] PGA

pga get --siva 00035a4b79ce6b9a4cd7e7006bd78aa153cb9389.siva

Rationale: when debugging I realized I wanted to download several specific siva files locally to understand what was going on with them.

[feature request] PGA language statistics

Hello,

it could be a useful feature to provide language statistics per repository -> so people will have more advanced options for filtering repositories.

Example:
Downloading of a repository with 1 line of JS code and 1000s in other languages could be not so useful for researchers who are focusing on JS.

Update pga.sourced.tech

The page currently shows only multitool, not pga.
Could we point pga.sourced.tech to show the content of README.md?

cc: @vmarkovtsev @warenlg

COMMITS_HEAD_COUNT does not appear in 'pga list'

As it is documented in https://docs.sourced.tech/datasets/publicgitarchive/web/md/index

Column name	Column description
COMMITS_HEAD_COUNT	Number of commits on the history obtained by the commit pointed by default HEAD. The output is the same as `git rev-list --count HEAD`.

COMMITS_HEAD_COUNT does not appear in pga list

Panic while downloading siva files to HDFS

Testing with a build done using a binary from #32 (I could not connect to HDFS otherwise), pga tool is failing to get files to HDFS

# cat /siva.txt | ./pga-rafa get --verbose -i -o hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020
downloading siva files by name from stdin
filter flags will be ignored
DEBU[0000] syncing http://pga.sourced.tech//siva/latest/4a/4a14cc02da0a9280538cd3f3242365601d72f241.siva to hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/siva/latest/4a/4a14cc02da0a9280538cd3f3242365601d72f241.siva
panic: runtime error: slice bounds out of range

goroutine 1 [running]:
github.com/src-d/datasets/PublicGitArchive/pga/cmd.downloadFilenames(0x86d4e0, 0xc4201ca080, 0x86d560, 0xc4201b8030, 0xc4201dc000, 0x1f, 0x1f, 0xa, 0x8000105, 0x0)
	/root/go/src/github.com/src-d/datasets/PublicGitArchive/pga/cmd/get.go:91 +0x263
github.com/src-d/datasets/PublicGitArchive/pga/cmd.glob..func1(0xa67920, 0xc4200b6340, 0x0, 0x4, 0x0, 0x0)
	/root/go/src/github.com/src-d/datasets/PublicGitArchive/pga/cmd/get.go:79 +0x3a8
github.com/src-d/datasets/PublicGitArchive/pga/vendor/github.com/spf13/cobra.(*Command).execute(0xa67920, 0xc4200b6300, 0x4, 0x4, 0xa67920, 0xc4200b6300)
	/root/go/src/github.com/src-d/datasets/PublicGitArchive/pga/vendor/github.com/spf13/cobra/command.go:698 +0x46d
github.com/src-d/datasets/PublicGitArchive/pga/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0xa67d60, 0xc4200abf58, 0x73fc95, 0xc4201763c0)
	/root/go/src/github.com/src-d/datasets/PublicGitArchive/pga/vendor/github.com/spf13/cobra/command.go:783 +0x2e4
github.com/src-d/datasets/PublicGitArchive/pga/vendor/github.com/spf13/cobra.(*Command).Execute(0xa67d60, 0xc42002a0b8, 0x0)
	/root/go/src/github.com/src-d/datasets/PublicGitArchive/pga/vendor/github.com/spf13/cobra/command.go:736 +0x2b
github.com/src-d/datasets/PublicGitArchive/pga/cmd.Execute()
	/root/go/src/github.com/src-d/datasets/PublicGitArchive/pga/cmd/root.go:34 +0x2d
main.main()
	/root/go/src/github.com/src-d/datasets/PublicGitArchive/pga/main.go:8 +0x20

Find attached the contents of siva.txt

For the moment I'm using multitool to download to HDFS as it is not giving me issues

cc @vmarkovtsev

Unexpected EOF - pga get -l go

pga version: eb71a82
Go version: 1.11.5 linux/amd64

Problem: Seeing corrupted files when the pga tool is attempting to rename temporary files. The tool also appears to hang - after ~10 hours, it's only 1.87% complete (467/24928; 75.5GiB) on a 8-core VM in GCP with a 16Gbps NIC.

$ pga get -l go

 467 / 24928 [==>-----------------------------------------------------------------------------------------------------------------]   1.87% 2h34m48s
could not get siva/latest/98/9822bb0f781b94b1c7610b2df2ae4817e257c9bb.siva: could not copy to temporary file siva/latest/98/9822bb0f781b94b1c7610b2d
f2ae4817e257c9bb.siva.tmp: could not copy http://pga.sourced.tech//siva/latest/98/9822bb0f781b94b1c7610b2df2ae4817e257c9bb.siva to siva/latest/98/98
22bb0f781b94b1c7610b2df2ae4817e257c9bb.siva.tmp: unexpected EOF
 467 / 24928 [==>-----------------------------------------------------------------------------------------------------------------]   1.87% 2h28m29s
could not get siva/latest/57/5708afc613c3a27489ed4be2560d49ef1752eaeb.siva: could not copy to temporary file siva/latest/57/5708afc613c3a27489ed4be2
560d49ef1752eaeb.siva.tmp: could not copy http://pga.sourced.tech//siva/latest/57/5708afc613c3a27489ed4be2560d49ef1752eaeb.siva to siva/latest/57/57
08afc613c3a27489ed4be2560d49ef1752eaeb.siva.tmp: unexpected EOF
 467 / 24928 [==>------------------------------------------------------------------------------------------------------------------]   1.87% 2h4m12s
could not get siva/latest/5b/5b8009800d6e8459453463db5a12f76ed146c7d7.siva: rename siva/latest/5b/5b8009800d6e8459453463db5a12f76ed146c7d7.siva.tmp 
to siva/latest/5b/5b8009800d6e8459453463db5a12f76ed146c7d7.siva failed: rename siva/latest/5b/5b8009800d6e8459453463db5a12f76ed146c7d7.siva.tmp siva
/latest/5b/5b8009800d6e8459453463db5a12f76ed146c7d7.siva: no such file or directory
 467 / 24928 [==>--------------------------------------------------------------------------------------------------------------------------]   1.87%
could not get siva/latest/b4/b4d0dc444d6c1088fe6c15743f7764f39c57f501.siva: could not copy to temporary file siva/latest/b4/b4d0dc444d6c1088fe6c1574
3f7764f39c57f501.siva.tmp: could not copy http://pga.sourced.tech//siva/latest/b4/b4d0dc444d6c1088fe6c15743f7764f39c57f501.siva to siva/latest/b4/b4
d0dc444d6c1088fe6c15743f7764f39c57f501.siva.tmp: unexpected EOF
 467 / 24928 [==>--------------------------------------------------------------------------------------------------------------------------]   1.87%
could not get siva/latest/50/50a4667b3b8dda5a16a78f0dcc6f7b1eab8924f8.siva: could not copy to temporary file siva/latest/50/50a4667b3b8dda5a16a78f0d
cc6f7b1eab8924f8.siva.tmp: could not copy http://pga.sourced.tech//siva/latest/50/50a4667b3b8dda5a16a78f0dcc6f7b1eab8924f8.siva to siva/latest/50/50
a4667b3b8dda5a16a78f0dcc6f7b1eab8924f8.siva.tmp: unexpected EOF
 467 / 24928 [==>--------------------------------------------------------------------------------------------------------------------------]   1.87%
could not get siva/latest/78/78ffcbdd4f5de3ed41516c7b74ddc1d3c657df39.siva: rename siva/latest/78/78ffcbdd4f5de3ed41516c7b74ddc1d3c657df39.siva.tmp 
to siva/latest/78/78ffcbdd4f5de3ed41516c7b74ddc1d3c657df39.siva failed: rename siva/latest/78/78ffcbdd4f5de3ed41516c7b74ddc1d3c657df39.siva.tmp siva
/latest/78/78ffcbdd4f5de3ed41516c7b74ddc1d3c657df39.siva: no such file or directory
 467 / 24928 [==>--------------------------------------------------------------------------------------------------------------------------]   1.87%
could not get siva/latest/d6/d6026c727ad9767fe6ccd4b14597453cd9bbac4c.siva: could not copy to temporary file siva/latest/d6/d6026c727ad9767fe6ccd4b1
4597453cd9bbac4c.siva.tmp: close siva/latest/d6/d6026c727ad9767fe6ccd4b14597453cd9bbac4c.siva.tmp: input/output error
 467 / 24928 [==>--------------------------------------------------------------------------------------------------------------------------]   1.87%
could not get siva/latest/c1/c14d891d44f0afff64e56ed7c9702df1d807b1ee.siva: could not copy to temporary file siva/latest/c1/c14d891d44f0afff64e56ed7
c9702df1d807b1ee.siva.tmp: could not copy http://pga.sourced.tech//siva/latest/c1/c14d891d44f0afff64e56ed7c9702df1d807b1ee.siva to siva/latest/c1/c1
4d891d44f0afff64e56ed7c9702df1d807b1ee.siva.tmp: unexpected EOF
 467 / 24928 [==>--------------------------------------------------------------------------------------------------------------------------]   1.87%
could not get siva/latest/05/0527d29da443886d92e9a418180c5b25a5f8d270.siva: could not copy to temporary file siva/latest/05/0527d29da443886d92e9a418
180c5b25a5f8d270.siva.tmp: could not copy http://pga.sourced.tech//siva/latest/05/0527d29da443886d92e9a418180c5b25a5f8d270.siva to siva/latest/05/05
27d29da443886d92e9a418180c5b25a5f8d270.siva.tmp: unexpected EOF
 467 / 24928 [==>--------------------------------------------------------------------------------------------------------------------------]   1.87

I'm wondering if this is a subtle race condition due to the way temporary files are named as name + ".tmp" with multiple workers?

Update pga dependencies

Update go-git and core-retrieval dependencies on pga to make it work with the latest borges version (database models).

[Feature request] Add stars to index file

It would be super useful for experiments to be able to grep and download only repositories with more than N stars.

Too many open files

When I try to download PGA dataset via

./multitool get-index > index.csv
cat index.csv | grep -oE '[0-9a-f]{40}\.siva' | ./multitool get-dataset -o /media/k/data/PGA/

I get an error

 90225 / 257391 [=====================>----------------------------------------]
Error: failed to create /media/k/data/PGA/siva/latest/87/870918f74af1095098b84eaf57765bb4e2a2f2d2.siva: open /media/k/data/PGA/siva/latest/87/870918f74af1095098b84eaf57765bb4e2a2f2d2.siva: too many open files

I know that I can change the limit of open files, I do so, but maybe there is a bug.

Write the docs how to use borges-indexer

borges-indexer is a cool program and needs to be documented so that the other people can use it.

PGA downloading a higher number of siva files than expected

I want to download all siva files with "Jupyter Notebook" on PGA.

To know how many they are, I ran:
$ pga list --lang "Jupyter Notebook" -f csv

After examining the csv file, I knew that there were 2,606 repos and 3,767 siva files corresponding to them.

To download the siva files, I ran
$ pga get --lang "Jupyter Notebook" -v

And the response that I got was:

DEBU[0004] local copy is outdated or non existent
1 / 6349 [>----------------------------------------------------------]   0.02% 40m59s

Meaning that it was downloading 6,349 files, and I have no idea why. If somebody can help me with this.

Missing repositories to process

This is the list of the currently missing repositories from the latest 181k repos index. This is due to corrupted siva files (or any other issue that may have arised) and we should reprocess (and redownload) them again.

Here's the list: https://gist.github.com/erizocosmico/1dfa3bc5e1d04ba2f15266ab4493420e

Borges is processing the last 4k repos, so I'm posting this here so we don't forget to do it later after it's done.

Tasks

Add corrupted siva files repos to the queue
Add flag to borges-indexer to generate an index only for a list of repositories
Redownload the repos in the list
Process the list to generate an index
Join with latest index and set forks

"multitool" link broken on PGA page

Although aware that there's an intention of removing pga.sourced.tech on this issue, we currently have a broken link there for multitool on:

You can download and filter by csv data using multitool.

I intended to fix it myself and push a PR but I could not find a new direction for multitool nor give a best solution for it, apart from removing this sentence. Guide me if I should fix it on the html