Small CLI command to download a bucket directory locally.
Aims to simulate gsutil cp -r
when copying from a bucket to local path.
# download whole bucket to a local dir
bucket-pull gs://mybucketname ./mybucketname
# download a directory and all it's content to a local dir
bucket-pull gs://mybucketname/mydir ./
pip install bucket-pull
https://pypi.org/project/bucket-pull/
# as cli
$ bucket-pull gs://mybucketname ./mybucketname
# as a module
$ python -m bucket_pull gs://mybucketname ./mybucketname
The utility makes use of the Google SDK and uses Client-Provided Authentication
To run with exported SA key, you can make use of the GOOGLE_APPLICATION_CREDENTIALS environmental variable
GOOGLE_APPLICATION_CREDENTIALS=/path/to/sa-credentials.json bucket-pull gs://bucket/mydir /tmp/some/path
The account you are connecting with will need at least storage.buckets.get
on the bucket, which can be granted with the roles/storage.legacyBucketReader
.
gsutil iam ch serviceAccount:[email protected]:legacyBucketReader gs://bucket
gsutil seems to have this somewhat weird behaviour when the destination path doesn't exist
$ gsutil cp -r gs://smoss-tech-test-bucket/mydir/ ./doesnotexist/actually/
$ echo $?
0
$ ls ./doesnotexist
ls: cannot access './doesnotexist': No such file or directory
It will not create the destination path (not that weird)
but it won't complain either and end exits with 0.
Here bucket-pull diverge and throws an error instead.
With the -m
flag we can enable multi-processing.
Here bucket-pull
has opted for using the threading
.
So, not true paralellism and we only ever make use of one CPU.
However, since we are mostly IO bound (disk and network) there is
still some gain to be had by using multiple threads waiting for IO.
"very" scientific comparison:
# with multithreading
time ./bucket-pull.py gs://smoss-tech-test-bucket/mydir /tmp/ -m
...
Downloading to /tmp/mydir/32mb.file
Downloading to /tmp/mydir/128mb.file
Downloading to /tmp/mydir/64mb.file
Downloading to /tmp/mydir/a/1.txt
Downloading to /tmp/mydir/a/b/2.txt
./bucket-pull.py gs://smoss-tech-test-bucket/mydir /tmp/ -m 5.86s user 5.24s system 22% cpu 50.149 total
# single thread
time ./bucket-pull.py gs://smoss-tech-test-bucket/mydir /tmp/
...
Downloading to /tmp/mydir/128mb.file
Downloading to /tmp/mydir/32mb.file
Downloading to /tmp/mydir/64mb.file
Downloading to /tmp/mydir/a/1.txt
Downloading to /tmp/mydir/a/b/2.txt
./bucket-pull.py gs://smoss-tech-test-bucket/mydir /tmp/ 4.80s user 4.38s system 12% cpu 1:13.83 total