Easily download, verify, and extract archives
If you have datasets or other archives that you want to make available to your users, and ensure they always have the latest versions and that they are downloaded correctly, fastdownload
can help.
Using pip:
pip install fastdownload
...or using conda:
conda install -c fastai fastdownload
The situation where you might want to use fastdownload
is where you have one or more URLs pointing at some archives you want to make available, and you want to ensure that your users download those archives correctly, have the latest version, and that it's as easy as possible for them to access the information in those archives.
Your user just calls a single method, FastDownload.get
, passing the URL required, and the URL will be downloaded and extracted to the directories you choose. The path to the extracted file is returned. If that URL has already been downloaded, then the cached archive or contents will be used automatically. However, if that size or hash of the archive is different to what it should be, then the user will be informed, and a new version will be downloaded.
In the future, you may want to update one or more of your archives. When you do so, fastdownload
will ensure your users have the latest version, by checking their downloaded archives against your updated file size and hash information.
For instance, fastai
uses fastdownload
to provide access to datasets for deep learning. fastai
users can download and extract them with a single command, using the return value to access the files. The files are automatically placed in appropriate subdirectories of a .fastai
folder in the user's homedir. If a dataset is updated, users are informed the next time they use the dataset, and the latest version is automatically downloaded and extracted for them.
When your users download an archive, fastdownload
will automatically save it to a directory, check if the size and hash matches, and extract the contents. Minimal usage for downloading and extracting is:
from fastdownload import FastDownload
d = FastDownload()
path = d.get('https://...')
After this, path
will contain the path where the extracted files are located. By default, archives are saved to {base}/archive
, and extracted to {base}/data
. {base}
defaults to ~/.fastdownload
. If there is more than one file or folder in the root of the downloaded archive, then a new folder is created in data
for the contents.
Instead of get
, use download
to download the URL without extracting it, or extract
to extract the URL without downloading it (assuming it's already been downloaded to the archive
directory). All of these methods accept a force
parameter which will download/extract the archive even if it's already present.
You can change any or all of the base
, archive
, and data
paths by passing them to FastDownload
:
d = FastDownload(base='~/.mypath', archive='downloaded', data='extracted')
You can remove the cached archive file and/or the extracted contents with rm
:
d.rm('https://...')
fastdownload
will add a file download_checks.py
to your Python module which contains file sizes and hashes for your archives. The file is located in the same directory as a module you choose, e.g.:
d = FastDownload(module=fastai.some_module)
Then use update
to create or update the size and hash for a URL:
d.update('https://...')
You will now find there is a file called download_checks.py
in the same directory where fastai.some_module
is located, which contains a Python dict with the URL, size, and hash for this file. If you've downloaded this file before to your archive
path then it will be used, instead of downloading a new copy. Use get(force=True)
first to download a new copy if even you have it in your archive.
If there is a file called config.ini
in your base
directory, then keys archive
and data
will be used as the default values for FastDownload
. The file should be in configparser format. Here's a sample config.ini
:
[DEFAULT]
archive = downloaded
data = extracted
If there is no ini file present, one will be automatically created for for you using the details you pass to FastDownload
.
You can add any additional key/value pairs to the config file that you want. When you call FastDownload.get
pass extract_key
to use a key other than data
for choosing a location to extract to.
fastdownload's People
Forkers
ankitshah009 tcapelle tommichiels rakhithjk genevievebuckley benknightdark isabella232 katarinagresova seem butchland gitudaniel arpitjain799 vedu346 lmartifastdownload's Issues
Keep-Alive-Actions
skip archive download if extracted data exists
Keep-Alive-Actions
download_url(show_progress=True) in lesson 1 incorrect progress percentage
add `extract_key` param
Keep-Alive-Actions
Keep-Alive-Actions
add `extract_key`
add timeout and show_progress params
License confusion
Hi fastai folk,
I noticed the LICENSE file for the project is Apache-2.0, but the setup.py lists 5 different licenses. Is the setup.py in error? It feels like maybe it's a template and still needs work?
Keep-Alive-Actions
Keep-Alive-Actions
SSL: CERTIFICATE_VERIFY_FAILED, Want to bypass SSL check
I was using simple urllib request GET method to download files, now I want to shift to fastdownload. The url was getting downloaded before without SSL check. I want to bypass this SSL check. How can i do this
Error:
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/fastdownload/core.py", line 117, in get self.download(url, force=force) File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/fastdownload/core.py", line 92, in download return download_and_check(url, urldest(url, self.arch_path()), self.module, force) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/fastdownload/core.py", line 61, in download_and_check res = download_url(url, fpath) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/fastdownload/core.py", line 19, in download_url return urlsave(url, dest, reporthook=progress if show_progress else None, timeout=timeout) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/fastcore/net.py", line 184, in urlsave nm,msg = urlretrieve(url, dest, reporthook, headers=headers, timeout=timeout) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/fastcore/net.py", line 149, in urlretrieve with contextlib.closing(urlopen(url, data, headers=headers, timeout=timeout)) as fp: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/fastcore/net.py", line 108, in urlopen try: return urlopener().open(urlwrap(url, data=data, headers=headers), timeout=timeout) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/urllib/request.py", line 519, in open response = self._open(req, data) ^^^^^^^^^^^^^^^^^^^^^ File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/urllib/request.py", line 536, in _open result = self._call_chain(self.handle_open, protocol, protocol + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/urllib/request.py", line 496, in _call_chain result = func(*args) ^^^^^^^^^^^ File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/urllib/request.py", line 1391, in https_open return self.do_open(http.client.HTTPSConnection, req, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/urllib/request.py", line 1351, in do_open raise URLError(err) urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:992)>
the function read_checks has a bug
checks_module
function returns fmod
as type Path or Dictionary
so one solution can be to replace if not fmod.exists(): return {}
with:
if not type(fmod)==PosixPath: return {}
or
try: fmod.exists()
or something better
get(force=True) fails for a file
I was trying to use fastdownload
(I am using fastdownload-0.0.5
) for downloading some compressed files and I was testing force=True
argument. Code looked something like this:
import fastdownload
from fastdownload import FastDownload
d = FastDownload(module=fastdownload)
path = d.get('https://silkdb.bioinfotoolkits.net/__resource/Bombyx_mori/download/cds.fa.tar.gz')
path = d.get('https://silkdb.bioinfotoolkits.net/__resource/Bombyx_mori/download/cds.fa.tar.gz', force=True)
Hovewer, removing old extracted file before extracting again is not working - only extracted directory is expected, not an extracted file.
---------------------------------------------------------------------------
NotADirectoryError Traceback (most recent call last)
fastdownload.ipynb Cell 12' in <cell line: 1>()
----> [1] path = d.get('https://silkdb.bioinfotoolkits.net/__resource/Bombyx_mori/download/cds.fa.tar.gz', force=True)
[2] path
File ~/my-conda-envs/tenv/lib/python3.8/site-packages/fastdownload/core.py:122, in FastDownload.get(self, url, extract_key, force)
[120] if data.exists(): return data
[121] self.download(url, force=force)
--> [122] return self.extract(url, extract_key=extract_key, force=force)
File ~/my-conda-envs/tenv/lib/python3.8/site-packages/fastdownload/core.py:114, in FastDownload.extract(self, url, extract_key, force)
[112] dest = self.data_path(extract_key)
[113] dest.mkdir(exist_ok=True, parents=True)
--> [114] return untar_dir(arch, dest, rename=True, overwrite=force)
File ~/my-conda-envs/tenv/lib/python3.8/site-packages/fastcore/xtras.py:226, in untar_dir(fname, dest, rename, overwrite)
[224] dest = dest/src.name
[225] if dest.exists():
--> [226] if overwrite: shutil.rmtree(dest)
[227] else: return dest
[228] if rename: src = _unpack(fname, out)
File ~/my-conda-envs/tenv/lib/python3.8/shutil.py:718, in rmtree(path, ignore_errors, onerror)
[716] try:
[717] if os.path.samestat(orig_st, os.fstat(fd)):
--> [718] _rmtree_safe_fd(fd, path, onerror)
[719] try:
[720] os.rmdir(path)
File ~/my-conda-envs/tenv/lib/python3.8/shutil.py:631, in _rmtree_safe_fd(topfd, path, onerror)
[629] except OSError as err:
[630] err.filename = path
--> [631] onerror(os.scandir, path, sys.exc_info())
[632] return
[633] for entry in entries:
File ~/my-conda-envs/tenv/lib/python3.8/shutil.py:627, in _rmtree_safe_fd(topfd, path, onerror)
[625] def _rmtree_safe_fd(topfd, path, onerror):
[626] try:
--> [627] with os.scandir(topfd) as scandir_it:
[628] entries = list(scandir_it)
[629] except OSError as err:
NotADirectoryError: [Errno 20] Not a directory: Path('/home/jovyan/.fastdownload/data/cds.fa')
Code does not work with Dropbox links
What is the reason behind this? Is there a way to get around this?
Keep-Alive-Actions
Unable to extract url's with gzip compression.
When I want to use fastdownload for other urls's I get an unknown archive format error.
Here is the code
from fastdownload import FastDownload
d = FastDownload()
d.get('http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz')
Here is the TraceBack
ReadError Traceback (most recent call last)
Cell In[2], line 5
1 from fastdownload import FastDownload
3 d = FastDownload()
----> 5 d.get('http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz')
File /usr/local/lib/python3.9/site-packages/fastdownload/core.py:118, in FastDownload.get(self, url, extract_key, force)
116 if data.exists(): return data
117 self.download(url, force=force)
--> 118 return self.extract(url, extract_key=extract_key, force=force)
File /usr/local/lib/python3.9/site-packages/fastdownload/core.py:110, in FastDownload.extract(self, url, extract_key, force)
108 dest = self.data_path(extract_key)
109 dest.mkdir(exist_ok=True, parents=True)
--> 110 return untar_dir(arch, dest, rename=True, overwrite=force)
File /usr/local/lib/python3.9/site-packages/fastcore/xtras.py:175, in untar_dir(fname, dest, rename, overwrite)
173 if overwrite: shutil.rmtree(dest) if dest.is_dir() else dest.unlink()
174 else: return dest
--> 175 if rename: src = _unpack(fname, out)
176 shutil.move(str(src), dest)
177 return dest
File /usr/local/lib/python3.9/site-packages/fastcore/xtras.py:157, in _unpack(fname, out)
...
-> 1267 raise ReadError("Unknown archive format '{0}'".format(filename))
1269 func = _UNPACK_FORMATS[format][1]
1270 kwargs = dict(_UNPACK_FORMATS[format][2])
ReadError: Unknown archive format '/root/.fastdownload/archive/train-images-idx3-ubyte.gz'
It's a shutil error, but a simple fix like this would work. converting gzip to zip works.
# Convert to zip file if gzip
file_type = magic.from_file(arch)
if 'gzip' in file_type:
with gzip.open(arch, 'rb') as f_in:
file_content = f_in.read()
arch.write_bytes(file_content)
with zipfile.ZipFile(arch, 'w') as myzip:
myzip.write(arch)
# rename file
new_arch = arch.with_suffix('.zip')
arch = arch.rename(new_arch)
But the main problem is with the fastdownload way of handling files and my analysis show that it's the problem with fastcore like urldest uses file extensions from URL and so by changing the actual file format and extensions doesn't help here. I do not know what to do unless forking and changing the whole code base?
Keep-Alive-Actions
Keep-Alive-Actions
Keep-Alive-Actions
Keep-Alive-Actions
Keep-Alive-Actions
Keep-Alive-Actions
Keep-Alive-Actions
Keep-Alive-Actions
Keep-Alive-Actions
Keep-Alive-Actions
Keep-Alive-Actions
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.