Code Monkey home page Code Monkey logo

pooch's Introduction

Pooch: A friend to fetch your data files

Documentation (latest)Documentation (main branch)ContributingContact

Part of the Fatiando a Terra project

Latest version on PyPI Latest version on conda-forge Test coverage status Compatible Python versions. DOI used to cite Pooch

About

Just want to download a file without messing with requests and urllib? Trying to add sample datasets to your Python package? Pooch is here to help!

Pooch is a Python library that can manage data by downloading files from a server (only when needed) and storing them locally in a data cache (a folder on your computer).

  • Pure Python and minimal dependencies.
  • Download files over HTTP, FTP, and from data repositories like Zenodo and figshare.
  • Built-in post-processors to unzip/decompress the data after download.
  • Designed to be extended: create custom downloaders and post-processors.

Are you a scientist or researcher? Pooch can help you too!

  • Host your data on a repository and download using the DOI.
  • Automatically download data using code instead of telling colleagues to do it themselves.
  • Make sure everyone running the code has the same version of the data files.

Projects using Pooch

SciPy, scikit-image, xarray, Ensaio, GemPy, MetPy, napari, Satpy, yt, PyVista, icepack, histolab, seaborn-image, Open AR-Sandbox, climlab, mne-python, GemGIS, SHTOOLS, MOABB, GeoViews, ScopeSim, Brainrender, pyxem, cellfinder, PVGeo, geosnap, BioCypher, cf-xarray, Scirpy, rembg, DASCore, scikit-mobility, Py-ART, HyperSpy, RosettaSciIO, eXSpy

If you're using Pooch, send us a pull request adding your project to the list.

Example

For a scientist downloading a data file for analysis:

import pooch
import pandas as pd

# Download a file and save it locally, returning the path to it.
# Running this again will not cause a download. Pooch will check the hash
# (checksum) of the downloaded file against the given value to make sure
# it's the right file (not corrupted or outdated).
fname_bathymetry = pooch.retrieve(
    url="https://github.com/fatiando-data/caribbean-bathymetry/releases/download/v1/caribbean-bathymetry.csv.xz",
    known_hash="md5:a7332aa6e69c77d49d7fb54b764caa82",
)

# Pooch can also download based on a DOI from certain providers.
fname_gravity = pooch.retrieve(
    url="doi:10.5281/zenodo.5882430/southern-africa-gravity.csv.xz",
    known_hash="md5:1dee324a14e647855366d6eb01a1ef35",
)

# Load the data with Pandas
data_bathymetry = pd.read_csv(fname_bathymetry)
data_gravity = pd.read_csv(fname_gravity)

For package developers including sample data in their projects:

"""
Module mypackage/datasets.py
"""
import pkg_resources
import pandas
import pooch

# Get the version string from your project. You have one of these, right?
from . import version

# Create a new friend to manage your sample data storage
GOODBOY = pooch.create(
    # Folder where the data will be stored. For a sensible default, use the
    # default cache folder for your OS.
    path=pooch.os_cache("mypackage"),
    # Base URL of the remote data store. Will call .format on this string
    # to insert the version (see below).
    base_url="https://github.com/myproject/mypackage/raw/{version}/data/",
    # Pooches are versioned so that you can use multiple versions of a
    # package simultaneously. Use PEP440 compliant version number. The
    # version will be appended to the path.
    version=version,
    # If a version as a "+XX.XXXXX" suffix, we'll assume that this is a dev
    # version and replace the version with this string.
    version_dev="main",
    # An environment variable that overwrites the path.
    env="MYPACKAGE_DATA_DIR",
    # The cache file registry. A dictionary with all files managed by this
    # pooch. Keys are the file names (relative to *base_url*) and values
    # are their respective SHA256 hashes. Files will be downloaded
    # automatically when needed (see fetch_gravity_data).
    registry={"gravity-data.csv": "89y10phsdwhs09whljwc09whcowsdhcwodcydw"}
)
# You can also load the registry from a file. Each line contains a file
# name and it's sha256 hash separated by a space. This makes it easier to
# manage large numbers of data files. The registry file should be packaged
# and distributed with your software.
GOODBOY.load_registry(
    pkg_resources.resource_stream("mypackage", "registry.txt")
)

# Define functions that your users can call to get back the data in memory
def fetch_gravity_data():
    """
    Load some sample gravity data to use in your docs.
    """
    # Fetch the path to a file in the local storage. If it's not there,
    # we'll download it.
    fname = GOODBOY.fetch("gravity-data.csv")
    # Load it with numpy/pandas/etc
    data = pandas.read_csv(fname)
    return data

Getting involved

🗨️ Contact us: Find out more about how to reach us at fatiando.org/contact.

👩🏾‍💻 Contributing to project development: Please read our Contributing Guide to see how you can help and give feedback.

🧑🏾‍🤝‍🧑🏼 Code of conduct: This project is released with a Code of Conduct. By participating in this project you agree to abide by its terms.

Imposter syndrome disclaimer: We want your help. No, really. There may be a little voice inside your head that is telling you that you're not ready, that you aren't skilled enough to contribute. We assure you that the little voice in your head is wrong. Most importantly, there are many valuable ways to contribute besides writing code.

This disclaimer was adapted from the MetPy project.

License

This is free software: you can redistribute it and/or modify it under the terms of the BSD 3-clause License. A copy of this license is provided in LICENSE.txt.

pooch's People

Contributors

andersy005 avatar avalentino avatar bjoernludwigptb avatar danshapero avatar dependabot[bot] avatar dokempf avatar dopplershift avatar fatiando-bot avatar genevievebuckley avatar hmaarrfk avatar hugovk avatar jlaehne avatar jni avatar jrleeman avatar kephale avatar leouieda avatar lmartinking avatar mathause avatar matthewturk avatar myd7349 avatar pdurbin avatar rabernat avatar remram44 avatar rob-luke avatar rowanc1 avatar santisoler avatar sarthakjariwala avatar shoyer avatar xarthisius avatar zflamig avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pooch's Issues

Add decompression processors

Description of the desired feature

Many times, downloaded files will be compressed to save space in storage and during download. But it's sometimes a hassle to decompress them every time before reading the file. In the case of netCDF grids, decompressing into a temporary file makes it impossible to have xarray lazily load the grid.

#59 introduced processor hooks and #72 introduced the Unzip processor. Following this logic, it would be good to have a Decompress processor that decompresses the downloaded file. It can guess the compression method from the file name or be given it explicitly as an __init__ argument.

The class would look something like this:

class Decompress:
    def __init__(self, method="auto"):
        self.method = method
    def __call__(self, fname, action, pooch):
        ...
        return decompressed

Re-enable Python 3.5 builds

Description of the desired feature

Scikit-image is looking into using Pooch (scikit-image/scikit-image#3945) but they need Python 3.5 still. The code works under 3.5 so there aren't any changes needed to get this working. It's mostly a matter of setting python_requires in setup.py and enabling the CI builds of 3.5.

The problem is that we won't be able to build packages for conda-forge since Pooch isn't a noarch package there (because of extra dependencies required for 2.7).

Pinging @stefanv @hmaarrfk. Is this something worth pursuing?

Add progress bar for downloads

Description of the desired feature

Add a progress bar for downloads using, say, tqdm or python-progressbar.
Getting this level of feedback is nice especially if you're on a slow connection.

Currently pooch issues a warning when something is being downloaded.
This is useful feedback because the dataset will be only be downloaded the first time you fetch it.
If there's more than one way to provide feedback, some changes to the API might be necessary to specify which one.

Are you willing to help implement and maintain this feature? Yes

Add an Untar processor

Description of the desired feature

The pooch.Unzip processor unpacks a zip archive to a folder in the local file system after it's downloaded. We need a pooch.Untar processor that does the same thing but for .tar archives. The tarfile module can be used with the bonus that it supports compressed archives out of the box.

Drop support for Python 2.7

Description of the desired feature

With Unidata/MetPy#1163 on its way and other projects that depend on us not using 2.7 any more, I think we can aim to ditch 2.7 on our next release. We probably don't need many code changes. A few things that come to mind:

  • Remove 2.7 builds from CIs
  • Remove 2.7 dependencies and edit setup.py
  • Remove import code for backports of some modules
  • Edit the conda-forge recipe to build noarch since we no longer need different dependencies depending on Python version.

Add mechanism for post download hooks

Feedback or description of feature requested

It would be great to provide a mechanism to specify functions that are run after a file is downloaded (further checks, unpacking into a cache folder, etc). This could be implemented as a callback function that receives the Pooch object (self) and the file name that was downloaded. It would be called from Pooch._download_file after the download is finished.

Create a function to fetch a single file

Description of the desired feature

Consider the following use case:

I'm a scientist/educator writing a notebook and I need to get data from this one file on the internet. I'm on Jupyter Lab, possibly on a remote server. I don't know the hash of the file and don't need to download many files, just this one. I want the file to be cached locally and not downloaded multiple times. After the first download, I want to set the hash so that I can make sure future runs of the notebook use the same version of the data.

Right now, if you want to use Pooch for that you need to write this code:

from unittest import mock

url = "https://www.ngdc.noaa.gov/mgg/ocean_age/data/2008/grids/age/age.3.6.nc.bz2"
remote_file = url.split("/")[-1]

fname = pooch.create(
    path=pooch.os_cache("meh"), 
    base_url="", 
    registry={remote_file: mock.ANY}, 
    urls={remote_file: url}
).fetch(remote_file)

print(pooch.file_hash(fname))
# Copy and paste the hash in the registry above for future runs.

There is a lot of repetition and feels like a hack. The same thing in fsspec would look like:

# files is a not-yet-open OpenFile object. The "with" context actually opens it
with fsspec.open(url, mode="r") as f:
    # Read in the data

which is much more readable but it doesn't seem to cache automatically (requiring further setup) and doesn't have a hash checking mechanism (at least not that I could find in the docs).

Proposal

Add a function pooch.retrieve that would make the code above:

fname = pooch.retrieve(url, hash=None)
# hash of downloaded file logged to screen and copy+pasted above for future runs
  • Automatically cache to a pooch.os_cache("pooch") location without further setup.
  • File names would be set to a short hash of the file content to make sure no overwriting happens.
  • The return is the file name only, keeping with the pooch way of not trying to do anything fancy.
  • Automatically printing out the hash makes it easier to download first then make sure future downloads are the same.
  • Cache location and file name can be set by user if desired.
  • Function takes downloader and processor arguments like pooch.Pooch.fetch.

This makes it easy to have all the features of Pooch for single file downloads in very little code.

The name pooch.fetch is nicer but it may cause confusion with pooch.Pooch.fetch so retrieve is a good alternative.

Are you willing to help implement and maintain this feature? Yes

Fix wrong import on "Training your Pooch"

Description of the problem

In the "Post-processing hooks" section there is an import of UnzipSingle which doesn't exist anymore and needs to be replaced with Unzip. The next import of Unzip isn't needed in that case.

Drop support for Python 3.5

Conda-forge is no longer building 3.5 packages since October 2018. We should probably remove it from our CIs and officially drop support for new releases. This will mean updating:

  • doc/install.rst
  • setup.py
  • .travis.yml and .appveyor.yml

Release schedule for dropping 2.7 and v1.0.0

The fall of 2019 is here and we can finally drop Python 2.7 (as we had promised). We already have #100 ready to be merged do to just that. We probably shouldn't drop 2.7 without a big warning and we have #97 that works on 2.7 but still hasn't been released. Here is a tentative schedule for the next 2 releases:

  • Make a 0.6.0 release now (before merging #100). It will include #97 and update the warning message to say that this is the last release supporting 2.7 (a PR adding this message is welcome). This version is pretty stable and we haven't broken backwards compatibility in a while.
  • After the release, merge #100 (maybe update warnings).
  • Release 1.0.0 🎊 between now and the end of 2019 with support for Python 3.5+ (and any new features/fixes implemented until then). Our API is pretty stable and we've been good about retaining backwards compatibility. The downloaders and processors mean that Pooch is easily extensible.

Of course, all of this is up for debate and I'm open to suggestions/critique.

JOSS Paper Review

xref: openjournals/joss-reviews#1943

Looks good.

  • Scientficit software is also used to "Capture" or "Acquire" data
  • "The usual" -> "One common approach" when talking about smaller datasets.
  • The sentence that describes the challenge with including datasets in the github repo should come before the "Larger datasets [...]" sentence.
  • HTTP -> HTTPS
  • You state that people are recreating the same code, but maybe give a few examples?

Paragraph 2:

  • It isn't clear what the registery holds: local file name, hash, remote file url should be clearly listed.
  • I "minimal dependencies" is the selling point to packagers. The selling point to users is that it is easily installable with standard python distributinos PyPi and Conda, and a wide array of python versions (2.7, 3.5->3.8).

Paragraph 3:

  • the functionality to unzip files is understated and should be made more explicit. It probably deserves a paragraph of its own.

Paragraph 4:

  • You should elaborate on what exactly makes intake harder to setup. It seems like Pooch is "files only", while intake is "files + metadata" which is a larger hurdle to overcome.

Paragraph 5:

  • I think you can show off a little bit explaining the motivation in terms of install size this would have on scikit-image. The numbers are listed in the PR

I guess I'll go through the checklist when the review officially starts.

Add a section to Training your Pooch about Decompress

Description of the desired feature

Need an extra paragraph to the "Post-processing hooks" section explaining the difference between Decompress and Unzip/Untar. Mainly that it just decompresses a single file, while the others unpack the entire archive (or select members).

Unwanted ~ directory creation in current directory

Unwanted ~ directory creation in current directory

When a Pooch object is created, an unwanted ~ directory
s created on the working directory when using a pooch.os_cache() path.

Are you willing to help implement and maintain this feature? Yes

Full code that generated the error

When running the following simple code:

import pooch

GOODBOY = pooch.create(
    path=pooch.os_cache("pooch_test"),
    base_url="https://github.com/fatiando/pooch/raw/master/data/",
    registry={
        "tiny-data.txt":
            "baee0894dba14b12085eacb204284b97e362f4f3e5a5807693cc90ef415c1b2d"
    }
)

fname = GOODBOY.fetch("tiny-data.txt")

I get the following output:

/home/santi/.anaconda3/envs/pooch_test/lib/python3.7/site-packages/pooch/core.py:259: UserWarning: Downloading data file 'tiny-data.txt' from remote data store 'https://github.com/fatiando/pooch/raw/master/data/' to '~/.cache/pooch_test'.
  action, fname, self.base_url, str(self.path)

And the file was correctly downloaded:

$ tree ~/.cache/pooch_test/
/home/santi/.cache/pooch_test/
└── tiny-data.txt

0 directories, 1 file

Seems that the function works as expected, but an unwanted ~ directory has been created on the current directory:

$ tree -a
.
└── ~
    └── .cache
        └── pooch_test

3 directories, 0 files

System information

  • Operating system: GNU/Linux (Manjaro XFCE)
  • Python installation (Anaconda, system, ETS): Anaconda
  • Version of Python: 3.7.0
  • Version of this package: v0.2.0
  • If using conda, paste the output of conda list below:
# packages in environment at /home/santi/.anaconda3/envs/pooch_test:
#
# Name                    Version                   Build  Channel
asn1crypto                0.24.0                py37_1003    conda-forge
bzip2                     1.0.6                h470a237_2    conda-forge
ca-certificates           2018.10.15           ha4d7672_0    conda-forge
certifi                   2018.10.15            py37_1000    conda-forge
cffi                      1.11.5           py37h5e8e0c9_1    conda-forge
chardet                   3.0.4                 py37_1003    conda-forge
cryptography              2.3.1            py37hdffb7b8_0    conda-forge
idna                      2.7                   py37_1002    conda-forge
libffi                    3.2.1                hfc679d8_5    conda-forge
libgcc-ng                 7.2.0                hdf63c60_3    conda-forge
libstdcxx-ng              7.2.0                hdf63c60_3    conda-forge
ncurses                   6.1                  hfc679d8_1    conda-forge
openssl                   1.0.2p               h470a237_1    conda-forge
packaging                 18.0                       py_0    conda-forge
pip                       18.1                  py37_1000    conda-forge
pooch                     0.2.0                 py37_1000    conda-forge
pycparser                 2.19                       py_0    conda-forge
pyopenssl                 18.0.0                py37_1000    conda-forge
pyparsing                 2.3.0                      py_0    conda-forge
pysocks                   1.6.8                 py37_1002    conda-forge
python                    3.7.0                h5001a0f_4    conda-forge
readline                  7.0                  haf1bffa_1    conda-forge
requests                  2.20.0                py37_1000    conda-forge
setuptools                40.5.0                   py37_0    conda-forge
six                       1.11.0                py37_1001    conda-forge
sqlite                    3.25.3               hb1c47c0_0    conda-forge
tk                        8.6.8                ha92aebf_0    conda-forge
urllib3                   1.23                  py37_1001    conda-forge
wheel                     0.32.2                   py37_0    conda-forge
xz                        5.2.4                h470a237_1    conda-forge
zlib                      1.2.11               h470a237_3    conda-forge

Check availability of remote files

Feedback or description of feature requested

Would be nice to have a method inside Pooch to check if any file from the registry is available in the remote url. The idea is not to fetch the entire file, but only snoop if it's available for download (like wget --spider does).

This feature could help to build test functions involving Pooch objects in case of big files or huge registries. In some cases it's only necessary to test if the data is available online instead of downloading the entire set of files, what could take a lot of time.
Also, this feature could be a fast way to test if the host is up.

Are you willing to help implement and maintain this feature? Yes

Unused pooch argument on Downloaders.__call__ methods

Description of the problem

Both HTTPDownloader and FTPDownloader classes have a __call__ method used for downloading the desire file through the corresponding protocol. The method ask for the url of the file, the output file where it should be saved to and a pooch argument.
According to the documentation, this pooch argument should be the Pooch instance that is calling the method, although it is not used inside it.
Is it an unused argument that should be removed or is there a reason for it?

Crash on import when running in parallel

Description of the problem

We recommend a workflow of keeping a Pooch as a global variable by calling pooch.create at import time. This is problematic when running in parallel and the cache folder has not been created yet. Each parallel process/thread will try to create it at the same time leading to a crash.

For our users, this is particularly bad since their packages crash on import even if there is no use of the sample datasets.

Not sure exactly what we can do on our end except:

  1. Add a warning in the docs about this issue.
  2. Change our recommended workflow to wrapping pooch.create in a function. This way the crash wouldn't happen at import but only when trying to fetch a dataset in parallel. This gives packages a chance to warn users not to load sample data for the first time in parallel.

Perhaps an ideal solution would be for Pooch to create the cache folder at install time. But that would mean hooking into setup.py (which is complicated) and I don't know if we could even do this with conda.

Full code that generated the error

See scikit-image/scikit-image#4660

Create registry file without local copies of the files

Description of the desired feature

When trying to fetch a large number of data files from a website, it's generally faster to download them programatically rather than doing it manually. Pooch seems a nice tool to make this process easier without having to write code for download, store and/or decompress the data before it can be loaded as we desire. Nevertheless, we need to have the hashes for the wanted files before Pooch being able to download them programatically, what can be a problem if we don't actually have local copies of those files. So, the solution so far is to write a script to download the files, run pooch.make_registry() in order to compute the hashes and create the registry file. After these steps Pooch is ready to be used for fetching any file we want.

Would be nice if we can spare ourselves from writing the downloader script and allow Pooch to manage the entire process by simply passing the files we would like to fetch and the base url from which they can be downloaded. This feature would not be intended for the normal usage of Pooch, but for one time only: when we need to download the files in order to compute the hashes and create the registry file.

I'm thinking we could write a function that gets a pooch.Pooch instance without a registry, a list of the files that wants to be downloaded and the path where the registry file will be saved. It should download every file in the list from the base_url, compute the hashes and then creating the registry.
Maybe it can be a method of pooch.Pooch class instead of an independent function.

Any idea or comment is welcome.

Are you willing to help implement and maintain this feature? Yes

Make error when hash doesn't match more informative

    def _check_download_hash(self, fname, downloaded):
        """
        Check the hash of the downloaded file against the one in the registry.
    
        Parameters
        ----------
        fname : str
            The file name in the registry.
        downloaded : str
            The pull path to the downloaded file.
    
        Raises
        ------
        :class:`ValueError`
            If the hashes don't match.
    
        """
        tmphash = file_hash(downloaded)
        if tmphash != self.registry[fname]:
            raise ValueError(
                "Hash of downloaded file '{}' doesn't match the entry in the registry:"
                " Expected '{}' and got '{}'.".format(
>                   downloaded, self.registry[fname], tmphash
                )
            )
E           ValueError: Hash of downloaded file '/home/mark/.cache/scikit-image/tmpydrsiovh' doesn't match the entry in the registry: Expected '8cd81c5fccdbcca6b623a5f157e71b27e91907e667626a0e07da279745e12d19' and got '03097789a3dcbb0e40d20b9ef82537dbc3b670b6a7f2268d735470f22e003a91'.

I think the fname should be printed, at least that is probably more related than a tempfile name

pooch doesn't install on Ubuntu 16.04.6 LTS

Description of the problem

pip install pooch or pip install . (from source) fails with an error.

Full code that generated the error

pip install pooch

Full error message

pip install pooch
DEPRECATION: Python 2.7 will reach the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 won't be maintained after that date. A future version of pip will drop support for Python 2.7.
Collecting pooch
  Using cached https://files.pythonhosted.org/packages/61/59/73a2b4788e1cde829e83277039ac70f2b8f1a4d73dd56c9070ce5c5ccc2c/pooch-0.5.1.tar.gz
Requirement already satisfied: requests in /usr/local/lib/python2.7/dist-packages (from pooch) (2.22.0)
Requirement already satisfied: packaging in /usr/local/lib/python2.7/dist-packages (from pooch) (19.0)
Requirement already satisfied: appdirs in /usr/local/lib/python2.7/dist-packages (from pooch) (1.4.3)
Requirement already satisfied: pathlib; python_version < "3.5" in /usr/local/lib/python2.7/dist-packages (from pooch) (1.0.1)
Requirement already satisfied: backports.tempfile; python_version < "3.5" in /usr/local/lib/python2.7/dist-packages (from pooch) (1.0)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python2.7/dist-packages (from requests->pooch) (1.25.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python2.7/dist-packages (from requests->pooch) (2019.3.9)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /usr/local/lib/python2.7/dist-packages (from requests->pooch) (3.0.4)
Requirement already satisfied: idna<2.9,>=2.5 in /usr/local/lib/python2.7/dist-packages (from requests->pooch) (2.8)
Requirement already satisfied: six in /usr/local/lib/python2.7/dist-packages (from packaging->pooch) (1.12.0)
Requirement already satisfied: pyparsing>=2.0.2 in /usr/local/lib/python2.7/dist-packages (from packaging->pooch) (2.4.0)
Requirement already satisfied: backports.weakref in /usr/local/lib/python2.7/dist-packages (from backports.tempfile; python_version < "3.5"->pooch) (1.0.post1)
Building wheels for collected packages: pooch
  Building wheel for pooch (setup.py) ... error
  ERROR: Complete output from command /usr/bin/python -u -c 'import setuptools, tokenize;__file__='"'"'/tmp/pip-install-ww2Yi8/pooch/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-E84pYU --python-tag cp27:
  ERROR: /usr/lib/python2.7/distutils/dist.py:267: UserWarning: Unknown distribution option: 'python_requires'
    warnings.warn(msg)
  /usr/lib/python2.7/dist-packages/setuptools/dist.py:285: UserWarning: Normalizing 'v0.5.1' to '0.5.1'
    normalized_version,
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build/lib.linux-x86_64-2.7
  creating build/lib.linux-x86_64-2.7/pooch
  copying pooch/processors.py -> build/lib.linux-x86_64-2.7/pooch
  copying pooch/downloaders.py -> build/lib.linux-x86_64-2.7/pooch
  copying pooch/core.py -> build/lib.linux-x86_64-2.7/pooch
  copying pooch/utils.py -> build/lib.linux-x86_64-2.7/pooch
  copying pooch/_version.py -> build/lib.linux-x86_64-2.7/pooch
  copying pooch/__init__.py -> build/lib.linux-x86_64-2.7/pooch
  copying pooch/version.py -> build/lib.linux-x86_64-2.7/pooch
  creating build/lib.linux-x86_64-2.7/pooch/tests
  copying pooch/tests/test_utils.py -> build/lib.linux-x86_64-2.7/pooch/tests
  copying pooch/tests/utils.py -> build/lib.linux-x86_64-2.7/pooch/tests
  copying pooch/tests/test_processors.py -> build/lib.linux-x86_64-2.7/pooch/tests
  copying pooch/tests/test_core.py -> build/lib.linux-x86_64-2.7/pooch/tests
  copying pooch/tests/test_integration.py -> build/lib.linux-x86_64-2.7/pooch/tests
  copying pooch/tests/__init__.py -> build/lib.linux-x86_64-2.7/pooch/tests
  creating build/lib.linux-x86_64-2.7/pooch/tests/data
  copying pooch/tests/data/tiny-data.zip -> build/lib.linux-x86_64-2.7/pooch/tests/data
  copying pooch/tests/data/tiny-data.txt -> build/lib.linux-x86_64-2.7/pooch/tests/data
  copying pooch/tests/data/store.tar.gz -> build/lib.linux-x86_64-2.7/pooch/tests/data
  copying pooch/tests/data/tiny-data.tar.gz -> build/lib.linux-x86_64-2.7/pooch/tests/data
  error: can't copy 'pooch/tests/data/store': doesn't exist or not a regular file
  ----------------------------------------
  ERROR: Failed building wheel for pooch
  Running setup.py clean for pooch
Failed to build pooch
Installing collected packages: pooch
  Running setup.py install for pooch ... error
    ERROR: Complete output from command /usr/bin/python -u -c 'import setuptools, tokenize;__file__='"'"'/tmp/pip-install-ww2Yi8/pooch/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-XDe0Vd/install-record.txt --single-version-externally-managed --compile:
    ERROR: /usr/lib/python2.7/distutils/dist.py:267: UserWarning: Unknown distribution option: 'python_requires'
      warnings.warn(msg)
    /usr/lib/python2.7/dist-packages/setuptools/dist.py:285: UserWarning: Normalizing 'v0.5.1' to '0.5.1'
      normalized_version,
    running install
    running build
    running build_py
    creating build
    creating build/lib.linux-x86_64-2.7
    creating build/lib.linux-x86_64-2.7/pooch
    copying pooch/processors.py -> build/lib.linux-x86_64-2.7/pooch
    copying pooch/downloaders.py -> build/lib.linux-x86_64-2.7/pooch
    copying pooch/core.py -> build/lib.linux-x86_64-2.7/pooch
    copying pooch/utils.py -> build/lib.linux-x86_64-2.7/pooch
    copying pooch/_version.py -> build/lib.linux-x86_64-2.7/pooch
    copying pooch/__init__.py -> build/lib.linux-x86_64-2.7/pooch
    copying pooch/version.py -> build/lib.linux-x86_64-2.7/pooch
    creating build/lib.linux-x86_64-2.7/pooch/tests
    copying pooch/tests/test_utils.py -> build/lib.linux-x86_64-2.7/pooch/tests
    copying pooch/tests/utils.py -> build/lib.linux-x86_64-2.7/pooch/tests
    copying pooch/tests/test_processors.py -> build/lib.linux-x86_64-2.7/pooch/tests
    copying pooch/tests/test_core.py -> build/lib.linux-x86_64-2.7/pooch/tests
    copying pooch/tests/test_integration.py -> build/lib.linux-x86_64-2.7/pooch/tests
    copying pooch/tests/__init__.py -> build/lib.linux-x86_64-2.7/pooch/tests
    creating build/lib.linux-x86_64-2.7/pooch/tests/data
    copying pooch/tests/data/tiny-data.zip -> build/lib.linux-x86_64-2.7/pooch/tests/data
    copying pooch/tests/data/tiny-data.txt -> build/lib.linux-x86_64-2.7/pooch/tests/data
    copying pooch/tests/data/store.tar.gz -> build/lib.linux-x86_64-2.7/pooch/tests/data
    copying pooch/tests/data/tiny-data.tar.gz -> build/lib.linux-x86_64-2.7/pooch/tests/data
    error: can't copy 'pooch/tests/data/store': doesn't exist or not a regular file
    ----------------------------------------
ERROR: Command "/usr/bin/python -u -c 'import setuptools, tokenize;__file__='"'"'/tmp/pip-install-ww2Yi8/pooch/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-XDe0Vd/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-install-ww2Yi8/pooch/

System information

  • Ubuntu 16.04.6 LTS
  • Python 2.7.2 (sudo apt-get install python)

JOSS paper

I hadn't been considering getting a JOSS paper for Pooch because I thought it wouldn't qualify as "research software". But I just noticed that tqdm has a JOSS paper, so I think we can make a case for Pooch. After all, other research software would cite Pooch in their papers (ideally). What do others think? Is this worth pursuing?

The criteria for authorship that seems reasonable to me would be:

  1. Have contributed code or documentation to Pooch beyond typo fixes
  2. Read/edit the JOSS paper and give their 👍

This would basically mean anyone listed on Github that at least reads and OKs the paper.

As for the order, we could go with the GitHub contributor order (I think it's by number of commit?), and do alphabetical for a tie break. But I'm not entirely sure how to proceed with this one. Any input would be welcome.

Right now, potential authors would be me @santisoler @remram44 @hugovk @matthewturk @jrleeman @matthewturk @danshapero. If any of you are interested, please reply to this thread with your consent, affiliation, and ORCID. Also, whether you think it's worth submitting (the paper itself will take ~1h to write copying from our docs).

Delete archives, retain the extracted files, and don't re-download

Description of the desired feature

For big archives, it is sometimes desirable to delete the initial archive file (against which the hash is made) and retain the unextracted files, and not require a full re-download the next time it's used.

Granted, this opens up a vector of corruption, where the uncompressed files might be modified, but I think this is unlikely to be a problem.

It might be possible to do this by having a zero-size stamp or something that says, "verified," but I don't really know what would fit best.

A common pattern in yt is:

ds = yt.load_sample("IsolatedGalaxy")

and this extracts a .tar.gz file. But, it's a couple hundred megs. So sometimes we might want to kill the intermediate archive once it's there, so that it doesn't double-up on storage requirements. (One could even imagine a remove_intermediate option!) We'd have to record that the file doesn't need to be re-obtained, though.

Does this seem like a possibility?

Are you willing to help implement and maintain this feature? Yes

Project logo

Pooch currently has no logo. I'm not the best designer and would really welcome any contributions or ideas on this front. A few requirements of the logo:

  • Simple design with the minimum number of colors
  • Look good in large and small sizes
  • Work with dark or light backgrounds
  • Square proportions (so it can work in a banner, profile picture, favicon, sticker, etc)

Raise exception instead of warning when cache isn't writable

Description of the desired feature

As pointed out by @danshapero in #149, when the cache folder isn't writable we warn the user through an error message but don't re-raise the PermissionError that is generated. This might be OK for interactive environments like Jupyter but it would cause an error for a script and the actual warning would be lost in the traceback.

A more sensible thing to do would be to add the message we create to the exception and re-raise it instead of logging.

The code for this is in pooch/core.py in the make_local_storage function (after #149 is merged).

How do you pull data from master?

I wanted to use this package to just pull down data from a separate repository, so having it just pull from master is fine. What I set version='master', I get an error.

Any ideas how to get around this?

Fetching data that requires a username + password

Some data sets, such as those hosted by the Global Hydrology Resource Center or the National Snow and Ice Data Center, require a username and password for access. It would be really convenient if pooch had a way to ask the user for a password when the example data require it. The repo owner would probably have to specify something in the registry to let pooch know that authentication needs to happen.

The requests library includes support for authentication, which should be useful. I can help implement this but I don't have a great idea of what the API should be.

Wrap all docstrings to 79 characters

Description of the desired feature

After fatiando/verde#177 raised by @prisae, fatiando/community#9 and fatiando/community#10 it has been decided that all docstrings must be wrapped to 79 characters per line.
We should warp all the existing docstrings in Pooch to 79 characters and also configure flake8 to check for this when running the style checks.

Sadly, there's no way (yet) to automatically change all docstrings, but at least we can use flake8 to raise lines that fail. This can be done by setting max-doc-length to 79 characters on the flake8 configuration under setup.cfg.

Typo in Pooch docstring

Typo in docstring

There's a typo on load_registry method of Pooch class on this line: should say from instead of form.

Are you willing to help implement and maintain this feature? Yes

Try using Pooch in MetPy

Use Pooch to replace the MetPy data caching system. This will help test the usefulness of loading a registry file from a file since they have a ton of data.

Refactor the Pooch class to make it easily subclassed

From @remram44 comment on #4, it would be great to expose more of the inner workings of the Pooch class as methods. This would make it easy to subclass it and overwrite those methods. The _download_file method should also be made public (remove the leading _).

Any ideas and suggestions on what should be refactored are welcome.

Delay cache folder creation to Pooch.fetch

Description of the desired feature

pooch.create tries to ensure that the cache folder exists by creating it and trying to write to it through pooch.utils.make_local_storage. This can cause problems if users call create to make a module global variable (since it will run at import time). If the import is run in parallel, this can lead to a race condition. #171 tried to fix this but it's not ideal since we shouldn't be writing to disk at import time anyway. A better approach would be to delay the folder creation to Pooch.fetch (which already does this https://github.com/fatiando/pooch/blob/master/pooch/core.py#L548).

In this case, make_local_storage needs to be modified to only create the folder name (it would probably need to be renamed). This would also make #150 and comments in there unnecessary since we don't need to check for permission error if we're not writing to disk until fetch time. retrieve will also need to be modified to create the cache folder as well (like fetch does). We might want to wrap the folder creation so that we can check for permission error and add a message about the environment variable to the exception.

There might be unintended consequences to not checking if the cache folder is writable at import time. #22 is when we decided to do this in the first place.

See scikit-image/scikit-image#4660

Are you willing to help implement and maintain this feature? Yes

SFTP Downloader

Description of the desired feature

A while ago, we added an FTP Downloader to pooch. I've been wondering whether it would reasonable to add an SFTP Downloader as well. For SSH Authentication, I am thinking that we could use paramiko.

Are you willing to help implement and maintain this feature? Yes/No

Yes

In case there's no demand/interest for this feature, I am happy to wait until there is a need.

Use real code in the tutorials

The tutorials (Training your Pooch) don't have code that is actually run. This is not ideal because the examples can't be copy and pasted. We should convert the tutorial to setup a real Pooch using our test data (in the data folder of the repository). Probably best to use doctests since we're not making any plots and then the output can be tested.

Check for empty lines in registry files

Description of the desired feature

Pooch reads in the files and their hashes from registry files using Pooch.load_registry. Right now, it fails if there is a blank line anywhere in the file (see https://github.com/fatiando/pooch/blob/master/pooch/core.py#L444).

Ideally, the code should skip any lines for which line.strip() is empty. While we're at it, the line number reported in the error message is off by one because enumerate starts from 0 so we should add one to it.

Use logging instead of warnings for user feedback

Description of the desired feature

Pooch currently uses warnings to give feedback to users when a file is being downloaded, an archive unpacked, etc. The logging module from the Python standard library is a better fit for this purpose.

  • More easily adjust how important the feedback is -- downloading a file is information, not having permission to create a destination directory is a warning or error
  • Users can choose to silence feedback from pooch without silencing warnings from other libraries

Are you willing to help implement and maintain this feature? Yes

Comments in registry file?

It could be worthwhile to support comments in registry files. For simplicity comments one could require that they need to be on a new line:

# this ressource
c137.csv 19uheidhlkjdwhoiwuhc0uhcwljchw9ochwochw89dcgw9dcgwc
# that ressource
cronen.csv 1upodh2ioduhw9celdjhlfvhksgdwikdgcowjhcwoduchowjg8w

However, it's probably not really possible to use this in a backward-compatible way...

Are you willing to help implement and maintain this feature? Yes

Use the XDG_CACHE_HOME environment variable if it exists

Following the discussion in #26, we need to make some changes to the os_cache function in pooch/utils.py function to be compliant with the FreeDesktop standards:

  • Add an option check_environment to the function that defaults to False
  • If check_environment == True, check if the XDG_CACHE_HOME environment variable is defined.
  • If it is, use it to define the default cache location instead of ~/.cache on Linux.

Convenience functions to list downloaded datasets, to delete them, and to show their location

Many datasets that will be downloaded by pooch will be very large and might only be used once by the user. It would be useful to have some convenience functions to list those datasets that have been downloaded, to delete all (or some) of them, to determine where they are stored, and to give the size of the downloaded files.

Convenience functions that one might want to add to a datasets module might include:

  • datasets_list()
  • datasets_clear()
  • datasets_path()
  • datasets_size()

Also, for the maintainer of a datasets module, it would be useful to have a simple function that would verify that all the datasets are indeed downloadable (i.e., the file exists). Oftentimes, files get moved around on a server over 5 year time spans.

Add an automodule for the top level package

The API docs only use autodoc on the functions and classes. So a link to the top level pooch package like :mod:`pooch` won't work. Need to add an .. automodule:: pooch to docs/api/index.rst.

Data in more than one remote source

Right now, we assume that all the sample data is in the same remote source. This might not be the case for many projects if they are using publicly available data. It would be great to be able to specify the a remote URL for individual entries in the registry (which should probably not be versioned).

One of doing this that might not require much refactoring is to include a new attribute in Pooch.registry_urls that is a dictionary mapping file names to remote URLs where they can be downloaded. We should not append the file name to this URL. The Pooch._download_file method can then check if fname in self.registry_urls and use self.registry_urls[fname] as the download source instead.

Finer control over the extraction paths

Description of the desired feature

Currently, when the user downloads and unpacks an archive, e.g. with this code:

data_0 = pooch.create(
    path=SOME_PATH,
    base_url=SOME_URL,
    registry={"d1.tar": None, },
)

def fetch_data_0():
    download = pooch.HTTPDownloader()
    unpack = pooch.Untar()

    for item in ("d1.tar", ):
        _ = data_0.fetch(
            item,
            processor=unpack,
            downloader=download)

the archive is extracted into SOME_PATH/d1.tar.untar/ARCHIVE_FILES.

I think it would be great to have a possibility:

  1. (preferred) to do flat unpacking, such that the files are put into SOME_PATH/ARCHIVE_FILES
  2. to change the suffix in the preprocessor, e.g. to have SOME_PATH/d1.tar/ARCHIVE_FILES. I understand that this option may be challenging when multiple preprocessors are chained, since they, probably, look for the intermediate directories named in a certain way.

Are you willing to help implement and maintain this feature? Yes

pooch.create should test for write access to directories and have a failback for failures

See realworld implementation problem with Unidata/MetPy#933

pooch.create can fail to write to whatever cache directory it is presented with. A failback mechanism should exist, like what does with matplotlib when it attempts to find writable locations. There is also a XDG Base Directory Specification for this.

POOCH = pooch.create(
    path=pooch.os_cache('metpy'),
    base_url='https://github.com/Unidata/MetPy/raw/{version}/staticdata/',
    version='v' + __version__,
    version_dev='master',
    env='TEST_DATA_DIR')

Full error message

File "/opt/miniconda3/envs/prod/lib/python3.6/site-packages/pooch/core.py", line 143, in create
   os.makedirs(versioned_path)
 File "/opt/miniconda3/envs/prod/lib/python3.6/os.py", line 210, in makedirs
     makedirs(head, mode, exist_ok)
  File "/opt/miniconda3/envs/prod/lib/python3.6/os.py", line 210, in makedirs
    makedirs(head, mode, exist_ok)
  File "/opt/miniconda3/envs/prod/lib/python3.6/os.py", line 220, in makedirs
    mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/usr/share/httpd/.cache'

thank you

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.