Comments (9)
For the reference: The original bug that let to using shlex
is #313.
from pooch.
Hi @bmcfee, thanks for opening this issue.
I'm sorry that the last release broke backward compatibility. This is something we always try to avoid through exhaustive testing, but as you could tell we weren't considering the case where the filenames contain apostrophes.
I think the issue you are facing is part of a larger discussion: what type of standard do we want to use for filenames within Pooch. After we merged #315 I think we inadvertently established that we would follow the POSIX standard (see Parsing rules in shlex when POSIX is True).
Therefore, if a filename contains a special character, like a space or an apostrophe, we can always use the backslash to escape it. In your particular case, you can modify your registry file to:
Karissa_Hobbs_-_Let\'s_Go_Fishin\'.hq.ogg ...
Karissa_Hobbs_-_Let\'s_Go_Fishin\'.ogg ...
Karissa_Hobbs_-_Let\'s_Go_Fishin\'.txt ...
and it should work. We use the same strategy for spaces, as was introduced in #315.
If we try to modify the load_registry()
method after every time we encounter a corner case that isn't supported, I think we'll be modifying its code after every release, which isn't sustainable and it's prone to introduce more backward compatibility breaks.
So, in my opinion, special characters like spaces, double quotes or apostrophes in filenames should be handled by inserting a backslash before them.
What do you think? Does it sounds like a good compromise?
from pooch.
@bmcfee thanks for being very understanding with this. I think it's best to assume the POSIX syntax is intended (though it was originally an oversight). make_registry
wasn't updated to work with the new syntax, though, as you pointed out. That still needs to be done.
from pooch.
Confirmed that this is due to this line:
Line 660 in a965902
For example, the 1.6.0 behavior would have been:
In [2]: line = "Karissa_Hobbs_-_Let's_Go_Fishin'.ogg"
In [3]: line.split()
Out[3]: ["Karissa_Hobbs_-_Let's_Go_Fishin'.ogg"]
In 1.7, it becomes:
In [11]: shlex.split(line)
Out[11]: ['Karissa_Hobbs_-_Lets_Go_Fishin.ogg']
(note the missing single-quotes)
A quick fix here is to use shlex.quote
:
In [12]: shlex.split(shlex.quote(line))
Out[12]: ["Karissa_Hobbs_-_Let's_Go_Fishin'.ogg"]
I don't know if this will be safe across platforms, but it would at least fix the bug I'm seeing.
from pooch.
I think the use of shlex.quote
would still fix #313:
In [2]: shlex.split(shlex.quote("foo' bar baz'.txt"))
Out[2]: ["foo' bar baz'.txt"]
from pooch.
Ok, I started prepping a PR for this and realized that it's a bit more subtle than just sticking shlex.quote in there, since it will apply to the entire line. If a filename ends with a quote, it will muck up the space parsing, and the whole parser falls apart.
The problem as I see it boils down to some ambiguity in the format of the registry file, in particular due to the option of custom urls as a third token. We don't know in advance how many spaces there will be in a line, and that makes splitting tricky.
If we didn't have custom urls, it would be enough to do:
filename, hash = line.rsplit(' ', maxsplit=2)
where rsplit will group all but the last space separator into filename
.
If we always have a custom URL, it would be equally simple:
filename, hash, url = line.rsplit(' ', maxsplit=3)
but this will be incorrect if the url is absent and the filename contains a space.
What we really need is a way to first check if the final token is a hash or a url, and then act accordingly:
# Assuming URLs in registry files are properly URL-encoded, i.e. ' ' → '%20'
prefix, last = line.rsplit(' ', maxsplit=2)
if is_url(last):
url = last
filename, hash = prefix.split(' ', maxsplit=2)
else:
filename, hash, url = prefix, last, None
Now, there's a problem that URLs (per RFC 3986) if not fully formed can look an awful lot like hashes. We can't rely on detecting just a :
because this is also valid in hashes (md5:abcdef...
).
We also can't get away with detecting :/
, or even ://
because this would rule out DOI urls.
I suspect the simplest solution would be to explicitly match URLs against a subset of supported schemes (http:, https:, ftp:, sftp:, doi:) and treat anything else as if it's a hash. If it's not actually a hash, this will eventually fail later on.
Does this sound reasonable? If so, I'd be happy to prep a PR implementing the above strategy.
from pooch.
Thanks @santisoler - I agree with that assessment. I'm totally happy if pooch wants to explicitly adopt POSIX-style syntax for filenames, and I think the confusion only arose because the syntax was implicitly defined (1.7) or under-specified (<1.6). I opened the PR because it wasn't clear if this was intended behavior or a bug to be fixed, but if it's intended behavior we can close it out.
I think what this means for my project in particular is that we need to put out a post-release to upper-bound the supported pooch versions, and then revise our registry file with an updated lower bound.
One question though: how does this formatting work when creating registry files from a pooch object? I don't see any logic for escaping filenames on write (make_registry
) in #315 , and from a glance at
Lines 218 to 222 in a965902
from pooch.
No worries - sounds like a good plan to me!
from pooch.
Alright, closing this one since we have #374 and #369 to handle the issues identified here. Thanks again for reporting @bmcfee!
from pooch.
Related Issues (20)
- DataVerse file DOIs in pooch.retrieve HOT 1
- Download from URL with expired SSL certificate HOT 4
- Problem accessing public files on Dropbox HOT 4
- Add support for downloading from AWS S3 butckets HOT 1
- If Unzip/Untar fails because of a non-existent member, calling without members won't work HOT 2
- Feature: allow null hash in registry files HOT 3
- Make sure that any test that uses the network is properly marked HOT 1
- pooch.make_registry does not handle spaces in filenames correctly HOT 1
- `KeyError: 'key'` in ZenodoRepository.download_url() after Zenodo migration HOT 8
- Mention in docs that registry files must use POSIX compatible file names
- Release v1.8.0 HOT 1
- Sorry wrong repo for the issue HOT 1
- fatiando.org is down HOT 2
- Add "git blob hash" as available hash HOT 3
- Add or document support for Dryad via DOIDownloader HOT 1
- Add support for downloading from Azure cloud storage HOT 2
- Release v1.8.1 HOT 2
- Add support for downloading from Google Cloud Storage HOT 5
- Increase timeout on HTTP requests and use a variable HOT 12
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pooch.