Code Monkey home page Code Monkey logo

Comments (9)

dokempf avatar dokempf commented on June 17, 2024 1

For the reference: The original bug that let to using shlex is #313.

from pooch.

santisoler avatar santisoler commented on June 17, 2024 1

Hi @bmcfee, thanks for opening this issue.

I'm sorry that the last release broke backward compatibility. This is something we always try to avoid through exhaustive testing, but as you could tell we weren't considering the case where the filenames contain apostrophes.

I think the issue you are facing is part of a larger discussion: what type of standard do we want to use for filenames within Pooch. After we merged #315 I think we inadvertently established that we would follow the POSIX standard (see Parsing rules in shlex when POSIX is True).

Therefore, if a filename contains a special character, like a space or an apostrophe, we can always use the backslash to escape it. In your particular case, you can modify your registry file to:

Karissa_Hobbs_-_Let\'s_Go_Fishin\'.hq.ogg ...
Karissa_Hobbs_-_Let\'s_Go_Fishin\'.ogg ...
Karissa_Hobbs_-_Let\'s_Go_Fishin\'.txt ...

and it should work. We use the same strategy for spaces, as was introduced in #315.

If we try to modify the load_registry() method after every time we encounter a corner case that isn't supported, I think we'll be modifying its code after every release, which isn't sustainable and it's prone to introduce more backward compatibility breaks.

So, in my opinion, special characters like spaces, double quotes or apostrophes in filenames should be handled by inserting a backslash before them.

What do you think? Does it sounds like a good compromise?

from pooch.

leouieda avatar leouieda commented on June 17, 2024 1

@bmcfee thanks for being very understanding with this. I think it's best to assume the POSIX syntax is intended (though it was originally an oversight). make_registry wasn't updated to work with the new syntax, though, as you pointed out. That still needs to be done.

from pooch.

bmcfee avatar bmcfee commented on June 17, 2024

Confirmed that this is due to this line:

elements = shlex.split(line)

For example, the 1.6.0 behavior would have been:

In [2]: line = "Karissa_Hobbs_-_Let's_Go_Fishin'.ogg"

In [3]: line.split()
Out[3]: ["Karissa_Hobbs_-_Let's_Go_Fishin'.ogg"]

In 1.7, it becomes:

In [11]: shlex.split(line)
Out[11]: ['Karissa_Hobbs_-_Lets_Go_Fishin.ogg']

(note the missing single-quotes)

A quick fix here is to use shlex.quote:

In [12]: shlex.split(shlex.quote(line))
Out[12]: ["Karissa_Hobbs_-_Let's_Go_Fishin'.ogg"]

I don't know if this will be safe across platforms, but it would at least fix the bug I'm seeing.

from pooch.

bmcfee avatar bmcfee commented on June 17, 2024

I think the use of shlex.quote would still fix #313:

In [2]: shlex.split(shlex.quote("foo' bar baz'.txt"))
Out[2]: ["foo' bar baz'.txt"]

from pooch.

bmcfee avatar bmcfee commented on June 17, 2024

Ok, I started prepping a PR for this and realized that it's a bit more subtle than just sticking shlex.quote in there, since it will apply to the entire line. If a filename ends with a quote, it will muck up the space parsing, and the whole parser falls apart.

The problem as I see it boils down to some ambiguity in the format of the registry file, in particular due to the option of custom urls as a third token. We don't know in advance how many spaces there will be in a line, and that makes splitting tricky.

If we didn't have custom urls, it would be enough to do:

filename, hash = line.rsplit(' ', maxsplit=2)

where rsplit will group all but the last space separator into filename.

If we always have a custom URL, it would be equally simple:

filename, hash, url = line.rsplit(' ', maxsplit=3)

but this will be incorrect if the url is absent and the filename contains a space.

What we really need is a way to first check if the final token is a hash or a url, and then act accordingly:

# Assuming URLs in registry files are properly URL-encoded, i.e. ' ' → '%20'
prefix, last = line.rsplit(' ', maxsplit=2)
if is_url(last):
    url = last
    filename, hash = prefix.split(' ', maxsplit=2)
else:
    filename, hash, url = prefix, last, None

Now, there's a problem that URLs (per RFC 3986) if not fully formed can look an awful lot like hashes. We can't rely on detecting just a : because this is also valid in hashes (md5:abcdef...).

We also can't get away with detecting :/, or even :// because this would rule out DOI urls.

I suspect the simplest solution would be to explicitly match URLs against a subset of supported schemes (http:, https:, ftp:, sftp:, doi:) and treat anything else as if it's a hash. If it's not actually a hash, this will eventually fail later on.

Does this sound reasonable? If so, I'd be happy to prep a PR implementing the above strategy.

from pooch.

bmcfee avatar bmcfee commented on June 17, 2024

Thanks @santisoler - I agree with that assessment. I'm totally happy if pooch wants to explicitly adopt POSIX-style syntax for filenames, and I think the confusion only arose because the syntax was implicitly defined (1.7) or under-specified (<1.6). I opened the PR because it wasn't clear if this was intended behavior or a bug to be fixed, but if it's intended behavior we can close it out.

I think what this means for my project in particular is that we need to put out a post-release to upper-bound the supported pooch versions, and then revise our registry file with an updated lower bound.

One question though: how does this formatting work when creating registry files from a pooch object? I don't see any logic for escaping filenames on write (make_registry) in #315 , and from a glance at

pooch/pooch/hashes.py

Lines 218 to 222 in a965902

with open(output, "w") as outfile:
for fname, fhash in zip(files, hashes):
# Only use Unix separators for the registry so that we don't go
# insane dealing with file paths.
outfile.write("{} {}\n".format(fname.replace("\\", "/"), fhash))
I expect it will not work as expected currently.

from pooch.

bmcfee avatar bmcfee commented on June 17, 2024

No worries - sounds like a good plan to me!

from pooch.

leouieda avatar leouieda commented on June 17, 2024

Alright, closing this one since we have #374 and #369 to handle the issues identified here. Thanks again for reporting @bmcfee!

from pooch.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.