Deion of the problem: This just popped up in

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Confirmed that this is due to this line: <div class="Box Box--co

I think the use of shlex.quote would still fix <a cla

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Alright, closing this one since we have <a class="issue-link js-issue-link" data-error

Pooch 1.7.0 fails when filenames have apostrophes about pooch HOT 9 CLOSED

bmcfee commented on June 26, 2024 1

Pooch 1.7.0 fails when filenames have apostrophes

from pooch.

Comments (9)

dokempf commented on June 26, 2024 1

For the reference: The original bug that let to using shlex is #313.

from pooch.

santisoler commented on June 26, 2024 1

Hi @bmcfee, thanks for opening this issue.

I'm sorry that the last release broke backward compatibility. This is something we always try to avoid through exhaustive testing, but as you could tell we weren't considering the case where the filenames contain apostrophes.

I think the issue you are facing is part of a larger discussion: what type of standard do we want to use for filenames within Pooch. After we merged #315 I think we inadvertently established that we would follow the POSIX standard (see Parsing rules in shlex when POSIX is True).

Therefore, if a filename contains a special character, like a space or an apostrophe, we can always use the backslash to escape it. In your particular case, you can modify your registry file to:

Karissa_Hobbs_-_Let\'s_Go_Fishin\'.hq.ogg ...
Karissa_Hobbs_-_Let\'s_Go_Fishin\'.ogg ...
Karissa_Hobbs_-_Let\'s_Go_Fishin\'.txt ...

and it should work. We use the same strategy for spaces, as was introduced in #315.

If we try to modify the load_registry() method after every time we encounter a corner case that isn't supported, I think we'll be modifying its code after every release, which isn't sustainable and it's prone to introduce more backward compatibility breaks.

So, in my opinion, special characters like spaces, double quotes or apostrophes in filenames should be handled by inserting a backslash before them.

What do you think? Does it sounds like a good compromise?

from pooch.

leouieda commented on June 26, 2024 1

@bmcfee thanks for being very understanding with this. I think it's best to assume the POSIX syntax is intended (though it was originally an oversight). make_registry wasn't updated to work with the new syntax, though, as you pointed out. That still needs to be done.

from pooch.

bmcfee commented on June 26, 2024

Confirmed that this is due to this line:

pooch/pooch/core.py

Line 660 in a965902

elements = shlex.split(line)

For example, the 1.6.0 behavior would have been:

In [2]: line = "Karissa_Hobbs_-_Let's_Go_Fishin'.ogg"

In [3]: line.split()
Out[3]: ["Karissa_Hobbs_-_Let's_Go_Fishin'.ogg"]

In 1.7, it becomes:

In [11]: shlex.split(line)
Out[11]: ['Karissa_Hobbs_-_Lets_Go_Fishin.ogg']

(note the missing single-quotes)

A quick fix here is to use shlex.quote:

In [12]: shlex.split(shlex.quote(line))
Out[12]: ["Karissa_Hobbs_-_Let's_Go_Fishin'.ogg"]

I don't know if this will be safe across platforms, but it would at least fix the bug I'm seeing.

from pooch.

bmcfee commented on June 26, 2024

I think the use of shlex.quote would still fix #313:

In [2]: shlex.split(shlex.quote("foo' bar baz'.txt"))
Out[2]: ["foo' bar baz'.txt"]

from pooch.

bmcfee commented on June 26, 2024

Ok, I started prepping a PR for this and realized that it's a bit more subtle than just sticking shlex.quote in there, since it will apply to the entire line. If a filename ends with a quote, it will muck up the space parsing, and the whole parser falls apart.

The problem as I see it boils down to some ambiguity in the format of the registry file, in particular due to the option of custom urls as a third token. We don't know in advance how many spaces there will be in a line, and that makes splitting tricky.

If we didn't have custom urls, it would be enough to do:

filename, hash = line.rsplit(' ', maxsplit=2)

where rsplit will group all but the last space separator into filename.

If we always have a custom URL, it would be equally simple:

filename, hash, url = line.rsplit(' ', maxsplit=3)

but this will be incorrect if the url is absent and the filename contains a space.

What we really need is a way to first check if the final token is a hash or a url, and then act accordingly:

# Assuming URLs in registry files are properly URL-encoded, i.e. ' ' → '%20'
prefix, last = line.rsplit(' ', maxsplit=2)
if is_url(last):
    url = last
    filename, hash = prefix.split(' ', maxsplit=2)
else:
    filename, hash, url = prefix, last, None

Now, there's a problem that URLs (per RFC 3986) if not fully formed can look an awful lot like hashes. We can't rely on detecting just a : because this is also valid in hashes (md5:abcdef...).

We also can't get away with detecting :/, or even :// because this would rule out DOI urls.

I suspect the simplest solution would be to explicitly match URLs against a subset of supported schemes (http:, https:, ftp:, sftp:, doi:) and treat anything else as if it's a hash. If it's not actually a hash, this will eventually fail later on.

Does this sound reasonable? If so, I'd be happy to prep a PR implementing the above strategy.

from pooch.

bmcfee commented on June 26, 2024

Thanks @santisoler - I agree with that assessment. I'm totally happy if pooch wants to explicitly adopt POSIX-style syntax for filenames, and I think the confusion only arose because the syntax was implicitly defined (1.7) or under-specified (<1.6). I opened the PR because it wasn't clear if this was intended behavior or a bug to be fixed, but if it's intended behavior we can close it out.

I think what this means for my project in particular is that we need to put out a post-release to upper-bound the supported pooch versions, and then revise our registry file with an updated lower bound.

One question though: how does this formatting work when creating registry files from a pooch object? I don't see any logic for escaping filenames on write (make_registry) in #315 , and from a glance at

pooch/pooch/hashes.py

Lines 218 to 222 in a965902

    
           with open(output, "w") as outfile: 
        
               for fname, fhash in zip(files, hashes): 
        
                   # Only use Unix separators for the registry so that we don't go 
        
                   # insane dealing with file paths. 
        
                   outfile.write("{} {}\n".format(fname.replace("\\", "/"), fhash))

I expect it will not work as expected currently.

from pooch.

bmcfee commented on June 26, 2024

No worries - sounds like a good plan to me!

from pooch.

leouieda commented on June 26, 2024

Alright, closing this one since we have #374 and #369 to handle the issues identified here. Thanks again for reporting @bmcfee!

from pooch.

Pooch 1.7.0 fails when filenames have apostrophes about pooch HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	with open(output, "w") as outfile:
	for fname, fhash in zip(files, hashes):
	# Only use Unix separators for the registry so that we don't go
	# insane dealing with file paths.
	outfile.write("{} {}\n".format(fname.replace("\\", "/"), fhash))