Code Monkey home page Code Monkey logo

Comments (4)

AlirezaSohofi avatar AlirezaSohofi commented on June 7, 2024

Thank you @kai-tub for the encouragement and your interest in Squirrel :)

Here I would like to note that given the extension, fsspec (default) could also infer the compression by inspecting the filename suffix.

Seems like a good solution to support different compressions without adding complexity to the deserialization logic, a PR for this is very welcome 👍 . In the SquirrelStore, the filename suffix can then be part of the key. It can indicate no compression if there is no filename suffix.

In my internal benchmarks, I was able to greatly speed up the data loading by simply using no compression at all (None)

I am very interested in the benchmark, would be great if you can share it.

from squirrel-core.

kai-tub avatar kai-tub commented on June 7, 2024

There is a small issue with the fsspec variant. It can infer the compression if the data is compressed, but would have to be set to None if no compression is used.
Plus, I don't know if some users would replace fsspec with something different.
Also, the .keys method would have to be updated either way to find "relevant" paths.

Currently, we would have to check if the files inside the SquirrelStore have an fsspec compression extension, if not then we would set compression=None, otherwise, we could use compression='infer' from fsspec.

At this point, I think it would be more straightforward to have a more explicit mapping.
I would like to use a simple dictionary with the 'supported' compression algorithms + extension name and then use that to select the compression for fsspec during the compression/decompression.
fsspec doesn't do it differently internally:

https://github.com/fsspec/filesystem_spec/blob/9ecef92101194ab4eece8013ce0e3f419b0ccd86/fsspec/utils.py#L114-L126

So the only difference would be that we also have an explicit name for compression=None and wouldn't rely on fsspec's infer function.
The compression=None extension name might be .tar, .none, or simply no extension "".

We should enforce that the SquirrelStore folder only contains the encoded data in a single compression format and infer the used compression automatically.
Then get_keys would simply return all elements in the path with a compression extension name (with one extra name for None).
I would like to fail loud and hard if there are multiple compression formats inside of the SquirrelStore, as this would produce undesirable results.
Another reason to only allow one compression format inside a store is that otherwise the .get_keys and therefore the .get_iter method would also have to know which compression format was used.
Which would have a negative impact on the user experience in my opinion.

With a single compression format, the deserializer can always infer it given the key name (remember key name includes extension name now) and continue to work with it and a user never has to "know" which compression algorithm was used.
So the SquirrelStore.get function would have additional logic that would infer the compression from the key name and pass that down to the deserialize function.

I know this is a lot to read and reason about.
Maybe it would be easier if I start the PR draft and you can review my "suggested" solution directly as code?

from squirrel-core.

AlirezaSohofi avatar AlirezaSohofi commented on June 7, 2024

Plus, I don't know if some users would replace fsspec with something different.

the filesystem is an internal responsibility of SquirrelStore, this is signaled by accepting url rather than an instance of a filesystem, so if there is a need for something other than fsspec, either a new kind of Store should be implemented or it should be internally handled by SquirrelStore. This is not an issue for now.

Currently, we would have to check if the files inside the SquirrelStore have an fsspec compression extension, if not then we would set compression=None, otherwise, we could use compression='infer' from fsspec.

this seems reasonable to me. This logic can be entirely contained inside SquirrelStore

At this point, I think it would be more straightforward to have a more explicit mapping.

It's good that it's more explicit, but on the other hand if fsspec adds a new compression, we have to update this mapping to support it. I have a slight preference for letting fsspec handle it (except the empty "" filename suffix as you mentioned). Also, remember that instantiating a filesystem with different protocols (e.g. s3://, gs://, etc) with create different objects which might have different support for compressions.

We should enforce that the SquirrelStore folder only contains the encoded data in a single compression format and infer the used compression automatically.

For enforcing that, we can't just fail at the read time, we need to fail when write is happening. This is tricky, because store must be parallelizable, and we need a synchronization strategy to achieve this which is hard and unreliable.

Another reason to only allow one compression format inside a store is that otherwise the .get_keys and therefore the .get_iter method would also have to know which compression format was used.

Not necessarily. If compression is expressed as filename suffix, SquirrelStore can automatically handle it while reading, and can accept optional kwargs when writing (default to .gz).

With a single compression format, the deserializer can always infer it given the key name (remember key name includes extension name now) and continue to work with it and a user never has to "know" which compression algorithm was used.
So the SquirrelStore.get function would have additional logic that would infer the compression from the key name and pass that down to the deserialize function.

Exactly, the very same can happen if there are multiple filename suffix in the store.

Maybe it would be easier if I start the PR draft

Yes it would be great :)

from squirrel-core.

github-actions avatar github-actions commented on June 7, 2024

This is issue is marked as stale as it has been inactive for 30 days. It will be closed in 7 days.

from squirrel-core.

Related Issues (11)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.