Hey, Thank you for working on this library! I think it has a hug

Thank you <a class="user-mention notranslate" data-hovercard-type="user" data-hovercar

[FEATURE] Allow configuring compression mode in MessagepackSerializer about squirrel-core HOT 4 CLOSED

kai-tub commented on June 7, 2024

[FEATURE] Allow configuring compression mode in MessagepackSerializer

from squirrel-core.

Comments (4)

AlirezaSohofi commented on June 7, 2024

Thank you @kai-tub for the encouragement and your interest in Squirrel :)

Here I would like to note that given the extension, fsspec (default) could also infer the compression by inspecting the filename suffix.

Seems like a good solution to support different compressions without adding complexity to the deserialization logic, a PR for this is very welcome 👍 . In the SquirrelStore, the filename suffix can then be part of the key. It can indicate no compression if there is no filename suffix.

In my internal benchmarks, I was able to greatly speed up the data loading by simply using no compression at all (None)

I am very interested in the benchmark, would be great if you can share it.

from squirrel-core.

kai-tub commented on June 7, 2024

There is a small issue with the fsspec variant. It can infer the compression if the data is compressed, but would have to be set to None if no compression is used.
Plus, I don't know if some users would replace fsspec with something different.
Also, the .keys method would have to be updated either way to find "relevant" paths.

Currently, we would have to check if the files inside the SquirrelStore have an fsspec compression extension, if not then we would set compression=None, otherwise, we could use compression='infer' from fsspec.

At this point, I think it would be more straightforward to have a more explicit mapping.
I would like to use a simple dictionary with the 'supported' compression algorithms + extension name and then use that to select the compression for fsspec during the compression/decompression.
fsspec doesn't do it differently internally:

https://github.com/fsspec/filesystem_spec/blob/9ecef92101194ab4eece8013ce0e3f419b0ccd86/fsspec/utils.py#L114-L126

So the only difference would be that we also have an explicit name for compression=None and wouldn't rely on fsspec's infer function.
The compression=None extension name might be .tar, .none, or simply no extension "".

We should enforce that the SquirrelStore folder only contains the encoded data in a single compression format and infer the used compression automatically.
Then get_keys would simply return all elements in the path with a compression extension name (with one extra name for None).
I would like to fail loud and hard if there are multiple compression formats inside of the SquirrelStore, as this would produce undesirable results.
Another reason to only allow one compression format inside a store is that otherwise the .get_keys and therefore the .get_iter method would also have to know which compression format was used.
Which would have a negative impact on the user experience in my opinion.

With a single compression format, the deserializer can always infer it given the key name (remember key name includes extension name now) and continue to work with it and a user never has to "know" which compression algorithm was used.
So the SquirrelStore.get function would have additional logic that would infer the compression from the key name and pass that down to the deserialize function.

I know this is a lot to read and reason about.
Maybe it would be easier if I start the PR draft and you can review my "suggested" solution directly as code?

from squirrel-core.

AlirezaSohofi commented on June 7, 2024

Plus, I don't know if some users would replace fsspec with something different.

the filesystem is an internal responsibility of SquirrelStore, this is signaled by accepting url rather than an instance of a filesystem, so if there is a need for something other than fsspec, either a new kind of Store should be implemented or it should be internally handled by SquirrelStore. This is not an issue for now.

Currently, we would have to check if the files inside the SquirrelStore have an fsspec compression extension, if not then we would set compression=None, otherwise, we could use compression='infer' from fsspec.

this seems reasonable to me. This logic can be entirely contained inside SquirrelStore

At this point, I think it would be more straightforward to have a more explicit mapping.

It's good that it's more explicit, but on the other hand if fsspec adds a new compression, we have to update this mapping to support it. I have a slight preference for letting fsspec handle it (except the empty "" filename suffix as you mentioned). Also, remember that instantiating a filesystem with different protocols (e.g. s3://, gs://, etc) with create different objects which might have different support for compressions.

We should enforce that the SquirrelStore folder only contains the encoded data in a single compression format and infer the used compression automatically.

For enforcing that, we can't just fail at the read time, we need to fail when write is happening. This is tricky, because store must be parallelizable, and we need a synchronization strategy to achieve this which is hard and unreliable.

Another reason to only allow one compression format inside a store is that otherwise the .get_keys and therefore the .get_iter method would also have to know which compression format was used.

Not necessarily. If compression is expressed as filename suffix, SquirrelStore can automatically handle it while reading, and can accept optional kwargs when writing (default to .gz).

With a single compression format, the deserializer can always infer it given the key name (remember key name includes extension name now) and continue to work with it and a user never has to "know" which compression algorithm was used.
So the SquirrelStore.get function would have additional logic that would infer the compression from the key name and pass that down to the deserialize function.

Exactly, the very same can happen if there are multiple filename suffix in the store.

Maybe it would be easier if I start the PR draft

Yes it would be great :)

from squirrel-core.

github-actions commented on June 7, 2024

This is issue is marked as stale as it has been inactive for 30 days. It will be closed in 7 days.

from squirrel-core.

[FEATURE] Allow configuring compression mode in MessagepackSerializer about squirrel-core HOT 4 CLOSED

Comments (4)

Related Issues (11)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent