Comments (4)
Thank you @kai-tub for the encouragement and your interest in Squirrel :)
Here I would like to note that given the extension, fsspec (default) could also infer the compression by inspecting the filename suffix.
Seems like a good solution to support different compressions without adding complexity to the deserialization logic, a PR for this is very welcome 👍 . In the SquirrelStore
, the filename suffix can then be part of the key
. It can indicate no compression if there is no filename suffix.
In my internal benchmarks, I was able to greatly speed up the data loading by simply using no compression at all (None)
I am very interested in the benchmark, would be great if you can share it.
from squirrel-core.
There is a small issue with the fsspec
variant. It can infer
the compression if the data is compressed, but would have to be set to None
if no compression is used.
Plus, I don't know if some users would replace fsspec
with something different.
Also, the .keys
method would have to be updated either way to find "relevant" paths.
Currently, we would have to check if the files inside the SquirrelStore
have an fsspec
compression extension, if not then we would set compression=None
, otherwise, we could use compression='infer'
from fsspec
.
At this point, I think it would be more straightforward to have a more explicit mapping.
I would like to use a simple dictionary with the 'supported' compression algorithms + extension name and then use that to select the compression
for fsspec
during the compression/decompression.
fsspec
doesn't do it differently internally:
So the only difference would be that we also have an explicit name for compression=None and wouldn't rely on fsspec
's infer function.
The compression=None
extension name might be .tar
, .none
, or simply no extension ""
.
We should enforce that the SquirrelStore
folder only contains the encoded data in a single compression format and infer the used compression automatically.
Then get_keys
would simply return all elements in the path with a compression
extension name (with one extra name for None).
I would like to fail loud and hard if there are multiple compression formats inside of the SquirrelStore, as this would produce undesirable results.
Another reason to only allow one compression format inside a store is that otherwise the .get_keys
and therefore the .get_iter
method would also have to know which compression format was used.
Which would have a negative impact on the user experience in my opinion.
With a single compression format, the deserializer can always infer it given the key name (remember key name includes extension name now) and continue to work with it and a user never has to "know" which compression algorithm was used.
So the SquirrelStore.get
function would have additional logic that would infer the compression from the key name and pass that down to the deserialize function.
I know this is a lot to read and reason about.
Maybe it would be easier if I start the PR draft and you can review my "suggested" solution directly as code?
from squirrel-core.
Plus, I don't know if some users would replace fsspec with something different.
the filesystem is an internal responsibility of SquirrelStore
, this is signaled by accepting url
rather than an instance of a filesystem, so if there is a need for something other than fsspec
, either a new kind of Store
should be implemented or it should be internally handled by SquirrelStore
. This is not an issue for now.
Currently, we would have to check if the files inside the SquirrelStore have an fsspec compression extension, if not then we would set compression=None, otherwise, we could use compression='infer' from fsspec.
this seems reasonable to me. This logic can be entirely contained inside SquirrelStore
At this point, I think it would be more straightforward to have a more explicit mapping.
It's good that it's more explicit, but on the other hand if fsspec adds a new compression, we have to update this mapping to support it. I have a slight preference for letting fsspec
handle it (except the empty ""
filename suffix as you mentioned). Also, remember that instantiating a filesystem with different protocols (e.g. s3://
, gs://
, etc) with create different objects which might have different support for compressions.
We should enforce that the SquirrelStore folder only contains the encoded data in a single compression format and infer the used compression automatically.
For enforcing that, we can't just fail at the read
time, we need to fail when write
is happening. This is tricky, because store must be parallelizable, and we need a synchronization strategy to achieve this which is hard and unreliable.
Another reason to only allow one compression format inside a store is that otherwise the .get_keys and therefore the .get_iter method would also have to know which compression format was used.
Not necessarily. If compression is expressed as filename suffix, SquirrelStore
can automatically handle it while reading, and can accept optional kwargs when writing (default to .gz
).
With a single compression format, the deserializer can always infer it given the key name (remember key name includes extension name now) and continue to work with it and a user never has to "know" which compression algorithm was used.
So the SquirrelStore.get function would have additional logic that would infer the compression from the key name and pass that down to the deserialize function.
Exactly, the very same can happen if there are multiple filename suffix in the store.
Maybe it would be easier if I start the PR draft
Yes it would be great :)
from squirrel-core.
This is issue is marked as stale as it has been inactive for 30 days. It will be closed in 7 days.
from squirrel-core.
Related Issues (11)
- [FEATURE] Make `get_iter` method documentation about `max_workers` more explicit HOT 3
- Zip multiple iterables as a source HOT 1
- [BUG] Composable `shuffle` is not guaranteeing repeatability when using random.Random instances HOT 1
- [BUG] When forwarding kwargs to `shuffle` the `None` default breaks HOT 4
- [BUG] Syntax errors in Sphinx Documentation HOT 1
- [BUG] IterableSamplerSource is not picklable HOT 4
- [BUG] TorchIterable raises `PyTorchSplittingException` unexpectedly when source is picklable IterableSamplerSource HOT 2
- [FEATURE] Multiplexing functionality for dataset composables HOT 1
- [BUG] Rendering errors in docs HOT 1
- [FEATURE] Refactor Drivers to use composition instead of inheritance HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from squirrel-core.