Light

clbarnes / zarr3-rs Goto Github PK

View Code? Open in Web Editor NEW

18.0 4.0 0.0 318 KB

Prototype implementation of zarr v3 in rust

License: MIT License

Rust 100.00%

zarr3-rs's Introduction

zarr3-rs

Prototype implementation of zarr v3 in rust.

Based heavily on an earlier prototype at sci-rs/zarr.

Usage

See examples/roundtrip.rs for an example (and cargo run --example roundtrip to run it).

zarr3-rs's People

Contributors

Stargazers

Watchers

zarr3-rs's Issues

Error handling

Currently a mess of unwrap and Result<_, &'static str>.

This will mean changing a bunch of signatures, and probably revisiting the endian codec's handling, using the fallible write_whatever::<BigEndian> rather than the panicking BigEndian::write_whatever which wasn't working before.

AA codecs data type handling

data type changing within the codec, because some codecs might want that (?!)
- this effectively means dynamic data types
not requiring a rust type annotation to decode (ideally use a DataType arg)

Core data types

Metadata roundtripping

Improve generic/ ReflectedType handling

Currently, ArrayMetadata cannot be generic over the data type because it needs to deserialise the fill_value based on the data_type. The ArrayMetadataBuilder and the Array are both generic, so the concrete metadata is a weird in-between. This means that the array knows what kind of values it needs but can't communicate it to the type system: the best we can do is be generic over T: ReflectedType and then do a runtime check like if T::ZARR_TYPE != self.data_type { panic!("oops") }.

Core codecs

transpose
endian
gzip
blosc

Sharding codec breakage

Presumably after addition of checksum.

Allocations, checksums, and Readers/Writers

Preferably, codecs involving bytes (BB, AB) would use Readers/ Writers instead of allocating new byte buffers at every stage. This is how they were originally implemented ( a2757db ). However, with the introduction of the CRC32C codec (and conceptually other hashsums), decoders may need to read the whole read stream (to remove the hash). This is not too much of a problem - it can read the whole stream, then present a Boxed Reader which just wraps over the read bytes. However, encoders may need to know when the write stream is finished (to append the hash). We could:

abuse Drop to write the hash, but this can't be guaranteed, especially as it's fallible
abuse flush() by assuming it will only be called at the end of the write, but that's a big assumption
wrap the writers into something with a .finalize() method which is a no-op for most writers: this could require different wrappers for different codecs, and we'd still need to pass around something like Box<CustomWriter<Box<dyn Write>>>
do what every other implementation does, don't bother with lazy IO, and just allocate new byte buffers with every codec. Pass around Bytes objects; for decoding we might get lucky and be able to use slices of them.

4 is the simplest and brings us closer in line with other implementations. Allocation reduction is probably going to be dwarfed by IO and encode/decode overhead.

Float deserialisation

"NaN"
"+Infinity"
"-Infinity"
"0xYYYYYYYY" for specific NaNs

Also for complex.

Add CRC-32C checksum to shard footer

In accordance with zarr-developers/zarr-specs#237

This may add some complications to lazy-reading the footer, but then again it's small enough that just doing it eagerly and returning a Bytes should be fine.

Codec refactor: partial enc/de, reduce allocations, assoc. types

Associated types

Codecs which currently return boxed readers and writers could use an associated type on the trait instead. Boxed dyn trait objects would probably still be needed for e.g. impl ABCodec for CodecChain and impl BBCodec for &[BBCodec], but it could reduce the number of required boxes. Boxed objects may also still be necessary for partial IO

`(en|de)code_into`

Where arrays need to be passed around, the output array could also be given (as a mutable view). This would itself be the decoded representation, and reduce allocations/ clones.

Partial `(en|de)code`

We have partial IO at the store level; codec level is necessary to make sharding worthwhile.

Concurrent IO

Obviously a lot goes on at the same time in this context. I'd lean towards async-std if using async/await as it seems more library-friendly. But the lifetime-wrangling involved makes me nervous.

The minimal version would probably be to make the chunk IO methods on Array async but leave the codecs synchronous, then allow region reads/writes to just join over the internal futures (possibly blocking on that so the function itself doesn't need to be async).

Pros

Offload complexity of different backends, gaining any new implementations for free
More consistent (and probably better thought out) interface
Async-first, which we should be moving towards anyway

Cons

Probably still need wrapper types so that we can implement backends they don't have
May not fit perfectly with our data model
WASM doesn't support cloud stores due to reqwests: apache/arrow-rs#4776
~~We use traits a lot, and async methods on traits aren't stable yet, but may be as of 2023-12-28~~
get and get_range{s} use an async stream wrapping a Bytes ~~respectively~~; both would need wrapping over
~~doesn't allow suffix ranges yet apache/arrow-rs#4611~~

With any move to async, BB codecs in particular would need major rewrites, and make use of https://crates.io/crates/async-compression for compression.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.