Code Monkey home page Code Monkey logo

zarr3-rs's Introduction

zarr3-rs

Prototype implementation of zarr v3 in rust.

Based heavily on an earlier prototype at sci-rs/zarr.

Usage

See examples/roundtrip.rs for an example (and cargo run --example roundtrip to run it).

zarr3-rs's People

Contributors

clbarnes avatar

Stargazers

Chase Dwelle avatar YouSiki avatar Gabriel Nützi avatar Joshua Reibert avatar Prayag avatar Charlie Robbins avatar Sandalots avatar Kyle Barron avatar Adam Plowman avatar Jeff Carpenter avatar Jeremy Maitin-Shepard avatar Kamyar Mohajerani avatar Xavier Nogueira avatar Ángel Iglesias Préstamo avatar Matthew Iannucci avatar Trent Hauck avatar Tommaso Comparin avatar Josh Moore avatar

Watchers

Josh Moore avatar  avatar Chase Dwelle avatar  avatar

zarr3-rs's Issues

Error handling

Currently a mess of unwrap and Result<_, &'static str>.

This will mean changing a bunch of signatures, and probably revisiting the endian codec's handling, using the fallible write_whatever::<BigEndian> rather than the panicking BigEndian::write_whatever which wasn't working before.

AA codecs data type handling

  • data type changing within the codec, because some codecs might want that (?!)
    • this effectively means dynamic data types
  • not requiring a rust type annotation to decode (ideally use a DataType arg)

Core data types

  • bool
  • uints
  • ints
  • floats
    • f16
  • complex
  • raw
    • up to 128bit, trivial to add more

Improve generic/ ReflectedType handling

Currently, ArrayMetadata cannot be generic over the data type because it needs to deserialise the fill_value based on the data_type. The ArrayMetadataBuilder and the Array are both generic, so the concrete metadata is a weird in-between. This means that the array knows what kind of values it needs but can't communicate it to the type system: the best we can do is be generic over T: ReflectedType and then do a runtime check like if T::ZARR_TYPE != self.data_type { panic!("oops") }.

Allocations, checksums, and Readers/Writers

Preferably, codecs involving bytes (BB, AB) would use Readers/ Writers instead of allocating new byte buffers at every stage. This is how they were originally implemented ( a2757db ). However, with the introduction of the CRC32C codec (and conceptually other hashsums), decoders may need to read the whole read stream (to remove the hash). This is not too much of a problem - it can read the whole stream, then present a Boxed Reader which just wraps over the read bytes. However, encoders may need to know when the write stream is finished (to append the hash). We could:

  1. abuse Drop to write the hash, but this can't be guaranteed, especially as it's fallible
  2. abuse flush() by assuming it will only be called at the end of the write, but that's a big assumption
  3. wrap the writers into something with a .finalize() method which is a no-op for most writers: this could require different wrappers for different codecs, and we'd still need to pass around something like Box<CustomWriter<Box<dyn Write>>>
  4. do what every other implementation does, don't bother with lazy IO, and just allocate new byte buffers with every codec. Pass around Bytes objects; for decoding we might get lucky and be able to use slices of them.

4 is the simplest and brings us closer in line with other implementations. Allocation reduction is probably going to be dwarfed by IO and encode/decode overhead.

Codec refactor: partial enc/de, reduce allocations, assoc. types

Associated types

Codecs which currently return boxed readers and writers could use an associated type on the trait instead. Boxed dyn trait objects would probably still be needed for e.g. impl ABCodec for CodecChain and impl BBCodec for &[BBCodec], but it could reduce the number of required boxes. Boxed objects may also still be necessary for partial IO

(en|de)code_into

Where arrays need to be passed around, the output array could also be given (as a mutable view). This would itself be the decoded representation, and reduce allocations/ clones.

Partial (en|de)code

We have partial IO at the store level; codec level is necessary to make sharding worthwhile.

Concurrent IO

Obviously a lot goes on at the same time in this context. I'd lean towards async-std if using async/await as it seems more library-friendly. But the lifetime-wrangling involved makes me nervous.

The minimal version would probably be to make the chunk IO methods on Array async but leave the codecs synchronous, then allow region reads/writes to just join over the internal futures (possibly blocking on that so the function itself doesn't need to be async).

Stupid sharding implementation

Partial reads are going to be a pain, but it would be good to have complete reads/writes for the current sharding spec available with well-documented tradeoffs.

Error propagation through codecs

Probably needs a CodecError enum which contains an "Other" type for potential future codec-specific errors. This could roll io::Error into it.

Rebase on object_store

https://crates.io/crates/object_store

Pros

  • Offload complexity of different backends, gaining any new implementations for free
  • More consistent (and probably better thought out) interface
  • Async-first, which we should be moving towards anyway

Cons

  • Probably still need wrapper types so that we can implement backends they don't have
  • May not fit perfectly with our data model
  • WASM doesn't support cloud stores due to reqwests: apache/arrow-rs#4776
  • We use traits a lot, and async methods on traits aren't stable yet, but may be as of 2023-12-28
  • get and get_range{s} use an async stream wrapping a Bytes respectively; both would need wrapping over
  • doesn't allow suffix ranges yet apache/arrow-rs#4611

With any move to async, BB codecs in particular would need major rewrites, and make use of https://crates.io/crates/async-compression for compression.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.