Code Monkey home page Code Monkey logo

serde_arrow's Introduction

serde_arrow - convert sequences of Rust objects to Arrow arrays and back again

Crate info | API docs | Example | Related packages & performance | Status | License | Changes | Development

The arrow in-memory format is a powerful way to work with data frame like structures. The surrounding ecosystem includes a rich set of libraries, ranging from data frames such as Polars to query engines such as DataFusion. However, the API of the underlying Rust crates can be at times cumbersome to use due to the statically typed nature of Rust.

serde_arrow, offers a simple way to convert Rust objects into Arrow arrays and back. serde_arrow relies on the Serde package to interpret Rust objects. Therefore, adding support for serde_arrow to custom types is as easy as using Serde's derive macros.

In the Rust ecosystem there are two competing implementations of the arrow in-memory format. serde_arrow supports both arrow and arrow2 for schema tracing, serialization from Rust structs to arrays, and deserialization from arrays to Rust structs.

Example

The following examples assume that serde_arrow is added to the Cargo.toml file and its features are configured. serde_arrow supports different arrow and arrow2 versions. The relevant one can be selected by specifying the correct feature (e.g., arrow-51 to support arrow=51). See here for more details.

The following examples use the following Rust structure and example records

#[derive(Serialize, Deserialize)]
struct Record {
    a: f32,
    b: i32,
}

let records = vec![
    Record { a: 1.0, b: 1 },
    Record { a: 2.0, b: 2 },
    Record { a: 3.0, b: 3 },
];

Serialize to arrow RecordBatch

use arrow::datatypes::FieldRef;
use serde_arrow::schema::{SchemaLike, TracingOptions};

// Determine Arrow schema
let fields = Vec::<FieldRef>::from_type::<Record>(TracingOptions::default())?;

// Build a record batch
let batch = serde_arrow::to_record_batch(&fields, &records)?;

This RecordBatch can now be written to disk using ArrowWriter from the parquet crate.

let file = File::create("example.pq");
let mut writer = ArrowWriter::try_new(file, batch.schema(), None)?;
writer.write(&batch)?;
writer.close()?;

Serialize to arrow2 arrays

use arrow2::datatypes::Field;
use serde_arrow::schema::{SchemaLike, TracingOptions};

let fields = Vec::<Field>::from_type::<Record>(TracingOptions::default())?;
let arrays = serde_arrow::to_arrow2(&fields, &records)?;

These arrays can now be written to disk using the helper method defined in the arrow2 guide. For parquet:

use arrow2::{chunk::Chunk, datatypes::Schema};

// see https://jorgecarleitao.github.io/arrow2/io/parquet_write.html
write_chunk(
    "example.pq",
    Schema::from(fields),
    Chunk::new(arrays),
)?;

Usage from python

The written files can be read in Python via

# using polars
>>> import polars as pl
>>> pl.read_parquet("example.pq")
shape: (3, 2)
┌─────┬─────┐
│ ab   │
│ ------ │
│ f32i32 │
╞═════╪═════╡
│ 1.01   │
│ 2.02   │
│ 3.03   │
└─────┴─────┘

# using pandas
>>> import pandas as pd
>>> pd.read_parquet("example.pq")
     a  b
0  1.0  1
1  2.0  2
2  3.0  3

Related packages & Performance

  • arrow: the JSON component of the official Arrow package supports serializing objects that support serialize via the Decoder object. It supports primitives types, structs and lists
  • arrow2-convert: adds derive macros to convert objects from and to arrow2 arrays. It supports primitive types, structs, lists, and chrono's date time types. Enum support is experimental according to the Readme. If performance is the main objective, arrow2-convert is a good choice as it has no or minimal overhead over building the arrays manually.

The different implementation have the following performance differences, when compared to arrow2-convert:

Time

The detailed runtimes of the benchmarks are listed below.

complex_common_serialize(100000)

label time [ms] arrow2_convert: serde_arrow::to serde_arrow::to arrow_json::Rea
arrow2_convert::TryIntoArrow 54.01 1.00 0.31 0.30 0.14
serde_arrow::to_arrow2 173.84 3.22 1.00 0.98 0.46
serde_arrow::to_arrow 177.92 3.29 1.02 1.00 0.47
arrow_json::ReaderBuilder 378.48 7.01 2.18 2.13 1.00

complex_common_serialize(1000000)

label time [ms] arrow2_convert: serde_arrow::to serde_arrow::to arrow_json::Rea
arrow2_convert::TryIntoArrow 576.81 1.00 0.34 0.33 0.16
serde_arrow::to_arrow2 1701.46 2.95 1.00 0.97 0.46
serde_arrow::to_arrow 1748.89 3.03 1.03 1.00 0.48
arrow_json::ReaderBuilder 3676.51 6.37 2.16 2.10 1.00

primitives_serialize(100000)

label time [ms] arrow2_convert: serde_arrow::to serde_arrow::to arrow_json::Rea
arrow2_convert::TryIntoArrow 15.83 1.00 0.51 0.36 0.12
serde_arrow::to_arrow2 30.90 1.95 1.00 0.70 0.23
serde_arrow::to_arrow 43.96 2.78 1.42 1.00 0.33
arrow_json::ReaderBuilder 133.97 8.46 4.34 3.05 1.00

primitives_serialize(1000000)

label time [ms] arrow2_convert: serde_arrow::to serde_arrow::to arrow_json::Rea
arrow2_convert::TryIntoArrow 153.07 1.00 0.47 0.35 0.11
serde_arrow::to_arrow2 327.32 2.14 1.00 0.74 0.23
serde_arrow::to_arrow 440.39 2.88 1.35 1.00 0.31
arrow_json::ReaderBuilder 1429.31 9.34 4.37 3.25 1.00

License

Copyright (c) 2021 - 2024 Christopher Prohm and contributors

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

serde_arrow's People

Contributors

alamb avatar chmp avatar elbaro avatar gstvg avatar gz avatar jmfiaschi-veepee avatar michael-f-bryan avatar progval avatar ryan-jacobs1 avatar ten0 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

serde_arrow's Issues

Bring back the support for arrow-rs

While implementing a data source for DataFusion, I found this great library.
Unfortunately v0.5.0 does not support the list type, v0.6.0 added the support for lists and others but arrow2 is incompatible with DataFusion.

Until arrow-rs and arrow2 are merged, can you consider supporting arrow-rs or provide some compat utility to convert between arrow's RecordBatch and arrow2's Chunk?

https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.RecordBatchStream.html

Support dict / struct as the outer element?

At the moment the outer structure has to be Seq > Map. However, in principle it could also make sense to to support Map > Seq (and also to trace arrays from a sequence).

Include union type as dictionaries

Update: serialize unions as is, but include an additional {name}_type field that encodes the union type in human readable form.

At the moment field-less enums are serialized as a union of null fields. This is probably not expected (or desired). Probably it would be better to serialize it as a string-dictionary.

test_round_trip!(
    test_name = enums_union,
    tracing_options = TracingOptions::default().allow_null_fields(true),
    field = Field::new(
        "value",
        DataType::Union(
            vec![
                Field::new("A", DataType::Null, true),
                Field::new("B", DataType::Null, true),
            ],
            None,
            UnionMode::Dense
        ),
        false,
    ),
    ty = Item,
    values = [Item::A, Item::B,],
    define = {
        #[derive(Debug, PartialEq, Deserialize, Serialize)]
        enum Item {
            A,
            B,
        }
    },
);

Current idea:

  • add fields with strategy "enum_type" that only serialize the type of the enum
  • Q: allow both data and type to be stored in different fields? How to deal with inconsistencies?
    • on serialization no inconsistency can happen
    • on deserialization ignore the type if both are present?

Check allocation when bulding arrays

In my tests using arrow2 it seems like there may be some large allocation happening. Could be in build-arrays or in ipc format. Steps:

  • Convert dataset
  • Build arrays
  • Write IPC format

Improve error messages

  • Unify error message format
    • lowercase / uppercase initial letter?
  • Indicate the relevant component for each error (e.g., the relevant builder)
  • Include the field path in the error message (#202)
  • Document the conventions

Known Issues:

  • mismatched data types in particular for float coercion: F64 can accept U64, F32 cannot. Give a proper error message in this case and describe how coerce_numbers may help with Tracing
  • Add point to corce_numbers for different number types (e.g., u64 accepting a f64)
  • #97

Include field path in error messages

It can be hard to diagnose errors, when they occur deep within an object tree. It would be better to include the field path in the error message.

Add support missing keys

Key may be missing for maps. At the moment this is not supported. serde_arrow should implement a "fill any missing fields with null" strategy.

Notes:

  • Idea: Keep a boolean mask of fields written and at the end of each record, fill any missing fields with nulls.
  • This needs also to be supported in schema tracing

Add support for fixed point numbers

Q: What are the relevant fixed point / decimal crates for Rust? Do they support serialization?

  • crates.io keywords:
  • Top crates (with more than 1 million recent downloads):
    • rust_decimal: 2.272.352 recent downloads
      • arbitrary precision uses internal serde_json tags, figure out how this works
      • str / float serialization can be directly implemented
    • bigdecimal: 2.143.209 recent downloads
      • serde feature serializes to string (source)
    • fraction: 1.170.025 recent downloads
      • custom serialization format, neither string nor float based. skip for now

Fix nested options with bytecode serializer

Option<Option<T>> is currently not working with the bytecode serializer

The logic of the bytecode is too strict: it expects exactly Item, <Value> or Null, but Option<Option<T>>> serializes as:

  • Item Item <Value>,
  • Item Null, or
  • Null

Full support for dictionary arrays

Currently serde_arrow supports dictionaries of string values (categories), as this the most common use case. Arrow2 allows all hashable types to be used in a dictionary array. It actually uses only the hash and does not test for equality.

Q: How can serde_arrow support more general dictionary types?

  • Only use hashses? (as arrow2 does)
  • Use the new Value object to fully collect serialized objects in memory?

Support arrow2

Since polars switched to arrow2, serde_arrow no longer works with polars.

Questions to sort out:

  • Support arrow / arrow2 in the same dependecy?
    • Alternative: require users to include multiple versions of serde_arrow

Initial plan of attack:

  1. Isolate all arrow interaction into separate modules
  2. Investigate how to support different arrow "backends" without code duplication

Branch: https://github.com/chmp/serde_arrow/tree/feature/arrow2

Add support for Timestamp

datafusion supports Timestamp columns better, than date64 (e.g., date_trunc requires a Timestamp argument).

Plan: support Timestamp(Second, None) and Timestamp(Second, UTC) columns by piggy bagging on the Date64 support

Add codegen for serialization?

Add the option to generate code for serialization based on fields?

Rework arrow2::format_fields:

  • rework it to use GenericField
  • move it into a new codegen package

Crossovers

Hi,

Firstly, thank you for building this in the open & sharing!

I see that this can be used to serde-derivable structures to the arrow layout.

There are a ways to parse binary content into Rust data types.
Additionally, there is https://github.com/simd-lite/simd-json-derive for deriving JSON from Rust data types.

Would I be able to convert a bunch of structs "created" by them, and then use serde_arrow's derive on top of that, to convert it finally to the arrow layout?

Re-consider date handling strategy

At the moment the date handling strategy is directly tied to the internals of chrono. It probably would be better to allow more customization and therefore decouple serde_arrow from chrono.

Issue with nested structure

Hey

Nice project!

I'm just attempting to deserialize a nested structure (2 levels). I'm not sure whether this is supported(?) but I'm doing something similar with the csv serde and it just flattens the nested structure. Not sure whether something similar would be possible here or are nested structures supported by arrow/parquet?

Currently I'm getting the following error Error: Cannot add_field with event StartMap

Thanks

Release `0.6.0`

Steps to take before releasing version 0.6.0

PR #46

  • Review docs
    • Use the same example for Readme and lib docs
    • Add Intro to lib docs
    • Unify docs between arrow / arrow2 (document the individual fields)
    • Add doc tests to arrow impls
  • Remove serde_arrow::arrow2::experimental::format_fields
  • Check defaults in tracing (LargeUtf8, LargeList, dictionary keys ...)
    • Note: use u32 for dictionary keys, as polars does not support u64 keys
  • Add missing tests for sinks:
    • union
    • chrono
    • maps (traced as struct, traced as map, with string and non-string keys)
  • Remove base module?
  • Warn for null arrays when tracing (or maybe even error?)
  • Sort maps fields? Otherwise the order fields for HashMaps is runtime dependent

Make features additive

Currently the features are non additive. E.g., using arrow-50 and arrow-49 will select arrow=50. In large dependency trees this may cause issues as multiple crates might select different versions of the underlying crates.

User A builds a crate on top of serde_arrow using say the arrow48 feature and user B else builds a crate on top of serde_arrow using the arrow49 feature. As soon as someone combines both of these crates, feature unification will select both the arrow48 and arrow49 features. In this case, serde_arrow will select the arrow=49 depedency and user A's crate will break, in probably hard to diagnose ways. Library authors should be able to select a fixed arrow version and stick with it, regardless of the larger context of the program.

Potential interface

  • stick with one fixed arrow version for each major version of serde_arrow
  • track only this fixed version using the shorthand functions
  • export a versioned module for each supported version
  • e.g., serde_arrow::to_arrow => fixed version, serde_arrow::arrow_49::to_arrow => arrow=49

Option 1: Generate wrappers

  • Generate the wrappers for each arrow version, e.g., using macros or templates.
  • Generate code for each activated version
  • Re-export the maximum version as the "default" (current interface)
  • Instruct library authors to use the specific versions
  • Add a @generated marker to the generated files (see capnproto/capnproto-rust#153)

Option 2: Use FFI

Only maintain support for the latest arrow version and use the FFI interface to convert between version?

Alternatives

  1. Give up supporting multiple different arrow versions and simply default to latest.

Allow to serialize / deserialize the schema to disk?

Could be helpful to specify the schema in JSON format and then use it instead of tracing.

[
  {"name": "a", "data_type": "U32", "nullable": false},
  {
     "name": "complex", "data_type": "Struct", "nullable": false,
     "children": [
       "..."
     ]
  }
]

Use a similar concept as with other structs: allow to serialize a struct into the fields (e.g., a serde_json::Value)

Support serialization of maps as structs

Required to support #[serde(flatten) again

Implementation idea:

  • add a strategy to a struct type that says serialize to map / deserialize from map

Open question:

  • support schema tracing? add a general option, that converts maps to structs?

Release of serde_arrow with arrow2 support

The arrow2 branch is looking pretty solid at the moment. However, there are some missing issues, which should be implemented before relasing it:

(Note: I plan to close the linked issues with the release)

Tests:

  • can tuples also be serialized a sequence or is TupleAsStruct always used?

PR: #16

[Roadmap] Next (major) release

Planned features:

Skip for now:

Misc:

  • Document the meaning of the schema: the schema describes the arrays not the rust structs
  • Re-implement maps for the outer struct (tests: test_maps_with_missing_items, into_outer_maps_simple)
  • Test Lists with i32 offsets
  • Remove special cased dictionaries in interpreter and use raw buffers directly
  • Test different order between arrrays and fields (cannot happen: the order of arrays and fields must be consistent)

Improve performance

Using the accept_{ev} in serialization gives a huge performance boost. This should be reimplemented. See 5e27233 for details

Fieldless unions broken on main

Original report:

Not sure whether this should be its own issue or under this one, but I am currently unable to serialize field-less enums - they all panic with a backtrace like this:

thread '<unnamed>' panicked at 'index out of bounds: the len is 2 but the index is 3', /Users/davidroher/.cargo/registry/src/github.com-1ecc6299db9ec823/serde_arrow-0.7.0/src/internal/generic_sinks/union.rs:85:45
stack backtrace:
   0: rust_begin_unwind
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/panicking.rs:579:5
   1: core::panicking::panic_fmt
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/panicking.rs:64:14
   2: core::panicking::panic_bounds_check
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/panicking.rs:159:5
   3: <serde_arrow::internal::generic_sinks::union::UnionArrayBuilder<B> as serde_arrow::internal::sink::EventSink>::accept_variant
   4: <serde_arrow::internal::generic_sinks::struct::StructArrayBuilder<B> as serde_arrow::internal::sink::EventSink>::accept_variant
   5: <serde_arrow::internal::generic_sinks::struct::StructArrayBuilder<B> as serde_arrow::internal::sink::EventSink>::accept_variant
   6: <serde_arrow::internal::sink::EventSerializer<S> as serde::ser::Serializer>::serialize_unit_variant
   7: boxball_rs::event_file::game_state::_::<impl serde::ser::Serialize for boxball_rs::event_file::game_state::GameMetadata>::serialize

As soon as it gets to any struct which has a fieldless enum as one of its fields (when I remove all enum fields from the struct, i no longer get the error).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.