chmp / serde_arrow Goto Github PK

Convert sequences of Rust objects to Arrow tables

License: MIT License

Rust 97.76% Python 2.24%

serde_arrow's Introduction

`serde_arrow` - convert sequences of Rust objects to Arrow arrays and back again

The arrow in-memory format is a powerful way to work with data frame like structures. The surrounding ecosystem includes a rich set of libraries, ranging from data frames such as Polars to query engines such as DataFusion. However, the API of the underlying Rust crates can be at times cumbersome to use due to the statically typed nature of Rust.

serde_arrow, offers a simple way to convert Rust objects into Arrow arrays and back. serde_arrow relies on the Serde package to interpret Rust objects. Therefore, adding support for serde_arrow to custom types is as easy as using Serde's derive macros.

In the Rust ecosystem there are two competing implementations of the arrow in-memory format. serde_arrow supports both arrow and arrow2 for schema tracing, serialization from Rust structs to arrays, and deserialization from arrays to Rust structs.

Example

The following examples assume that serde_arrow is added to the Cargo.toml file and its features are configured. serde_arrow supports different arrow and arrow2 versions. The relevant one can be selected by specifying the correct feature (e.g., arrow-51 to support arrow=51). See here for more details.

The following examples use the following Rust structure and example records

#[derive(Serialize, Deserialize)]
struct Record {
    a: f32,
    b: i32,
}

let records = vec![
    Record { a: 1.0, b: 1 },
    Record { a: 2.0, b: 2 },
    Record { a: 3.0, b: 3 },
];

Serialize to `arrow` `RecordBatch`

use arrow::datatypes::FieldRef;
use serde_arrow::schema::{SchemaLike, TracingOptions};

// Determine Arrow schema
let fields = Vec::<FieldRef>::from_type::<Record>(TracingOptions::default())?;

// Build a record batch
let batch = serde_arrow::to_record_batch(&fields, &records)?;

This RecordBatch can now be written to disk using ArrowWriter from the parquet crate.

let file = File::create("example.pq");
let mut writer = ArrowWriter::try_new(file, batch.schema(), None)?;
writer.write(&batch)?;
writer.close()?;

Serialize to `arrow2` arrays

use arrow2::datatypes::Field;
use serde_arrow::schema::{SchemaLike, TracingOptions};

let fields = Vec::<Field>::from_type::<Record>(TracingOptions::default())?;
let arrays = serde_arrow::to_arrow2(&fields, &records)?;

These arrays can now be written to disk using the helper method defined in the arrow2 guide. For parquet:

use arrow2::{chunk::Chunk, datatypes::Schema};

// see https://jorgecarleitao.github.io/arrow2/io/parquet_write.html
write_chunk(
    "example.pq",
    Schema::from(fields),
    Chunk::new(arrays),
)?;

Usage from python

The written files can be read in Python via

# using polars
>>> import polars as pl
>>> pl.read_parquet("example.pq")
shape: (3, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ f32 ┆ i32 │
╞═════╪═════╡
│ 1.0 ┆ 1   │
│ 2.0 ┆ 2   │
│ 3.0 ┆ 3   │
└─────┴─────┘

# using pandas
>>> import pandas as pd
>>> pd.read_parquet("example.pq")
     a  b
0  1.0  1
1  2.0  2
2  3.0  3

Related packages & Performance

arrow: the JSON component of the official Arrow package supports serializing objects that support serialize via the Decoder object. It supports primitives types, structs and lists
arrow2-convert: adds derive macros to convert objects from and to arrow2 arrays. It supports primitive types, structs, lists, and chrono's date time types. Enum support is experimental according to the Readme. If performance is the main objective, arrow2-convert is a good choice as it has no or minimal overhead over building the arrays manually.

The different implementation have the following performance differences, when compared to arrow2-convert:

The detailed runtimes of the benchmarks are listed below.

complex_common_serialize(100000)

label	time [ms]	arrow2_convert:	serde_arrow::to	serde_arrow::to	arrow_json::Rea
arrow2_convert::TryIntoArrow	54.01	1.00	0.31	0.30	0.14
serde_arrow::to_arrow2	173.84	3.22	1.00	0.98	0.46
serde_arrow::to_arrow	177.92	3.29	1.02	1.00	0.47
arrow_json::ReaderBuilder	378.48	7.01	2.18	2.13	1.00

complex_common_serialize(1000000)

label	time [ms]	arrow2_convert:	serde_arrow::to	serde_arrow::to	arrow_json::Rea
arrow2_convert::TryIntoArrow	576.81	1.00	0.34	0.33	0.16
serde_arrow::to_arrow2	1701.46	2.95	1.00	0.97	0.46
serde_arrow::to_arrow	1748.89	3.03	1.03	1.00	0.48
arrow_json::ReaderBuilder	3676.51	6.37	2.16	2.10	1.00

primitives_serialize(100000)

label	time [ms]	arrow2_convert:	serde_arrow::to	serde_arrow::to	arrow_json::Rea
arrow2_convert::TryIntoArrow	15.83	1.00	0.51	0.36	0.12
serde_arrow::to_arrow2	30.90	1.95	1.00	0.70	0.23
serde_arrow::to_arrow	43.96	2.78	1.42	1.00	0.33
arrow_json::ReaderBuilder	133.97	8.46	4.34	3.05	1.00

primitives_serialize(1000000)

label	time [ms]	arrow2_convert:	serde_arrow::to	serde_arrow::to	arrow_json::Rea
arrow2_convert::TryIntoArrow	153.07	1.00	0.47	0.35	0.11
serde_arrow::to_arrow2	327.32	2.14	1.00	0.74	0.23
serde_arrow::to_arrow	440.39	2.88	1.35	1.00	0.31
arrow_json::ReaderBuilder	1429.31	9.34	4.37	3.25	1.00

License

Copyright (c) 2021 - 2024 Christopher Prohm and contributors

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

serde_arrow's People

Contributors

Stargazers

Watchers

Forkers

icodein ryan-jacobs1 michael-f-bryan jerry73204 thomas-k-cameron palettechain vultix xmakro padicresearch ten0 alamb progval gz gstvg eventual-inc shabbirhasan1 raj-nimble

serde_arrow's Issues

Add support for deserialize

See branch https://github.com/chmp/serde_arrow/tree/feature/deserializer

Bring back the support for arrow-rs

While implementing a data source for DataFusion, I found this great library.
Unfortunately v0.5.0 does not support the list type, v0.6.0 added the support for lists and others but arrow2 is incompatible with DataFusion.

Until arrow-rs and arrow2 are merged, can you consider supporting arrow-rs or provide some compat utility to convert between arrow's RecordBatch and arrow2's Chunk?

https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.RecordBatchStream.html

update arrow to 6.0

PR opened #4

Support dict / struct as the outer element?

At the moment the outer structure has to be Seq > Map. However, in principle it could also make sense to to support Map > Seq (and also to trace arrays from a sequence).

Include union type as dictionaries

Update: serialize unions as is, but include an additional {name}_type field that encodes the union type in human readable form.

At the moment field-less enums are serialized as a union of null fields. This is probably not expected (or desired). Probably it would be better to serialize it as a string-dictionary.

test_round_trip!(
    test_name = enums_union,
    tracing_options = TracingOptions::default().allow_null_fields(true),
    field = Field::new(
        "value",
        DataType::Union(
            vec![
                Field::new("A", DataType::Null, true),
                Field::new("B", DataType::Null, true),
            ],
            None,
            UnionMode::Dense
        ),
        false,
    ),
    ty = Item,
    values = [Item::A, Item::B,],
    define = {
        #[derive(Debug, PartialEq, Deserialize, Serialize)]
        enum Item {
            A,
            B,
        }
    },
);

Current idea:

add fields with strategy "enum_type" that only serialize the type of the enum
Q: allow both data and type to be stored in different fields? How to deal with inconsistencies?
- on serialization no inconsistency can happen
- on deserialization ignore the type if both are present?

Check allocation when bulding arrays

In my tests using arrow2 it seems like there may be some large allocation happening. Could be in build-arrays or in ipc format. Steps:

Convert dataset
Build arrays
Write IPC format

Improve error messages

Unify error message format
- lowercase / uppercase initial letter?
Indicate the relevant component for each error (e.g., the relevant builder)
Include the field path in the error message (#202)
Document the conventions

Known Issues:

mismatched data types in particular for float coercion: F64 can accept U64, F32 cannot. Give a proper error message in this case and describe how coerce_numbers may help with Tracing
Add point to corce_numbers for different number types (e.g., u64 accepting a f64)
#97

Implement deserialization for arrow

Include field path in error messages

It can be hard to diagnose errors, when they occur deep within an object tree. It would be better to include the field path in the error message.

Add support missing keys

Key may be missing for maps. At the moment this is not supported. serde_arrow should implement a "fill any missing fields with null" strategy.

Notes:

Idea: Keep a boolean mask of fields written and at the end of each record, fill any missing fields with nulls.
This needs also to be supported in schema tracing

Add a flexibe number mode for JSON documents

Follow the arrow2 tracing implementation: https://jorgecarleitao.github.io/arrow2/main/guide/io/json_read.html

And their coercing rules: https://github.com/jorgecarleitao/arrow2/blob/ca7f36f4bcb603e4af29588c16acefcf48308a82/src/io/json/read/infer_schema.rs#L119

Add quickstart guide

Include migration hints from previous versions

Started in PR #26

Implement pretty printing of arrow2 datatypes

At the moment the debug repr of arrow2 data types is used. Probably used a separate pretty printing rep would be a better idea.

Add support for fixed point numbers

Q: What are the relevant fixed point / decimal crates for Rust? Do they support serialization?

crates.io keywords:
- https://crates.io/keywords/fixed
- https://crates.io/keywords/decimal
Top crates (with more than 1 million recent downloads):
- rust_decimal: 2.272.352 recent downloads
  - arbitrary precision uses internal serde_json tags, figure out how this works
  - str / float serialization can be directly implemented
- bigdecimal: 2.143.209 recent downloads
  - serde feature serializes to string (source)
- fraction: 1.170.025 recent downloads
  - custom serialization format, neither string nor float based. skip for now

Arrow-rs now generate schema & batches with json values of json file

I just see that the arrow-rs can generate schema from json file/buffer/Vec<Result>. With the schema, it is possible to use a decoder to create batches .

Maybe It should be good to join the effort ?

Implement `Date32` formats

Implement generating Date32 columns the same way Date64 is currently implemented.

Update arrow dependency

See: apache/arrow-rs#2952 (comment)

Release v0.8.0-rc.2

Preserve full metadata through array generation?

At the moment only strategy metadata is passed through the array generation

Fix nested options with bytecode serializer

Option<Option<T>> is currently not working with the bytecode serializer

The logic of the bytecode is too strict: it expects exactly Item, <Value> or Null, but Option<Option<T>>> serializes as:

Item Item <Value>,
Item Null, or
Null

Full support for dictionary arrays

Currently serde_arrow supports dictionaries of string values (categories), as this the most common use case. Arrow2 allows all hashable types to be used in a dictionary array. It actually uses only the hash and does not test for equality.

Q: How can serde_arrow support more general dictionary types?

Only use hashses? (as arrow2 does)
Use the new Value object to fully collect serialized objects in memory?

Add heuristic to detect supported date-time formats (similar to pandas) in tracing

Support arrow2

Since polars switched to arrow2, serde_arrow no longer works with polars.

Questions to sort out:

Support arrow / arrow2 in the same dependecy?
- Alternative: require users to include multiple versions of serde_arrow

Initial plan of attack:

Isolate all arrow interaction into separate modules
Investigate how to support different arrow "backends" without code duplication

Branch: https://github.com/chmp/serde_arrow/tree/feature/arrow2

Add support for Timestamp

datafusion supports Timestamp columns better, than date64 (e.g., date_trunc requires a Timestamp argument).

Plan: support Timestamp(Second, None) and Timestamp(Second, UTC) columns by piggy bagging on the Date64 support

Add codegen for serialization?

Add the option to generate code for serialization based on fields?

Rework arrow2::format_fields:

rework it to use GenericField
move it into a new codegen package

Support Iterators of Row structs

serde_arrow::to_record_batch((0..100).map(|_| random_item()), &schema)?

Implement bytecode deserialization

#68

Expose deserialization for arrow (and document it)

Make sure error can be converted to anyhow::Error

Implement f16 for bytecode

Include experimental prefix in examples

Allow to configure the defaults for tracing

LargeUtf8 vs Utf8
LargeList vs List
Default dictionary keys
...

Crossovers

Hi,

Firstly, thank you for building this in the open & sharing!

I see that this can be used to serde-derivable structures to the arrow layout.

There are a ways to parse binary content into Rust data types.
Additionally, there is https://github.com/simd-lite/simd-json-derive for deriving JSON from Rust data types.

Would I be able to convert a bunch of structs "created" by them, and then use serde_arrow's derive on top of that, to convert it finally to the arrow layout?

Re-consider date handling strategy

At the moment the date handling strategy is directly tied to the internals of chrono. It probably would be better to allow more customization and therefore decouple serde_arrow from chrono.

Issue with nested structure

Hey

Nice project!

I'm just attempting to deserialize a nested structure (2 levels). I'm not sure whether this is supported(?) but I'm doing something similar with the csv serde and it just flattens the nested structure. Not sure whether something similar would be possible here or are nested structures supported by arrow/parquet?

Currently I'm getting the following error Error: Cannot add_field with event StartMap

Thanks

Ignore unknown columns in serialize

E.g., for JSON it can be helpful to embed comments as extra fields

Release `0.6.0`

Steps to take before releasing version 0.6.0

PR #46

Make features additive

Currently the features are non additive. E.g., using arrow-50 and arrow-49 will select arrow=50. In large dependency trees this may cause issues as multiple crates might select different versions of the underlying crates.

User A builds a crate on top of serde_arrow using say the arrow48 feature and user B else builds a crate on top of serde_arrow using the arrow49 feature. As soon as someone combines both of these crates, feature unification will select both the arrow48 and arrow49 features. In this case, serde_arrow will select the arrow=49 depedency and user A's crate will break, in probably hard to diagnose ways. Library authors should be able to select a fixed arrow version and stick with it, regardless of the larger context of the program.

Potential interface

stick with one fixed arrow version for each major version of serde_arrow
track only this fixed version using the shorthand functions
export a versioned module for each supported version
e.g., serde_arrow::to_arrow => fixed version, serde_arrow::arrow_49::to_arrow => arrow=49

Option 1: Generate wrappers

Generate the wrappers for each arrow version, e.g., using macros or templates.
Generate code for each activated version
Re-export the maximum version as the "default" (current interface)
Instruct library authors to use the specific versions
Add a @generated marker to the generated files (see capnproto/capnproto-rust#153)

Option 2: Use FFI

Only maintain support for the latest arrow version and use the FFI interface to convert between version?

Alternatives

Give up supporting multiple different arrow versions and simply default to latest.

Allow to serialize / deserialize the schema to disk?

Could be helpful to specify the schema in JSON format and then use it instead of tracing.

[
  {"name": "a", "data_type": "U32", "nullable": false},
  {
     "name": "complex", "data_type": "Struct", "nullable": false,
     "children": [
       "..."
     ]
  }
]

Use a similar concept as with other structs: allow to serialize a struct into the fields (e.g., a serde_json::Value)

Support serialization of maps as structs

Required to support #[serde(flatten) again

Implementation idea:

add a strategy to a struct type that says serialize to map / deserialize from map

Open question:

support schema tracing? add a general option, that converts maps to structs?

Document serde compatibility of arrow, onced released

Hidy ho neighbor, I wanted to let you know that this pull request in arrow-rs adds some serde deserialization compatibility in the next release. I thought this might be interesting to take a looksee at 😄

Fix missing fields in bytecode serializer

Release of serde_arrow with arrow2 support

The arrow2 branch is looking pretty solid at the moment. However, there are some missing issues, which should be implemented before relasing it:

(Note: I plan to close the linked issues with the release)

Tests:

can tuples also be serialized a sequence or is TupleAsStruct always used?

PR: #16

[Roadmap] Next (major) release

Planned features:

Skip for now:

Misc:

Document the meaning of the schema: the schema describes the arrays not the rust structs
Re-implement maps for the outer struct (tests: test_maps_with_missing_items, into_outer_maps_simple)
Test Lists with i32 offsets
Remove special cased dictionaries in interpreter and use raw buffers directly
~~Test different order between arrrays and fields~~ (cannot happen: the order of arrays and fields must be consistent)

Clean up benchmarks

Use

https://github.com/DataEngineeringLabs/arrow2-convert as the "manual" case
...

Improve performance

Using the accept_{ev} in serialization gives a huge performance boost. This should be reimplemented. See 5e27233 for details

Fieldless unions broken on main

Original report:

Not sure whether this should be its own issue or under this one, but I am currently unable to serialize field-less enums - they all panic with a backtrace like this:

thread '<unnamed>' panicked at 'index out of bounds: the len is 2 but the index is 3', /Users/davidroher/.cargo/registry/src/github.com-1ecc6299db9ec823/serde_arrow-0.7.0/src/internal/generic_sinks/union.rs:85:45
stack backtrace:
   0: rust_begin_unwind
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/panicking.rs:579:5
   1: core::panicking::panic_fmt
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/panicking.rs:64:14
   2: core::panicking::panic_bounds_check
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/panicking.rs:159:5
   3: <serde_arrow::internal::generic_sinks::union::UnionArrayBuilder<B> as serde_arrow::internal::sink::EventSink>::accept_variant
   4: <serde_arrow::internal::generic_sinks::struct::StructArrayBuilder<B> as serde_arrow::internal::sink::EventSink>::accept_variant
   5: <serde_arrow::internal::generic_sinks::struct::StructArrayBuilder<B> as serde_arrow::internal::sink::EventSink>::accept_variant
   6: <serde_arrow::internal::sink::EventSerializer<S> as serde::ser::Serializer>::serialize_unit_variant
   7: boxball_rs::event_file::game_state::_::<impl serde::ser::Serialize for boxball_rs::event_file::game_state::GameMetadata>::serialize

As soon as it gets to any struct which has a fieldless enum as one of its fields (when I remove all enum fields from the struct, i no longer get the error).

Support every Nth release (2nd?)
Support for how long?
- with fixed window serde_arrow also has to release a breaking change every 2 - 4 weeks