Code Monkey home page Code Monkey logo

datafusion-orc's Introduction

test codecov Crates.io Crates.io

orc-rust

A native Rust implementation of the Apache ORC file format, providing API's to read data into Apache Arrow in-memory arrays.

See the documentation for examples on how to use this crate.

Supported features

This crate currently only supports reading ORC files into Arrow arrays. Write support is planned (see Roadmap). The below features listed relate only to reading ORC files. At this time, we aim to support the ORCv1 specification only.

  • Read synchronously & asynchronously (using Tokio)
  • All compression types (Zlib, Snappy, Lzo, Lz4, Zstd)
  • All ORC data types
  • All encodings
  • Rudimentary support for retrieving statistics
  • Retrieving user metadata into Arrow schema metadata

Roadmap

The long term vision for this crate is to be feature complete enough to be donated to the arrow-rs project.

The following lists the rough roadmap for features to be implemented, from highest to lowest priority.

  • Performance enhancements
  • DataFusion integration
  • Predicate pushdown
  • Row indices
  • Bloom filters
  • Write from Arrow arrays
  • Encryption

A non-Arrow API interface is not planned at the moment. Feel free to raise an issue if there is such a use case.

Version compatibility

No guarantees are provided about stability across versions. We will endeavour to keep the top level API's (ArrowReader and ArrowStreamReader) as stable as we can, but other API's provided may change as we explore the interface we want the library to expose.

Versions will be released on an ad-hoc basis (with no fixed schedule).

Mapping ORC types to Arrow types

The following table lists how ORC data types are read into Arrow data types:

ORC Data Type Arrow Data Type Notes
Boolean Boolean
TinyInt Int8
SmallInt Int16
Int Int32
BigInt Int64
Float Float32
Double Float64
String Utf8
Char Utf8
VarChar Utf8
Binary Binary
Decimal Decimal128
Date Date32
Timestamp Timestamp(Nanosecond, None) ¹
Timestamp instant Timestamp(Nanosecond, UTC) ¹
Struct Struct
List List
Map Map
Union Union(_, Sparse) ²

¹: ArrowReaderBuilder::with_schema allows configuring different time units or decoding to Decimal128(38, 9) (i128 of non-leap nanoseconds since UNIX epoch). Overflows may happen while decoding to a non-Seconds time unit, and results in OrcError. Loss of precision may happen while decoding to a non-Nanosecond time unit, and results in OrcError. Decimal128(38, 9) avoids both overflows and loss of precision.

²: Currently only supports a maximum of 127 variants

Contributing

All contributions are welcome! Feel free to raise an issue if you have a feature request, bug report, or a question. Feel free to raise a Pull Request without raising an issue first, as long as the Pull Request is descriptive enough.

Some tools we use in addition to the standard cargo that require installation are:

cargo install typos-cli
cargo install taplo-cli
# Building the crate
cargo build

# Running the test suite
cargo test

# Simple benchmarks
cargo bench

# Formatting TOML files
taplo format

# Detect any typos in the codebase
typos

To regenerate/update the proto.rs file, execute the regen.sh script.

./regen.sh

datafusion-orc's People

Contributors

alamb avatar harveyyue avatar jefffrey avatar klangner avatar progval avatar v0y4g3r avatar waynexia avatar wenyxu avatar xuanwo avatar youngsofun avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

datafusion-orc's Issues

Convert date/timestamp data direct to arrow without chrono

Currently in the date and timestamp arrow readers, we are reading the values then converting them to NaiveDate/NaiveDateTime's before these are then used to construct the relevant arrow arrays

We can omit this step to chrono types and instead convert the ORC date/timestamp values direct into the arrow underlying representation for dates/timestamps (with any necessary bit/byte manipulation)

Integer RLE v1 decoding support

I have some code which I did for Integer RLE v1 decoding so I'll work on porting this over

This should also allow support for DIRECT (v1) and DICTIONARY (v1) encoding support

More Integer RLE tests

With refactoring introduced by #57

I'd like to add a lot of tests for RLE behaviour. Will need to go to Apache Orc repo implementation to generate the test cases (using it as our oracle).

Want to get to a state where we are 100% certain the integer decoding behaviour is aligned with the specification.

Write ORC from arrow recordbatches

Not a focus now, just raising issue here for tracking

Currently in progress.

Initial support

Tracked by initial-write-support branch

Checklist:

  • High level ArrowWriter synchronous interface (accepts RecordBatches to write)
  • Basic configuration via builder
  • Stripe writer
  • Metadata writer
  • Value encoding
    • Integer RLEv2
      • Short repeat
      • Direct
      • Delta
      • Patched base
    • Base 128 varint
    • Byte RLE
  • Encode nullability
  • Float/Double array
  • Short/Int/Long array
  • String/Binary array
  • Boolean array
  • Byte array
  • Basic struct array support (for root)

Once complete will raise PR for all the above, to provide a complete and usable writer (though lacking in features see below).

Subsequent features

Following items will be added in smaller PRs once base code of writer is merged to main.

  • Asynchronous interface
  • Compression
    • Zlib
    • Snappy
    • Lzo
    • Lz4
    • Zstd
  • Statistics
    • Int
    • Double
    • String
    • Bucket
    • Decimal
    • Date
    • Binary
    • Timestamp
  • Dictionary array
  • Run length array
  • Decimal array
  • Date array
  • Timestamp array
  • Compound array
    • Union array
    • Map array
    • List array
    • Struct array
  • Index streams
    • Row group index
    • Bloom filters
  • Extension configuration (see Java config for examples)
  • User metadata
  • Arrow type hint (when writing with this Arrow -> ORC writer, encode the original Arrow type in metadata so when reading, we can recreate original Arrow array)
  • TODO: other Arrow types

Support selection pruning

Make use of file statistics, stripe statistics, column statistics, row group indexes, and bloom filters

Need way to expose this functionality so users (like datafusion) can utilize to efficiently query large ORC files, e.g. via predicate pushdown

Benchmarks

To make improving performance more measurable, include benchmarks to be run.

Requires benchmark programs (see https://github.com/apache/arrow-rs/tree/master/parquet/benches)

And also large data files, ideally with all supported data types

Note for the data files, completely random data may not be sufficient, as some encodings take advantage of patterns in the data (e.g. int v2 RLE), so need to keep that in mind if considering generating data for the benchmarks

Could also use something like TPCH or TPCDS data, or NYC taxi, for more variety in data

Out of spec, message: combined patch width and value width cannot exceed the size of the integer type being decoded

this check:

if (patch_bit_width + value_bit_width) > (N::BYTE_SIZE * 8) {
return OutOfSpecSnafu {
msg: "combined patch width and value width cannot exceed the size of the integer type being decoded",
}
.fail();
}

can be falsified while reading files created with pyorc. To reproduce:

wget https://softwareheritage.s3.amazonaws.com/graph/2024-05-16/orc/revision/revision-0c45576a-59f7-48d1-a9a8-2e5c64098905.orc

(sorry, it's 4GB. I don't have a smaller example on hand)

then checkout 28e911b (a commit from #96 because it's the only way not to hit an overflow crash before this bug), apply this patch:

diff --git a/Cargo.toml b/Cargo.toml
index 2af67e1..ecec249 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -69,6 +70,10 @@ required-features = ["datafusion"]
 # Some issue when publishing and path isn't specified, so adding here
 path = "./examples/datafusion_integration.rs"
 
+[[example]]
+name = "repro"
+required-features = ["cli"]
+
 [[bin]]
 name = "orc-metadata"
 required-features = ["cli"]
diff --git a/src/reader/decode/rle_v2/patched_base.rs b/src/reader/decode/rle_v2/patched_base.rs
index c33815b..c149ea4 100644
--- a/src/reader/decode/rle_v2/patched_base.rs
+++ b/src/reader/decode/rle_v2/patched_base.rs
@@ -53,6 +53,7 @@ impl<N: NInt, R: Read> RleReaderV2<N, R> {
             .fail();
         }
         if (patch_bit_width + value_bit_width) > (N::BYTE_SIZE * 8) {
+            eprintln!("patch_bit_width= {} value_bit_width= {} N::BYTE_SIZE= {}", patch_bit_width, value_bit_width, N::BYTE_SIZE);
             return OutOfSpecSnafu {
                 msg: "combined patch width and value width cannot exceed the size of the integer type being decoded",
             }

then run this code with ./revision-0c45576a-59f7-48d1-a9a8-2e5c64098905.orc as parameter:

use std::fs::File;
use std::path::PathBuf;
use std::sync::Arc;

use anyhow::{Context, Result};
use arrow::datatypes::{DataType, Decimal128Type, DecimalType, Schema};
use orc_rust::arrow_reader::ArrowReaderBuilder;
use orc_rust::projection::ProjectionMask;
//use rayon::prelude::*;

fn transform_schema(schema: &Schema) -> Arc<Schema> {
    Arc::new(Schema::new(
        schema
            .fields()
            .iter()
            .cloned()
            .map(|field| match field.data_type() {
                DataType::Timestamp(_, _) => (*field)
                    .clone()
                    //.with_data_type(DataType::Timestamp(TimeUnit::Microsecond, tz.clone())),
                    .with_data_type(DataType::Decimal128(Decimal128Type::MAX_SCALE as _, 9)),
                _ => (*field).clone(),
            })
            .collect::<Vec<_>>(),
    ))
}

pub fn main() -> Result<()> {
    std::env::args()
        .skip(1)
        .collect::<Vec<_>>()
        //.into_par_iter()
        .into_iter()
        .try_for_each(|arg| {
            let file_path = PathBuf::from(arg);
            println!("reading {}", file_path.display());
            let file = File::open(&file_path)?;
            let reader_builder = ArrowReaderBuilder::try_new(file)?;
            let projection = ProjectionMask::named_roots(
                reader_builder.file_metadata().root_data_type(),
                ["date"].as_slice(),
            );
            let reader_builder = reader_builder
                .with_projection(projection)
                .with_batch_size(1024);
            let schema = transform_schema(&reader_builder.schema());
            let reader = reader_builder.with_schema(schema).build();
            for (i, chunk) in reader.enumerate() {
                let chunk = chunk.with_context(|| {
                    format!("Could not read chunk {} of {}", i, file_path.display())
                })?;
                //println!("{:?}", chunk);
            }

            Ok(())
        })
}

which prints:

reading /srv/softwareheritage/ssd/data/vlorentz/datasets/2024-05-16/orc/revision/revision-0c45576a-59f7-48d1-a9a8-2e5c64098905.orc
patch_bit_width= 40 value_bit_width= 30 N::BYTE_SIZE= 8
Error: Could not read chunk 29525 of /srv/softwareheritage/ssd/data/vlorentz/datasets/2024-05-16/orc/revision/revision-0c45576a-59f7-48d1-a9a8-2e5c64098905.orc

Caused by:
    0: External error: Out of spec, message: combined patch width and value width cannot exceed the size of the integer type being decoded
    1: Out of spec, message: combined patch width and value width cannot exceed the size of the integer type being decoded

Stack backtrace:
   0: anyhow::context::<impl anyhow::Context<T,E> for core::result::Result<T,E>>::with_context
   1: repro::main
   2: std::sys_common::backtrace::__rust_begin_short_backtrace
   3: std::rt::lang_start::{{closure}}
   4: core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/ops/function.rs:284:13
   5: std::panicking::try::do_call
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/panicking.rs:552:40
   6: std::panicking::try
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/panicking.rs:516:19
   7: std::panic::catch_unwind
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/panic.rs:142:14
   8: std::rt::lang_start_internal::{{closure}}
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/rt.rs:148:48
   9: std::panicking::try::do_call
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/panicking.rs:552:40
  10: std::panicking::try
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/panicking.rs:516:19
  11: std::panic::catch_unwind
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/panic.rs:142:14
  12: std::rt::lang_start_internal
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/rt.rs:148:20
  13: main
  14: __libc_start_main
             at ./csu/../csu/libc-start.c:308:16
  15: _start

Tracking issue for error handling

Minimize use of unwraps in codebase

Errors should make visible to user what failed and at what stage

  • e.g. instead of saying EOF reached, say EOF reached when reading data for X column in Y stripe

Implement RLEv2 reader using generics

Currently reader for RLEv2 implements unsigned/signed reading using same base code, but with a boolean to indicate if signed or not.

https://github.com/datafusion-contrib/datafusion-orc/blob/ebe96e070eafcb797cb33cb02aa2c935767a4825/src/reader/decode/rle_v2.rs

Want to explore using generics to enable this behaviour without that runtime check (since when we construct the RLE reader we should already know if we want signed or not, so this runtime check during decoding shouldn't be needed).

Also I'm unsure how well the current code might handle overflows/saturating adds/subs

Cargo check --all-targets failed for `stripe::Stripe` is private

:) cargo c
    Checking orc-rust v0.3.0 (/home/xuanwo/Code/datafusion-contrib/datafusion-orc)
error[E0603]: struct `Stripe` is private
  --> src/bin/orc-metadata.rs:4:57
   |
4  | use orc_rust::{reader::metadata::read_metadata, stripe::Stripe};
   |                                                         ^^^^^^ private struct
   |
note: the struct `Stripe` is defined here
  --> /home/xuanwo/Code/datafusion-contrib/datafusion-orc/src/stripe.rs:91:1
   |
91 | pub(crate) struct Stripe {
   | ^^^^^^^^^^^^^^^^^^^^^^^^

For more information about this error, try `rustc --explain E0603`.
error: could not compile `orc-rust` (bin "orc-metadata" test) due to 1 previous error
warning: build failed, waiting for other jobs to finish...
error: could not compile `orc-rust` (bin "orc-metadata") due to 1 previous error

Should we export it? The bin orc-metadata failed to build for this.

`unreachable!()` in rle_v2_decode_bit_width is reachable

the unreachable!() statement in this function:

pub fn rle_v2_decode_bit_width(encoded: u8) -> usize {
debug_assert!(encoded < 32, "encoded bit width cannot exceed 5 bits");
match encoded {
0..=23 => encoded as usize + 1,
27 => 32,
28 => 40,
29 => 48,
30 => 56,
31 => 64,
_ => unreachable!(),
}
}

can be reached while reading files created with pyorc. To reproduce:

wget https://softwareheritage.s3.amazonaws.com/graph/2023-09-06/orc/release/release-00d4739e-c723-4843-863e-e4a895c58005.orc

then run this code with ./release-00d4739e-c723-4843-863e-e4a895c58005.orc as parameter:

use std::fs::File;
use std::path::PathBuf;

use anyhow::Result;
use orc_rust::arrow_reader::ArrowReaderBuilder;
use orc_rust::projection::ProjectionMask;


pub fn main() -> Result<()> {
    let file_path = PathBuf::from(std::env::args().skip(1).next().unwrap());
    println!("reading {}", file_path.display());
    let file = File::open(&file_path)?;
    let reader_builder = ArrowReaderBuilder::try_new(file)?;
    let projection = ProjectionMask::named_roots(
        reader_builder.file_metadata().root_data_type(),
        ["date"].as_slice(),
    );
    let reader = reader_builder
        .with_projection(projection)
        .with_batch_size(10)
        .build();
    for (i, _) in reader.enumerate() {
        println!("chunk {}", i);
    }

    Ok(())
}

and this small patch to datafusion-orc:

diff --git a/src/reader/decode/util.rs b/src/reader/decode/util.rs
index 468bd7d..2d5bad6 100644
--- a/src/reader/decode/util.rs
+++ b/src/reader/decode/util.rs
@@ -231,7 +231,7 @@ pub fn rle_v2_decode_bit_width(encoded: u8) -> usize {
         29 => 48,
         30 => 56,
         31 => 64,
-        _ => unreachable!(),
+        _ => unreachable!("rle_v2_decode_bit_width({})", encoded),
     }
 }
 

which prints:

[...]
chunk 1715
chunk 1716
chunk 1717
chunk 1718
chunk 1719
chunk 1720
chunk 1721
chunk 1722
chunk 1723
chunk 1724
thread 'main' panicked at /home/vlorentz/datafusion-orc/src/reader/decode/util.rs:234:14:
internal error: entered unreachable code: rle_v2_decode_bit_width(26)
stack backtrace:
   0: rust_begin_unwind
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/panicking.rs:645:5
   1: core::panicking::panic_fmt
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/panicking.rs:72:14
   2: orc_rust::reader::decode::util::rle_v2_decode_bit_width
             at /home/vlorentz/datafusion-orc/src/reader/decode/util.rs:234:14
   3: orc_rust::reader::decode::rle_v2::patched_base::<impl orc_rust::reader::decode::rle_v2::RleReaderV2<N,R>>::read_patched_base
             at /home/vlorentz/datafusion-orc/src/reader/decode/rle_v2/patched_base.rs:32:31
   4: orc_rust::reader::decode::rle_v2::RleReaderV2<N,R>::decode_batch
             at /home/vlorentz/datafusion-orc/src/reader/decode/rle_v2/mod.rs:38:42
   5: <orc_rust::reader::decode::rle_v2::RleReaderV2<N,R> as core::iter::traits::iterator::Iterator>::next
             at /home/vlorentz/datafusion-orc/src/reader/decode/rle_v2/mod.rs:53:19
   6: <alloc::boxed::Box<I,A> as core::iter::traits::iterator::Iterator>::next
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/alloc/src/boxed.rs:1949:9
   7: <&mut I as core::iter::traits::iterator::Iterator>::next
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/iter/traits/iterator.rs:4169:9
   8: <core::iter::adapters::zip::Zip<A,B> as core::iter::adapters::zip::ZipImpl<A,B>>::next
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/iter/adapters/zip.rs:166:21
   9: <core::iter::adapters::zip::Zip<A,B> as core::iter::traits::iterator::Iterator>::next
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/iter/adapters/zip.rs:85:9
  10: <orc_rust::arrow_reader::column::timestamp::TimestampIterator as core::iter::traits::iterator::Iterator>::next
             at /home/vlorentz/datafusion-orc/src/arrow_reader/column/timestamp.rs:31:13
  11: <alloc::boxed::Box<I,A> as core::iter::traits::iterator::Iterator>::next
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/alloc/src/boxed.rs:1949:9
  12: orc_rust::arrow_reader::decoder::PrimitiveArrayDecoder<T>::next_primitive_batch
             at /home/vlorentz/datafusion-orc/src/arrow_reader/decoder/mod.rs:65:35
  13: <orc_rust::arrow_reader::decoder::timestamp::TimestampOffsetArrayDecoder as orc_rust::arrow_reader::decoder::ArrayBatchDecoder>::next_batch
             at /home/vlorentz/datafusion-orc/src/arrow_reader/decoder/timestamp.rs:109:21
  14: orc_rust::arrow_reader::decoder::NaiveStripeDecoder::inner_decode_next_batch
             at /home/vlorentz/datafusion-orc/src/arrow_reader/decoder/mod.rs:408:25
  15: orc_rust::arrow_reader::decoder::NaiveStripeDecoder::decode_next_batch
             at /home/vlorentz/datafusion-orc/src/arrow_reader/decoder/mod.rs:420:22
  16: <orc_rust::arrow_reader::decoder::NaiveStripeDecoder as core::iter::traits::iterator::Iterator>::next
             at /home/vlorentz/datafusion-orc/src/arrow_reader/decoder/mod.rs:288:26
  17: <alloc::boxed::Box<I,A> as core::iter::traits::iterator::Iterator>::next
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/alloc/src/boxed.rs:1949:9
  18: <orc_rust::arrow_reader::ArrowReader<R> as core::iter::traits::iterator::Iterator>::next
             at /home/vlorentz/datafusion-orc/src/arrow_reader/mod.rs:159:23
  19: <core::iter::adapters::enumerate::Enumerate<I> as core::iter::traits::iterator::Iterator>::next
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/iter/adapters/enumerate.rs:47:17
  20: repro::main
             at ./rust/src/bin/repro.rs:22:19
  21: core::ops::function::FnOnce::call_once
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

Consider switching to trunk based workflow?

First of all, sorry for the slow progress, I do still intend to work on this but have just been juggling other items at the moment.

I was considering making the workflow of this repo more trunk based, where those with write access can just push direct to main without requiring a PR and approval.

Benefits:

  • small set of contributors at the moment, so avoid overhead of requiring PR for every change
  • this project doesn't need to worry about stability at the moment, so can move faster

Thoughts? @waynexia @WenyXu

DataFusion integration

I implemented a very shallow example integration with DataFusion:

8c68f47

Will want to flesh this out and move this code into src, so can provide support for stuff like predicate pushdown from DaraFusions point of view.

Could split this repo into two crates, one to focus on reading to Arrow (arrow-orc, akin to parquet in arrow-rs) then another for DataFusion integration code (datafusion-orc) such as a trait which implements ExecutionPlan , etc.

Support nested projection

Be able to specifically project child columns of struct columns, instead of only being able to project the entire struct column at root level

Refactor top-level interface

(Where top-level interface refers to how DataFusion will use this library to read ORC files as that is the main intention of the crate)

Since we want this library to integrate with DataFusion, we should try provide a more clean interface for it to be able to read ORC files as record batches.

In current way:

fn new_arrow_reader(path: &str, fields: &[&str]) -> ArrowReader<File> {
let f = File::open(path).expect("no file found");
let reader = Reader::new(f).unwrap();
let cursor = Cursor::new(reader, fields).unwrap();
ArrowReader::new(cursor, None)
}

  • Create reader (our struct) from file
  • Create cursor (our struct) from reader
  • Create arrow reader (our struct) from cursor

Similar can be said for async version.

We can take inspiration from how parquet does it:

User Story: Databend uses datafusion-orc for it's async support

Hello, everyone! Thank you for this fantastic crate. I'm writing to share how databend is utilizing datafusion-orc for its async support.

With datafusion-orc, databend now supports importing data from ORC files. Our colleague @youngsofun has been diligently working on this project. We plan to contribute our improvements to the upstream in the near future. Stay tuned!

Refactor metadata into our own classes

Currently the file metadata:

pub struct FileMetadata {
pub postscript: PostScript,
pub footer: Footer,
pub metadata: Metadata,
pub stripe_footers: Vec<StripeFooter>,
}

And stripe metadata:

pub struct Stripe {
pub(crate) footer: Arc<StripeFooter>,
pub(crate) columns: Vec<Column>,
pub(crate) stripe_offset: usize,
pub(crate) info: StripeInformation,
/// <(ColumnId, Kind), Bytes>
pub(crate) stream_map: Arc<StreamMap>,
}

Rely directly on proto structs/types.

Will work on refactoring to add our own versions of these structs/types as a sort of decoupling layer, and to potentially have a nicer interface for what we need from the metadata.

Requires nightly to compile

Seems we currently rely on some unstable features, wondering if we should continue as such and require nightly or change to stable to be similar to arrow-rs?

Support decoding Union column with up to including 256 variants

According to https://orc.apache.org/specification/ORCv1/

Currently ORC union types are limited to 256 variants, which matches the Hive type model.

However in Arrow, UnionArrays are limited to 127 variants: https://arrow.apache.org/docs/format/Columnar.html#union-layout

A union with more than 127 possible types can be modeled as a union of unions.

To support this, would need to do as above and decode into union of union

See initial Union support here: ee69b91

Re-evaluate exposed public types

Will need to go through codebase and see which traits/structs/functions we want to expose for use and which to keep private. I'd like to keep as minimal a public interface as possible to allow us to change the internals more frequently and easily. Not too high priority at the moment as I consider the repo a bit experimental at the moment (and we aren't publishing yet, but will need to consider this too)

Dictionary array handling

Spun off from discussion in #68

Find way to preserve dictionary encoding in DictionaryArray without having to cast to StringArray as introduced here:

// Cast back to StringArray to ensure all stripes have consistent datatype
// TODO: Is there anyway to preserve the dictionary encoding?
// This costs performance.
let array = cast(&array, &DataType::Utf8).context(ArrowSnafu)?;

Some reference from similar parquet issue: apache/arrow-rs#171

Switch codecov secret

This project used to use codecov to collect and show unit test coverage.

- name: Codecov upload
uses: codecov/codecov-action@v2
with:
token: ${{ secrets.CODECOV_TOKEN }}
files: ./lcov.info
flags: rust
fail_ci_if_error: false
verbose: true

To make it work in this repository, we need to switch repo in codecov as well. The codecov app is installed to this repo, but I can't generate a secret in codecov. Maybe this is because of some message delay, I'd check again tomorrow.

Short-term roadmap for this implementation

Previous discussion: apache/datafusion#4707

Though the ORC format is not as widely used as parquet in arrow-rs and datafusion related projects, there are still some (growing, to my feelings) interesting and requirements on this format. As @Jefffrey said here, a noticeable and viable milestone for this project is it can be merged into arrow-rs. This draft roadmap is raised to help us discuss, arrange and take our efforts toward that milestone.

Given the ORC format is less complex than parquet, there are still many work to do in various aspects. Here is a list of functionalities need to be done if we consider making ORC files queriable from datafusion as the primary use case on this stage. Please feel free to add/remove/set priorities to them. It's likely that we can't finish all of them in a short term, thus marking what are going to be done is also important.

  • primitive data types (ORC refs)
    • tiny int #22
    • timestamp with local time zone #13
    • decimal #18
  • common compress methods #10
  • user metadata
  • other encodings #24
  • Benchmark #8

The below are also related but with lower priorities

  • compound data types #14
    • struct #26
    • list
    • map
    • union
  • file metadata and statistics
  • pruning #15

Long term items:

  • encryption

Then something I'm not sure about. Looking for more information. Also feel free to change previous two lists.

Fix Timestamp to align with ORC spec

ORC timestamp type is not straightforward, as though it apparently represents a timestamp without a timezone, its encoding & decoding is still dependent on the writer timezone (encoded in the stripe) and the reader timezone.

This is why the 1900 & 2038 integration tests are currently disabled as our implementation is incorrect:

#[test]
#[ignore] // TODO: Incorrect timezone + representation differs
fn test_date_1900() {
test_expected_file("TestOrcFile.testDate1900");
}
#[test]
#[ignore] // TODO: Incorrect timezone + representation differs
fn test_date_2038() {
test_expected_file("TestOrcFile.testDate2038");
}

Execute the statement `select count(*)` always getting 0

Execute the statement select count(*) always getting 0

    let ctx = SessionContext::new();
    ctx.register_orc(
        "table1",
        "tests/basic/data/alltypes.snappy.orc",
        OrcReadOptions::default(),
    )
    .await?;

    ctx.sql("select count(*) from table1")
        .await?
        .show()
        .await?;

Update README

Details to include:

  • Roadmap (short term and long term)
    • Vision is for this to eventually be donated to arrow-rs and/or datafusion
    • Also status of support for ORC features (such as bloom filter, statistics, etc._
  • Guarantees about stability (aka none)
  • Contributing (guidelines and how to get started)

etc.

Bug: Fail to read column of type `array<float>`

What's wrong?

encounter panic when reading a column containing float values under array or map structures, such as:

  • array<float>
  • array<tuple<float, int>>
array less than expected length
thread 'basic_test_nested_array_float' panicked at src/array_decoder/mod.rs:71:30:
array less than expected length

How to reproduce?

Add the following code to tests/basic/data/write.py:

nested_array_float = {
    "value": [
        [1.0, 2.0],
        [None, 2.0],
    ],
}

_write("struct<value:array<float>>", nested_array_float, "nested_array_float.orc")

Run the test against it like nested_array.orc

Reason

FloatIter uses number of rows instead of number of leaf values.

Refer to the code here: array_decoder/mod.rs

Quick Fix

FloatIter do not have to guard against number of values. It can simply read until the end of bytes in memory, similar to how integers are handled.

pr: #112

Some questions about Column::number_of_rows

Is the Column::number_of_rows intended to represent number_of_values? If so, this code seems incorrect:

pub fn children(&self) -> Vec<Column> {
    .... 
    DataType::List { child, .. } => {
        vec![Column {
            number_of_rows: self.number_of_rows,
            footer: self.footer.clone(),
            name: "item

In my understanding, we have no way to derive number of values from the parent column.
however we can retrieve number of values from the stripe metadata, this requires a refactor.

Here’s an ugly but workable fix: Commit

Rigorous ORC integration tests

Integration tests added by #65

However we have to compare actual vs expected data in JSON format since that is how it is encoded in the Apache ORC repo

An alternative way could be to use the pyarrow/arrow ORC implementation to generate the expected files into a parquet or arrow flight file format which can be more rigorous than JSON

We lose visibility on the expected data a bit but since these are integration tests with data from Apache ORC repo, they wouldn't change often (if at all) anyway

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.