datafusion-contrib / datafusion-orc Goto Github PK

Implementation of Apache ORC file format use Apache Arrow in-memory format

License: Apache License 2.0

Makefile 0.12% Rust 96.07% Python 3.50% Shell 0.31%

datafusion-orc's Introduction

orc-rust

A native Rust implementation of the Apache ORC file format, providing API's to read data into Apache Arrow in-memory arrays.

See the documentation for examples on how to use this crate.

Supported features

This crate currently only supports reading ORC files into Arrow arrays. Write support is planned (see Roadmap). The below features listed relate only to reading ORC files. At this time, we aim to support the ORCv1 specification only.

Read synchronously & asynchronously (using Tokio)
All compression types (Zlib, Snappy, Lzo, Lz4, Zstd)
All ORC data types
All encodings
Rudimentary support for retrieving statistics
Retrieving user metadata into Arrow schema metadata

Roadmap

The long term vision for this crate is to be feature complete enough to be donated to the arrow-rs project.

The following lists the rough roadmap for features to be implemented, from highest to lowest priority.

Performance enhancements
DataFusion integration
Predicate pushdown
Row indices
Bloom filters
Write from Arrow arrays
Encryption

A non-Arrow API interface is not planned at the moment. Feel free to raise an issue if there is such a use case.

Version compatibility

No guarantees are provided about stability across versions. We will endeavour to keep the top level API's (ArrowReader and ArrowStreamReader) as stable as we can, but other API's provided may change as we explore the interface we want the library to expose.

Versions will be released on an ad-hoc basis (with no fixed schedule).

Mapping ORC types to Arrow types

The following table lists how ORC data types are read into Arrow data types:

ORC Data Type	Arrow Data Type	Notes
Boolean	Boolean
TinyInt	Int8
SmallInt	Int16
Int	Int32
BigInt	Int64
Float	Float32
Double	Float64
String	Utf8
Char	Utf8
VarChar	Utf8
Binary	Binary
Decimal	Decimal128
Date	Date32
Timestamp	Timestamp(Nanosecond, None)	¹
Timestamp instant	Timestamp(Nanosecond, UTC)	¹
Struct	Struct
List	List
Map	Map
Union	Union(_, Sparse)	²

¹: ArrowReaderBuilder::with_schema allows configuring different time units or decoding to Decimal128(38, 9) (i128 of non-leap nanoseconds since UNIX epoch). Overflows may happen while decoding to a non-Seconds time unit, and results in OrcError. Loss of precision may happen while decoding to a non-Nanosecond time unit, and results in OrcError. Decimal128(38, 9) avoids both overflows and loss of precision.

²: Currently only supports a maximum of 127 variants

Contributing

All contributions are welcome! Feel free to raise an issue if you have a feature request, bug report, or a question. Feel free to raise a Pull Request without raising an issue first, as long as the Pull Request is descriptive enough.

Some tools we use in addition to the standard cargo that require installation are:

taplo
typos

cargo install typos-cli
cargo install taplo-cli

# Building the crate
cargo build

# Running the test suite
cargo test

# Simple benchmarks
cargo bench

# Formatting TOML files
taplo format

# Detect any typos in the codebase
typos

To regenerate/update the proto.rs file, execute the regen.sh script.

./regen.sh

datafusion-orc's People

Contributors

Stargazers

Watchers

Forkers

wenyxu waynexia justanotherdot progval klangner youngsofun datafuse-extras harveyyue

datafusion-orc's Issues

Put async functionality behind feature

Convert date/timestamp data direct to arrow without chrono

Currently in the date and timestamp arrow readers, we are reading the values then converting them to NaiveDate/NaiveDateTime's before these are then used to construct the relevant arrow arrays

We can omit this step to chrono types and instead convert the ORC date/timestamp values direct into the arrow underlying representation for dates/timestamps (with any necessary bit/byte manipulation)

Integer RLE v1 decoding support

I have some code which I did for Integer RLE v1 decoding so I'll work on porting this over

This should also allow support for DIRECT (v1) and DICTIONARY (v1) encoding support

More Integer RLE tests

With refactoring introduced by #57

I'd like to add a lot of tests for RLE behaviour. Will need to go to Apache Orc repo implementation to generate the test cases (using it as our oracle).

Want to get to a state where we are 100% certain the integer decoding behaviour is aligned with the specification.

Write ORC from arrow recordbatches

~~Not a focus now, just raising issue here for tracking~~

Currently in progress.

Initial support

Tracked by initial-write-support branch

Checklist:

Once complete will raise PR for all the above, to provide a complete and usable writer (though lacking in features see below).

Subsequent features

Following items will be added in smaller PRs once base code of writer is merged to main.

Do you have a plan for ORC writer?

Thanks for your excellent work.
And do you have a pan for a writer? it's also useful in many cases.

Encryption support

Not a focus now, just raising issue for tracking

CLI tools

Implement some CLI binaries for working with ORC files such as reading schema, getting stats, etc.

Tools to have:

View footer metadata
- Initial version: 8537adb
View stripe metadata (be able to filter specific stripes since can have a lot)
View statistics
Indexes, bloom filter stuff, etc.

Also need to ensure these are tested as part of CI

Some references:

Support selection pruning

Make use of file statistics, stripe statistics, column statistics, row group indexes, and bloom filters

Need way to expose this functionality so users (like datafusion) can utilize to efficiently query large ORC files, e.g. via predicate pushdown

Benchmarks

To make improving performance more measurable, include benchmarks to be run.

Requires benchmark programs (see https://github.com/apache/arrow-rs/tree/master/parquet/benches)

And also large data files, ideally with all supported data types

Note for the data files, completely random data may not be sufficient, as some encodings take advantage of patterns in the data (e.g. int v2 RLE), so need to keep that in mind if considering generating data for the benchmarks

Could also use something like TPCH or TPCDS data, or NYC taxi, for more variety in data

Out of spec, message: combined patch width and value width cannot exceed the size of the integer type being decoded

this check:

datafusion-orc/src/reader/decode/rle_v2/patched_base.rs

Lines 55 to 60 in 16b5704

    
           if (patch_bit_width + value_bit_width) > (N::BYTE_SIZE * 8) { 
        
               return OutOfSpecSnafu { 
        
                   msg: "combined patch width and value width cannot exceed the size of the integer type being decoded", 
        
               } 
        
               .fail(); 
        
           }

can be falsified while reading files created with pyorc. To reproduce:

wget https://softwareheritage.s3.amazonaws.com/graph/2024-05-16/orc/revision/revision-0c45576a-59f7-48d1-a9a8-2e5c64098905.orc

(sorry, it's 4GB. I don't have a smaller example on hand)

then checkout 28e911b (a commit from #96 because it's the only way not to hit an overflow crash before this bug), apply this patch:

diff --git a/Cargo.toml b/Cargo.toml
index 2af67e1..ecec249 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -69,6 +70,10 @@ required-features = ["datafusion"]
 # Some issue when publishing and path isn't specified, so adding here
 path = "./examples/datafusion_integration.rs"
 
+[[example]]
+name = "repro"
+required-features = ["cli"]
+
 [[bin]]
 name = "orc-metadata"
 required-features = ["cli"]
diff --git a/src/reader/decode/rle_v2/patched_base.rs b/src/reader/decode/rle_v2/patched_base.rs
index c33815b..c149ea4 100644
--- a/src/reader/decode/rle_v2/patched_base.rs
+++ b/src/reader/decode/rle_v2/patched_base.rs
@@ -53,6 +53,7 @@ impl<N: NInt, R: Read> RleReaderV2<N, R> {
             .fail();
         }
         if (patch_bit_width + value_bit_width) > (N::BYTE_SIZE * 8) {
+            eprintln!("patch_bit_width= {} value_bit_width= {} N::BYTE_SIZE= {}", patch_bit_width, value_bit_width, N::BYTE_SIZE);
             return OutOfSpecSnafu {
                 msg: "combined patch width and value width cannot exceed the size of the integer type being decoded",
             }

then run this code with ./revision-0c45576a-59f7-48d1-a9a8-2e5c64098905.orc as parameter:

use std::fs::File;
use std::path::PathBuf;
use std::sync::Arc;

use anyhow::{Context, Result};
use arrow::datatypes::{DataType, Decimal128Type, DecimalType, Schema};
use orc_rust::arrow_reader::ArrowReaderBuilder;
use orc_rust::projection::ProjectionMask;
//use rayon::prelude::*;

fn transform_schema(schema: &Schema) -> Arc<Schema> {
    Arc::new(Schema::new(
        schema
            .fields()
            .iter()
            .cloned()
            .map(|field| match field.data_type() {
                DataType::Timestamp(_, _) => (*field)
                    .clone()
                    //.with_data_type(DataType::Timestamp(TimeUnit::Microsecond, tz.clone())),
                    .with_data_type(DataType::Decimal128(Decimal128Type::MAX_SCALE as _, 9)),
                _ => (*field).clone(),
            })
            .collect::<Vec<_>>(),
    ))
}

pub fn main() -> Result<()> {
    std::env::args()
        .skip(1)
        .collect::<Vec<_>>()
        //.into_par_iter()
        .into_iter()
        .try_for_each(|arg| {
            let file_path = PathBuf::from(arg);
            println!("reading {}", file_path.display());
            let file = File::open(&file_path)?;
            let reader_builder = ArrowReaderBuilder::try_new(file)?;
            let projection = ProjectionMask::named_roots(
                reader_builder.file_metadata().root_data_type(),
                ["date"].as_slice(),
            );
            let reader_builder = reader_builder
                .with_projection(projection)
                .with_batch_size(1024);
            let schema = transform_schema(&reader_builder.schema());
            let reader = reader_builder.with_schema(schema).build();
            for (i, chunk) in reader.enumerate() {
                let chunk = chunk.with_context(|| {
                    format!("Could not read chunk {} of {}", i, file_path.display())
                })?;
                //println!("{:?}", chunk);
            }

            Ok(())
        })
}

which prints:

reading /srv/softwareheritage/ssd/data/vlorentz/datasets/2024-05-16/orc/revision/revision-0c45576a-59f7-48d1-a9a8-2e5c64098905.orc
patch_bit_width= 40 value_bit_width= 30 N::BYTE_SIZE= 8
Error: Could not read chunk 29525 of /srv/softwareheritage/ssd/data/vlorentz/datasets/2024-05-16/orc/revision/revision-0c45576a-59f7-48d1-a9a8-2e5c64098905.orc

Caused by:
    0: External error: Out of spec, message: combined patch width and value width cannot exceed the size of the integer type being decoded
    1: Out of spec, message: combined patch width and value width cannot exceed the size of the integer type being decoded

Stack backtrace:
   0: anyhow::context::<impl anyhow::Context<T,E> for core::result::Result<T,E>>::with_context
   1: repro::main
   2: std::sys_common::backtrace::__rust_begin_short_backtrace
   3: std::rt::lang_start::{{closure}}
   4: core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/ops/function.rs:284:13
   5: std::panicking::try::do_call
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/panicking.rs:552:40
   6: std::panicking::try
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/panicking.rs:516:19
   7: std::panic::catch_unwind
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/panic.rs:142:14
   8: std::rt::lang_start_internal::{{closure}}
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/rt.rs:148:48
   9: std::panicking::try::do_call
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/panicking.rs:552:40
  10: std::panicking::try
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/panicking.rs:516:19
  11: std::panic::catch_unwind
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/panic.rs:142:14
  12: std::rt::lang_start_internal
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/rt.rs:148:20
  13: main
  14: __libc_start_main
             at ./csu/../csu/libc-start.c:308:16
  15: _start

Support all compression methods

I'll work on porting over my code which supported them all, though admittedly it was simple integration to crates that did the actual decompression

Zlib: https://crates.io/crates/flate2
Snappy: https://crates.io/crates/snap
Lzo: https://crates.io/crates/lzokay-native
Lz4: https://crates.io/crates/lz4_flex
Zstd: https://crates.io/crates/zstd

Release strategy

Brought up in #55

Determine release cadence with process, etc.

Tinyint support

Read tinyint to Arrow Integer8 datatype

Tracking issue for error handling

Minimize use of unwraps in codebase

Errors should make visible to user what failed and at what stage

e.g. instead of saying EOF reached, say EOF reached when reading data for X column in Y stripe

Implement RLEv2 reader using generics

Currently reader for RLEv2 implements unsigned/signed reading using same base code, but with a boolean to indicate if signed or not.

https://github.com/datafusion-contrib/datafusion-orc/blob/ebe96e070eafcb797cb33cb02aa2c935767a4825/src/reader/decode/rle_v2.rs

Want to explore using generics to enable this behaviour without that runtime check (since when we construct the RLE reader we should already know if we want signed or not, so this runtime check during decoding shouldn't be needed).

Also I'm unsure how well the current code might handle overflows/saturating adds/subs

Integration tests that using example files from apache/orc

The upstream provides an example folder that contains lots of ORC files https://github.com/apache/orc/tree/main/examples. We can consider an integration test suite based on that.

Cargo check --all-targets failed for `stripe::Stripe` is private

:) cargo c
    Checking orc-rust v0.3.0 (/home/xuanwo/Code/datafusion-contrib/datafusion-orc)
error[E0603]: struct `Stripe` is private
  --> src/bin/orc-metadata.rs:4:57
   |
4  | use orc_rust::{reader::metadata::read_metadata, stripe::Stripe};
   |                                                         ^^^^^^ private struct
   |
note: the struct `Stripe` is defined here
  --> /home/xuanwo/Code/datafusion-contrib/datafusion-orc/src/stripe.rs:91:1
   |
91 | pub(crate) struct Stripe {
   | ^^^^^^^^^^^^^^^^^^^^^^^^

For more information about this error, try `rustc --explain E0603`.
error: could not compile `orc-rust` (bin "orc-metadata" test) due to 1 previous error
warning: build failed, waiting for other jobs to finish...
error: could not compile `orc-rust` (bin "orc-metadata") due to 1 previous error

Should we export it? The bin orc-metadata failed to build for this.

TimestampNanosecond may not be able to represent whole ORC timestamp range

ORC encodes timestamps as seconds since ORC epoch, and nanoseconds since last second.

We translate this into the TimestampNanoseconds type.

However this may limit the range that ORC timestamp in actuality can represent.

Need to consider this?

`unreachable!()` in rle_v2_decode_bit_width is reachable

the unreachable!() statement in this function:

datafusion-orc/src/reader/decode/util.rs

Lines 225 to 236 in 1ee1df9

    
           pub fn rle_v2_decode_bit_width(encoded: u8) -> usize { 
        
               debug_assert!(encoded < 32, "encoded bit width cannot exceed 5 bits"); 
        
               match encoded { 
        
                   0..=23 => encoded as usize + 1, 
        
                   27 => 32, 
        
                   28 => 40, 
        
                   29 => 48, 
        
                   30 => 56, 
        
                   31 => 64, 
        
                   _ => unreachable!(), 
        
               } 
        
           }

can be reached while reading files created with pyorc. To reproduce:

wget https://softwareheritage.s3.amazonaws.com/graph/2023-09-06/orc/release/release-00d4739e-c723-4843-863e-e4a895c58005.orc

then run this code with ./release-00d4739e-c723-4843-863e-e4a895c58005.orc as parameter:

use std::fs::File;
use std::path::PathBuf;

use anyhow::Result;
use orc_rust::arrow_reader::ArrowReaderBuilder;
use orc_rust::projection::ProjectionMask;


pub fn main() -> Result<()> {
    let file_path = PathBuf::from(std::env::args().skip(1).next().unwrap());
    println!("reading {}", file_path.display());
    let file = File::open(&file_path)?;
    let reader_builder = ArrowReaderBuilder::try_new(file)?;
    let projection = ProjectionMask::named_roots(
        reader_builder.file_metadata().root_data_type(),
        ["date"].as_slice(),
    );
    let reader = reader_builder
        .with_projection(projection)
        .with_batch_size(10)
        .build();
    for (i, _) in reader.enumerate() {
        println!("chunk {}", i);
    }

    Ok(())
}

and this small patch to datafusion-orc:

diff --git a/src/reader/decode/util.rs b/src/reader/decode/util.rs
index 468bd7d..2d5bad6 100644
--- a/src/reader/decode/util.rs
+++ b/src/reader/decode/util.rs
@@ -231,7 +231,7 @@ pub fn rle_v2_decode_bit_width(encoded: u8) -> usize {
         29 => 48,
         30 => 56,
         31 => 64,
-        _ => unreachable!(),
+        _ => unreachable!("rle_v2_decode_bit_width({})", encoded),
     }
 }

which prints:

[...]
chunk 1715
chunk 1716
chunk 1717
chunk 1718
chunk 1719
chunk 1720
chunk 1721
chunk 1722
chunk 1723
chunk 1724
thread 'main' panicked at /home/vlorentz/datafusion-orc/src/reader/decode/util.rs:234:14:
internal error: entered unreachable code: rle_v2_decode_bit_width(26)
stack backtrace:
   0: rust_begin_unwind
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/panicking.rs:645:5
   1: core::panicking::panic_fmt
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/panicking.rs:72:14
   2: orc_rust::reader::decode::util::rle_v2_decode_bit_width
             at /home/vlorentz/datafusion-orc/src/reader/decode/util.rs:234:14
   3: orc_rust::reader::decode::rle_v2::patched_base::<impl orc_rust::reader::decode::rle_v2::RleReaderV2<N,R>>::read_patched_base
             at /home/vlorentz/datafusion-orc/src/reader/decode/rle_v2/patched_base.rs:32:31
   4: orc_rust::reader::decode::rle_v2::RleReaderV2<N,R>::decode_batch
             at /home/vlorentz/datafusion-orc/src/reader/decode/rle_v2/mod.rs:38:42
   5: <orc_rust::reader::decode::rle_v2::RleReaderV2<N,R> as core::iter::traits::iterator::Iterator>::next
             at /home/vlorentz/datafusion-orc/src/reader/decode/rle_v2/mod.rs:53:19
   6: <alloc::boxed::Box<I,A> as core::iter::traits::iterator::Iterator>::next
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/alloc/src/boxed.rs:1949:9
   7: <&mut I as core::iter::traits::iterator::Iterator>::next
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/iter/traits/iterator.rs:4169:9
   8: <core::iter::adapters::zip::Zip<A,B> as core::iter::adapters::zip::ZipImpl<A,B>>::next
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/iter/adapters/zip.rs:166:21
   9: <core::iter::adapters::zip::Zip<A,B> as core::iter::traits::iterator::Iterator>::next
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/iter/adapters/zip.rs:85:9
  10: <orc_rust::arrow_reader::column::timestamp::TimestampIterator as core::iter::traits::iterator::Iterator>::next
             at /home/vlorentz/datafusion-orc/src/arrow_reader/column/timestamp.rs:31:13
  11: <alloc::boxed::Box<I,A> as core::iter::traits::iterator::Iterator>::next
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/alloc/src/boxed.rs:1949:9
  12: orc_rust::arrow_reader::decoder::PrimitiveArrayDecoder<T>::next_primitive_batch
             at /home/vlorentz/datafusion-orc/src/arrow_reader/decoder/mod.rs:65:35
  13: <orc_rust::arrow_reader::decoder::timestamp::TimestampOffsetArrayDecoder as orc_rust::arrow_reader::decoder::ArrayBatchDecoder>::next_batch
             at /home/vlorentz/datafusion-orc/src/arrow_reader/decoder/timestamp.rs:109:21
  14: orc_rust::arrow_reader::decoder::NaiveStripeDecoder::inner_decode_next_batch
             at /home/vlorentz/datafusion-orc/src/arrow_reader/decoder/mod.rs:408:25
  15: orc_rust::arrow_reader::decoder::NaiveStripeDecoder::decode_next_batch
             at /home/vlorentz/datafusion-orc/src/arrow_reader/decoder/mod.rs:420:22
  16: <orc_rust::arrow_reader::decoder::NaiveStripeDecoder as core::iter::traits::iterator::Iterator>::next
             at /home/vlorentz/datafusion-orc/src/arrow_reader/decoder/mod.rs:288:26
  17: <alloc::boxed::Box<I,A> as core::iter::traits::iterator::Iterator>::next
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/alloc/src/boxed.rs:1949:9
  18: <orc_rust::arrow_reader::ArrowReader<R> as core::iter::traits::iterator::Iterator>::next
             at /home/vlorentz/datafusion-orc/src/arrow_reader/mod.rs:159:23
  19: <core::iter::adapters::enumerate::Enumerate<I> as core::iter::traits::iterator::Iterator>::next
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/iter/adapters/enumerate.rs:47:17
  20: repro::main
             at ./rust/src/bin/repro.rs:22:19
  21: core::ops::function::FnOnce::call_once
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

Timestamp instant support

See Timestamp with local time zone here https://orc.apache.org/docs/types.html

Consider switching to trunk based workflow?

First of all, sorry for the slow progress, I do still intend to work on this but have just been juggling other items at the moment.

I was considering making the workflow of this repo more trunk based, where those with write access can just push direct to main without requiring a PR and approval.

Benefits:

small set of contributors at the moment, so avoid overhead of requiring PR for every change
this project doesn't need to worry about stability at the moment, so can move faster

Thoughts? @waynexia @WenyXu

Integer RLE encoding

Be able to encode v1 and v2 RLE

Will be useful to enable property testing for #59

Put compression behind feature flags

See parquet for reference:

https://github.com/apache/arrow-rs/blob/dcbe546529faeb9e282819d04ca1b4a3c668a747/parquet/Cargo.toml#L87-L90

(I guess the thinking is decoupling the core of the crate from third-party crates that might not have as much guarantee of maintenance, as well as perhaps only supporting the more common compression types by default, etc.)

DataFusion integration

I implemented a very shallow example integration with DataFusion:

8c68f47

Will want to flesh this out and move this code into src, so can provide support for stuff like predicate pushdown from DaraFusions point of view.

Could split this repo into two crates, one to focus on reading to Arrow (arrow-orc, akin to parquet in arrow-rs) then another for DataFusion integration code (datafusion-orc) such as a trait which implements ExecutionPlan , etc.

Support nested projection

Be able to specifically project child columns of struct columns, instead of only being able to project the entire struct column at root level

Refactor top-level interface

(Where top-level interface refers to how DataFusion will use this library to read ORC files as that is the main intention of the crate)

Since we want this library to integrate with DataFusion, we should try provide a more clean interface for it to be able to read ORC files as record batches.

In current way:

datafusion-orc/tests/basic/main.rs

Lines 14 to 22 in cbeb8bd

    
           fn new_arrow_reader(path: &str, fields: &[&str]) -> ArrowReader<File> { 
        
               let f = File::open(path).expect("no file found"); 
        
               let reader = Reader::new(f).unwrap(); 
        
               let cursor = Cursor::new(reader, fields).unwrap(); 
        
               ArrowReader::new(cursor, None) 
        
           }

Create reader (our struct) from file
Create cursor (our struct) from reader
Create arrow reader (our struct) from cursor

Similar can be said for async version.

We can take inspiration from how parquet does it:

User Story: Databend uses datafusion-orc for it's async support

Hello, everyone! Thank you for this fantastic crate. I'm writing to share how databend is utilizing datafusion-orc for its async support.

With datafusion-orc, databend now supports importing data from ORC files. Our colleague @youngsofun has been diligently working on this project. We plan to contribute our improvements to the upstream in the near future. Stay tuned!

Consider async API design

Don't want it to be simply the sync api but just with .await swapped in

Some reference from parquet:

Refactor metadata into our own classes

Currently the file metadata:

datafusion-orc/src/reader/metadata.rs

Lines 18 to 23 in f19cc7b

    
           pub struct FileMetadata { 
        
               pub postscript: PostScript, 
        
               pub footer: Footer, 
        
               pub metadata: Metadata, 
        
               pub stripe_footers: Vec<StripeFooter>, 
        
           }

And stripe metadata:

datafusion-orc/src/arrow_reader.rs

Lines 871 to 878 in f19cc7b

    
           pub struct Stripe { 
        
               pub(crate) footer: Arc<StripeFooter>, 
        
               pub(crate) columns: Vec<Column>, 
        
               pub(crate) stripe_offset: usize, 
        
               pub(crate) info: StripeInformation, 
        
               /// <(ColumnId, Kind), Bytes> 
        
               pub(crate) stream_map: Arc<StreamMap>, 
        
           }

Rely directly on proto structs/types.

Will work on refactoring to add our own versions of these structs/types as a sort of decoupling layer, and to potentially have a nicer interface for what we need from the metadata.

Requires nightly to compile

Seems we currently rely on some unstable features, wondering if we should continue as such and require nightly or change to stable to be similar to arrow-rs?

Support decoding Union column with up to including 256 variants

According to https://orc.apache.org/specification/ORCv1/

Currently ORC union types are limited to 256 variants, which matches the Hive type model.

However in Arrow, UnionArrays are limited to 127 variants: https://arrow.apache.org/docs/format/Columnar.html#union-layout

A union with more than 127 possible types can be modeled as a union of unions.

To support this, would need to do as above and decode into union of union

See initial Union support here: ee69b91

Support non-struct root data type

ORC spec: https://orc.apache.org/docs/types.html

Hive always uses a struct with a field for each of the top-level columns as the root object type, but that is not required

See #45 discussion

Re-evaluate exposed public types

Will need to go through codebase and see which traits/structs/functions we want to expose for use and which to keep private. I'd like to keep as minimal a public interface as possible to allow us to change the internals more frequently and easily. Not too high priority at the moment as I consider the repo a bit experimental at the moment (and we aren't publishing yet, but will need to consider this too)

Cleaner way to specify projection via builder

datafusion-orc/tests/basic/main.rs

Lines 21 to 22 in 4f1c98d

    
           let projection = ProjectionMask::named_roots(builder.file_metadata().root_data_type(), fields); 
        
           builder.with_projection(projection).build()

Bit clumsy to require the builder to create a projection which is then passed back to the builder

Find a cleaner way to do this

Dictionary array handling

Spun off from discussion in #68

Find way to preserve dictionary encoding in DictionaryArray without having to cast to StringArray as introduced here:

datafusion-orc/src/arrow_reader/decoder/string.rs

Lines 180 to 183 in bb885c0

    
           // Cast back to StringArray to ensure all stripes have consistent datatype 
        
           // TODO: Is there anyway to preserve the dictionary encoding? 
        
           //       This costs performance. 
        
           let array = cast(&array, &DataType::Utf8).context(ArrowSnafu)?;

Some reference from similar parquet issue: apache/arrow-rs#171

Switch codecov secret

This project used to use codecov to collect and show unit test coverage.

datafusion-orc/.github/workflows/ci.yml

Lines 129 to 136 in 40b2a3e

    
           - name: Codecov upload 
        
             uses: codecov/codecov-action@v2 
        
             with: 
        
               token: ${{ secrets.CODECOV_TOKEN }} 
        
               files: ./lcov.info 
        
               flags: rust 
        
               fail_ci_if_error: false 
        
               verbose: true

To make it work in this repository, we need to switch repo in codecov as well. The codecov app is installed to this repo, but I can't generate a secret in codecov. Maybe this is because of some message delay, I'd check again tomorrow.

Decimal support

Read decimal datatype into Arrow decimal datatype

Short-term roadmap for this implementation

Previous discussion: apache/datafusion#4707

Though the ORC format is not as widely used as parquet in arrow-rs and datafusion related projects, there are still some (growing, to my feelings) interesting and requirements on this format. As @Jefffrey said here, a noticeable and viable milestone for this project is it can be merged into arrow-rs. This draft roadmap is raised to help us discuss, arrange and take our efforts toward that milestone.

Given the ORC format is less complex than parquet, there are still many work to do in various aspects. Here is a list of functionalities need to be done if we consider making ORC files queriable from datafusion as the primary use case on this stage. Please feel free to add/remove/set priorities to them. It's likely that we can't finish all of them in a short term, thus marking what are going to be done is also important.

The below are also related but with lower priorities

Long term items:

encryption

~~Then something I'm not sure about. Looking for more information. Also feel free to change previous two lists.~~

Prepare 0.3.0 release

Publishing to existing crate https://crates.io/crates/orc-rust

We'll bump to 0.3.0 considering there's been considerable change since

Fix Timestamp to align with ORC spec

ORC timestamp type is not straightforward, as though it apparently represents a timestamp without a timezone, its encoding & decoding is still dependent on the writer timezone (encoded in the stripe) and the reader timezone.

This is why the 1900 & 2038 integration tests are currently disabled as our implementation is incorrect:

datafusion-orc/tests/integration/main.rs

Lines 155 to 165 in 3369b5d

    
           #[test] 
        
           #[ignore] // TODO: Incorrect timezone + representation differs 
        
           fn test_date_1900() { 
        
               test_expected_file("TestOrcFile.testDate1900"); 
        
           } 
        
           #[test] 
        
           #[ignore] // TODO: Incorrect timezone + representation differs 
        
           fn test_date_2038() { 
        
               test_expected_file("TestOrcFile.testDate2038"); 
        
           }

Support decoding Union column into DenseUnion

Initial Union support will decode to Sparse Union: ee69b91

Can try give option to prefer decoding to DenseUnion: https://arrow.apache.org/docs/format/Columnar.html#dense-union

Since ORC already encodes data in a dense manner (omitting nulls in the data streams) so might be more efficient? Unknown, just something to keep in mind

Execute the statement `select count(*)` always getting 0

Execute the statement select count(*) always getting 0

    let ctx = SessionContext::new();
    ctx.register_orc(
        "table1",
        "tests/basic/data/alltypes.snappy.orc",
        OrcReadOptions::default(),
    )
    .await?;

    ctx.sql("select count(*) from table1")
        .await?
        .show()
        .await?;

Update README

Details to include:

Roadmap (short term and long term)
- Vision is for this to eventually be donated to arrow-rs and/or datafusion
- Also status of support for ORC features (such as bloom filter, statistics, etc._
Guarantees about stability (aka none)
Contributing (guidelines and how to get started)

etc.

Bug: Fail to read column of type `array<float>`

What's wrong?

encounter panic when reading a column containing float values under array or map structures, such as:

array<float>
array<tuple<float, int>>

array less than expected length
thread 'basic_test_nested_array_float' panicked at src/array_decoder/mod.rs:71:30:
array less than expected length

How to reproduce?

Add the following code to tests/basic/data/write.py:

nested_array_float = {
    "value": [
        [1.0, 2.0],
        [None, 2.0],
    ],
}

_write("struct<value:array<float>>", nested_array_float, "nested_array_float.orc")

Run the test against it like nested_array.orc

Reason

FloatIter uses number of rows instead of number of leaf values.

Refer to the code here: array_decoder/mod.rs

Quick Fix

FloatIter do not have to guard against number of values. It can simply read until the end of bytes in memory, similar to how integers are handled.

pr: #112

Some questions about Column::number_of_rows

Is the Column::number_of_rows intended to represent number_of_values? If so, this code seems incorrect:

pub fn children(&self) -> Vec<Column> {
    .... 
    DataType::List { child, .. } => {
        vec![Column {
            number_of_rows: self.number_of_rows,
            footer: self.footer.clone(),
            name: "item

In my understanding, we have no way to derive number of values from the parent column.
however we can retrieve number of values from the stripe metadata, this requires a refactor.

Here’s an ugly but workable fix: Commit

Generate Arrow feather files at build time

See discussion here #73 (comment)

Support projection as boolean mask

Be able to specify when building reader which columns to project (from root struct type)

See parquet: https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/mod.rs#L159-L165

Rigorous ORC integration tests

Integration tests added by #65

However we have to compare actual vs expected data in JSON format since that is how it is encoded in the Apache ORC repo

An alternative way could be to use the pyarrow/arrow ORC implementation to generate the expected files into a parquet or arrow flight file format which can be more rigorous than JSON

We lose visibility on the expected data a bit but since these are integration tests with data from Apache ORC repo, they wouldn't change often (if at all) anyway

Allow user to coalesce small IOs

We need a reader like this: https://arrow.apache.org/rust/parquet/arrow/async_reader/trait.AsyncFileReader.html which allows the user to coalesce small IOs into a large IO.

Compound type support

Struct
Map
List
Union

	if (patch_bit_width + value_bit_width) > (N::BYTE_SIZE * 8) {
	return OutOfSpecSnafu {
	msg: "combined patch width and value width cannot exceed the size of the integer type being decoded",
	}
	.fail();
	}

	pub fn rle_v2_decode_bit_width(encoded: u8) -> usize {
	debug_assert!(encoded < 32, "encoded bit width cannot exceed 5 bits");
	match encoded {
	0..=23 => encoded as usize + 1,
	27 => 32,
	28 => 40,
	29 => 48,
	30 => 56,
	31 => 64,
	_ => unreachable!(),
	}
	}

	fn new_arrow_reader(path: &str, fields: &[&str]) -> ArrowReader<File> {
	let f = File::open(path).expect("no file found");

	let reader = Reader::new(f).unwrap();

	let cursor = Cursor::new(reader, fields).unwrap();

	ArrowReader::new(cursor, None)
	}

	pub struct FileMetadata {
	pub postscript: PostScript,
	pub footer: Footer,
	pub metadata: Metadata,
	pub stripe_footers: Vec<StripeFooter>,
	}

	pub struct Stripe {
	pub(crate) footer: Arc<StripeFooter>,
	pub(crate) columns: Vec<Column>,
	pub(crate) stripe_offset: usize,
	pub(crate) info: StripeInformation,
	/// <(ColumnId, Kind), Bytes>
	pub(crate) stream_map: Arc<StreamMap>,
	}

	let projection = ProjectionMask::named_roots(builder.file_metadata().root_data_type(), fields);
	builder.with_projection(projection).build()

	// Cast back to StringArray to ensure all stripes have consistent datatype
	// TODO: Is there anyway to preserve the dictionary encoding?
	// This costs performance.
	let array = cast(&array, &DataType::Utf8).context(ArrowSnafu)?;

	- name: Codecov upload
	uses: codecov/codecov-action@v2
	with:
	token: ${{ secrets.CODECOV_TOKEN }}
	files: ./lcov.info
	flags: rust
	fail_ci_if_error: false
	verbose: true

	#[test]
	#[ignore] // TODO: Incorrect timezone + representation differs
	fn test_date_1900() {
	test_expected_file("TestOrcFile.testDate1900");
	}

	#[test]
	#[ignore] // TODO: Incorrect timezone + representation differs
	fn test_date_2038() {
	test_expected_file("TestOrcFile.testDate2038");
	}