Code Monkey home page Code Monkey logo

arrow-rs's Introduction

Native Rust implementation of Apache Arrow and Apache Parquet

Coverage Status

Welcome to the Rust implementation of Apache Arrow, the popular in-memory columnar format.

This repo contains the following main components:

Crate Description Latest API Docs README
arrow Core functionality (memory layout, arrays, low level computations) docs.rs (README)
arrow-flight Support for Arrow-Flight IPC protocol docs.rs (README)
object-store Support for object store interactions (aws, azure, gcp, local, in-memory) docs.rs (README)
parquet Support for Parquet columnar file format docs.rs (README)
parquet_derive A crate for deriving RecordWriter/RecordReader for arbitrary, simple structs docs.rs (README)

The current development version the API documentation in this repo can be found here.

Release Versioning and Schedule

arrow and parquet crates

The Arrow Rust project releases approximately monthly and follows Semantic Versioning.

Due to available maintainer and testing bandwidth, arrow crates (arrow, arrow-flight, etc.) are released on the same schedule with the same versions as the parquet and [parquet-derive] crates.

Starting June 2024, we plan to release new major versions with potentially breaking API changes at most once a quarter, and release incremental minor versions in the intervening months. See this ticket for more details.

To keep our maintenance burden down, we do regularly scheduled releases (major and minor) from the master branch. How we handle PRs with breaking API changes is described in the contributing guide.

For example:

Approximate Date Version Notes
Jun 2024 52.0.0 Major, potentially breaking API changes
Jul 2024 52.1.0 Minor, NO breaking API changes
Aug 2024 52.2.0 Minor, NO breaking API changes
Sep 2024 53.0.0 Major, potentially breaking API changes

object_store crate

The object_store crate is released independently of the arrow and parquet crates and follows Semantic Versioning. We aim to release new versions approximately every 2 months.

Related Projects

There are several related crates in different repositories

Crate Description Documentation
datafusion In-memory query engine with SQL support (README)
ballista Distributed query execution (README)
object_store_opendal Use opendal as object_store backend (README)

Collectively, these crates support a wider array of functionality for analytic computations in Rust.

For example, you can write SQL queries or a DataFrame (using the datafusion crate) to read a parquet file (using the parquet crate), evaluate it in-memory using Arrow's columnar format (using the arrow crate), and send to another process (using the arrow-flight crate).

Generally speaking, the arrow crate offers functionality for using Arrow arrays, and datafusion offers most operations typically found in SQL, including joins and window functions.

You can find more details about each crate in their respective READMEs.

Arrow Rust Community

The [email protected] mailing list serves as the core communication channel for the Arrow community. Instructions for signing up and links to the archives can be found on the Arrow Community page. All major announcements and communications happen there.

The Rust Arrow community also uses the official ASF Slack for informal discussions and coordination. This is a great place to meet other contributors and get guidance on where to contribute. Join us in the #arrow-rust channel and feel free to ask for an invite via:

  1. the [email protected] mailing list
  2. the GitHub Discussions
  3. the Discord channel

The Rust implementation uses GitHub issues as the system of record for new features and bug fixes and this plays a critical role in the release process.

For design discussions we generally collaborate on Google documents and file a GitHub issue linking to the document.

There is more information in the contributing guide.

arrow-rs's People

Contributors

alamb avatar andygrove avatar askoa avatar crepererum avatar dandandan avatar dependabot[bot] avatar fsaintjacques avatar haoyang670 avatar houqp avatar jefffrey avatar jhorstmann avatar jimexist avatar jorgecarleitao avatar kou avatar kszucs avatar liukun4515 avatar nealrichardson avatar nevi-me avatar paddyhoran avatar pitrou avatar psvri avatar ritchie46 avatar sunchao avatar ted-jiang avatar tustvold avatar viirya avatar weijun-h avatar wesm avatar wjones127 avatar xhochy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

arrow-rs's Issues

Resolve Issues with `prettytable-rs` dependency

Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-8637

{{prettytable-rs}} is a dependency of Arrow for creating a string for displaying record batches in a tabular form (see [pretty util|https://github.com/apache/arrow/blob/c546eef41e6ab20c4ca29a2d836987959843896f/rust/arrow/src/util/pretty.rs#L24-L25]) The crate, however, has some issues:

 

1.) {{prettytable-rs}} has a dependency on the {{term}} crate. The {{term}} crate is under minimal maintenance, and it is advised to switch to another crate. This will probably pop up in an [informational security advisory|https://rustsec.org/advisories/RUSTSEC-2018-0015] if it's decided one day to audit the crates.

2.) The crate also has a dependency on {{encode-unicode}}. While not problematic in its own right, this crate implements some traits which can bring about confusing type inference issues. For example, after adding the {{prettytable-rs}} dependency in arrow, the following error occurred what attempting to compile the parquet crate:

 

{{let seed_vec: Vec =}}

{{    Standard.sample_iter(&mut rng).take(seed_len).collect();}}

 

{{error[E0282]: type annotations needed}}
{{   --> parquet/src/encodings/rle.rs:833:26}}
{{    |}}
{{833 | Standard.sample_iter(&mut rng).take(seed_len).collect();}}
{{    | ^^^^^^^^^^^ cannot infer type for T}}

 

Any user of the arrow crate would see a similar style of error.

 

There are a few possible ways to resolve this:

 

1.) Hopefully hear from the crate maintainer. There is a [PR open|https://github.com/phsym/prettytable-rs/pull/125] for the encode-unicode issue.

2.) Find a different table-generating crate with less issues.

3.) Fork and fix prettytable-rs.

4.) ???

 
 

Fail to compile with unrecognized platform-specific intrinsic function

Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-5613

I'm testing a project which depends on the Arrow crate. It failed with the following error:
{code}
error[E0441]: unrecognized platform-specific intrinsic function: simd_bitmask
--> /Users/sunchao/.cargo/registry/src/github.com-1ecc6299db9ec823/packed_simd-0.3.3/src/codegen/llvm.rs:100:5
|
100 | crate fn simd_bitmask<T, U>(value: T) -> U;
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

error: aborting due to previous error

For more information about this error, try rustc --explain E0441.
error: Could not compile packed_simd.
{code}

Reading UTF-8/JSON/ENUM field results in a lot of vec allocation

Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-7252

While reading a very large parquet file with basically all string fields was very slow(430MB gzipped), after profiling with osx instruments, I noticed that a lot of time is spent in "convert_byte_array", in particular, "reserving" and allocating Vec::with_capacity, which is done before String::from_utf8_unchecked.

It seems like using String as the underlying storage is causing this(String uses Vec for its underlying storage), this also requires copying from slice to vec.

"Field::Str" is a pub enum so I am not sure how "refactorable" is the String part, for example, converting it into a &str(we can perhaps then defer the conversion from &[u8] to Vec until the user really needs a String)

But of course, changing it to &str can result in quite a bit of interface changes... So I am wondering if there are already some plans or solution on the way to improve the handling of the "Field::Str" case?

 

Read temporal values from JSON

Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-4803

Ability to parse strings that look like timestamps to timestamp type. Need to consider whether only timestamp type should be supported as most JSON libraries stick to ISO8601. It might also be inefficient to use regex for timestamps, so the user should provide a hint of which columns to convert to timestamps

packed_simd requires nightly

Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-6718

See [https://github.com/rust-lang/rfcs/pull/2366] for more info on stabilization of this crate.

 
{code:java}
error[E0554]: #![feature] may not be used on the stable release channel
--> /home/andy/.cargo/registry/src/github.com-1ecc6299db9ec823/packed_simd-0.3.3/src/lib.rs:202:1
|
202 | / #![feature(
203 | | repr_simd,
204 | | const_fn,
205 | | platform_intrinsics,
... |
215 | | custom_inner_attributes
216 | | )]
| |__^
{code}

DataType::Dictionary is out of spec

Describe the bug

The schema.fbs and corresponding generated code has no concept of a Dictionary datatype.

However, we declare a DataType::Dictionary.

Additional context

As a user, I would like to be able to not have to change my DataType whenever I want to change an arrays' encoding. In the context of DataFusion, that uses DataType to declare the schema of the logical plan, this forbids optimizations at the physical level that e.g. would convert an array to a dictionary-encoded array, which is useful in any group-by or hashing operation.

Test issue

Testing automated jira ticket migration, please ignore

Use IntoIter trait for write_batch/write_mini_batch

Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-5153

Writing data to a parquet file requires a lot of copying and intermediate Vec creation. Take a record struct like:

{{struct MyData {}}{{  name: String,}}{{  address: Option}}{{}}}

Over the course of working sets of this data, you'll have the bulk data Vec,  the names column in a Vec<&String>, the address column in a Vec<Option>. This puts extra memory pressure on the system, at the minimum we have to allocate a Vec the same size as the bulk data even if we are using references.

What I'm proposing is to use an IntoIter style. This will maintain backward compat as a slice automatically implements IntoIter. Where ColumnWriterImpl#write_batch goes from "values: &[T::T]"to values "values: IntoIter<Item=T::T>". Then you can do things like

{{  write_batch(bulk.iter().map(|x| x.name), None, None)}}{{  write_batch(bulk.iter().map(|x| x.address), Some(bulk.iter().map(|x| x.is_some())), None)}}

and you can see there's no need for an intermediate Vec, so no short-term allocations to write out the data.

I am writing data with many columns and I think this would really help to speed things up.

All array types should have iterators and FromIterator support.

Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-7700

Array types should have an Iterable trait that generates plain or nullable iterators.

{code}
pub trait Iterable<'a>
where Self::IterType: std::iter::Iterator
{
type IterType;

fn iter(&'a self) -> Self::IterType;
fn iter_nulls(&'a self) -> NullableIterator<Self::IterType>;

}
{code}

IterType depends on the array type from standard slice iterators for primitive types, string iterators for UTF8 types and composite iterators (generating other iterators) for list, struct and dictionary types.

The NullableIterator type should bundle a null bitmap pointer with another iterator type to form a composite iterator that returns an option:

{code}
/// Convert any iterator to a nullable iterator by using the null bitmap.
#[derive(Debug, PartialEq, Clone)]
pub struct NullableIterator<T: Iterator> {
iter: T,
i: usize,
null_bitmap: *const u8,
}

impl<T: Iterator> NullableIterator {
fn from(iter: T, null_bitmap: &Option, offset: usize) -> Self;
}
{code}

For more details, some exploratory work has been done here: https://github.com/andy-thomason/arrow/blob/ARROW-iterators/rust/arrow/src/array/array.rs#L1711

[Rust][DataFusion] Improve like/nlike performance

Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-8681

Currently, the implementation of  like_utf8 and nlike_utf8 is based on regex, which is simple and readable, but poor at the performance.

 

I do some benchmark in [https://github.com/TennyZhuang/like-bench/] , in this repo, I compare three like algorithm.

like(includes partial_like): this is the first naive version, using the recursive approach, which will cause terrible performance on special attack input, such as a%a%a%a%a%a%a%a%b.

like_to_regex: which is almost the same as the current implementation in arrow.

like_optimize: the like problem is similar to glob in shell, so a perfect solution is proposed in [https://research.swtch.com/glob] . The code in the research is written golang but I translate it to rust.

 

In my benchmark result, the recursive solution can be ignored due to bad time complexity lower bound.

the regex solution will cost about 1000x time including regex compiling, and about 4x time without regex compiling then solution 3. And It seems that the code complexity of solution 3 is acceptable.

Everyone can reproduce the benchmark result using this repo with a few codes.

 

I have submitted a PR to TiKV to optimize the like performance ([https://github.com/tikv/tikv/pull/5866/files|https://github.com/tikv/tikv/pull/5866/files)], without UTF-8 support), and add collation support in [https://github.com/tikv/tikv/pull/6592], which can be simply port to data-fusion.

 

 

 

 

Question and Request for Examples of Array Operations

Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-6583

Hi all, thank you for your excellent work on Arrow.

As I was going through the example for the Rust Arrow implementation, specifically the read_csv example [https://github.com/apache/arrow/blob/master/rust/arrow/examples/read_csv.rs] , as well as the generated Rustdocs, and unit tests, it was not quite clear what the intended usage is for operations such as filtering and masking over Arrays.

One particular use-case I'm interested in is finding all values in an Array such that x >= N for all x. I came across arrow::compute::array_ops::filter, which seems to be similar to what I want, although it's expecting a mask to already be constructed before performing the filter operation, and it was not obviously visible in the documentation, leading me to believe this might not be idiomatic usage.

More generally, is the expectation for Arrays on the Rust side that they are just simple data abstractions, without exposing higher-order methods such as filtering/masking? Is the intent to leave that to users? If I missed some piece of documentation, please let me know. For my use-case I ended up trying something like:

{code:java}
let column = batch.column(0).as_any().downcast_ref::().unwrap();
let mut builder = BooleanBuilder::new(batch.num_rows());
let N = 5.0;
for i in 0..batch.num_rows() {
if column.value(i).unwrap() > N {
builder.append_value(true).unwrap();
} else {
builder.append_value(false).unwrap();
}
}

let mask = builder.finish();
let filtered_column = filter(column, mask);{code}

If possible, could you provide examples of intended usage of Arrays? Thank you!

 

 

Strongly-typed reading of Parquet data

Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-4314

See the proposal I made on [~csun]'s repository [here|https://github.com/sunchao/parquet-rs/issues/205] for more details.

This aims to let the user opt in to strong typing and substantial performance improvements (2x-7x, see [here|https://github.com/sunchao/parquet-rs/issues/205#issuecomment-446016254]) by optionally specifying the type of the records that they are iterating over.

It is currently a work in progress. All pre-existing tests succeed, bar those in src/record/api.rs which are commented out as they require reworking. Where relevant, pre-existing tests and benchmarks have been duplicated to make new strongly-typed tests and benchmarks, which all also succeed. I've tried to maintain pre-existing APIs where possible. Some changes have been made to better align with prior art in the Rust ecosystem.

Any feedback while I continue working on it very welcome! Looking forward to hopefully seeing this merged when it's ready.

Support LargeUtf8 in sort kernel

I am trying to run parts of a Polars LogicalPlan on DataFusion, but DataFusion uses Utf8 and Polars `LargeUtf8 but that is not yet supported.

Generate flatbuffers code automatically

Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-5135

This depends on the following upstream flatbuffers issues:

Once they are resolved We should generate flatbuffers code automatically,
as suggested by [~nevi_me]:

{code}
// [arrow/rust/arrow/build.rs]

use std::path::Path;
use flatc_rust;

fn main() {
flatc_rust::run(flatc_rust::Args {
lang: "rust",
inputs: &[
Path::new("../../format/FIle.fbs"),
Path::new("../../format/Message.fbs"),
Path::new("../../format/Schema.fbs"),
Path::new("../../format/Tensor.fbs"),
Path::new("../../format/SparseTensor.fbs"),
],
out_dir: Path::new("./src/ipc/gen/"),
// doesn't seem to be honoured
includes: &[Path::new("../../format/")],
..Default::default()
}).expect("Unable to build flatbuffer files");
}

// [arrow/rust/arrow/Cargo.toml]
[package]
...
build = "build.rs"

...

[build-dependencies]
flatc-rust = "0.1"
{code}

Add temporal kernels

Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-5367

When creating temporal arrays, we added a sample function that extracts the hour from a temporal array. This ticket is to add support for other common temporal functions like minute, second, hour, and might include temporal arithmetic as adding dates and times, calculating durations etc.

Reading parquet file is slow

Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-6774

Using the example at [https://github.com/apache/arrow/tree/master/rust/parquet] is slow.

The snippet 
{code:none}
let reader = SerializedFileReader::new(file).unwrap();
let mut iter = reader.get_row_iter(None).unwrap();
let start = Instant::now();
while let Some(record) = iter.next() {}
let duration = start.elapsed();
println!("{:?}", duration);
{code}
Runs for 17sec for a ~160MB parquet file.

If there is a more effective way to load a parquet file, it would be nice to add it to the readme.

P.S.: My goal is to construct an ndarray from it, I'd be happy for any tips.

[DataFusion] Implement optimizer rule to remove redundant projections

Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-6892

Currently we have code in the SQL query planner that wraps aggregate queries in a projection (if needed) to preserve the order of the final results. This is needed because the aggregate query execution always returns a result with grouping expressions first and then aggregate expressions.

It would be better (simpler, more readable code) to always wrap aggregates in projections and have an optimizer rule to remove redundant projections. There are likely other use cases where redundant projections might exist too.

FFI listarray lead to undefined behavior.

Describe the bug
When sending an array array with child data over FFI (e.g. via pyarrow for instance) we encounter undefined behavior.

I have found SIGILL and SEGFAULTS

To Reproduce
The tests in this arrow-pyarrow-integration and arrow/src/ffi.rs

I ran the the tests in MIRI (great tool!), but I have to admit, I am stuck, and seem to go in circles.

running 1 test
test ffi::tests::test_list ... error: Undefined Behavior: pointer to alloc278966 was dereferenced after this allocation got freed
   --> arrow/src/ffi.rs:126:21
    |
126 |         let child = &*child_ptr;
    |                     ^^^^^^^^^^^ pointer to alloc278966 was dereferenced after this allocation got freed
    |
    = help: this indicates a bug in the program: it performed an invalid operation, and caused Undefined Behavior
    = help: see https://doc.rust-lang.org/nightly/reference/behavior-considered-undefined.html for further information
            
    = note: inside `ffi::release_schema` at arrow/src/ffi.rs:126:21
note: inside `<ffi::FFI_ArrowSchema as std::ops::Drop>::drop` at arrow/src/ffi.rs:198:39
   --> arrow/src/ffi.rs:198:39
    |
198 |             Some(release) => unsafe { release(self) },
    |                                       ^^^^^^^^^^^^^
    = note: inside `std::intrinsics::drop_in_place::<ffi::FFI_ArrowSchema> - shim(Some(ffi::FFI_ArrowSchema))` at /home/ritchie46/.rustup/toolchains/nightly-2021-03-04-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ptr/mod.rs:187:1
    = note: inside `std::sync::Arc::<ffi::FFI_ArrowSchema>::drop_slow` at /home/ritchie46/.rustup/toolchains/nightly-2021-03-04-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/sync.rs:1039:18
    = note: inside `<std::sync::Arc<ffi::FFI_ArrowSchema> as std::ops::Drop>::drop` at /home/ritchie46/.rustup/toolchains/nightly-2021-03-04-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/sync.rs:1571:13
    = note: inside `std::intrinsics::drop_in_place::<std::sync::Arc<ffi::FFI_ArrowSchema>> - shim(Some(std::sync::Arc<ffi::FFI_ArrowSchema>))` at /home/ritchie46/.rustup/toolchains/nightly-2021-03-04-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ptr/mod.rs:187:1
    = note: inside `std::intrinsics::drop_in_place::<ffi::ArrowArray> - shim(Some(ffi::ArrowArray))` at /home/ritchie46/.rustup/toolchains/nightly-2021-03-04-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ptr/mod.rs:187:1
note: inside `array::ffi::<impl std::convert::TryFrom<ffi::ArrowArray> for array::data::ArrayData>::try_from` at arrow/src/array/ffi.rs:60:5
   --> arrow/src/array/ffi.rs:60:5
    |
60  |     }
    |     ^
note: inside `ffi::tests::test_generic_list::<i32>` at arrow/src/ffi.rs:876:20
   --> arrow/src/ffi.rs:876:20
    |
876 |         let data = ArrayData::try_from(array)?;
    |                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
note: inside `ffi::tests::test_list` at arrow/src/ffi.rs:899:9
   --> arrow/src/ffi.rs:899:9
    |
899 |         test_generic_list::<i32>()
    |         ^^^^^^^^^^^^^^^^^^^^^^^^^^
note: inside closure at arrow/src/ffi.rs:898:5
   --> arrow/src/ffi.rs:898:5
    |
898 | /     fn test_list() -> Result<()> {
899 | |         test_generic_list::<i32>()
900 | |     }
    | |_____^
    = note: inside `<[closure@arrow/src/ffi.rs:898:5: 900:6] as std::ops::FnOnce<()>>::call_once - shim` at /home/ritchie46/.rustup/toolchains/nightly-2021-03-04-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ops/function.rs:227:5
    = note: inside `<fn() as std::ops::FnOnce<()>>::call_once - shim(fn())` at /home/ritchie46/.rustup/toolchains/nightly-2021-03-04-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ops/function.rs:227:5
    = note: inside `test::__rust_begin_short_backtrace::<fn()>` at /home/ritchie46/.rustup/toolchains/nightly-2021-03-04-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/test/src/lib.rs:567:5
    = note: inside closure at /home/ritchie46/.rustup/toolchains/nightly-2021-03-04-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/test/src/lib.rs:558:30
    = note: inside `<[closure@test::run_test::{closure#2}] as std::ops::FnOnce<()>>::call_once - shim(vtable)` at /home/ritchie46/.rustup/toolchains/nightly-2021-03-04-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ops/function.rs:227:5
    = note: inside `<std::boxed::Box<dyn std::ops::FnOnce() + std::marker::Send> as std::ops::FnOnce<()>>::call_once` at /home/ritchie46/.rustup/toolchains/nightly-2021-03-04-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/boxed.rs:1546:9
    = note: inside `<std::panic::AssertUnwindSafe<std::boxed::Box<dyn std::ops::FnOnce() + std::marker::Send>> as std::ops::FnOnce<()>>::call_once` at /home/ritchie46/.rustup/toolchains/nightly-2021-03-04-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panic.rs:344:9
    = note: inside `std::panicking::r#try::do_call::<std::panic::AssertUnwindSafe<std::boxed::Box<dyn std::ops::FnOnce() + std::marker::Send>>, ()>` at /home/ritchie46/.rugged>]─╼ ╾─a280026[<untagged>]─╼ │ ╾──────╼╾──────╼
    0x20 │ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 │ ................
    0x30 │ 01 00 00 00 00 00 00 00 ╾─a279749[<untagged>]─╼ │ ........╾──────╼
    0x40 │ 00 00 00 00 00 00 00 00 ╾─a278656[<untagged>]─╼ │ ........╾──────╼
    0x50 │ ╾─a279863[<untagged>]─╼                         │ ╾──────╼
}
alloc280196 (Rust heap, size: 16, align: 8) {
    00 00 00 00 00 00 00 00 ╾─a275857[<untagged>]─╼ │ ........╾──────╼
}
alloc280316 (Rust heap, size: 56, align: 8) {
    0x00 │ ╾─a279544[<untagged>]─╼ 02 00 00 00 00 00 00 00 │ ╾──────╼........
    0x10 │ 02 00 00 00 00 00 00 00 ╾─a280196[<untagged>]─╼ │ ........╾──────╼
    0x20 │ 02 00 00 00 00 00 00 00 ╾─a279702[<untagged>]─╼ │ ........╾──────╼
    0x30 │ 01 00 00 00 00 00 00 00                         │ ........
}
alloc280344 (Rust heap, size: 96, align: 8) {
    0x00 │ 01 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 │ ................
    0x10 │ 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 │ ................
    0x20 │ 00 00 00 00 00 00 00 00 02 00 00 00 00 00 00 00 │ ................
    0x30 │ 01 00 00 00 00 00 00 00 ╾─a280196[<untagged>]─╼ │ ........╾──────╼
    0x40 │ ╾─a279702[<untagged>]─╼ 00 00 00 00 00 00 00 00 │ ╾──────╼........
    0x50 │ ╾─a278994[<untagged>]─╼ ╾─a280316[<untagged>]─╼ │ ╾──────╼╾──────╼
}
alloc278656 (fn: ffi::release_schema)
alloc278994 (fn: ffi::release_array)
stup/toolchains/nightly-2021-03-04-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:379:40
    = note: inside `std::panicking::r#try::<(), std::panic::AssertUnwindSafe<std::boxed::Box<dyn std::ops::FnOnce() + std::marker::Send>>>` at /home/ritchie46/.rustup/toolchains/nightly-2021-03-04-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:343:19
    = note: inside `std::panic::catch_unwind::<std::panic::AssertUnwindSafe<std::boxed::Box<dyn std::ops::FnOnce() + std::marker::Send>>, ()>` at /home/ritchie46/.rustup/toolchains/nightly-2021-03-04-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panic.rs:431:14
    = note: inside `test::run_test_in_process` at /home/ritchie46/.rustup/toolchains/nightly-2021-03-04-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/test/src/lib.rs:589:18
    = note: inside closure at /home/ritchie46/.rustup/toolchains/nightly-2021-03-04-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/test/src/lib.rs:486:39
    = note: inside `test::run_test::run_test_inner` at /home/ritchie46/.rustup/toolchains/nightly-2021-03-04-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/test/src/lib.rs:522:13
    = note: inside `test::run_test` at /home/ritchie46/.rustup/toolchains/nightly-2021-03-04-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/test/src/lib.rs:555:28
    = note: inside `test::run_tests::<[closure@test::run_tests_console::{closure#2}]>` at /home/ritchie46/.rustup/toolchains/nightly-2021-03-04-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/test/src/lib.rs:301:13
    = note: inside `test::run_tests_console` at /home/ritchie46/.rustup/toolchains/nightly-2021-03-04-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/test/src/console.rs:289:5
    = note: inside `test::test_main` at /home/ritchie46/.rustup/toolchains/nightly-2021-03-04-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/test/src/lib.rs:122:15
    = note: inside `test::test_main_static` at /home/ritchie46/.rustup/toolchains/nightly-2021-03-04-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/test/src/lib.rs:141:5
    = note: inside `main`
    = note: inside `<fn() as std::ops::FnOnce<()>>::call_once - shim(fn())` at /home/ritchie46/.rustup/toolchains/nightly-2021-03-04-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ops/function.rs:227:5
    = note: inside `std::sys_common::backtrace::__rust_begin_short_backtrace::<fn(), ()>` at /home/ritchie46/.rustup/toolchains/nightly-2021-03-04-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sys_common/backtrace.rs:125:18
    = note: inside closure at /home/ritchie46/.rustup/toolchains/nightly-2021-03-04-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/rt.rs:66:18
    = note: inside `std::ops::function::impls::<impl std::ops::FnOnce<()> for &dyn std::ops::Fn() -> i32 + std::marker::Sync + std::panic::RefUnwindSafe>::call_once` at /home/ritchie46/.rustup/toolchains/nightly-2021-03-04-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ops/function.rs:259:13
    = note: inside `std::panicking::r#try::do_call::<&dyn std::ops::Fn() -> i32 + std::marker::Sync + std::panic::RefUnwindSafe, i32>` at /home/ritchie46/.rustup/toolchains/nightly-2021-03-04-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:379:40
    = note: inside `std::panicking::r#try::<i32, &dyn std::ops::Fn() -> i32 + std::marker::Sync + std::panic::RefUnwindSafe>` at /home/ritchie46/.rustup/toolchains/nightly-2021-03-04-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:343:19
    = note: inside `std::panic::catch_unwind::<&dyn std::ops::Fn() -> i32 + std::marker::Sync + std::panic::RefUnwindSafe, i32>` at /home/ritchie46/.rustup/toolchains/nightly-2021-03-04-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panic.rs:431:14
    = note: inside `std::rt::lang_start_internal` at /home/ritchie46/.rustup/toolchains/nightly-2021-03-04-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/rt.rs:51:25
    = note: inside `std::rt::lang_start::<()>` at /home/ritchie46/.rustup/toolchains/nightly-2021-03-04-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/rt.rs:65:5
    = note: this error originates in an attribute macro (in Nightly builds, run with -Z macro-backtrace for more info)

error: aborting due to previous error

Invalid mem access in BufferBuilderTrait

Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-8627

Currently, there is an invalid access happening through the append_n method to a mutable location with multiple shared refs. Happens when benchmark code executes with bench_bool.

Happens on (rustc 1.44.0-nightly (45d050cde 2020-04-21))

 

bt shown below:

 * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x1004e7000)
 * frame #0: 0x0000000100150d37 builder-6a49123b1fedb178`_$LT$arrow..array..builder..BufferBuilder$LT$arrow..datatypes..BooleanType$GT$$u20$as$u20$arrow..array..builder..BufferBuilderTrait$LT$arrow..datatypes..BooleanType$GT$$GT$::append_n::h6ae4d34cca93d03c + 311
 frame #1: 0x0000000100007303 builder-6a49123b1fedb178`arrow::array::builder::PrimitiveBuilder$LT$T$GT$::append_slice::h8d33144acea1616b + 51
 frame #2: 0x000000010001b143 builder-6a49123b1fedb178`criterion::Bencher$LT$M$GT$::iter::hfcae173a53b56e6f + 259
 frame #3: 0x0000000100003136 builder-6a49123b1fedb178`_$LT$criterion..routine..Function$LT$M$C$F$C$T$GT$$u20$as$u20$criterion..routine..Routine$LT$M$C$T$GT$$GT$::warm_up::h5b415f52c0951798 + 102
 frame #4: 0x000000010000373b builder-6a49123b1fedb178`criterion::routine::Routine::sample::h2802012b9b92a2a5 + 203
 frame #5: 0x00000001000287a2 builder-6a49123b1fedb178`criterion::analysis::common::h1eabf5af2afe42e5 + 834
 frame #6: 0x0000000100023a83 builder-6a49123b1fedb178`_$LT$criterion..benchmark..Benchmark$LT$M$GT$$u20$as$u20$criterion..benchmark..BenchmarkDefinition$LT$M$GT$$GT$::run::hf631a3f91617ae46 + 1507
 frame #7: 0x00000001000109b8 builder-6a49123b1fedb178`builder::main::he83c09c3b2c8f318 + 216
 frame #8: 0x0000000100021c96 builder-6a49123b1fedb178`std::rt::lang_start::_$u7b$$u7b$closure$u7d$$u7d$::hfb404fc983af2389 + 6
 frame #9: 0x00000001001e9499 builder-6a49123b1fedb178`std::rt::lang_start_internal::h434140244059d623 [inlined] std::rt::lang_start_internal::_$u7b$$u7b$closure$u7d$$u7d$::h096599b40842db82 at rt.rs:52:13 [opt]
 frame #10: 0x00000001001e948e builder-6a49123b1fedb178`std::rt::lang_start_internal::h434140244059d623 [inlined] std::panicking::try::do_call::h1c9f73590350b657 at panicking.rs:331 [opt]
 frame #11: 0x00000001001e948e builder-6a49123b1fedb178`std::rt::lang_start_internal::h434140244059d623 [inlined] std::panicking::try::hca6829be93a31f1b at panicking.rs:274 [opt]
 frame #12: 0x00000001001e948e builder-6a49123b1fedb178`std::rt::lang_start_internal::h434140244059d623 [inlined] std::panic::catch_unwind::hb3c8ad89db0960bd at panic.rs:394 [opt]
 frame #13: 0x00000001001e948e builder-6a49123b1fedb178`std::rt::lang_start_internal::h434140244059d623 at rt.rs:51 [opt]
 frame #14: 0x0000000100010b49 builder-6a49123b1fedb178`main + 41
 frame #15: 0x00007fff691c07fd libdyld.dylib`start + 1
 frame #16: 0x00007fff691c07fd libdyld.dylib`start + 1

SIGSEGV when using StringBuilder with jemalloc

Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-8202

I have a Rust app which is just appending strings into many StringBuilders.  I tried using jemalloc and the app crashes with SIGSEGV (Address boundary error)

 

rust-lldb backtrace:


{{* frame #0: 0x00000001004073f1 memoird`_rjem_mallocx at sz.h:158:18 [opt]}}
{{ frame #1: 0x00000001004073e3 memoird`_rjem_mallocx [inlined] sz_s2u_lookup(size=<unavailable>) at sz.h:238 [opt]}}
{{ frame #2: 0x00000001004073e3 memoird`_rjem_mallocx [inlined] sz_s2u(size=<unavailable>) at sz.h:252 [opt]}}
{{ frame #3: 0x00000001004073d6 memoird`_rjem_mallocx [inlined] sz_sa2u(size=<unavailable>, alignment=64) at sz.h:283 [opt]}}
{{ frame #4: 0x00000001004073ac memoird`_rjem_mallocx [inlined] imalloc_body at jemalloc.c:1841 [opt]}}
{{ frame #5: 0x0000000100407394 memoird`_rjem_mallocx [inlined] imalloc(sopts=<unavailable>, dopts=<unavailable>) at jemalloc.c:2005 [opt]}}
{{ frame #6: 0x0000000100407345 memoird`_rjem_mallocx(size=<unavailable>, flags=<unavailable>) at jemalloc.c:2588 [opt]}}
{{ frame #7: 0x0000000100370187 memoird`arrow::array::builder::ListBuilder$LT$T$GT$::new::h16819112466ced47 [inlined] alloc::alloc::alloc_zeroed::hc53d8d0d6ed944ef(layout=<unavailable>) at alloc.rs:165:4 [opt]}}
{{ frame #8: 0x000000010037017a memoird`arrow::array::builder::ListBuilder$LT$T$GT$::new::h16819112466ced47 at memory.rs:29 [opt]}}
{{ frame #9: 0x000000010037017a memoird`arrow::array::builder::ListBuilder$LT$T$GT$::new::h16819112466ced47 at buffer.rs:419 [opt]}}
{{ frame #10: 0x000000010037017a memoird`arrow::array::builder::ListBuilder$LT$T$GT$::new::h16819112466ced47 at builder.rs:138 [opt]}}
{{ frame #11: 0x0000000100370169 memoird`arrow::array::builder::ListBuilder$LT$T$GT$::new::h16819112466ced47(values_builder=PrimitiveBuilder<arrow::datatypes::UInt8Type> {}}
{{values_builder: BufferBuilder<arrow::datatypes::UInt8Type> {}}
{{buffer: MutableBuffer {}}
{{data: &0x100b96000,}}
{{len: 0,}}
{{capacity: 8192}}
{{},}}
{{len: 0,}}
{{_marker: PhantomData<arrow::datatypes::UInt8Type> {}}

{{}}}
{{},}}
{{bitmap_builder: BufferBuilder<arrow::datatypes::BooleanType> {}}
{{buffer: MutableBuffer {}}
{{data: &0x100be3000,}}
{{len: 0,}}
{{capacity: 1024}}
{{},}}
{{len: 0,}}
{{_marker: PhantomData<arrow::datatypes::BooleanType> {}}

{{}}}
{{}}}
{{}) at builder.rs:368 [opt]}}
{{ frame #12: 0x0000000100370d4c memoird`arrow::array::builder::BinaryBuilder::new::h8f11851f0863e756(capacity=<unavailable>) at builder.rs:670:21 [opt]}}

[Parquet] Too many open files (os error 24)

Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-6154

Used [rust]parquet-read binary to read a deeply nested parquet file and see the below stack trace. Unfortunately won't be able to upload file.
{code:java}
stack backtrace:

   0: std::panicking::default_hook::{{closure}}

   1: std::panicking::default_hook

   2: std::panicking::rust_panic_with_hook

   3: std::panicking::continue_panic_fmt

   4: rust_begin_unwind

   5: core::panicking::panic_fmt

   6: core::result::unwrap_failed

   7: parquet::util::io::FileSource::new

   8: <parquet::file::reader::SerializedRowGroupReader as parquet::file::reader::RowGroupReader>::get_column_page_reader

   9: <parquet::file::reader::SerializedRowGroupReader as parquet::file::reader::RowGroupReader>::get_column_reader

  10: parquet::record::reader::TreeBuilder::reader_tree

  11: parquet::record::reader::TreeBuilder::reader_tree

  12: parquet::record::reader::TreeBuilder::reader_tree

  13: parquet::record::reader::TreeBuilder::reader_tree

  14: parquet::record::reader::TreeBuilder::reader_tree

  15: parquet::record::reader::TreeBuilder::build

  16: <parquet::record::reader::RowIter as core::iter::traits::iterator::Iterator>::next

  17: parquet_read::main

  18: std::rt::lang_start::{{closure}}

  19: std::panicking::try::do_call

  20: __rust_maybe_catch_panic

  21: std::rt::lang_start_internal

  22: main{code}

Allow Position to support arbitrary Cursor type

Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-8170

Hi, I'm currently writing in-memory page writer in order to support buffered row group writer (just like in C++ version), and...

  • I'd like to reuse SerializedPageWriter
  • SerializedPageWriter requires sink supports util::Position (which is private)
  • There's Position impl for Cursor, but unnecessarily restricts to mutable references for internal type.

So I'd like to make one line change in order to lift that type restriction and allow me implementation.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.