aljazerzen / connector_arrow Goto Github PK

View Code? Open in Web Editor NEW

32.0 32.0 1.0 2.64 MB

Apache Arrow database client for many databases.

Home Page: https://docs.rs/connector_arrow

License: MIT License

Just 0.86% Rust 98.53% Nix 0.61%

connector_arrow's People

Contributors

Stargazers

Watchers

Forkers

gopinathjcs

connector_arrow's Issues

SQLite will panic on results that contain values of different types

This will panic (I think, didn't test):

a     | b
------|-----
'val' | 1.0
'str' | 2

Make sure to catch this case too:

a     | b
------|-----
NULL  | 1.0
'str' | 2

SQLite will error out if a column contains NULLs only

Maybe fallback to arrow::datatype::DataType::Null field?

perf: benchmarking

ATM I have no idea how performant this connector is. Surly it is slower than using plain connections to the database and not converting to Arrow at all.

But how does it compare to:

pandas.read_sql,
polars.read_database,
connectorx,
ADBC,
dyplr?

Ideally, I would reuse the benchmarks from the connector-x project, but I'm not sure how portable they are.

Async support?

It looks like right now this crate is synchronous-only. Do you have any interest in expanding to async drivers? E.g. sqlx for postgres?

👋 I saw you were looking for feedback and I was curious whether you have support for or are interested in exposing support for user-defined data types? In particular, I'm working on building out support for geospatial in arrow in https://github.com/geoarrow/geoarrow-rs. We have a working but limited implementation that reads from PostGIS databases. It directly connects to sqlx, and it would be great to use a library like yours that focuses on converting database tables -> arrow. But we need to be able to access the type name, which is "geometry" or "geography" on a PostGIS column.

I haven't looked through your API yet, but I think one way this could work is if you exposed both the Postgres schema and the inferred Arrow schema? Because then geoarrow could access the upstream schema and know that the BinaryArray column actually represents geospatial data.

can PostgreSQL and SQLite store `LargeUtf8`?

In other words, do they support storing text/blob that has length larger than 2^32 (4GiB)?

If yes, then the type of TEXT should always be LargeUtf8 (and Uft8 is coerced into LargeUft8).
If no, then the type of TEXT should always be Utf8 (and coercion is reversed).
Similar for binary types.

DuckDB kindly answers this question by returning a schema that contain Utf8 when you declare a column as VARCHAR or TEXT.

This means that it might not be possible to store an Arrow in a database. We need an error for that. And we need to indicate that in coerce_type.

Future of this crate and ADBC

ADBC has recently updated their Rust API. It is quite similar to the ADBC C API, so it is would be possible to export a Rust ADBC driver as a C library to be used from any language using FFI.

The downside is that it is "low-level" and easy to misuse (i.e. by not setting SQL queries on a statement, by passing-in incorrectly formatted options, by not closing a statement but instead just dropping it). Such misuse would lead to run-time errors or undefined behavior.

This API is therefore "unsafe" and not suitable for general use in rust ecosystem. Instead, I suggest we create a high-level wrapper API that would be very hard or impossible to misuse but would use the "unsafe" API internally.

This way, it would also be possible to use non-Rust ADBC drivers from Rust via FFI.

Now, regarding this crate, I plan to:

use connector_arrow::api as a base to define the high-level ADBC API,
convert the connectors in the crate to implement the "unsafe" ADBC API.

(writing this, I realize we need a good name for the high-level ADBC API)

perf: transport should do branching ahead of the hot path

When moving values from a producer to a consumer, we utilize crate::utils::transport::transport function.

It contains branching depending on the type of the field that is being transported.
In most of the cases, we iterate over an array (or multiple arrays), which means that fields repeat a lot.

It would (probably) be much more performant to have a transporter function instead, that would be specialized for a specific type. These functions would be created prior to iteration over the arrays.

In the case of converting rows into RecordBatch, we could have these functions in an array.

Schema editing command don't escape table names

Following table names will fail on all data stores:

my"table
my'table
my"""table

Stale dependency on arrow-rs, because it also depends on duckdb-rs

The latest release of this crate uses arrow 49, as of writing the latest arrow release is at version 52. Could we get a new release with bumped arrow?

aljazerzen / connector_arrow Goto Github PK

connector_arrow's People

Contributors

Stargazers

Watchers

Forkers

connector_arrow's Issues

SQLite will panic on results that contain values of different types

SQLite will error out if a column contains NULLs only

perf: benchmarking

Async support?

Expose upstream schema

can PostgreSQL and SQLite store `LargeUtf8`?

Future of this crate and ADBC

perf: transport should do branching ahead of the hot path

Schema editing command don't escape table names

Stale dependency on arrow-rs, because it also depends on duckdb-rs

Decide on what to do when user tries to store "\0" in PostgreSQL

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent