aljazerzen / connector_arrow Goto Github PK
View Code? Open in Web Editor NEWApache Arrow database client for many databases.
Home Page: https://docs.rs/connector_arrow
License: MIT License
Apache Arrow database client for many databases.
Home Page: https://docs.rs/connector_arrow
License: MIT License
This will panic (I think, didn't test):
a | b
------|-----
'val' | 1.0
'str' | 2
Make sure to catch this case too:
a | b
------|-----
NULL | 1.0
'str' | 2
Maybe fallback to arrow::datatype::DataType::Null
field?
ATM I have no idea how performant this connector is. Surly it is slower than using plain connections to the database and not converting to Arrow at all.
But how does it compare to:
pandas.read_sql
,polars.read_database
,connectorx
,ADBC
,dyplr
?Ideally, I would reuse the benchmarks from the connector-x project, but I'm not sure how portable they are.
It looks like right now this crate is synchronous-only. Do you have any interest in expanding to async drivers? E.g. sqlx for postgres?
๐ I saw you were looking for feedback and I was curious whether you have support for or are interested in exposing support for user-defined data types? In particular, I'm working on building out support for geospatial in arrow in https://github.com/geoarrow/geoarrow-rs. We have a working but limited implementation that reads from PostGIS databases. It directly connects to sqlx
, and it would be great to use a library like yours that focuses on converting database tables -> arrow. But we need to be able to access the type name, which is "geometry"
or "geography"
on a PostGIS column.
I haven't looked through your API yet, but I think one way this could work is if you exposed both the Postgres schema and the inferred Arrow schema? Because then geoarrow could access the upstream schema and know that the BinaryArray column actually represents geospatial data.
In other words, do they support storing text/blob that has length larger than 2^32 (4GiB)?
If yes, then the type of TEXT
should always be LargeUtf8
(and Uft8
is coerced into LargeUft8
).
If no, then the type of TEXT
should always be Utf8
(and coercion is reversed).
Similar for binary types.
DuckDB kindly answers this question by returning a schema that contain Utf8
when you declare a column as VARCHAR
or TEXT
.
This means that it might not be possible to store an Arrow in a database. We need an error for that. And we need to indicate that in coerce_type
.
ADBC has recently updated their Rust API. It is quite similar to the ADBC C API, so it is would be possible to export a Rust ADBC driver as a C library to be used from any language using FFI.
The downside is that it is "low-level" and easy to misuse (i.e. by not setting SQL queries on a statement, by passing-in incorrectly formatted options, by not closing a statement but instead just dropping it). Such misuse would lead to run-time errors or undefined behavior.
This API is therefore "unsafe" and not suitable for general use in rust ecosystem. Instead, I suggest we create a high-level wrapper API that would be very hard or impossible to misuse but would use the "unsafe" API internally.
This way, it would also be possible to use non-Rust ADBC drivers from Rust via FFI.
Now, regarding this crate, I plan to:
connector_arrow::api
as a base to define the high-level ADBC API,(writing this, I realize we need a good name for the high-level ADBC API)
When moving values from a producer to a consumer, we utilize crate::utils::transport::transport
function.
It contains branching depending on the type of the field that is being transported.
In most of the cases, we iterate over an array (or multiple arrays), which means that fields repeat a lot.
It would (probably) be much more performant to have a transporter function instead, that would be specialized for a specific type. These functions would be created prior to iteration over the arrays.
In the case of converting rows into RecordBatch, we could have these functions in an array.
Following table names will fail on all data stores:
my"table
my'table
my"""table
The latest release of this crate uses arrow 49, as of writing the latest arrow release is at version 52. Could we get a new release with bumped arrow?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.