Code Monkey home page Code Monkey logo

connector_arrow's People

Contributors

aljazerzen avatar dependabot[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

gopinathjcs

connector_arrow's Issues

perf: benchmarking

ATM I have no idea how performant this connector is. Surly it is slower than using plain connections to the database and not converting to Arrow at all.

But how does it compare to:

  • pandas.read_sql,
  • polars.read_database,
  • connectorx,
  • ADBC,
  • dyplr?

Ideally, I would reuse the benchmarks from the connector-x project, but I'm not sure how portable they are.

Async support?

It looks like right now this crate is synchronous-only. Do you have any interest in expanding to async drivers? E.g. sqlx for postgres?

Expose upstream schema

๐Ÿ‘‹ I saw you were looking for feedback and I was curious whether you have support for or are interested in exposing support for user-defined data types? In particular, I'm working on building out support for geospatial in arrow in https://github.com/geoarrow/geoarrow-rs. We have a working but limited implementation that reads from PostGIS databases. It directly connects to sqlx, and it would be great to use a library like yours that focuses on converting database tables -> arrow. But we need to be able to access the type name, which is "geometry" or "geography" on a PostGIS column.

I haven't looked through your API yet, but I think one way this could work is if you exposed both the Postgres schema and the inferred Arrow schema? Because then geoarrow could access the upstream schema and know that the BinaryArray column actually represents geospatial data.

can PostgreSQL and SQLite store `LargeUtf8`?

In other words, do they support storing text/blob that has length larger than 2^32 (4GiB)?

If yes, then the type of TEXT should always be LargeUtf8 (and Uft8 is coerced into LargeUft8).
If no, then the type of TEXT should always be Utf8 (and coercion is reversed).
Similar for binary types.

DuckDB kindly answers this question by returning a schema that contain Utf8 when you declare a column as VARCHAR or TEXT.

This means that it might not be possible to store an Arrow in a database. We need an error for that. And we need to indicate that in coerce_type.

Future of this crate and ADBC

ADBC has recently updated their Rust API. It is quite similar to the ADBC C API, so it is would be possible to export a Rust ADBC driver as a C library to be used from any language using FFI.

The downside is that it is "low-level" and easy to misuse (i.e. by not setting SQL queries on a statement, by passing-in incorrectly formatted options, by not closing a statement but instead just dropping it). Such misuse would lead to run-time errors or undefined behavior.

This API is therefore "unsafe" and not suitable for general use in rust ecosystem. Instead, I suggest we create a high-level wrapper API that would be very hard or impossible to misuse but would use the "unsafe" API internally.

This way, it would also be possible to use non-Rust ADBC drivers from Rust via FFI.


Now, regarding this crate, I plan to:

  • use connector_arrow::api as a base to define the high-level ADBC API,
  • convert the connectors in the crate to implement the "unsafe" ADBC API.

(writing this, I realize we need a good name for the high-level ADBC API)

perf: transport should do branching ahead of the hot path

When moving values from a producer to a consumer, we utilize crate::utils::transport::transport function.

It contains branching depending on the type of the field that is being transported.
In most of the cases, we iterate over an array (or multiple arrays), which means that fields repeat a lot.

It would (probably) be much more performant to have a transporter function instead, that would be specialized for a specific type. These functions would be created prior to iteration over the arrays.

In the case of converting rows into RecordBatch, we could have these functions in an array.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.