Comments (10)
@gopik it will be part of datafusion, see apache/arrow-datafusion#907
from delta-rs.
Yeah, unfortunately, datafusion uses arrow parquet readers, which only supports local file at the moment: https://github.com/apache/arrow/blob/master/rust/datafusion/src/physical_plan/parquet.rs#L181. I think this is best handled by the rust parquet reader with minor adjustments to datafusion's execution plan after that.
@nevi-me has plans to add S3 support to the parquet reader. If you are interested in extending the reader to support S3 or other cloud storages, I would recommend collaborating with him :)
from delta-rs.
I think this is best handled by the rust parquet reader with minor adjustments to datafusion's execution plan after that.
Makes sense to me!
@nevi-me has plans to add S3 support to the parquet reader. If you are interested in extending the reader to support S3 or other cloud storages, I would recommend collaborating with him :)
Sounds good, I'll keep an eye on it and try and contribute an Azure reader when the time comes.
from delta-rs.
What could work in the interim is to use DataFusion's in-memory datasource (https://docs.rs/datafusion/2.0.0/datafusion/datasource/memory/index.html). When we have async-support on Parquet, then we can change to the relevant methods.
from delta-rs.
@nevi-me is there a bug anywhere to track S3 support? I took a brief look in the Arrow and Datafusion repos and didn't find anything. If you're open to it it's something that we could potentially look in to contributing.
from delta-rs.
@meastham feel free to start a discussion for s3 support in the upstream datafusion github repo or in the arrow dev mailing list.
from delta-rs.
Given object store support in datafusion, can a blob path integration be implemented assuming we have appropriate blobstore implementation of object_store interface?
https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/datasource/object_store/mod.rs
I understand that given this, we can pass the file names prefixed with appropriate storage handler name from delta-rs, but my question is, is datafusion execution plan integration with this data source complete or is it still in progress?
from delta-rs.
@gopik yes, we are pending on upstream object store support for s3. datafusion execution plan integration is all complete other than partition column support, which should be fairly straight forward to add.
from delta-rs.
@houqp When you say upstream object support for s3, will that be part of datafusion project or it'll be part of an integration that is embedding datafusion?
from delta-rs.
With the adoption of object_store
, the datafusion integration now supports all storage backends - there are integration tests as well :).
https://github.com/delta-io/delta-rs/blob/main/rust/tests/integration_datafusion.rs
from delta-rs.
Related Issues (20)
- terminate called without an active exception HOT 2
- Error when parsing delete expressions HOT 1
- internal.DeltaError: Generic DeltaTable error: Internal error: Invalid HashJoinExec partition count mismatch 1!=2 HOT 2
- Failed to create checkpoint with "Parquet does not support writing empty structs" HOT 2
- Extend CommitInfo with version when retrieving the history of a delta table HOT 2
- Reading deta from GCP with service account - Invalid RSA key HOT 1
- Decimal Column with Value 0 Causes Failure in Python Binding
- `TableNotFoundError` on `DeltaTable.create(...)` HOT 3
- Delta Table already exists with `write_deltalake(..., mode='overwrite', overwrite_schema=True)` HOT 1
- Rust Engine write_deltalake Schema HOT 3
- DELTA_FILE_PATTERN regex is incorrectly matching tmp commit files
- Add analytics to documentation page HOT 1
- Unable to append to delta table without datafusion feature HOT 1
- z_order `max_spill_size` parameter incorrectly documented
- add option to append only a subsets of columns HOT 1
- Handling of decimals in scientific notation HOT 1
- Merging to a table with multiple distinct partitions in parallel fails HOT 3
- Unable to merge column names starting from numbers HOT 2
- Get statistics metadata HOT 4
- Release GIL in deltalake.write_deltalake HOT 12
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from delta-rs.