Code Monkey home page Code Monkey logo

Comments (10)

houqp avatar houqp commented on May 17, 2024 1

@gopik it will be part of datafusion, see apache/arrow-datafusion#907

from delta-rs.

houqp avatar houqp commented on May 17, 2024

Yeah, unfortunately, datafusion uses arrow parquet readers, which only supports local file at the moment: https://github.com/apache/arrow/blob/master/rust/datafusion/src/physical_plan/parquet.rs#L181. I think this is best handled by the rust parquet reader with minor adjustments to datafusion's execution plan after that.

@nevi-me has plans to add S3 support to the parquet reader. If you are interested in extending the reader to support S3 or other cloud storages, I would recommend collaborating with him :)

from delta-rs.

sd2k avatar sd2k commented on May 17, 2024

I think this is best handled by the rust parquet reader with minor adjustments to datafusion's execution plan after that.

Makes sense to me!

@nevi-me has plans to add S3 support to the parquet reader. If you are interested in extending the reader to support S3 or other cloud storages, I would recommend collaborating with him :)

Sounds good, I'll keep an eye on it and try and contribute an Azure reader when the time comes.

from delta-rs.

nevi-me avatar nevi-me commented on May 17, 2024

What could work in the interim is to use DataFusion's in-memory datasource (https://docs.rs/datafusion/2.0.0/datafusion/datasource/memory/index.html). When we have async-support on Parquet, then we can change to the relevant methods.

from delta-rs.

meastham avatar meastham commented on May 17, 2024

@nevi-me is there a bug anywhere to track S3 support? I took a brief look in the Arrow and Datafusion repos and didn't find anything. If you're open to it it's something that we could potentially look in to contributing.

from delta-rs.

houqp avatar houqp commented on May 17, 2024

@meastham feel free to start a discussion for s3 support in the upstream datafusion github repo or in the arrow dev mailing list.

from delta-rs.

gopik avatar gopik commented on May 17, 2024

Given object store support in datafusion, can a blob path integration be implemented assuming we have appropriate blobstore implementation of object_store interface?
https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/datasource/object_store/mod.rs

I understand that given this, we can pass the file names prefixed with appropriate storage handler name from delta-rs, but my question is, is datafusion execution plan integration with this data source complete or is it still in progress?

from delta-rs.

houqp avatar houqp commented on May 17, 2024

@gopik yes, we are pending on upstream object store support for s3. datafusion execution plan integration is all complete other than partition column support, which should be fairly straight forward to add.

from delta-rs.

gopik avatar gopik commented on May 17, 2024

@houqp When you say upstream object support for s3, will that be part of datafusion project or it'll be part of an integration that is embedding datafusion?

from delta-rs.

roeap avatar roeap commented on May 17, 2024

With the adoption of object_store, the datafusion integration now supports all storage backends - there are integration tests as well :).

https://github.com/delta-io/delta-rs/blob/main/rust/tests/integration_datafusion.rs

from delta-rs.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.