Code Monkey home page Code Monkey logo

Comments (11)

roeap avatar roeap commented on June 18, 2024 2

it would be a real game changer the more I think about it.

@MrPowers - agreed.

Looked into it a little bit and as it turns out, it is more feasible then I thought. Most critical I think is to come up with an API that we want to commit to.

There are also two related things that we can deal with separately:

  1. executing queries
  2. specifying expressions (partition filters etc.)

Since we already make use of a full blown SQL engine (datafusion) we already have the capability to execute SQL and just need to expose an API. However executing SQL always (AFAIK) requires some sort of session, which we have tried to avoid so far.

looking at the example above, when we do

DeltaTable(...).query("select * where id1 = 'id016'")

what we are actually executing is

select * from <some relation name> where id1 = 'id016'

which would require us to inject some missing parts into the query. This is quite error prone as users might come up with complex queries.

In polars they introduced SQLContext which seems to be an "ad-hoc" session where you register relations.

Without having thought about it too deeply, just throwing some spontaneous thoughts out there on what abstractions we may offer ... DeltaLog, DeltaTable, DeltaLake.

DeltaLog would be targeted at users that want to inspect the log to answer questions like - "is there a new version available", "what files are included in the log", e.t.c.. Some methods currently on DeltaTable - e.g. files, file_uris, get_add_actions - and some new ones, like peek_next_commit could be exposed there.

This could of course easily be exposed on DeltaTable

class DeltaTable:
  def log(self) -> DeltaLog: ...

but also created separately, to avoid doing an actual scan of the log to get protocol and metadata (some users asked for this specifically).

DeltaLake would be our session-like thing and allow you to register tables and or catalogs as well as execute queries.

class DeltaLake:
  def register_table(self, table: PathLike | DeltaTable): ...
  def register_catalog(self, ...):
  def sql(self, query: str) -> pyarrow.Table: ...

Again, just throwing things out there for discussion :)

from delta-rs.

djouallah avatar djouallah commented on June 18, 2024 1

just expose a SQL interface, for Python API no one seems to agree on anything !!!

from delta-rs.

emcake avatar emcake commented on June 18, 2024 1

Other options may be to follow the DuckDB pattern of letting you name the relation:

table = DeltaTable(...)
table.sql("my_name", "select * from my_name where id = 'foo'")

Or kwargs style parsing:

delta_table_a = DeltaTable(...)
delta_table_b = DeltaTable(...)

deltalake.sql("select a.id, a.foo, b.bar from relation_a a join relation_b b on a.id = b.id", relation_a = delta_table_a, relation_b = delta_table_b")

from delta-rs.

emcake avatar emcake commented on June 18, 2024 1

The thing that still confuses me though is the hierarchy ... in @emcake's example we are assuming we have to higher level DeltaLake thing, right? since its deltalake.sql(...)

I was imagining just exposing it as a function in the package. In a sense it's nicer that way than doing it on a particular table, because it allows you to specify a query with multiple tables.

from delta-rs.

roeap avatar roeap commented on June 18, 2024

With the gradual move to logical plans, we are also moving closer and closer to being able properly expose a SQL API on the rust level as well. After which exposing this in Python should be straight forward :).

There is however some work to be done before we can get that to work. IF we want something quicker in python we could likely find a way by leveraging some other library. We would of course need to be careful not to loose pushdown optimizations etc...

from delta-rs.

MrPowers avatar MrPowers commented on June 18, 2024

@roeap - no rush on this one. But it would be a real game changer the more I think about it.

from delta-rs.

MrPowers avatar MrPowers commented on June 18, 2024

Great points @roeap!

Here's the syntax for querying a table with DataFusion:

from datafusion import SessionContext
ctx = SessionContext()
table = DeltaTable(f"{pathlib.Path.home()}/data/delta/G1_1e8_1e2_0_0")
ctx.register_dataset("my_dataset", table.to_pyarrow_dataset())

I like the idea of this user interface:

table = DeltaTable(f"{pathlib.Path.home()}/data/delta/G1_1e8_1e2_0_0")
table.register_table("my_dataset")
table.query("select id1, id1, v1 from my_dataset where id1 = 'id016'")

Let me know what you think!

from delta-rs.

MrPowers avatar MrPowers commented on June 18, 2024

@emcake - I don't like the DuckDB pattern of letting you name the relation, but the kwargs style parsing is a great idea. Spark just added this too. This would be really nice:

table = DeltaTable(f"{pathlib.Path.home()}/data/delta/G1_1e8_1e2_0_0")
table.query("select id1, id1, v1 from {my_dataset} where id1 = 'id016'", my_dataset=table)

from delta-rs.

roeap avatar roeap commented on June 18, 2024

Personally I like the kwargs-syle idea very much! its clean, and users can make the decision right where they need to ...

The thing that still confuses me though is the hierarchy ... in @emcake's example we are assuming we have to higher level DeltaLake thing, right? since its deltalake.sql(...)

@MrPowers - i very much understand the urge to keep things simple and flat, but this notion would always require a reference to itself, so I could not write something like ...

result = DeltaTable(...).sql("select ...", relation=<what goes here?>)

which would work with @emcake first suggestion

result = DeltaTable(...).sql("my_table", "select ...")

If we brings back catalogues into python (and I hope we do) I think we will need a notion of something above table, since a catalogue usually contains more then one table.

from delta-rs.

roeap avatar roeap commented on June 18, 2024

I was imagining just exposing it as a function in the package

ah that actually makes a lot of sense, and is very much in line with what we are doing with write_deltalake ---

from delta-rs.

djouallah avatar djouallah commented on June 18, 2024

how about write, insert , merge, will those scenarios be supported ?

from delta-rs.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.