Comments (5)
Update here: we merged #9593 and now will work on increasing the test coverage to enable it by default (tracked in #10336)
from arrow-datafusion.
Would this ticket be an appropriate place to add tickets related to pushing down sorts to federated query engines? I know that this was discussed previously (i.e. #7871) and it seems that writing a custom optimizer is the current way to handle that.
I added #7871 to the list above -- thank you.
Yes I think this would be a good place to discuss
I will need to do this soon (federated sort pushdown) and it initially wasn't clear to me how to make this work in DataFusion. I can volunteer to write some docs on how to do this once I have an implementation that works.
That would be great, thanks @phillipleblanc
Right now, once TableProvider::execute
gets called, the returned ExecutionPlan
can report how it is already sorted.
What we don't have is any way to have the optimizer tell a ExecutionPlan
that it could reduce the work required in the DataFusion plan if the data was already sorted.
I wonder if we could add something to ExecutionPlan
trait similar to ExecutionPlan::repartitioned
like
trait ExecutionPlan {
...
/// return other possible orders that this ExecutionPlan could return
/// (the DataFusion optimizer will use this information to potentially push Sorts
/// into the Node
fn pushable_sorts(&self) -> Result<Option<PotentialSortOrders>>> {
return Ok(None)
}
/// return a node like this one except that it its output is sorted according to exprs
fn resorted(&self) -> Result<Option<Arc<dyn ExecutionPlan>>> {
return Ok(None)
}
And then add a new optimizer pass that tries to push sorts into the plan nodes that report they can provide sorted data 🤔
from arrow-datafusion.
After digging into and understanding how the datafusion-federation
crate works, I don't think we need anything additional for sort pushdown. I basically came to the same realization that @backkem had in #7871 (comment).
My realization essentially comes down to (please correct me if this is incorrect):
DataFusion is a library that provides both query planning (LogicalPlan
) and query execution (ExecutionPlan
). When we are connecting a set of tables from a remote query engine into DataFusion, what we really want is the ability to get an optimized logical plan and send that plan to be executed by the remote query engine - in its entirety, bypassing the query execution of DataFusion as much as possible. (In reality we still want the query execution DataFusion provides for more complex queries that involve custom UDFs, joins between two different remote query engines, etc).
The TableProvider
construct is part of the query execution (ExecutionPlan
level) machinery of DataFusion, so trying to teach it to be smarter for the query federation case is an anti-pattern in my mind. But we still need a TableProvider
to be registered so we can take advantage of the logical planning (via the auto-transformation of a TableProvider
to a TableSource
in said planning). The datafusion-federation
repo solves this by using a thin wrapper around a TableProvider
called a FederatedTableProviderAdaptor
whose entire job is to provide a TableSource
during logical planning. And through a custom FederationQueryPlanner
- it recognizes when there are TableScan
s of a FederatedTableProviderAdaptor
and knows to delegate the query execution for the largest LogicalPlan sub-tree that includes only TableScan
s from the same source to that source (via the deparsing back to SQL).
from arrow-datafusion.
Would this ticket be an appropriate place to add tickets related to pushing down sorts to federated query engines? I know that this was discussed previously (i.e. #7871) and it seems that writing a custom optimizer is the current way to handle that.
I will need to do this soon (federated sort pushdown) and it initially wasn't clear to me how to make this work in DataFusion. I can volunteer to write some docs on how to do this once I have an implementation that works.
from arrow-datafusion.
FYI @NGA-TRAN is working on porting ProgressiveEval to DataFusion: #10488
from arrow-datafusion.
Related Issues (20)
- A valid SQL query returned 'Schema error: ...' (SQLancer) HOT 1
- Optimize CASE expression for "expr or expr" usage HOT 1
- Allow custom planning behavior for selecting wildcard expression HOT 11
- functions_aggregate::expr_fn::approx_distinct should expose the function, not the module
- Support per-option value normalization
- Clippy CI failures on main after Rust 1.80 release HOT 2
- Allow comparison of Timestamps with different Timezones HOT 4
- Get Clippy clean for Rust 1.80 and run it on CI HOT 5
- `join_inner_two`, `join_inner_one_two_parts_left` and `join_inner_one_two_parts_right` fail with `force_hash_collisions` HOT 1
- `equijoin_full_and_condition_from_both` slt test fails with `force_hash_collisions`
- Window function test fails when `force_hash_collisions` is enabled
- Use `AccumulatorArgs::is_reversed` in `NthValueAgg`
- circular dependency check CI check is failing with compile error HOT 1
- `pushdown_sorts` pushes a SortExec through a node in violation of its stated input ordering requirements HOT 1
- Implement native `StringView` support for CharacterLength HOT 2
- [Epic] High cardinality aggregation performance wishlist HOT 2
- Improve performance of high cardinality grouping by reusing hash values HOT 15
- Enable `datafusion.execution.parquet.schema_force_string_view` by default
- Remove unnecessary allocations in `struct` and `named_struct`
- Eliminate distinct of min/max and rewrite count wildcard family in ExprBuilder
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from arrow-datafusion.