Comments (10)
Oh yeah, you are right, I don't know why I didn't think about that. I will try it anyway, maybe I will come up with a different solution. Thanks Peter, for letting me know!
In order to make it work, you might be able to implement a custom iterator like
enum InputIterator<'a> {
SingleInput(&LogicalPlan),
VecInput(&[LogicalPlan])),
..
}
And then implement the Iterator
trait appropriately
from arrow-datafusion.
If you guys really think this can be helpful i can open up a PR so you can look at some details, but it is maybe worthwhile to just wait until Rust allows us to return multiple types in a -> impl Trait function.
I agree your (clearly very impressive skills) might be better spent on other projects.
Is there any particular issue or area you are interested in working on?
from arrow-datafusion.
This is very cool -- is the code somehere pubic?
I can make my implementation public AcyclicJoin Node. In short, it introduces a new logical node (acyclic join) that is coupled with a physical node (join impl of that acyclic join), but the physical node is part of that PhD project, which I can't share. To create those acyclic join nodes, I had to detect if a particular join tree (i.e. subsequent joins in a LogicalPlan) is acyclic or not.
I also think adding an example of SQL analysis would be nice. I'll move to #10871 so this can be closed.
from arrow-datafusion.
I can make my implementation public AcyclicJoin Node. In short, it introduces a new logical node (acyclic join) that is coupled with a physical node (join impl of that acyclic join), but the physical node is part of that PhD project, which I can't share. To create those acyclic join nodes, I had to detect if a particular join tree (i.e. subsequent joins in a LogicalPlan) is acyclic or not.
That is very cool -- thank you for sharing @LorrensP-2158466 . Since DataFusion is totally open it is always cool to see what people are doing with it.
(BTW if you need a paper about DataFusion to cite -- we now have one https://dl.acm.org/doi/10.1145/3626246.3653368 :) )
from arrow-datafusion.
I believe @peter-toth and @ozankabak and I discussed something similar in #10543 (comment)
The new reference walking APIs added in #10543 (which I don't think are released yet) may also be related -- specifically TreeNode::visit
can now return references that can be kepy of the underlying nodes
from arrow-datafusion.
@LorrensP-2158466, I as far as I get you just want to change the return type of LogicalPlan::inputs()
to use iterators instead of the current Vec
. I think that is not really related to #10543 and actually can be a good idea to avoid unnecessary Vec
allocations.
But I feel you will run into problems with the impl Iterator<Item = &LogicalPlan>
return type as it doesn't allow returning different iterator implementations on different LogicalPlan
s. (And I think we want to avoid dyn Iterator
too.)
But please give it a try if you have some time...
from arrow-datafusion.
Oh yeah, you are right, I don't know why I didn't think about that. I will try it anyway, maybe I will come up with a different solution. Thanks Peter, for letting me know!
I will also look at #10543 for extra info.
from arrow-datafusion.
So I have found a solution that i can compile, but at this stage I'm not very happy with it.
I will first try to explain my reasoning as to how i got to my current solution, it is a bit much, so if you are only interested the final you can Jump to the end. I did underestimate this problem...
Explanation
I first tried the solution Andrew mentioned, but it runs into following problems:
- The Join Plans and Recursive Query
Both the Join and CrossJoin hold the left and right inputs separately in an Arc, so we need another variantDouble()
which needs to know if the first input is retrieved and than the second. This is because you can't transform both the Arc's into a&[LogicalPlan]
. This also applies to Recursive Queries which have 2 inputs. - The
Union
node holds aVec<Arc<LogicalPlan>>
which you also can't transform safely into a&[LogicalPlan]
. - The
Extension
node doesn't allow us to know how the inputs are stored, we have to assume this can have multiple inputs all contained in Arc's, so we get the same problem as theUnion
Node but a little worse because some Extension nodes just have 1 input but we treat them as having multiple.
To support all these cases i came up with:
pub enum InputIterator<'parent> {
Empty(Empty<&'parent LogicalPlan>),
Single(Once<&'parent LogicalPlan>),
Double(DoubleIter<&'parent LogicalPlan>), // DoubleIter<T> = std::array::IntoIter<T, 2>
Multiple(std::vec::IntoIter<&'parent LogicalPlan>),
}
Built-In LogicalPlans
Most cases can be handled by the Empty
, Single
or Double
variant.
The Multiple
variant has a Vec::IntoIter
, so we can handle the Union
case easier, this is because we have to transform the Vec<Arc<LogicalPlan>>
into a collection of references, which results in us allocating and transforming it back into a iterator. This also happens in the inputs()
function but with just the collect()
call.
Extension Plans
To handle the Extension
nodes i added the following function into the UserDefinedLogicalNode
trait:
fn inputs_iter(&self) -> InputIterator<'_>;
The user of this trait can then choose their own iterator, and we only have to call extension.node.inputs_iter()
.
More Problems
But there are still some problems with this (i think):
- The
Multiple
variant still causes us to allocate in a lot of cases, some UserDefined plans in examples have their inputs stored asVec<LogicalPlan>
, so we need to doinputs.iter().collect().into_iter()
.
We could add another Iterator type:
Slice(slice::IntoIter<&'parent LogicalPlan>)
which is an slice iterator, so the above would be inputs.as_slice().into_iter()
. But adding another variant that still says "i have multiple inputs" does not look nice.
2. Some Nodes, like Union
hold their inputs in Vec<Arc<LogicalPlan>>
which also causes us to transform and collect this into a Vec<&LogicalPlan>
to turn this into a iterator the Multiple
variant accepts.
Solutions
To fix this i can split up the Multiple
variant into 2 new variants:
pub enum InputIterator{
// ...
Slice(slice::IntoIter<&LogicalPlan>),
FromArcs(Map<Iter<'_, Arc<LogicalPlan>>, AsRef::as_ref>), // maps Arc<LogicalPlan> into &LogicalPlan
}
This does sum up to a total of 5 different iterator types, but i don't know how i can cover every possible way of holding onto multiple inputs. For example if node implementation holds their inputs in some other collection (like BTreeMap
, HashMap
, ...) their respective Iter
implementations have different types, so if someone wants to return a InputIterator
they are forced to also keep their inputs in a Vec<LogicalPlan>
or Vec<Arc<LogicalPlan>>
.
Currently
So currently the InputIterator
looks like this...
pub enum InputIterator<'parent> {
Empty(Empty<&'parent LogicalPlan>),
Single(Once<&'parent LogicalPlan>),
Double(DoubleIter<&'parent LogicalPlan>),
Slice(SliceIter<'parent, LogicalPlan>),
FromArcs(FromArcsIter<'parent, LogicalPlan>),
}
I have made a macro which let's me "delegate" a call to this iterator to any of the inner iterators, e.g. fn next()
:
fn next(&mut self) -> Option<Self::Item> {
delegate!(self, iter => iter.next())
}
Things to do:
- find a way to accept many more IteratorTypes, (maybe just have a fallback in the form of
Vec::into_iter
ordyn Iterator
). - Minimize Iter variants while still keeping the implementation clear, maybe we can collapse the
Empty
,Single
andDouble
into one variant, but this does cause more checks.
If you guys really think this can be helpful i can open up a PR so you can look at some details, but it is maybe worthwhile to just wait until Rust allows us to return multiple types in a -> impl Trait
function.
from arrow-datafusion.
Thanks, that means a lot. I'm interested in anything data science or database related. I came in contact with DataFusion because of my Bachelor's project this year, and I liked it so much that I wanted to contribute in any way I could. But because of school, I can't find the time to do this actively. The project was about detecting α-acyclic joins to help out a PhD project at my University, so I had to use LogicalPlans and the logical optimizer, where I came up with this "issue." Now that it was done, I wanted to try it out.
As I said, I like to help wherever I can, but I'm not familiar enough with DataFusion to know where I can help. Maybe you guys know some places where I can look/help?
Thanks for the interest and help in this issue!
from arrow-datafusion.
The project was about detecting α-acyclic joins to help out a PhD project at my University, so I had to use LogicalPlans and the logical optimizer, where I came up with this "issue." Now that it was done, I wanted to try it out.
This is very cool -- is the code somehere pubic ? Hopefully doing this kind of analysis would be easier now with the nicer tree node API from @peter-toth .
One thing that might be interesting / relevant perhaps then would be to add an example of that kind of analysis.
For example, this ticket describes an example for an analyzer rule #10855, but writing an example of sql_analaysis.rs
that shows how to parse some sql and then use DataFusion structures to do an analysis (like maybe join counts, or predicate analysis or something) would be really neat -- I filed #10871 to track this
from arrow-datafusion.
Related Issues (20)
- DataFusion weekly project plan (Andrew Lamb) - Aug 12, 2024
- Disable `create_default_catalog` when the exist session state has default catalog for `SessionStateBuilder` HOT 1
- Improve performance of `REPEAT` functions HOT 2
- Internal error: Empty iterator passed to ScalarValue::iter_to_array HOT 1
- `optimize_projections` fails when eagerly selecting from a join with ambigious column names HOT 1
- Speed up rpad implementation to use StringBuilder HOT 1
- Remove unused config value `scalar_update_factor` HOT 1
- 在文件prost.rs文件中定义的"pub struct Schema"等结构体,需要使用"#[derive(Serialize)]"属性生成序列化的方法。 HOT 3
- Improve error message for invalid aggregate queries
- Support different aggregate expression (outside SELECT list) in ORDER BY list HOT 2
- Allow suppling a table schema to ParquetExec HOT 7
- Panics in `MIN()/MAX()` aggregate functions (SQLancer) HOT 3
- Internal error in `approx_percentile_cont()` aggregate function (SQLancer) HOT 2
- Invalid aggregate SQL query with `HAVING` can be executed without error (SQLancer-TLP) HOT 3
- Unable to access LogicalPlanBuilder.plan inside a new trail implementation
- Make properties in phyiscal-plan/aggregates public
- COUNT(expr) always returns the COUNT(colname) HOT 7
- Convert built-in `row_number` to user-defined window function
- Improve performance of SUBSTR for StringViewArray HOT 3
- Empty strings in CSV files aren't being interpreted as null when using a `Dictionary(_, Utf8)` HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from arrow-datafusion.