Comments (19)
I think this can be controlled by the consumer -- for example if you are walking Exprs in lancedb, you can control when you transform Expr::GetStructField into ScalarUDF and depending on where you do your analysis you only have to check for one
I agree (and thanks for the example). This works for us. All of our expr from the user start as &str
. However, if we were receiving Expr
directly then it wouldn't work because we wouldn't know which approach the user used. This is not a problem for us, we aren't expecting the user to provide direct DF structs in any of our roadmap features, I'm just thinking through possibilities.
I think the idea is people might want to override Expr::GetStructFields semantics and they way they would do so is to rewrite it into a different function. I think this is especially compelling for supporting JSON/JSONB for example
What's the motivation for doing this at the logical level instead of doing this as part of the conversion from logical to physical?
from arrow-datafusion.
I don't think there is any particular motivation (or any reason that the conversion needs to be done at either spot) 🤔
I think, for me, it's just a general unease with having multiple ways of expressing the same thing. I feel like this can lead to "implicit layers" of the plan. For example, there is already some notion of "parse plan", "unoptimized logical plan" and "optimized logical plan", and "physical plan". The middle two are both represented by Expr
which can be subtle. Do we now add "rewritten logical plan" to the list? Or maybe "rewritten" and "simplified" are just very transient states between "unoptimized" and "optimized" and I am blowing things out of proportion.
Another way to tackle it could be to leave the concept of a GetIndexedField
node at the parsing layer and pull it out of Expr
(or deprecate). This would force the conversion to be done between the parse plan and the logical plan.
That being said, my needs are met (thanks again for your help), and perfect is the enemy of the good, so I'm happy to leave well enough alone.
from arrow-datafusion.
done between the parse plan and the logical plan
I had also thought about deprecating Expr
and use functions directly in parsing phase. I think it might be a good idea. The downside of the current impl is that one needs to register function rewrite rule to convert Expr to Function. I think Register is better not a neccessary step for default behavior. If we have functions in parsing phase, no additional step (register rewrite) is needed.
What's the motivation for doing this at the logical level instead of doing this as part of the conversion from logical to physical?
One good reason is that we don't need to care about physical expr if we converted it to functions in logical level. I think the early we optimize, the less duplicated things we leave. if we move this rewrite to logical-to-physical
step, we need to map it to somewhat of physical-expr, which apparently is not a better idea.
I think the idea is people might want to override Expr::GetStructFields semantics and they way they would do so is to rewrite it into a different function. I think this is especially compelling for supporting JSON/JSONB for example
If we have functions after parsing, they can rewrite functions to their expected one through register their own rewrite rules, so I think it is also not a problem
Do we now add "rewritten logical plan" to the list?
"rewritten / simplified logical plan" is actually "optimized logical plan" to me, "optimized" is more general term.
I think we need a better API to do this for real (in 38.0.0 and going forward). I will think about this -- maybe @jayzhan211 has some thoughts
I think we are talking about better API design of get_field
right? I think we can take reference from others.
Duckdb has struct_extract and map_extract
from arrow-datafusion.
I am thinking I'll try and make a PR with such an API over the next day or two to see how it might look
from arrow-datafusion.
Example from @ion-elgreco
@alamb this is the code:
let (table, _metrics) = DeltaOps(table)
.delete()
.with_predicate("props['a'] = '2021-02-02'")
.await
.unwrap();
Which comes from here: https://github.com/delta-io/delta-rs/blob/main/crates%2Fcore%2Fsrc%2Foperations%2Fdelete.rs#L770-L774
from arrow-datafusion.
I see a few options:
- We can put
get_field
back into the core crate (and move it out of datafusion-functions) - might be the least surprising but would not allow people to customize how field access worked (e.g. for implementing JSON support) - We can make a better API / more examples how to get the core rewrites working in
datafusion/datafusion-examples/examples/expr_api.rs
Lines 84 to 110 in 0780438
I would be happy to work on 2 if someone could point me at how people are creating Physical exprs today
from arrow-datafusion.
I think I'd be happy with 2. The example you linked is how we are using datafusion. Here is an updated example that fails with the error:
// For example, let's say you have some integers in an array
let b = Arc::new(Int32Array::from(vec![4, 5, 6, 7, 8, 7, 4]));
let a = Arc::new(StructArray::new(
vec![Field::new("b", DataType::Int32, false)].into(),
vec![b],
None,
));
let batch = RecordBatch::try_from_iter([("a", a as _)])?;
// If you want to find all rows where the expression `a < 5 OR a = 8` is true
let expr = col("a")
.field("b")
.lt(lit(5))
.or(col("a").field("b").eq(lit(8)));
// First, you make a "physical expression" from the logical `Expr`
let physical_expr = physical_expr(&batch.schema(), expr)?;
from arrow-datafusion.
@alamb this is how we create expressions:
/// Parse a string predicate into an `Expr`
pub(crate) fn parse_predicate_expression(
schema: &DFSchema,
expr: impl AsRef<str>,
df_state: &SessionState,
) -> DeltaResult<Expr> {
let dialect = &GenericDialect {};
let mut tokenizer = Tokenizer::new(dialect, expr.as_ref());
let tokens = tokenizer
.tokenize()
.map_err(|err| DeltaTableError::GenericError {
source: Box::new(err),
})?;
let sql = Parser::new(dialect)
.with_tokens(tokens)
.parse_expr()
.map_err(|err| DeltaTableError::GenericError {
source: Box::new(err),
})?;
let context_provider = DeltaContextProvider { state: df_state };
let sql_to_rel =
SqlToRel::new_with_options(&context_provider, DeltaParserOptions::default().into());
Ok(sql_to_rel.sql_to_expr(sql, schema, &mut Default::default())?)
}
from arrow-datafusion.
I'll work on creating an example shortly
from arrow-datafusion.
This also leads to a sort of "variant problem" for any code we write that handles Expr
. For example, we have code that walks through an Expr
and calculates which (potentially nested) columns are referenced. The "Expr walking" code now has to be aware of both the GetStructField
and the ScalarUDF
variants of field access.
It would be nicer I think if there was a single canonical way to represent a nested field access in Expr
. For example, maybe the translation from GetStructField
to ScalarUDF
happens as part of the translation from Expr
to PhysicalExpr
? This way Expr
only has GetStructField
but there is still the flexibility to customize field access?
from arrow-datafusion.
Here is an example of how to make Expr::struct work in 37.1.0: #10183
I think we need a better API to do this for real (in 38.0.0 and going forward). I will think about this -- maybe @jayzhan211 has some thoughts
from arrow-datafusion.
The "Expr walking" code now has to be aware of both the GetStructField and the ScalarUDF variants of field access.
I think this can be controlled by the consumer -- for example if you are walking Expr
s in lancedb, you can control when you transform Expr::GetStructField
into ScalarUDF
and depending on where you do your analysis you only have to check for one
It would be nicer I think if there was a single canonical way to represent a nested field access in
Expr
.
I think the idea is people might want to override Expr::GetStructField
s semantics and they way they would do so is to rewrite it into a different function. I think this is especially compelling for supporting JSON/JSONB for example
from arrow-datafusion.
What's the motivation for doing this at the logical level instead of doing this as part of the conversion from logical to physical?
I don't think there is any particular motivation (or any reason that the conversion needs to be done at either spot) 🤔
from arrow-datafusion.
Another way to tackle it could be to leave the concept of a GetIndexedField node at the parsing layer and pull it out of Expr (or deprecate). This would force the conversion to be done between the parse plan and the logical plan.
I agree this would be clearest (basically remove Expr::GetIndexedField
and always use a function.
I think we are talking about better API design of get_field right?
I was actually thinking about an API for the usecase of "I created an Expr
programatically and I want to convert it to a PhyscalExpr
I can execute. We do this in InfluxDB IOx. I didn't realize that both LanceDB and Delta did the same thing
Currently there is an example of how to do this in expr_api.rs
:
datafusion/datafusion-examples/examples/expr_api.rs
Lines 251 to 252 in 711239f
Maybe it is time to "promote" that function to a real API in datafusion-core
somewhere where it can be tested and better documented.
I think we can take reference from others.
Duckdb has struct_extract and map_extract
Those are interesting functions -- now that we have the notion of function packages, adding better support for the Map datatype would be sweet.
from arrow-datafusion.
@alamb just checking in, is there something we need to refactor? Or are you simplifying the API and automatically handling this within datafusion on all code paths?
from arrow-datafusion.
@alamb just checking in, is there something we need to refactor? Or are you simplifying the API and automatically handling this within datafusion on all code paths?
I suggest for the time being you use the pattern here (is this possible)?
Here is an example of how to make Expr::struct work in 37.1.0: #10183
For the next release (38.0.0) I was thinking would move the create_physical_expr
function into the core (rather than an example).
Thoughts?
from arrow-datafusion.
@alamb just checking in, is there something we need to refactor? Or are you simplifying the API and automatically handling this within datafusion on all code paths?
I suggest for the time being you use the pattern here (is this possible)?
Here is an example of how to make Expr::struct work in 37.1.0: #10183
For the next release (38.0.0) I was thinking would move the
create_physical_expr
function into the core (rather than an example).Thoughts?
I am not entirely sure, in delta-rs we just parse SQL strings into logical exprs and then convert it to physical exprs without doing any adjustments there.
I think it makes sense that these expr conversion are automatically done in core. Because logically between v36, 37 nothing changed in the intent of the operation. That the physical impl is different should be abstracted.
from arrow-datafusion.
I am not entirely sure, in delta-rs we just parse SQL strings into logical exprs and then convert it to physical exprs without doing any adjustments there.
I think in 37.0.0 you'll need to update the code that converts to a physical expr to call the code in #10183
I think it makes sense that these expr conversion are automatically done in core.
Yes I agree. I will make a PR over the next few days with a proposed API
from arrow-datafusion.
Here is a PR with a proposed new API: #10330
from arrow-datafusion.
Related Issues (20)
- INSERT INTO SQL failing on CSV-backed table HOT 3
- Unify SQL planning for `ORDER BY`, `HAVING`, `DISTINCT`, etc
- Unable to perform lead/lag built in functions on List and Struct data types HOT 1
- Enable `split_file_groups_by_statistics` by default HOT 3
- Avoid inlining non deterministic CTE HOT 4
- Make all SchemaProvider trait APIs async HOT 4
- Document timezone semantics HOT 2
- Schema incorrect after select over aggregate function that returns a different type than the input HOT 5
- clippy failure in main HOT 1
- Document Sort Merge Join algorithm HOT 4
- `LogFunc` simplifier swaps the order of arguments
- Standardize the separator in name HOT 1
- Onyl recompute schema in `TypeCoercion` when necessary
- Better timezone functionalities HOT 6
- Auto-update mechanism for dataframe test HOT 6
- Remove `Expr::GetIndexedField` and `GetFieldAccess` and always use function `get_field` for indexing HOT 6
- Support user defined display for UDF HOT 2
- Remove DataPtr trait and use Arc::ptr_eq directly
- Sort Merge Join. LeftSemi issues
- Sort Merge Join. LeftAnti issues HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from arrow-datafusion.