Code Monkey home page Code Monkey logo

Comments (19)

westonpace avatar westonpace commented on August 16, 2024 1

I think this can be controlled by the consumer -- for example if you are walking Exprs in lancedb, you can control when you transform Expr::GetStructField into ScalarUDF and depending on where you do your analysis you only have to check for one

I agree (and thanks for the example). This works for us. All of our expr from the user start as &str. However, if we were receiving Expr directly then it wouldn't work because we wouldn't know which approach the user used. This is not a problem for us, we aren't expecting the user to provide direct DF structs in any of our roadmap features, I'm just thinking through possibilities.

I think the idea is people might want to override Expr::GetStructFields semantics and they way they would do so is to rewrite it into a different function. I think this is especially compelling for supporting JSON/JSONB for example

What's the motivation for doing this at the logical level instead of doing this as part of the conversion from logical to physical?

from arrow-datafusion.

westonpace avatar westonpace commented on August 16, 2024 1

I don't think there is any particular motivation (or any reason that the conversion needs to be done at either spot) 🤔

I think, for me, it's just a general unease with having multiple ways of expressing the same thing. I feel like this can lead to "implicit layers" of the plan. For example, there is already some notion of "parse plan", "unoptimized logical plan" and "optimized logical plan", and "physical plan". The middle two are both represented by Expr which can be subtle. Do we now add "rewritten logical plan" to the list? Or maybe "rewritten" and "simplified" are just very transient states between "unoptimized" and "optimized" and I am blowing things out of proportion.

Another way to tackle it could be to leave the concept of a GetIndexedField node at the parsing layer and pull it out of Expr (or deprecate). This would force the conversion to be done between the parse plan and the logical plan.

That being said, my needs are met (thanks again for your help), and perfect is the enemy of the good, so I'm happy to leave well enough alone.

from arrow-datafusion.

jayzhan211 avatar jayzhan211 commented on August 16, 2024 1

done between the parse plan and the logical plan

I had also thought about deprecating Expr and use functions directly in parsing phase. I think it might be a good idea. The downside of the current impl is that one needs to register function rewrite rule to convert Expr to Function. I think Register is better not a neccessary step for default behavior. If we have functions in parsing phase, no additional step (register rewrite) is needed.

What's the motivation for doing this at the logical level instead of doing this as part of the conversion from logical to physical?

One good reason is that we don't need to care about physical expr if we converted it to functions in logical level. I think the early we optimize, the less duplicated things we leave. if we move this rewrite to logical-to-physical step, we need to map it to somewhat of physical-expr, which apparently is not a better idea.

I think the idea is people might want to override Expr::GetStructFields semantics and they way they would do so is to rewrite it into a different function. I think this is especially compelling for supporting JSON/JSONB for example

If we have functions after parsing, they can rewrite functions to their expected one through register their own rewrite rules, so I think it is also not a problem

Do we now add "rewritten logical plan" to the list?

"rewritten / simplified logical plan" is actually "optimized logical plan" to me, "optimized" is more general term.

I think we need a better API to do this for real (in 38.0.0 and going forward). I will think about this -- maybe @jayzhan211 has some thoughts

I think we are talking about better API design of get_field right? I think we can take reference from others.
Duckdb has struct_extract and map_extract

from arrow-datafusion.

alamb avatar alamb commented on August 16, 2024 1

I am thinking I'll try and make a PR with such an API over the next day or two to see how it might look

from arrow-datafusion.

alamb avatar alamb commented on August 16, 2024

Example from @ion-elgreco

@alamb this is the code:

  let (table, _metrics) = DeltaOps(table)
            .delete()
            .with_predicate("props['a'] = '2021-02-02'")
            .await
            .unwrap();

Which comes from here: https://github.com/delta-io/delta-rs/blob/main/crates%2Fcore%2Fsrc%2Foperations%2Fdelete.rs#L770-L774

from arrow-datafusion.

alamb avatar alamb commented on August 16, 2024

I see a few options:

  1. We can put get_field back into the core crate (and move it out of datafusion-functions) - might be the least surprising but would not allow people to customize how field access worked (e.g. for implementing JSON support)
  2. We can make a better API / more examples how to get the core rewrites working in
    fn evaluate_demo() -> Result<()> {
    // For example, let's say you have some integers in an array
    let batch = RecordBatch::try_from_iter([(
    "a",
    Arc::new(Int32Array::from(vec![4, 5, 6, 7, 8, 7, 4])) as _,
    )])?;
    // If you want to find all rows where the expression `a < 5 OR a = 8` is true
    let expr = col("a").lt(lit(5)).or(col("a").eq(lit(8)));
    // First, you make a "physical expression" from the logical `Expr`
    let physical_expr = physical_expr(&batch.schema(), expr)?;
    // Now, you can evaluate the expression against the RecordBatch
    let result = physical_expr.evaluate(&batch)?;
    // The result contain an array that is true only for where `a < 5 OR a = 8`
    let expected_result = Arc::new(BooleanArray::from(vec![
    true, false, false, false, true, false, true,
    ])) as _;
    assert!(
    matches!(&result, ColumnarValue::Array(r) if r == &expected_result),
    "result: {:?}",
    result
    );
    Ok(())

I would be happy to work on 2 if someone could point me at how people are creating Physical exprs today

from arrow-datafusion.

westonpace avatar westonpace commented on August 16, 2024

I think I'd be happy with 2. The example you linked is how we are using datafusion. Here is an updated example that fails with the error:

    // For example, let's say you have some integers in an array
    let b = Arc::new(Int32Array::from(vec![4, 5, 6, 7, 8, 7, 4]));
    let a = Arc::new(StructArray::new(
        vec![Field::new("b", DataType::Int32, false)].into(),
        vec![b],
        None,
    ));
    let batch = RecordBatch::try_from_iter([("a", a as _)])?;

    // If you want to find all rows where the expression `a < 5 OR a = 8` is true
    let expr = col("a")
        .field("b")
        .lt(lit(5))
        .or(col("a").field("b").eq(lit(8)));

    // First, you make a "physical expression" from the logical `Expr`
    let physical_expr = physical_expr(&batch.schema(), expr)?;

from arrow-datafusion.

ion-elgreco avatar ion-elgreco commented on August 16, 2024

@alamb this is how we create expressions:

/// Parse a string predicate into an `Expr`
pub(crate) fn parse_predicate_expression(
    schema: &DFSchema,
    expr: impl AsRef<str>,
    df_state: &SessionState,
) -> DeltaResult<Expr> {
    let dialect = &GenericDialect {};
    let mut tokenizer = Tokenizer::new(dialect, expr.as_ref());
    let tokens = tokenizer
        .tokenize()
        .map_err(|err| DeltaTableError::GenericError {
            source: Box::new(err),
        })?;
    let sql = Parser::new(dialect)
        .with_tokens(tokens)
        .parse_expr()
        .map_err(|err| DeltaTableError::GenericError {
            source: Box::new(err),
        })?;

    let context_provider = DeltaContextProvider { state: df_state };
    let sql_to_rel =
        SqlToRel::new_with_options(&context_provider, DeltaParserOptions::default().into());

    Ok(sql_to_rel.sql_to_expr(sql, schema, &mut Default::default())?)
}

from arrow-datafusion.

alamb avatar alamb commented on August 16, 2024

I'll work on creating an example shortly

from arrow-datafusion.

westonpace avatar westonpace commented on August 16, 2024

This also leads to a sort of "variant problem" for any code we write that handles Expr. For example, we have code that walks through an Expr and calculates which (potentially nested) columns are referenced. The "Expr walking" code now has to be aware of both the GetStructField and the ScalarUDF variants of field access.

It would be nicer I think if there was a single canonical way to represent a nested field access in Expr. For example, maybe the translation from GetStructField to ScalarUDF happens as part of the translation from Expr to PhysicalExpr? This way Expr only has GetStructField but there is still the flexibility to customize field access?

from arrow-datafusion.

alamb avatar alamb commented on August 16, 2024

Here is an example of how to make Expr::struct work in 37.1.0: #10183

I think we need a better API to do this for real (in 38.0.0 and going forward). I will think about this -- maybe @jayzhan211 has some thoughts

from arrow-datafusion.

alamb avatar alamb commented on August 16, 2024

The "Expr walking" code now has to be aware of both the GetStructField and the ScalarUDF variants of field access.

I think this can be controlled by the consumer -- for example if you are walking Exprs in lancedb, you can control when you transform Expr::GetStructField into ScalarUDF and depending on where you do your analysis you only have to check for one

It would be nicer I think if there was a single canonical way to represent a nested field access in Expr.

I think the idea is people might want to override Expr::GetStructFields semantics and they way they would do so is to rewrite it into a different function. I think this is especially compelling for supporting JSON/JSONB for example

from arrow-datafusion.

alamb avatar alamb commented on August 16, 2024

What's the motivation for doing this at the logical level instead of doing this as part of the conversion from logical to physical?

I don't think there is any particular motivation (or any reason that the conversion needs to be done at either spot) 🤔

from arrow-datafusion.

alamb avatar alamb commented on August 16, 2024

Another way to tackle it could be to leave the concept of a GetIndexedField node at the parsing layer and pull it out of Expr (or deprecate). This would force the conversion to be done between the parse plan and the logical plan.

I agree this would be clearest (basically remove Expr::GetIndexedField and always use a function.

I think we are talking about better API design of get_field right?

I was actually thinking about an API for the usecase of "I created an Expr programatically and I want to convert it to a PhyscalExpr I can execute. We do this in InfluxDB IOx. I didn't realize that both LanceDB and Delta did the same thing

Currently there is an example of how to do this in expr_api.rs:

/// Build a physical expression from a logical one, after applying simplification and type coercion
pub fn physical_expr(schema: &Schema, expr: Expr) -> Result<Arc<dyn PhysicalExpr>> {

Maybe it is time to "promote" that function to a real API in datafusion-core somewhere where it can be tested and better documented.

I think we can take reference from others.
Duckdb has struct_extract and map_extract

Those are interesting functions -- now that we have the notion of function packages, adding better support for the Map datatype would be sweet.

from arrow-datafusion.

ion-elgreco avatar ion-elgreco commented on August 16, 2024

@alamb just checking in, is there something we need to refactor? Or are you simplifying the API and automatically handling this within datafusion on all code paths?

from arrow-datafusion.

alamb avatar alamb commented on August 16, 2024

@alamb just checking in, is there something we need to refactor? Or are you simplifying the API and automatically handling this within datafusion on all code paths?

I suggest for the time being you use the pattern here (is this possible)?

Here is an example of how to make Expr::struct work in 37.1.0: #10183

For the next release (38.0.0) I was thinking would move the create_physical_expr function into the core (rather than an example).

Thoughts?

from arrow-datafusion.

ion-elgreco avatar ion-elgreco commented on August 16, 2024

@alamb just checking in, is there something we need to refactor? Or are you simplifying the API and automatically handling this within datafusion on all code paths?

I suggest for the time being you use the pattern here (is this possible)?

Here is an example of how to make Expr::struct work in 37.1.0: #10183

For the next release (38.0.0) I was thinking would move the create_physical_expr function into the core (rather than an example).

Thoughts?

I am not entirely sure, in delta-rs we just parse SQL strings into logical exprs and then convert it to physical exprs without doing any adjustments there.

I think it makes sense that these expr conversion are automatically done in core. Because logically between v36, 37 nothing changed in the intent of the operation. That the physical impl is different should be abstracted.

from arrow-datafusion.

alamb avatar alamb commented on August 16, 2024

I am not entirely sure, in delta-rs we just parse SQL strings into logical exprs and then convert it to physical exprs without doing any adjustments there.

I think in 37.0.0 you'll need to update the code that converts to a physical expr to call the code in #10183

I think it makes sense that these expr conversion are automatically done in core.

Yes I agree. I will make a PR over the next few days with a proposed API

from arrow-datafusion.

alamb avatar alamb commented on August 16, 2024

Here is a PR with a proposed new API: #10330

from arrow-datafusion.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.