Comments (15)
The one issue with moving this functionality into a spark module is that for that to really be valid the formats would have to be spark compatible, which they are not currently. I do not have the spare time in the near future to implement a parser to do that.
from arrow-datafusion.
After thinking about this a fair bit the builder approach like what @jayzhan211 did with aggregate functions seems to be the best way forward on this feature imho. While I do like the idea of a separate crate(s) for mirroring functionality from other systems I think that is a much much larger project and is encompasses a lot more functionality than this specific feature entails. Putting this feature into core I don't believe limits DF in the future to extracting out this and other similar behaviour 'traits' and functionality to system specific crates.
I'll start work on this and see how that works out. If it does then I'll add safe support via a trait to the to_timestamp*, to_date and to_unixtime functions. If there are other UDF's that could benefit from having a 'safe' mode (return null on error) please let me know and I'll see about adding safe mode to those as well.
Thank you everyone for your feedback and guidance on this feedback request! 👍
from arrow-datafusion.
We just merged the aggregate builder in #10560 -- I am quite happy with how it turned out, in case you want to take a friendly look
from arrow-datafusion.
After attempting to implement the builder approach it became apparent to me that it will touch too many things and really won't work well without changing the signature of ScalarUDFImpl anyways. It works for the aggregate functions because the functions defined in the AggregateUDFImpl trait have arguments where the additional information (distinct, sort, ordering, etc) is provided to the UDF implementation. In the case of ScalarUDFImpl though that is not the case.
After some more thought I think the cleanest approach may be to add a get_config_options function to the SimplifyInfo trait and add a scalar_udf_safe_mode: bool, default = false
to the ExecutionOptions struct. Doing that will allow functions that require configuration (including but obviously not limited to the 'safe' mode I'm working on) to access them while changing as little as possible wrt trait signatures.
err, scratch that. Onto the next idea :/
from arrow-datafusion.
I think options 1 and 3 would be straightforward
You could even potentially implement
pub fn to_timestamp_safe(args: Vec<Expr>) -> Expr {
...
}
Directly in your application (rather than in the core of datafusion)
Another crazy thought might be to implement a rewrite pass (e.g. AnalyzerRule
) that rewrites all expressions in the plan when safe mode is needed... I think they have access to all the state necessary
from arrow-datafusion.
I think the key thing to figure out is "will safemode to_timestamp be part of the datafusion core"?
Maybe it is time to make a datafusion-functions-scalar-spark
or something that has functions that have the spark behaviors 🤔
from arrow-datafusion.
I think it is possible to extend the safe_mode
to the datafusion core like what you mentioned, it should be similar to the distinct
mode to the aggregate function. We can have different behavior based on whether safe_mode
is set.
For the expression API, we can either
- Introduce
to_timestamp_safe()
- Extend
to_timestamp(args, safe: bool)
- Introduce builder mode to avoid breaking change
to_timestamp(args).safe().build()
I prefer the third one.
Also, there are many to_timestamp
functions with different time units. I'm not sure why they are split into different functions, they are possible to collapse into a single to_timestamp function and cast to different time units based on the given argument.
We can have to_timestamp(args).time_unit(Mili).safe().build()
if we need the timestamp millisecond and to_timestamp(args).safe().build()
for Nanosecond
"will safemode to_timestamp be part of the datafusion core"
It is an interesting question, we can think of implementing functions based on other DB in the first place.
For example, we usually follow postgres, duckdb, and others.
We can have functions-postgres
, functions-duckdb
and functions-spark
that aim to mirror the behavior of other db, and we don't even need functions
crate (we can keep them for backward compatibility). They are considered as the extension crate implemented by the third party
, datafusion does not need to implement any datafusion-specific function (and we prefer not to), we just make sure datafusion core is possible compatible with different functions crate. And we can register most-used functions to the datafusion for the end-to-end SQL workflow!
from arrow-datafusion.
We can have functions-postgres, functions-duckdb and functions-spark
Most of the function has the same behavior in different db, we can also implement different functions in one crate functions
, but we register the expected functions based on the DB we want to mimic. Instead of having a default function
for datafusion SQL workflow, we can have spark_functions
, and postgres_functions
that register different to_timestamp
functions based on the configuration we set.
from arrow-datafusion.
We can have functions-postgres, functions-duckdb and functions-spark
Most of the function has the same behavior in different db, we can also implement different functions in one crate
functions
, but we register the expected functions based on the DB we want to mimic. Instead of having adefault function
for datafusion SQL workflow, we can havespark_functions
, andpostgres_functions
that register differentto_timestamp
functions based on the configuration we set.
@andygrove are there udf's already in the comet project that handle spark specific behaviour? If so is that a separate project or embedded in comet currently? (I haven't looked at that codebase myself since the initial drop)
from arrow-datafusion.
@Omega359 so far we have been implementing custom PhysicalExpr
directly in the datafusion-comet
project as needed for Spark-specific behavior, with support for Spark's different evaluation modes (legacy, try, ansi) and we are using fuzz testing to ensure compatibilty across multiple Spark versions.
I think we need to have the discussion of whether it makes sense to upstream these into the core datafusion project or not, or whether we publish a datafusion-spark-compat
crate from Comet, or some other option.
from arrow-datafusion.
The one issue with moving this functionality into a spark module is that for that to really be valid the formats would have to be spark compatible, which they are not currently. I do not have the spare time in the near future to implement a parser to do that.
We are porting Spark parsing logic as part of Comet.
from arrow-datafusion.
I think we need to have the discussion of whether it makes sense to upstream these into the core datafusion project or not, or whether we publish a
datafusion-spark-compat
crate from Comet, or some other option.
Thank you for chiming in. While I wouldn't mind spark compatibility it really isn't the focus of this request as I've already converted all the spark expressions and function usages to DF compatible ones. It's the general system behaviour that is what I would like to address - being able to essentially switch from a db focused perspective (fail fast) to a processing engine one (nominally lenient - return null) for some (all) of the UDF's.
If the general consensus is to separate out this desired behaviour than I would think a separate crate might be the best approach. However from searching the issues here there seems to have been some talk of how to handle mirroring the behaviour of other databases in the past but it also includes sql syntax as well so it's not quite as simple as just having a db specific crate full of UDF's and calling it a day.
from arrow-datafusion.
I have read the context now and understand that this is about safe
mode or what Spark calls ANSI
mode.
Isn't this just a case of adding a new flag to the session context that UDFs can choose to use when deciding whether to return null or throw an error?
from arrow-datafusion.
I have read the context now and understand that this is about
safe
mode or what Spark callsANSI
mode.Isn't this just a case of adding a new flag to the session context that UDFs can choose to use when deciding whether to return null or throw an error?
That would be nice ... except UDF's don't have a way to access the session context currently :( Option #2 and #3 provide that via different mechanisms.
from arrow-datafusion.
I wonder if we could take a page from what @jayzhan211 is implementing in #10560 and go with a trait
So we could implement something like
let expr = to_timestamp(lit("2021-01-01"))
// set the to_timestamp mode to "safe"
.safe();
I realize that this would require changing the callsites so maybe it isn't viable
from arrow-datafusion.
Related Issues (20)
- Feat: Support `GROUP BY unnest expr`
- Add support for `newlines_in_values` to `CsvOptions` HOT 4
- Move `sql_compound_identifier_to_expr` to `ExprPlanner` HOT 6
- DataFusion weekly project plan (Andrew Lamb) - July 15, 2024 HOT 1
- Release DataFusion `41.0.0`
- ExprPlanner not propagated to SqlToRel HOT 7
- Update the parquet code `prune_pages_in_one_row_group` to use the `StatisticsExtractor`
- Optimize CASE WHEN for "expression or null" case HOT 1
- Add has_side_effects to PhysicalExpr HOT 1
- `SanityCheckPlan` Error during planning: ... does not satisfy parent order requirements: ... HOT 1
- Move spill related functions to `spill.rs` HOT 5
- Consolidate optimizer readme into datafusion user guide HOT 2
- Reduce repetition in `try_process_group_by_unnest` and `try_process_unnest`
- Investigate memory use in debug builds for deeply nested array constants
- [EPIC] Extract remaining physical optimizer out of core
- Leverage dictionary-encode when turning a scalar columnar value into an array HOT 3
- [Proposal] Decouple logical from physical types HOT 15
- Chore: fix typos in the code
- Explore Updating VariadicAny Signature to take 0 Args HOT 2
- Resources exhuasted errors are confusing return the biggest memory consumers. HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from arrow-datafusion.