Comments (4)
@lucamattiazzi yep seems like a reasonable feature -- there might be a bit of surgery required to pass that information through (haven't scoped it). @elijahbenizzy might also have thoughts.
Some questions to get your input on:
- Do you like having this validator become a specific part of the DAG? Would it matter if it wasn't?
- Do you capture any results of this? Would you want to?
- What drives the validator setup on your end? Some configuration? or is it code? or?
Reason I ask is there are other implementation options:
- Extend the current validators to enable people to register custom ones, versus needing to use
@check_output_custom
-- but we'd need to expose passing in function metadata information. - Use the lifecycle API functionality (see blog) combined with say
@tag
on features requiring validation.
So just figuring out which of the three approaches would be better to invest in / look at.
from hamilton.
Thank you for the quick answer!
1 - it does not matter, it's nice but it would not change a lot for us
2 - yes, we capture all of the failures in the validations and use the results as warning for the end user
3 - we use GreatExpectations suites and a single validator class that uses the specific suite for each feature (that's why it would be useful for us to have the name of the feature at init or execute time)
the lifecycle API seems great, and I think we might make it work with our current implementation, but the results would be stored elsewhere than the driver, which looks a little bit worse in my opinion.
moreover, we would like in the future to be able to skip invalid rows from the computations down the DAG, and since this method only allows for side effects it wouldn't work I'm afraid (that's not an issue for us now)
the current structure of DataValidator does not seem to allow the current_node value to be passed easily without breaking current child classes
from hamilton.
Thanks for the responses, makes sense. Some follow ups:
the lifecycle API seems great, and I think we might make it work with our current implementation, but the results would be stored elsewhere than the driver, which looks a little bit worse in my opinion.
Yes, that lifecycle adapter could house the results that you would then inspect. How are you getting the results now? (or how would like to get them?)
moreover, we would like in the future to be able to skip invalid rows from the computations down the DAG, and since this method only allows for side effects it wouldn't work I'm afraid (that's not an issue for us now)
None of the approaches would directly enable this I don't think. My Hamilton philosophy here is that this should then be explicitly part of the DAG -- or if part of a decorator maybe a different one or a flag to enable it.
the current structure of DataValidator does not seem to allow the current_node value to be passed easily without breaking current child classes
Yep. But nothing that can't be changed since I think we can do it in a backwards compatible way. :)
from hamilton.
Chiming in -- so yes, this does make inherent sense. We could expose HamiltonNode
to be the node it decorates, we'd just have to make it backwards compatible. Could be as simpole as checking if the class implements validate_with_metadata
, which would have a default implementation that calls to validate
or something like that.
It's interesting -- usually we recommend that users put the configuration closer to their code (I.E. in the parameters themselves), but if its a lot this can get tricky. So I think this is a reasonable approach. Your workaround is pretty good for now, but it's reasonable to have the name of the node as the input.
from hamilton.
Related Issues (20)
- Add SODA integration
- Umbrella Issue: Lifecycle Adapter Ideas HOT 2
- Add data source sinks for Polars Lazyframe implementation
- Add sources for missing Polars inputs
- Slack notifier. Have one that can notify a slack channel on error.
- Error cleanly when no output requested using .materialize
- Add xarray example
- Pandas spss HOT 1
- Streamlining materializer definition HOT 1
- Lifecycle adapter for automatic schema evolution
- Adapter mapping types to pyarrow
- `.visualize_execution()` doesn't show config nodes
- Improve API for `.execute()` all nodes HOT 10
- pandas fwf
- TDQM hook with overrides HOT 2
- Deprecate dependency on `networkx` for `sf-hamilton[visualization]`
- `__repr__` of `HamiltonNode` is hard to read
- `Config` node missing from legend
- Add `Builder.with_materializers()`
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hamilton.