galileo-galilei / kedro-pandera Goto Github PK
View Code? Open in Web Editor NEWA kedro plugin to use pandera in your kedro projects
Home Page: https://kedro-pandera.readthedocs.io/en/latest/
License: Apache License 2.0
A kedro plugin to use pandera in your kedro projects
Home Page: https://kedro-pandera.readthedocs.io/en/latest/
License: Apache License 2.0
Instead of failing immediately when one check is wrong, pandera supports perfomring all check before failing
Make debugging easier by getting all errors in a single run
Pass kwargs to schema.validate()
through a config file or a dataset.metadata
extra key, e.g.:
iris:
type: pandas. CSVDataSet
filepath: /path/to/iris.csv
metadata:
pandera:
schema: ${pa.yaml: _iris_schema}
validation_kwargs:
lazy: True
This key can ultimately support all the arguments available in the validate
method: https://pandera.readthedocs.io/en/stable/reference/generated/methods/pandera.api.pandas.container.DataFrameSchema.validate.html
The more I think about the importance of data contracts ensuring coverage checks as part of a team's workflow feels like a natural evolution of this pattern.
The way I see this, there are two standards a user should aim for:
TBD
TBD
TBD
kedro run --pandera.off
CLI flag would be nice, but it is not currently possible to add flags to the CLI via plugins
I believe there is a leftover print statement in this method at this line:
When loading a Pandera DataFrameModel this isn't very verbose. But you can use the same resolver to also load a Pandera DataFrameSchema defined in Python - in which case this print statement outputs the entire schema and causes clutter. Can the print statement be removed or changed to a debug log message?
Pandera accepts validating against one schema or another
Be compatible with pandera
Accept a list of schemas in metadata?
iris:
type: ...
filepath: ...
metadata:
pandera:
schemas:
- schema1 : <schema1>
- schema2 : <schema1>
Code to copy paste is worth a thousand word !
example_iris_data
of a dataframe with the kedro pandera infer -d example_iris_data
command.yaml
or python are very explicit, but hard to show to managers / stakeholders / business teams. Being able to convert schema to prettier and more organized HTML documents would definitely help documentation efforts and consistency. it would be great of kedro-pandera
could generate these docs automatically.
quoting @datajoely
Again dbt has had this for years and it's just a no brainer, we could easily generate static docs describing what data is in the catalog, associated metadata and tests.
There is also an obvious integration point with enterprise catalogs like Alation/Colibra/Amundsen
Dataset documentation is a much required feature to interact with non technical teams.
Add a CLI kedro pandera doc
which would perform the conversion of all datasets with schemas.
The real question lies in the responsibility of generating the HTML from schema. This likely
belongs to pandera
itself.
In addition to the YAML API, we should support the class-base API DataFrameModel
(pydantic)
TBD
TBD
TBD
Enable data checking in Jupyter Notebook.
Quotes from #12
I wonder if we could introduce a catalog.validate() method which runs tests declared in the catalog? This is why I've designed the snippet above in a somewhat generic validators structure
We could even allow for a prototyping workflow with catalog.validate("dataset_name", SchemaClass)...
interactive worklflow: I'd love to have something like this too. The design should be thought in details. Notice you can already do something like (ugly but still easy):
data=catalog.load("dataset_name") catalog._data_sets["dataset_name"].metadata.pandera.schema.validate(data)
With the same logic, maybe a CLI command kedro pandera validate would be helpful too, I guess you sometimes just want to check a new dataset quickly.
This enable offline data checking, and make data validation easier within a notebook. It shouldn't even requires a full Kedro Projecta, config file + DataCatalog
+ a pyproject.toml maybe enough to make it work.
In general, pandera support two modes: Class-based API and Object-based API. Do we have a preference for supporting which API first? maybe it's trivial to support both.
It is already possible to validate data against a given schema defined in catalog
with the pandera
metadata key.
In addition to schema.validate
, pandera also support decorators for pipeline
It requires inspecting function signature then parse which datasets are registered with data check. (Out of scope, it only matters when you have a pipeline, we should start with notebook first)
There are few options:
catalog.validate
methodDataCatalog
class - requires change in settings.py
to enable it.kedro_pandera.validate(catalog, schema)
TBD
TBD
TBD
TBD
TBD
Opening the floor for feature request discussion, what do you want to see in this plugin? What should it do and what it shouldn't do? Why is it important to you?
I'd like to validate data from the CLI.
When I have changed a filepath in my catalog, I'd like to be able to validate this new dataset before running a whole pipeline.
Create a kedro pandera validate --dataset <dataset_name>
command which will load and validate data.
I want to be able to run pipeline with fake data generated from a dataset schema, mainly for pipeline unit testing or debugging with small dataset.
Unit testing for data pipeline is hard, and this may be a helpful solution. [
kedro pandera dryrun --pipeline <pipeline_name>
(name to be defined) command which will generate data for all inputs datasets and run the pipeline thanks to pandera [data synthesis]PanderaRunner
to run the pipeline with kedro run --runner=PanderaRunner --pipeline <pipeline_name>
. The advantage is to stick to the kedro CLI and eventually enable "composition" with other logic; the drawback is that this solution is not compatible with a custom config file we may introduceRuntime validation is performed before_node_run
. This means we validate only nodes which are loaded (e.g. inputs or intermediate outputs). We should also validate terminal nodes before saving them.
Users expect all datasets being validated once.
Create a after_dataset_save
or a after_node_run
hook which checks if the dataset is a terminal output of a pipeline before validation
I'd like to add a "schema preview" (maybe with a toggle button, as for code?) in kedro-viz as it already exists for code or dataset, see:
This would help documenting dataset directly from code.
Documenting data pipeline and comprehensive checks is hard, and kedro-viz is a great tool to show what exists in the code. I think this would be really useful to have "self documented" pipeline and enhance collaboration and maintenance
Absolutely no idea on how to extend kedro-viz
, happy to hear suggestions here :)
Necessary before a PyPI release, waiting for kedro==0.18.13
This plugin's requirement has clipped kedro version <0.19
. I'm using newer kedro version and would like to use this plugin with it.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.