galileo-galilei / kedro-pandera Goto Github PK

View Code? Open in Web Editor NEW

31.0 3.0 2.0 218 KB

A kedro plugin to use pandera in your kedro projects

Home Page: https://kedro-pandera.readthedocs.io/en/latest/

License: Apache License 2.0

Python 99.56% Makefile 0.44%

data-contracts data-pipelines data-schemas kedro kedro-plugin pandera pipelines-testing

kedro-pandera's People

Contributors

Stargazers

Watchers

Forkers

noklam mjspier

kedro-pandera's Issues

Description

Context

Possible Implementation

Possible Alternatives

Enable lazy validation at a dataset level

Description

Instead of failing immediately when one check is wrong, pandera supports perfomring all check before failing

Context

Make debugging easier by getting all errors in a single run

Possible Implementation

Pass kwargs to schema.validate() through a config file or a dataset.metadata extra key, e.g.:

iris: 
    type: pandas. CSVDataSet
    filepath: /path/to/iris.csv
    metadata:
        pandera:
            schema: ${pa.yaml: _iris_schema}
            validation_kwargs: 
                lazy: True

This key can ultimately support all the arguments available in the validate method: https://pandera.readthedocs.io/en/stable/reference/generated/methods/pandera.api.pandas.container.DataFrameSchema.validate.html

Description

The more I think about the importance of data contracts ensuring coverage checks as part of a team's workflow feels like a natural evolution of this pattern.

Context

The way I see this, there are two standards a user should aim for:

A "gold standard 🥇" pattern where every dataset in your project has pandera schemas attached (all parameter inputs also have pandera/pydantic definitions too)
A "silver standard 🥈" pattern where just the free-inputs/outputs of a pipeline are properly validated and the rest is treated a closed box.

Possible Implementation

Build an AST introspection utility which uses an instantiated KedroSession object to validate state

Possible Alternatives

Look at building a Pylint plugin to do the same thing

Temporarily deactivate runtime validation

Description

TBD

Context

TBD

Possible Implementation

TBD

kedro run --pandera.off CLI flag would be nice, but it is not currently possible to add flags to the CLI via plugins

Remove leftover print statement in `resolve_dataframe_model`

I believe there is a leftover print statement in this method at this line:

https://github.com/Galileo-Galilei/kedro-pandera/blob/a631c1ab5710152b6afc9f1fd0e230a7cfab7a95/kedro_pandera/framework/config/resolvers.py#L35C51-L35C51

When loading a Pandera DataFrameModel this isn't very verbose. But you can use the same resolver to also load a Pandera DataFrameSchema defined in Python - in which case this print statement outputs the entire schema and causes clutter. Can the print statement be removed or changed to a debug log message?

Enable multiple schema validation

Description

Pandera accepts validating against one schema or another

Context

Be compatible with pandera

Possible Implementation

Accept a list of schemas in metadata?

iris:
    type: ...
    filepath: ...
    metadata:
        pandera: 
            schemas: 
                - schema1 : <schema1>
                - schema2 : <schema1>

Add a hello-world tutorial

Description

Code to copy paste is worth a thousand word !

Possible Implementation

clone the pandas iris tutorial
create the schema example_iris_data of a dataframe with the kedro pandera infer -d example_iris_data command.
Add a manual validation in the schema (ex: check that target is in ["setosa", "versicolor", "virginica"]
Run on a fake dataset
Enjoy the failure message

Other features

use the CLI to see the schema
generate a fake dataset and run the pipeline with it
add extra configuration (lazy failure...)

Generate HTML documentation from schema

Description

yaml or python are very explicit, but hard to show to managers / stakeholders / business teams. Being able to convert schema to prettier and more organized HTML documents would definitely help documentation efforts and consistency. it would be great of kedro-pandera could generate these docs automatically.

quoting @datajoely

Again dbt has had this for years and it's just a no brainer, we could easily generate static docs describing what data is in the catalog, associated metadata and tests.
There is also an obvious integration point with enterprise catalogs like Alation/Colibra/Amundsen

Context

Dataset documentation is a much required feature to interact with non technical teams.

Possible Implementation

Add a CLI kedro pandera doc which would perform the conversion of all datasets with schemas.

The real question lies in the responsibility of generating the HTML from schema. This likely
belongs to pandera itself.

Support DataframeModel and the python API for declaring schema

Description

In addition to the YAML API, we should support the class-base API DataFrameModel (pydantic)

Context

TBD

Possible Implementation

TBD

Possible Alternatives

TBD

Enable Offline Data Check with Jupyter

Description

Enable data checking in Jupyter Notebook.

Quotes from #12
I wonder if we could introduce a catalog.validate() method which runs tests declared in the catalog? This is why I've designed the snippet above in a somewhat generic validators structure
We could even allow for a prototyping workflow with catalog.validate("dataset_name", SchemaClass)...

interactive worklflow: I'd love to have something like this too. The design should be thought in details. Notice you can already do something like (ugly but still easy):

data=catalog.load("dataset_name")
catalog._data_sets["dataset_name"].metadata.pandera.schema.validate(data)

With the same logic, maybe a CLI command kedro pandera validate would be helpful too, I guess you sometimes just want to check a new dataset quickly.

Context

This enable offline data checking, and make data validation easier within a notebook. It shouldn't even requires a full Kedro Projecta, config file + DataCatalog + a pyproject.toml maybe enough to make it work.

In general, pandera support two modes: Class-based API and Object-based API. Do we have a preference for supporting which API first? maybe it's trivial to support both.

Possible Implementation

It is already possible to validate data against a given schema defined in catalog with the pandera metadata key.

In addition to schema.validate, pandera also support decorators for pipeline
It requires inspecting function signature then parse which datasets are registered with data check. (Out of scope, it only matters when you have a pipeline, we should start with notebook first)

There are few options:

monkeypatch a catalog.validate method
Inherit the current DataCatalog class - requires change in settings.py to enable it.
kedro_pandera.validate(catalog, schema)
??

Possible Alternatives

TBD

Generate metadata catalog entry from annotated functions

Description

TBD

Context

TBD

Possible Implementation

TBD

Possible Alternatives

TBD

What do you want to see in `kedro-pandera`?

Description

Opening the floor for feature request discussion, what do you want to see in this plugin? What should it do and what it shouldn't do? Why is it important to you?

Add kedro catalog validate command

Description

I'd like to validate data from the CLI.

Context

When I have changed a filepath in my catalog, I'd like to be able to validate this new dataset before running a whole pipeline.

Possible Implementation

Create a kedro pandera validate --dataset <dataset_name> command which will load and validate data.

Run the pipeline with fake pandera-generated data

Description

I want to be able to run pipeline with fake data generated from a dataset schema, mainly for pipeline unit testing or debugging with small dataset.

Context

Unit testing for data pipeline is hard, and this may be a helpful solution. [

Possible Implementation(s)

create a kedro pandera dryrun --pipeline <pipeline_name> (name to be defined) command which will generate data for all inputs datasets and run the pipeline thanks to pandera [data synthesis]
create a PanderaRunner to run the pipeline with kedro run --runner=PanderaRunner --pipeline <pipeline_name>. The advantage is to stick to the kedro CLI and eventually enable "composition" with other logic; the drawback is that this solution is not compatible with a custom config file we may introduce

Add data validation to terminal outputs

Description

Runtime validation is performed before_node_run. This means we validate only nodes which are loaded (e.g. inputs or intermediate outputs). We should also validate terminal nodes before saving them.

Context

Users expect all datasets being validated once.

Possible Implementation

Create a after_dataset_save or a after_node_run hook which checks if the dataset is a terminal output of a pipeline before validation

Add a preview of the schema in kedro-viz

Description

I'd like to add a "schema preview" (maybe with a toggle button, as for code?) in kedro-viz as it already exists for code or dataset, see:

This would help documenting dataset directly from code.

Context

Documenting data pipeline and comprehensive checks is hard, and kedro-viz is a great tool to show what exists in the code. I think this would be really useful to have "self documented" pipeline and enhance collaboration and maintenance

Possible Implementation

Absolutely no idea on how to extend kedro-viz, happy to hear suggestions here :)

Upgrade requirements with a valid kedro version

Description

Necessary before a PyPI release, waiting for kedro==0.18.13

Add kedro~=0.19.0 compatability

Description

This plugin's requirement has clipped kedro version <0.19. I'm using newer kedro version and would like to use this plugin with it.

galileo-galilei / kedro-pandera Goto Github PK

kedro-pandera's People

Contributors

Stargazers

Watchers

Forkers

kedro-pandera's Issues

Description

Context

Possible Implementation

Possible Alternatives

Description

Context

Possible Implementation

Description

Context

Possible Implementation

Possible Alternatives

Description

Context

Possible Implementation

Description

Context

Possible Implementation

Description

Possible Implementation

Other features

Description

Context

Possible Implementation

Description

Context

Possible Implementation

Possible Alternatives

Description

Context

Possible Implementation

Possible Alternatives

Description

Context

Possible Implementation

Possible Alternatives

Description

Description

Context

Possible Implementation

Description

Context

Possible Implementation(s)

Description

Context

Possible Implementation

Description

Context

Possible Implementation

Description

Description

Recommend Projects

Recommend Topics

Recommend Org