Code Monkey home page Code Monkey logo

kedro-pandera's People

Contributors

dependabot[bot] avatar galileo-galilei avatar noklam avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

noklam mjspier

kedro-pandera's Issues

Release v0.1.0

Description

Context

Possible Implementation

Possible Alternatives

Enable lazy validation at a dataset level

Description

Instead of failing immediately when one check is wrong, pandera supports perfomring all check before failing

Context

Make debugging easier by getting all errors in a single run

Possible Implementation

Pass kwargs to schema.validate() through a config file or a dataset.metadata extra key, e.g.:

iris: 
    type: pandas. CSVDataSet
    filepath: /path/to/iris.csv
    metadata:
        pandera:
            schema: ${pa.yaml: _iris_schema}
            validation_kwargs: 
                lazy: True

This key can ultimately support all the arguments available in the validate method: https://pandera.readthedocs.io/en/stable/reference/generated/methods/pandera.api.pandas.container.DataFrameSchema.validate.html

`kedro pandera coverage`

Description

The more I think about the importance of data contracts ensuring coverage checks as part of a team's workflow feels like a natural evolution of this pattern.

Context

The way I see this, there are two standards a user should aim for:

  • A "gold standard ๐Ÿฅ‡" pattern where every dataset in your project has pandera schemas attached (all parameter inputs also have pandera/pydantic definitions too)
  • A "silver standard ๐Ÿฅˆ" pattern where just the free-inputs/outputs of a pipeline are properly validated and the rest is treated a closed box.

Possible Implementation

  • Build an AST introspection utility which uses an instantiated KedroSession object to validate state

Possible Alternatives

  • Look at building a Pylint plugin to do the same thing

Temporarily deactivate runtime validation

Description

TBD

Context

TBD

Possible Implementation

TBD

kedro run --pandera.off CLI flag would be nice, but it is not currently possible to add flags to the CLI via plugins

Remove leftover print statement in `resolve_dataframe_model`

I believe there is a leftover print statement in this method at this line:

https://github.com/Galileo-Galilei/kedro-pandera/blob/a631c1ab5710152b6afc9f1fd0e230a7cfab7a95/kedro_pandera/framework/config/resolvers.py#L35C51-L35C51

When loading a Pandera DataFrameModel this isn't very verbose. But you can use the same resolver to also load a Pandera DataFrameSchema defined in Python - in which case this print statement outputs the entire schema and causes clutter. Can the print statement be removed or changed to a debug log message?

Add a hello-world tutorial

Description

Code to copy paste is worth a thousand word !

Possible Implementation

  • clone the pandas iris tutorial
  • create the schema example_iris_data of a dataframe with the kedro pandera infer -d example_iris_data command.
  • Add a manual validation in the schema (ex: check that target is in ["setosa", "versicolor", "virginica"]
  • Run on a fake dataset
  • Enjoy the failure message

Other features

  • use the CLI to see the schema
  • generate a fake dataset and run the pipeline with it
  • add extra configuration (lazy failure...)

Generate HTML documentation from schema

Description

yaml or python are very explicit, but hard to show to managers / stakeholders / business teams. Being able to convert schema to prettier and more organized HTML documents would definitely help documentation efforts and consistency. it would be great of kedro-pandera could generate these docs automatically.

quoting @datajoely

Again dbt has had this for years and it's just a no brainer, we could easily generate static docs describing what data is in the catalog, associated metadata and tests.
There is also an obvious integration point with enterprise catalogs like Alation/Colibra/Amundsen

Context

Dataset documentation is a much required feature to interact with non technical teams.

Possible Implementation

Add a CLI kedro pandera doc which would perform the conversion of all datasets with schemas.

The real question lies in the responsibility of generating the HTML from schema. This likely
belongs to pandera itself.

Enable Offline Data Check with Jupyter

Description

Enable data checking in Jupyter Notebook.

Quotes from #12
I wonder if we could introduce a catalog.validate() method which runs tests declared in the catalog? This is why I've designed the snippet above in a somewhat generic validators structure
We could even allow for a prototyping workflow with catalog.validate("dataset_name", SchemaClass)...

interactive worklflow: I'd love to have something like this too. The design should be thought in details. Notice you can already do something like (ugly but still easy):

data=catalog.load("dataset_name")
catalog._data_sets["dataset_name"].metadata.pandera.schema.validate(data)

With the same logic, maybe a CLI command kedro pandera validate would be helpful too, I guess you sometimes just want to check a new dataset quickly.

Context

This enable offline data checking, and make data validation easier within a notebook. It shouldn't even requires a full Kedro Projecta, config file + DataCatalog + a pyproject.toml maybe enough to make it work.

In general, pandera support two modes: Class-based API and Object-based API. Do we have a preference for supporting which API first? maybe it's trivial to support both.

Possible Implementation

It is already possible to validate data against a given schema defined in catalog with the pandera metadata key.

In addition to schema.validate, pandera also support decorators for pipeline
It requires inspecting function signature then parse which datasets are registered with data check. (Out of scope, it only matters when you have a pipeline, we should start with notebook first)

There are few options:

  1. monkeypatch a catalog.validate method
  2. Inherit the current DataCatalog class - requires change in settings.py to enable it.
  3. kedro_pandera.validate(catalog, schema)
  4. ??

Possible Alternatives

TBD

What do you want to see in `kedro-pandera`?

Description

Opening the floor for feature request discussion, what do you want to see in this plugin? What should it do and what it shouldn't do? Why is it important to you?

Add kedro catalog validate command

Description

I'd like to validate data from the CLI.

Context

When I have changed a filepath in my catalog, I'd like to be able to validate this new dataset before running a whole pipeline.

Possible Implementation

Create a kedro pandera validate --dataset <dataset_name> command which will load and validate data.

Run the pipeline with fake pandera-generated data

Description

I want to be able to run pipeline with fake data generated from a dataset schema, mainly for pipeline unit testing or debugging with small dataset.

Context

Unit testing for data pipeline is hard, and this may be a helpful solution. [

Possible Implementation(s)

  • create a kedro pandera dryrun --pipeline <pipeline_name> (name to be defined) command which will generate data for all inputs datasets and run the pipeline thanks to pandera [data synthesis]
  • create a PanderaRunner to run the pipeline with kedro run --runner=PanderaRunner --pipeline <pipeline_name>. The advantage is to stick to the kedro CLI and eventually enable "composition" with other logic; the drawback is that this solution is not compatible with a custom config file we may introduce

Add data validation to terminal outputs

Description

Runtime validation is performed before_node_run. This means we validate only nodes which are loaded (e.g. inputs or intermediate outputs). We should also validate terminal nodes before saving them.

Context

Users expect all datasets being validated once.

Possible Implementation

Create a after_dataset_save or a after_node_run hook which checks if the dataset is a terminal output of a pipeline before validation

Add a preview of the schema in kedro-viz

Description

I'd like to add a "schema preview" (maybe with a toggle button, as for code?) in kedro-viz as it already exists for code or dataset, see:

image

This would help documenting dataset directly from code.

Context

Documenting data pipeline and comprehensive checks is hard, and kedro-viz is a great tool to show what exists in the code. I think this would be really useful to have "self documented" pipeline and enhance collaboration and maintenance

Possible Implementation

Absolutely no idea on how to extend kedro-viz, happy to hear suggestions here :)

Add kedro~=0.19.0 compatability

Description

This plugin's requirement has clipped kedro version <0.19. I'm using newer kedro version and would like to use this plugin with it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.