unionai-oss / pandera Goto Github PK

View Code? Open in Web Editor NEW

3.0K 20.0 273.0 3.36 MB

A light-weight, flexible, and expressive statistical data testing library

Home Page: https://www.union.ai/pandera

License: MIT License

Python 99.92% Makefile 0.08%

pandas validation schema dataframes testing pandas-validation pandas-dataframe data-validation data-cleaning data-check

pandera's People

Contributors

Stargazers

Watchers

Forkers

pingali ralbertazzi chr1st1ank lwasser bensaxby matteby sanyam07 c3-anthony-truchet iyedg staylorx baskervilski aditya1001001 d33bs elissandromendes vshulyak jeffzi kunifu adityaj7 prabhummoorthy areeh amitripshtos analytique-bourassa gesta81 thecleric m1so gianfa koushikjoshi abyz0123 jespersundin jespercodes seb84924 ericmjl arita37 antonl arielshulman hansbantilan jekwatt cosmicbboy kvnkho kacky24 cristianmatache matthawthorn vinisalazar babakjfard ak1920 admackin patil2099 khwilson lappis-unb sbrugman pingsutw viracall jeffzi-metamoki thorben-flapo guptam zevisert gordonhart smackesey gustavogb roshcagra passion4energy matthiashuschle nickolay m-richards airibarne jamesmyatt jeffreykennethli sandy4321 avipaghadar1729 datalearns ruanribeiros ferhah aiwalter kacper-sellforte isvoboda robertcraigie christeefy alphavector tfwillems telferm57 fleimgruber jeffamaxey nickcrews kombinatorix will-gp cattes spread0x johnkangw borissmidt earthastronaut rja172 plague006 blakenaccarato mattb1989 tmi cfirmo33 dantheand saiddddd cerebralmind frangoitia

pandera's Issues

create Nullable class (inherits from Assert)

create CONTRIBUTING.md with dataframe schema style-guide

based on conversation here:
https://github.com/cosmicBboy/pandera/pull/34

discussion: should pandera get a logo?

Should Pandera get a logo?

Many public projects have logos to help with recognisability.

There's a free logo generator on https://hatchful.shopify.com/

To make the logos below, II used the options: Get Started -> Services -> Reliable -> pandera+pandas schema validation -> Online store or website

These are some of the better ones:
Option 1:

Option 2:

Option 3:

discussion: submit pandera to pyOpenSci review process

might be good to get more contributers/users/feedback to speed up development and adoption of this tool by having pandera have the stamp of approval of the pyOpenSci community:

https://www.pyopensci.org/dev_guide/intro.html

dict-based definition of schema

express schema as a dictionary

DataFrameSchema({
    "column1": Column(Int, Validator)
})

for multiindex column, use tuple as key

DataFrameSchema({
    ("c1_level0", "c1_level1"): Column(Int, Validator, nullable=True)
})

add strict=True to DataFrameSchema

if strict=True, then all columns in the dataframe must have a corresponding Column in the dataframe schema. If strict=False, raise UserWarning that indicates columns in the df that aren't being validated.

Dataframe null check doesn't provide df name if error occurs

Describe the bug

If a dataframe contains null values in a column, the dataframe name is not returned when the errors in the series are reported. When multiple checks are being run in the same script, this makes debugging difficult as only the column name is returned.

To Reproduce

import pandas as pd
import numpy as np
from pandera import Column, DataFrameSchema, PandasDtype, SeriesSchema, Check, Int

schema = DataFrameSchema(
    {
        "a": Column(PandasDtype.Int, Check(lambda x: x > 0))
    })

df = pd.DataFrame({
    "a": [1, 2, np.nan]
})

schema.validate(df)

This results in an error:

SchemaError: expected series 'a' to have type int64, got float64 and non-nullable series contains null values: {2: nan}

(full error report included at the bottom of this issue).

I think this is the expected behaviour, but this behaviour makes it very difficult to debug for a user. Whilst I have raised a PR which enhances the SeriesSchemaBase class error reporting by including the series/column name, I'm unsure how best to also return the dataframe name so that a user can debug.

Expected behavior:
The name of the dataframe and the column should be returned to the user to enable fast debugging of why the error occurs.

e.g. for the above code, an error like this would be very helpful:

SchemaError: in dataframe 'df', column 'a' expected to have type int64, got float64 and expected to be non-nullable but contains null values: {2: nan}

Python version: 3.6
Pandera version: 0.12

Full error message:

SchemaError                               Traceback (most recent call last)
/mycondaenv/lib/python3.6/site-packages/schema.py in validate(self, data)
    402             try:
--> 403                 if s(data):
    404                     return data

/panderadev/pandera/pandera.py in __call__(self, df)
    340                 "need to `set_name` of column before calling it.")
--> 341         return super(Column, self).__call__(df[self._name])
    342 

/panderadev/pandera/pandera.py in __call__(self, series)
    226                         "expected series '%s' to have type %s, got %s and non-nullable series contains null values: %s" %
--> 227                         (series.name, self._pandas_dtype.value, series.dtype, series[nulls].head(N_FAILURE_CASES).to_dict()))
    228                 else:

SchemaError: expected series 'a' to have type int64, got float64 and non-nullable series contains null values: {2: nan}

During handling of the above exception, another exception occurred:

SchemaError                               Traceback (most recent call last)
/mycondaenv/lib/python3.6/site-packages/schema.py in validate(self, data)
    392             try:
--> 393                 return s.validate(data)
    394             except SchemaError as x:

/mycondaenv/lib/python3.6/site-packages/schema.py in validate(self, data)
    121         for s in [self._schema(s, error=self._error, ignore_extra_keys=self._ignore_extra_keys) for s in self._args]:
--> 122             data = s.validate(data)
    123         return data

/mycondaenv/lib/python3.6/site-packages/schema.py in validate(self, data)
    405             except SchemaError as x:
--> 406                 raise SchemaError([None] + x.autos, [e] + x.errors)
    407             except BaseException as x:

SchemaError: expected series 'a' to have type int64, got float64 and non-nullable series contains null values: {2: nan}

During handling of the above exception, another exception occurred:

SchemaError                               Traceback (most recent call last)
<ipython-input-29-dc85ea979a2b> in <module>
----> 1 schema.validate(df)

/panderadev/pandera/pandera.py in validate(self, dataframe)
    176         if not isinstance(dataframe, pd.DataFrame):
    177             raise TypeError("expected dataframe, got %s" % type(dataframe))
--> 178         return self.schema.validate(dataframe)
    179 
    180 

/mycondaenv/lib/python3.6/site-packages/schema.py in validate(self, data)
    393                 return s.validate(data)
    394             except SchemaError as x:
--> 395                 raise SchemaError([None] + x.autos, [e] + x.errors)
    396             except BaseException as x:
    397                 message = "%r.validate(%r) raised %r" % (s, data, x)

SchemaError: expected series 'a' to have type int64, got float64 and non-nullable series contains null values: {2: nan}

PyPI docs logo not loading

Not sure why, but I get the below when looking at pypi. Have tried a few times and get the same error.

The image address on pypi is odd : https://warehouse-camo.cmh1.psfhosted.org/8a54bccff4f1c84259a30f7d18afae06ce64b607/68747470733a2f2f6769746875622e636f6d2f636f736d696342626f792f70616e646572612f626c6f622f6d61737465722f646f63732f736f757263652f5f7374617469632f70616e646572612d62616e6e65722e737667

And doesn't match the GitHub address for the same image which I'd expect to be used:

I'll investigate properly at some point but time constrained at the moment.

for two sample hypothesis tests, standardize API

For a more intuitive API, two sample Hypothesis tests class method definitions should look like this:

def two_sample_hypothesis_test(
    cls, groupby, group1, group2, relationship, alpha=0.01,
    equal_var=True, nan_policy="propagate")

which is more intuitive than specifying a list of groups that may or may not have 2 elements in it.

support subsetting the dataframe for validation

user can subset by head, tail, or randomly (set seed) in order to validate a subset of a potentially large dataframe.

update readme on release of next version

suggest replacing

**Supports:** python 2.7, 3.5, 3.6

with:

[![PyPI pyversions](https://img.shields.io/pypi/pyversions/pandera.svg)](https://pypi.python.org/pypi/pandera/)

when the new metadata is available on pypi.

It will be 'missing' before pypi is updated:

should DataFrameSchema support partitioning dataframe into valid/invalid parts?

Discuss the use case for partitioning a dataframe into valid and invalid portions.

Akash Gupta on this pandera blogpost comment https://disqus.com/by/disqus_jiG9N3PPd8 expressed interest in separating the dataframe into a valid and invalid portion.

Note exactly sure what the use case he had in mind was, but wondering if this might be a good idea.

The error message of an invalid dataframe contains some information about which indices were invalid, but perhaps debugging can be made easier by providing additional data in the SchemaError raised when calling schema.validate(df).

One proposal would be to extend SchemaError to include a failure_cases attribute and expose it to the user:

class SchemaError(Exception):
    def __init__(self, message, failure_cases):
        super(SchemaError, self).__init__(message)
        # some TBD data structure containing invalid cases. Maybe by `Column` + `Index`?
        self.failure_cases = failure_cases

Then the user can catch these errors in their code:

schema = DataFrameSchema(...)
df = ...

try:
    schema.validate(df)
except SchemaError as e:
    # suppose that `failure_cases` is a dict mapping Column names to a list of indexes in
    # dataframe that didn't pass a particular `Check`, where each element
    # in the list corresponds to the `Check`.
    idx = e.failure_cases["column_name"][0]
    # access specific failure cases in the dataframe
    df.loc[idx, "column_name"]

add support for grouping column of interest by another column.

if the user needs to validate some column x conditioned on the values of column y, the user can specify multiple column names argument

DataFrameSchema({
    ("x", "y"): Columns([Int, String], Check(lambda x, y: x[y == "foo"] > 10))
})

add support for dataframe schema transformations: add_column, remove_column

these should be methods that correspond with pandas dataframe operations.

For example, if the user adds a column to a dataframe, also support changing the corresponding schema to account for that change:

df = pd.DataFrame({"a": [1, 2, 3]})

schema = DataFrameSchema([Column("a", PandasDtype.Int)])
df = schema.validate(df)

# add a column to the dataframe
df["b"] = ["x", "y", "z"]

# add column to the dataframe schema
schema = schema.add_column(Column("b", PandasDtype.String))
df = schema.validate(df)

# same with removing columns
df = df.dropna("a", axis=1)
schema = schema.remove_column("a")

df = schema.validate(df)


# or reflecting changes in an existing column
df["a"] = df["a"].astype(float)
schema = schema.change_column(Column("a", PandasDtype.Float))

df = schema.validate(df)

handle `self` argument when using check_input in class method when str obj_getter is supplied

Check is the wrapped function is a class method:

https://stackoverflow.com/questions/5963729/check-if-a-function-is-a-method-of-some-object

if so, ignore the first argument.

Add built-in Checks for common operations

pandera offers a lot of flexibility in terms of the kinds of assertions you can make about a dataframe. However, it would be useful to have built-in checks for common operations. This issue is a proposal to define what those common operations are.

These should all use vectorized pandas operations.

For dtypes that support pandas comparison operators `>`, `>=`, `<`, `<=`, `==`, `!=`

greater than scalar x
greater than or equal to scalar x
equal to scalar x
not equal to scalar x
less than scalar x
is in range x - y (inclusive or exclusive)
is sorted, ascending or descending

For dtypes that support checks for set membership

is in list-like/set s
is not in list-like/set s

For datetime dtypes

check for date format "<some_format>"
(comparison operators and range checks also apply here)

For string dtype

column matches "regex" pattern
no trailing whitespace
no leading whitespace
column contains "string"
column starts with "string"
column ends with "string"

Add CompareColumns subclass of Check for dataframe-level multi-column checks

The CompareColumns subclasses of Check are designed to work nicely with dataframe-level checks #14.

`CompareColumns` class

This class enables built-in comparisons of two columns.

The proposed API for this would be something like:

DataFrameSchema(
    columns={...},
    checks=[
        Compare("col1").greater_than("col2"),
        Compare("col2").less_than_equal("col3"),
    ]
)

This will be an experimental API

create GitHub organisation for repo?

Could aid credibility of repo, e.g. it's fairly common for other projects to be associated with an org rather than a specific user.
https://github.com/scikit-learn
https://github.com/pandas-dev
https://github.com/matplotlib

add support for validating head/tail/random sample of dataframe/series

create official documentation with Sphinx

Move the documentation currently maintained in the readme into https://readthedocs.org, use sphinx to auto-generate.

add dataframe-level checks: a list of checks that have access to the entire dataframe

This feature enables checks on the dataframe that assert properties about multiple columns in the dataframe, for example, to assert that the ratio of two columns is <, >, etc.

This should be expressed as a list of checks supplied to the DataFrameSchema constructor:

DataFrameSchema(
    columns={...},
    checks=[
        Check(lambda df: (df["col1"] / df["col2"]) > 1),
        Check(lambda df: (df["col1"] * df["col3"]) < 1),
    ]
)

Note a subtlety here that if we supply element_wise=True, then the function signature should apply to a row in the dataframe, as if we were doing the following:

for row in dataframe.itertuples():
    (row["col1"] / row["col2"]) > 1

In pandera, the above schema would look like:

DataFrameSchema(
    columns={...},
    checks=[
        Check(lambda r: (r["col1"] / r["col2"]) > 1, element_wise=True),
        Check(lambda r: (r["col1"] * r["col3"]) < 1, element_wise=True),
    ]
)

create conda package

implement Hypothesis validator

this is a special class of multi-column validator that performs a hypothesis test on the dataset.

For two-sample hypothesis tests, the minimum requirement is that a groupby argument is specified.

For a one-sample hypothesis test, if the groupby argument is supplied, then the groups argument must be a string of a one-element list specifying the group that you want to test. If groupby is None, then the hypothesis test will apply to the entire column Series.

If we want to statistically assert that the average weight for men is higher than women, then we can do something like:

DataFrameSchema({
    "weight": Column(
        Float,
        checks=[
            Hypothesis(
                test="one_sided_two_sample_t_test", groupby="sex",
                groups=["men", "women"], relationship="gt")
                raise_warning=True)
        ]
    )
})

# warning makes it so that runtime raises warning instead of exception
# in cases where breaking the hypothesis test shouldn't block runtime

Need to figure out a suite of use cases (e.g. confidence interval assertions, etc.)

update dataframe schema API

Make the interface more natural to pandas users:

schema = DataFrameSchema({
    "column1": Int,
    "column2": Float,
    "column3": String
})

No need for the Column class, in this case.
note that Int, Float, and String are now just validators

Or provide a list of validators

schema = DataFrameSchema({
    "column1": [Int, Nullable, Assert(lambda x: x > 1, element_wise=True)],
    ...
})

The Assert signature should be:

Assert(callables*, element_wise=True)

Make callables for each data type, e.g. Int, Float, etc.
and also for Nullable

Validator interface should be the same

add support for schema merging

This is similar to #6, except this enables the user to merge schemas (using a similar api as DataFrame.merge)

support DataFrameSchema transformations

This feature supports the use case where I want to be able to declare schema transformations as methods of a DataFrameSchema so that I can express dataframe schema transformations that I expect in a pipeline of transformations.

Schema Transformations

schema1 = DataFrameSchema({
    "col1": Column(Int, Check(lambda s: s >= 0)),
    "col2": Column(Int, Check(lambda s: s >= 0)),
})

schema2 = schema1.add_columns({
    "col3": Column(Int, Check(lambda s: s >= 0))
})

schema3 = schema2.remove_columns(["col1"])

df = pd.DataFrame({
    "col1": [1, 2, 3],
    "col2": [1, 2, 3],
})
df = schema1.validate(df)
df["col3"] = df["col1"] + df["col2"]
df = schema2.validate(df)
del df["col1"]
df = schema3.validate(df)

The add_columns and remove_columns methods should return a deep copy of the dataframe schema so that it doesn't mutate the original schema.columns dict.

Edit: move SchemaPipeline to its own issue #162

make coerce logic compatible with nullable columns

currently, of coerce=True, the validation logic will first coerce the column to the specified dtype, then the column checks are applied.

This causes an issue where coercing to type str will turn None into 'None', thus invalidating the nullable logic.

Why does SeriesSchema not accept a series with name?

I'm having trouble with this section of SeriesSchemaBase:
https://github.com/pandera-dev/pandera/blob/ca5b39d329b3572d2889dd52a41ea97da6ef1534/pandera/pandera.py#L748-L751

It raises an error if I give it a series with name. Here a minimal example:

>>> import pandera
>>> sample_df = pd.DataFrame({
...     "int_col": [1, 2, 3],
...     "float_col": [1.1, 2.5, 9.9]
... })
... 
... series = pd.Series(sample_df['int_col'])  # Necessary here: name=None
... schema = pandera.SeriesSchema(
...     pandas_dtype='int', checks=pandera.Check(lambda s: s > 0)
... )
... schema.validate(series)
Traceback (most recent call last):
  File "<input>", line 10, in <module>
  File "C:\Users\ckrudewi\PycharmProjects\pandera\pandera\pandera.py", line 839, in validate
    if super(SeriesSchema, self).__call__(series):
  File "C:\Users\ckrudewi\PycharmProjects\pandera\pandera\pandera.py", line 751, in __call__
    (type(self), self._name, series.name))
pandera.pandera.SchemaError: Expected <class 'pandera.pandera.SeriesSchema'> to have name 'None', found 'int_col'

I can try to suggest a change as PR. But before I would like to ask if there is a particular reason for this behaviour?

make Check element_wise=False the default

since we're working with pandas here, it makes sense to make vectorized checks the default setting to encourage users to take advantage of performance gains of vectorized checks.

add `allow_duplicates` to SeriesSchema or Index class

add support/tests for multi-index columns

modularize pandera code

what started as a small ~200 LOC project is now... larger.

should modularize schemas, checks, hypotheses, errors, and decorators into their own modules.

Schema definitions in yaml

I suggest allowing to pass schemas as yaml files. That way it wouldn't be necessary to hardcode all the checks when using pandera. Instead they would be defined in the yaml schema.
There are two use cases I see:

Validating dataframes in a CI/CD pipeline. There it would be possible to have a validation step which could then be re-configured without changing the actual python code. It would be possible to check multiple different dataframes against their expected schemas with the same python code but just different yaml files
Pandera could offer a (simple) command line tool which reads data from some defined formats, such as pickle or json and directly checks it against a schema specified in a file.

The yaml format needs to be designed thoroughly in this case to offer optimal flexibility. I could think of something like this:

YAML schema definition:

# General section for dataframe wide checks
dataframe:
  - min_length: 1000
# Checks per column
columns:
  column1:
    # List of checks, each one is a dictionary
    # this allows parametrization
    - type: int
    - max: 10
    - allow_null: False
  column2:
    - type: float
    - max: -1.2
  column2:
    - type: str
    - match: "^value_"
    # Allow custom functions (here with arguments)
    - custom_function: split_shape
      split_char: "_"
      expected_splits: 2

Python code:

def split_shape(df, split_char, expected_splits):
   """Custom check function"""
   return (s.str.split(split_char, expand=True).shape[1] == expected_splits)
   
schema = DataFrameSchema.from_yaml(
            path="path_to_yaml",
            custom_functions = [split_shape]
)

validated_df = schema.validate(df)

As we probably don't want that arbitrary Python code can be executed from the yaml file with the !!python syntax I suggest that we rather go with a mix of built-in checks and the option to add user defined functions as in the example above.

add coerce option to DataFrameSchema

this should coerce the type of all columns to the specified dtype before validation.

This should raise a warning when nullable Int columns are defined... these should
just default to Float

create a human-readable repr method

print(DataFrameSchemaObj) as a human-readable string, e.g.

DataFrameSchema({
    "column1": Column(type, checks),
    "column2": Column(type, checks),
    ...
})

automate conda packaging with travisci

this gist is useful:

https://gist.github.com/zshaheen/fe76d1507839ed6fbfbccef6b9c13ed9

blog post ideas?

@mastersplinter do you have any ideas on blogposts to demonstrate the utility of this library?

So far I've written this one:
https://cosmicbboy.github.io/2018/12/28/validating-pandas-dataframes.html

I'm thinking other topics could be:

pandera in a machine learning pipeline
more data munging use cases

Re: exposure:

I also think brief posts on Medium would help get more exposure.
posting on Reddit, HackerNews, etc.

drop python 2.7 support

since python2.7 will no longer be maintained by Jan 1, 2020, we should also no longer maintain it moving forward.

pandera v0.1.3 will be released on June 9th, 2019.

Should plan to follow up with a v0.2.0 version bump to deprecate support for python 2.7 after resolving a few more issues

only print out most common failure cases when validator fails

for failure cases, print out failure cases in the following form:

# suppose validator is lambda x: x > 3

failure case: 0 - index: [1, 2, 3, 4]
failure case: 1 - index: [10, 11]
failure case: 2 - index: [13]

Infer dataframe schema?

This is an idea I want to put out for discussion.

Together with a yaml schema format (#91) one could introduce schema inference to create a "draft schema" automatically. Currently it seems a lot of manual work to specify a schema for a dataframe with many features thoroughly. With two methods infer_schema(df) and to_yaml() work could be a little easier, because then the schema would only need some additional fine-tuning.

TensorFlow validation offers such a functionality, for example. However their implementation also shows the complexity of such a feature. It seems to be a two-step approach:

But maybe pandera doesn't need to offer the same flexibility and could go with a simpler infer_schema function. What do you think:

replace Validator with Check class

should act in pretty much the same way.

add type hints to all functions/methods

this should happen after dropping 2.7 support

include code coverage badge

It would be nice to be able to see the code coverage via a button on the readme:

A free/commonly used approach is via:
https://codecov.io/

create conda-forge package

https://conda-forge.org/#contribute

ModuleNotFoundError: No module named 'pandera'

Hi Niels,

Awesome library! I pip-installed it, and tried to use it, but for some reason the library can't be found.
I get the same error whether I open python from the command line or open a Jupyter notebook. (I also tried opening a new Terminal window).

I'm using Miniconda3 with pandera I believe installed under:
/miniconda3/lib/python3.6/site-packages (0.0.3)

However, all I can find there is this folder:
pandera-0.0.3.dist-info

...with these files:

METADATA
RECORD
top_level.txt
WHEEL

Could it be that the install doesn't work properly on Miniconda?

Let me know if I need to provide more information

add flake8 linter to CI

Add more comprehensive type support

Currently, only the *64 bit dtypes are supported by pandera.

explicitly cover all valid pandas dtypes, including category dtype: https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html

Wrapping a check_input causes index out of range

The following example:

from pandera import DataFrameSchema, Int, DateTime, String, Check, Column, Float, Bool, check_output, check_input
import pandas as pd
import numpy as np
from gsktools import load_data

df = pd.DataFrame({
        "column1": [10.0, 20.0, 30.0],
        "column2": [1, 2, 3],
    })

schema = DataFrameSchema({
        "column1": Column(Float),
        "column2": Column(Int),
    })

@check_input(schema,1)
def original_function(some_arg, df):
    return df

def wrapper(function_as_parameter,positional_arguments):
    df = function_as_parameter(positional_arguments)
    return df

def function_that_calls_original_function_via_wrapper(df):
    new_df = wrapper(
            function_as_parameter=original_function,
            positional_arguments=df)

function_that_calls_original_function_via_wrapper(df)

Results in this list index out of range error:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-100-da3315e2c30b> in <module>
----> 1 function_that_calls_original_function_via_wrapper(df)

<ipython-input-99-0075cf7550ed> in function_that_calls_original_function_via_wrapper(df)
      2     new_df = wrapper(
      3             function_as_parameter=original_function,
----> 4             positional_arguments=df)

<ipython-input-98-c6fbf50f8087> in wrapper(function_as_parameter, positional_arguments)
      1 def wrapper(function_as_parameter,positional_arguments):
----> 2     df = function_as_parameter(positional_arguments)
      3     return df

/pandera_dev/pandera/pandera.py in _wrapper(fn, instance, args, kwargs)
    766                     zip(arg_spec_args, args))
    767                 args_dict[obj_getter] = schema.validate(args_dict[obj_getter])
--> 768                 args = list(args_dict.values())
    769         elif obj_getter is None:
    770             try:

IndexError: list index out of range

add support for MultiIndex index/columns

the API for MultiIndex columns could be something like:

DataFrameSchema({
    ("col1_levela", "col1_levelb"): Column(...)
})

The API for MultiIndex indexes could be something like:

MultiIndex(
    Index(Int, ...),
    Index(String, ...),
)

add debug mode

The user enables "debug" mode (maybe as an ENV variable) in order to enter the scope of the schema validator where the error occurs.