unionai-oss / pandera Goto Github PK
View Code? Open in Web Editor NEWA light-weight, flexible, and expressive statistical data testing library
Home Page: https://www.union.ai/pandera
License: MIT License
A light-weight, flexible, and expressive statistical data testing library
Home Page: https://www.union.ai/pandera
License: MIT License
based on conversation here:
https://github.com/cosmicBboy/pandera/pull/34
Should Pandera get a logo?
Many public projects have logos to help with recognisability.
There's a free logo generator on https://hatchful.shopify.com/
To make the logos below, II used the options: Get Started -> Services -> Reliable -> pandera+pandas schema validation -> Online store or website
might be good to get more contributers/users/feedback to speed up development and adoption of this tool by having pandera have the stamp of approval of the pyOpenSci community:
express schema as a dictionary
DataFrameSchema({
"column1": Column(Int, Validator)
})
for multiindex column, use tuple as key
DataFrameSchema({
("c1_level0", "c1_level1"): Column(Int, Validator, nullable=True)
})
if strict=True, then all columns in the dataframe must have a corresponding Column
in the dataframe schema. If strict=False, raise UserWarning that indicates columns in the df that aren't being validated.
Describe the bug
If a dataframe contains null values in a column, the dataframe name is not returned when the errors in the series are reported. When multiple checks are being run in the same script, this makes debugging difficult as only the column name is returned.
To Reproduce
import pandas as pd
import numpy as np
from pandera import Column, DataFrameSchema, PandasDtype, SeriesSchema, Check, Int
schema = DataFrameSchema(
{
"a": Column(PandasDtype.Int, Check(lambda x: x > 0))
})
df = pd.DataFrame({
"a": [1, 2, np.nan]
})
schema.validate(df)
This results in an error:
SchemaError: expected series 'a' to have type int64, got float64 and non-nullable series contains null values: {2: nan}
(full error report included at the bottom of this issue).
I think this is the expected behaviour, but this behaviour makes it very difficult to debug for a user. Whilst I have raised a PR which enhances the SeriesSchemaBase class error reporting by including the series/column name, I'm unsure how best to also return the dataframe name so that a user can debug.
Expected behavior:
The name of the dataframe and the column should be returned to the user to enable fast debugging of why the error occurs.
e.g. for the above code, an error like this would be very helpful:
SchemaError: in dataframe 'df', column 'a' expected to have type int64, got float64 and expected to be non-nullable but contains null values: {2: nan}
Python version: 3.6
Pandera version: 0.12
Full error message:
SchemaError Traceback (most recent call last)
/mycondaenv/lib/python3.6/site-packages/schema.py in validate(self, data)
402 try:
--> 403 if s(data):
404 return data
/panderadev/pandera/pandera.py in __call__(self, df)
340 "need to `set_name` of column before calling it.")
--> 341 return super(Column, self).__call__(df[self._name])
342
/panderadev/pandera/pandera.py in __call__(self, series)
226 "expected series '%s' to have type %s, got %s and non-nullable series contains null values: %s" %
--> 227 (series.name, self._pandas_dtype.value, series.dtype, series[nulls].head(N_FAILURE_CASES).to_dict()))
228 else:
SchemaError: expected series 'a' to have type int64, got float64 and non-nullable series contains null values: {2: nan}
During handling of the above exception, another exception occurred:
SchemaError Traceback (most recent call last)
/mycondaenv/lib/python3.6/site-packages/schema.py in validate(self, data)
392 try:
--> 393 return s.validate(data)
394 except SchemaError as x:
/mycondaenv/lib/python3.6/site-packages/schema.py in validate(self, data)
121 for s in [self._schema(s, error=self._error, ignore_extra_keys=self._ignore_extra_keys) for s in self._args]:
--> 122 data = s.validate(data)
123 return data
/mycondaenv/lib/python3.6/site-packages/schema.py in validate(self, data)
405 except SchemaError as x:
--> 406 raise SchemaError([None] + x.autos, [e] + x.errors)
407 except BaseException as x:
SchemaError: expected series 'a' to have type int64, got float64 and non-nullable series contains null values: {2: nan}
During handling of the above exception, another exception occurred:
SchemaError Traceback (most recent call last)
<ipython-input-29-dc85ea979a2b> in <module>
----> 1 schema.validate(df)
/panderadev/pandera/pandera.py in validate(self, dataframe)
176 if not isinstance(dataframe, pd.DataFrame):
177 raise TypeError("expected dataframe, got %s" % type(dataframe))
--> 178 return self.schema.validate(dataframe)
179
180
/mycondaenv/lib/python3.6/site-packages/schema.py in validate(self, data)
393 return s.validate(data)
394 except SchemaError as x:
--> 395 raise SchemaError([None] + x.autos, [e] + x.errors)
396 except BaseException as x:
397 message = "%r.validate(%r) raised %r" % (s, data, x)
SchemaError: expected series 'a' to have type int64, got float64 and non-nullable series contains null values: {2: nan}
Not sure why, but I get the below when looking at pypi. Have tried a few times and get the same error.
The image address on pypi is odd : https://warehouse-camo.cmh1.psfhosted.org/8a54bccff4f1c84259a30f7d18afae06ce64b607/68747470733a2f2f6769746875622e636f6d2f636f736d696342626f792f70616e646572612f626c6f622f6d61737465722f646f63732f736f757263652f5f7374617469632f70616e646572612d62616e6e65722e737667
And doesn't match the GitHub address for the same image which I'd expect to be used:
I'll investigate properly at some point but time constrained at the moment.
For a more intuitive API, two sample Hypothesis
tests class method definitions should look like this:
def two_sample_hypothesis_test(
cls, groupby, group1, group2, relationship, alpha=0.01,
equal_var=True, nan_policy="propagate")
which is more intuitive than specifying a list of groups
that may or may not have 2 elements in it.
user can subset by head, tail, or randomly (set seed) in order to validate a subset of a potentially large dataframe.
Discuss the use case for partitioning a dataframe into valid and invalid portions.
Akash Gupta on this pandera blogpost comment https://disqus.com/by/disqus_jiG9N3PPd8 expressed interest in separating the dataframe into a valid and invalid portion.
Note exactly sure what the use case he had in mind was, but wondering if this might be a good idea.
The error message of an invalid dataframe contains some information about which indices were invalid, but perhaps debugging can be made easier by providing additional data in the SchemaError
raised when calling schema.validate(df)
.
One proposal would be to extend SchemaError
to include a failure_cases
attribute and expose it to the user:
class SchemaError(Exception):
def __init__(self, message, failure_cases):
super(SchemaError, self).__init__(message)
# some TBD data structure containing invalid cases. Maybe by `Column` + `Index`?
self.failure_cases = failure_cases
Then the user can catch these errors in their code:
schema = DataFrameSchema(...)
df = ...
try:
schema.validate(df)
except SchemaError as e:
# suppose that `failure_cases` is a dict mapping Column names to a list of indexes in
# dataframe that didn't pass a particular `Check`, where each element
# in the list corresponds to the `Check`.
idx = e.failure_cases["column_name"][0]
# access specific failure cases in the dataframe
df.loc[idx, "column_name"]
if the user needs to validate some column x
conditioned on the values of column y
, the user can specify multiple column names argument
DataFrameSchema({
("x", "y"): Columns([Int, String], Check(lambda x, y: x[y == "foo"] > 10))
})
these should be methods that correspond with pandas dataframe operations.
For example, if the user adds a column to a dataframe, also support changing the corresponding schema to account for that change:
df = pd.DataFrame({"a": [1, 2, 3]})
schema = DataFrameSchema([Column("a", PandasDtype.Int)])
df = schema.validate(df)
# add a column to the dataframe
df["b"] = ["x", "y", "z"]
# add column to the dataframe schema
schema = schema.add_column(Column("b", PandasDtype.String))
df = schema.validate(df)
# same with removing columns
df = df.dropna("a", axis=1)
schema = schema.remove_column("a")
df = schema.validate(df)
# or reflecting changes in an existing column
df["a"] = df["a"].astype(float)
schema = schema.change_column(Column("a", PandasDtype.Float))
df = schema.validate(df)
Check is the wrapped function is a class method:
https://stackoverflow.com/questions/5963729/check-if-a-function-is-a-method-of-some-object
if so, ignore the first argument.
pandera
offers a lot of flexibility in terms of the kinds of assertions you can make about a dataframe. However, it would be useful to have built-in checks for common operations. This issue is a proposal to define what those common operations are.
These should all use vectorized pandas operations.
>
, >=
, <
, <=
, ==
, !=
x
x
x
x
x
x - y
(inclusive or exclusive)s
s
"<some_format>"
"regex"
pattern"string"
"string"
"string"
The CompareColumns
subclasses of Check
are designed to work nicely with dataframe-level checks #14.
CompareColumns
classThis class enables built-in comparisons of two columns.
The proposed API for this would be something like:
DataFrameSchema(
columns={...},
checks=[
Compare("col1").greater_than("col2"),
Compare("col2").less_than_equal("col3"),
]
)
This will be an experimental API
Could aid credibility of repo, e.g. it's fairly common for other projects to be associated with an org rather than a specific user.
https://github.com/scikit-learn
https://github.com/pandas-dev
https://github.com/matplotlib
Move the documentation currently maintained in the readme into https://readthedocs.org, use sphinx to auto-generate.
This feature enables checks on the dataframe that assert properties about multiple columns in the dataframe, for example, to assert that the ratio of two columns is <
, >
, etc.
This should be expressed as a list of checks
supplied to the DataFrameSchema
constructor:
DataFrameSchema(
columns={...},
checks=[
Check(lambda df: (df["col1"] / df["col2"]) > 1),
Check(lambda df: (df["col1"] * df["col3"]) < 1),
]
)
Note a subtlety here that if we supply element_wise=True
, then the function signature should apply to a row in the dataframe, as if we were doing the following:
for row in dataframe.itertuples():
(row["col1"] / row["col2"]) > 1
In pandera
, the above schema would look like:
DataFrameSchema(
columns={...},
checks=[
Check(lambda r: (r["col1"] / r["col2"]) > 1, element_wise=True),
Check(lambda r: (r["col1"] * r["col3"]) < 1, element_wise=True),
]
)
this is a special class of multi-column validator that performs a hypothesis test on the dataset.
For two-sample hypothesis tests, the minimum requirement is that a groupby
argument is specified.
For a one-sample hypothesis test, if the groupby
argument is supplied, then the groups
argument must be a string of a one-element list specifying the group that you want to test. If groupby
is None, then the hypothesis test will apply to the entire column Series.
If we want to statistically assert that the average weight for men is higher than women, then we can do something like:
DataFrameSchema({
"weight": Column(
Float,
checks=[
Hypothesis(
test="one_sided_two_sample_t_test", groupby="sex",
groups=["men", "women"], relationship="gt")
raise_warning=True)
]
)
})
# warning makes it so that runtime raises warning instead of exception
# in cases where breaking the hypothesis test shouldn't block runtime
Need to figure out a suite of use cases (e.g. confidence interval assertions, etc.)
Make the interface more natural to pandas users:
schema = DataFrameSchema({
"column1": Int,
"column2": Float,
"column3": String
})
No need for the Column
class, in this case.
note that Int
, Float
, and String
are now just validators
Or provide a list of validators
schema = DataFrameSchema({
"column1": [Int, Nullable, Assert(lambda x: x > 1, element_wise=True)],
...
})
The Assert
signature should be:
Assert(callables*, element_wise=True)
Make callables for each data type, e.g. Int
, Float
, etc.
and also for Nullable
Validator interface should be the same
This is similar to #6, except this enables the user to merge schemas (using a similar api as DataFrame.merge
)
This feature supports the use case where I want to be able to declare schema transformations as methods of a DataFrameSchema
so that I can express dataframe schema transformations that I expect in a pipeline of transformations.
schema1 = DataFrameSchema({
"col1": Column(Int, Check(lambda s: s >= 0)),
"col2": Column(Int, Check(lambda s: s >= 0)),
})
schema2 = schema1.add_columns({
"col3": Column(Int, Check(lambda s: s >= 0))
})
schema3 = schema2.remove_columns(["col1"])
df = pd.DataFrame({
"col1": [1, 2, 3],
"col2": [1, 2, 3],
})
df = schema1.validate(df)
df["col3"] = df["col1"] + df["col2"]
df = schema2.validate(df)
del df["col1"]
df = schema3.validate(df)
The add_columns
and remove_columns
methods should return a deep copy of the dataframe schema so that it doesn't mutate the original schema.columns
dict.
Edit: move SchemaPipeline to its own issue #162
currently, of coerce=True
, the validation logic will first coerce the column to the specified dtype, then the column checks are applied.
This causes an issue where coercing to type str
will turn None
into 'None'
, thus invalidating the nullable logic.
I'm having trouble with this section of SeriesSchemaBase:
https://github.com/pandera-dev/pandera/blob/ca5b39d329b3572d2889dd52a41ea97da6ef1534/pandera/pandera.py#L748-L751
It raises an error if I give it a series with name. Here a minimal example:
>>> import pandera
>>> sample_df = pd.DataFrame({
... "int_col": [1, 2, 3],
... "float_col": [1.1, 2.5, 9.9]
... })
...
... series = pd.Series(sample_df['int_col']) # Necessary here: name=None
... schema = pandera.SeriesSchema(
... pandas_dtype='int', checks=pandera.Check(lambda s: s > 0)
... )
... schema.validate(series)
Traceback (most recent call last):
File "<input>", line 10, in <module>
File "C:\Users\ckrudewi\PycharmProjects\pandera\pandera\pandera.py", line 839, in validate
if super(SeriesSchema, self).__call__(series):
File "C:\Users\ckrudewi\PycharmProjects\pandera\pandera\pandera.py", line 751, in __call__
(type(self), self._name, series.name))
pandera.pandera.SchemaError: Expected <class 'pandera.pandera.SeriesSchema'> to have name 'None', found 'int_col'
I can try to suggest a change as PR. But before I would like to ask if there is a particular reason for this behaviour?
since we're working with pandas here, it makes sense to make vectorized checks the default setting to encourage users to take advantage of performance gains of vectorized checks.
what started as a small ~200 LOC project is now... larger.
should modularize schemas, checks, hypotheses, errors, and decorators into their own modules.
I suggest allowing to pass schemas as yaml files. That way it wouldn't be necessary to hardcode all the checks when using pandera. Instead they would be defined in the yaml schema.
There are two use cases I see:
The yaml format needs to be designed thoroughly in this case to offer optimal flexibility. I could think of something like this:
YAML schema definition:
# General section for dataframe wide checks
dataframe:
- min_length: 1000
# Checks per column
columns:
column1:
# List of checks, each one is a dictionary
# this allows parametrization
- type: int
- max: 10
- allow_null: False
column2:
- type: float
- max: -1.2
column2:
- type: str
- match: "^value_"
# Allow custom functions (here with arguments)
- custom_function: split_shape
split_char: "_"
expected_splits: 2
Python code:
def split_shape(df, split_char, expected_splits):
"""Custom check function"""
return (s.str.split(split_char, expand=True).shape[1] == expected_splits)
schema = DataFrameSchema.from_yaml(
path="path_to_yaml",
custom_functions = [split_shape]
)
validated_df = schema.validate(df)
As we probably don't want that arbitrary Python code can be executed from the yaml file with the !!python
syntax I suggest that we rather go with a mix of built-in checks and the option to add user defined functions as in the example above.
this should coerce the type of all columns to the specified dtype before validation.
This should raise a warning when nullable
Int columns are defined... these should
just default to Float
print(DataFrameSchemaObj)
as a human-readable string, e.g.
DataFrameSchema({
"column1": Column(type, checks),
"column2": Column(type, checks),
...
})
this gist is useful:
https://gist.github.com/zshaheen/fe76d1507839ed6fbfbccef6b9c13ed9
@mastersplinter do you have any ideas on blogposts to demonstrate the utility of this library?
So far I've written this one:
https://cosmicbboy.github.io/2018/12/28/validating-pandas-dataframes.html
I'm thinking other topics could be:
Re: exposure:
since python2.7 will no longer be maintained by Jan 1, 2020, we should also no longer maintain it moving forward.
pandera v0.1.3 will be released on June 9th, 2019.
Should plan to follow up with a v0.2.0 version bump to deprecate support for python 2.7 after resolving a few more issues
for failure cases, print out failure cases in the following form:
# suppose validator is lambda x: x > 3
failure case: 0 - index: [1, 2, 3, 4]
failure case: 1 - index: [10, 11]
failure case: 2 - index: [13]
This is an idea I want to put out for discussion.
Together with a yaml schema format (#91) one could introduce schema inference to create a "draft schema" automatically. Currently it seems a lot of manual work to specify a schema for a dataframe with many features thoroughly. With two methods infer_schema(df)
and to_yaml()
work could be a little easier, because then the schema would only need some additional fine-tuning.
TensorFlow validation offers such a functionality, for example. However their implementation also shows the complexity of such a feature. It seems to be a two-step approach:
But maybe pandera doesn't need to offer the same flexibility and could go with a simpler infer_schema function. What do you think:
should act in pretty much the same way.
this should happen after dropping 2.7 support
It would be nice to be able to see the code coverage via a button on the readme:
A free/commonly used approach is via:
https://codecov.io/
Hi Niels,
Awesome library! I pip-installed it, and tried to use it, but for some reason the library can't be found.
I get the same error whether I open python from the command line or open a Jupyter notebook. (I also tried opening a new Terminal window).
I'm using Miniconda3 with pandera I believe installed under:
/miniconda3/lib/python3.6/site-packages (0.0.3)
However, all I can find there is this folder:
pandera-0.0.3.dist-info
...with these files:
METADATA
RECORD
top_level.txt
WHEEL
Could it be that the install doesn't work properly on Miniconda?
Let me know if I need to provide more information
Currently, only the *64
bit dtypes are supported by pandera.
category
dtype: https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.htmlThe following example:
from pandera import DataFrameSchema, Int, DateTime, String, Check, Column, Float, Bool, check_output, check_input
import pandas as pd
import numpy as np
from gsktools import load_data
df = pd.DataFrame({
"column1": [10.0, 20.0, 30.0],
"column2": [1, 2, 3],
})
โ
schema = DataFrameSchema({
"column1": Column(Float),
"column2": Column(Int),
})
@check_input(schema,1)
def original_function(some_arg, df):
return df
def wrapper(function_as_parameter,positional_arguments):
df = function_as_parameter(positional_arguments)
return df
def function_that_calls_original_function_via_wrapper(df):
new_df = wrapper(
function_as_parameter=original_function,
positional_arguments=df)
function_that_calls_original_function_via_wrapper(df)
Results in this list index out of range error:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-100-da3315e2c30b> in <module>
----> 1 function_that_calls_original_function_via_wrapper(df)
<ipython-input-99-0075cf7550ed> in function_that_calls_original_function_via_wrapper(df)
2 new_df = wrapper(
3 function_as_parameter=original_function,
----> 4 positional_arguments=df)
<ipython-input-98-c6fbf50f8087> in wrapper(function_as_parameter, positional_arguments)
1 def wrapper(function_as_parameter,positional_arguments):
----> 2 df = function_as_parameter(positional_arguments)
3 return df
/pandera_dev/pandera/pandera.py in _wrapper(fn, instance, args, kwargs)
766 zip(arg_spec_args, args))
767 args_dict[obj_getter] = schema.validate(args_dict[obj_getter])
--> 768 args = list(args_dict.values())
769 elif obj_getter is None:
770 try:
IndexError: list index out of range
the API for MultiIndex columns could be something like:
DataFrameSchema({
("col1_levela", "col1_levelb"): Column(...)
})
The API for MultiIndex indexes could be something like:
MultiIndex(
Index(Int, ...),
Index(String, ...),
)
The user enables "debug" mode (maybe as an ENV variable) in order to enter the scope of the schema validator where the error occurs.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.