deepchecks / deepchecks Goto Github PK

Deepchecks: Tests for Continuous Validation of ML Models & Data. Deepchecks is a holistic open-source solution for all of your AI & ML validation needs, enabling to thoroughly test your data and models from research to production.

Home Page: https://docs.deepchecks.com/stable

License: Other

Makefile 0.41% Python 99.07% HTML 0.53%

machine-learning ml model-validation data-validation mlops data-science python jupyter-notebook model-monitoring data-drift

deepchecks's People

Contributors

Stargazers

Watchers

Forkers

fossabot avitzd avimeir itaygabbay jkl98isr tomer-grin nirhutnik benisraeldan ptannor shir22 opbenesh almogbaku idow09 js-ts elazar-b pablojmoreno lorenzofamiglini drupalio albz657 oryahud chunmk celestialized chirasmita-mallick acepace msrocean mkt-ds projektosmium ianderrington drcfsorg laplacekorea arrpak techthiyanes ozelias cybernetics mdfazal datalearns souvikghosh-git biogeek ssahgal miaviles fahrizalfarid ramonserrallonga blueoceansustainablesolutions lishuaijing3 ritesh1200 ongraphpythondev jamesthesnake o7s8r6 valteresj2 brianodhiambo jtiab kilickursat ayushexel naincykumariknoldus nehak275 hnasir01 stjordanis mannan14 gmokyere micseb mre ronitay haithemh chelseatroy vishalsingh17 felipedasilvazup bastienboutonnet creative-research-project-v1-1 jaime-cespedes-sisniega srisai85 milan-chicago g-hung julienschuermans saurabh23 asad14053 hvplus abofficial444 basiralab avangarde2225 valeman schetudiante mardejour poojashri kenakai16 tak aravindalbert nashid mann1904 laurentmnr95 danarlowski zivzone shahparth0007 deoghasp loversrliars viswambhar-yasa rohitpandey13 santiagoahc maclandrol bgalvao kishore-s-15

deepchecks's Issues

Running suite in collab throws an execption

suite.run(train_dataset=ds_train, test_dataset=ds_test, model=rf_clf)

"Quickinstall" Collab is complaining of a warning when installing deepchecks

!pip install deepchecks

...
      Successfully uninstalled plotly-4.4.1
  Attempting uninstall: matplotlib
    Found existing installation: matplotlib 3.2.2
    Uninstalling matplotlib-3.2.2:
      Successfully uninstalled matplotlib-3.2.2
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
albumentations 0.1.12 requires imgaug<0.2.7,>=0.2.5, but you have imgaug 0.2.9 which is incompatible.
Successfully installed category-encoders-2.3.0 deepchecks-0.1.1 fonttools-4.28.5 matplotlib-3.5.1 plotly-5.5.0 scipy-1.7.3 tenacity-8.0.1
WARNING: The following packages were previously imported in this runtime:
  [matplotlib,mpl_toolkits]
You must restart the runtime in order to use newly installed versions.

Support free-form columns in dataset

Currently

We distinguish between categorical columns and non categorical columns by defining (or identifying) them in the Dataset object. In some check non categorical columns are assumed to be numeric and causes failures.

Feature request

Identify free form columns (automagically or defined by user) to allow us to process or ignore this data in some checks.

Internal links inside the suite don't work in collab

Train test drift legend is hiding the plot

RMSE shows a negative value

When working on my regression dataset, the RMSE and the MAE metrics are negative values..This doesn't make sense.

To reproduce:
https://www.kaggle.com/itay94/notebookf8c78e84d7

Load and Compare Datasets

Howdy!

I'm a maintainer on the DataProfiler library, we built the DataProfiler as an improved replacement for pandas-profiling. Some additional features:

Auto-Detect & Load: CSV, AVRO, Parquet, JSON, Text, URL data = Data("your_filepath_or_url.csv")
Profile data in a multi-threaded way with robust PII detection profile = Profiler(data)
Merge profiles: profile3 = profile1 + profile2
Take differences between profiles profile_diff = profile1.diff(profile2)
Generate reports: readable_report = profile.report(report_options={"output_format": "compact"})

Would love to connect and / or see how we could be useful to this project!

import json
from dataprofiler import Data, Profiler

data = Data("your_filepath_or_url.csv") # Auto-Detect & Load: CSV, AVRO, Parquet, JSON, Text, URL

print(data.data.head(5)) # Access data directly via a compatible Pandas DataFrame

profile = Profiler(data) # Calculate Statistics, Entity Recognition, etc

readable_report = profile.report(report_options={"output_format": "compact"})

print(json.dumps(readable_report, indent=4))

SimpleModelComparison add_condition_ratio_not_less_than doesn't work well on imbalanced data

add_condition_ratio_not_less_than() is defined by default on this check in the default_suite. When data is highly imbalanced, it is possible that for example simple model recall is 0.96, and user model recall is 0.99, which are really close but in the imbalanced case this can mean that user model is actually pretty good.

The fix should be suggesting a different condition that can realize that, or alternatively realizing that this may be the case and explaining it to the user.

Regression error distribution and Regression systematic error don't appear one after another

I would expect that in the suite result I see these 2 checks one after another since I'm in the context of exploring regression errors

[DEE-207] Add an option to visually compare between suite results

I think it would be nice if I had the option to visually see 2 suites results side by side. This will help me to get some insights about the difference between the models and pipelines. By side by side I mean that shared checks between the 2 suites will be shown near each other.

_DEE-207

[BUG] Model Identification in sklearn Pipeline should look at the last step

pipe = Pipeline(steps=[
    ('transform', transformers),
    ('handle_nans', SimpleImputer(strategy='most_frequent')),
    ('modle', clf)
])

Boosting Overfit	DeepchecksValueError: Unsupported model of type: SimpleImputer

Boosting Overfit check should have identified the pipeline as the last step (clf which was catboost)

Conda package doesn't enforce dependencies versions

May cause bugs such as:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_34/1021525983.py in <module>
      1 from deepchecks.checks import WholeDatasetDrift
      2 
----> 3 WholeDatasetDrift().run(ds_train, ds_test)

/opt/conda/lib/python3.7/site-packages/deepchecks/base/check.py in wrapped(*args, **kwargs)
    275     @wraps(func)
    276     def wrapped(*args, **kwargs):
--> 277         result = func(*args, **kwargs)
    278         if not isinstance(result, CheckResult):
    279             raise DeepchecksValueError(f'Check {class_instance.name()} expected to return CheckResult bot got: '

/opt/conda/lib/python3.7/site-packages/deepchecks/checks/distribution/whole_dataset_drift.py in run(self, train_dataset, test_dataset, model)
    114         self._cat_features = cat_features
    115 
--> 116         domain_classifier = self._generate_model(list(set(features) - set(cat_features)), cat_features)
    117 
    118         sample_size = min(self.sample_size, train_dataset.n_samples, test_dataset.n_samples)

/opt/conda/lib/python3.7/site-packages/deepchecks/checks/distribution/whole_dataset_drift.py in _generate_model(self, numerical_columns, categorical_columns)
    247         categorical_transformer = Pipeline(
    248             steps=[('encoder', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=np.nan,
--> 249                                               dtype=np.float64))]
    250         )
    251 

/opt/conda/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
     70                           FutureWarning)
     71         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72         return f(**kwargs)
     73     return inner_f
     74 

TypeError: __init__() got an unexpected keyword argument 'handle_unknown'

[Feature] label field should be able to get label column name and label series

Currently

Dataset object can get both label and label_name as parameters.

suggestion

The parameter label should be able to get a pd.Series and identify it as such and also be able to get a string/int and identify it as column in the DataFrame

(same for index and date)

[DEE-208] Detect date leakage from columns which are not necessarily datetime types

In many datasets, there are columns that represent a date-time in their meaning but are not an actual timestamp.
For example, a dataset that contains a "year" column.

I know this may be impossible to implement, but it would be nice if deepchecks can detect such columns and warn me that it may indicate a date - thus I might have a date leakage.

_DEE-208

BUG: Simple model comparison shows irrelevant labels

To reproduce:
https://www.kaggle.com/itay94/notebook05f499eb19

Exception when running ModelInfo check with catboost

To reproduce:
https://www.kaggle.com/itay94/notebookf8c78e84d7

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/tmp/ipykernel_34/3556338172.py in <module>
      1 from deepchecks.checks import ModelInfo
      2 
----> 3 ModelInfo().run(model)

/opt/conda/lib/python3.7/site-packages/deepchecks/base/check.py in wrapped(*args, **kwargs)
    275     @wraps(func)
    276     def wrapped(*args, **kwargs):
--> 277         result = func(*args, **kwargs)
    278         if not isinstance(result, CheckResult):
    279             raise DeepchecksValueError(f'Check {class_instance.name()} expected to return CheckResult bot got: '

/opt/conda/lib/python3.7/site-packages/deepchecks/checks/overview/model_info.py in run(self, model)
     32             CheckResult: value is dictionary in format {type: <model_type>, params: <model_params_dict>}
     33         """
---> 34         return self._model_info(model)
     35 
     36     def _model_info(self, model: BaseEstimator):

/opt/conda/lib/python3.7/site-packages/deepchecks/checks/overview/model_info.py in _model_info(self, model)
     42         # Create dataframe to show
     43         model_param_df = pd.DataFrame(model_params.items(), columns=['Parameter', 'Value'])
---> 44         model_param_df['Default'] = model_param_df['Parameter'].map(lambda x: default_params[x])
     45 
     46         def highlight_not_default(data):

/opt/conda/lib/python3.7/site-packages/pandas/core/series.py in map(self, arg, na_action)
   4159         dtype: object
   4160         """
-> 4161         new_values = super()._map_values(arg, na_action=na_action)
   4162         return self._constructor(new_values, index=self.index).__finalize__(
   4163             self, method="map"

/opt/conda/lib/python3.7/site-packages/pandas/core/base.py in _map_values(self, mapper, na_action)
    868 
    869         # mapper is a function
--> 870         new_values = map_f(values, mapper)
    871 
    872         return new_values

/opt/conda/lib/python3.7/site-packages/pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()

/opt/conda/lib/python3.7/site-packages/deepchecks/checks/overview/model_info.py in <lambda>(x)
     42         # Create dataframe to show
     43         model_param_df = pd.DataFrame(model_params.items(), columns=['Parameter', 'Value'])
---> 44         model_param_df['Default'] = model_param_df['Parameter'].map(lambda x: default_params[x])
     45 
     46         def highlight_not_default(data):

KeyError: 'iterations'

[DEE-205] Dataset Info Check

Purpose:

As a data scientist I want a simple check that will give me an overview of the structure of my data.

Inputs:

A single Dataset

Output:

Tables or graphs containing:

General information about the Dataset
Metadata about special Dataset columns that are defined (label column name, index column, date column)
For each column
- Identified type
- statistics of the data (see reference: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html)
  - for numerical: count, mean, min, 10%, 25%, median, 75%, 90%, max)
  - for categorical: (count, unique, top, freq)
- histogram

Requirements:

Dataset Info check should run quickly
Should work quickly and not clog display also for a large number of features (e.g. 200)
Output should be easily digestible

Check Category:

Overview

_DEE-205

BUG: ROC report shows the encoded classes instead of real classes

"Quickstart" collab: pip install block is missing

There is no "pip install" block on the collab, which requires me to manually add it

BUG: model info check fails when model is sklearn pipeline

to reproduce:
https://www.kaggle.com/itay94/notebook6f16624759

pd.Timesatmp and datetime raises TypeError in is_string_column on pd.to_numeric

padnas.Styler.format use assumes pandas 1.3.0 while requirements include pandas>=1.1.5

In line 28 of deepchecks.base.display_pandas.dataframe_to_html() we call df_styler.format(precision=2).

However, the format() method for the pandas.io.formats.style.Styler class only got the precision kwarg in pandas==1.3.0. Before that, this could only be initialized in the constructor of the Styler class.

As the title says, this makes the code rely on pandas 1.3.0, while requirements.txt states pandas>=1.15.

One option is to update requirements.txt. Another is, perhaps, to somehow initialize a custom Styler object with precision=2 and then injecting it into the input dataframe (I guess that's possible, but I don't know it is).

By the way, to keep tabs on leaky minimum requirements one can write a custom test to setup a minimum requirements virtualenv and run all tests in it, perhaps like so:

sed "s/[>\~]/=/g" requirements.txt > min_requirements.txt
pip install -r min_requirements.txt
pytest ...

Discovered when running

OverallGenericCheckSuite.run(train_dataset=ds_train, test_dataset=ds_test, model=rf_clf, check_datasets_policy='both')

in the Iris notebook in my local environment (Python 3.8.5 with pandas==1.2.5) and getting the following error and stack trace:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~/.pyenv/versions/3.8.5/envs/py3/lib/python3.8/site-packages/IPython/core/formatters.py in __call__(self, obj)
    916             method = get_real_method(obj, self.print_method)
    917             if method is not None:
--> 918                 method()
    919                 return True
    920 

~/clones/deepchecks/deepchecks/base/suite.py in _ipython_display_(self)
     29 
     30     def _ipython_display_(self):
---> 31         display_suite_result_2(self.name, self.results)
     32 
     33 

~/clones/deepchecks/deepchecks/base/display_suite.py in display_suite_result_2(suite_name, results)
    129         conditions_table.sort_values(by=['sort'], inplace=True)
    130         conditions_table.drop('sort', axis=1, inplace=True)
--> 131         display_dataframe(conditions_table, hide_index=True)
    132     else:
    133         display_html('<p>No conditions defined on checks in the suite.</p>', raw=True)

~/clones/deepchecks/deepchecks/base/display_pandas.py in display_dataframe(df, hide_index)
     12         hide_index (bool): Whether to hide or not the dataframe index.
     13     """
---> 14     display_html(dataframe_to_html(df, hide_index), raw=True)
     15 
     16 

~/clones/deepchecks/deepchecks/base/display_pandas.py in dataframe_to_html(df, hide_index)
     26         df_styler = df.style
     27         df_styler.set_table_styles([dict(selector='table,thead,tbody,th,td', props=[('text-align', 'left')])])
---> 28         df_styler.format(precision=2)
     29         if hide_index:
     30             df_styler.hide_index()

TypeError: format() got an unexpected keyword argument 'precision'

Exception when running DataDuplicates check

To reproduce:
https://www.kaggle.com/itay94/notebookf8c78e84d7

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
/tmp/ipykernel_34/3771541214.py in <module>
      1 from deepchecks.checks import DataDuplicates
      2 
----> 3 DataDuplicates().run(ds_train)

/opt/conda/lib/python3.7/site-packages/deepchecks/base/check.py in wrapped(*args, **kwargs)
    275     @wraps(func)
    276     def wrapped(*args, **kwargs):
--> 277         result = func(*args, **kwargs)
    278         if not isinstance(result, CheckResult):
    279             raise DeepchecksValueError(f'Check {class_instance.name()} expected to return CheckResult bot got: '

/opt/conda/lib/python3.7/site-packages/deepchecks/checks/integrity/data_duplicates.py in run(self, dataset, model)
     71             raise DeepchecksValueError('Dataset does not contain any data')
     72 
---> 73         group_unique_data = df[data_columns].groupby(data_columns, dropna=False).size()
     74         n_unique = len(group_unique_data)
     75 

/opt/conda/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in size(self)
   1834             result = result.rename("size").reset_index()
   1835 
-> 1836         return self._reindex_output(result, fill_value=0)
   1837 
   1838     @final

/opt/conda/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in _reindex_output(self, output, fill_value)
   3163         levels_list = [ping.group_index for ping in groupings]
   3164         index, _ = MultiIndex.from_product(
-> 3165             levels_list, names=self.grouper.names
   3166         ).sortlevel()
   3167 

/opt/conda/lib/python3.7/site-packages/pandas/core/indexes/multi.py in from_product(cls, iterables, sortorder, names)
    618 
    619         # codes are all ndarrays, so cartesian_product is lossless
--> 620         codes = cartesian_product(codes)
    621         return cls(levels, codes, sortorder=sortorder, names=names)
    622 

/opt/conda/lib/python3.7/site-packages/pandas/core/reshape/util.py in cartesian_product(X)
     52         b = np.zeros_like(cumprodX)
     53 
---> 54     return [tile_compat(np.repeat(x, b[i]), np.product(a[i])) for i, x in enumerate(X)]
     55 
     56 

/opt/conda/lib/python3.7/site-packages/pandas/core/reshape/util.py in <listcomp>(.0)
     52         b = np.zeros_like(cumprodX)
     53 
---> 54     return [tile_compat(np.repeat(x, b[i]), np.product(a[i])) for i, x in enumerate(X)]
     55 
     56 

<__array_function__ internals> in repeat(*args, **kwargs)

/opt/conda/lib/python3.7/site-packages/numpy/core/fromnumeric.py in repeat(a, repeats, axis)
    477 
    478     """
--> 479     return _wrapfunc(a, 'repeat', repeats, axis=axis)
    480 
    481 

/opt/conda/lib/python3.7/site-packages/numpy/core/fromnumeric.py in _wrapfunc(obj, method, *args, **kwds)
     56 
     57     try:
---> 58         return bound(*args, **kwds)
     59     except TypeError:
     60         # A TypeError occurs if the object does have such a method in its

MemoryError: Unable to allocate 589. PiB for an array with shape (331491645779374080,) and data type int16

BUG: confusion matrix chart displays wrong output

To reproduce: https://www.kaggle.com/itay94/notebook05f499eb19

Simple model comparison plot is confusing

In the simple model comparison, both models have the same color. I find it confusing. Maybe it's better to color one of them in a different color?

BUG:New Label train test in regression task

When running the full suite on a regression task, I encouner 2 differnet bugs:

This check is appearing twice
This check is appearing (In regression i don't expect it to be there)

Boosting overfit condition is not clear

Either there is a bug in the logic or the description is not clear, but:

DOCS: broken code blocks in SimpleModelComparison

Output Result of Full Suite In PDF / HTML

Hi,

This tool is excellent.
Is there a way to generate a report of the full suite? (i.e. in pdf / ppt / any other format that can be presented not in a notebook but sent by mail to managers etc after a model training / research).

Solve user guide docs dependencies

Update "dataset object" documentation to have the concepts, and link to API reference for the parameters.

From original discussion:
Yes I agree that this isn't sustainable over time. It can be useful for now (assuming it's still updated)

I discussed this with @matanper, who is the one that wrote this specific page :) (I just appended it as part of the PR because the context was relevant and he was away) that we should shorten this description, and add a link to the API reference... However for now let's keep it as is and work on it as a separate small task (because till we update it I think it's still better to have this version rather than not have anything)

Originally posted by @shir22 in #340 (comment)

[DEE-206] [feat] Allow Users to Create Checks On the Fly

Background

as we have checks pre-configured in the package, a user may want a specific check that is yet to be created,
or a specific check his personal needs, that might cause the users to pass on the ability to use the package as it won't fully answer their needs

Considerations

Creating a check at runtime should be as easy as possible, while still answering the needs of the user & the package (output wise, interface wise)
Checks that are created on the fly should be defined if they are data checks or model checks as that changes some of the behaviour
Creating a check by the user needs to be fully documented with examples as it may become a key factor in the users decision to adopt the package

Proposal Concept

Allow the users to define, whether ipynb or py to create a check at runtime (or 'On the Fly')
That way even if the package does not answer the users whole needs, it might still answer enough while allowing the users to fully customize its specific.

Thought on implementations

at most, all checks have the following:

a "check" which actually validates against the data/model
a "validation" which makes sure that the input data is correct
an "output" which depends on future implementations and changes, needs to correlate with the global package usage

as such, I feel the main option for that is create a class (lets call it CustomCheck for this thought).
CustomCheck should have a function that returns a check, which can later be injected into suites, or run independently
it may look a bit like this:

our CustomCheck may look like this

...
##imports

#CustomDatasetBaseCheck should have 
from mlchecks.base.check import CheckResult, CustomDatasetBaseCheck 
...


class CustomCheck(CustomUserCheck):
# variable to hold the "Requires" field
# variable to hold the "checkFunction" field
# variable to know if we need to parse the output or not
# basic usage functions 

  def new( *, checkFunction, requires):
    # Returns an initialized CustomCheckClass based on the function requirements 
    # the "requires" param should change the class behaviour in terms of validation and output, as they differ by the input requirements
    # A param to know if we should parse the output of the check or the user wants to do it (if its a plot or something complex)

  def run(self, dataset=None, additional_dataset=None, model=None) -> CheckResult:
    #Based on the required param, call the appropriate call in mlchecks.base.check to validate
    output=checkFunction(dataset=dataset, additional_dataset=None, model=None)
    #If needed
    #Based on the required param, call the appropriate call in mlchecks.base.check to "output" the data

The vision of implementation by the user

as a basic check, lets assume a user wants to check the row count

...
###imports
from mlchecks.checks import CustomCheck
...

def checkRowCount (dataset: Union[pd.DataFrame, Dataset]):
  return len(dataset.data)

rowCountCheck = CustomCheck.new(checkFunction=checkRowCount,
                               requires=CustomCheck.SINGLE_DATASET, #TWO_DATASETS/DATASET_MODEL/MODEL,etc
                               ) #in the future we may add "output_format=html/json/yaml/cli"
                               

data = {'col1': ['foo', 'bar', 'cat']}
dataframe = pd.DataFrame(data=data)

rowCountCheck.run(dataframe)

Changes that will be required

there should be a global validation functions regarding the input of the params (that should also be used by internal checks)
there should be a global output function regarding the input of the params (that should also be used by the internal checks)
the 2 points above will force changes to all the internal checks

EDIT:

We've thought about allowing to user to "choose" if we should parse the output of the check (if its something simple like dataset,etc) or if he wants to add this logic to the check itself (which will make it easier for him to copy paste its existing code almost 1:1)

_DEE-206

The mean value is not shown in the regression systematic error plot

I would expect that near the plot (or when I hover over the mean line in the plot), I would see the mean error value.

To reproduce:
https://www.kaggle.com/itay94/notebookf8c78e84d7

[DEE-204] Suggestion to increase `Checks` flexibility

I think that the TrainValidationBaseCheck check type and the way how CheckSuite is implemented makes the entire library less flexible/

From my perspective, it would be better to add an additional parameter to the SingleDatasetBaseCheck and CompareDatasetsBaseCheck types which will determine to which datasets any particular check should be applied, take a look at the example below to understand what I mean

SingleDatasetCheckPolicy = t.Callable[[base.Dataset], bool]
TwoDatasetsCheckPolicy = t.Callable[[base.Dataset, base.Dataset], bool]

class SingleDatasetCheck(BaseCheck):

    def __init__(self, policies: t.Optional[t.Sequence[SingleDatasetCheckPolicy]] = None):
        self.policies = policies
    
    @abc.abstractmethod
    def run(self, dataset: base.Dataset, model: object = None) -> CheckResult:
        """Define run signature."""
        raise NotImplementedError()

class TwoDatasetsCheck(BaseCheck):

    def __init__(self, policies: t.Optional[t.Sequence[TwoDatasetsCheckPolicy]] = None):
        self.policies = policies
    
    @abc.abstractmethod
    def run(self, first: base.Dataset, second: base.Dataset, model: object = None) -> CheckResult:
        """Define run signature."""
        raise NotImplementedError()

# built-in checks

class StringMismatchComparison(TwoDatasetsCheck):
    def __init__(
        self, 
        policies: t.Optional[t.Sequence[TwoDatasetsCheckPolicy]] = None,
        columns: t.Union[str, t.Iterable[str]] = None, 
        ignore_columns: t.Union[str, t.Iterable[str]] = None
    ):
        super().__init__()
        self.policies = policies or [lambda *args, **kwargs: True]
        ...

class CategoryMismatchTrainValidation(TwoDatasetsCheck):
    def __init__(
        self, 
        policies: t.Optional[t.Sequence[TwoDatasetsCheckPolicy]] = None,
        columns: t.Union[str, t.Iterable[str]] = None, 
        ignore_columns: t.Union[str, t.Iterable[str]] = None
    ):
        super().__init__()
        self.policies = policies or [lambda first_dataset, second_dataset: first_dataset.role == "train"]
        ...

# how run method of the CheckSuite will look

single_dataset_checks = filter(lambda: ..., self.checks)
two_datasets_checks = filter(lambda: ..., self.checks)

for dataset in (train_dataset, validation_dataset):
    for check in single_dataset_checks:
        if all(p(dataset) for p in check.policies):
            check_resul = check.run(dataset, model)

for check in two_datasets_checks:
    if all(p(train_dataset, validation_dataset) for p in check.policies):
        check_resul = check.run(train_dataset, validation_dataset, model)
    elif all(p(validation_dataset, train_dataset) for p in check.policies):
        check_resul = check.run(validation_dataset, train_dataset, model)

...

# built-in suites

ComparativeIntegrityCheckSuite = CheckSuite(
    'Comparative Integrity Suite',
    StringMismatchComparison(),
    CategoryMismatchTrainValidation()
)

# built-in checks policies/policies builders

def dataset_role(role: str) -> SingleDatasetCheckPolicy:
    def policy(dataset: base.Dataset) -> bool:
        return dataset.role == role
    return policy

def datasets_roles(first_dataset: str, second_dataset: str) -> TwoDatasetsCheckPolicy:
    def policy(first: base.Dataset, second: base.Dataset) -> bool:
        return (first_dataset, second_dataset) == (first.role, second.role)
    return policy

def has_timestamp(dataset: base.Dataset) -> bool:
    return dataset.has_timestamp_column()

# custom user-defined checks

class MyCustomCheck(SingleDatasetCheck):
    ...

class MySecondCustomCheck(TwoDatasetsCheck):
    ...

# instantiation of suite

CheckSuite(
    "<suit name>",
    # NOTE: policies will be used by the Suite to determine if it should apply check to the dataset
    MyCustomCheck([dataset_role("train"), has_timestamp]),
    MySecondCustomCheck([datasets_roles("validation", "train")]),
)

_DEE-204

Regression error distribution is missing the prediction column

When I look at the results of the regresssion error distribution check, I can see the features, the target, and the target-prediction difference. It would be really helpful to see also the prediction itself, since it is hard to understand the direction of the error from the wording "Largest over estimated error". Also often users skip reading large texts..

Consider using Lychee to validate our links

@ItayGabbay we can take some inspiration from https://github.com/eyarz/pink-bunny-ears and use https://github.com/lycheeverse/lychee-action to constantly validate our links.

Originally posted by @noamzbr in #340 (review)

Deleting all the conditions of a check

Is there a function such as suite[0].deleteAllConds() or something like that? or should I manually iterate over them to delete the all of the conditions?

`utils.features._built_in_importance` misuses `model.coef_`

In some sklearn linear models, the coef_ attribute of a fitted estimator might be either a 1d array or a 2d one (e.g. in the case of LinearRegression it depends on the number of targets passed). In others, it is guaranteed to always be 2d, like for LogisticRegression:

coef_ : ndarray of shape (1, n_features) or (n_classes, n_features)
 Coefficient of the features in the decision function.

 coef_ is of shape (1, n_features) when the given problem is binary. In particular, when multi_class='multinomial', coef_ corresponds to outcome 1 (True) and -coef_ corresponds to outcome 0 (False).

The utils.features. _built_in_importance method - used by several checks, including at the very least TrainTestFeatureDrift, DominantFrequencyChange and StringMismatchComparison - misuses this attribute by assuming it's always 1d in the case of Linear Models. See the line after the elif here:

def _built_in_importance(model: t.Any, dataset: 'base.Dataset') -> t.Optional[pd.Series]:
    """Get feature importance member if present in model."""
    if 'feature_importances_' in dir(model):  # Ensembles
        normalized_feature_importance_values = model.feature_importances_/model.feature_importances_.sum()
        return pd.Series(normalized_feature_importance_values, index=dataset.features)
    elif 'coef_' in dir(model):  # Linear models
        coef = np.abs(model.coef_)
        coef = coef / coef.sum()
        return pd.Series(coef, index=dataset.features)
    else:
        return

In case of a 2d array, this should use model.coef_[0], of course.

Minimal example to reproduce:

import pandas as pd; import deepchecks;
from sklearn.linear_model import LogisticRegression
from deepchecks.utils.features import _built_in_importance

train_df = pd.DataFrame([[23, True], [19, False], [15, False], [5, True]], columns=['age', 'smoking'], index=[0, 1, 2, 3])
train_y = pd.Series([1, 1, 0, 0])
test_df = pd.DataFrame([[21, True], [40, False], [12, False], [50, True]], columns=['age', 'smoking'], index=[0, 1, 2, 3])
test_y = pd.Series([1, 0, 1, 0])

logreg = LogisticRegression()
logreg.fit(train_df, train_y)

ds_train = deepchecks.Dataset(df=train_df, label=train_y)
ds_test = deepchecks.Dataset(df=test_df, label=test_y)

_built_in_importance(logreg, ds_train)

This will yield:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/var/folders/2q/pxh7dn6s1nq4fb8hw_x4lqd80000gn/T/ipykernel_42799/3322691940.py in <module>
----> 1 _built_in_importance(logreg, ds_train)

~/clones/deepchecks/deepchecks/utils/features.py in _built_in_importance(model, dataset)
    113         coef = np.abs(model.coef_)
    114         coef = coef / coef.sum()
--> 115         return pd.Series(coef, index=dataset.features)
    116     else:
    117         return

~/.pyenv/versions/3.8.5/envs/py3/lib/python3.8/site-packages/pandas/core/series.py in __init__(self, data, index, dtype, name, copy, fastpath)
    428                 index = ibase.default_index(len(data))
    429             elif is_list_like(data):
--> 430                 com.require_length_match(data, index)
    431 
    432             # create/copy the manager

~/.pyenv/versions/3.8.5/envs/py3/lib/python3.8/site-packages/pandas/core/common.py in require_length_match(data, index)
    529     """
    530     if len(data) != len(index):
--> 531         raise ValueError(
    532             "Length of values "
    533             f"({len(data)}) "

ValueError: Length of values (1) does not match length of index (2)

[BUG] graph for categorical data with long category names causes overlap of titles.

Vehicle2 overlaps with DriftScore:

numerical features are displayed without issue

[BUG] calcualting permutational feature importance takes too much time

on a complex model with a large dataset (30+ features with 20k+ samples) permutational importance ran for over 30 minutes.

options:

limit time we calculate feature importance
flag to decide if we run feature importance or not
find faster algorithm to calculate feature importance

[BUG] Date Train-Test Leakage displays datetimes as object, should be human readable

output of DateTrainTestLeakage is currently datetime.datetime(...) should be 1992/02/13 13:23.... instead

Exception when running StringLengthOutOfBounds check

To reproduce:
https://www.kaggle.com/itay94/notebookf8c78e84d7

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
/tmp/ipykernel_34/59083449.py in <module>
      1 from deepchecks.checks import StringLengthOutOfBounds
      2 
----> 3 StringLengthOutOfBounds().run(ds_train)

/opt/conda/lib/python3.7/site-packages/deepchecks/base/check.py in wrapped(*args, **kwargs)
    275     @wraps(func)
    276     def wrapped(*args, **kwargs):
--> 277         result = func(*args, **kwargs)
    278         if not isinstance(result, CheckResult):
    279             raise DeepchecksValueError(f'Check {class_instance.name()} expected to return CheckResult bot got: '

/opt/conda/lib/python3.7/site-packages/deepchecks/checks/integrity/string_length_out_of_bounds.py in run(self, dataset, model)
     75         """
     76         feature_importances = calculate_feature_importance_or_none(model, dataset)
---> 77         return self._string_length_out_of_bounds(dataset, feature_importances)
     78 
     79     def _string_length_out_of_bounds(self, dataset: Union[pd.DataFrame, Dataset],

/opt/conda/lib/python3.7/site-packages/deepchecks/checks/integrity/string_length_out_of_bounds.py in _string_length_out_of_bounds(self, dataset, feature_importances)
     92                 continue
     93 
---> 94             string_length_column = column.map(lambda x: len(str(x)), na_action='ignore')
     95 
     96             # If not a lot of unique values, calculate the percentiles for existing values.

/opt/conda/lib/python3.7/site-packages/pandas/core/series.py in map(self, arg, na_action)
   4159         dtype: object
   4160         """
-> 4161         new_values = super()._map_values(arg, na_action=na_action)
   4162         return self._constructor(new_values, index=self.index).__finalize__(
   4163             self, method="map"

/opt/conda/lib/python3.7/site-packages/pandas/core/base.py in _map_values(self, mapper, na_action)
    850             values = self._values
    851             if na_action is not None:
--> 852                 raise NotImplementedError
    853             map_f = lambda values, f: values.map(f)
    854         else:

NotImplementedError:

_DEE-209

[BUG] NaNs break gaussian_kde in output of Drift checks

Train Test Feature Drift	ValueError: array must not contain infs or NaNs

[Refactor] Identifier Leakage Check should be plotted with plotly and not matplotlib

Metrics: change binary model to support metric per class

When using multiclass_avg=False in the metrics functions for binary model, need to return the multiclass scorers instead of the binary ones.
In checks using multiclass_avg=False make the necessary updates

[BUG] unnecessary warnings in integrity suite

Scenario:
When I simply run the example from the readme.

Exception when running LabelAmbiguity check

To reproduce:
https://www.kaggle.com/itay94/notebookf8c78e84d7

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
/tmp/ipykernel_34/2864323761.py in <module>
      1 from deepchecks.checks import LabelAmbiguity
      2 
----> 3 LabelAmbiguity().run(ds_train)

/opt/conda/lib/python3.7/site-packages/deepchecks/base/check.py in wrapped(*args, **kwargs)
    275     @wraps(func)
    276     def wrapped(*args, **kwargs):
--> 277         result = func(*args, **kwargs)
    278         if not isinstance(result, CheckResult):
    279             raise DeepchecksValueError(f'Check {class_instance.name()} expected to return CheckResult bot got: '

/opt/conda/lib/python3.7/site-packages/deepchecks/checks/integrity/label_ambiguity.py in run(self, dataset, model)
     64 
     65         group_unique_data = dataset.data.groupby(dataset.features, dropna=False)
---> 66         group_unique_labels = group_unique_data.nunique()[label_col]
     67 
     68         num_ambiguous = 0

/opt/conda/lib/python3.7/site-packages/pandas/core/groupby/generic.py in nunique(self, dropna)
   1803         obj = self._obj_with_exclusions
   1804         results = self._apply_to_column_groupbys(
-> 1805             lambda sgb: sgb.nunique(dropna), obj=obj
   1806         )
   1807 

/opt/conda/lib/python3.7/site-packages/pandas/core/groupby/generic.py in _apply_to_column_groupbys(self, func, obj)
   1709         columns = obj.columns
   1710         results = [
-> 1711             func(col_groupby) for _, col_groupby in self._iterate_column_groupbys(obj)
   1712         ]
   1713 

/opt/conda/lib/python3.7/site-packages/pandas/core/groupby/generic.py in <listcomp>(.0)
   1709         columns = obj.columns
   1710         results = [
-> 1711             func(col_groupby) for _, col_groupby in self._iterate_column_groupbys(obj)
   1712         ]
   1713 

/opt/conda/lib/python3.7/site-packages/pandas/core/groupby/generic.py in <lambda>(sgb)
   1803         obj = self._obj_with_exclusions
   1804         results = self._apply_to_column_groupbys(
-> 1805             lambda sgb: sgb.nunique(dropna), obj=obj
   1806         )
   1807 

/opt/conda/lib/python3.7/site-packages/pandas/core/groupby/generic.py in nunique(self, dropna)
    671 
    672         result = self.obj._constructor(res, index=ri, name=self.obj.name)
--> 673         return self._reindex_output(result, fill_value=0)
    674 
    675     @doc(Series.describe)

/opt/conda/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in _reindex_output(self, output, fill_value)
   3163         levels_list = [ping.group_index for ping in groupings]
   3164         index, _ = MultiIndex.from_product(
-> 3165             levels_list, names=self.grouper.names
   3166         ).sortlevel()
   3167 

/opt/conda/lib/python3.7/site-packages/pandas/core/indexes/multi.py in from_product(cls, iterables, sortorder, names)
    618 
    619         # codes are all ndarrays, so cartesian_product is lossless
--> 620         codes = cartesian_product(codes)
    621         return cls(levels, codes, sortorder=sortorder, names=names)
    622 

/opt/conda/lib/python3.7/site-packages/pandas/core/reshape/util.py in cartesian_product(X)
     52         b = np.zeros_like(cumprodX)
     53 
---> 54     return [tile_compat(np.repeat(x, b[i]), np.product(a[i])) for i, x in enumerate(X)]
     55 
     56 

/opt/conda/lib/python3.7/site-packages/pandas/core/reshape/util.py in <listcomp>(.0)
     52         b = np.zeros_like(cumprodX)
     53 
---> 54     return [tile_compat(np.repeat(x, b[i]), np.product(a[i])) for i, x in enumerate(X)]
     55 
     56 

<__array_function__ internals> in repeat(*args, **kwargs)

/opt/conda/lib/python3.7/site-packages/numpy/core/fromnumeric.py in repeat(a, repeats, axis)
    477 
    478     """
--> 479     return _wrapfunc(a, 'repeat', repeats, axis=axis)
    480 
    481 

/opt/conda/lib/python3.7/site-packages/numpy/core/fromnumeric.py in _wrapfunc(obj, method, *args, **kwds)
     56 
     57     try:
---> 58         return bound(*args, **kwds)
     59     except TypeError:
     60         # A TypeError occurs if the object does have such a method in its

MemoryError: Unable to allocate 45.3 PiB for an array with shape (25499357367644160,) and data type int16

feature importance crashes if not supplied with a standard sklearn model

TypeError: If no scoring is specified, the estimator passed should have a 'score' method. The estimator my_dumb_model() does not.

permutational importance assumes model has fit and score attributes. We need to check if it has these attributes.
If it does not have a score function, use a default from the assumed task type.

deepchecks / deepchecks Goto Github PK

deepchecks's People

Contributors

Stargazers

Watchers

Forkers

deepchecks's Issues

Currently

Feature request

Currently

suggestion

Purpose:

Inputs:

Output:

Requirements:

Check Category:

Background

Considerations

Proposal Concept

Thought on implementations

our CustomCheck may look like this

The vision of implementation by the user

Changes that will be required

Recommend Projects

Recommend Topics

Recommend Org