Code Monkey home page Code Monkey logo

deepchecks / deepchecks Goto Github PK

View Code? Open in Web Editor NEW
3.4K 18.0 240.0 274.26 MB

Deepchecks: Tests for Continuous Validation of ML Models & Data. Deepchecks is a holistic open-source solution for all of your AI & ML validation needs, enabling to thoroughly test your data and models from research to production.

Home Page: https://docs.deepchecks.com/stable

License: Other

Makefile 0.41% Python 99.07% HTML 0.53%
machine-learning ml model-validation data-validation mlops data-science python jupyter-notebook model-monitoring data-drift

deepchecks's People

Contributors

allcontributors[bot] avatar arterm-sedov avatar benisraeldan avatar cemalgurpinar avatar danarlowski avatar danbasson avatar deepchecks-bot avatar github-actions[bot] avatar harsh-deepchecks avatar hjain5164 avatar idow09 avatar itaygabbay avatar jkl98isr avatar kishore-s-15 avatar matanper avatar michaelmarien avatar nadav-barak avatar nirchecks avatar nirhutnik avatar nissimofir avatar noamzbr avatar rcwoolston avatar ronitay avatar shaypal5 avatar shayts7 avatar shir22 avatar shiritdvir avatar shivshankardayal avatar thesoly avatar yromanyshyn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

deepchecks's Issues

"Quickinstall" Collab is complaining of a warning when installing deepchecks

!pip install deepchecks

...
      Successfully uninstalled plotly-4.4.1
  Attempting uninstall: matplotlib
    Found existing installation: matplotlib 3.2.2
    Uninstalling matplotlib-3.2.2:
      Successfully uninstalled matplotlib-3.2.2
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
albumentations 0.1.12 requires imgaug<0.2.7,>=0.2.5, but you have imgaug 0.2.9 which is incompatible.
Successfully installed category-encoders-2.3.0 deepchecks-0.1.1 fonttools-4.28.5 matplotlib-3.5.1 plotly-5.5.0 scipy-1.7.3 tenacity-8.0.1
WARNING: The following packages were previously imported in this runtime:
  [matplotlib,mpl_toolkits]
You must restart the runtime in order to use newly installed versions.

Support free-form columns in dataset

Currently

We distinguish between categorical columns and non categorical columns by defining (or identifying) them in the Dataset object. In some check non categorical columns are assumed to be numeric and causes failures.

Feature request

Identify free form columns (automagically or defined by user) to allow us to process or ignore this data in some checks.

Load and Compare Datasets

Howdy!

I'm a maintainer on the DataProfiler library, we built the DataProfiler as an improved replacement for pandas-profiling. Some additional features:

  • Auto-Detect & Load: CSV, AVRO, Parquet, JSON, Text, URL data = Data("your_filepath_or_url.csv")
  • Profile data in a multi-threaded way with robust PII detection profile = Profiler(data)
  • Merge profiles: profile3 = profile1 + profile2
  • Take differences between profiles profile_diff = profile1.diff(profile2)
  • Generate reports: readable_report = profile.report(report_options={"output_format": "compact"})

Would love to connect and / or see how we could be useful to this project!

import json
from dataprofiler import Data, Profiler

data = Data("your_filepath_or_url.csv") # Auto-Detect & Load: CSV, AVRO, Parquet, JSON, Text, URL

print(data.data.head(5)) # Access data directly via a compatible Pandas DataFrame

profile = Profiler(data) # Calculate Statistics, Entity Recognition, etc

readable_report = profile.report(report_options={"output_format": "compact"})

print(json.dumps(readable_report, indent=4))

SimpleModelComparison add_condition_ratio_not_less_than doesn't work well on imbalanced data

add_condition_ratio_not_less_than() is defined by default on this check in the default_suite. When data is highly imbalanced, it is possible that for example simple model recall is 0.96, and user model recall is 0.99, which are really close but in the imbalanced case this can mean that user model is actually pretty good.

The fix should be suggesting a different condition that can realize that, or alternatively realizing that this may be the case and explaining it to the user.

[BUG] Model Identification in sklearn Pipeline should look at the last step

pipe = Pipeline(steps=[
    ('transform', transformers),
    ('handle_nans', SimpleImputer(strategy='most_frequent')),
    ('modle', clf)
])

Boosting Overfit DeepchecksValueError: Unsupported model of type: SimpleImputer

Boosting Overfit check should have identified the pipeline as the last step (clf which was catboost)

Conda package doesn't enforce dependencies versions

May cause bugs such as:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_34/1021525983.py in <module>
      1 from deepchecks.checks import WholeDatasetDrift
      2 
----> 3 WholeDatasetDrift().run(ds_train, ds_test)

/opt/conda/lib/python3.7/site-packages/deepchecks/base/check.py in wrapped(*args, **kwargs)
    275     @wraps(func)
    276     def wrapped(*args, **kwargs):
--> 277         result = func(*args, **kwargs)
    278         if not isinstance(result, CheckResult):
    279             raise DeepchecksValueError(f'Check {class_instance.name()} expected to return CheckResult bot got: '

/opt/conda/lib/python3.7/site-packages/deepchecks/checks/distribution/whole_dataset_drift.py in run(self, train_dataset, test_dataset, model)
    114         self._cat_features = cat_features
    115 
--> 116         domain_classifier = self._generate_model(list(set(features) - set(cat_features)), cat_features)
    117 
    118         sample_size = min(self.sample_size, train_dataset.n_samples, test_dataset.n_samples)

/opt/conda/lib/python3.7/site-packages/deepchecks/checks/distribution/whole_dataset_drift.py in _generate_model(self, numerical_columns, categorical_columns)
    247         categorical_transformer = Pipeline(
    248             steps=[('encoder', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=np.nan,
--> 249                                               dtype=np.float64))]
    250         )
    251 

/opt/conda/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
     70                           FutureWarning)
     71         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72         return f(**kwargs)
     73     return inner_f
     74 

TypeError: __init__() got an unexpected keyword argument 'handle_unknown'

Exception when running ModelInfo check with catboost

To reproduce:
https://www.kaggle.com/itay94/notebookf8c78e84d7

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/tmp/ipykernel_34/3556338172.py in <module>
      1 from deepchecks.checks import ModelInfo
      2 
----> 3 ModelInfo().run(model)

/opt/conda/lib/python3.7/site-packages/deepchecks/base/check.py in wrapped(*args, **kwargs)
    275     @wraps(func)
    276     def wrapped(*args, **kwargs):
--> 277         result = func(*args, **kwargs)
    278         if not isinstance(result, CheckResult):
    279             raise DeepchecksValueError(f'Check {class_instance.name()} expected to return CheckResult bot got: '

/opt/conda/lib/python3.7/site-packages/deepchecks/checks/overview/model_info.py in run(self, model)
     32             CheckResult: value is dictionary in format {type: <model_type>, params: <model_params_dict>}
     33         """
---> 34         return self._model_info(model)
     35 
     36     def _model_info(self, model: BaseEstimator):

/opt/conda/lib/python3.7/site-packages/deepchecks/checks/overview/model_info.py in _model_info(self, model)
     42         # Create dataframe to show
     43         model_param_df = pd.DataFrame(model_params.items(), columns=['Parameter', 'Value'])
---> 44         model_param_df['Default'] = model_param_df['Parameter'].map(lambda x: default_params[x])
     45 
     46         def highlight_not_default(data):

/opt/conda/lib/python3.7/site-packages/pandas/core/series.py in map(self, arg, na_action)
   4159         dtype: object
   4160         """
-> 4161         new_values = super()._map_values(arg, na_action=na_action)
   4162         return self._constructor(new_values, index=self.index).__finalize__(
   4163             self, method="map"

/opt/conda/lib/python3.7/site-packages/pandas/core/base.py in _map_values(self, mapper, na_action)
    868 
    869         # mapper is a function
--> 870         new_values = map_f(values, mapper)
    871 
    872         return new_values

/opt/conda/lib/python3.7/site-packages/pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()

/opt/conda/lib/python3.7/site-packages/deepchecks/checks/overview/model_info.py in <lambda>(x)
     42         # Create dataframe to show
     43         model_param_df = pd.DataFrame(model_params.items(), columns=['Parameter', 'Value'])
---> 44         model_param_df['Default'] = model_param_df['Parameter'].map(lambda x: default_params[x])
     45 
     46         def highlight_not_default(data):

KeyError: 'iterations'

[DEE-205] Dataset Info Check

Purpose:

As a data scientist I want a simple check that will give me an overview of the structure of my data.

Inputs:

A single Dataset

Output:

Tables or graphs containing:

  • General information about the Dataset
  • Metadata about special Dataset columns that are defined (label column name, index column, date column)
  • For each column

Requirements:

Dataset Info check should run quickly
Should work quickly and not clog display also for a large number of features (e.g. 200)
Output should be easily digestible

Check Category:

Overview

DEE-205

padnas.Styler.format use assumes pandas 1.3.0 while requirements include pandas>=1.1.5

In line 28 of deepchecks.base.display_pandas.dataframe_to_html() we call df_styler.format(precision=2).

However, the format() method for the pandas.io.formats.style.Styler class only got the precision kwarg in pandas==1.3.0. Before that, this could only be initialized in the constructor of the Styler class.

As the title says, this makes the code rely on pandas 1.3.0, while requirements.txt states pandas>=1.15.

One option is to update requirements.txt. Another is, perhaps, to somehow initialize a custom Styler object with precision=2 and then injecting it into the input dataframe (I guess that's possible, but I don't know it is).


By the way, to keep tabs on leaky minimum requirements one can write a custom test to setup a minimum requirements virtualenv and run all tests in it, perhaps like so:

sed "s/[>\~]/=/g" requirements.txt > min_requirements.txt
pip install -r min_requirements.txt
pytest ...

Discovered when running

OverallGenericCheckSuite.run(train_dataset=ds_train, test_dataset=ds_test, model=rf_clf, check_datasets_policy='both')

in the Iris notebook in my local environment (Python 3.8.5 with pandas==1.2.5) and getting the following error and stack trace:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~/.pyenv/versions/3.8.5/envs/py3/lib/python3.8/site-packages/IPython/core/formatters.py in __call__(self, obj)
    916             method = get_real_method(obj, self.print_method)
    917             if method is not None:
--> 918                 method()
    919                 return True
    920 

~/clones/deepchecks/deepchecks/base/suite.py in _ipython_display_(self)
     29 
     30     def _ipython_display_(self):
---> 31         display_suite_result_2(self.name, self.results)
     32 
     33 

~/clones/deepchecks/deepchecks/base/display_suite.py in display_suite_result_2(suite_name, results)
    129         conditions_table.sort_values(by=['sort'], inplace=True)
    130         conditions_table.drop('sort', axis=1, inplace=True)
--> 131         display_dataframe(conditions_table, hide_index=True)
    132     else:
    133         display_html('<p>No conditions defined on checks in the suite.</p>', raw=True)

~/clones/deepchecks/deepchecks/base/display_pandas.py in display_dataframe(df, hide_index)
     12         hide_index (bool): Whether to hide or not the dataframe index.
     13     """
---> 14     display_html(dataframe_to_html(df, hide_index), raw=True)
     15 
     16 

~/clones/deepchecks/deepchecks/base/display_pandas.py in dataframe_to_html(df, hide_index)
     26         df_styler = df.style
     27         df_styler.set_table_styles([dict(selector='table,thead,tbody,th,td', props=[('text-align', 'left')])])
---> 28         df_styler.format(precision=2)
     29         if hide_index:
     30             df_styler.hide_index()

TypeError: format() got an unexpected keyword argument 'precision'

Exception when running DataDuplicates check

To reproduce:
https://www.kaggle.com/itay94/notebookf8c78e84d7

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
/tmp/ipykernel_34/3771541214.py in <module>
      1 from deepchecks.checks import DataDuplicates
      2 
----> 3 DataDuplicates().run(ds_train)

/opt/conda/lib/python3.7/site-packages/deepchecks/base/check.py in wrapped(*args, **kwargs)
    275     @wraps(func)
    276     def wrapped(*args, **kwargs):
--> 277         result = func(*args, **kwargs)
    278         if not isinstance(result, CheckResult):
    279             raise DeepchecksValueError(f'Check {class_instance.name()} expected to return CheckResult bot got: '

/opt/conda/lib/python3.7/site-packages/deepchecks/checks/integrity/data_duplicates.py in run(self, dataset, model)
     71             raise DeepchecksValueError('Dataset does not contain any data')
     72 
---> 73         group_unique_data = df[data_columns].groupby(data_columns, dropna=False).size()
     74         n_unique = len(group_unique_data)
     75 

/opt/conda/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in size(self)
   1834             result = result.rename("size").reset_index()
   1835 
-> 1836         return self._reindex_output(result, fill_value=0)
   1837 
   1838     @final

/opt/conda/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in _reindex_output(self, output, fill_value)
   3163         levels_list = [ping.group_index for ping in groupings]
   3164         index, _ = MultiIndex.from_product(
-> 3165             levels_list, names=self.grouper.names
   3166         ).sortlevel()
   3167 

/opt/conda/lib/python3.7/site-packages/pandas/core/indexes/multi.py in from_product(cls, iterables, sortorder, names)
    618 
    619         # codes are all ndarrays, so cartesian_product is lossless
--> 620         codes = cartesian_product(codes)
    621         return cls(levels, codes, sortorder=sortorder, names=names)
    622 

/opt/conda/lib/python3.7/site-packages/pandas/core/reshape/util.py in cartesian_product(X)
     52         b = np.zeros_like(cumprodX)
     53 
---> 54     return [tile_compat(np.repeat(x, b[i]), np.product(a[i])) for i, x in enumerate(X)]
     55 
     56 

/opt/conda/lib/python3.7/site-packages/pandas/core/reshape/util.py in <listcomp>(.0)
     52         b = np.zeros_like(cumprodX)
     53 
---> 54     return [tile_compat(np.repeat(x, b[i]), np.product(a[i])) for i, x in enumerate(X)]
     55 
     56 

<__array_function__ internals> in repeat(*args, **kwargs)

/opt/conda/lib/python3.7/site-packages/numpy/core/fromnumeric.py in repeat(a, repeats, axis)
    477 
    478     """
--> 479     return _wrapfunc(a, 'repeat', repeats, axis=axis)
    480 
    481 

/opt/conda/lib/python3.7/site-packages/numpy/core/fromnumeric.py in _wrapfunc(obj, method, *args, **kwds)
     56 
     57     try:
---> 58         return bound(*args, **kwds)
     59     except TypeError:
     60         # A TypeError occurs if the object does have such a method in its

MemoryError: Unable to allocate 589. PiB for an array with shape (331491645779374080,) and data type int16

BUG:New Label train test in regression task

When running the full suite on a regression task, I encouner 2 differnet bugs:

  1. This check is appearing twice
  2. This check is appearing (In regression i don't expect it to be there)
    image

Output Result of Full Suite In PDF / HTML

Hi,

This tool is excellent.
Is there a way to generate a report of the full suite? (i.e. in pdf / ppt / any other format that can be presented not in a notebook but sent by mail to managers etc after a model training / research).

Solve user guide docs dependencies

Update "dataset object" documentation to have the concepts, and link to API reference for the parameters.


From original discussion:
Yes I agree that this isn't sustainable over time. It can be useful for now (assuming it's still updated)

I discussed this with @matanper, who is the one that wrote this specific page :) (I just appended it as part of the PR because the context was relevant and he was away) that we should shorten this description, and add a link to the API reference... However for now let's keep it as is and work on it as a separate small task (because till we update it I think it's still better to have this version rather than not have anything)

Originally posted by @shir22 in #340 (comment)

[DEE-206] [feat] Allow Users to Create Checks On the Fly

Background

as we have checks pre-configured in the package, a user may want a specific check that is yet to be created,
or a specific check his personal needs, that might cause the users to pass on the ability to use the package as it won't fully answer their needs

Considerations

  • Creating a check at runtime should be as easy as possible, while still answering the needs of the user & the package (output wise, interface wise)
  • Checks that are created on the fly should be defined if they are data checks or model checks as that changes some of the behaviour
  • Creating a check by the user needs to be fully documented with examples as it may become a key factor in the users decision to adopt the package

Proposal Concept

Allow the users to define, whether ipynb or py to create a check at runtime (or 'On the Fly')
That way even if the package does not answer the users whole needs, it might still answer enough while allowing the users to fully customize its specific.

Thought on implementations

at most, all checks have the following:

  • a "check" which actually validates against the data/model
  • a "validation" which makes sure that the input data is correct
  • an "output" which depends on future implementations and changes, needs to correlate with the global package usage

as such, I feel the main option for that is create a class (lets call it CustomCheck for this thought).
CustomCheck should have a function that returns a check, which can later be injected into suites, or run independently
it may look a bit like this:

our CustomCheck may look like this

...
##imports

#CustomDatasetBaseCheck should have 
from mlchecks.base.check import CheckResult, CustomDatasetBaseCheck 
...


class CustomCheck(CustomUserCheck):
# variable to hold the "Requires" field
# variable to hold the "checkFunction" field
# variable to know if we need to parse the output or not
# basic usage functions 

  def new( *, checkFunction, requires):
    # Returns an initialized CustomCheckClass based on the function requirements 
    # the "requires" param should change the class behaviour in terms of validation and output, as they differ by the input requirements
    # A param to know if we should parse the output of the check or the user wants to do it (if its a plot or something complex)

  def run(self, dataset=None, additional_dataset=None, model=None) -> CheckResult:
    #Based on the required param, call the appropriate call in mlchecks.base.check to validate
    output=checkFunction(dataset=dataset, additional_dataset=None, model=None)
    #If needed
    #Based on the required param, call the appropriate call in mlchecks.base.check to "output" the data

The vision of implementation by the user

as a basic check, lets assume a user wants to check the row count

...
###imports
from mlchecks.checks import CustomCheck
...

def checkRowCount (dataset: Union[pd.DataFrame, Dataset]):
  return len(dataset.data)

rowCountCheck = CustomCheck.new(checkFunction=checkRowCount,
                               requires=CustomCheck.SINGLE_DATASET, #TWO_DATASETS/DATASET_MODEL/MODEL,etc
                               ) #in the future we may add "output_format=html/json/yaml/cli"
                               

data = {'col1': ['foo', 'bar', 'cat']}
dataframe = pd.DataFrame(data=data)

rowCountCheck.run(dataframe)

Changes that will be required

  • there should be a global validation functions regarding the input of the params (that should also be used by internal checks)
  • there should be a global output function regarding the input of the params (that should also be used by the internal checks)
  • the 2 points above will force changes to all the internal checks

EDIT:

We've thought about allowing to user to "choose" if we should parse the output of the check (if its something simple like dataset,etc) or if he wants to add this logic to the check itself (which will make it easier for him to copy paste its existing code almost 1:1)

DEE-206

[DEE-204] Suggestion to increase `Checks` flexibility

I think that the TrainValidationBaseCheck check type and the way how CheckSuite is implemented makes the entire library less flexible/

From my perspective, it would be better to add an additional parameter to the SingleDatasetBaseCheck and CompareDatasetsBaseCheck types which will determine to which datasets any particular check should be applied, take a look at the example below to understand what I mean

SingleDatasetCheckPolicy = t.Callable[[base.Dataset], bool]
TwoDatasetsCheckPolicy = t.Callable[[base.Dataset, base.Dataset], bool]

class SingleDatasetCheck(BaseCheck):

    def __init__(self, policies: t.Optional[t.Sequence[SingleDatasetCheckPolicy]] = None):
        self.policies = policies
    
    @abc.abstractmethod
    def run(self, dataset: base.Dataset, model: object = None) -> CheckResult:
        """Define run signature."""
        raise NotImplementedError()

class TwoDatasetsCheck(BaseCheck):

    def __init__(self, policies: t.Optional[t.Sequence[TwoDatasetsCheckPolicy]] = None):
        self.policies = policies
    
    @abc.abstractmethod
    def run(self, first: base.Dataset, second: base.Dataset, model: object = None) -> CheckResult:
        """Define run signature."""
        raise NotImplementedError()

# built-in checks

class StringMismatchComparison(TwoDatasetsCheck):
    def __init__(
        self, 
        policies: t.Optional[t.Sequence[TwoDatasetsCheckPolicy]] = None,
        columns: t.Union[str, t.Iterable[str]] = None, 
        ignore_columns: t.Union[str, t.Iterable[str]] = None
    ):
        super().__init__()
        self.policies = policies or [lambda *args, **kwargs: True]
        ...

class CategoryMismatchTrainValidation(TwoDatasetsCheck):
    def __init__(
        self, 
        policies: t.Optional[t.Sequence[TwoDatasetsCheckPolicy]] = None,
        columns: t.Union[str, t.Iterable[str]] = None, 
        ignore_columns: t.Union[str, t.Iterable[str]] = None
    ):
        super().__init__()
        self.policies = policies or [lambda first_dataset, second_dataset: first_dataset.role == "train"]
        ...

# how run method of the CheckSuite will look

single_dataset_checks = filter(lambda: ..., self.checks)
two_datasets_checks = filter(lambda: ..., self.checks)

for dataset in (train_dataset, validation_dataset):
    for check in single_dataset_checks:
        if all(p(dataset) for p in check.policies):
            check_resul = check.run(dataset, model)

for check in two_datasets_checks:
    if all(p(train_dataset, validation_dataset) for p in check.policies):
        check_resul = check.run(train_dataset, validation_dataset, model)
    elif all(p(validation_dataset, train_dataset) for p in check.policies):
        check_resul = check.run(validation_dataset, train_dataset, model)

...

# built-in suites

ComparativeIntegrityCheckSuite = CheckSuite(
    'Comparative Integrity Suite',
    StringMismatchComparison(),
    CategoryMismatchTrainValidation()
)

# built-in checks policies/policies builders

def dataset_role(role: str) -> SingleDatasetCheckPolicy:
    def policy(dataset: base.Dataset) -> bool:
        return dataset.role == role
    return policy

def datasets_roles(first_dataset: str, second_dataset: str) -> TwoDatasetsCheckPolicy:
    def policy(first: base.Dataset, second: base.Dataset) -> bool:
        return (first_dataset, second_dataset) == (first.role, second.role)
    return policy

def has_timestamp(dataset: base.Dataset) -> bool:
    return dataset.has_timestamp_column()

# custom user-defined checks

class MyCustomCheck(SingleDatasetCheck):
    ...

class MySecondCustomCheck(TwoDatasetsCheck):
    ...

# instantiation of suite

CheckSuite(
    "<suit name>",
    # NOTE: policies will be used by the Suite to determine if it should apply check to the dataset
    MyCustomCheck([dataset_role("train"), has_timestamp]),
    MySecondCustomCheck([datasets_roles("validation", "train")]),
)

DEE-204

Regression error distribution is missing the prediction column

When I look at the results of the regresssion error distribution check, I can see the features, the target, and the target-prediction difference. It would be really helpful to see also the prediction itself, since it is hard to understand the direction of the error from the wording "Largest over estimated error". Also often users skip reading large texts..

image

Deleting all the conditions of a check

Is there a function such as suite[0].deleteAllConds() or something like that? or should I manually iterate over them to delete the all of the conditions?

`utils.features._built_in_importance` misuses `model.coef_`

In some sklearn linear models, the coef_ attribute of a fitted estimator might be either a 1d array or a 2d one (e.g. in the case of LinearRegression it depends on the number of targets passed). In others, it is guaranteed to always be 2d, like for LogisticRegression:

coef_ : ndarray of shape (1, n_features) or (n_classes, n_features)
 Coefficient of the features in the decision function.

 coef_ is of shape (1, n_features) when the given problem is binary. In particular, when multi_class='multinomial', coef_ corresponds to outcome 1 (True) and -coef_ corresponds to outcome 0 (False).

The utils.features. _built_in_importance method - used by several checks, including at the very least TrainTestFeatureDrift, DominantFrequencyChange and StringMismatchComparison - misuses this attribute by assuming it's always 1d in the case of Linear Models. See the line after the elif here:

def _built_in_importance(model: t.Any, dataset: 'base.Dataset') -> t.Optional[pd.Series]:
    """Get feature importance member if present in model."""
    if 'feature_importances_' in dir(model):  # Ensembles
        normalized_feature_importance_values = model.feature_importances_/model.feature_importances_.sum()
        return pd.Series(normalized_feature_importance_values, index=dataset.features)
    elif 'coef_' in dir(model):  # Linear models
        coef = np.abs(model.coef_)
        coef = coef / coef.sum()
        return pd.Series(coef, index=dataset.features)
    else:
        return

In case of a 2d array, this should use model.coef_[0], of course.

Minimal example to reproduce:

import pandas as pd; import deepchecks;
from sklearn.linear_model import LogisticRegression
from deepchecks.utils.features import _built_in_importance

train_df = pd.DataFrame([[23, True], [19, False], [15, False], [5, True]], columns=['age', 'smoking'], index=[0, 1, 2, 3])
train_y = pd.Series([1, 1, 0, 0])
test_df = pd.DataFrame([[21, True], [40, False], [12, False], [50, True]], columns=['age', 'smoking'], index=[0, 1, 2, 3])
test_y = pd.Series([1, 0, 1, 0])

logreg = LogisticRegression()
logreg.fit(train_df, train_y)

ds_train = deepchecks.Dataset(df=train_df, label=train_y)
ds_test = deepchecks.Dataset(df=test_df, label=test_y)

_built_in_importance(logreg, ds_train)

This will yield:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/var/folders/2q/pxh7dn6s1nq4fb8hw_x4lqd80000gn/T/ipykernel_42799/3322691940.py in <module>
----> 1 _built_in_importance(logreg, ds_train)

~/clones/deepchecks/deepchecks/utils/features.py in _built_in_importance(model, dataset)
    113         coef = np.abs(model.coef_)
    114         coef = coef / coef.sum()
--> 115         return pd.Series(coef, index=dataset.features)
    116     else:
    117         return

~/.pyenv/versions/3.8.5/envs/py3/lib/python3.8/site-packages/pandas/core/series.py in __init__(self, data, index, dtype, name, copy, fastpath)
    428                 index = ibase.default_index(len(data))
    429             elif is_list_like(data):
--> 430                 com.require_length_match(data, index)
    431 
    432             # create/copy the manager

~/.pyenv/versions/3.8.5/envs/py3/lib/python3.8/site-packages/pandas/core/common.py in require_length_match(data, index)
    529     """
    530     if len(data) != len(index):
--> 531         raise ValueError(
    532             "Length of values "
    533             f"({len(data)}) "

ValueError: Length of values (1) does not match length of index (2)

Exception when running StringLengthOutOfBounds check

To reproduce:
https://www.kaggle.com/itay94/notebookf8c78e84d7

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
/tmp/ipykernel_34/59083449.py in <module>
      1 from deepchecks.checks import StringLengthOutOfBounds
      2 
----> 3 StringLengthOutOfBounds().run(ds_train)

/opt/conda/lib/python3.7/site-packages/deepchecks/base/check.py in wrapped(*args, **kwargs)
    275     @wraps(func)
    276     def wrapped(*args, **kwargs):
--> 277         result = func(*args, **kwargs)
    278         if not isinstance(result, CheckResult):
    279             raise DeepchecksValueError(f'Check {class_instance.name()} expected to return CheckResult bot got: '

/opt/conda/lib/python3.7/site-packages/deepchecks/checks/integrity/string_length_out_of_bounds.py in run(self, dataset, model)
     75         """
     76         feature_importances = calculate_feature_importance_or_none(model, dataset)
---> 77         return self._string_length_out_of_bounds(dataset, feature_importances)
     78 
     79     def _string_length_out_of_bounds(self, dataset: Union[pd.DataFrame, Dataset],

/opt/conda/lib/python3.7/site-packages/deepchecks/checks/integrity/string_length_out_of_bounds.py in _string_length_out_of_bounds(self, dataset, feature_importances)
     92                 continue
     93 
---> 94             string_length_column = column.map(lambda x: len(str(x)), na_action='ignore')
     95 
     96             # If not a lot of unique values, calculate the percentiles for existing values.

/opt/conda/lib/python3.7/site-packages/pandas/core/series.py in map(self, arg, na_action)
   4159         dtype: object
   4160         """
-> 4161         new_values = super()._map_values(arg, na_action=na_action)
   4162         return self._constructor(new_values, index=self.index).__finalize__(
   4163             self, method="map"

/opt/conda/lib/python3.7/site-packages/pandas/core/base.py in _map_values(self, mapper, na_action)
    850             values = self._values
    851             if na_action is not None:
--> 852                 raise NotImplementedError
    853             map_f = lambda values, f: values.map(f)
    854         else:

NotImplementedError: 

[DEE-209] Add an option to fix possible string mismatches

When I see the output of the string mismatch check - it really helps me to understand I got some integrity issues in my data.

I think it will be really useful if I had a function that can fix for me the mismatches according to the matching algorithm that already exists in the check.

DEE-209

Exception when running LabelAmbiguity check

To reproduce:
https://www.kaggle.com/itay94/notebookf8c78e84d7

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
/tmp/ipykernel_34/2864323761.py in <module>
      1 from deepchecks.checks import LabelAmbiguity
      2 
----> 3 LabelAmbiguity().run(ds_train)

/opt/conda/lib/python3.7/site-packages/deepchecks/base/check.py in wrapped(*args, **kwargs)
    275     @wraps(func)
    276     def wrapped(*args, **kwargs):
--> 277         result = func(*args, **kwargs)
    278         if not isinstance(result, CheckResult):
    279             raise DeepchecksValueError(f'Check {class_instance.name()} expected to return CheckResult bot got: '

/opt/conda/lib/python3.7/site-packages/deepchecks/checks/integrity/label_ambiguity.py in run(self, dataset, model)
     64 
     65         group_unique_data = dataset.data.groupby(dataset.features, dropna=False)
---> 66         group_unique_labels = group_unique_data.nunique()[label_col]
     67 
     68         num_ambiguous = 0

/opt/conda/lib/python3.7/site-packages/pandas/core/groupby/generic.py in nunique(self, dropna)
   1803         obj = self._obj_with_exclusions
   1804         results = self._apply_to_column_groupbys(
-> 1805             lambda sgb: sgb.nunique(dropna), obj=obj
   1806         )
   1807 

/opt/conda/lib/python3.7/site-packages/pandas/core/groupby/generic.py in _apply_to_column_groupbys(self, func, obj)
   1709         columns = obj.columns
   1710         results = [
-> 1711             func(col_groupby) for _, col_groupby in self._iterate_column_groupbys(obj)
   1712         ]
   1713 

/opt/conda/lib/python3.7/site-packages/pandas/core/groupby/generic.py in <listcomp>(.0)
   1709         columns = obj.columns
   1710         results = [
-> 1711             func(col_groupby) for _, col_groupby in self._iterate_column_groupbys(obj)
   1712         ]
   1713 

/opt/conda/lib/python3.7/site-packages/pandas/core/groupby/generic.py in <lambda>(sgb)
   1803         obj = self._obj_with_exclusions
   1804         results = self._apply_to_column_groupbys(
-> 1805             lambda sgb: sgb.nunique(dropna), obj=obj
   1806         )
   1807 

/opt/conda/lib/python3.7/site-packages/pandas/core/groupby/generic.py in nunique(self, dropna)
    671 
    672         result = self.obj._constructor(res, index=ri, name=self.obj.name)
--> 673         return self._reindex_output(result, fill_value=0)
    674 
    675     @doc(Series.describe)

/opt/conda/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in _reindex_output(self, output, fill_value)
   3163         levels_list = [ping.group_index for ping in groupings]
   3164         index, _ = MultiIndex.from_product(
-> 3165             levels_list, names=self.grouper.names
   3166         ).sortlevel()
   3167 

/opt/conda/lib/python3.7/site-packages/pandas/core/indexes/multi.py in from_product(cls, iterables, sortorder, names)
    618 
    619         # codes are all ndarrays, so cartesian_product is lossless
--> 620         codes = cartesian_product(codes)
    621         return cls(levels, codes, sortorder=sortorder, names=names)
    622 

/opt/conda/lib/python3.7/site-packages/pandas/core/reshape/util.py in cartesian_product(X)
     52         b = np.zeros_like(cumprodX)
     53 
---> 54     return [tile_compat(np.repeat(x, b[i]), np.product(a[i])) for i, x in enumerate(X)]
     55 
     56 

/opt/conda/lib/python3.7/site-packages/pandas/core/reshape/util.py in <listcomp>(.0)
     52         b = np.zeros_like(cumprodX)
     53 
---> 54     return [tile_compat(np.repeat(x, b[i]), np.product(a[i])) for i, x in enumerate(X)]
     55 
     56 

<__array_function__ internals> in repeat(*args, **kwargs)

/opt/conda/lib/python3.7/site-packages/numpy/core/fromnumeric.py in repeat(a, repeats, axis)
    477 
    478     """
--> 479     return _wrapfunc(a, 'repeat', repeats, axis=axis)
    480 
    481 

/opt/conda/lib/python3.7/site-packages/numpy/core/fromnumeric.py in _wrapfunc(obj, method, *args, **kwds)
     56 
     57     try:
---> 58         return bound(*args, **kwds)
     59     except TypeError:
     60         # A TypeError occurs if the object does have such a method in its

MemoryError: Unable to allocate 45.3 PiB for an array with shape (25499357367644160,) and data type int16

feature importance crashes if not supplied with a standard sklearn model

TypeError: If no scoring is specified, the estimator passed should have a 'score' method. The estimator my_dumb_model() does not.

permutational importance assumes model has fit and score attributes. We need to check if it has these attributes.
If it does not have a score function, use a default from the assumed task type.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.