Code Monkey home page Code Monkey logo

boruta-shap's Introduction

DOI PyPI version

Boruta-Shap

BorutaShap is a wrapper feature selection method which combines both the Boruta feature selection algorithm with shapley values. This combination has proven to out perform the original Permutation Importance method in both speed, and the quality of the feature subset produced. Not only does this algorithm provide a better subset of features, but it can also simultaneously provide the most accurate and consistent global feature rankings which can be used for model inference too. Unlike the orginal R package, which limits the user to a Random Forest model, BorutaShap allows the user to choose any Tree Based learner as the base model in the feature selection process.

Despite BorutaShap's runtime improvements the SHAP TreeExplainer scales linearly with the number of observations making it's use cumbersome for large datasets. To combat this, BorutaShap includes a sampling procedure which uses the smallest possible subsample of the data availble at each iteration of the algorithm. It finds this sample by comparing the distributions produced by an isolation forest of the sample and the data using ks-test. From experiments, this procedure can reduce the run time up to 80% while still creating a valid approximation of the entire data set. Even with these improvments the user still might want a faster solution so BorutaShap has included an option to use the mean decrease in gini impurity. This importance measure is independent of the size dataset as it uses the tree's structure to compute a global feature ranking making it much faster than SHAP at larger datasets. Although this metric returns somewhat comparable feature subsets, it is not a reliable measure of global feature importance in spite of it's wide spread use. Thus, I would recommend to using the SHAP metric whenever possible.

Algorithm

  1. Start by creating new copies of all the features in the data set and name them shadow + feature_name, shuffle these newly added features to remove their correlations with the response variable.

  2. Run a classifier on the extended data with the random shadow features included. Then rank the features using a feature importance metric the original algorithm used permutation importance as it's metric of choice.

  3. Create a threshold using the maximum importance score from the shadow features. Then assign a hit to any feature that had exceeded this threshold.

  4. For every unassigned feature preform a two sided T-test of equality.

  5. Attributes which have an importance significantly lower than the threshold are deemed 'unimportant' and are removed them from process. Deem the attributes which have importance significantly higher than than the threshold as 'important'.

  6. Remove all shadow attributes and repeat the procedure until an importance has been assigned for each feature, or the algorithm has reached the previously set limit of runs.

If the algorithm has reached its set limit of runs and an importance has not been assigned to each feature the user has two choices. Either increase the number of runs or use the tentative rough fix function which compares the median importance values between unassigned features and the maximum shadow feature to make the decision.

Installation

Use the package manager pip to install BorutaShap.

pip install BorutaShap

Usage

For more use cases such as alternative models, sampling or changing the importance metric please view the notebooks here.

Using Shap and Basic Random Forest

from BorutaShap import BorutaShap, load_data
  
X, y = load_data(data_type='regression')
X.head()

# no model selected default is Random Forest, if classification is True it is a Classification problem
Feature_Selector = BorutaShap(importance_measure='shap',
                              classification=False)

'''
Sample: Boolean
	if true then a rowise sample of the data will be used to calculate the feature importance values

sample_fraction: float
	The sample fraction of the original data used in calculating the feature importance values only
        used if Sample==True.

train_or_test: string
	Decides whether the feature improtance should be calculated on out of sample data see the dicussion here.
        https://slds-lmu.github.io/iml_methods_limitations/pfi-data.html

normalize: boolean
            if true the importance values will be normalized using the z-score formula

verbose: Boolean
	a flag indicator to print out all the rejected or accepted features.
'''
Feature_Selector.fit(X=X, y=y, n_trials=100, sample=False,
            	     train_or_test = 'test', normalize=True,
		     verbose=True)

# Returns Boxplot of features
Feature_Selector.plot(which_features='all')

# Returns a subset of the original data with the selected features
subset = Feature_Selector.Subset()

Using BorutaShap with another model XGBoost

from BorutaShap import BorutaShap, load_data
from xgboost import XGBClassifier

X, y = load_data(data_type='classification')
X.head()

model = XGBClassifier()

# if classification is False it is a Regression problem
Feature_Selector = BorutaShap(model=model,
                              importance_measure='shap',
                              classification=True)

Feature_Selector.fit(X=X, y=y, n_trials=100, sample=False,
            	     train_or_test = 'test', normalize=True,
		     verbose=True)

# Returns Boxplot of features
Feature_Selector.plot(which_features='all')

# Returns a subset of the original data with the selected features
subset = Feature_Selector.Subset()

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

License

If you wish to cite this work please click on the zenodo badge at the top of this READme file MIT

boruta-shap's People

Contributors

adalseno avatar ag-tcm avatar dependabot[bot] avatar ekeany avatar erikvdp avatar golfbrother avatar gzuin avatar ianword avatar jckkvs avatar jgreene avatar kmedved avatar lgmoneda avatar mauritsdescamps avatar ploriaux avatar rorybyrne avatar sboshin avatar sungreong avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

boruta-shap's Issues

[ENH] Feature_Selector.plot() with o without shadow features

Current Situation

Feature_Selector.plot(..., which_features='all') shows shadow features.

Enhancement

Feature_Selector.plot(..., which_features='all', shadow=True/False)

Reasoning

Shadow features expands Y-axis limits (Z-score) and it avoids seeing the rest of the features clearly.

Implementation

...

Tasks

...

[ENH] Support for pairwise learning-to-rank models

Current Situation

Train_model function does not support training for pairwise leaning-to-rank models as these models requires a grouping of observations passed to their model.fit function.

An example of model.fit function for pairwise learning can be found here

Enhancement

Improve fit function by adding another function argument for grouping of observations.

Use this grouping of data in Train_model function and also in create_shadow_feature function while shuffling the columns.

Also as mentioned in issue-57, grouping of observations need to be taken into account for train-test split of data.

Reasoning

This will add support for all the pairwise learning-to-rank models.

Implementation

  • Add function argument for a list/np.array of groups in the BorutaShap.fit function.
  • Use this array of groups in Train_model function while calling model.fit.
  • Add capability to create_shadow_feature function to shuffle data inside these pre-defined groups.
  • Add capability to split train-test data based on grouping of observations.

I'll be happy to share a PR in some time given the go-ahead

[ENH] Create a support_ attribute to avoid error in generating a dataframe report or to select in a sklearn way

Current Situation

Currently, the selector has mutliple columns attribute.
Also, it not easy to apply the selection of features through an attribute "support_" similar to sklearn framework.
Also, it is not sure, except if we go into the code, that the order of hits is similar to the order of columns of the dataframe provided as argument.

Enhancement

  • Create a "support_" True/False mask such as "support_" attribute from 'sklearn.feature_selection' object.
  • Provide an example or add in the documentation the means to create a "report" dataframe of hits and "accepted", such as f_classif test : ([list of hits values], [True/False list of accepted features]) the whole ordered as the X provided as argument.

Can I use permutation importance, and another base as test base in ranking phase?

Question 1)
Can I use permutation importance as algorithm for rank importances?

Question 2)
Can I use a fully out of sample base as test base in evaluating phase of rank importance (in each step)? I checked that exist an parameter on fit() called "train_or_test", but this is hardcoded to call default "train_test_split()" from sklearn with 30% for test (and using the same base). I would like to use another base for the test, since my train base have features generated from whole train base

[BUG] `check_missing_values` warning not done correctly

Describe the bug

I'm using a LightGBM model, and was surprised to see BorutaShap complains about missing values in my data.

To Reproduce

  1. Make classification dataset with missing values
  2. Create LGBMClassifier
  3. Run fit()

Expected behavior

The warning as defined in check_missing_values should be triggered, as this is a LGBM model:
print('Warning there are missing values in your data !')
instead of the error message.

Fix

This: if model_name.startswith(models_to_check): - line 184 - won't work, as an initialised LGBM model (name currently computed with model_name = str(type(self.model)).lower()) is <class 'lightgbm.sklearn.lgbmclassifier'>.
That clearly doesn't start with lgbm.

Instead, one could do, for example: if any([x in model_name for x in models_to_check]): to fix this issue.

[FEATURE]

Hi,

thank you very much for this great enhancement of the Boruta selection in Python!
I am working with thusands of features and would like to have some kind ob verbose option. I just comment out the prints, but that should not really be the way to go.

Description

Option to prevent printing all selected and not-selected features.

Reasoning

With thousands of features the output gets pretty messy.

Implementation

An argument to disable:

        print(str(len(self.accepted))  + ' attributes confirmed important: ' + str(self.accepted))
        print(str(len(self.rejected))  + ' attributes confirmed unimportant: ' + str(self.rejected))
        print(str(len(self.tentative)) + ' tentative attributes remains: ' + str(self.tentative))

How best to interpret the borutashap plots?

download

Within my selected features output by BorutaShap I have 4 features that are selected but according to the plot (shown above) these 4 are less important than the maximum shadow feature, how do I interpret why these features were selected? Originally I thought that the maximum shadow feature serves as the cut-off for selecting important features.

[ENH] Reference for isolation forest sampling

Current Situation

First of all, thanks for sharing this library!

By reading the library doc and code, I've got interested in the sampling strategy to reduce the time:

It finds this sample by comparing the distributions produced by an isolation forest of the sample and the data using ks-test. From experiments, this procedure can reduce the run time up to 80% while still creating a valid approximation of the entire data set.

Is there any reference about how this sampling strategy is better than random sampling to estimate shap average importance of the features?

I'm thinking about extracting this piece of logic to use it directly in the task of doing shap values with a sample and reduce its run time.

Enhancement

Add references, if they exist, for the isolation forest sampling.

Reasoning

Make docs more complete.

Implementation

Add a link to the readme.

Speeding up of the shap calculations[ENH]

Current Situation

As it stands, BorutaShap works perfectly on smaller datasets with the importance_measure = "shap". You would have to resort to gini, if the dataset was even slightly on the larger side.

Enhancement

Simply adding an approximate = True flag in each of the lines where explainer.shap_values() is, speeds up the process exponentially. This could be added as an option. During my hacky experimentation, it reduced the time of processing of shap values by almost 50x.
This solution was also used by the creators of the shap library to make it more usable

Reasoning

The library will be usable on much larger datasets.

Implementation

As mentioned before, add in the approximate = True flag in the shap_value calls.

Tasks

Sample Weight Support for Regression Problems [ENH]

Current Situation

Most scikit-learn estimators allow for passing sample weights along with the X and y values as an optional parameter. These weights are often key for regression problems. Currently it seems like Boruta-Shap does not support this (unless I'm missing something).

Enhancement

Add support for sample_weights.

Implementation

Given most scikit-learn compatible regression estimators already allow this, including RF and Xgboost, I think this should be possible to add just by passing a sample_weight parameter to the .fit() call of the relevant estimator. I may be missing some complexity here however. I appreciate the help.

[ENH] Add target permutation

An alternative approach is to permute the target instead of permuting features, so you don't have to add shadow features. Although this is different from the Boruta approach, it could be an interesting option.

See:

According to the paper

To preserve the relations between features, we use permutations of the outcome.

[BUG]TypeError: __init__() got an unexpected keyword argument 'approximate'

TypeError Traceback (most recent call last)
in ()
2 classification=True)
3
----> 4 Feature_Selector.fit(X=X, y=y, n_trials=100, random_state=0)

/opt/conda/lib/python3.6/site-packages/BorutaShap.py in fit(self, X, y, n_trials, random_state, sample)
268 self.model.fit(self.X_boruta, self.y)
269
--> 270 self.X_feature_import, self.Shadow_feature_import = self.feature_importance()
271 self.update_importance_history()
272 self.hits += self.calculate_hits()

/opt/conda/lib/python3.6/site-packages/BorutaShap.py in feature_importance(self)
495 if self.importance_measure == 'shap':
496
--> 497 self.explain()
498 vals = np.abs(self.shap_values).mean(0)
499 vals = self.calculate_Zscore(vals)

/opt/conda/lib/python3.6/site-packages/BorutaShap.py in explain(self)
579
580
--> 581 explainer = shap.TreeExplainer(self.model, approximate=True, feature_perturbation = "tree_path_dependent")
582
583

TypeError: init() got an unexpected keyword argument 'approximate'

[BUG] Exception - 'Series' object has no attribute 'select_dtypes'

Describe the bug

When BorutaShap removes all features because none of them are helpful it throws an error due to pandas apply returning a Series object rather than a DataFrame object. This occurs here:

obj_col = self.X_shadow.select_dtypes("object").columns.tolist()

To Reproduce

Steps to reproduce the behavior:

import numpy as np
import pandas as pd
from BorutaShap import BorutaShap

np.random.seed(4)

shape=(100, 4)
df = pd.DataFrame(np.random.randint(0, 100, size=shape), columns=list('ABCD'))
df['E'] = np.random.randint(0, 1, size=(100, 1))

target = 'E'
y_train = df[target]
X_train = df.drop(target, axis=1)

featurizer = BorutaShap()
featurizer.fit(X_train, y_train)

Expected behavior

BorutaShap shouldn't throw an error.

Additional context

Pandas DataFrame apply method returns a Series rather than a DataFrame when there are no columns in the current DataFrame. This seems like a poor design decision on pandas side but it is impacting BorutaShap.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html

Configure CI/CD, issues, and Dependabot

Description

There's some configuration files we can put into .github/ to configure Github Actions, default issue templates, and Dependabot.

Reasoning

It's good to configure Github Actions so that tests run automatically every time a commit is pushed or a PR is opened.

It's good to configure Dependabot to run monthly because it'll submit new PRs every fucking day otherwise and make you want to jump out the window.

It's handy to configure issue templates so that issues automatically look pretty - like this one!

Implementation

It's just a .github/ directory in the root of the repository.

Tasks

  • Create the .github/ dir
  • Add the dependabot configuration
  • Add the default issue templates
  • Adds some workflows for Github Actions

[ENH] HDBSCAN for downsampling

Hello. Thank you for this library!

I found your idea of using an Isolation Forest for downsampling the observations passed to SHAP very interesting. I'm wondering if you also tried HDBSCAN clustering. It is used, for example, as the default downsampling method in the interpret_community library and it has the advantage of automatically selecting the optimal number of clusters (i.e., the optimal number of samples to better represent the dataset). There is a Python implementation of HDBSCAN in in this package.

Regards,
Bruno

Noob issue: Having trouble loading up my own dataset

Sorry for noob question but I'm trying to load up my own database and its not working...

This is how I am trying to do it

from BorutaShap import BorutaShap, load_data

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd




# Importing the dataset
dataset = pd.read_csv('D:/test/AAT.csv')
dataset = pd.get_dummies(dataset) 

# Saving feature names for later use
dataset_list = list(dataset.columns)

X = dataset.iloc[:, 1:].values 
y = dataset.iloc[:, 0].values 


# no model selected default is Random Forest, if classification is True it is a Classification problem
Feature_Selector = BorutaShap(importance_measure='shap',
                              classification=False)

Feature_Selector.fit(X=X, y=y, n_trials=100, random_state=42)




# Returns Boxplot of features
Feature_Selector.plot(which_features='all')
Feature_Selector.results_to_csv(filename='feature_importance')

This gives me error AttributeError: 'numpy.ndarray' object has no attribute 'columns'. Any idea what I am doing wrong? Thank you for your help in the matter

Maybe not is correct drop nans in self.history_x

In line 446 (self.history_x.dropna(axis=0,inplace=True)) this statement drop all rows that contains nans from history, but some features (those still not removed), have it values dropped also, losing part of its history. So, at the end, the plot remains the same appearance all the times even increasing the n_trials parameter, and the means are not computed correctly (with all computed values).

I don't know if you could understand me

[BUG] missing values check on training data shouldn't be done for Catboost and XGBoost

Describe the bug

Catboost and XGBoost accept missing values in the training set. The training set shouldn't be checked for missing values when using these algorithms.

To Reproduce

run:

from BorutaShap import BorutaShap, load_data
from catboost import CatBoostClassifier
import numpy as np

X, y = load_data(data_type="classification")
# introduce one missing value
X.iloc[4, 5] = np.nan

model = CatBoostClassifier()
Feature_Selector = BorutaShap(model, importance_measure="shap", classification=True)
Feature_Selector.fit(X=X, y=y, n_trials=100, random_state=0)

Output: ValueError: There are missing values in your Data

Expected behavior

The fit shouldn't throw an error. The model can cope with missing values.

[BUG]

Describe the bug

There is an issue using xgb 1.1.0 and shap 0.35.0
See: shap/shap#1215

To Reproduce

See link to issue 1215

Expected behavior

There has been changes in the XGBTreeModelLoader of the xgb 1.1.0 that lead to an error with decode('utf-8')

Solution

A workaround is upgrading to new shap version (0.35) from master:
pip install https://github.com/slundberg/shap/archive/master.zip

However, BorutaShap warns that it depends on shap 0.34, but it works.

Having trouble in running[BUG]

hello!I'm trying to run the run_tests and its not working...
I got an TypeError: Cannot interpret '<attribute 'dtype' of 'numpy.generic' objects>' as a data type
I am a novice
Could you tell me how to solve this problem

[BUG] create_shadow_features () - LightGBM + BorutaShap - AttributeError: 'Series' object has no attribute 'select_dtypes'

I am trying to run BorutaShap with a bunch of models. One of them being LightGBM
When I run BorutaShap with a small number of rows, I get this attribute error in create_shadow_features()

... in <module>
     18 
     19 boruta_.fit(X=toy_X, y=toy_y.Label, n_trials=n_trials, normalize=normalize, sample=sample,
---> 20                 random_state=random_state, verbose=False)
     21 
     22 # results, fs = feature_selection.boruta(toy_df, french_f_y.Label, model=clf, n_trials=10, percentile=percentile,

.../BorutaShap.py in fit(self, X, y, n_trials, random_state, sample, train_or_test, normalize, verbose)
    344             self.remove_features_if_rejected()
    345             self.columns = self.X.columns.to_numpy()
--> 346             self.create_shadow_features()
    347 
    348             # early stopping

.../BorutaShap.py in create_shadow_features(self)
    541         self.X_shadow = self.X.apply(np.random.permutation)
    542         # append
--> 543         obj_col = self.X_shadow.select_dtypes("object").columns.tolist()
    544         if obj_col ==[] :
    545              pass

.../pandas/core/generic.py in __getattr__(self, name)
   5476         ):
   5477             return self[name]
-> 5478         return object.__getattribute__(self, name)
   5479 
   5480     def __setattr__(self, name: str, value) -> None:

AttributeError: 'Series' object has no attribute 'select_dtypes'

The following code will reproduce the error:

import pandas as pd
from lightgbm import LGBMClassifier
from BorutaShap import BorutaShap


toy_X = pd.read_csv('toy_df.csv', index_col=0)
toy_y = pd.read_csv('toy_labels.csv', index_col=0)

model = LGBMClassifier()
importance_measure = 'shap'
percentile = 70
random_state = 0
normalize = True
sample = False
n_trials = 100

boruta_ = BorutaShap(model=model, importance_measure=importance_measure, classification=True,
                         percentile=percentile)

boruta_.fit(X=toy_X, y=toy_y.Label, n_trials=n_trials, normalize=normalize, sample=sample,
                random_state=random_state, verbose=False)

toy_df.csv
toy_labels.csv

The same data will work with other classifiers, and BorutaShap works with LightGBM on other data sets. I suspect it has something to do with the dataset size, as I've seen this happen only with small subsets of the data.

Boruta doesn't finish set runs (and ends with 0 important features)

Describe the bug

At times when I run Boruta with different data it doesn't finish set runs (100) and ends with 0 important features and 0 tentative features. It usually stops at the 16 run and no important features are found.

TentativeRoughFix() did not help, also increased the runs from 100 to 200, set Sample = True and False, changed random states, set Sample = True and False but still got the same result.

To Reproduce

Steps to reproduce the behavior:
`Feature_Selector = BorutaShap(model=model,
importance_measure='shap',
classification=True)

Feature_Selector.fit(X=X, y=y, n_trials=200, sample=False,
normalize=True, verbose=True, random_state=1)

Feature_Selector.TentativeRoughFix()
`

Expected behavior

Boruta should at least finish its runs. In addition, at times it doesn't find any important nor tentative features. Sometimes it works using a very similar dataset with the same n features and observations.

[BUG] TypeError: __init__() got an unexpected keyword argument 'feature_perturbation'

TypeErrorTraceback (most recent call last)
in
8 classification=True)
9
---> 10 Feature_Selector.fit(X=X_train_rs, y=y_train_rs, n_trials=100, random_state=0)

~/.local/lib/python3.6/site-packages/BorutaShap.py in fit(self, X, y, n_trials, random_state, sample)
281
282
--> 283 self.X_feature_import, self.Shadow_feature_import = self.feature_importance()
284 self.update_importance_history()
285 self.hits += self.calculate_hits()

~/.local/lib/python3.6/site-packages/BorutaShap.py in feature_importance(self)
511 if self.importance_measure == 'shap':
512
--> 513 self.explain()
514 vals = self.shap_values
515 vals = self.calculate_Zscore(vals)

~/.local/lib/python3.6/site-packages/BorutaShap.py in explain(self)
595
596
--> 597 explainer = shap.TreeExplainer(self.model, feature_perturbation = "tree_path_dependent")
598
599

TypeError: init() got an unexpected keyword argument 'feature_perturbation'

I use shap 0.32.1

[FEATURE] compute significance for a subset of features

Description

Add the possibility to limit the features to test

Reasoning

Often some features are obviously important and there is no need to test their contribution. When the data set is large and the number of features is large, limiting the number shadow features may lead to a significantly shorter computation time.

Shadow features in the final features

In the end we get a DataFrame with features and their importance + number of selected features. Why do we still have shadow features at this point? Shouldn't we eliminate them?

Also, suppose Boruta says that we should select top-10 features. If some shadow_features are in top-10, they are also counted, right?

[BUG] SHAP values using treeExplainer and lightgbm for classification are wrong because of the different output type

Describe the bug

Hi, here a short description

  • TreeExplainer behaves differently given the passed model (namely, the output is not always of the same type, np.array or a list of np.array's). See the lightgbm issue
  • SHAP and permutation importance should be computed on unseen data
  • SHAP importances are mean( | shap.values | ), so for classification, before taking the mean/sum, the abs value should be applied.

To Reproduce

See from line 589 to 608

Expected behavior

Something like the following worked for me (related to the PR I made for Boruta_Py). It can be "beautify".

self.is_cat = 'catboost' in str(type(self.estimator))
self.is_lgb = 'lightgbm' in str(type(self.estimator))
    def _is_tree_based(self):
        """
        checking if the estimator is tree-based (kernel SAP is too slow to be used here, unless using sampling)
        :return:
        """
        tree_based_models = ['lightgbm', 'xgboost', 'catboost', '_forest']
        condition = any(i in str(type(self.estimator)) for i in tree_based_models)
        return condition
        if not self.classification:
             X_tr, X_tt, y_tr, y_tt = train_test_split(X, y, random_state=42)
        else:
             X_tr, X_tt, y_tr, y_tt = train_test_split(X, y, stratify=y, random_state=42)

        X_tr = pd.DataFrame(X_tr)
        X_tt = pd.DataFrame(X_tt)
        obj_feat = list(set(list(X_tr.columns)) - set(list(X_tr.select_dtypes(include=[np.number]))))

        if obj_feat:
            X_tr[obj_feat] = X_tr[obj_feat].astype('str').astype('category')
            X_tt[obj_feat] = X_tt[obj_feat].astype('str').astype('category')

        if self._is_tree_based():
            try:
                if self.is_cat:
                    model = self.estimator.fit(X_tr, y_tr, cat_features=obj_feat)
                else:
                    model = self.estimator.fit(X_tr, y_tr)

            except Exception as e:
                raise ValueError('Please check your X and y variable. The provided '
                                 'estimator cannot be fitted to your data.\n' + str(e))
            # build the explainer
            explainer = shap.TreeExplainer(model, feature_perturbation="tree_path_dependent")
            shap_values = explainer.shap_values(X_tt)
            # flatten to 2D if classification and lightgbm
            if is_classifier(self.estimator):
                if isinstance(shap_values, list):
                    # for lightgbm clf sklearn api, shap returns list of arrays
                    # https://github.com/slundberg/shap/issues/526
                    class_inds = range(len(shap_values))
                    shap_imp = np.zeros(shap_values[0].shape[1])
                    for i, ind in enumerate(class_inds):
                        shap_imp += np.abs(shap_values[ind]).mean(0)
                    shap_imp /= len(shap_values)
                else:
                    shap_imp = np.abs(shap_values).mean(0)
            else:
                shap_imp = np.abs(shap_values).mean(0)
        else:
            raise ValueError('Not a tree based model')

Screenshots

If applicable, add screenshots to help explain your problem.

Additional context

Add any other context about the problem here.

[ENH] train_test_split doesn't support splitting by group as required for pairwise learning

Current Situation

In Check_if_chose_train_or_test_and_train_model(), if "test" is chosen, the split between train & test is done irrespective of the group the observation belongs to, making it inappropriate for pairwise learning.

Enhancement

improve splitting to take groups into account.

Reasoning

It would make the package more in line with xgboost capabilities.

Implementation

I am not sure such a splitting function exists in sklearn. xgboost provides a split function that is group aware, but there is not much explanation on what it does exactly.

[FEATURE] get the feature importance scores in a dataframe

Hi,
Thanks a lot for this package!
On question : Is there any way to get the final feature importance scores directly in a dataframe without creating the csv file (with Feature_Selector.results_to_csv(filename='feature_importance') ) and then reopen it?
Many thanks,
Best
Yedid

[ENH] Randomize train/test split

Randomize train/test split

The random_state argument in BorutaShap.fit is set to 0 by default. This means that the train/test split performed in Check_if_chose_train_or_test_and_train_model is the same for every iteration. For the shadow features this doesn't really matter, but for the real features is means that the same subset of the data is used for training at every iteration. Is this by design, or would it be better to perform a random split each iteration?

will it work for multidimensional matrix? like the type of tif?

It was an excellent improved Boruta algorithm, but I wanna do feature selection of the multiband images. The data type is tif and each band represents for a kind of feature. I wonder whether the algorithm is able to work for multidimensional matrix([200, 200, 11]) as input features or not?

[BUG] check_missing_values() for methods that can handle missing data is faulty

Description

in the function check_missing_values() the check for models that can handle missing values is faulty. Simply comparing against the set models_to_check is wrong. This leads to the second if statement never getting executed even if for example an instance XGBRegressor() that can handle missing data is passed.

models_to_check = ('xgb', 'catboost', 'lgbm', 'lightgbm')
model_name = str(type(self.model)).lower()        

if X_missing or Y_missing:
    if any([x in model_name for x in models_to_check]):                
        print('Warning there are missing values in your data !')
    else:                
         raise ValueError('There are missing values in your Data')

Reasoning

str(type(self.model)).lower() returns the instance type which for the XGBRegressor() is <class 'xgboost.sklearn.xgbregressor'>. Comparing this against the models_to_check set will result in a False. Consequently, the second if statement will not get executed.

[FEATURE] Add a transform method

Description

I was trying to chain BorutaShap inside a sklearn pipeline, as I can do with Borutapy, but it retuns this error:

TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' '<BorutaShap.BorutaShap object at 0x000002038843DB20>' (type <class 'BorutaShap.BorutaShap'>) doesn't.

It would be nice if we could chain BorutaShap inside a sklearn pipeline as a feature selector.

Implementation

Maybe you could try to implement similar to the BorutaPy transform method?

[BUG] Incorrect `calculate_hits` documentation

The docstring of calculate_hits in BorutaShap.py states the following:

Percentile : value ranging from 0-1

This should be

Percentile : value ranging from 0-100

since numpy.percentile expects a percentage.

BorutaShap throws error if there are missing values in the data even if the model used is lightgbm [BUG]

Describe the bug

Currently, BorutaShap throws error if there are missing values in the data even if the model used is lightgbm. The reason is that the code checks for string 'lgbm' instead of 'lightgbm' as a marker to see if the model used is 'lightgbm'

LIne 179 in BorutaShap.py
models_to_check = ('xgb', 'catboost', 'lgbm')
should be
models_to_check = ('xgb', 'catboost', 'lgbm', 'lightgbm')
Since,
model=LGBMClassifier()
print(type(model))
#lightgbm.sklearn.LGBMClassifier

Cannot make RF or XGBoost run multithread

Hi! I think is more a question than a bug/request. I tried it with defaults for RandomForest and XGBoost and tried with the parameter n_jobs=10, no change in cores usage, do you have any example to use multithreading?, with a lot of data the process is really slow.
Thanks!!

[INF] Merge to BorutaPy (sklearn-contrib)

Description

Merge those to BorutaPy
I did a similar PR, with almost all the same modifications.

Reasoning

It makes sense to merge to the parent BorutaPy package to avoid duplicates
Existing PR can be then replaced or merge with the present package (FYI scikit-learn-contrib/boruta_py#77).

It'll be part of sklearn-contrib, which is actually a nice way to have visibility and a significant impact on the ML community.

Implementation

Overview of possible implementations

Tasks

  • Reach out the BorutaPy the maintainers, discuss the merging
  • Task 2
  • Task 3

How to cite this library in an academic text

Hello,

I would need to cite this library in my thesis, but don't know if you have a preferred citation style. Also, does this repo have a DOI? (https://github.blog/2014-05-14-improving-github-for-science/).

I am currently using this style:

@misc{BorutaShap,
  author = {Eoghan, Keany},
  title = {Boruta-Shap},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/Ekeany/Boruta-Shap}},
  DOI = {},
  commit = {ffc1ca2b71235ebe353e89185021635abfffc712}
}

Please do let me know if you'd like a different style or if you can provide a DOI!

[BUG] sample_fraction parameter is not implemented

Describe the bug

The documents describe a the function parameters for the fit argument and include a sample_fraction parameter that's used if sample=True however that parameter is not actually implemented in the function.

To Reproduce

Steps to reproduce the behavior:

  1. Go to boruta.py
  2. Look at the parameters that are implemented vs the docs and you will see that it is missing

Expected behavior

sample_fraction to sample some portion of the data to enhance speed.

Additional context

Add any other context about the problem here.

[BUG] Misplaced important features

Hello and thank you for this wonderful tool.

I work with a regression Boruta setup as :

Feature_Selector.fit(X=X, y=y, n_trials=100, sample=False,
train_or_test = 'test', normalize=True,
verbose=True)
Feature_Selector.plot(which_features='all')

The ranking of the main features works pretty well and makes sense considering a cross-validated RF model I ran earlier. However, A bunch of these features has a 'funny' rank.

For instance, some 'confirmed important' features are located below the max_shadow and symmetrically for another dataset, some 'confirmed unimportant' are located above the max_shadow.

here are two Feature_Selector.plots showing the issue. I hope you can help me with this.

ex_1

ex_2

Best wishes, and I'm sorry if I forgot anything that would help the understanding of my problem.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.