aerdem4 / lofo-importance Goto Github PK

View Code? Open in Web Editor NEW

798.0 14.0 81.0 826 KB

Leave One Feature Out Importance

License: MIT License

Python 100.00%

feature-importance machine-learning data-science feature-selection explainable-ai

lofo-importance's People

Contributors

Stargazers

Watchers

Forkers

stephanecollot dylanxia2017 roymachinelearning aws24689 tianrongchen andrewzrant valldabo2 rafmacalaba ashutosh1919 wxzquant hughstime enlighterorg hquality wouterswart chandupentela72 vikram687 ln-l stupiddogger oncukayalar kiminh nirupam1sharma alekseiyurkevich williamberrios suchitramajumdar pariyat stjordanis chrinide nataly44 saif-al-zarrar kaburelabs imgeaslikok xl60-hust osmanatam cetinerdem c-rawler sangtrx hroskes ringwraith wanghao19970205 bartlomiejskwira jinyu25 ife1er yueyedeai pangpang97 xufeng-gif techthiyanes avinashnanda alikula314 longshen931 tonchen3 d4rk-lucif3r o-senpai-o susnato isokaze04 shahnawaz-1 timvink safwennaimi oyogo abempah omerkulahci ibnerasheed dgitahi bassemfg monk5088 arunsechergy phanduc dan-bouchard tdl77 ozancanozdemir kamenialexnea xukkx melonhead901 j-gundy reporter-law unhoang silverlininginclouds anyaoha kingychiu kamwaro-001 kananmahammadli mavillan

lofo-importance's Issues

Variable Grouping Only Works When Model Parameter is Kept To Default

I was getting a weird error when passing the titanic dataset through lofo-importance, and I think I know why:

So take this code example. The inclusion of the parameter "model=rf" will trigger an error, whereas removing it will allow the lofo importance calculation to proceed with the default model.

import lofo
import seaborn as sns
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

titanic = sns.load_dataset('titanic')

df = titanic[['sex','survived']]
dataset = lofo.Dataset(df=df, target="survived", features=[col for col in df.columns if col != "survived"])
rf = RandomForestClassifier()
lofo_imp = lofo.LOFOImportance(dataset, model=rf, scoring='accuracy')
importance_df = lofo_imp.get_importance()

Root cause:
Looking at lofo_importance.py it looks like at lines 32-33 it specifies that infer_model is only called when model =None. Any particular reason for this, or shouldn't this run regardless of the model passed to lofo_importance?

How to use GroupKFold?

Add the choice between Mean/Std and Median/IQR

Median and IQR could be more robust and useful if distribution of importances is not normal.

Something like this

importance_df["importance_md"] = lofo_cv_scores_normalized.median(axis=1)
importance_df["importance_iqr"] = stats.iqr(lofo_cv_scores_normalized, axis=1)

Also for plot_importance there could be a choice between error and 95%CI;

For std it would be

importance_df.plot(x="feature", 
y="importance_mean", 
xerr=1.96 * importance_df.importance_std,
kind='barh', 
color=importance_df["color"], 
figsize=figsize)

and for iqr

importance_df.plot(x="feature", 
y="importance_md", 
xerr=1.57 * importance_df.importance_iqr / np.sqrt(n), # num_sampling for flofo and num of folds for lofo
kind='barh', 
color=importance_df["color"], 
figsize=figsize)

Multiclass models

The algorithm doesn't support multiclass classification.
In infer_model function the classification task is defined only for two unique values of the target.

Returns NaNs all the time

add categorical_feature like lightgbm

I don't know how to feature request XD.

It would be great if you can add the categorical_feature parameter in your Dataset just like in the lightgbm docs. Thanks!!

Sample_weight?

Is there any way to pass a sample weight column into a Dataset object, or is there any plan to add this functionality? It should fit pretty smoothly into the sklearn framework I would expect, but maybe I'm missing something.

Thanks for the wonderful package!

ImportError about check_scoring

Hi,

I was running the example notebook, here came the problem :

from lofo.lofo_importance import LOFOImportance

~/Desktop/lofo-importance/lofo/flofo_importance.py in ()
4 import multiprocessing
5 import warnings
----> 6 from sklearn.metrics import check_scoring
7
8

ImportError: cannot import name 'check_scoring'

I found the latest version about sklearn is using
'from sklearn.metrics.scorer import check_scoring'

requirements.txt not packaged in source distribution

Hello,

The requirements.txt file, which is referenced in the setup.py file, is not included in the source distribution. This causes an error when installing from source because the file is missing. This is a problem, for example, when installing with conda. The file can be included by use of a MANIFEST file - as specified here https://docs.python.org/3/distutils/sourcedist.html

--Kellen

Pandas 2.0.x compatibility

Tried your package today with pandas 2.0.3 and got the following error:

AttributeError: module 'pandas.core.strings' has no attribute 'StringMethods'

After downgrading to Version 1.5.3 the issue was gone.

Can you please double-check and in case upgrade to pandas 2.0.x? thanks!

How to perform feature selection with hyperparameter tuning?

Add logging or restart mechanism

When there are many features, the task takes a long time and is easy to collapse. Therefore, I think it is necessary to add the logging function and breakpoint recovery function in lofo.

Feature selection using statistical significance

Feature request: Having mean and stdev metrics are useful to understand the variable importance but I think It would be quite valuable to have a statistical significance test especially for academic researchers..

Could you add a reference?

Thanks for putting this repo together. Who developed this method? Is it the same as LOCO?
https://www.tandfonline.com/doi/full/10.1080/01621459.2017.1307116?casa_token=HAl_ErrKi18AAAAA:YyDJybfbzaLMDU1Zlzq8D4OQnZmUeEwukWJFagcsFB7_JA-W-7ifcINhc8N0FTbtbImLjezWESo

Thanks,
Daniel

Groupkfold or Groupshufflesplit Cross Validation

Thanks for the very cool package. I was wondering if it is possible to pass a group_id somehow in order to do Groupkfold or Groupshufflesplit cross validation? I don't see anywhere obvious, but wanted to check in case I'm missing anything.

Thanks.

Compatibility with neural network: replacing with constant value instead of dropping the feature

Hi,

For neural network if you change the number of features, you need to change the input dimension and therefore the number of neurons.
So, we could have an option like:

leaving='drop' for current behaviour
leaving='replace' for NN

What do you think?

Any tutorial for dealing with genetic data?

Do you have tutorial to work with high dimensional data like SNP data?

TimeSeriesSplit with Lofo

is it possible to use sklearn TimeSeriesSplit with lofo?

Performing feature selection in multi targeted regression dataset

Hi,

I would like to apply lofo in my multi targeted regression dataset, here is a reprex of my data;


from sklearn.datasets import make_regression

X,y = make_regression(n_samples=int(1e3),n_features=5,n_targets=4)

some models (like CatBoost) allow multi targets without changing any parameter.

Since lofo.Dataset takes target= argument as string, I couldn't perform feature selection with lofo-importance.

Can you add this compatibility ?

Thanks

Having a lot of features + Using LOFO?

Hi,

I have 1673 features. When I tried using LOFO importance, the result is the following:

Are the features showing up one on top of the other because the plot isn't long enough? What would you suggest to fix this problem?

Thank you

Support multiclass classification ?

The code below is okay to get importance_df.

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer, load_iris
from sklearn.model_selection import KFold
from lofo import LOFOImportance, Dataset, plot_importance

data = load_breast_cancer(as_frame=True)# load as dataframe
df = data.data
df['target']=data.target.values

# model
model = RandomForestClassifier()
# dataset
dataset = Dataset(df=df, target="target", features=[col for col in df.columns if col != 'target'])
# get feature importance
cv = KFold(n_splits=5, shuffle=True, random_state=666)
lofo_imp = LOFOImportance(dataset, cv=cv, scoring="f1",model=model)
importance_df = lofo_imp.get_importance()
print(importance_df)

But if we modify load_breast_cancer to load_iris, the importance_df values are all NaN.

Is the lofo-importance only support binary classification?

usage question

This is not an issue, but rather a quick question for clarification.

From the brief definition of the method, it is a little hard to tell how LOFO and RFE/Backward Selection differ from each other. Could you please compare & contrast?

Thank you again for sharing this lib with the community!

Serdar

Installation without git

Hi, love you package. I'd like to install it in a cloud environment, where I do not have git available.
I can only install packages from PyPi, or if I upload an .egg or .whl file. Could you maybe provide details on how to install your package, given these circumstances or point me in the right direction?

Thank you, Flo

Running the example in the readme throws errors

Indeed, in the first example here, the target attribute does not exist and should be instead "HasDetections" or defined before creating the model

I have created pull request here

I am not able to import Dataset

from lofo import Dataset doesn't work and gives error of

cannot import name 'Dataset' from 'lofo'

But if I write from lofo.dataset import Dataset then it works fine.

I think it has something to do with init.py

Understanding LOFO Importance

Hi,
Consider the following code:

import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import KFold
from lofo import LOFOImportance

brca = load_breast_cancer()
df = pd.DataFrame(brca["data"], columns=brca["feature_names"]).assign(target=brca["target"])

lofo = LOFOImportance(df, brca["feature_names"], "target", 
                      cv=KFold(n_splits=5, random_state=4), 
                      scoring="roc_auc")
importance_df = lofo.get_importance()

image1

This produces a importance ranking of features. There are some positive means and some negative means in the dataset.

I will now select all features, whose mean is above zero and calculate the importance again.

new_features = importance_df.query("importance_mean > 0")["feature"].tolist()
lofo = LOFOImportance(df, new_features, "target", 
                      cv=KFold(n_splits=5, random_state=4), 
                      scoring="roc_auc")
importance_df2 = lofo.get_importance()

image2

Again, there are some positive means and some negative means in the dataset. This is not what I was expecting. I removed the supposedly unnecessary features after the first step, why are there new unnecessary features after the second step?