Code Monkey home page Code Monkey logo

lofo-importance's People

Contributors

aerdem4 avatar kamenialexnea avatar kingychiu avatar melonhead901 avatar rafah-ek avatar stephanecollot avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lofo-importance's Issues

Variable Grouping Only Works When Model Parameter is Kept To Default

I was getting a weird error when passing the titanic dataset through lofo-importance, and I think I know why:

So take this code example. The inclusion of the parameter "model=rf" will trigger an error, whereas removing it will allow the lofo importance calculation to proceed with the default model.

import lofo
import seaborn as sns
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

titanic = sns.load_dataset('titanic')

df = titanic[['sex','survived']]
dataset = lofo.Dataset(df=df, target="survived", features=[col for col in df.columns if col != "survived"])
rf = RandomForestClassifier()
lofo_imp = lofo.LOFOImportance(dataset, model=rf, scoring='accuracy')
importance_df = lofo_imp.get_importance()

Root cause:
Looking at lofo_importance.py it looks like at lines 32-33 it specifies that infer_model is only called when model =None. Any particular reason for this, or shouldn't this run regardless of the model passed to lofo_importance?

Add the choice between Mean/Std and Median/IQR

Median and IQR could be more robust and useful if distribution of importances is not normal.

Something like this

importance_df["importance_md"] = lofo_cv_scores_normalized.median(axis=1)
importance_df["importance_iqr"] = stats.iqr(lofo_cv_scores_normalized, axis=1)

Also for plot_importance there could be a choice between error and 95%CI;

For std it would be

importance_df.plot(x="feature", 
y="importance_mean", 
xerr=1.96 * importance_df.importance_std,
kind='barh', 
color=importance_df["color"], 
figsize=figsize)

and for iqr

importance_df.plot(x="feature", 
y="importance_md", 
xerr=1.57 * importance_df.importance_iqr / np.sqrt(n), # num_sampling for flofo and num of folds for lofo
kind='barh', 
color=importance_df["color"], 
figsize=figsize)

Multiclass models

The algorithm doesn't support multiclass classification.
In infer_model function the classification task is defined only for two unique values of the target.

add categorical_feature like lightgbm

I don't know how to feature request XD.

It would be great if you can add the categorical_feature parameter in your Dataset just like in the lightgbm docs. Thanks!!

Sample_weight?

Is there any way to pass a sample weight column into a Dataset object, or is there any plan to add this functionality? It should fit pretty smoothly into the sklearn framework I would expect, but maybe I'm missing something.

Thanks for the wonderful package!

ImportError about check_scoring

Hi,

I was running the example notebook, here came the problem :

from lofo.lofo_importance import LOFOImportance

~/Desktop/lofo-importance/lofo/flofo_importance.py in ()
4 import multiprocessing
5 import warnings
----> 6 from sklearn.metrics import check_scoring
7
8

ImportError: cannot import name 'check_scoring'

I found the latest version about sklearn is using
'from sklearn.metrics.scorer import check_scoring'

Pandas 2.0.x compatibility

Tried your package today with pandas 2.0.3 and got the following error:

AttributeError: module 'pandas.core.strings' has no attribute 'StringMethods'

After downgrading to Version 1.5.3 the issue was gone.

Can you please double-check and in case upgrade to pandas 2.0.x? thanks!

Add logging or restart mechanism

When there are many features, the task takes a long time and is easy to collapse. Therefore, I think it is necessary to add the logging function and breakpoint recovery function in lofo.

Feature selection using statistical significance

Feature request: Having mean and stdev metrics are useful to understand the variable importance but I think It would be quite valuable to have a statistical significance test especially for academic researchers..

Groupkfold or Groupshufflesplit Cross Validation

Thanks for the very cool package. I was wondering if it is possible to pass a group_id somehow in order to do Groupkfold or Groupshufflesplit cross validation? I don't see anywhere obvious, but wanted to check in case I'm missing anything.

Thanks.

Performing feature selection in multi targeted regression dataset

Hi,

I would like to apply lofo in my multi targeted regression dataset, here is a reprex of my data;


from sklearn.datasets import make_regression

X,y = make_regression(n_samples=int(1e3),n_features=5,n_targets=4)

some models (like CatBoost) allow multi targets without changing any parameter.

Since lofo.Dataset takes target= argument as string, I couldn't perform feature selection with lofo-importance.

Can you add this compatibility ?

Thanks

Having a lot of features + Using LOFO?

Hi,

I have 1673 features. When I tried using LOFO importance, the result is the following:

lofo_importance_result

Are the features showing up one on top of the other because the plot isn't long enough? What would you suggest to fix this problem?

Thank you

Support multiclass classification ?

The code below is okay to get importance_df.

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer, load_iris
from sklearn.model_selection import KFold
from lofo import LOFOImportance, Dataset, plot_importance

data = load_breast_cancer(as_frame=True)# load as dataframe
df = data.data
df['target']=data.target.values

# model
model = RandomForestClassifier()
# dataset
dataset = Dataset(df=df, target="target", features=[col for col in df.columns if col != 'target'])
# get feature importance
cv = KFold(n_splits=5, shuffle=True, random_state=666)
lofo_imp = LOFOImportance(dataset, cv=cv, scoring="f1",model=model)
importance_df = lofo_imp.get_importance()
print(importance_df)

But if we modify load_breast_cancer to load_iris, the importance_df values are all NaN.

Is the lofo-importance only support binary classification?

usage question

This is not an issue, but rather a quick question for clarification.

From the brief definition of the method, it is a little hard to tell how LOFO and RFE/Backward Selection differ from each other. Could you please compare & contrast?

Thank you again for sharing this lib with the community!

Serdar

Installation without git

Hi, love you package. I'd like to install it in a cloud environment, where I do not have git available.
I can only install packages from PyPi, or if I upload an .egg or .whl file. Could you maybe provide details on how to install your package, given these circumstances or point me in the right direction?

Thank you, Flo

I am not able to import Dataset

from lofo import Dataset doesn't work and gives error of

cannot import name 'Dataset' from 'lofo'

But if I write from lofo.dataset import Dataset then it works fine.

I think it has something to do with init.py

Understanding LOFO Importance

Hi,
Consider the following code:

import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import KFold
from lofo import LOFOImportance

brca = load_breast_cancer()
df = pd.DataFrame(brca["data"], columns=brca["feature_names"]).assign(target=brca["target"])

lofo = LOFOImportance(df, brca["feature_names"], "target", 
                      cv=KFold(n_splits=5, random_state=4), 
                      scoring="roc_auc")
importance_df = lofo.get_importance()

image1

This produces a importance ranking of features. There are some positive means and some negative means in the dataset.

I will now select all features, whose mean is above zero and calculate the importance again.

new_features = importance_df.query("importance_mean > 0")["feature"].tolist()
lofo = LOFOImportance(df, new_features, "target", 
                      cv=KFold(n_splits=5, random_state=4), 
                      scoring="roc_auc")
importance_df2 = lofo.get_importance()

image2

Again, there are some positive means and some negative means in the dataset. This is not what I was expecting. I removed the supposedly unnecessary features after the first step, why are there new unnecessary features after the second step?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.