aerdem4 / lofo-importance Goto Github PK
View Code? Open in Web Editor NEWLeave One Feature Out Importance
License: MIT License
Leave One Feature Out Importance
License: MIT License
I was getting a weird error when passing the titanic dataset through lofo-importance, and I think I know why:
So take this code example. The inclusion of the parameter "model=rf" will trigger an error, whereas removing it will allow the lofo importance calculation to proceed with the default model.
import lofo
import seaborn as sns
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
titanic = sns.load_dataset('titanic')
df = titanic[['sex','survived']]
dataset = lofo.Dataset(df=df, target="survived", features=[col for col in df.columns if col != "survived"])
rf = RandomForestClassifier()
lofo_imp = lofo.LOFOImportance(dataset, model=rf, scoring='accuracy')
importance_df = lofo_imp.get_importance()
Root cause:
Looking at lofo_importance.py it looks like at lines 32-33 it specifies that infer_model is only called when model =None. Any particular reason for this, or shouldn't this run regardless of the model passed to lofo_importance?
Median and IQR could be more robust and useful if distribution of importances is not normal.
Something like this
importance_df["importance_md"] = lofo_cv_scores_normalized.median(axis=1)
importance_df["importance_iqr"] = stats.iqr(lofo_cv_scores_normalized, axis=1)
Also for plot_importance there could be a choice between error and 95%CI;
For std it would be
importance_df.plot(x="feature",
y="importance_mean",
xerr=1.96 * importance_df.importance_std,
kind='barh',
color=importance_df["color"],
figsize=figsize)
and for iqr
importance_df.plot(x="feature",
y="importance_md",
xerr=1.57 * importance_df.importance_iqr / np.sqrt(n), # num_sampling for flofo and num of folds for lofo
kind='barh',
color=importance_df["color"],
figsize=figsize)
The algorithm doesn't support multiclass classification.
In infer_model function the classification task is defined only for two unique values of the target.
I don't know how to feature request XD.
It would be great if you can add the categorical_feature
parameter in your Dataset
just like in the lightgbm docs. Thanks!!
Is there any way to pass a sample weight column into a Dataset object, or is there any plan to add this functionality? It should fit pretty smoothly into the sklearn framework I would expect, but maybe I'm missing something.
Thanks for the wonderful package!
Hi,
I was running the example notebook, here came the problem :
from lofo.lofo_importance import LOFOImportance
~/Desktop/lofo-importance/lofo/flofo_importance.py in ()
4 import multiprocessing
5 import warnings
----> 6 from sklearn.metrics import check_scoring
7
8
ImportError: cannot import name 'check_scoring'
I found the latest version about sklearn is using
'from sklearn.metrics.scorer import check_scoring'
Hello,
The requirements.txt file, which is referenced in the setup.py file, is not included in the source distribution. This causes an error when installing from source because the file is missing. This is a problem, for example, when installing with conda. The file can be included by use of a MANIFEST file - as specified here https://docs.python.org/3/distutils/sourcedist.html
--Kellen
Tried your package today with pandas 2.0.3 and got the following error:
AttributeError: module 'pandas.core.strings' has no attribute 'StringMethods'
After downgrading to Version 1.5.3 the issue was gone.
Can you please double-check and in case upgrade to pandas 2.0.x? thanks!
How to perform feature selection with hyperparameter tuning?
When there are many features, the task takes a long time and is easy to collapse. Therefore, I think it is necessary to add the logging function and breakpoint recovery function in lofo.
Feature request: Having mean and stdev metrics are useful to understand the variable importance but I think It would be quite valuable to have a statistical significance test especially for academic researchers..
Thanks for putting this repo together. Who developed this method? Is it the same as LOCO?
https://www.tandfonline.com/doi/full/10.1080/01621459.2017.1307116?casa_token=HAl_ErrKi18AAAAA:YyDJybfbzaLMDU1Zlzq8D4OQnZmUeEwukWJFagcsFB7_JA-W-7ifcINhc8N0FTbtbImLjezWESo
Thanks,
Daniel
Thanks for the very cool package. I was wondering if it is possible to pass a group_id somehow in order to do Groupkfold or Groupshufflesplit cross validation? I don't see anywhere obvious, but wanted to check in case I'm missing anything.
Thanks.
Hi,
For neural network if you change the number of features, you need to change the input dimension and therefore the number of neurons.
So, we could have an option like:
What do you think?
Do you have tutorial to work with high dimensional data like SNP data?
is it possible to use sklearn TimeSeriesSplit with lofo?
Hi,
I would like to apply lofo in my multi targeted regression dataset, here is a reprex of my data;
from sklearn.datasets import make_regression
X,y = make_regression(n_samples=int(1e3),n_features=5,n_targets=4)
some models (like CatBoost) allow multi targets without changing any parameter.
Since lofo.Dataset
takes target=
argument as string, I couldn't perform feature selection with lofo-importance.
Can you add this compatibility ?
Thanks
The code below is okay to get importance_df.
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer, load_iris
from sklearn.model_selection import KFold
from lofo import LOFOImportance, Dataset, plot_importance
data = load_breast_cancer(as_frame=True)# load as dataframe
df = data.data
df['target']=data.target.values
# model
model = RandomForestClassifier()
# dataset
dataset = Dataset(df=df, target="target", features=[col for col in df.columns if col != 'target'])
# get feature importance
cv = KFold(n_splits=5, shuffle=True, random_state=666)
lofo_imp = LOFOImportance(dataset, cv=cv, scoring="f1",model=model)
importance_df = lofo_imp.get_importance()
print(importance_df)
But if we modify load_breast_cancer
to load_iris
, the importance_df values are all NaN.
Is the lofo-importance only support binary classification?
This is not an issue, but rather a quick question for clarification.
From the brief definition of the method, it is a little hard to tell how LOFO and RFE/Backward Selection differ from each other. Could you please compare & contrast?
Thank you again for sharing this lib with the community!
Serdar
Hi, love you package. I'd like to install it in a cloud environment, where I do not have git available.
I can only install packages from PyPi, or if I upload an .egg or .whl file. Could you maybe provide details on how to install your package, given these circumstances or point me in the right direction?
Thank you, Flo
from lofo import Dataset
doesn't work and gives error of
cannot import name 'Dataset' from 'lofo'
But if I write from lofo.dataset import Dataset
then it works fine.
I think it has something to do with init.py
Hi,
Consider the following code:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import KFold
from lofo import LOFOImportance
brca = load_breast_cancer()
df = pd.DataFrame(brca["data"], columns=brca["feature_names"]).assign(target=brca["target"])
lofo = LOFOImportance(df, brca["feature_names"], "target",
cv=KFold(n_splits=5, random_state=4),
scoring="roc_auc")
importance_df = lofo.get_importance()
This produces a importance ranking of features. There are some positive means and some negative means in the dataset.
I will now select all features, whose mean is above zero and calculate the importance again.
new_features = importance_df.query("importance_mean > 0")["feature"].tolist()
lofo = LOFOImportance(df, new_features, "target",
cv=KFold(n_splits=5, random_state=4),
scoring="roc_auc")
importance_df2 = lofo.get_importance()
Again, there are some positive means and some negative means in the dataset. This is not what I was expecting. I removed the supposedly unnecessary features after the first step, why are there new unnecessary features after the second step?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.