Comments (5)
The occurrence of the issue you mention could be due to multiple reasons:
- Performing hyperparameter optimization - in your case you do not apply it, but it is natural that it would increase the validation AUC, due to multiple runs for different parameters.
- Even in the randomly created dataset, there could be some features that have predictive power by random, especially for a small sample. If that is the case above, then it is not a bug and ShapRFECV exploits this better than RFECV because of using SHAP as a proxy for feature importance.
I repeated the test with Lgbm for sample of:
I think the deviation from 0.5 AUC becomes very marginal, and appears due to some of the random variables having by random some predictive power. The more data we use, the less and later in the process it appears.
I would say this is not a bug, it is a feature of ShapRFECV (always wanted to say it). What do you think?
from probatus.
Yes, that is my observation as well. For small no.of features and no.of rows the results are always optimistic as compared to the RFECV. Though, this is feature I think we should mention it some where in the docs in the form of a warning or a disclaimer.
We have observed that in case of smaller dataset n_rows< 5k , n_cols<10 the the validation results can be marginally optimistic since it depends on the underlying SHAP values and the trained model.
Since dealing with random data a validation roc of 0.52 and 0.5 may not matter much, however depending on the use case a difference of 0.01-0.02 can be huge.
What do you think ?
from probatus.
I wouldn't consider it as optimistic in that case. If you have randomly generated features, especially for small datasets, some of the features will have some predictive power by random. The fact that you randomly generated the data does not guarantee that there is no predictive power in them. Only when the sample becomes large, this becomes highly probable. I would also argue if you have a very small sample <1k, and starting the feature elimination process with more of random features >100, the AUC can be even higher towards the end of the process, because initially the number of features that by random could have predictive power is high.
The fact that AUC is different from 0.5 comes from the predictive power of these features and the fact that ShapRFECV uses more efficient proxy for selecting these features over others. ShapRFECV removes noisy features and prevents overfitting of the model on the fully random features. RFECV uses impurity metrics, which is known to be biased towards high cardinality features, which is the case for randomly generated values. Therefore, the other proxy is less effective for this use case. There are no other major differences between these two approaches in the provided code example.
In my view, the observation provides proof that if there is even marginal predictive power in the features ShapRFECV manages to extract it more efficiently than RFECV. It is not optimistic, because in both cases we use 20-fold cross validation to prove that in fact the validation AUC is higher than 50%.
from probatus.
This experiment explains how impurity metrics are fooled by random variables. The random variables could have a very high importance even if there is completely no predictive power in them. That is why the RFECV does not manage to get a boost in AUC, as we observe in ShapRFECV
from probatus.
Yes, the random variables generated do have very low correlation and SHAP is able to pick that and maybe that is why the slight boost in performance.Thanks for investigating this. I will close the issue.
from probatus.
Related Issues (20)
- Implement automatic feature selection methods (Finish work started by #173) HOT 2
- Unit tests should only contain assertions that make sense in the context of the functionality. HOT 2
- Mkdocs fails HOT 2
- Update Probatus to use the latest version of SHAP HOT 23
- Antivirus blacklisted and blocked Probatus website HOT 7
- Option early_stopping_rounds missing for LightGBM in ShapRFECV HOT 11
- Patch release v2.1.1 HOT 2
- Spark Support of ShapRFECV HOT 3
- python3.12 support HOT 2
- Support for shap==0.43.0 HOT 6
- AttributeError: module 'numpy' has no attribute 'bool'. HOT 2
- Random state not set consistently. HOT 1
- Add explicit support for regressors next to classifiers HOT 1
- Introduce dependabot for help with dependency updates
- Investigate if parts of the codebase can leverage other libraries code HOT 2
- Update all notebooks according to latest code. HOT 1
- Probatus v3.0.0+ missing features & issues.
- Add a notebook which shows the use of Probatus with pySpark
- Add seed to explainer + remove np.random.state() HOT 1
- Create a new tag HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from probatus.