Code Monkey home page Code Monkey logo

Comments (5)

Matgrb avatar Matgrb commented on May 22, 2024

The occurrence of the issue you mention could be due to multiple reasons:

  • Performing hyperparameter optimization - in your case you do not apply it, but it is natural that it would increase the validation AUC, due to multiple runs for different parameters.
  • Even in the randomly created dataset, there could be some features that have predictive power by random, especially for a small sample. If that is the case above, then it is not a bug and ShapRFECV exploits this better than RFECV because of using SHAP as a proxy for feature importance.

I repeated the test with Lgbm for sample of:

1k:
image

10k:
image

and 20k:
image

I think the deviation from 0.5 AUC becomes very marginal, and appears due to some of the random variables having by random some predictive power. The more data we use, the less and later in the process it appears.

I would say this is not a bug, it is a feature of ShapRFECV (always wanted to say it). What do you think?

from probatus.

anilkumarpanda avatar anilkumarpanda commented on May 22, 2024

Yes, that is my observation as well. For small no.of features and no.of rows the results are always optimistic as compared to the RFECV. Though, this is feature I think we should mention it some where in the docs in the form of a warning or a disclaimer.

We have observed that in case of smaller dataset n_rows< 5k , n_cols<10 the the validation results can be marginally optimistic since it depends on the underlying SHAP values and the trained model.

Since dealing with random data a validation roc of 0.52 and 0.5 may not matter much, however depending on the use case a difference of 0.01-0.02 can be huge.
What do you think ?

from probatus.

Matgrb avatar Matgrb commented on May 22, 2024

I wouldn't consider it as optimistic in that case. If you have randomly generated features, especially for small datasets, some of the features will have some predictive power by random. The fact that you randomly generated the data does not guarantee that there is no predictive power in them. Only when the sample becomes large, this becomes highly probable. I would also argue if you have a very small sample <1k, and starting the feature elimination process with more of random features >100, the AUC can be even higher towards the end of the process, because initially the number of features that by random could have predictive power is high.

The fact that AUC is different from 0.5 comes from the predictive power of these features and the fact that ShapRFECV uses more efficient proxy for selecting these features over others. ShapRFECV removes noisy features and prevents overfitting of the model on the fully random features. RFECV uses impurity metrics, which is known to be biased towards high cardinality features, which is the case for randomly generated values. Therefore, the other proxy is less effective for this use case. There are no other major differences between these two approaches in the provided code example.

In my view, the observation provides proof that if there is even marginal predictive power in the features ShapRFECV manages to extract it more efficiently than RFECV. It is not optimistic, because in both cases we use 20-fold cross validation to prove that in fact the validation AUC is higher than 50%.

from probatus.

Matgrb avatar Matgrb commented on May 22, 2024

This experiment explains how impurity metrics are fooled by random variables. The random variables could have a very high importance even if there is completely no predictive power in them. That is why the RFECV does not manage to get a boost in AUC, as we observe in ShapRFECV

from probatus.

anilkumarpanda avatar anilkumarpanda commented on May 22, 2024

Yes, the random variables generated do have very low correlation and SHAP is able to pick that and maybe that is why the slight boost in performance.Thanks for investigating this. I will close the issue.

from probatus.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.