Comments (8)
I just watched a presentation by @nanne-aben on covariate shift that details a different approach:
- train a resemblance model (he calls it an adversial model) between train and test
- determine the sample weight
w
asp(non-train | Xi) / p(train | Xi)
for each train instance - train your actual model using the sample weights
Benefits of that approach is that you do not have to subsample your training data, not losing any information.
probatus
already offers SHAPImportanceResemblance for 1). For 2), I think a helper method might actually be really useful. For 3), passing sample weights is straightforward enough already :)
Definition of done would be the helper method for sample weights + a tutorial on "dealing with covariate shift" in the probatus docs.
Thoughts?
from probatus.
This is definitely something that would be nice to have. A couple of thoughts:
-
How do we see this feature being further used? In some way we would use quite a lot of information from the test, even if we don't use labels. Wouldn't this cause a bias when we measure OOT Test score?
-
Implementing this would require some work on how we handle data.
Now we do train/test split within resemblance model (here train and test is created from combined X1
and X2
, unfortunately in your example it is also train and test). In order to calculate the sample weights, we would need to compute the predictions on all samples of X1
, which would require use of cross-validation.
That is why, making this would either require making a completely separate feature that is similar to SHAPImportanceResemblance, but implementing the CV correctly, or would require rework of the entire sample_similarity module, to use CV instead of train/test split. I would be voting for the first option, because in Resemblance model you don't really need the CV, since it is a simple test and it is not about squeezing the most out of the performance of the model.
from probatus.
from probatus.
from probatus.
I like it, especially the second option that you have presented with the use of CV, it is more data efficient. Another tweak that can be done there is using a model with class_weight='balanced'
.
Could you share the experiments? I am interested how this works in practice.
Regarding the bias, this is tricky. Imagine having a OOT Test set, which covers the entire pandemics of Covid-19. In that times, the dataset has changed dramatically, compared to pre-pandemics Train. If you use the data distribution during the pandemics to make the pre-pandemic dataset training better suited, this will cause a strong leakage of information from test to train. The model will be definitely better suited for the future, assuming that the situation doesnt change so much post pandemic, but estimated performance is less realistic, because in this case the model "knew" about upcoming data shift, even though in production it would not. This is of course an extreme example, but I wanted to illustrate where this could go wrong. In the end it is user's choice, whether this bias is an issue or not for a given problem.
Couple of use scenarios I can think of that would decrease possible impact of such bias:
- Set last month of Train data as validation set. In this case, the older Train data can be weighted to better represent most recent times, and no bias would be introduced by using information from the test set.
- Split Test set into two parts, use one part to do adversarial validation. Then the performance on the first and second part of the test set can be illustrated to indicate whether there is any bias introduced (in case the performance between Test1 and Test2 differs)
from probatus.
from probatus.
Interesting discussion.
Framed slightly differently, you could use adversarial/resemblance modelling to calibrate your model as a last (retraining) step, in order to improve performance in production where there is a (known) distribution shift like covid-19.
To do that without leakage, you need get X_train_adversarial
by splitting your out-of-time test set into two, or take previously unused out-of-time data for which you don't have labels yet. Then you train a resemblance model on X_train_adversarial
, and use that model to sets instance weights for your original model, and retrain it one more time using those. You can then measure the performance difference between your original model and your calibrated model using the same out-of-time test dataset.
Back to probatus
. I think there is an opportunity to build some tooling & documentation for this in a new probatus.calibration
module. Some pseudo code:
# We have X_train, y_train, X_test, y_test, X_adversarial
# Normal model
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier().fit(X_train, y_train)
# Resemblance model
from probatus.sample_similarity import SHAPImportanceResemblance
clf = RandomForestClassifier()
rm = SHAPImportanceResemblance(clf)
shap_resemblance_model = rm.fit_compute(X_train, X_adverserial)
# Model calibration
resemblance_model = shap_resemblance_model.model # new method
probs = resemblance_model.predict(X_train)
weights = calculate_weight(probs) # new function
calibrated_model = LGBMClassifier().fit(X_train, y_train, sample_weights = weights)
# Compare performance
# get AUC from model.predict(X_test, y_test)
# get AUC from calibrated_model.predict(X_test, y_test)
The new parts are in the model calibration section. I think we can simplify that process a bit more, maybe something like:
ac = probatus.calibration.AdversialCalibrator()
ac.fit_compute(model, resemblance_model, X_train, y_train, X_train, X_test, X_adverserial) # returns pd.DataFrame comparing calibrated model with non-calibrated model
Thoughts?
from probatus.
from probatus.
Related Issues (20)
- Unit tests should only contain assertions that make sense in the context of the functionality. HOT 2
- Mkdocs fails HOT 2
- Update Probatus to use the latest version of SHAP HOT 23
- Antivirus blacklisted and blocked Probatus website HOT 7
- Option early_stopping_rounds missing for LightGBM in ShapRFECV HOT 11
- Patch release v2.1.1 HOT 2
- Spark Support of ShapRFECV HOT 3
- python3.12 support HOT 2
- Support for shap==0.43.0 HOT 6
- AttributeError: module 'numpy' has no attribute 'bool'. HOT 2
- Random state not set consistently. HOT 1
- Add explicit support for regressors next to classifiers HOT 1
- Introduce dependabot for help with dependency updates
- Investigate if parts of the codebase can leverage other libraries code HOT 2
- Update all notebooks according to latest code. HOT 1
- Probatus v3.0.0+ missing features & issues.
- Add a notebook which shows the use of Probatus with pySpark
- Add seed to explainer + remove np.random.state() HOT 1
- Create a new tag HOT 3
- eval_metric in EarlyStoppingShapRFECV not used for LGBMClassifier
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from probatus.