Code Monkey home page Code Monkey logo

Comments (8)

timvink avatar timvink commented on May 22, 2024

I just watched a presentation by @nanne-aben on covariate shift that details a different approach:

  1. train a resemblance model (he calls it an adversial model) between train and test
  2. determine the sample weight w as p(non-train | Xi) / p(train | Xi) for each train instance
  3. train your actual model using the sample weights

Benefits of that approach is that you do not have to subsample your training data, not losing any information.

probatus already offers SHAPImportanceResemblance for 1). For 2), I think a helper method might actually be really useful. For 3), passing sample weights is straightforward enough already :)

Definition of done would be the helper method for sample weights + a tutorial on "dealing with covariate shift" in the probatus docs.

Thoughts?

from probatus.

Matgrb avatar Matgrb commented on May 22, 2024

This is definitely something that would be nice to have. A couple of thoughts:

  1. How do we see this feature being further used? In some way we would use quite a lot of information from the test, even if we don't use labels. Wouldn't this cause a bias when we measure OOT Test score?

  2. Implementing this would require some work on how we handle data.

Now we do train/test split within resemblance model (here train and test is created from combined X1 and X2, unfortunately in your example it is also train and test). In order to calculate the sample weights, we would need to compute the predictions on all samples of X1, which would require use of cross-validation.

That is why, making this would either require making a completely separate feature that is similar to SHAPImportanceResemblance, but implementing the CV correctly, or would require rework of the entire sample_similarity module, to use CV instead of train/test split. I would be voting for the first option, because in Resemblance model you don't really need the CV, since it is a simple test and it is not about squeezing the most out of the performance of the model.

from probatus.

nanne-aben avatar nanne-aben commented on May 22, 2024

from probatus.

nanne-aben avatar nanne-aben commented on May 22, 2024

from probatus.

Matgrb avatar Matgrb commented on May 22, 2024

I like it, especially the second option that you have presented with the use of CV, it is more data efficient. Another tweak that can be done there is using a model with class_weight='balanced'.

Could you share the experiments? I am interested how this works in practice.

Regarding the bias, this is tricky. Imagine having a OOT Test set, which covers the entire pandemics of Covid-19. In that times, the dataset has changed dramatically, compared to pre-pandemics Train. If you use the data distribution during the pandemics to make the pre-pandemic dataset training better suited, this will cause a strong leakage of information from test to train. The model will be definitely better suited for the future, assuming that the situation doesnt change so much post pandemic, but estimated performance is less realistic, because in this case the model "knew" about upcoming data shift, even though in production it would not. This is of course an extreme example, but I wanted to illustrate where this could go wrong. In the end it is user's choice, whether this bias is an issue or not for a given problem.

Couple of use scenarios I can think of that would decrease possible impact of such bias:

  • Set last month of Train data as validation set. In this case, the older Train data can be weighted to better represent most recent times, and no bias would be introduced by using information from the test set.
  • Split Test set into two parts, use one part to do adversarial validation. Then the performance on the first and second part of the test set can be illustrated to indicate whether there is any bias introduced (in case the performance between Test1 and Test2 differs)

from probatus.

nanne-aben avatar nanne-aben commented on May 22, 2024

from probatus.

timvink avatar timvink commented on May 22, 2024

Interesting discussion.

Framed slightly differently, you could use adversarial/resemblance modelling to calibrate your model as a last (retraining) step, in order to improve performance in production where there is a (known) distribution shift like covid-19.

To do that without leakage, you need get X_train_adversarial by splitting your out-of-time test set into two, or take previously unused out-of-time data for which you don't have labels yet. Then you train a resemblance model on X_train_adversarial, and use that model to sets instance weights for your original model, and retrain it one more time using those. You can then measure the performance difference between your original model and your calibrated model using the same out-of-time test dataset.

Back to probatus. I think there is an opportunity to build some tooling & documentation for this in a new probatus.calibration module. Some pseudo code:

# We have X_train, y_train, X_test, y_test, X_adversarial

# Normal model
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier().fit(X_train, y_train)

# Resemblance model
from probatus.sample_similarity import SHAPImportanceResemblance
clf = RandomForestClassifier()
rm = SHAPImportanceResemblance(clf)
shap_resemblance_model = rm.fit_compute(X_train, X_adverserial)

# Model calibration
resemblance_model = shap_resemblance_model.model # new method
probs = resemblance_model.predict(X_train)
weights = calculate_weight(probs) # new function
calibrated_model = LGBMClassifier().fit(X_train, y_train, sample_weights = weights)

# Compare performance
# get AUC from model.predict(X_test, y_test) 
# get AUC from calibrated_model.predict(X_test, y_test) 

The new parts are in the model calibration section. I think we can simplify that process a bit more, maybe something like:

ac = probatus.calibration.AdversialCalibrator()
ac.fit_compute(model, resemblance_model, X_train, y_train, X_train, X_test, X_adverserial) # returns pd.DataFrame comparing calibrated model with non-calibrated model

Thoughts?

from probatus.

nanne-aben avatar nanne-aben commented on May 22, 2024

from probatus.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.