<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Mismatch between feature importancies from `GroupedMarginalImputer` and `MarginalImputer` about sage HOT 5 CLOSED

KarelZe commented on May 18, 2024

Mismatch between feature importancies from `GroupedMarginalImputer` and `MarginalImputer`

from sage.

Comments (5)

iancovert commented on May 18, 2024 1

Re: the subject of removing features properly, this has been a big point of contention in the literature! We have a paper that reviews the many options people have tried (Section 4) and provides some analysis to justify sampling features from their conditional distribution (Section 8). I personally think this is a reasonable approach in many scenarios, and I agree that it seems wise to group highly correlated features.

There are many other papers supporting this conditional sampling approach (here here and here), and the main other approach folks argue for is sampling from the marginal distribution (here). Masking with a specific value (e.g., zero or the mean) seems a little less compelling to me, except in the case where features are independent and the model is linear (see eq. 12 in the original SHAP paper), or when the model is trained with random masking so that the masking values represent the absence of information (which we discuss in Section 8.3 of our paper and is surely discussed in some others).

Anyway, there's a lot of nuance in these discussions about which approach is theoretically best. But only a couple approaches are easy to implement in practice.

from sage.

KarelZe commented on May 18, 2024 1

Thank you for all the insights. The paper I was referring to, is the one of https://arxiv.org/pdf/2011.14878.pdf. I found figure 2 particularly helpful for work.😉

I think we can close this issue. Hope it's helpful to others, too.

from sage.

KarelZe commented on May 18, 2024

@iancovert Did you have the chance to look into this yet?

from sage.

iancovert commented on May 18, 2024

So sorry again for the delayed response. And thanks for making the colab notebook to show the behavior you're talking about, that was very helpful.

In your notebook with the airbnb dataset, it seems like the differences we see aren't overly concerning? For example, the "room_type" score is very close relative to the scale. The "location (grouped)" score has a bit more of a discrepancy, but since this is the one with grouping that's where I would expect more difference. So is it mainly the cases with your private datasets that are more concerning?

In general, calculating SAGE values (or SHAP values for that matter) with grouping doesn't guarantee 1) that groups have scores equal to the sum of their constituents, or 2) that singletons have the same score regardless of grouping in other features. If that were the case, I think grouping would mainly be a trick to reduce compute (note that it does this by reducing the number of orderings to consider and total values to estimate). Instead, it actually redefines the important scores in a subtle way, which I think can be better or more meaningful.

The reformulation is the following: in the no-grouping formulation, we have a set of features, and the SAGE value is calculated by averaging a feature's marginal contributions over all possible orderings (eq. 6 in the paper). That means that in the Airbnb example, we consider how much "room_type" improves the model when added to every subset of features, including subsets that contain partial groups (like "latitude" but not "longitude"). In the grouped version, grouped features like "latitude" and "longitude" are either all in or all out, so this affects the marginal contributions that we account for, even for a singleton like "room_type." So we shouldn't expect even singleton features to stay exactly the same, although I would agree that any huge differences would be surprising.

There's also another interesting thing that happens when you group features, which has to do with off-manifold sampling when we hold out different feature subsets. You may or may not know this, but there's a big debate in the feature importance literature about the correct way to handle held-out features (I could go on at length about this), which is the backbone behind many popular feature importance methods (SHAP, SAGE, LIME, permutation tests, etc). The approach we adopted in this repo is to sample values for the held-out features from the training set. This can lead to some weird scenarios where if you have highly correlated features that aren't in the same group, and you're holding out some but not all of them, you'll retain the observed values for some but replace the others with potentially impossible values sampled from the dataset. (E.g., you could get an implausible combination of latitude/longitude/neighborhood.) This issue is partially resolved by grouping, in that we can guarantee highly correlated features in the same groups will maintain plausible combinations of values.

Finally, for one of your other questions about the imputers: if the groups consist entirely of singletons, we should get the exact same result as if there were no grouping. If that doesn't happen, that's a bug in the imputer and something I need to fix.

Apologies for the long response, but let me know what you think.

from sage.

KarelZe commented on May 18, 2024

@iancovert thanks for your detailled answer.

In your notebook with the airbnb dataset, it seems like the differences we see aren't overly concerning? For example, the "room_type" score is very close relative to the scale. The "location (grouped)" score has a bit more of a discrepancy, but since this is the one with grouping that's where I would expect more difference. So is it mainly the cases with your private datasets that are more concerning?

Yes, effects were much stronger in the private repo. Unfortunately, I'm not allowed to share data or screenshots.

In general, calculating SAGE values (or SHAP values for that matter) with grouping doesn't guarantee 1) that groups have scores equal to the sum of their constituents, or 2) that singletons have the same score regardless of grouping in other features. If that were the case, I think grouping would mainly be a trick to reduce compute (note that it does this by reducing the number of orderings to consider and total values to estimate). Instead, it actually redefines the important scores in a subtle way, which I think can be better or more meaningful.

Yes, I thought that features would contribute the same scores to groups, as if they were singeltons. Sorry for my misunderstanding.

The reformulation is the following: in the no-grouping formulation, we have a set of features, and the SAGE value is calculated by averaging a feature's marginal contributions over all possible orderings (eq. 6 in the paper). That means that in the Airbnb example, we consider how much "room_type" improves the model when added to every subset of features, including subsets that contain partial groups (like "latitude" but not "longitude"). In the grouped version, grouped features like "latitude" and "longitude" are either all in or all out, so this affects the marginal contributions that we account for, even for a singleton like "room_type." So we shouldn't expect even singleton features to stay exactly the same, although I would agree that any huge differences would be surprising.

Thanks for your striking example. Now its clear to me.

There's also another interesting thing that happens when you group features, which has to do with off-manifold sampling when we hold out different feature subsets. You may or may not know this, but there's a big debate in the feature importance literature about the correct way to handle held-out features (I could go on at length about this), which is the backbone behind many popular feature importance methods (SHAP, SAGE, LIME, permutation tests, etc). The approach we adopted in this repo is to sample values for the held-out features from the training set. This can lead to some weird scenarios where if you have highly correlated features that aren't in the same group, and you're holding out some but not all of them, you'll retain the observed values for some but replace the others with potentially impossible values sampled from the dataset. (E.g., you could get an implausible combination of latitude/longitude/neighborhood.) This issue is partially resolved by grouping, in that we can guarantee highly correlated features in the same groups will maintain plausible combinations of values.

Thanks for bringing up this interesting point. I've only very recently learned about the issue (e.g., zero-out, marginal distribution etc.) in a paper (https://jmlr.csail.mit.edu/papers/volume22/20-1316/20-1316.pdf comparing differnt feature importance measures based on the removal principle. My conclusion from your comment would be, that its wise to combine highly correlated features in groups to sample more realistic combinations.

Finally, for one of your other questions about the imputers: if the groups consist entirely of singletons, we should get the exact same result as if there were no grouping. If that doesn't happen, that's a bug in the imputer and something I need to fix.

I didn't test this scenario (only the other way around).

Apologies for the long response, but let me know what you think.

Thanks again, your comment is very insightful!

from sage.

Mismatch between feature importancies from `GroupedMarginalImputer` and `MarginalImputer` about sage HOT 5 CLOSED

Comments (5)

Related Issues (16)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent