The <a href="https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeig

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Closed via <a class="issue-link js-issue-link" data-error-text="Failed to load title"

Independent predict and score methods for KNeighborsRegressor derived estimators,about lemma-osu/sknnr

Comments (10)

aazuspan commented on June 29, 2024 1

I had thought of this design as well, but wondered if the optional X would raise any issues when doing estimator checks

Good thought! Nothing specific comes to mind about why it would cause issues, but it's possible. I would lean towards using the design that makes the most sense, and dealing with incompatible estimator checks as needed, even if we have to xfail them. I'm starting to wonder how much value the estimator checks are actually adding, given the complexity of working around incompatible checks, but that's a discussion for #44...

I think that scared me off, but if folks are used to passing an X then maybe just an explanation in the docstring might be sufficient?

Yeah, I would think so. Another option I just thought of while trying to come up with similar use cases in sklearn is something more like random forest OOB error. In that case, the score and predictions are set as fit attributes oob_score_ and oob_predictions_ (assuming you instantiate the estimator with oob_score=True). What do you think about using attributes to store the independent accuracy score and predictions, rather than modifying the methods? I'm not sure what my preference is, between the two options...

In the case where we have two samples that are identical (and thus identical X vectors) but only one was used in fitting, then the sample that was not used in fitting does not need an independent neighbor and would use the identical sample as its first nearest neighbor (the "identical-twin" case).

Good point! I had a vague thought that we could exclude zero-distance neighbors, but of course there is a possibility for legitimately identical samples, especially with a small number of covariates.

But I think we discussed that this is really outside the scope of the initial sknnr package because we aren't even yet suggesting that these estimators are used in a spatial context. In a wrapping package that deals with accuracy assessment at the pixel scale in plot footprints, my current thought is that we would:

Yeah, I thought that's where we'd landed. Your outline for how to solve this in the wrapper package sounds like a good approach!

Note that for map prediction, pixels are allowed to use their own samples/plots as neighbors, so the above just applies to independent AA. Have I totally confused things?

There are definitely a lot of little corner cases and gotchas to watch out for, but I think you've laid them out as clearly as possible! I appreciate how you spaced them out rather than laying them all on me at the outset of the project, which would have been pretty overwhelming!

from sknnr.

aazuspan commented on June 29, 2024

@grovduck thanks for the detailed write-up! It took me a while to wrap my head around this, but I think I'm up to speed now.

This is a bit hacky since it's obviously not the intended behavior, but can we accomplish these independent methods by calling the standard predict and score methods with X=None? If you think this is a common use case, that might be too weird of an API to expect users to use, but we could potentially simplify the predict_independent and score_independent methods to use the signature you suggested and simply call their corresponding methods with X=None, e.g.

def predict_independent(self):
    return self.predict(X=None)

EDIT: Alternatively, if we wanted predict and score to act more like kneighbors and just default to running independently unless an X is provided (rather than having separate independent methods), we could do:

def predict(self, X=None):
    return super().predict(X)

We will want to retain predict and score as-is for when we send new X data to an already fit model.

So if the user chose to score a sample that was in the training set (e.g. a pixel that had a corresponding plot, leading to an identical X matrix), we're not currently planning to provide a "second-nearest neighbor" in that case, right? I know we've discussed that and ran into a lot of complexity around excluding neighborhoods of plot pixels, but just wanted to make sure I'm on the same page.

from sknnr.

grovduck commented on June 29, 2024

This is a bit hacky since it's obviously not the intended behavior, but can we accomplish these independent methods by calling the standard predict and score methods with X=None? If you think this is a common use case, that might be too weird of an API to expect users to use, but we could potentially simplify the predict_independent and score_independent methods to use the signature you suggested and simply call their corresponding methods with X=None

Yes, good thinking! I had thought of this design as well, but wondered if the optional X would raise any issues when doing estimator checks. I just did a quick search of scikit-learn and only see of one estimator where they use this pattern (e.g. def predict(self, X=None)) and it looks to be in an outlier detection estimator. I think that scared me off, but if folks are used to passing an X then maybe just an explanation in the docstring might be sufficient?

EDIT: Alternatively, if we wanted predict and score to act more like kneighbors and just default to running independently unless an X is provided (rather than having separate independent methods), we could do:
def predict(self, X=None):
    return super().predict(X)

I think I like this pattern better for sure.

So if the user chose to score a sample that was in the training set (e.g. a pixel that had a corresponding plot, leading to an identical X matrix), we're not currently planning to provide a "second-nearest neighbor" in that case, right? I know we've discussed that and ran into a lot of complexity around excluding neighborhoods of plot pixels, but just wanted to make sure I'm on the same page.

I think there are two issues here:

In the case where we have two samples that are identical (and thus identical X vectors) but only one was used in fitting, then the sample that was not used in fitting does not need an independent neighbor and would use the identical sample as its first nearest neighbor (the "identical-twin" case).
In the case where a pixel is part of the "footprint" of a sample that was used in fitting, then it is no longer independent and needs to pick a different sample. (Note that there is the added complexity that a pixel in a footprint often has a different X than the average signature of the footprint itself.) But I think we discussed that this is really outside the scope of the initial sknnr package because we aren't even yet suggesting that these estimators are used in a spatial context. In a wrapping package that deals with accuracy assessment at the pixel scale in plot footprints, my current thought is that we would:
1. Associate pixels with their plot ID (or sample index from fit)
2. Run kneighbors(X) with all pixels and oversample k
3. Filter the list to exclude any neighbors matching the plot ID
4. Cut down to the desired k

Note that for map prediction, pixels are allowed to use their own samples/plots as neighbors, so the above just applies to independent AA. Have I totally confused things?

from sknnr.

grovduck commented on June 29, 2024

Another option I just thought of while trying to come up with similar use cases in sklearn is something more like random forest OOB error. In that case, the score and predictions are set as fit attributes oob_score_ and oob_predictions_ (assuming you instantiate the estimator with oob_score=True). What do you think about using attributes to store the independent accuracy score and predictions, rather than modifying the methods? I'm not sure what my preference is, between the two options...

How interesting! I hadn't even considered this approach. It really feels like these developers found themselves in a similar situation as we are now and had to come up with a solution 😄. I think I lean toward the predict(self, X=None) call a bit more, but can be convinced otherwise. For now, I'll plan to go with this approach and see what issues it brings up.

There are definitely a lot of little corner cases and gotchas to watch out for ...

It feels like my life is full of gotchas!

from sknnr.

aazuspan commented on June 29, 2024

I think I lean toward the predict(self, X=None) call a bit more, but can be convinced otherwise. For now, I'll plan to go with this approach and see what issues it brings up.

Sounds good to me!

from sknnr.

grovduck commented on June 29, 2024

@aazuspan, I've made some headway with this issue mostly in terms of creating data from yaImpute to support the tests, such that I can see that this code is working. But I'm currently blocked on a couple of issues that I could use your help with.

First, I was leaning toward the implementation like you suggested:

def predict(self, X=None):
    return super().predict(X)

but recognizing that I need to do the same with score, I come up against providing a keyword argument X before the required argument y, i.e.

def score(self, X=None, y):
    return super().score(X, y)

I can flip the arguments, but that causes many checks to fail. I'm not sure there is a good way to do this without violating the familiar API? For now, I went back to naming these predict_independent and score_independent and implementing them as:

class IndependentPredictionMixin:
    """
    Mixin for KNeighborsRegressor derived classes that return predictions
    that don't include itself in the nearest neighbors.
    """

    def predict_independent(self) -> NDArray:
        return super().predict(X=None)

    def score_independent(self, y) -> float:
        return super().score(X=None, y=y)

We can also revisit storing these as fit attributes like your other suggestion.

This design, however, brings up an additional issue. I wanted to add this as a mix-in, but mypy is complaining that these two methods are undefined in superclass.

src\sknnr\_base.py:95: error: "predict" undefined in superclass  [misc]
src\sknnr\_base.py:98: error: "score" undefined in superclass  [misc]

This actually makes sense to me as IndependentPredictionMixin isn't a subclass of KNeighborsRegressor, although all our estimators are. But what doesn't make sense to me is why mypy isn't complaining about similar use of super() in KNeighborsDFIndexCrosswalkMixin and TransformedKNeighborsMixin which also aren't subclasses of KNeighborsRegressor. I feel like I'm missing something obvious, but I'm not seeing it.

EDIT: Ah, finally realized it was because I'm giving return type information to these methods, which is causing mypy to actually check them. I'm guessing we'll run into this issue down the road with #7. For now, I'm going to leave off the typing information and get a draft PR up so you can see it.

from sknnr.

aazuspan commented on June 29, 2024

I come up against providing a keyword argument X before the required argument y ... I can flip the arguments, but that causes many checks to fail. I'm not sure there is a good way to do this without violating the familiar API?

Yeah, I didn't think about that! I think you're right that there's no good workaround here that would allow us to override score. Reversing the order of X and y would be very weird, even if we could get the tests to pass. We could make y "optional" and raise an error if it's not provided, but that also seems too odd.

We can also revisit storing these as fit attributes like your other suggestion.

I have a slight preference for this just because there's a precedent for it with the OOB attributes, but if you think the independent methods are clearer or would have some practical advantages, I'm happy to go with those instead.

Ah, finally realized it was because I'm giving return type information to these methods, which is causing mypy to actually check them. I'm guessing we'll run into this issue down the road with #7.

Good diagnosing! I didn't realize mypy would ignore methods without type info. That's definitely going to bite us once we start getting serious about type annotations... I think your solution of skipping those mypy checks for now is good, but we will need to give this some thought down the road.

The fact that the mixins refer to their superclass methods without actually having a superclass has always felt a little weird.
Thinking out loud with untested code, I wonder if something like this would make mypy happy (and be a little more explicit):

    def score_independent(self, y) -> float:
        sup: KNeighborsRegressor = super()
        return sup.score(X=None, y=y)

from sknnr.

grovduck commented on June 29, 2024

Thinking out loud with untested code, I wonder if something like this would make mypy happy (and be a little more explicit):
    def score_independent(self, y) -> float:
        sup: KNeighborsRegressor = super()
        return sup.score(X=None, y=y)

Beautiful, that actually does it (at least in this method)! I'm guessing we'll want to do this on all mixins, so perhaps hold off on this issue/PR, but implement as part of #7?

from sknnr.

aazuspan commented on June 29, 2024

Cool! There may be a better approach, but it's good to know there's an easy workaround.

I'm guessing we'll want to do this on all mixins, so perhaps hold off on this issue/PR, but implement as part of #7?

Agreed!

from sknnr.

grovduck commented on June 29, 2024

Closed via #50

from sknnr.

Independent predict and score methods for KNeighborsRegressor derived estimators about sknnr HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent