Code Monkey home page Code Monkey logo

Comments (5)

grovduck avatar grovduck commented on June 25, 2024

@aazuspan, great explanation of the issue and I understand what's going on. Correct me if I'm wrong, but this definitely feels like a bit of a corner case, in that a user would have to read in X as an array, then convert to a dataframe without specifying column names, right?

I'm a bit conflicted over what the expected behavior should be. One argument would be that given that the user (intentionally/unintentionally) passed X as a dataframe, it feels like we should respect that choice (as far as column names) even if column names are meaningless. But I think that goes against your proposed fix, because I think you'd set the output mode to be default, correct?

from sknnr.

aazuspan avatar aazuspan commented on June 25, 2024

I think you're right that this is definitely an edge case, and probably not worth worrying about beyond maintaining consistency with sklearn. However...

While trying to solve this, I realized that I added a ton of unnecessary complexity in #34. This commit message goes into some depth, but essentially I think I got stuck on the idea that we needed the transformer to pass dataframes to our estimator so that we could get feature names, but that became totally unnecessary once we added the feature_names_in_ property since we now just pull the names straight from the transformer.

Removing that requirement simplifies the _apply_transform method a lot and allows us to ditch the overly confusing set_temp_output function.

With those gone, and after a few hours tracing through exactly what methods get called with what data, I realized that we could also simplify _check_feature_names. Basically, that method gets called in two circumstances:

  1. When fitting the transformed estimator, it gets called by KNeighborsRegressor.fit and compares the transformed X (e.g. cca0) against the original feature names (e.g. PSME_BA). This is the check that gives us problems and the reason we need to override it, because unlike sklearn, we don't want those to match. This is also the check that was causing the error in this issue, and is indirectly tested by test_estimators_support_dataframes.
  2. When predicting, it gets called by KNeighborsMixin.kneighbors and compares the new X (e.g. PSME_BA) to the X that the transformer was fit with (hopefully PSME_BA). We do want this check to ensure that a user doesn't accidentally predict with different features than they fit with. That's the behavior tested by test_estimators_warn_for_missing_features.

Trying to accomplish both of those objectives (don't warn when fitting but do warn when predicting) was why I had to copy some of the logic from the sklearn implementation. However, I realize now that we can avoid that complexity (and the licensing) by just disabling that check and instead running that check via the transformer in _apply_transform. That allows us to totally disable the check in case 1 above, but get the full check via the transformer in case 2.

Sorry for the whiplash of adding a bunch of features then pulling them all out again! I realize this is a lot to digest, so I'm happy to clarify or discuss further. But if this all makes sense and you don't see any glaring holes in my logic, I'm ready to make a PR that would fix this and simplify the validation.

from sknnr.

grovduck avatar grovduck commented on June 25, 2024

I'm so glad you understand everything, but I'm right at about 30-40% on this 😄. I'm going to take the cop-out that I generally understand how you've simplified things (and the code is much simpler) and trust that you have a good handle on all of this. I'm good with you moving forward with the PR and promise to stare at it more such that I get to 60-80%!

from sknnr.

aazuspan avatar aazuspan commented on June 25, 2024

That's fair enough!

from sknnr.

aazuspan avatar aazuspan commented on June 25, 2024

Resolved by #36

from sknnr.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.