Comments (15)
Regarding estimator names, what do you think about using Imputer
as opposed to Classifier
or Regressor
? To me this seems like the most technically accurate term, but I'm guessing you shied away from it to avoid confusion with the imputers (including KNNImputer) built in to sklearn
? The tricky thing seems to be that our estimators actually very closely match the functionality of the KNNImputer
, so we will need to be careful to distinguish how this package differs from the built-in features of sklearn
...
from scikit-learn-knn-regression.
@aazuspan, honestly I hadn't come across the Imputer
classes in sklearn
until a few weeks ago when we were diving in. I definitely agree with you that our estimators basically do both straight-forward and fancy imputation and really closely mimic the behavior of KNNImputer
(which is basically our Raw
estimator).
The only reason I've shied away from using this terminology on our estimators is because they seem to treat imputation as a way to fill missing values on the way to transformation/classification/regression. That is, it's a necessary pre-filling step such that other estimators down the line don't choke.
The expectation for our estimators is that the X
we pass to fit
is complete whereas in KNNImputer
, the expectation is that the X
passed to fit
will be transformed to get rid of missing values.
But recognize that I haven't yet come up with a better term yet ... how about GNNClassifierAndRegressorAndReallyAnImputer
😄.
I'm willing to go with Imputer
if you think we can effectively explain the differences.
from scikit-learn-knn-regression.
GNNClassifierAndRegressorAndReallyAnImputer
Perfect, I'll just close this issue as resolved 😆
they seem to treat imputation as a way to fill missing values on the way to transformation/classification/regression
I completely agree. It's a challenging distinction to make because they function almost identically, but have very different purposes.
I'm willing to go with Imputer if you think we can effectively explain the differences.
No, I think your instincts were right. The association between imputation and filling missing data is just too strong for sklearn
users, even if it's technically an accurate description of what we're doing.
Back to the drawing board!
from scikit-learn-knn-regression.
Based on discussions in #2 and #13, starting to learn toward using Regressor
here to describe these custom estimators. If we do indeed subclass our estimators from KNearestRegressor
, logical names would be:
RawRegressor
EuclideanRegressor
MahalanobisRegressor
MSNRegressor
GNNRegressor
I might be tempted to give a nod to being explicit that these are KNearest regressors, so we could consider names like:
RawKnnRegressor
...
It gets a bit awkward/redundant with MSN (most similar neighbor) and GNN (gradient nearest neighbor) because they already have NN implied. So, my current thinking is:
RawKnnRegressor
EuclideanKnnRegressor
MahalanobisKnnRegressor
MSNRegressor
GNNRegressor
(and eventually RFNNRegressor)
Does that make scikit-learn-knn-regression
(package) and sklearn-knn-regression
(imported module) too unwieldy as a names?
from scikit-learn-knn-regression.
So, my current thinking is:
I'm happy with these names, and I'm also in favor of being more explicit by including Knn
while avoiding the redundancy with the MSN
and GNN
estimators.
In the interest of thinking through every option, would there be any advantage to combining similar estimators into a single estimator that could be selected with a parameter like method
? For example, say a KNNRegressor
where method
could be raw, Euclidean, or Mahalanobis? I'm not sure I like this idea, but I'm curious if you have any inclination.
Does that make scikit-learn-knn-regression (package) and sklearn-knn-regression (imported module) too unwieldy as a names?
They are a little wordy... I think the package name would be okay since it's only used once or twice, but the module name is definitely a mouthful to import every time. I don't want to criticize without tossing out some alternatives, so just to put a few ideas on the board for consideration:
- sknn (this is already in use by a relatively popular archived project, so probably not a good choice)
- skknn
- skneighbors
- sknear
from scikit-learn-knn-regression.
In the interest of thinking through every option, would there be any advantage to combining similar estimators into a single estimator that could be selected with a parameter like
method
? For example, say aKNNRegressor
wheremethod
could be raw, Euclidean, or Mahalanobis? I'm not sure I like this idea, but I'm curious if you have any inclination.
I definitely agree that there is very little difference between the estimators we've introduced and that they likely could be combined into a single estimator. My aversion to this is mainly based on trying to decipher the yai
function in yaImpute
which takes exactly this approach (method
is a parameter). The complexity of that function is quite high with a lot of conditionals (granted it has even more parameters than just method
that causes this complexity). So it's entirely possible that I've gone overboard (too many distinct classes) when porting these estimators over. I'm curious what you feel is the right balance.
I think one goal that I have in mind is inter-comparison between the methods along with their hyperparameters through GridSearchCV
or RandomizedSearchCV
. Perhaps having method
as a hyperparameter for a single estimator versus different estimators makes this comparison easier? Is that what you were thinking?
I don't want to criticize without tossing out some alternatives, so just to put a few ideas on the board for consideration
I agree we'd probably want to avoid an import name that has been used before (unfortunately "nearest-neighbors" and "neural-networks" share the same abbreviation). All of your suggestions are better than mine, but thought I'd throw one more out - sknnr (aka "skinner" - a nod to the Simpsons principal). It looks like there is a package on PyPi called "skinner", but not under active development. I like "skneighbors" as well.
from scikit-learn-knn-regression.
My aversion to this is mainly based on trying to decipher the yai function in yaImpute which takes exactly this approach (method is a parameter). The complexity of that function is quite high with a lot of conditionals
Yes, I think that kind of parameter spaghetti is definitely something we should try to avoid. The sklearn framework gives us an advantage there since it forces separation of the model configuration from the fitting, predicting, and scoring parameters, but I still think you're right to err on the side of atomic over configurable.
I'm curious what you feel is the right balance.
I think either approach is reasonable, but I'm leaning towards the way you have it set up now. I agree that simpler functions are better, and in particular I'm not a fan of string params with predefined options because they're harder for users to remember and force us to handle invalid inputs. In any case, I don't think we're locking ourselves into anything and can always reconsider later. I imagine if we wanted to switch to a method
param, we would probably leave the estimators as is and just add a KNNRegressor
function that dispatches to the appropriate estimator, so no big code changes needed.
I think one goal that I have in mind is inter-comparison between the methods along with their hyperparameters through GridSearchCV or RandomizedSearchCV. Perhaps having method as a hyperparameter for a single estimator versus different estimators makes this comparison easier?
I don't have a ton of experience with those, but the conclusion I came to was that a method
parameter would only help for comparing within that single estimator, and might make it harder to compare between other estimators. This is a good thing to keep in mind though.
All of your suggestions are better than mine, but thought I'd throw one more out - sknnr (aka "skinner" - a nod to the Simpsons principal). It looks like there is a package on PyPi called "skinner", but not under active development. I like "skneighbors" as well.
I like sknnr
! That matches the proposed package name better and looks like it should be more searchable than skneighbors
, which Google thinks is a typo for kneighbors. That's at the top of my list.
from scikit-learn-knn-regression.
All sounds good to me, @aazuspan. I agree that we can revisit the decision of the method
parameter, but let's stick with what we have now. Are we ready to implement this? I'm happy to take a cut at implementing it, although I could use some advice on the cleanest way to rename the package without really messing things up. I guess that's probably a simple rename on Github and it shouldn't impact any local repos (because you can give the local repo whatever name you want, right?)
from scikit-learn-knn-regression.
Are we ready to implement this?
I think so! It will be a little painful to merge that change in with any branches we have ongoing since I think git will consider all the old src files deleted, but we'll have to deal with it at some point, so I'm not sure we gain anything by waiting.
I could use some advice on the cleanest way to rename the package without really messing things up
I think locally it should be as simple as a find-replace in VS Code and renaming the module folder. There could be some other snags I'm not remembering, but I've done this a few times and can't remember any major issues. It's also possible we'll have to rebuild our hatch environments using hatch env prune
, but I'm not sure.
I guess that's probably a simple rename on Github and it shouldn't impact any local repos (because you can give the local repo whatever name you want, right?)
Yeah, that should be pretty painless. You can continue to push to the old repo name and Github will redirect it, although it might raise a warning. I think the long-term fix is to run git remote set-url origin ...
with the new URL.
from scikit-learn-knn-regression.
It will be a little painful to merge that change in with any branches we have ongoing since I think git will consider all the old src files deleted, but we'll have to deal with it at some point, so I'm not sure we gain anything by waiting.
I may be showing my ignorance here, but considering our flow to this point, it seems like we only have one or maybe two active feature branches going and right now they are branching off fb_add_estimators
(itself supposedly a feature branch). I don't necessarily think we need to retain a develop
branch based on how we've been working (likely more trouble than its worth). If we make the naming change now, continue to develop on fb_add_estimators
until we've got the first five estimators implemented and then rebase onto main
, do we avoid the problem you mention?
from scikit-learn-knn-regression.
I propose renaming MyStandardScaler
to StandardScalerWithDOF
as part of this as well. What do you think of this implementation?
import numpy as np
class StandardScalerWithDOF(StandardScaler):
def __init__(self, ddof=0):
super().__init__()
self.ddof_ = ddof
def fit(self, X, y=None, sample_weight=None):
scaler = super().fit(X, y, sample_weight)
scaler.scale_ = np.std(X, axis=0, ddof=self.ddof_)
return scaler
from scikit-learn-knn-regression.
If we make the naming change now, continue to develop on fb_add_estimators until we've got the first five estimators implemented and then rebase onto main, do we avoid the problem you mention?
The problem I'm thinking of is with unpushed local branches. For example, if you're working on src/sklearn_knn/msn.py
and we rename that to src/sknnr/msn.py
, I think that merging those changes back to your local branch will require manually pulling code out of the "deleted" original file and into the renamed file. Does that sound right to you?
I don't necessarily think we need to retain a develop branch based on how we've been working (likely more trouble than its worth).
I've never worked with a develop
branch before, so unless you think there would be some advantages once we start making releases, I'm happy to do without it.
I propose renaming MyStandardScaler to StandardScalerWithDOF as part of this as well. What do you think of this implementation?
I like the name and the added flexibility with the implementation. My only concern would be how that fits in with #15. If it's a quick fix to make it dataframe compatible, maybe we could tackle two issues with one commit, so to speak.
I was also wondering if that should be moved out of _base
and into the transformers
module?
from scikit-learn-knn-regression.
The problem I'm thinking of is with unpushed local branches
Ah, yep, I totally agree with this. At this time, I don't have any unpushed local branches (I stashed the work on MSN
and deleted that branch before bringing in your work on use_regressors
. You're right that I'll still need to do some manual transfer after popping the stash, but I think it should be fairly painless at this stage. And, like you say, better now than later.
I've never worked with a develop branch before, so unless you think there would be some advantages once we start making releases, I'm happy to do without it.
Again, we can always introduce it down the line if we find that it becomes necessary.
My only concern would be how that fits in with #15. If it's a quick fix to make it dataframe compatible, maybe we could tackle two issues with one commit, so to speak. I was also wondering if that should be moved out of _base and into the transformers module?
Good thoughts for both of these. I'm leaning toward a separate PR once we do the renaming. I'll do the renaming now.
from scikit-learn-knn-regression.
I'm leaning toward a separate PR once we do the renaming. I'll do the renaming now.
Sounds good to me!
from scikit-learn-knn-regression.
Resolved all but StandardScaler
via #18
from scikit-learn-knn-regression.
Related Issues (20)
- Transformed regressors drop dataframe feature names HOT 26
- Get all `sklearn` estimator checks passing HOT 1
- Unwanted double transformation using TransformedKNeighborsMixin.predict HOT 8
- Design and implementation of random forest nearest neighbors (RF-NN) HOT 2
- Test accuracy of `predict` methods HOT 5
- Licensing HOT 2
- Simplify fit parameters for `CCATransformer` and `CCorATransformer`? HOT 4
- Support use of reduced number of axes in CCorATransformer, CCATransformer HOT 1
- Warning when fitting transformed estimator with dataframe with no feature names HOT 5
- Possibility of using built-in Mahalanobis distance HOT 4
- Add ruff to pre-commit hooks HOT 9
- Replace existing port tests with automated regression tests HOT 6
- Independent predict and score methods for KNeighborsRegressor derived estimators HOT 10
- Using properties as accessors in CCA/CCorA leads to slow run times HOT 9
- Refactor `CCA` and `CCorA` classes to reduce repeated operations
- Refactor transformed estimators HOT 6
- Test estimators with `y_fit` data HOT 5
- `list.index` accidentally stored as `dataframe_index_in_` HOT 3
- Change installed name to `sknnr` HOT 9
- Test all Python versions with Hatch 1.8.0 HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from scikit-learn-knn-regression.