lemma-osu / scikit-learn-knn-regression Goto Github PK
View Code? Open in Web Editor NEWscikit-learn compatible estimators for various kNN imputation methods
Home Page: https://sknnr.readthedocs.io
scikit-learn compatible estimators for various kNN imputation methods
Home Page: https://sknnr.readthedocs.io
At present, CCA
and CCorA
are implemented as classes that mainly use @property
to access attributes. As such, properties call other properties within the class and there are often cases when properties are used repeatedly, which forces recalculation. This is particularly expensive when evaluating methods from the np.linalg
module such as singular value decomposition, QR decomposition, and solving for systems of linear equations. We need to ensure that attributes are only calculated once when there is no possibility of change to attribute values.
This issue is closely tied to #47, but we are opening it in a separate issue for a few different reasons:
transform
method on the enclosing estimators (CCATransformer
and CCorATransformer
) is the method that is called most often and the fixes in #48 address the issue of repeated calls to properties by setting estimator attributes during fit
. Because the fit
method on these transformers calls the classes of interest in this issue (CCA
and CCorA
), it still suffers from repeated calls to properties, but fit
is typically called only once, so delaying the refactor shouldn't be too problematic.yaImpute
and wanted to verify that "checkpoints" in the porting process were lining up with output from yaImpute. As such, there are properties that correspond to multiple steps in each ordination process, very few of which we actually use. Now that the porting seems to be behaving correctly, we can treat many of these properties as local variables and only retain the attributes/properties/methods needed for clients and enclosing classes. However, if there is value in retaining some of these "internal" attributes for diagnosing model goodness-of-fit (e.g. charting capabilities for ordinations), we should perhaps retain these as (at the very least) private properties/methods. This will require a bit of research into what attributes are retained for diagnostics, using yaImpute
and vegan
for inspiration.CCA
, there are actually a number of other ordination techniques (redundancy analysis (RDA), distance-based RDA) that use common code already implemented in this class from which we may choose to create additional estimators. Before optimizing the two existing classes, we'll want to think through the design of any new estimators and reuse as much of the existing code as possible.Note that none of these reasons really stops us from refactoring out the duplicate calls at present, but changes to the existing code and potential expansion into a family of classes makes it prudent to delay until after the Core Estimators
milestone.
Currently, we test the accuracy of fit
and kneighbors
against yaImpute
in test_port.py, but we don't yet test the accuracy of predict
, which makes it easy for small mistakes that affect results but not functionality (like #23) to go unnoticed.
As discussed in that issue, we should add some tests that compare prediction results of our estimators to those generated with yaImpute
.
We need to add docstrings for the public API, and probably the flake8-docstrings plugin to lint them once #8 is merged.
@grovduck Any preference on docstring conventions?
This might set a record for the shortest time between merging a PR and finding a new bug that it caused...
When fitting a transformed estimator with a dataframe without feature names, _check_feature_names
incorrectly warns: UserWarning: X has feature names, but GNNRegressor was fitted without feature names
.
To reproduce:
from sklearn.datasets import load_linnerud
import pandas as pd
from sknnr import GNNRegressor
X, y = load_linnerud(return_X_y=True)
# Note that columns are not specified
X_df = pd.DataFrame(X)
est = GNNRegressor().fit(X_df, y)
This wasn't caught by our tests because we specify column names for all of our dataframe inputs, and this will only occur when the dataframe has no column names. What happens is that est.transform_
gets fit with a dataframe with no feature names, but outputs a dataframe that does have feature names (in this case, [cca0
, cca1
]). Once fit works its way through the MRO to KNeighborsRegressor.fit
, it calls _validate_data
which calls our overridden _check_feature_names
that complains about the mismatched feature names.
A quick fix for this is to change the condition under which we use pandas
output mode on the transformer so that we only use it if sklearn can extract feature names, but I feel like that whole section might be unnecessarily complex, so I'll give this a little more thought before I push anything...
In early experiments with putting a spatial wrapper around sknnr estimators (and specifically GNNRegressor
) using sknrr-spatial, @aazuspan noticed that the performance of CCATransformer.transform
was not covarying with KNeighborsRegressors.kneighbors
as expected with different chunk sizes. Smaller chunk sizes would lead to CCATransformer.transform
actually taking a majority of the CPU time as profiled with pyinstrument. We're capturing the conversation here to decide on next steps to fix this issue.
We briefly discussed in #8 switching from setup.py
to pyproject.toml
and possibly switching the build system from setuptools
to something like hatch
.
PEP 517 explains the benefits of pyproject.toml
, and there are lots of good resources like this and this about setting up a pyproject.toml
.
pypa/setuptools#3683 discusses some headaches around editable installs with pyproject.toml
, and is probably worth reviewing to see how relevant it is to our project and whether we might need to also include a setup.cfg
. EDIT: This is only relevant to setuptools
1.
I've used hatch in the past and am a fan of the amount of features (it could effectively replace bumpversion
and twine
and add some additional functionality like custom scripts), but I know there are other popular options like flit.
We should also consider future compatibility regarding things like packaging data files, distributing through conda-forge, and maybe having to compile Cython2.
Editable installs look like they're pretty seamless in hatch
. Basically, instead of creating an environment and running pip install -e .[dev]
, the command hatch shell
will create and enter an environment with all the necessary dependencies and the current source code installed. It looks like flit also supports editable installs. ↩
It looks like Cython support is planned but not available yet in hatch
. I'm guessing it's not supported in flit
because flit
seems to focus on simple, pure Python builds, but I haven't found any conclusive answer. Assuming we're able to lean on existing Cython sklearn
functions or vectorized numpy
operations, I'm guessing we'll never need to package our own Cython, but let me know what you think @grovduck. ↩
See discussion in #1.
Some estimators can be fit with different data than they will predict (e.g. fitting with species composition data and predicting structural attributes using GNN), so we need to be able to set that data (tentatively called y_predict
) somehow. We also need to the flexibility to set different predicted data without requiring refitting.
A few options discussed:
y_predict
as a parameter of fit
to store for later use. If this is the only way to pass y_predict
, it would require refitting when switching predicted attributes, so this alone is not a complete solution.set_target_data
method on estimators after fitting. This gives the flexibility of being able to switch data without refitting, but would interrupt the normal fit
then predict
then score
workflow.y_predict
as a parameter of predict
. Need to think this through further.Hey @grovduck, I'm starting to sound like a broken record, but there's another issue blocking dataframe indexes in #2. This is a bit of a long one, so I tried to lay it out below.
All of our TransformedKNeighborsMixin
estimators are incompatible with dataframes in that they don't store feature names. I didn't think to check this before, but updating the dataframe test to check for feature names fails for everything but RawKNNRegressor
.
@pytest.mark.parametrize("estimator", get_kneighbor_estimator_instances())
def test_estimators_support_dataframes(estimator, moscow_euclidean):
"""All estimators should fit and predict data stored as dataframes."""
num_features = moscow_euclidean.X.shape[1]
feature_names = [f"col_{i}" for i in range(num_features)]
X_df = pd.DataFrame(moscow_euclidean.X, columns=feature_names)
y_df = pd.DataFrame(moscow_euclidean.y)
estimator.fit(X_df, y_df)
estimator.predict(X_df)
assert_array_equal(estimator.feature_names_in_, feature_names)
This happens because they all run X
through transformers that convert the dataframes to arrays before they get to KNeighborsRegressor.fit
where the features would be retrieved and stored. The same thing would happen with sklearn
transformers, so I think we should probably solve this in TransformedKNeighborsMixin
rather than in the transformers.
EDIT: As detailed in the next post, once we solve the issue of losing feature names when fitting, we need to also retain feature names when predicting to avoid warnings.
First of all, I think we should move the actual transformation out of the fit
method for each estimator and into a fit
method for TransformedKNeighborsMixin
. That should probably be done regardless of this issue just to reduce some duplication, and also allows us to make sure everything gets fit the same way. Then, I think we need to modify that fit
method to make sure it sets appropriate feature names after transformation.
To get feature names, all that sklearn does is look for a columns
attribute on X
. If we could copy that columns
attribute onto the transformed array before passing it to KNeighborsRegressor.fit
we'd be set, but there's no way to directly set attributes on Numpy arrays because they are implemented in C.
I think that leaves us with a few options:
sklearn.utils.validation._get_feature_names
to get and validate the feature names before transforming, then manually set them as feature_names_in_
after fitting. I don't love this because it requires us to use a private method that could disappear, get renamed, change return types, etc. The upside is that we would know our feature names are retrieved consistently with sklearn
.sklearn.utils.validation._get_feature_names
into our code base. That bypasses the private method issue, but adds some maintenance cost and we would need to carefully consider how to do that consistently with the sklearn
license. As with option 1, we would still need to handle setting the feature_names_in_
attribute.ndarray
to support a column
attribute and pass that in to fit. As long as sklearn
doesn't change how they identify features (which seems unlikely), we could let sklearn
handle getting and setting feature names, and I think it would be transparent to users. I did confirm that the _fit_X
attribute seems to store a numpy array regardless of what goes into it. Like option 2, this adds some maintenance cost._check_feature_names
method with the non-transformed X
after fitting. This will set feature names on the model and fix the issue of losing feature names when fitting. The downside is that we're again using a private method.I don't love any of these options, so let me know what you think or if any other solutions occur to you.
I noticed that the sklearn.estimator_checks
would have caught this, so I wonder if we should prioritize getting those checks to pass before we add any more functionality? I think that may be a big lift, but would at least prevent us from accidentally breaking estimators in the future. Also, it may be easier to do now than after they get more complex and would keep us from accidentally writing a lot of redundant tests.
EDIT: This would also catch warnings for predicting without feature names that I mention in the next post.
This ties in to #2, but I think is going to need its own issue to track: now that we're planning on passing the target data (e.g. composition) directly in as y
, subclassing from KNeighborsClassifier
no longer works because it doesn't support continuous multioutput targets. For example:
clf = Mahalanobis()
X = [[0, 1], [1, 1], [2, 2], [3, 3]] # environmental features
y = [[0.1, 0.2], [0.2, 0.3], [0.3, 0.4], [0.4, 0.5]] # composition targets
clf.fit(X, y)
ValueError: Unknown label type: 'continuous-multioutput'
The easiest solution is to subclass KNeighborsRegressor
instead, which supports continuous multioutput targets. This will break the current kneighbor_ids
implementation which relies on the classes_
attribute that doesn't exist on regressors, but I think that was going to need to be modified regardless based our discussion on handling plot IDs. There might be other side effects to this change I haven't considered...
The other option would be to patch continuous multioutput support into a subclass of KNeighborsClassifier
, and subclass the rest of our estimators from that instead. This sort of gets back to #6 and the question of whether we're going to consider our estimators as regressors, classifiers, or something in between.
Let me know what you think!
This will add static typing to the package. Type-checking will be added in #4, so we'll probably want to wait for that. This is a small enough code-base that we could probably tackle this all at once, but mypy has a handy guide for gradual adoption if we'd rather go that route.
It's worth noting that sklearn isn't currently typed and doesn't offer official type stubs yet. I don't think this is a big problem, but it will limit how much we can rely on type checking for our sklearn-subclasses.
The experimental stubs may end up being a useful reference, but I don't think there's any way we can directly integrate them without including them in the package (I could be wrong).
Mimic the sklearn.datasets
module with functions to load the example datasets from yaImpute
(pending approval for us to share the datasets). Unlike sklearn
, probably package datasets in a purpose-made class rather than a Bunch
object, but try to keep a consistent API.
It may make sense for us to define a Dataset
class for public examples with a TestingDataset
subclass that contains additional neighbors and distances for internal testing, but we can figure that out later.
In #22, we brought fit
, predict
, and kneighbors
methods under the TransformedKNeighborsMixin
class and took them out of the estimators. On predict
, I believe we are unintentionally applying each estimator's transform twice. The current code for TransformedKNeighborsMixin.predict
is this:
def predict(self, X):
"""Predict using transformed feature data."""
X_transformed = self._apply_transform(X)
return super().predict(X_transformed)
This will apply the transform to the X array and then call KNeighborsRegressor.predict
. However, this method then in turn calls self.kneighbors
which returns the call back to TransformedKNeighborsMixin.kneighbors
which again applies the transform, this time to the already transformed data.
This probably went unnoticed because we aren't explicitly testing predict
in test_port.py
at present, only kneighbors
. Presumably the check distances would not have matched after the double transformation.
This did surface in the implementation of the MSNRegressor
(not yet pushed) where the canonical correlation analysis transform reduces the number of features based on significant CCA axes. This failed both the test_estimators_support_continuous_multioutput
and test_estimators_support_dataframes
test because on the first transform pass, we subset the X
array down to fewer features, but then try to pass this reduced dimension array back to StandardScalerWithDOF
which was originally fit with the full number of features.
If this diagnosis seems correct, I believe the solution is to just remove the TransformedKNeighborsMixin.predict
method. Any call to an estimator's predict
gets routed to KNeighborsRegressor.predict
which in turn will call TransformedKNeighborsMixin.kneighbors
and will do the appropriate transformation before neighbor finding. I have tested locally with this approach (and on the new MSNRegressor
estimator) and it passes all current tests.
The new Hatch release will auto-install missing Python versions when using an environment matrix for testing! I did a quick test by adding:
[[tool.hatch.envs.test.matrix]]
python = ["3.8", "3.9", "3.10", "3.11", "3.12"]
to the pyproject.toml
, updating my hatch
, and running hatch run test:all
, and it successfully ran the whole test matrix (takes a while the first time, but it looks like it caches installs for future runs). This solves all the trouble I've had in the past getting tox
to find the right Python installations.
@grovduck I don't think this is a high priority, or something we necessarily even want to run every time, but it would be nice to have the option to test old Python versions and catch those sneaky typing incompatibilities before they hit the CI. I figured you might be interested in this for other projects that you've migrated to Hatch as well.
@grovduck I have a proposal to run by you, inspired by your refactor of the inheritance in #50. This design builds off of the changes there, so that would be merged before tackling this. I thought about proposing this as part of that PR, but didn't want to derail things with even more refactoring.
I suggest we turn TransformedKNeighborsRegressor
into an abstract class by inheriting from ABC, then add an abstract method _get_transform
that all subclasses would be required to implement, e.g.:
class EuclideanKNNRegressor(TransformedKNeighborsRegressor):
def _get_transform(self) -> TransformerMixin:
return StandardScalerWithDOF(ddof=1)
Next, we would get rid of the fit
methods on those estimators and move that functionality into TransformedKNeighborsRegressor.fit
, using a _set_fit_transform
method to handle the instantiation and fitting of the transformer.
class TransformedKNeighborsRegressor(RawKNNRegressor, ABC):
...
@abstractmethod
def _get_transform(self) -> TransformerMixin:
"""Return the transformer to use for fitting. Must be implemented by subclasses."""
...
def _set_fit_transform(self, X, y) -> TransformerMixin:
self.transform_ = self._get_transform().fit(X, y)
def fit(self, X, y):
"""Fit using transformed feature data."""
self._validate_data(X, y, force_all_finite=True, multi_output=True)
self._set_dataframe_index_in(X)
self._set_fit_transform(X, y)
X_transformed = self.transform_.transform(X)
return super().fit(X_transformed, y)
To accommodate for fitting transformers with separate y data in GNN
and MSN
, we could add a kNN-independent YFitMixin
that overrides the fit
method to accept the additional argument, store it as an attribute, and fit the transformer with it using an overriden _set_fit_transform
:
class YFitMixin:
"""Mixin for transformed estimators that can fit the transformer with a separate y."""
def _set_fit_transform(self, X, y):
y_fit = self.y_fit_ if self.y_fit_ is not None else y
self.transform_ = self._get_transform().fit(X, y_fit)
def fit(self, X, y, y_fit=None):
self.y_fit_ = y_fit
return super().fit(X, y)
Overall, this should reduce some code duplication in the fit
methods, prevent instantiation of TransformedKNeighborsRegressor
without having to rely on making it private, and add a runtime check to ensure that all transformed estimators define a transformer function. The main downside is the need to store a reference to y_fit_
. There may be another way to handle the YFitMixin
, but that was the best solution I could come up with after trying a few different strategies.
Curious to hear your thoughts on this design, and if you foresee any limitations.
As discussed in #20, we've run into a number of issues that probably would have been caught earlier with the built-in test suite in the sklearn.utils.estimator_checks
module, so it's probably in our best interest to focus on getting those passing. This should also potentially allow us to remove some redundant tests from test_estimators.py
.
Note that it seems like some valuable tests like check_dataframe_column_names_consistency aren't running automatically, so we'll need to make sure we run everything that's relevant.
EDIT: Our tests currently contain commented-out checks for our estimators, but it looks like there are also transformer checks that we should run. In fact, maybe it would make sense to get our transformers passing first since those will affect the estimator results?
Another aspect that I'm not very familiar with but that we'll probably need to dive into for this is the sklearn tag system, which allows you to specify different compatibility options for estimators and transformers that inform which checks will be run.
Currently, our estimators are named to follow method names introduced in R's yaImpute
package. We likely want to conform to scikit-learn's model of being more explicit about whether estimators are classifiers or regressors. kNN methods fit into a funny space here - most often they are thought of as classifiers as they use the k
nearest neighbors in prediction. However, some attributes are continuous and are calculated as (weighted) means from the k
neighbors attributes, which is more similar to KNeighborsRegressor.
Also need to decide whether it makes sense to stick with scikit-learn-knn
as a package name as knn
is already a meaningful term in scikit-learn
. We could also decide to "pay homage" to yaImpute
which is serving as the inspiration for this package.
While working on #52, I noticed that we don't currently have any tests to confirm the functionality or accuracy of predictions that use separate y_fit
data. This caused a few bugs (e.g. 7f92a90) to slip by since the tests were passing even if y_fit
was ignored.
@grovduck, I'm not 100% sure we need a full-blown port test for this. We already know that we can accurately fit transformers, so maybe all we need to do is confirm that they're getting fit with the correct data? I know it's a pain to add in a whole new set of test data (and I'm not even sure if this is doable with yaImpute
?), so I'm open to an easy solution, as long as we're confident that it can work as expected. What do you think?
pyupgrade standardizes syntax, including type annotations, to a minimum Python version. That should make it easier to maintain consistency, especially as we drop support for older versions of Python.
Good overview of the tool: How to Upgrade Syntax with pyupgrade
EDIT: Rather than adding pyupgrade
, we could switch to ruff which supports almost all of the features of pyupgrade
, flake8
, isort
, and a variety of other linters (including flake8
extensions like bugbear
and docstrings
.) In addition to being much faster than our current tools, this would consolidate almost everything into a single tool.
@grovduck let me know if you have any preferences here on tooling. Otherwise I'll probably try migrating sankee
to ruff
over the weekend to get a better idea of whether there are any sticking points, and report back.
Our previous plan was to use scikit-learn-knn-regression
as the installed name of the package and sknnr
as the imported name, but per the discussion in #59, we're now tentatively planning to use sknnr
across the board.
sknnr
:sknnr
:scikit-learn
Making the switch will be pretty straightforward, and should just require updating the pyproject.toml
, the repository name on Github, and links to the repository throughout the docs (although Github will continue to redirect).
If we do make a final decision to go with sknnr
on PyPI, I suggest we make an alpha release just to claim the package name (pre-releases can only be installed with pip install sknnr --pre
, so it should be clear that it is not an official, stable release if anyone comes across it).
Currently, we test our estimator accuracy against manually generated results from yaImpute
and pynnmap
. Once we have all major functionality implemented and are confident in our results, we can switch to regression testing to ensure no errors are introduced. This was briefly discussed in #40:
Eventually, I suppose we will reach a point where we're confident everything is working and could theoretically test against previous versions of sknnr ... pytest-regressions and syrupy both look like interesting tools that might solve this problem for us in the future by testing against automatically generated results.
We'll need to do some experimenting to find the right tool for our use case, but the advantage should be a massive simplification of our testing system (e.g. the removal of KNNTestDataset
) and the ability to easily test against a wider range of parameters (e.g. values of n_neighbors
and n_components
).
Before release, we need to:
sknnr
.yaImpute
throughout, which is primarily GPL v3 but also contains some public domain code. Their GPL license was inherited and upgraded from the LGPL license of the ANN library.statsmodels
for the implementation of CCorA
(although we can hopefully migrate to the sklearn
implementation of CCA
), which is BSD-3 licensed.scikit-learn
for the implementation of dataset loading, which is BSD-3 licensed.See #27 for further discussion.
There are functions in sknnr.datasets._base
that manage the loading of CSV files into Dataset
objects (similar to sklearn's Bunch
objects). At present, however, these utility functions (e.g. _load_dataset_from_csv_filenames
and _load_csv_data
) are meant for internal use and are currently used in load_moscow_stjoes
and load_swo_ecoplot
. As @aazuspan points out in lemma-osu/synthetic-knn#2, these functions (and callees) rely on "specific restrictions like requiring an integer index in the first column, and floats in the remaining columns" (see here).
Nonetheless, even with these restrictions, this functionality is likely useful across other packages that require sknnr
or for users that may have similarly formatted CSV files. In order to use these functions in a public API, we would need to generalize _load_csv_data
so that it possibly accepts DATA_MODULE
as a keyword argument (as in sklearn.datasets.load_csv_data
. This would work well for other packages that want to use this functionality, although perhaps users might have to create a module structure with their data to use it as well.
To consider:
sknnr.datasets
or in a more generic "utilities" module?importlib.resources
) or from generic directories as well?@aazuspan, further thoughts on this?
While trying to implement #2, I discovered another snag: Euclidean
and Mahalanobis
cannot be fit by passing a dataframe as X
. The errors seem to originate in the transformers, but I haven't dug any deeper than that.
from sklearn.datasets import load_linnerud
from sklearn_knn import Euclidean
X, y = load_linnerud(return_X_y=True)
X_df = pd.DataFrame(X)
# Fits fine with array
Euclidean().fit(X, y)
# Fails with dataframe [ValueError: Length of values (20) does not match length of index (3)]
Euclidean().fit(X_df, y)
While sklearn
definitely doesn't claim to be interoperable with pandas, my past experience and a quick test (below) indicates that most if not all regressors do support dataframes transparently.
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression
from sklearn.neural_network import MLPRegressor
from sklearn.svm import SVR
X, y = load_iris(return_X_y=True)
X_df = pd.DataFrame(X)
# All the regressors below support dataframes
regressors = [
RandomForestRegressor(),
GradientBoostingRegressor(),
KNeighborsRegressor(),
LinearRegression(),
MLPRegressor(),
SVR(),
]
for reg in regressors:
reg.fit(X_df, y)
My inclination is that we should too, but let me know what you think.
I just came across scikit-learn/scikit-learn#27037 that points out that using hasattr
or getattr
to retrieve a DataFrame index
will unintentionally retrieve the list.index
method. For us, this means that an estimator fitted with a list
will store that method as dataframe_index_in_
. Then, if you used return_dataframe_index=True
with kneighbors
, you'd get a weird IndexError
.
from sknnr import EuclideanKNNRegressor
from sknnr.datasets import load_swo_ecoplot
X, y = load_swo_ecoplot(return_X_y=True)
X = X.tolist()
est = EuclideanKNNRegressor().fit(X, y)
print(est.dataframe_index_in_) # <built-in method index of list object at 0x000001E8BB330BC0>
est.kneighbors(return_dataframe_index=True) # IndexError: too many indices for array
This is the offending line:
They solve this in scikit-learn/scikit-learn#27044 by using a new _is_pandas_df
helper, but as far I can tell that's not released yet, so I think it'll be a while before we want to rely on that.
We could add a similar function to sknnr
, but I wonder if we might break some implicit duck-typed support for DataFrame
-like objects by adding an explicit class check here. An alternative would just be to check that index
is an array using _is_arraylike
, which would prevent this bug and should cover us for just about any other case.
The predict
method for KNeighborsRegressor uses an X
matrix to calculate a predicted score for each row in X. If the model was fit from this same X
matrix and k
is set to 1, predict
will (by definition) return the y
matrix used in fitting the model. This is because the nearest neighbor is always itself (ignoring duplicate values) and the y
values associated with this neighbor are returned, i.e.
>>> from sklearn.neighbors import KNeighborsRegressor
>>> from sklearn.datasets import load_linnerud
>>> X, y = load_linnerud(return_X_y=True)
>>> est = KNeighborsRegressor(n_neighbors=1).fit(X, y)
>>> print(np.all(est.predict(X) == y))
True
The kneighbors
method, on the other hand, allows the option of not passing the X
matrix in the call, in which case "neighbors of each indexed point are returned. In this case, the query point is not considered its own neighbor".
For kNN methods, it is common to use the "second-nearest neighbor" (truly, the first independent neighbor) to use in independent accuracy assessment. This is very similar to a leave-one-out (LOO) analysis, except that the model is fit with all plots rather than being fit n times with n-1 plots and predicting the left-out plot.
To accomplish this, we can:
kneighbors
with no X
array to get independent neighbors for each rowpredict
to get the predicted values for each column in X
, but with independent neighborsCurrently, all of our estimators are using the predict
and score
methods from KNeighborsRegressors (i.e. these methods have not been overridden). We will want to retain predict
and score
as-is for when we send new X
data to an already fit model. But we also need to provide additional methods for independent prediction and scoring, in the least disruptive way.
I don't see a particularly elegant way to do this, other than repeating most of what is already in KNeighborsRegressor.predict. In essence,
def predict_independent(self):
# Note that kneighbors is called without `X` parameter
if self.weights == "uniform":
neigh_ind = self.kneighbors(return_distance=False)
neigh_dist = None
else:
neigh_dist, neigh_ind = self.kneighbors()
weights = _get_weights(neigh_dist, self.weights)
# Remainder of predict method
def score_independent(self, y, sample_weight=None):
# Here I'm using root mean squared error instead of R^2 which is what is returned from `score`
y_pred = self.predict_independent()
return mean_squared_error(
y,
y_pred,
sample_weight=sample_weight,
squared=False,
multioutput="raw_values",
)
These would likely be new methods in an overridden KNeighborsRegressor
or possibly as a separate mix-in. @aazuspan, I'm looking for feedback on the best way to bring this functionality in if you have any thoughts. In early testing, I'm getting the correct values (based on a comparison with yaImpute
), but am uneasy about the duplication of code.
We started to discuss the design and implementation of RF-NN in #22, but thought it would be better to track in a separate issue. Because RF-NN is quite different than all other estimators in this package in terms of how distances are calculated, we'll need to likely use different sklearn
base classes or derive our own to implement this. The goal of this issue is to lay out the design of RF-NN and decide how best to tackle it. As RF-NN was first introduced in the R yaImpute
package, we rely heavily on their implementation details.
One or more y
attributes (along with multiple X
covariates) are used to fit different random forests, notably one forest per each y
attribute. In yaImpute
, each forest is actually a classification (i.e. RandomForestClassifier
) - the y
attributes passed are either categorical attributes (something like vegetation class) or continuous attributes that are binned into classes using some classification mode (equal interval, quantile, natural breaks, etc.).
Somewhat non-intuitively, in RF-NN, the actual values of the terminal nodes (i.e. the class values) in each random forest doesn't matter, as we only care about the terminal nodes' IDs of the forests' trees where the references land. For example, if there are q
y
attributes passed, q
different forests of n
trees will be created. For each reference observation passed to fit
, we want to capture the terminal node ID that it falls into for each tree and each forest. Therefore, the needed output of the fit
process is both the specification of the forests that were created (to run additional targets) and the "node matrix" of p
rows (reference plots) by q * n
columns that stores node IDS. We'll be leaning on the apply method of each forest to return the leaf node IDs.
Functionally, the predict
process for each new target is very similar to standard random forests, although the target will be run through all q
forests rather than just a single forest. As with the references, we only use the IDs of the terminal nodes that the target falls into - therefore, we get a (q * n)
vector of node IDs as a result of passing a target through all forests. At this point, we use a node similarity metric to define the distances from the target to each candidate reference plot. As opposed to all other estimators that we've introduced so far which have a neighbor space greater than one dimension, distances in RF-NN are calculated as the inverse of the number of nodes that the reference and target share in common. Neighbors are ranked on this distance to find the k
nearest neighbors for each target.
An initial suggestion for the naming of this estimator would be RandomForestNNRegressor
. We could also go with RFNNRegressor
as the term RFNN
is somewhat well understood by those in the imputation community. To be explict, I propose the former name and use it in subsequent discussion.
We can rely on RandomForestClassifier
to create the forests for each y
attribute passed. As of now, it seems reasonable to have RandomForestNNRegressor
composed of q
RandomForestClassifier
s rather than inherit from it. fit
would introduce two estimator attributes: 1) the forests themselves (proposed name: forests_
); and 2) the node matrix of the data that was used to fit the model (proposed name: node_matrix_
). We'll also likely want to have easy access to a few counts, e.g. number of forests and number of trees in a forest. These could easily be properties of the class derived from information stored in self.forests_
.
We will likely need to introduce a new method with the same signature as estimators derived from TransformedKNeighborsMixin
, i.e. def kneighbors(self, X=None, n_neighbors=None, return_distance=True)
. The method would run X
through self.forests_
, capture a node ID matrix for X
and then row-wise calculate distances from self.node_matrix
. The return value would be the (optional) distances and row indexes of the n_neighbors
. Note that n_neighbors
will be set on the initialization of RandomForestNNRegressor
just like our other estimators, but kneighbors
can override this by passing a value to n_neighbors
.
Initially we had thought we may possibly be able to derive from KNeighborsRegressor
if we use the callable weights parameter, because given weights and y
attributes, the actual calculation of predicted attributes will be exactly the same as in all estimators that derive from TransformedKNeighborsMixin
. However, in looking at that implementation more closely, predict
still relies on calculating the n-dimensional distances and then applying the callable function to the distance array. We are probably better off by introducing a new predict
function that is simply np.average
with a weights parameter that could represent the inverse distances calculated from kneighbors
.
As this estimator deviates substantially from the other estimators already introduced, I propose we first merge fb_add_estimators
back into main
and treat this as a separate feature branch.
yaImpute
.GNNRegressor
and MSNRegressor
) apply dimensionality reduction such that the resultant axes are ordered based on the amount of variation explained. The eigenvalues from these axes are typically used to weight these axes when finding neighbors. As currently written, RF-NN weights all forests the same (as they are captured in a single node matrix). There may be a measure of forest importance (akin to variable importance in the forest themselves) that may provide weights when calculating distances. I've not seen anything to capture this, but also have not looked very closely at all diagnostics from RandomForestClassifier
.Hey @grovduck, I was poking around sklearn and came across a built-in implementation of Mahalanobis as a distance metric. After reading through some stackoverflow discussions, I was able to bodge together an alternate implementation of MahalanobisKNNRegressor
that doesn't require a custom transformer and produces (as far as I can tell) identical results.
class MahalanobisKNNRegressor(IDNeighborsRegressor):
def fit(self, X, y):
self.metric = "mahalanobis"
self.metric_params = {"VI": np.linalg.inv(np.cov(X, rowvar=False))}
return super().fit(X, y)
This implementation is pretty rough and I don't really understand how it works, but it does pass all tests. There are definitely upsides to this simpler approach in that we can remove some code that's ported from yaImpute
and cut down on the number of estimator checks we need to pass, but do you think we're losing anything in terms of future flexibility? Or is there additional value to including the MahalanobisTransformer
beyond its use in this estimator? Curious to hear your thoughts!
While working on #20 I noticed that CCATransformer
and CCorATransformer
both specify y
as an optional parameter but won't actually work if it is None
(unless the optional spp
is provided instead). To reproduce:
from sklearn.datasets import load_linnerud
from sknnr.transformers import CCATransformer, CCorATransformer
X, y = load_linnerud(return_X_y=True)
# AxisError: axis 1 is out of bounds for array of dimension 0
CCATransformer().fit(X)
My first thought was that we should throw a ValueError
if both y
and spp
are None
, but now that I think about it I'm wondering if those transformers should instead just take a required y
parameter and let the estimator handle the selection of fitting data. So for example, GNNRegressor.fit
would pass either y
or spp
, but not both, to CCATransformer.fit
.
Currently, CCATransformer
really only takes an spp
parameter because the estimator that uses it might pass that, which is probably a bit backwards and could lead to some confusion, as well as some unnecessary complexity. Making the estimators responsible for choosing the right fitting data would remove a couple parameters and the need for special error handling since both estimators already require y
parameters.
Let me know what you think, @grovduck.
Probably sphinx or mkdocs. I've used sphinx
in the past but would much rather write markdown than restructured text, so I think mkdocs
is worth looking into.
Apparently mkdocs
doesn't handle automatically building docs from docstrings, so we'd probably need to throw in mkdocstrings.
We've touched on this in a few previous issues (#20, #22) but for estimators that perform dimensionality reduction and create "axes" that are linear combinations of their input covariates (i.e. CCorATransformer
, CCATransformer
), we want the flexibility that a user can define the number of axes to use from these estimators. Typically, these estimators will create the number of axes up to the number of X
variables provided. For example, CCorATransformer
uses a test internally to reduce the number of axes down based on the statistical significance of the axes.
We need to solve the following problems with this issue:
__init__
and would default to None
.transform
.MSNRegression
and GNNRegression
), test that predict
and kneighbors
return the expected values. Because yaImpute doesn't support a user-defined option for fewer axes, we will need to create new test files that implement the expected behavior and produce these results.A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.