lemma-osu / scikit-learn-knn-regression Goto Github PK

0.0 4.0 1.0 761 KB

scikit-learn compatible estimators for various kNN imputation methods

Python 100.00%

gnn imputation k-nearest-neighbor knn msn rfnn gradient-nearest-neighbor most-similar-neighbor random-forest-nearest-neighbor scikit-learn sklearn-estimator classification regression

scikit-learn-knn-regression's Issues

Refactor `CCA` and `CCorA` classes to reduce repeated operations

At present, CCA and CCorA are implemented as classes that mainly use @property to access attributes. As such, properties call other properties within the class and there are often cases when properties are used repeatedly, which forces recalculation. This is particularly expensive when evaluating methods from the np.linalg module such as singular value decomposition, QR decomposition, and solving for systems of linear equations. We need to ensure that attributes are only calculated once when there is no possibility of change to attribute values.

This issue is closely tied to #47, but we are opening it in a separate issue for a few different reasons:

The transform method on the enclosing estimators (CCATransformer and CCorATransformer) is the method that is called most often and the fixes in #48 address the issue of repeated calls to properties by setting estimator attributes during fit. Because the fit method on these transformers calls the classes of interest in this issue (CCA and CCorA), it still suffers from repeated calls to properties, but fit is typically called only once, so delaying the refactor shouldn't be too problematic.
Initially, when we first wrote the code for these classes, we were mostly concerned with the correct porting behavior from yaImpute and wanted to verify that "checkpoints" in the porting process were lining up with output from yaImpute. As such, there are properties that correspond to multiple steps in each ordination process, very few of which we actually use. Now that the porting seems to be behaving correctly, we can treat many of these properties as local variables and only retain the attributes/properties/methods needed for clients and enclosing classes. However, if there is value in retaining some of these "internal" attributes for diagnosing model goodness-of-fit (e.g. charting capabilities for ordinations), we should perhaps retain these as (at the very least) private properties/methods. This will require a bit of research into what attributes are retained for diagnostics, using yaImpute and vegan for inspiration.
For at least CCA, there are actually a number of other ordination techniques (redundancy analysis (RDA), distance-based RDA) that use common code already implemented in this class from which we may choose to create additional estimators. Before optimizing the two existing classes, we'll want to think through the design of any new estimators and reuse as much of the existing code as possible.

Note that none of these reasons really stops us from refactoring out the duplicate calls at present, but changes to the existing code and potential expansion into a family of classes makes it prudent to delay until after the Core Estimators milestone.

Test accuracy of `predict` methods

Currently, we test the accuracy of fit and kneighbors against yaImpute in test_port.py, but we don't yet test the accuracy of predict, which makes it easy for small mistakes that affect results but not functionality (like #23) to go unnoticed.

As discussed in that issue, we should add some tests that compare prediction results of our estimators to those generated with yaImpute.

Add docstrings and lint them

We need to add docstrings for the public API, and probably the flake8-docstrings plugin to lint them ~~once #8 is merged~~.

@grovduck Any preference on docstring conventions?

Warning when fitting transformed estimator with dataframe with no feature names

This might set a record for the shortest time between merging a PR and finding a new bug that it caused...

When fitting a transformed estimator with a dataframe without feature names, _check_feature_names incorrectly warns: UserWarning: X has feature names, but GNNRegressor was fitted without feature names.

To reproduce:

from sklearn.datasets import load_linnerud
import pandas as pd
from sknnr import GNNRegressor

X, y = load_linnerud(return_X_y=True)
# Note that columns are not specified
X_df = pd.DataFrame(X)
est = GNNRegressor().fit(X_df, y)

This wasn't caught by our tests because we specify column names for all of our dataframe inputs, and this will only occur when the dataframe has no column names. What happens is that est.transform_ gets fit with a dataframe with no feature names, but outputs a dataframe that does have feature names (in this case, [cca0, cca1]). Once fit works its way through the MRO to KNeighborsRegressor.fit, it calls _validate_data which calls our overridden _check_feature_names that complains about the mismatched feature names.

A quick fix for this is to change the condition under which we use pandas output mode on the transformer so that we only use it if sklearn can extract feature names, but I feel like that whole section might be unnecessarily complex, so I'll give this a little more thought before I push anything...

Using properties as accessors in CCA/CCorA leads to slow run times

In early experiments with putting a spatial wrapper around sknnr estimators (and specifically GNNRegressor) using sknrr-spatial, @aazuspan noticed that the performance of CCATransformer.transform was not covarying with KNeighborsRegressors.kneighbors as expected with different chunk sizes. Smaller chunk sizes would lead to CCATransformer.transform actually taking a majority of the CPU time as profiled with pyinstrument. We're capturing the conversation here to decide on next steps to fix this issue.

pyproject.toml and build system

We briefly discussed in #8 switching from setup.py to pyproject.toml and possibly switching the build system from setuptools to something like hatch.

pyproject.toml

PEP 517 explains the benefits of pyproject.toml, and there are lots of good resources like this and this about setting up a pyproject.toml.

pypa/setuptools#3683 discusses some headaches around editable installs with pyproject.toml, and is probably worth reviewing to see how relevant it is to our project and whether we might need to also include a setup.cfg. EDIT: This is only relevant to setuptools¹.

Build system

I've used hatch in the past and am a fan of the amount of features (it could effectively replace bumpversion and twine and add some additional functionality like custom scripts), but I know there are other popular options like flit.

We should also consider future compatibility regarding things like packaging data files, distributing through conda-forge, and maybe having to compile Cython².

Editable installs look like they're pretty seamless in hatch. Basically, instead of creating an environment and running pip install -e .[dev], the command hatch shell will create and enter an environment with all the necessary dependencies and the current source code installed. It looks like flit also supports editable installs. ↩
It looks like Cython support is planned but not available yet in hatch. I'm guessing it's not supported in flit because flit seems to focus on simple, pure Python builds, but I haven't found any conclusive answer. Assuming we're able to lean on existing Cython sklearn functions or vectorized numpy operations, I'm guessing we'll never need to package our own Cython, but let me know what you think @grovduck. ↩

Decide how to handle fitting, prediction, and ID data

See discussion in #1.

Some estimators can be fit with different data than they will predict (e.g. fitting with species composition data and predicting structural attributes using GNN), so we need to be able to set that data (tentatively called y_predict) somehow. We also need to the flexibility to set different predicted data without requiring refitting.

A few options discussed:

Accept y_predict as a parameter of fit to store for later use. If this is the only way to pass y_predict, it would require refitting when switching predicted attributes, so this alone is not a complete solution.
Use a set_target_data method on estimators after fitting. This gives the flexibility of being able to switch data without refitting, but would interrupt the normal fit then predict then score workflow.
Accept y_predict as a parameter of predict. Need to think this through further.

Transformed regressors drop dataframe feature names

Hey @grovduck, I'm starting to sound like a broken record, but there's another issue blocking dataframe indexes in #2. This is a bit of a long one, so I tried to lay it out below.

The problem

All of our TransformedKNeighborsMixin estimators are incompatible with dataframes in that they don't store feature names. I didn't think to check this before, but updating the dataframe test to check for feature names fails for everything but RawKNNRegressor.

@pytest.mark.parametrize("estimator", get_kneighbor_estimator_instances())
def test_estimators_support_dataframes(estimator, moscow_euclidean):
    """All estimators should fit and predict data stored as dataframes."""
    num_features = moscow_euclidean.X.shape[1]
    feature_names = [f"col_{i}" for i in range(num_features)]

    X_df = pd.DataFrame(moscow_euclidean.X, columns=feature_names)
    y_df = pd.DataFrame(moscow_euclidean.y)

    estimator.fit(X_df, y_df)
    estimator.predict(X_df)
    assert_array_equal(estimator.feature_names_in_, feature_names)

This happens because they all run X through transformers that convert the dataframes to arrays before they get to KNeighborsRegressor.fit where the features would be retrieved and stored. The same thing would happen with sklearn transformers, so I think we should probably solve this in TransformedKNeighborsMixin rather than in the transformers.

EDIT: As detailed in the next post, once we solve the issue of losing feature names when fitting, we need to also retain feature names when predicting to avoid warnings.

Possible solutions

First of all, I think we should move the actual transformation out of the fit method for each estimator and into a fit method for TransformedKNeighborsMixin. That should probably be done regardless of this issue just to reduce some duplication, and also allows us to make sure everything gets fit the same way. Then, I think we need to modify that fit method to make sure it sets appropriate feature names after transformation.

To get feature names, all that sklearn does is look for a columns attribute on X. If we could copy that columns attribute onto the transformed array before passing it to KNeighborsRegressor.fit we'd be set, but there's no way to directly set attributes on Numpy arrays because they are implemented in C.

I think that leaves us with a few options:

Use sklearn.utils.validation._get_feature_names to get and validate the feature names before transforming, then manually set them as feature_names_in_ after fitting. I don't love this because it requires us to use a private method that could disappear, get renamed, change return types, etc. The upside is that we would know our feature names are retrieved consistently with sklearn.
Same as above but copy sklearn.utils.validation._get_feature_names into our code base. That bypasses the private method issue, but adds some maintenance cost and we would need to carefully consider how to do that consistently with the sklearn license. As with option 1, we would still need to handle setting the feature_names_in_ attribute.
Subclass ndarray to support a column attribute and pass that in to fit. As long as sklearn doesn't change how they identify features (which seems unlikely), we could let sklearn handle getting and setting feature names, and I think it would be transparent to users. I did confirm that the _fit_X attribute seems to store a numpy array regardless of what goes into it. Like option 2, this adds some maintenance cost.
Use the _check_feature_names method with the non-transformed X after fitting. This will set feature names on the model and fix the issue of losing feature names when fitting. The downside is that we're again using a private method.

I don't love any of these options, so let me know what you think or if any other solutions occur to you.

Estimator checks

I noticed that the sklearn.estimator_checks would have caught this, so I wonder if we should prioritize getting those checks to pass before we add any more functionality? I think that may be a big lift, but would at least prevent us from accidentally breaking estimators in the future. Also, it may be easier to do now than after they get more complex and would keep us from accidentally writing a lot of redundant tests.

EDIT: This would also catch warnings for predicting without feature names that I mention in the next post.

Support continuous multioutput targets

This ties in to #2, but I think is going to need its own issue to track: now that we're planning on passing the target data (e.g. composition) directly in as y, subclassing from KNeighborsClassifier no longer works because it doesn't support continuous multioutput targets. For example:

clf = Mahalanobis()
X = [[0, 1], [1, 1], [2, 2], [3, 3]]  # environmental features
y = [[0.1, 0.2], [0.2, 0.3], [0.3, 0.4], [0.4, 0.5]]  # composition targets
clf.fit(X, y)

ValueError: Unknown label type: 'continuous-multioutput'

The easiest solution is to subclass KNeighborsRegressor instead, which supports continuous multioutput targets. This will break the current kneighbor_ids implementation which relies on the classes_ attribute that doesn't exist on regressors, but I think that was going to need to be modified regardless based our discussion on handling plot IDs. There might be other side effects to this change I haven't considered...

The other option would be to patch continuous multioutput support into a subclass of KNeighborsClassifier, and subclass the rest of our estimators from that instead. This sort of gets back to #6 and the question of whether we're going to consider our estimators as regressors, classifiers, or something in between.

Let me know what you think!

Add typing

This will add static typing to the package. ~~Type-checking will be added in #4, so we'll probably want to wait for that.~~ This is a small enough code-base that we could probably tackle this all at once, but mypy has a handy guide for gradual adoption if we'd rather go that route.

It's worth noting that sklearn isn't currently typed and doesn't offer official type stubs yet. I don't think this is a big problem, but it will limit how much we can rely on type checking for our sklearn-subclasses.

The experimental stubs may end up being a useful reference, but I don't think there's any way we can directly integrate them without including them in the package (I could be wrong).

Add load functions for example datasets

Mimic the sklearn.datasets module with functions to load the example datasets from yaImpute ~~(pending approval for us to share the datasets)~~. Unlike sklearn, probably package datasets in a purpose-made class rather than a Bunch object, but try to keep a consistent API.

It may make sense for us to define a Dataset class for public examples with a TestingDataset subclass that contains additional neighbors and distances for internal testing, but we can figure that out later.

Unwanted double transformation using TransformedKNeighborsMixin.predict

In #22, we brought fit, predict, and kneighbors methods under the TransformedKNeighborsMixin class and took them out of the estimators. On predict, I believe we are unintentionally applying each estimator's transform twice. The current code for TransformedKNeighborsMixin.predict is this:

def predict(self, X):
    """Predict using transformed feature data."""
    X_transformed = self._apply_transform(X)
    return super().predict(X_transformed)

This will apply the transform to the X array and then call KNeighborsRegressor.predict. However, this method then in turn calls self.kneighbors which returns the call back to TransformedKNeighborsMixin.kneighbors which again applies the transform, this time to the already transformed data.

This probably went unnoticed because we aren't explicitly testing predict in test_port.py at present, only kneighbors. Presumably the check distances would not have matched after the double transformation.

This did surface in the implementation of the MSNRegressor (not yet pushed) where the canonical correlation analysis transform reduces the number of features based on significant CCA axes. This failed both the test_estimators_support_continuous_multioutput and test_estimators_support_dataframes test because on the first transform pass, we subset the X array down to fewer features, but then try to pass this reduced dimension array back to StandardScalerWithDOF which was originally fit with the full number of features.

If this diagnosis seems correct, I believe the solution is to just remove the TransformedKNeighborsMixin.predict method. Any call to an estimator's predict gets routed to KNeighborsRegressor.predict which in turn will call TransformedKNeighborsMixin.kneighbors and will do the appropriate transformation before neighbor finding. I have tested locally with this approach (and on the new MSNRegressor estimator) and it passes all current tests.

Test all Python versions with Hatch 1.8.0

The new Hatch release will auto-install missing Python versions when using an environment matrix for testing! I did a quick test by adding:

[[tool.hatch.envs.test.matrix]]
python = ["3.8", "3.9", "3.10", "3.11", "3.12"]

to the pyproject.toml, updating my hatch, and running hatch run test:all, and it successfully ran the whole test matrix (takes a while the first time, but it looks like it caches installs for future runs). This solves all the trouble I've had in the past getting tox to find the right Python installations.

@grovduck I don't think this is a high priority, or something we necessarily even want to run every time, but it would be nice to have the option to test old Python versions and catch those sneaky typing incompatibilities before they hit the CI. I figured you might be interested in this for other projects that you've migrated to Hatch as well.

Refactor transformed estimators

@grovduck I have a proposal to run by you, inspired by your refactor of the inheritance in #50. This design builds off of the changes there, so that would be merged before tackling this. I thought about proposing this as part of that PR, but didn't want to derail things with even more refactoring.

I suggest we turn TransformedKNeighborsRegressor into an abstract class by inheriting from ABC, then add an abstract method _get_transform that all subclasses would be required to implement, e.g.:

class EuclideanKNNRegressor(TransformedKNeighborsRegressor):
    def _get_transform(self) -> TransformerMixin:
        return StandardScalerWithDOF(ddof=1)

Next, we would get rid of the fit methods on those estimators and move that functionality into TransformedKNeighborsRegressor.fit, using a _set_fit_transform method to handle the instantiation and fitting of the transformer.

class TransformedKNeighborsRegressor(RawKNNRegressor, ABC):
    ...

    @abstractmethod
    def _get_transform(self) -> TransformerMixin:
        """Return the transformer to use for fitting. Must be implemented by subclasses."""
        ...

    def _set_fit_transform(self, X, y) -> TransformerMixin:
        self.transform_ = self._get_transform().fit(X, y)

    def fit(self, X, y):
        """Fit using transformed feature data."""
        self._validate_data(X, y, force_all_finite=True, multi_output=True)        
        self._set_dataframe_index_in(X)
        self._set_fit_transform(X, y)

        X_transformed = self.transform_.transform(X)
        return super().fit(X_transformed, y)

To accommodate for fitting transformers with separate y data in GNN and MSN, we could add a kNN-independent YFitMixin that overrides the fit method to accept the additional argument, store it as an attribute, and fit the transformer with it using an overriden _set_fit_transform:

class YFitMixin:
    """Mixin for transformed estimators that can fit the transformer with a separate y."""
    def _set_fit_transform(self, X, y):
        y_fit = self.y_fit_ if self.y_fit_ is not None else y
        self.transform_ = self._get_transform().fit(X, y_fit)

    def fit(self, X, y, y_fit=None):
        self.y_fit_ = y_fit
        return super().fit(X, y)

Overall, this should reduce some code duplication in the fit methods, prevent instantiation of TransformedKNeighborsRegressor without having to rely on making it private, and add a runtime check to ensure that all transformed estimators define a transformer function. The main downside is the need to store a reference to y_fit_. There may be another way to handle the YFitMixin, but that was the best solution I could come up with after trying a few different strategies.

Curious to hear your thoughts on this design, and if you foresee any limitations.

Get all `sklearn` estimator checks passing

As discussed in #20, we've run into a number of issues that probably would have been caught earlier with the built-in test suite in the sklearn.utils.estimator_checks module, so it's probably in our best interest to focus on getting those passing. This should also potentially allow us to remove some redundant tests from test_estimators.py.

Note that it seems like some valuable tests like check_dataframe_column_names_consistency aren't running automatically, so we'll need to make sure we run everything that's relevant.

EDIT: Our tests currently contain commented-out checks for our estimators, but it looks like there are also transformer checks that we should run. In fact, maybe it would make sense to get our transformers passing first since those will affect the estimator results?

Another aspect that I'm not very familiar with but that we'll probably need to dive into for this is the sklearn tag system, which allows you to specify different compatibility options for estimators and transformers that inform which checks will be run.

Set up dev tools

Finalize naming conventions for estimators and package

Currently, our estimators are named to follow method names introduced in R's yaImpute package. We likely want to conform to scikit-learn's model of being more explicit about whether estimators are classifiers or regressors. kNN methods fit into a funny space here - most often they are thought of as classifiers as they use the k nearest neighbors in prediction. However, some attributes are continuous and are calculated as (weighted) means from the k neighbors attributes, which is more similar to KNeighborsRegressor.

Also need to decide whether it makes sense to stick with scikit-learn-knn as a package name as knn is already a meaningful term in scikit-learn. We could also decide to "pay homage" to yaImpute which is serving as the inspiration for this package.

Test estimators with `y_fit` data

While working on #52, I noticed that we don't currently have any tests to confirm the functionality or accuracy of predictions that use separate y_fit data. This caused a few bugs (e.g. 7f92a90) to slip by since the tests were passing even if y_fit was ignored.

@grovduck, I'm not 100% sure we need a full-blown port test for this. We already know that we can accurately fit transformers, so maybe all we need to do is confirm that they're getting fit with the correct data? I know it's a pain to add in a whole new set of test data (and I'm not even sure if this is doable with yaImpute?), so I'm open to an easy solution, as long as we're confident that it can work as expected. What do you think?

Add ruff to pre-commit hooks

pyupgrade standardizes syntax, including type annotations, to a minimum Python version. That should make it easier to maintain consistency, especially as we drop support for older versions of Python.

Good overview of the tool: How to Upgrade Syntax with pyupgrade

EDIT: Rather than adding pyupgrade, we could switch to ruff which supports almost all of the features of pyupgrade, flake8, isort, and a variety of other linters (including flake8 extensions like bugbear and docstrings.) In addition to being much faster than our current tools, this would consolidate almost everything into a single tool.

@grovduck let me know if you have any preferences here on tooling. Otherwise I'll probably try migrating sankee to ruff over the weekend to get a better idea of whether there are any sticking points, and report back.

Change installed name to `sknnr`

Our previous plan was to use scikit-learn-knn-regression as the installed name of the package and sknnr as the imported name, but per the discussion in #59, we're now tentatively planning to use sknnr across the board.

Pros for switching to `sknnr`:

Shorter, catchier, more memorable
No confusion about what name to install and/or import

Cons for switching to `sknnr`:

Less descriptive
No explicit reference to scikit-learn

Making the switch will be pretty straightforward, and should just require updating the pyproject.toml, the repository name on Github, and links to the repository throughout the docs (although Github will continue to redirect).

If we do make a final decision to go with sknnr on PyPI, I suggest we make an alpha release just to claim the package name (pre-releases can only be installed with pip install sknnr --pre, so it should be clear that it is not an official, stable release if anyone comes across it).

Replace existing port tests with automated regression tests

Currently, we test our estimator accuracy against manually generated results from yaImpute and pynnmap. Once we have all major functionality implemented and are confident in our results, we can switch to regression testing to ensure no errors are introduced. This was briefly discussed in #40:

Eventually, I suppose we will reach a point where we're confident everything is working and could theoretically test against previous versions of sknnr ... pytest-regressions and syrupy both look like interesting tools that might solve this problem for us in the future by testing against automatically generated results.

We'll need to do some experimenting to find the right tool for our use case, but the advantage should be a massive simplification of our testing system (e.g. the removal of KNNTestDataset) and the ability to easily test against a wider range of parameters (e.g. values of n_neighbors and n_components).

Licensing

Before release, we need to:

Select a license for sknnr.
Ensure we comply with the license of ported code. We currently port code from:
1. yaImpute throughout, which is primarily GPL v3 but also contains some public domain code. Their GPL license was inherited and upgraded from the LGPL license of the ANN library.
2. statsmodels for the implementation of CCorA (although we can hopefully migrate to the sklearn implementation of CCA), which is BSD-3 licensed.
3. scikit-learn for the implementation of dataset loading, which is BSD-3 licensed.

See #27 for further discussion.

Expose dataset loading functions as public API

There are functions in sknnr.datasets._base that manage the loading of CSV files into Dataset objects (similar to sklearn's Bunch objects). At present, however, these utility functions (e.g. _load_dataset_from_csv_filenames and _load_csv_data) are meant for internal use and are currently used in load_moscow_stjoes and load_swo_ecoplot. As @aazuspan points out in lemma-osu/synthetic-knn#2, these functions (and callees) rely on "specific restrictions like requiring an integer index in the first column, and floats in the remaining columns" (see here).

Nonetheless, even with these restrictions, this functionality is likely useful across other packages that require sknnr or for users that may have similarly formatted CSV files. In order to use these functions in a public API, we would need to generalize _load_csv_data so that it possibly accepts DATA_MODULE as a keyword argument (as in sklearn.datasets.load_csv_data. This would work well for other packages that want to use this functionality, although perhaps users might have to create a module structure with their data to use it as well.

To consider:

Should these data loading functions continue to live in sknnr.datasets or in a more generic "utilities" module?
Do we need to support loading not only from modules (using importlib.resources) or from generic directories as well?

@aazuspan, further thoughts on this?

Make estimators compatible with `pandas` dataframes?

While trying to implement #2, I discovered another snag: Euclidean and Mahalanobis cannot be fit by passing a dataframe as X. The errors seem to originate in the transformers, but I haven't dug any deeper than that.

from sklearn.datasets import load_linnerud
from sklearn_knn import Euclidean

X, y = load_linnerud(return_X_y=True)
X_df = pd.DataFrame(X)

# Fits fine with array
Euclidean().fit(X, y)

# Fails with dataframe [ValueError: Length of values (20) does not match length of index (3)]
Euclidean().fit(X_df, y)

While sklearn definitely doesn't claim to be interoperable with pandas, my past experience and a quick test (below) indicates that most if not all regressors do support dataframes transparently.

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression
from sklearn.neural_network import MLPRegressor
from sklearn.svm import SVR

X, y = load_iris(return_X_y=True)
X_df = pd.DataFrame(X)
# All the regressors below support dataframes
regressors = [
    RandomForestRegressor(),
    GradientBoostingRegressor(),
    KNeighborsRegressor(),
    LinearRegression(),
    MLPRegressor(),
    SVR(),
]

for reg in regressors:
    reg.fit(X_df, y)

My inclination is that we should too, but let me know what you think.

`list.index` accidentally stored as `dataframe_index_in_`

I just came across scikit-learn/scikit-learn#27037 that points out that using hasattr or getattr to retrieve a DataFrame index will unintentionally retrieve the list.index method. For us, this means that an estimator fitted with a list will store that method as dataframe_index_in_. Then, if you used return_dataframe_index=True with kneighbors, you'd get a weird IndexError.

from sknnr import EuclideanKNNRegressor
from sknnr.datasets import load_swo_ecoplot

X, y = load_swo_ecoplot(return_X_y=True)
X = X.tolist()

est = EuclideanKNNRegressor().fit(X, y)
print(est.dataframe_index_in_)  # <built-in method index of list object at 0x000001E8BB330BC0>
est.kneighbors(return_dataframe_index=True)  # IndexError: too many indices for array

This is the offending line:

https://github.com/lemma-osu/scikit-learn-knn-regression/blob/96a4230cbbd113136aa619f869afabe1d648e054/src/sknnr/_base.py#L14-L15

They solve this in scikit-learn/scikit-learn#27044 by using a new _is_pandas_df helper, but as far I can tell that's not released yet, so I think it'll be a while before we want to rely on that.

We could add a similar function to sknnr, but I wonder if we might break some implicit duck-typed support for DataFrame-like objects by adding an explicit class check here. An alternative would just be to check that index is an array using _is_arraylike, which would prevent this bug and should cover us for just about any other case.

Independent predict and score methods for KNeighborsRegressor derived estimators

The predict method for KNeighborsRegressor uses an X matrix to calculate a predicted score for each row in X. If the model was fit from this same X matrix and k is set to 1, predict will (by definition) return the y matrix used in fitting the model. This is because the nearest neighbor is always itself (ignoring duplicate values) and the y values associated with this neighbor are returned, i.e.

>>> from sklearn.neighbors import KNeighborsRegressor
>>> from sklearn.datasets import load_linnerud  
>>> X, y = load_linnerud(return_X_y=True)
>>> est = KNeighborsRegressor(n_neighbors=1).fit(X, y)
>>> print(np.all(est.predict(X) == y))
True

The kneighbors method, on the other hand, allows the option of not passing the X matrix in the call, in which case "neighbors of each indexed point are returned. In this case, the query point is not considered its own neighbor".

For kNN methods, it is common to use the "second-nearest neighbor" (truly, the first independent neighbor) to use in independent accuracy assessment. This is very similar to a leave-one-out (LOO) analysis, except that the model is fit with all plots rather than being fit n times with n-1 plots and predicting the left-out plot.

To accomplish this, we can:

Call kneighbors with no X array to get independent neighbors for each row
Continue with the same logic as in predict to get the predicted values for each column in X, but with independent neighbors

Currently, all of our estimators are using the predict and score methods from KNeighborsRegressors (i.e. these methods have not been overridden). We will want to retain predict and score as-is for when we send new X data to an already fit model. But we also need to provide additional methods for independent prediction and scoring, in the least disruptive way.

I don't see a particularly elegant way to do this, other than repeating most of what is already in KNeighborsRegressor.predict. In essence,

def predict_independent(self):
    # Note that kneighbors is called without `X` parameter
    if self.weights == "uniform":
        neigh_ind = self.kneighbors(return_distance=False)
        neigh_dist = None
    else:
        neigh_dist, neigh_ind = self.kneighbors()
    weights = _get_weights(neigh_dist, self.weights)
    # Remainder of predict method

def score_independent(self, y, sample_weight=None):
    # Here I'm using root mean squared error instead of R^2 which is what is returned from `score`
    y_pred = self.predict_independent()
    return mean_squared_error(
        y,
        y_pred,
        sample_weight=sample_weight,
        squared=False,
        multioutput="raw_values",
    )

These would likely be new methods in an overridden KNeighborsRegressor or possibly as a separate mix-in. @aazuspan, I'm looking for feedback on the best way to bring this functionality in if you have any thoughts. In early testing, I'm getting the correct values (based on a comparison with yaImpute), but am uneasy about the duplication of code.

Design and implementation of random forest nearest neighbors (RF-NN)

We started to discuss the design and implementation of RF-NN in #22, but thought it would be better to track in a separate issue. Because RF-NN is quite different than all other estimators in this package in terms of how distances are calculated, we'll need to likely use different sklearn base classes or derive our own to implement this. The goal of this issue is to lay out the design of RF-NN and decide how best to tackle it. As RF-NN was first introduced in the R yaImpute package, we rely heavily on their implementation details.

Fitting RF-NN models

One or more y attributes (along with multiple X covariates) are used to fit different random forests, notably one forest per each y attribute. In yaImpute, each forest is actually a classification (i.e. RandomForestClassifier) - the y attributes passed are either categorical attributes (something like vegetation class) or continuous attributes that are binned into classes using some classification mode (equal interval, quantile, natural breaks, etc.).

Somewhat non-intuitively, in RF-NN, the actual values of the terminal nodes (i.e. the class values) in each random forest doesn't matter, as we only care about the terminal nodes' IDs of the forests' trees where the references land. For example, if there are q y attributes passed, q different forests of n trees will be created. For each reference observation passed to fit, we want to capture the terminal node ID that it falls into for each tree and each forest. Therefore, the needed output of the fit process is both the specification of the forests that were created (to run additional targets) and the "node matrix" of p rows (reference plots) by q * n columns that stores node IDS. We'll be leaning on the apply method of each forest to return the leaf node IDs.

Predicting targets using node similarity

Functionally, the predict process for each new target is very similar to standard random forests, although the target will be run through all q forests rather than just a single forest. As with the references, we only use the IDs of the terminal nodes that the target falls into - therefore, we get a (q * n) vector of node IDs as a result of passing a target through all forests. At this point, we use a node similarity metric to define the distances from the target to each candidate reference plot. As opposed to all other estimators that we've introduced so far which have a neighbor space greater than one dimension, distances in RF-NN are calculated as the inverse of the number of nodes that the reference and target share in common. Neighbors are ranked on this distance to find the k nearest neighbors for each target.

Estimator naming

An initial suggestion for the naming of this estimator would be RandomForestNNRegressor. We could also go with RFNNRegressor as the term RFNN is somewhat well understood by those in the imputation community. To be explict, I propose the former name and use it in subsequent discussion.

Design considerations

fit

We can rely on RandomForestClassifier to create the forests for each y attribute passed. As of now, it seems reasonable to have RandomForestNNRegressor composed of q RandomForestClassifiers rather than inherit from it. fit would introduce two estimator attributes: 1) the forests themselves (proposed name: forests_); and 2) the node matrix of the data that was used to fit the model (proposed name: node_matrix_). We'll also likely want to have easy access to a few counts, e.g. number of forests and number of trees in a forest. These could easily be properties of the class derived from information stored in self.forests_.

kneighbors

We will likely need to introduce a new method with the same signature as estimators derived from TransformedKNeighborsMixin, i.e. def kneighbors(self, X=None, n_neighbors=None, return_distance=True). The method would run X through self.forests_, capture a node ID matrix for X and then row-wise calculate distances from self.node_matrix. The return value would be the (optional) distances and row indexes of the n_neighbors. Note that n_neighbors will be set on the initialization of RandomForestNNRegressor just like our other estimators, but kneighbors can override this by passing a value to n_neighbors.

predict

Initially we had thought we may possibly be able to derive from KNeighborsRegressor if we use the callable weights parameter, because given weights and y attributes, the actual calculation of predicted attributes will be exactly the same as in all estimators that derive from TransformedKNeighborsMixin. However, in looking at that implementation more closely, predict still relies on calculating the n-dimensional distances and then applying the callable function to the distance array. We are probably better off by introducing a new predict function that is simply np.average with a weights parameter that could represent the inverse distances calculated from kneighbors.

Timing

As this estimator deviates substantially from the other estimators already introduced, I propose we first merge fb_add_estimators back into main and treat this as a separate feature branch.

Other possible enhancements

Rather than capturing a flat node matrix that is the combination of all forests and all trees within forests, we could capture a node matrix for each forest such that it behaves like an "axis", more akin to our other estimators. Distances could then be calculated per axis and n-dimensional Euclidean distance could yield neighbors. I believe this would be a new twist on RF-NN that is not implemented in yaImpute.
Many of the other estimators (e.g. GNNRegressor and MSNRegressor) apply dimensionality reduction such that the resultant axes are ordered based on the amount of variation explained. The eigenvalues from these axes are typically used to weight these axes when finding neighbors. As currently written, RF-NN weights all forests the same (as they are captured in a single node matrix). There may be a measure of forest importance (akin to variable importance in the forest themselves) that may provide weights when calculating distances. I've not seen anything to capture this, but also have not looked very closely at all diagnostics from RandomForestClassifier.

Possibility of using built-in Mahalanobis distance

Hey @grovduck, I was poking around sklearn and came across a built-in implementation of Mahalanobis as a distance metric. After reading through some stackoverflow discussions, I was able to bodge together an alternate implementation of MahalanobisKNNRegressor that doesn't require a custom transformer and produces (as far as I can tell) identical results.

class MahalanobisKNNRegressor(IDNeighborsRegressor):
    def fit(self, X, y):
        self.metric = "mahalanobis"
        self.metric_params = {"VI": np.linalg.inv(np.cov(X, rowvar=False))}
        return super().fit(X, y)

This implementation is pretty rough and I don't really understand how it works, but it does pass all tests. There are definitely upsides to this simpler approach in that we can remove some code that's ported from yaImpute and cut down on the number of estimator checks we need to pass, but do you think we're losing anything in terms of future flexibility? Or is there additional value to including the MahalanobisTransformer beyond its use in this estimator? Curious to hear your thoughts!

Simplify fit parameters for `CCATransformer` and `CCorATransformer`?

While working on #20 I noticed that CCATransformer and CCorATransformer both specify y as an optional parameter but won't actually work if it is None (unless the optional spp is provided instead). To reproduce:

from sklearn.datasets import load_linnerud
from sknnr.transformers import CCATransformer, CCorATransformer

X, y = load_linnerud(return_X_y=True)
# AxisError: axis 1 is out of bounds for array of dimension 0
CCATransformer().fit(X)

My first thought was that we should throw a ValueError if both y and spp are None, but now that I think about it I'm wondering if those transformers should instead just take a required y parameter and let the estimator handle the selection of fitting data. So for example, GNNRegressor.fit would pass either y or spp, but not both, to CCATransformer.fit.

Currently, CCATransformer really only takes an spp parameter because the estimator that uses it might pass that, which is probably a bit backwards and could lead to some confusion, as well as some unnecessary complexity. Making the estimators responsible for choosing the right fitting data would remove a couple parameters and the need for special error handling since both estimators already require y parameters.

Let me know what you think, @grovduck.

Set up docs

Probably sphinx or mkdocs. I've used sphinx in the past but would much rather write markdown than restructured text, so I think mkdocs is worth looking into.

Apparently mkdocs doesn't handle automatically building docs from docstrings, so we'd probably need to throw in mkdocstrings.

Support use of reduced number of axes in CCorATransformer, CCATransformer

We've touched on this in a few previous issues (#20, #22) but for estimators that perform dimensionality reduction and create "axes" that are linear combinations of their input covariates (i.e. CCorATransformer, CCATransformer), we want the flexibility that a user can define the number of axes to use from these estimators. Typically, these estimators will create the number of axes up to the number of X variables provided. For example, CCorATransformer uses a test internally to reduce the number of axes down based on the statistical significance of the axes.

We need to solve the following problems with this issue:

Allow the user to specify the number of axes to keep. This would likely be done through the estimator's __init__ and would default to None.
After the model is fit, use the minimum value from either the user-specified value or from the number of axes calculated during model fit.
If the number of axes is reduced, possibly modify and/or normalize the axis weightings to account for this. Each estimator handles this differently.
Use the updated number of axes and axis weightings in transform.
For these estimators' "wrappers" (i.e. MSNRegression and GNNRegression), test that predict and kneighbors return the expected values. Because yaImpute doesn't support a user-defined option for fewer axes, we will need to create new test files that implement the expected behavior and produce these results.