Code Monkey home page Code Monkey logo

Comments (9)

grovduck avatar grovduck commented on September 27, 2024

From @aazuspan:

Just a quick follow up after running a little bit of performance profiling with pyinstrument on GNNRegressor.kneighbors: The proportion of CPU time spent on KNeighborsRegressor.kneighbors vs. GNNRegressor._apply_transform varies with the size of X. With 4096x4096 samples, _apply_transform should only take up about 8.5% of the time.

image

On the other hand, if the dask chunks were really small, that proportion might shift to where _apply_transform takes the majority of the time. Here's the result with 64x64 samples, where _apply_transform takes 86% of the CPU time for kneighbors.

image

Maybe there's a substantial overhead to the _apply_transform method that makes it relatively inefficient for small datasets?

from sknnr.

grovduck avatar grovduck commented on September 27, 2024

From @grovduck:

Seeing your results is really surprising to me. I varied chunk sizes from 32x32 up to 4096x4096. I’m completely flummoxed why these two methods (_apply_transform and kneighbors) wouldn’t scale linearly relative to one another, independent of chunk size. Do you have any ideas off the top of your head why chunk size would affect their relative performance?

from sknnr.

grovduck avatar grovduck commented on September 27, 2024

From @aazuspan:

(@aazuspan provided a Jupyter notebook with basic profiling code to replicate the issue):

[cell]
%load_ext pyinstrument
%load_ext autoreload
%autoreload 2

[cell]
from sknnr.datasets import load_swo_ecoplot
from sknnr import GNNRegressor
import numpy as np

[cell]
X, y = load_swo_ecoplot(return_X_y=True, as_frame=True)
est = GNNRegressor().fit(X, y)
dummy_X = np.random.rand(256 * 256, X.shape[1])

[cell]
%%pyinstrument
est.kneighbors(dummy_X)

After some more experimentation, I'm starting to notice that even with small sample sizes, the sklearn call sometimes takes longer than the transform on the first run, but not on subsequent runs. That makes me think there may be some compiling overhead or caching going on in the sklearn/scipy/numpy code that's allowing it to run very fast with repeated runs on small datasets.

The only other explanation I can think of for why their relative performance would vary with sample size is time complexity of the algorithms, e.g. if the CCA transform scales linearly while the k-neighbors scales exponentially. I can't think of a good reason why that would be the case, but I certainly don't understand either algorithm well enough to be confident that it's not.

from sknnr.

grovduck avatar grovduck commented on September 27, 2024

From @grovduck:

After some more experimentation, I'm starting to notice that even with small sample sizes, the sklearn call sometimes takes longer than the transform on the first run, but not on subsequent runs. That makes me think there may be some compiling overhead or caching going on in the sklearn/scipy/numpy code that's allowing it to run very fast with repeated runs on small datasets.

I have a sneaking suspicion (although I could be wrong) this has to do with my use of properties for most of the code in CCA (and CCorA). Because I’m calculating these “on-the-fly” (i.e. “projector” calls the properties axis_weights and coefficients which call subsequent properties) and not caching any of these properties, I’m guessing they are being recalculated each time? Does that make sense?

This wouldn’t address what you’ve found with running it subsequent times, though. I’m still not understanding that.

The only other explanation I can think of for why their relative performance would vary with sample size is time complexity of the algorithms, e.g. if the CCA transform scales linearly while the k-neighbors scales exponentially. I can't think of a good reason why that would be the case, but I certainly don't understand either algorithm well enough to be confident that it's not.

I can’t think of any reason that transform and k-neighbors shouldn’t covary as a function of the number of array elements. The number of neighbors used in kneighbors would stay the same, the transform would stay the same, only the number of operations should differ, right? The only edge case I can think of is if the transformed score puts the point into a dense area of plots and the KDTree has to run more iterations to find the nearest neighbors.

from sknnr.

grovduck avatar grovduck commented on September 27, 2024

From @aazuspan:

I have a sneaking suspicion (although I could be wrong) this has to do with my use of properties for most of the code in CCA (and CCorA). Because I’m calculating these “on-the-fly” (i.e. “projector” calls the properties axis_weights and coefficients which call subsequent properties) and not caching any of these properties, I’m guessing they are being recalculated each time? Does that make sense?

Yes, those would get re-calculated every time they're called, so it might be worth experimenting with using cached_property instead. There's also some overhead to caching though, so a quick operation might be not benefit at all. Worth a shot!

This wouldn’t address what you’ve found with running it subsequent times, though. I’m still not understanding that.

Agreed... I did a few more slightly systematic tests and confirmed that with 64x64 samples, the ndarray.kneighbors portion of the call takes about 10x longer the first run after restarting the kernel than it does on subsequent runs. It will continue to run fast even if you change the input data, which to me points to some kind of compilation step rather than caching. With that said, the 10x overhead on the first call is only ~150ms and doesn't seem to scale with sample size, so it's negligible with larger datasets. Probably not worth worrying about.

I can’t think of any reason that transform and k-neighbors shouldn’t covary as a function of the number of array elements. The number of neighbors used in kneighbors would stay the same, the transform would stay the same, only the number of operations should differ, right? The only edge case I can think of is if the transformed score puts the point into a dense area of plots and the KDTree has to run more iterations to find the nearest neighbors.

I don't understand the sklearn implementation of kneighbors well enough to even guess at a cause, but your thought about KDTree sounds reasonable to me. I assume we don't want to dive into trying to optimize sklearn, so as far as kneighbors performance goes, it sounds like the takeaways are 1) there's some overhead that may affect small datasets, and 2) the sklearn-side of kneighbors takes most of the time with reasonably large datasets, but maybe we can optimize our transformers further with some caching, right?

from sknnr.

grovduck avatar grovduck commented on September 27, 2024

Using an editable install of sknnr, I first decorated CCA.env_center, CCA.axis_weights, and CCA.coefficients with @cached_property. This seemed to help considerably such that all chunk sizes from 32x32 to 8192x8192 showed that KNeighborsRegressor.kneighbors and CCATransformer.transform were covarying as expected with 85% and 15% of the CPU time, respectively.

Revisiting CCATransformer though, I think it might make more sense to set estimator attributes in fit for the matrices needed for transformation, e.g.:

class CCATransformer(ComponentReducerMixin, TransformerMixin, BaseEstimator):
    def fit(self, X, y):
        ...
        self.ordination_ = CCA(X, y)
        self.set_n_components()
        self.env_center_ = self.ordination_.env_center
        self.projector_ = self.ordination_.projector(n_components=self.n_components_)
        return self

    def transform(self, X, y=None):
        ...
        return (X - self.env_center_) @ self.projector_

This could be done without changing CCA (although it could be argued that it's in need of refactoring anyway) and persists the data in the scope of CCATransformer. This proposed change maintains the profiling behavior that I observed by introducing the @cached_property decorators. If this is acceptable, we will want a PR to set these minimal estimator attributes on CCATransformer and CCorATransformer.

Also, as part of profiling, it looks like we are calling CCATransformer._validate_data twice within TransformedKNeighborsMixin._apply_transform, once in TransformedKNeighborsMixin._apply_transform and then again in CCATransformer.transform. If I comment out the former, tests still pass for all estimators. Thoughts on this?

image

from sknnr.

grovduck avatar grovduck commented on September 27, 2024

Just one other oddity when using the dask dashboard (and really the genesis of this issue). In the profiling tab, it doesn't look like the profiler is always pointing to the correct line of code when calculating time. For example, in the below image the bar associated with the mouse pointer takes up 94% of the time to run GNNRegressor.predict, but it points to this code:

def kneighbors(self, X=None, n_neighbors=None, return_distance=True):
    """Return neighbor indices and distances using transformed feature data."""
>>> X_transformed = self._apply_transform(X) if X is not None else X
    return super().kneighbors(
        X=X_transformed, n_neighbors=n_neighbors, return_distance=return_distance
    )

when I believe it should be pointing to the line below. This has the potential for confusion when profiling slow run times.

image

from sknnr.

aazuspan avatar aazuspan commented on September 27, 2024

@grovduck Thanks for compiling all of this!

Revisiting CCATransformer though, I think it might make more sense to set estimator attributes in fit for the matrices needed for transformation...

Totally agreed! It makes sense to me to store anything that can be calculated during fitting and needs to be accessed repeatedly as an attribute.

it looks like we are calling CCATransformer._validate_data twice within TransformedKNeighborsMixin._apply_transform, once in TransformedKNeighborsMixin._apply_transform and then again in CCATransformer.transform. If I comment out the former, tests still pass for all estimators. Thoughts on this?

Good catch. I think the former was added first and was needed to catch mismatched feature names. When the latter was added to pass an estimator check, the former must have become redundant. I'd say let's get rid of it!

In the profiling tab, it doesn't look like the profiler is always pointing to the correct line of code when calculating time.

This is very interesting. I can't think of what would cause this other than a Dask bug, but I also haven't seen any references to this elsewhere. In any case, definitely something to watch out for.

from sknnr.

grovduck avatar grovduck commented on September 27, 2024

Resolved by #48

from sknnr.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.