Code Monkey home page Code Monkey logo

Comments (4)

slevang avatar slevang commented on July 3, 2024 1

Very nice, thanks for the quick detective work! I hadn't thought through the math very closely and was mistakenly assuming these quantities should be identical regardless of what comes out of the decomposition, but that's obviously wrong. The dask algorithm we're using is of course an approximation (svd_compressed) and this all makes total sense.

From a user perspective, if the compute time scales less than linearly with increasing n_power, then I think a more conservative default here would make sense (at least 1 or 2 as suggested in the paper). Climate data will often match the "slowly decaying modes" description. I would happily take 2x compute time and be guaranteed a good answer, then could always iterate downwards on this parameter if desire.

An ability to warn the user about an inaccurate decomposition, and a suggestion about how to fix it, is even better. No strong opinions on your points above, either computing an extra mode or computing the error on k-1 modes seems fine. And these types of tolerance definitions are always somewhat heuristic.

One other note is that to maintain the fully lazy execution mode, these type of checks would have to go in the new _post_compute() section I added.

from xeofs.

nicrie avatar nicrie commented on July 3, 2024 1

I was curious about the default number of power iterations used by Scikit-learn's randomized_svd, and it turns out they use either 4 or 7. To maintain consistency, I propose setting the default to 4 as well, and then leave it to the users to choose whether to decrease the number and thus the computational time for decreased accuracy.

From sklearn's documentation:

n_iter int or ‘auto’, default=’auto’

Number of power iterations. It can be used to deal with very noisy problems. When ‘auto’, it is set to 4, unless n_components is small (< .1 * min(X.shape)) in which case n_iter is set to 7. This improves precision with few components. Note that in general users should rather increase n_oversamples before increasing n_iter as the principle of the randomized method is to avoid usage of these more costly power iterations steps. When n_components is equal or greater to the effective matrix rank and the spectrum does not present a slow decay, n_iter=0 or 1 should even work fine in theory (see [1] page 9).

from xeofs.

nicrie avatar nicrie commented on July 3, 2024

Good catch @slevang! It's not just the scores that are off; the entire decomposition appears to be inaccurate. One thing that matches between Dask's svd_compressed and Scikit-learn's randomized_svd is the total variance. However, the explained variance per mode already differs between them, indicating that it's distributed differently. The same applies to the eigenvectors, which ultimately explains the score differences.

I noticed that when we increase the n_power_iter parameter (which defaults to 0), the differences disappear. According to the documentation:

n_power_iter: int, default=0

Number of power iterations, useful when the singular values decay slowly. Error decreases exponentially as n_power_iter increases. In practice, set n_power_iter <= 4.

With a value of 4, the differences virtually vanish. But even a lower value like 2 already results in differences on the order of about $10^{-5}$, which is likely acceptable for many applications.

In the original paper by Halko et al. (paragraph 1.6), they even suggest using a higher value than 0:

Unfortunately, the simplest version of this scheme is inadequate in many applications because the singular spectrum of the input matrix may decay slowly. To address this difficulty, we incorporate q steps of a power iteration, where q = 1 or q = 2 usually suffices in practice.

Since it's challenging to estimate the error order before computing the singular spectrum, we might consider using a higher default value for n_power_iter. However, this would come at the cost of increased computation time. I ran a quick experiment using your example to see how increasing the number of power iterations affects computation time (see the figure below). While I'm not sure how generalizable these results are, in this example, 2 power iterations increased the computation time by roughly 60%.

output

I'll also explore whether equation 1.10 in the paper can provide us with insights that we can relay to users, suggesting an increase in n_power_iter if the potential error appears to be too large. What are your thoughts on this approach?

from xeofs.

nicrie avatar nicrie commented on July 3, 2024

Calculating the reconstruction error $\Delta$ using equation (1.10) seems straightforward, but just a couple of considerations to keep in mind:

Number of Modes:
To compute $\Delta$ for a rank-$k$ matrix approximation, we need to include the singular value $k+1$. So, for example, if you want to estimate the error using 5 modes, you'll actually need to compute 6 modes. To address this, I think we have two options: either refactor the code to always compute $k+1$ modes instead of $k,' or compute and report the error for $k-1$ modes.

When to Issue a Warning
To obtain a normalized value for $\Delta$ that can be used for any input matrix, independent of the data's scale, I think we can normalize $\Delta$ by the Frobenius norm of the input matrix. This can be calculated as $\sqrt{\kappa (m-1)},$ where $m$ is the number of samples, and $\kappa$ is the total variance (which is already computed and stored in self.data["total_variance"]). The key question here is when to issue a warning.

Taking your previous example, we can calculate Frobenius-normalized reconstruction errors for different values of
n_power_iter :

n_power_iter 0 1 2 3 4
$\Delta_{norm}$ 9.81 0.50 0.27 0.21 0.18

I also condcuted similar analyses with different data sets and various normalizations (e.g., standardize=True and use_coslat=True, and the results appear consistent. Generally, it seems that $\Delta_{norm} &lt; 0.5$ provides reasonably accurate results. However, I'd like to establish a less heuristic criterion for issuing warnings.

from xeofs.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.