Comments (4)
Very nice, thanks for the quick detective work! I hadn't thought through the math very closely and was mistakenly assuming these quantities should be identical regardless of what comes out of the decomposition, but that's obviously wrong. The dask algorithm we're using is of course an approximation (svd_compressed) and this all makes total sense.
From a user perspective, if the compute time scales less than linearly with increasing n_power
, then I think a more conservative default here would make sense (at least 1 or 2 as suggested in the paper). Climate data will often match the "slowly decaying modes" description. I would happily take 2x compute time and be guaranteed a good answer, then could always iterate downwards on this parameter if desire.
An ability to warn the user about an inaccurate decomposition, and a suggestion about how to fix it, is even better. No strong opinions on your points above, either computing an extra mode or computing the error on k-1
modes seems fine. And these types of tolerance definitions are always somewhat heuristic.
One other note is that to maintain the fully lazy execution mode, these type of checks would have to go in the new _post_compute()
section I added.
from xeofs.
I was curious about the default number of power iterations used by Scikit-learn's randomized_svd
, and it turns out they use either 4 or 7. To maintain consistency, I propose setting the default to 4 as well, and then leave it to the users to choose whether to decrease the number and thus the computational time for decreased accuracy.
From sklearn's documentation:
n_iter int or ‘auto’, default=’auto’
Number of power iterations. It can be used to deal with very noisy problems. When ‘auto’, it is set to 4, unless n_components is small (< .1 * min(X.shape)) in which case n_iter is set to 7. This improves precision with few components. Note that in general users should rather increase n_oversamples before increasing n_iter as the principle of the randomized method is to avoid usage of these more costly power iterations steps. When n_components is equal or greater to the effective matrix rank and the spectrum does not present a slow decay, n_iter=0 or 1 should even work fine in theory (see [1] page 9).
from xeofs.
Good catch @slevang! It's not just the scores that are off; the entire decomposition appears to be inaccurate. One thing that matches between Dask's svd_compressed
and Scikit-learn's randomized_svd
is the total variance. However, the explained variance per mode already differs between them, indicating that it's distributed differently. The same applies to the eigenvectors, which ultimately explains the score differences.
I noticed that when we increase the n_power_iter
parameter (which defaults to 0), the differences disappear. According to the documentation:
n_power_iter: int, default=0
Number of power iterations, useful when the singular values decay slowly. Error decreases exponentially as n_power_iter increases. In practice, set n_power_iter <= 4.
With a value of 4, the differences virtually vanish. But even a lower value like 2 already results in differences on the order of about
In the original paper by Halko et al. (paragraph 1.6), they even suggest using a higher value than 0:
Unfortunately, the simplest version of this scheme is inadequate in many applications because the singular spectrum of the input matrix may decay slowly. To address this difficulty, we incorporate q steps of a power iteration, where q = 1 or q = 2 usually suffices in practice.
Since it's challenging to estimate the error order before computing the singular spectrum, we might consider using a higher default value for n_power_iter
. However, this would come at the cost of increased computation time. I ran a quick experiment using your example to see how increasing the number of power iterations affects computation time (see the figure below). While I'm not sure how generalizable these results are, in this example, 2 power iterations increased the computation time by roughly 60%.
I'll also explore whether equation 1.10 in the paper can provide us with insights that we can relay to users, suggesting an increase in n_power_iter
if the potential error appears to be too large. What are your thoughts on this approach?
from xeofs.
Calculating the reconstruction error
Number of Modes:
To compute
When to Issue a Warning
To obtain a normalized value for self.data["total_variance"]
). The key question here is when to issue a warning.
Taking your previous example, we can calculate Frobenius-normalized reconstruction errors for different values of
n_power_iter
:
n_power_iter | 0 | 1 | 2 | 3 | 4 |
---|---|---|---|---|---|
9.81 | 0.50 | 0.27 | 0.21 | 0.18 |
I also condcuted similar analyses with different data sets and various normalizations (e.g., standardize=True
and use_coslat=True
, and the results appear consistent. Generally, it seems that
from xeofs.
Related Issues (20)
- ModuleNotFound Error datatree HOT 5
- Improving release process
- Keep the documentation in sync with the code HOT 3
- Add dependencies section in the documentation HOT 1
- numpy and pandas dependency question
- numpy and pandas dependencies HOT 3
- Migrating repository to xarray-contrib HOT 2
- why the explained_variance_ratio of CCA so small HOT 4
- Serialization fails with `xarray>=2024.1.0` HOT 2
- Single mode data reconstruction fails with normalized scores HOT 9
- Support for complex input data
- MCA incorrect coords alignment in transform method HOT 3
- Is the reconstruction of unseen data possible using EOF? HOT 5
- Model cannot fit DataArray without coordiantes
- Add option `n_modes="all"` to perform the full decomposition HOT 2
- Trouble creating the eof object HOT 8
- TypeError: __init__() missing 1 required positional argument: 'X' HOT 1
- Data with Nan values HOT 5
- Questions to the REOF power parameter in the examples HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from xeofs.