<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Follow-up fixes for bootstrap about climpred HOT 8 CLOSED

pangeo-data commented on May 28, 2024

Follow-up fixes for bootstrap

from climpred.

Comments (8)

aaronspring commented on May 28, 2024 1

concerning compute_persistence_pm I found a way: I transfer the initialisation time axis into the index values of control.time, then I can easily add lags to the index.

dim='time'
nlags=ds.time.size
init_month_index=0 # januar init, 10 for november
init_years = ds['initialization'].values
init_cftimes = []
for year in init_years:
    init_cftimes.append(control.sel(time=str(year)).isel(time=init_month_index).time)
init_cftimes=xr.concat(init_cftimes,'time')

init_index=[]
l=list(control.time.values)
for i, inits in enumerate(init_cftimes.time.values):
    init_index.append(l.index(init_cftimes[i]))

plag=[]
from xskillscore import pearson_r,rmse
metric=pearson_r
for lag in range(1, 1 + nlags):
    inits_index_plus_lag = [x+lag for x in inits_index]
    ref = control.isel({dim: inits_index_plus_lag})
    fct = control.isel({dim: inits_index})
    ref[dim] = fct[dim]
    plag.append(metric(ref, fct, dim=dim))
pers_new = xr.concat(plag, 'time')
pers_new['time'] = np.arange(1, 1 + nlags)

from climpred.

aaronspring commented on May 28, 2024 1

Decide if we want to maintain a separate significance level for init/uninit and persistence. If this is the case, a "quantile_persistence" and "quantile_ensemble" or something similar dimension distinction should be made to make plotting easy. The graphics plot was breaking in the notebook if the significance levels were different.

I want to keep different significance levels for the calculation. For the plotting I havent implemented that. Now raises an error.
I am unsure still how to gather all the results into a dataset nicely (I prefer here only the variable name as one data_var).

Fix pytest to deal with datetime[ns]. (I think this was the problem you identified?)

Changed compute_persistence_pm. Didnt adapt compute_persistence yet. But should be easily adaptable.

Revise the perfect_model notebook with your new bootstrap functions so that the whole thing compiles as you see fit.

Compiles.

from climpred.

aaronspring commented on May 28, 2024

separate significance level: actually it shouldnt be different levels. although its much harder to beat the persistence forecast than the uninit one in the first lead years.

from climpred.

aaronspring commented on May 28, 2024

I like the consolidated approach, but it leads to some data_vars which will only contains nan. as soon as we put them all in a dataset, all dimensions are available to all dataArrays and many will end up as nan.
In theory we should aviod useless nan fields.
As we use Datasets, this should in the end allow users to get results for more variables. Therefore I would opt for a result where we only have one data_vars=variable and the rest of information somehow be stored in the coordinates.

But somehow you also have that kind of problem because you have a threshold for you p-value from the z-score (at least implicit) and then decide whether p-value is acceptable or not.

from climpred.

bradyrx commented on May 28, 2024

I like the consolidated approach, but it leads to some data_vars which will only contains nan. as soon as we put them all in a dataset, all dimensions are available to all dataArrays and many will end up as nan. In theory we should aviod useless nan fields.

I think xarray handles this with its broadcasting. So when it goes into a dataset, the DataArrays only maintain the dimensions they have going in. The only time NaNs appear is if the dimensions mismatch, like when quartile mismatched with different significance levels.

As we use Datasets, this should in the end allow users to get results for more variables. Therefore I would opt for a result where we only have one data_vars=variable and the rest of information somehow be stored in the coordinates.

Yeah I agree that the current implementation isn't perfect. Although see my discussion comments in https://github.com/bradyrx/climpred/pull/86. I think the current bootstrap_perfect_model does too much. It should only bootstrap, i.e., return a bootstrapped form of the control run. Currently it has switches to do all sorts of significance testing which can get wrapped into the class-based system. Perhaps just bootstrapping each variable in a dataset will prevent these issues above.

from climpred.

aaronspring commented on May 28, 2024

I also had the idea to somehow put the p-values and CIs in coordinates.

How would you split up bootstrap_perfect_model? I thought about it but didnt really come to better idea yet:

If we only do bootstrapping we get very large arrays (1000xnlonxnlat). I didnt want to keep these. Therefore I just calc the CIs and p_value to return these.

What do you mean but bootstrapped form?

It should only bootstrap, i.e., return a bootstrapped form of the control run.

Totally agree on:

Currently it has switches to do all sorts of significance testing which can get wrapped into the class-based system.

Well we could split up the part of for _ in range(bootstrap): into a function, but there is few things to do with that output. Therefore in bootstrap_pm I just extract p_value and CI.

We could just write the function that it does only one comparison: vs persistence or vs uninitialized. but persistence and uninitialized have different dimensions anyway (persistence has lead years, uninitialized not really in a meaningful way). therefore I just put them all in this big function.

from climpred.

bradyrx commented on May 28, 2024

What do you mean but bootstrapped form?

I think I was confusing this with the _pseudo_ens function. My thinking was so that you don't have to run _pseudo_ens many times, you could append it to a special category on the PerfectModelEnsemble object that can be referenced. Since I understand it as generating an ensemble of the same dimensions as the initialized, but to simulate an uninitialized form.

How would you split up bootstrap_perfect_model? I thought about it but didnt really come to better idea yet: If we only do bootstrapping we get very large arrays (1000xnlonxnlat). I didnt want to keep these. Therefore I just calc the CIs and p_value to return these.

I see now that bootstrap_perfect_model is mainly just to get CIs and p_values. So perhaps it just returns an object with the same lat/lon dimensions with variables "p", "upper", and "lower" for confidence intervals or something similar. But this would only work for DataArrays I think.

Well we could split up the part of for _ in range(bootstrap): into a function, but there is few things to do with that output. Therefore in bootstrap_pm I just extract p_value and CI.

We could just write the function that it does only one comparison: vs persistence or vs uninitialized. but persistence and uninitialized have different dimensions anyway (persistence has lead years, uninitialized not really in a meaningful way). therefore I just put them all in this big function.

Agreed on these points. You can clean it up a little bit in https://github.com/bradyrx/climpred/pull/87, but don't worry too much about it. Let's get the bootstrapping working, persistence fixed, pytest, etc. in https://github.com/bradyrx/climpred/pull/87 and then get the object-oriented system merged. Then with the object-oriented system we can work on cleaning things up "under the hood".

from climpred.

bradyrx commented on May 28, 2024

Implemented in https://github.com/bradyrx/climpred/pull/87

from climpred.

Follow-up fixes for bootstrap about climpred HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent