janezd / baycomp Goto Github PK

View Code? Open in Web Editor NEW

68.0 68.0 15.0 348 KB

License: MIT License

Python 92.46% Stan 7.54%

baycomp's People

Contributors

Stargazers

Watchers

Forkers

lidq92 ancardona skylogic004 cgosorio codeaudit emanjavacas luccaportes verri gregavrbancic dpaetzel iho17 jpcbertoldo danrodgar yinseocho

baycomp's Issues

Is Stan a requirement to run this package?

It appears that you use PyStan under the hood to perform hierarchical t-test (?). Is that true or is it simply an experimental feature that you still work on?

Anyways, thanks a lot for sharing this great package! Keep up good work!

ValueError: could not broadcast input array from shape (2) into shape (3)

Hi,

I am running the hierarchical model version of two_on_multiple on two cross-validation matrices x and y.
I am running the test many times without problems, but for some matrices I am having a value error.

Reproducing code:

from ast import literal_eval

x="""[[0.82080925 0.95953757 0.89595376 0.97109827 0.90116279]
 [0.90667808 0.91780822 0.90239726 0.89974293 0.9143102 ]
 [0.88695652 0.89565217 0.92982456 0.92982456 0.92105263]
 [0.84782609 0.86956522 0.7826087  0.84444444 0.91111111]
 [0.8        0.8        0.7        0.88333333 0.86666667]
 [0.92222222 0.92222222 0.93333333 0.95555556 0.94444444]
 [1.         0.96666667 0.9        0.96666667 0.93333333]
 [0.97369068 0.98180477 0.9771274  0.97909493 0.97442204]
 [1.         0.97777778 0.97777778 0.97777778 0.97777778]
 [0.99213287 0.98688811 0.98951049 0.99125874 0.98776224]
 [0.99388112 0.99344406 0.99562937 0.99606643 0.99519231]
 [0.93337731 0.91029024 0.94327177 0.88778878 0.88976898]
 [0.77       0.765      0.765      0.755      0.765     ]
 [0.75250836 0.77725753 0.79451138 0.74364123 0.76639893]
 [0.94782609 0.94891304 0.95271739 0.95541055 0.94453507]
 [0.95       0.96428571 0.97142857 0.9547619  0.96428571]
 [0.82222222 0.86666667 0.81818182 0.88636364 0.88636364]
 [0.81176471 0.90588235 0.89411765 0.81176471 0.82142857]
 [0.99361314 0.99635036 0.99178832 0.99726027 1.        ]
 [0.97368421 0.97974342 0.97501688 0.96623903 0.98109386]
 [0.77142857 0.8        0.79885057 0.82758621 0.7816092 ]
 [0.95804196 0.98601399 0.97902098 0.97902098 0.94366197]
 [0.95348837 0.95348837 0.95348837 0.97619048 1.        ]
 [0.72294372 0.64718615 0.69264069 0.72017354 0.72234273]
 [0.82025678 0.78601997 0.81597718 0.82881598 0.80571429]] """
y="""[[0.9132948  0.87283237 0.87861272 0.83815029 0.84883721]
 [0.89554795 0.8989726  0.89297945 0.88860326 0.90488432]
 [0.92173913 0.88695652 0.9122807  0.93859649 0.9122807 ]
 [0.91304348 0.82608696 0.89130435 0.84444444 0.93333333]
 [0.81666667 0.85       0.78333333 0.86666667 0.88333333]
 [0.94444444 0.95555556 0.91111111 0.91111111 0.93333333]
 [1.         0.96666667 0.9        0.96666667 0.93333333]
 [0.95279075 0.95918367 0.95696016 0.94564683 0.96261682]
 [0.98888889 0.98888889 0.97777778 0.98888889 0.98888889]
 [0.97902098 0.96853147 0.9798951  0.98164336 0.9798951 ]
 [0.99038462 0.98339161 0.98907343 0.9916958  0.9881993 ]
 [0.91358839 0.92612137 0.93139842 0.90825083 0.91881188]
 [0.775      0.805      0.85       0.755      0.81      ]
 [0.94715719 0.96120401 0.96385542 0.95046854 0.95448461]
 [0.92880435 0.93858696 0.93695652 0.94888526 0.94127243]
 [0.98571429 0.96666667 0.97619048 0.94285714 0.98571429]
 [0.82222222 0.93333333 0.90909091 0.79545455 0.86363636]
 [0.91764706 0.92941176 0.91764706 0.91764706 0.91666667]
 [0.97718978 0.97718978 0.96624088 0.97077626 0.9716895 ]
 [0.97300945 0.97366644 0.96556381 0.96961512 0.972316  ]
 [0.74857143 0.73714286 0.7183908  0.72413793 0.75287356]
 [0.94405594 0.97202797 0.97902098 0.95104895 0.95774648]
 [0.90697674 1.         1.         1.         1.        ]
 [0.77056277 0.71212121 0.73809524 0.72885033 0.7462039 ]
 [0.81740371 0.77746077 0.80884451 0.82738944 0.83571429]] """
x = re.sub(r"([^[])\s+([^]])", r"\1, \2", x)
y = re.sub(r"([^[])\s+([^]])", r"\1, \2", y)
x = np.array(literal_eval(x))
y = np.array(literal_eval(y))

print(x)
print(y)

probs= two_on_multiple(x, y, rope=0, plot=False, names=['x', 'y'])```

# Output

```ValueError                                Traceback (most recent call last)
<ipython-input-15-71d98cb2e0f9> in <module>
     60 print(y)
     61 
---> 62 probs= two_on_multiple(x, y, rope=0, plot=False, names=['x', 'y'])

~/anaconda3/envs/bayesian/lib/python3.7/site-packages/baycomp/multiple.py in two_on_multiple(x, y, rope, runs, names, plot, **kwargs)
    485     else:
    486         test = SignedRankTest
--> 487     return call_shortcut(test, x, y, rope, names=names, plot=plot, **kwargs)

~/anaconda3/envs/bayesian/lib/python3.7/site-packages/baycomp/utils.py in call_shortcut(test, x, y, rope, plot, names, *args, **kwargs)
     18 
     19 def call_shortcut(test, x, y, rope, *args, plot=False, names=None, **kwargs):
---> 20     sample = test(x, y, rope, *args, **kwargs)
     21     if plot:
     22         return sample.probs(), sample.plot(names)

~/anaconda3/envs/bayesian/lib/python3.7/site-packages/baycomp/multiple.py in __new__(cls, x, y, rope, nsamples, **kwargs)
    151 
    152     def __new__(cls, x, y, rope=0, *, nsamples=50000, **kwargs):
--> 153         return Posterior(cls.sample(x, y, rope, nsamples=nsamples, **kwargs))
    154 
    155     @classmethod

~/anaconda3/envs/bayesian/lib/python3.7/site-packages/baycomp/multiple.py in sample(cls, x, y, rope, runs, lower_alpha, upper_alpha, lower_beta, upper_beta, upper_sigma, chains, nsamples)
    443 
    444         rope, diff = scaled_data(x, y, rope)
--> 445         mu, stdh, nu = run_stan(diff)
    446         samples = np.empty((len(nu), 3))
    447         for mui, std, df, sample_row in zip(mu, stdh, nu, samples):

~/anaconda3/envs/bayesian/lib/python3.7/site-packages/baycomp/multiple.py in run_stan(diff)
    426 
    427         def run_stan(diff):
--> 428             stan_data = prepare_stan_data(diff)
    429 
    430             # check if the last pickled result can be reused

~/anaconda3/envs/bayesian/lib/python3.7/site-packages/baycomp/multiple.py in prepare_stan_data(diff)
    401                 if np.var(sample) == 0:
    402                     sample[:nscores_2] = np.random.uniform(-rope, rope, nscores_2)
--> 403                     sample[nscores_2:] = -sample[:nscores_2]
    404 
    405             std_within = np.mean(np.std(diff, axis=1))  # may be different from std_diff!

ValueError: could not broadcast input array from shape (2) into shape (3)```

New release on Pypi

Hi!

Thanks for this awesome lib. I use this downstream in a convenience-package for statistical analysis called autorank and a user suggested that we also make the Bayesian analysis reproducible with the random state (sherbold/autorank#21). I think that is a great idea, and would like to expose this. However, I would like to avoid doing this by importing the current version directly from Github.

Are there any plans to push a new release to Pypi? That would make this a lot easier for me :)

Best,
Steffen

Questions about baycomp.two_on_single

Hi there! Thank you for the insightful work, and for the baycomp library! I had (what I hope) are a few quick clarification questions regarding baycomp.two_on_single.

In particular, I am in a scenario where I have computed accuracy of method A and method B over a single k-fold cross validation split on a single dataset -- for each algorithm, I have a vector of k performance metrics (A[i] and B[i]) where each value indicates the performance over the "test" split for a given algorithm over a given fold of the dataset (e.g., A[3] and B[3] give the performance according to some metric on the 3rd cross-validation fold). I am trying to decide whether performance differences I observe between A and B are likely to be meaningful or not.

I think that baycomp.two_on_single is the function I should be using. However -- the documentation was a bit confusing to me. In particular, the function requests "vectors of scores" for each model. I was a bit confused as to whether these vectors of scores correspond to performance over folds, or average scores over multiple cross-validations. In short -- is it okay to pass the length-k vectors A[i] and B[i] as input? And, if I were to run m multiple cross-validations, do I pass m * k length vectors to this function? And -- does the ordering of these values matter? If so -- what is the restriction on the ordering?

How do i generate larger images?

Hello,

I'm publishing a paper and i need to enlarge the results generated from my tests. I didn't find anything related in the docs.
Thanks.

Add random state to two_on_multiple

Hi, I think it would be useful to have a random_state parameter to the two_on_multiple function, so we can provide reproducible code. I will submit a pull request with this modification, I did as few modifications as possible, the random_state parameter followed the same "path" as n_samples. That is, being treated as a kwarg until the Test object is instantiated.

I also had to update one test so it matches the call with the new parameter.

Bagging versus cross validation

Is there a suggested usage for comparing models that were created using bagging (e.g. N samples of training data drawn M times with replacement) without cross validation? For example, given a dataset with 1000 training samples and 100 test samples: 1) randomly sample 10% (N=100) of the training data 5 (M) times with replacement; 2) train 2 models on the given training data sample (e.g. random forest versus linear regression); 3) test the trained models on the 100 test samples; 4) take the mean or median of the result of each trained model. To see how the models perform with additional data, perform the above tasks using varying amounts (2%-N=20, 10%-N=100, and 20%-N=200) of the training data, again taking the mean or median for each varied amount of training data.

pip has old source code?

Hello,

I installed baycomp today with 'python -m pip install baycomp' and tried to use the 'two_on_multiple' function.

Then I got the error message: 'ImportError: Hierarchical model requires 'pystan'; install it by 'pip install pystan''

According to the latest update of the source code this shoud be fixed already. But it seems like that this update wasn't posted to PyPI? On the website https://pypi.org/project/baycomp/#files the latest version is 1.0.2 from 2019 and this is the version i got with pip install, too.

Maybe you want to update this? :-)

`HierarchicalTest` when `nfolds == 1`

I'm sorry for bothering you with another question. I have the case where I do
not need to use cross-validation (I can generate as much data as I want). This
means that the number of runs is equal to the number of scores I have for each
learning task, runs == nscores.

In that case, HierarchicalTest computes nfolds correctly as being 1.
This, however, results in the correlation to be computed as

rho == 1 / nfolds == 1

I may be mistaken but, based on the hierarchical model, rho being 1 here
indicates maximal correlation between the runs for learning task i when the
correlation, intuitively, is smaller than, say, the correlation of k-fold
cross-validation for any k? (I'm not entirely sure but shouldn't rho be 0
in this case (i.e. a diagonal covariance matrix for learning task i) since the
runs are rather uncorrelated?)

Random training/test sets versus k-fold cross-val

Hi again! I had another question about the assumptions used by baycomp's two_on_single function. In particular, in footnote 2 in the JMLR paper mentions that Nadeau and Bengio's (2003) correction was originally conceived in the setting of random train/test splits (rather than k-fold cross-validation) but this setting is not mentioned elsewhere in the work. Is it acceptable to use two_on_single for random training/test splits (rather than k-fold cross-validation) as well?

Correct comparison method - need advice

Hello,
Could you please recommend a right comparison method for my problem?
I have N timeseries and predict K (usually K=4) last observations for each timeseries during a cross-validation (one predicted observation per fold). Specifics: a) this is timeseries-related walk-forward validation more similar to Leave-One-Out; b) this is regression problem. At the end, I have K*N scores. Each timeseries has different magnitude of forecasting errors/scores due to different amount of noise in the data.

Which comparison method should I use? What comes to mind:

Treat all timeseries as a single dataset and use two_on_single() with vectors of K*N length and runs=1 (or runs=K?)
Use two_on_multiple() with vectors of length N, each item in vector is average of K folds
Use two_on_multiple() in hierarchical mode and pass matrices of (N,K) size and runs=1

#1 seems to be a bad choice due to different magnitude of scores between series (the resulting distribution of scores is heavy tailed), #3 seems to be optimal but slow, #2 is much faster but less precise alternative to #3. Are my conclusions correct?

All points stick to wall

Hi, I was doing some tests with my results using two_on_multiple and I got some weird plots.

First, if I set my rope to 0.96 (my results range from 0 to 100) I get this:

However, if I set to 0.97, I get this:

I am using all the default parameters but the "plot" and "rope" of version 1.0.2

Any ideas of why would this happen? Is there any interpretation for this?

`HierarchicalTest`: Better(?) priors for the data set means

Here and here (Stan implicitly samples this from a uniform, which is the intended behaviour as described in the publications on the hierarchical model) the prior on delta is set as Uniform(-max(abs(x)), max(abs(x))). In the publications on the hierarchical model, this is Uniform(-1, 1), presumably since accuracy is the metric under consideration whose maximum is 1 (although this does not explain the lower bound of -1).

My question is: Wouldn't Uniform(min(x), max(x)) be better-suited in general than the current choice of Uniform(-max(abs(x)), max(abs(x)))? Many metrics are asymmetric around 0:

logarithmic metrics may take on highly negative values but only moderately positive ones
most error measures (MSE, MAE, …) cannot take on values below 0

Because of that, using Uniform(-max(abs(x)), max(abs(x))) as the default prior on delta seems to be unnatural to me but I may be mistaken?

A possible way to overcome this issue flexibly is to introduce another parameter to HierarchicalTest.sample, named e.g. data_set_mean_prior, and then handling the probably most-used cases but also allowing users to specify the lower and upper bound by themselves:

# The original choice.
if data_set_mean_prior == "symmetric-max":
    delta_upper = np.max(np.abs(diff))
    delta_lower = -delta_upper
# If only positive values are sensible, this may be a better choice.
elif data_set_mean_prior == "min-max":
    delta_lower = np.min(diff)
    delta_upper = np.max(diff)
elif data_set_mean_prior == "zero-max":
    delta_lower = 0
    delta_upper = np.max(diff)
elif isinstance(data_set_mean_prior, tuple):
    delta_lower = data_set_mean_prior[0]
    delta_upper = data_set_mean_prior[1]
else:
    raise ValueError("data_set_mean_prior has to be one of"
                     "\"symmetric-max\", \"min-max\", \"zero-max\""
                     "or of type tuple")

What do you think? Should I create a PR for this?